[01:08:29] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 17.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:03] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 13.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:39] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:43] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:26:14] jynus: when would it be a good day/time to stop db1117 (backup source for misc hosts) [07:26:33] any time now, but let me double check [07:26:38] sure! [07:28:12] yeah, confirmed, all backups for misc server finished yesterday normally [07:28:23] great! stopping it now thanks [07:30:02] I am seeing snapshots are taking more and more time, I may change those a bit to make them faster or revert to start earlier, as for example, they haven't finished yet [08:25:11] Is there (meant to be) any way to say "regenerate the thumbs for this image"? [08:29:12] Emperor: you mean something like thumb.php? [08:29:23] or at a different layer? [08:30:44] I don't know what thumb.php is, sorry. But e.g. https://phabricator.wikimedia.org/T333042 contains complaints that e.g. the thumbs for https://commons.wikimedia.org/wiki/File:Vake_District.svg are out of date (which they do seem to be), and I want to know if there's some way they (or I) can cause those thumbs to be regenerated that is nicer than me manually deleting the thumbnails from swift and thus getting the 404 handler to fire [08:30:44] again which feels like The Wrong Approach [08:32:30] in theory, sending a purge forces refreshes of thumbs, the problem is that there is a bug somewhere else preventing that [08:33:10] [AFAICT in this case there are thumbs in eqiad, but not codfw, but I might be driving swift list wrong] [08:33:15] https://en.wikipedia.org/wiki/Wikipedia:Purge#Images [08:34:20] it just feels like there are some quite unhappy people about thumbs, and I'd like to help them without it eating all my time ever [08:37:30] how are https://commons.wikimedia.org/wiki/File:Vake_District.svg the thumbs outdated, can you tell something to differenciate them? [08:37:51] not doutbting you, just want to see how to differenciate them [08:38:18] so in the presenting case https://commons.wikimedia.org/wiki/File:Vake_District.svg?action=purge and click OK? [08:38:36] that's the documented "fix" [08:38:56] jynus: if you look at the history, the older images have some yellow shading (higher ground, maybe?) that isn't in the newer version [08:39:16] the purge does seem to have helped here [08:39:51] the issue is in some cases that doesn't work, like with: https://phabricator.wikimedia.org/T330942 or https://phabricator.wikimedia.org/T334487 [08:40:25] yeah, and indeed https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Vake_District.svg/1280px-Vake_District.svg.png is still the older one [08:40:52] that's because purge only purges the thumbs for that page [08:41:00] not all thumbs [08:41:07] but that's one of the standard sizes linked from the commons page...? [08:41:46] I am not saying this makes sense, I am just saying it "purges shown thumbs" :-) [08:42:16] I understand now that what you want is "purge all thumbs generated from given image", right? [08:42:24] yes [08:42:59] I am not familiar with the backend, check if there is something like that on the mediawiki/maintenance repo [08:43:02] weirdly, that purge attempt seems to have updated the thumb in codfw but not eqiad [08:43:15] and that looks like a bug [08:43:45] equally weirdly that's the _only_ thumb of that image in codfw if I'm doing swift list --prefix 7/75/Vake_District.svg wikipedia-commons-local-thumb.75 right [08:44:55] whereas in eqiad it looks to have regenerated the 1024, 1125, 1200, and 640 thumbs [08:45:04] I see [08:45:11] I think I know the issue [08:45:18] but not any of the other sizes [08:45:24] that looks like multicdc logic bug [08:46:02] it probably assumes it only does prethumb generation in one datacenter [08:46:16] you should report that finding [08:46:25] T330942 was already something in this area [08:46:26] T330942: Latest image thumbnails aren't replaced correctly after image reupload - https://phabricator.wikimedia.org/T330942 [08:47:16] T331138 is the more architectural image [08:47:17] T331138: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 [08:47:21] issue, sorry [08:51:34] jynus: mediawiki/maintenance? gerrit isn't showing me anything if I search for that [08:56:26] I'm checking it, cannot find it [08:56:45] https://www.mediawiki.org/wiki/Manual:Maintenance_scripts/List_of_scripts#File_maintenance_scripts [08:57:06] https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/ [08:57:59] jynus: thanks; looks like there's nothing obvious there. [08:58:40] but I am unsure if mw would even know which thumbs are available [08:58:57] it may had to be built at swift layer :-( [08:59:23] can you check that at least the original is the same on both dcs? [09:04:35] yes, it's the same in both DCs [09:06:44] The other other problem is that it's quite hard to get the CDN to stop serving a cached thumbnail. e.g. https://commons.wikimedia.org/wiki/File:Vake_District.svg even after a purge still has in the middle of it https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Vake_District.svg/750px-Vake_District.svg.png which is an old thumb; I've removed the thumb from swift, but it's still cached by the CDN and AFAICT there's no way to g [09:06:44] the CDN to update that (and try and regenerate the thumb) other than waiting for it to time out... [09:09:20] I think for cdn purges (assuming app layer/swift/thumbor works as expected) there is an actual script for that [09:09:52] I think it is being used when removing images [09:12:36] I am checking https://www.mediawiki.org/wiki/Manual:PurgePage.php [09:12:55] and it says "In addition, individual page types (such as FilePage) and extensions, may register additional actions. For example, when purging a File page, it also deletes thumbnails from Swift storage, and purge the URLs of all thumbnail sizes and variations (page1, page2, 120px, 320px etc. etc.)" [09:13:36] This is the equivalent to action=purge [09:13:50] so there is something not working there (multidc?) [09:17:42] should I reopen T330942 ? [09:17:43] T330942: Latest image thumbnails aren't replaced correctly after image reupload - https://phabricator.wikimedia.org/T330942 [09:18:49] in particular it seems strange that mostly new thumbs are being made in eqiad given codfw is master [09:19:38] (though I note that codfw does now have a new 750px thumb) [09:21:12] jynus: it would be reasonably plausible to write a cookbook for "delete all thumbs for this image" - e.g. in the case in point the original is 7/75/Vake_District.svg in wikipedia-commons-local-public.75 and you can find all the thumbs for it by doing swift list --prefix 7/75/Vake_District.svg wikipedia-commons-local-thumb.75 (note the change in container name) [09:22:10] however, while that is nice as a patch, the important thing is for the logic to work not needing that! [09:23:36] if it doesn't take very long, looks like a useful tool, though [09:28:23] Yeah, I don't feel I'm very well equipped to try and sort out the "why isn't this working properly?" issue, but maybe at least a tool so SRE could bin old thumbs (and thus cause new ones to be produced eventually when the CDN notices) would help a bit [10:23:03] okay, I spend some time on it now, I guess here goes my day [10:24:42] Amir1: sorry [10:25:38] don't be, it's WMF's shortcoming [10:25:55] FWIW, purging https://commons.wikimedia.org/wiki/File:Free-object-universal-property.svg didn't make the CDN drop anything, perhaps it cleared some thumbs from codfw? [sorry, should have checked], it didn't remove any from eqiad. I've removed them from eqiad, so hopefully when the CDN gets bored now thumbor will make new ones [10:26:12] Amir1: that one is apropos T334303 [10:26:12] T334303: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 [10:27:36] Amir1: as I noted above, from a swift POV a cookbook of the form "given a commons image X, remove all the thumbs from swift for X (iff both clusters have a consistent original image)" shouldn't be too hard to put together, if it's likely to be useful [10:27:38] my guess is that it's not in CDN level because that will be purged automatically after certain time, it's probably swift [10:28:05] (mw not knowing the correct swift) [10:28:45] Mmm. if we want to pick another test victim from commons and call purge I can do before-and-after on the relevant thumb containers, if that is helpful. [10:28:49] Emperor: that should already somewhat exist in mw as a maint script, it's used by T&S [10:29:10] (it deletes the file too but making it purge thumbs only is not much work, if we fix it) [10:29:29] Amir1: DYK if that picks up all the non-standard-size thumbs too? [10:29:55] yeah it does, it basically sends a req to swift to give back all existing thumbnails [10:30:11] the same process happens in reupload and that was what was broken last time [10:31:04] the pregenerate job doesn't do anything, it basically hits http for certain sizes triggering thumb generation [10:31:18] right [10:31:37] it would be noop if the thumb is already there for any reason (e.g. the previous one not getting deleted) [10:33:30] https://www.mediawiki.org/wiki/Manual:EraseArchivedFile.php looks like it has similar logic for erased files (but that's not quite what is required here) [10:42:51] yeah, we can repurpose it/make a new one [10:48:24] Emperor: How can I run a swift command? I basically want to run this: https://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-test2-local-thumb [10:48:57] Amir1: https://wikitech.wikimedia.org/wiki/Swift/How_To [10:49:06] awesome thanks [10:49:23] ms-fe{1,2}009 are the machines with the credential files on [11:21:03] root@ms-fe1009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test2-local-thumb [11:21:03] Container 'wikipedia-test2-local-thumb' not found [11:21:11] hmm [11:21:43] what's the commons page with that image on? [11:22:05] it's not on commons, test2.wikipedia.org and it's happening there too [11:22:11] https://test.wikipedia.org/wiki/File:Wikitech-2021-blue-large-icon_(copy).png [11:22:26] aah, I think I messed it up, it should be test wiki [11:24:30] Container 'wikipedia-test-local-thumb' not found either [11:25:15] I can see wikipedia-test2-local-public [11:25:47] also wikipedia-test-local-thumb [11:26:04] also wikipedia-testcommons-local-thumb [11:26:22] oh, also wikipedia-testwikidata-local-thumb if you're not yet sufficiently confused [11:26:37] we have a lot of test wikis [11:26:50] but the container should be there, why this is erroring: [11:26:53] root@ms-fe1009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test-local-thumb [11:26:54] Container 'wikipedia-test-local-thumb' not found [11:27:07] maybe lack of access? [11:27:18] (I did . /etc/swift/account_AUTH_dispersion.env) [11:27:30] oh, yes, that's not the correct credential [11:27:34] 2 ticks [11:28:12] your comment returns 4 things for me; you want /etc/swift/account_AUTH_mw.env [11:28:25] command, even. [11:28:33] this cold has apparantly fried my typing neurons [11:28:33] aah\ [11:28:35] thanks [11:29:56] NP :) [11:50:35] Emperor: how hard it would be to setup bi-directional replication for thumbs? [11:50:53] we do that in x2 and ParserCache, it's painful but not too hard [11:53:09] Amir1: we _used_ to chuck every new thumbnail at the other dc [11:53:57] thumbnail generation is _meant_ to be active/active [11:54:09] cf https://phabricator.wikimedia.org/T313102#8093002 [11:54:46] anyhow, the "chuck thumbnails at the other DC" process caused outages cf T313102 so we stopped going it [11:54:46] T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 [11:55:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/816726/ was the change [11:55:49] Another idea: Delegate deletion of thumbs to swift and replicate that. Instead of discovering all thumbnails (that it's clearly not doing a good job at it), make mw issue a command to swift to delete any thumb it has under the given file name, swift would replicate that command to the secondary dc [11:56:00] is that possible? [11:58:59] I'm not sure. Swift itself doesn't have any idea of the original / thumb relationship; we can just infer that from object names [11:59:43] I'd be very reluctant to add extra complexity to the swift rewrite middleware (which, as noted previously at length I want to get rid of entirely) [12:00:21] our weekly rclone run doesn't look at thumbs containers at the moment (because they're so very large) [12:05:02] I thought per discussion on T313102 that thumb gneration was meant to be active:active; is that not working? [12:05:03] T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 [12:09:33] (and, I guess relatedly, I'm not sure how many problem thumbs we actually have, and whether some sort of "no, really purge the dratted things" tool would be BALGE) [12:09:36] Emperor: yeah, I understand I was wondering if e.g. it can be done via prefix [12:09:39] err, By And Large Good Enough [12:10:13] depends on what do you mean by active/active [12:11:05] this service can be setup in multiple ways that we could call it active/active [12:11:07] I don't think the HTTP API has the --prefix option [12:12:17] cf https://docs.openstack.org/api-ref/object-store/index.html#containers [12:12:25] (not sure what the swift cli is doing under the hood) [12:13:34] Amir1: Hm, I understood Tim to be saying that thumb generation should result in the new thumb being generated/stored in both DCs [12:14:01] that is not happening, I guarantee that [12:15:25] specially imagine someone hitting the secondary dc to get a specific thumb size that doesn't exist already. There is no mw involved here [12:15:46] through the proxy -> thumbor -> swift [12:16:10] or more proxy -> swift -> thumbor -> swift [12:16:40] Ugh, all is sadness [12:17:02] * Emperor wonders what Tim meant then [12:18:11] could be the pregen sizes [12:18:41] that is mw [12:18:46] (it hits it internally) [12:19:24] Could be; though purging a commons page seems to not actually regenerate the pregen thumb sizes [12:19:41] reupload should and that's a job async [12:19:43] oh, wait, it looks to have in eqiad but not codfw [12:20:17] the job basically hits urls pretending to be a user [12:20:18] Ah, no, probably purging just deleted them, and then my looking goes via eqiad and that's why eqiad has thumbs for something I purged earlier by codfw does not [12:20:24] it literally curls [12:20:25] Amir1: right, yes [12:21:03] Could we make purge do similarly (and at both DCs) without a mountain of work? That might solve a bunch of the user requests [12:21:49] here is the juicy part of the problem. [12:22:07] 😿 [12:22:31] mw deletes them by hitting the primary swift asking for any thumb size for the image [12:22:52] the list doesn't return the sizes in the secondary dc because it's not in the main swift [12:23:19] so it doesn't purge them [12:23:55] stupid question, then: could it also ask the secondary swift? [12:24:08] One other way is to make reupload hit all swifts and delete them one by one but that might lead to very slow reuploads [12:24:13] we certainly can try [12:24:44] I even was writing the code for it last time, just the config for different swifts are ... confusing [12:24:54] "yay" [12:27:40] now I need to go buy something for my partner because he tends to lose adapters constantly. Will be back in an hour or two [12:29:18] Amir1: TY for spending time and effort on this [12:30:07] I appreciate him spending time and effort on this as wel [12:30:21] lol [12:55:07] one thing that could be done is transmit that, even if annoying (and shouldn't happen in the first place) original files are not affected [12:55:17] so it is only temporary [14:50:27] urandom: https://phabricator.wikimedia.org/T330693#8776772 so no PVs. What should we do next? [16:34:40] db1117 may need some prometheus exporter reload, I can do them if it is just that [17:43:09] I did those systemd restarts [18:34:10] thanks jynus