[10:07:01] There's something weird going on with thumbnails, and for once I don't think either swift or thumbor are to blame. T383023 and T383034 are both users complaining about being unable to view a thumb; the thumbs are in both ms clusters (and are not new). I can't repro (nor can another Europe-based person), but we've had a number of complaints (including village pump), which is making me wonder if one of our cache-only DCs is having an [10:07:01] issue? [10:07:01] T383023: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023 [10:07:01] T383034: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034 [10:28:38] thumbnail here https://phabricator.wikimedia.org/T383023#10431900 is of user getting an "Unauthorized" error trying to view https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Buick_Regal_2_--_10-30-2009.jpg/280px-Buick_Regal_2_--_10-30-2009.jpg [10:44:22] said thumb definitely in both swift clusters [10:54:49] https://phabricator.wikimedia.org/P71802 is a pointer to the issue - whilst one can swift stat objects in wikipedia-commons-local-thumb.f8 in codfw, you can't currently stat the container itself(!) [10:57:59] knowing mostly nothing about swift arch, does it matter which ms-fe in codfw you use? [10:58:09] I assume not [10:58:56] just looking at that (we don't keep credentials on >1, so it's a bit of a faff) [11:00:21] Same behaviour on ms-fe2012, so it's not frontend-specific. [11:13:36] Well, that's not good - ask swift-get-nodes where the container dbs should be for that container and they're all absent(!) [11:15:00] the container ring hasn't changed since 23 December [11:16:44] the account thinks the container should exist [11:20:27] OK, this is Bad (TM) [11:20:58] I can't find the codfw container db for wikipedia-commons-local-thumb.f8 [11:21:44] Not quite sure how swift stat on the contents of it is even working in codfw (cached?) [11:23:39] is it the only container db affected? [11:32:15] there are 43066 containers in that account, LMS if there is a quick way to check. [11:32:33] [I wonder if swift stat doesn't use the container DB and just looks up the object in the object ring?] [11:36:34] running a check on just the thumb containers (256 of them) [11:37:35] OK, of the thumb containers only wikipedia-commons-local-thumb.f8 affected [11:40:48] back-of-an-envelope suggests it'd take about 4 hours to check all 43,066 containers, so I'll kick that off in a tmux [11:41:59] If the dbs are anywhere they're likely in a directory that contains the path /accounts0/containers/16503/280/4077d9164732d6587761ef101bcbc280 so it _might_ be worth looking for that (but I suspect it's gone. I just don't know where or how or why) [11:47:43] No joy [11:49:24] and yes, you can do 'swift-get-nodes /etc/swift/object.ring.gz AUTH_mw containername objectname', so I think that's why swift stat on objects in wikipedia-commons-local-thumb.f8 is still working even though the container db is gone. [11:51:27] On 2nd Jan, we were still good. [11:52:10] could it be https://phabricator.wikimedia.org/T379942 ? [11:52:45] ah no it should only be work on 0 [11:52:53] I think not - Amir1 has been working on those, and he's only been doing 06-0f [11:52:56] be working on 0* [11:53:19] working in the early hours of Jan 5 [11:53:56] yeah, I'm currently on 0* ones, will start 1* soon but it shouldn't be related [11:53:57] but we were saying 401 to that container by midnight this morning [11:54:10] LMS when that started yesterday [11:56:56] ms-fe2009 first said 401 to something in wikipedia-commons-local-thumb.f8 at 07:20:50 yesterday (5th Jan) [11:57:58] So my best theory is that something around that time deleted the container dbs [11:59:39] [??maybe via issuing a delete for the container?? thought I'd naiively expect that to clear out the contents first] [12:01:11] Not sure if this warrants becoming an incident? [12:02:58] any way to know how many thumbs are affected? any way we can ~quickly sync back from eqiad for instance? [12:03:20] like do we consider this data loss with user impact? [12:03:24] it's 1/256 of our thumbs, but only a fraction of those are actually in use. [12:03:38] And we officially don't care about thumbs - cf Amir who has been merrily deleting them [12:03:55] true, but we're not regenerating them on demande in this scenario [12:04:47] So one thing we _could_ do to get back to functioning would be to re-create the affected container - then the thumbs would get re-generated on demand. It'd probably bump the load on thumbor a bit initially [12:07:55] Emperor: that would be fine I think [12:08:08] at worst we scale thumbor up a bit [12:08:39] Currently trying to see if I can find a suitable DELETE in the logs for yesterday [12:13:20] worst case, you can drop the whole container and start afresh. 1/256th regen should be fine for now. The thumbs reqs per sec to swift is ~1,500 per second. Thumbor regens around 25/s [12:13:41] it'll add a bit of load but not much [12:14:51] I need to figure out how to set the correct ACLs (mw:thumbor,mw:media,.r:* read mw:thumbor,mw:media write) [12:17:25] I think: "swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media'" ? [12:18:14] ^-- I'd like a +1 from someone before doing this :-/ [12:20:44] one sec [12:21:56] I went looking for DELETE in logs from yesterday for wikipedia-commons-local-thumb.f8 in codfw; there are 5208, but all lines contain the string 'px-' which I think means they are all expected/legit-looking delets of thumbs [12:23:08] Emperor: I don't know if it's related or not but in ms-fe2009, trying to get stat on any container gives me "Container not found". I wanted to check ACLs against other containers [12:23:29] [but in any case https://wikitech.wikimedia.org/wiki/Swift/How_To#Delete_a_container_or_object suggests a container delete should remove the object] [12:23:55] Amir1: you need to have sourced the mw credentials [12:24:09] I have done that, without it it gives a different error [12:24:46] Amir1: WFM: https://phabricator.wikimedia.org/P71802#287938 [12:24:55] I checked it in ms-fe1009, the ACL you wrote there is correct to me [12:25:38] OK, I'll try the swift post command; I suspect it'll fail because according to the account the container still exists. [12:25:51] ah, I know what's going on, ". /etc/swift/account_AUTH_admin.env" this is incorrect [12:26:09] ignore my issue then [12:26:22] That seems to have worked [12:26:51] https://phabricator.wikimedia.org/P71802#287939 [12:26:59] It's getting writes now [12:27:18] ❯ curl --connect-to upload.wikimedia.org:443:upload-lb.codfw.wikimedia.org https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Buick_Regal_2_--_10-30-2009.jpg/280px-Buick_Regal_2_--_10-30-2009.jpg -v -o /dev/null 2>&1 |grep HTTP [12:27:23] < HTTP/2 200 [12:28:30] one less container for me to clean :D [12:29:09] I'm going to leave my checker-script running in case we've any others. What I don't have is a good model of what happened to the container db :( [12:30:28] but I think I should probably step away from the keyboard for a bit now [12:30:33] qps to thumbor going up but seems like it's handling the load ok [12:31:42] 👍 [13:08:44] one user complaint of 429 from thumbor (for a different container) [13:13:14] ugh the response code scale is log10 [13:13:53] we're sending about 10 429 per second [13:16:58] it's back to baseline, so I don't think we should worry too much about it [13:25:06] that particular file looks like it's getting a 500 from the backend [13:33:11] I think it's getting generation failure throttled [13:33:29] * claime lunch, actually [13:43:40] yeah, GIMP doesn't like the original .png [15:19:07] I'm tracking investigation of this issue in T383053 but it looks like swift decided the container DB was corrupt [15:19:08] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053 [15:19:48] (I'm covering for jhathaway today) [15:37:55] JennH: Do you need me to switch off db2143? https://phabricator.wikimedia.org/T382751? [15:38:26] I can shut it off. But thank you for checking! [15:38:35] JennH: Let me stop mariadb first :) [15:38:52] It is still depooled- ah yes. [15:39:01] I was gonna ask lol [15:40:16] JennH: Done. You can go for it now anytime you want (I also upgraded the kernel) [15:40:54] Thank you! [16:05:33] Is sendmail/postfix/mailman owned by the same team, and if yes, which team is that (I'm unsure between infrafun and svcops) [16:05:55] I/F [16:05:58] k [16:09:55] context, T383047 [16:09:56] T383047: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047