[08:41:59] Emperor: morning, something is off with swift @ codfw, it's showing 502s from every PoP that uses codfw (codfw+ulsfo+eqsin), while eqiad ones (eqiad+drmrs+esams) have a flat line for 502s [08:46:00] :( [08:50:20] WTB: useful logging for requests through swift [08:51:53] vgutierrez: I'm doing a rolling-restart now [08:52:15] we have more frontends, they've arrived in the DC, but still aren't ready to commission :-/ [08:57:28] rolling restart doesn't seem to impact on 502s [08:57:56] I'm staring at https://grafana.wikimedia.org/goto/qWxnUIJ4z?orgId=1 [08:59:03] Yeah, I was noticing it hadn't helped. [09:01:30] the problem remains that I don't see the 502s, I can't see where they're combing from, and the swift server themselves aren't showing a huge number of errors (or I'm looking at the wrong frontends) [09:02:19] e.g. I'm still seeing some "Client disconnected on read of..." log entries, but not a huge number [09:03:57] the TTFB graphs for eqiad and codfw look similar [09:20:32] vgutierrez: OK, so I went and followed an 502 I could find in the logs back through all the backends and https://phabricator.wikimedia.org/P44364 [NDA, it has IPs in] [09:23:39] That looks to me like swift was asked for a thumbnail that doesn't exist and ended up saying 502, which I __think__ means that our horrible rewrite middleware tried to talk to thumbor and got a failure - godog might confirm that this is (or isn't) expected behaviour here. [09:24:38] Emperor: I've pasted two requests reported as 502 by ats-backend @ cp2028 in that phab paste [09:25:19] LMS if I can find them [09:25:26] https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG/720px-Part_of_Great_Barrier_Reef_from_Helecopter.JPG is also failing for me via drmrs [09:25:51] (this takes a while - I have to use cumin to query the frontends first, then see if there's a txnid, then try the backends...) [09:27:19] Emperor: sorry I can't remember what happens in that case (i.e. thumbor fails) [09:27:21] this last one exists in different sizes [09:27:24] https://commons.wikimedia.org/wiki/File:Part_of_Great_Barrier_Reef_from_Helicopter.jpg [09:27:43] yeah.. I've messed up a little bit by getting the URL from ats-be [09:27:47] https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Part_of_Great_Barrier_Reef_from_Helicopter.jpg/720px-Part_of_Great_Barrier_Reef_from_Helicopter.jpg [09:27:52] that's working as expected here [09:28:27] volans: hmm notice the typo in the one on the paste [09:28:34] Helecopter VS helicopter [09:28:48] are we returning 502s for 404s? [09:30:50] sorry, working on it, this takes time to log-dive [09:32:13] vgutierrez: I've dug the first one out of the logs for you (into the phab paste) [09:32:23] * vgutierrez checking [09:32:39] here again swift is asking 6 nodes for the object (so the 3 real ones and 3 fallback nodes), all of which say 404. [09:33:27] Emperor: and backends returning 404s trigger a 502 on ms-fe? [09:33:30] there we do _also_ see the frontend asking for a listing; I'm not sure how that relates [09:34:13] vgutierrez: Normally, shouldn't do - but for thumbnails we have our own custom 404 plumbing that is meant to ask thumbor to produce the thumbnail and then return the answer to the requestor. [09:34:59] Emperor: so besides the 404s on ms-be nodes we should be able to track that request on thumbor? [09:35:07] https://github.com/wikimedia/operations-puppet/blob/production/modules/swift/files/python3.9/SwiftMedia/wmf/rewrite.py [09:35:17] vgutierrez: ideally, yes (but I don't know where/howto) [09:35:49] I would expect a request from swift (probably ms-fe2010 that was handling this request) to thumbor to generate that thumnail at 09:23:06 or thereabouts [09:36:44] also.. our generic 404 is quite misleading [09:37:18] https://upload.wikimedia.org/wikipedia/it/thumb/1/1d/Tempio_di_Antonino_e_Faustina.jpg/782px-Tempio_di_Antonino_e_Faustina.jpg returns a 404 but definitely looks like a 5xx [09:42:07] I'll go follow your other request through the logs for now [09:45:38] Emperor: nothing special about them, just the first two that they've appeared FYI [09:48:53] vgutierrez: that second one is more interesting, as it happens. [09:49:23] vgutierrez: you'll see we've been asked for it 88 times this morning, and we've said 404 to 83 of those requests and 502 to 5 [09:49:48] FWIW the CDN caches 404s for 10 minutes [09:50:59] sorry, "we" here == "the swift service" [09:51:40] what's the difference between the 502s and the 404s? [09:52:02] cause we've seen that 404s in ms-be can translate to a 502 to ats-backend aka the user [09:53:15] I'm now looking for the transaction you flagged (which ended up being 502) in the backends; then I'll pick one that swift ended up saying 404 to [09:57:55] vgutierrez: paste updated, but TL;DR> from the backend they look the same, which I think means next question is "what did thumbor say to those requests?" [09:58:17] [I'm going to be AFK for 10m now] [10:09:06] Do we have any thumbor experts here? Or at least someone who is familiar with where/how it keeps logs? I know hnowlan has been looking at k8sifying it... [10:12:03] Emperor: Don't know much about thumbor, but maybe check https://logstash.wikimedia.org/goto/181e3bec56099a8ab44db52a66984215 ? [10:12:35] But from what I heard h.nowlan say it's a pain to debug [10:15:11] those logs don't look very enlightening about what thumbor is doing, from a quick look (but I'm probably driving logstash wrong) [10:16:46] Emperor: No, they really don't. :/ [10:17:18] I had the link in my bookmarks, but yeah, they're quite unhelpful [10:24:14] OK, after some wrestling, I can do e.g. host: thumbor* AND message: "Tempio_di_Antonino_e_Faustina.jpg/782px-Tempio_di_Antonino_e_Faustina.jpg" [10:32:50] vgutierrez: OK, for the requests you asked about, swift is returning a 502 from thumbor (having failed to find the thumbnail locally). [10:34:35] hmmm so a 502 could be thumbor being AWOL? [10:34:47] or too busy? [10:34:55] cause after all that's a Bad Gateway error [10:36:02] /srv/log/thumbor/thumbor.error.log is empty on thumbor2003, which is "helpful" [10:36:30] vgutierrez: I think the 502 is "swift asked thumbor, thumbor said 502 for whatever reason, swift passed that back to the client" [10:36:59] Emperor: yeah.. I'm wondering why thumbor is returning a 502 [10:37:19] Hm... [10:38:31] sometimes thumbor logs ClientException('Object GET failed ',) ; sometimes it logs SSLError(SSLError(IOError(24, 'Too many open files'),),). I'm not sure that's a relevant detail [10:38:41] -rw-r--r-- 1 root root 0 Feb 13 06:25 thumbor.error.log [10:38:42] -rw-r--r-- 1 root root 37M Feb 13 10:38 thumbor.error.log.1 [10:38:50] it looks to me that thumbor is struggling with log rotation [10:38:52] 502 in thumbor could be literally *anything* tbh unfortunately [10:38:59] "yay" [10:39:17] Feb 13 10:38:28 thumbor2003 thumbor@8804[28810]: 2023-02-13 10:38:28,756 8804 thumbor:ERROR [SWIFT_STORAGE] get exception: SSLError(SSLError(IOError(24, 'Too many open files'),),) [10:39:24] that's pretty specific [10:39:53] that doesn't seem to correspond with the requests it says 502 though, however [10:39:59] vgutierrez@thumbor2003:/srv/log/thumbor$ fgrep "Too many open files" thumbor.error.log.1 |wc -l [10:39:59] 72466 [10:40:19] e.g. Feb 13 09:23:14 thumbor2003 thumbor@8804[28810]: 2023-02-13 09:23:14,479 8804 tornado.access:ERROR 502 GET /wikipedia/commons/thumb/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG/720px-Part_of_Great_Barrier_Reef_from_Helecopter.JPG (10.192.48.44) 56.78ms [10:40:54] Oh, actually, that's also a IOERror one 2023-02-13 09:23:14,475 8804 thumbor:ERROR [SWIFT_LOADER] get_object failed: https%3A//ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.54/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG SSLError(SSLError(IOError(24, 'Too many open files'),),) [10:41:52] mind you, there are 41 IOErrors logged against that, and only 8 502s returned [10:42:09] at least per grep -c Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.error.log.1 [10:43:39] Emperor: you also got haproxy log available there [10:44:01] apparently the helecopter thumb triggers 404s or 502s [10:44:26] Ah, actually, if I restrict the date ranges (log rotation FTL here) I think we do have about the same number of IOErrors and 502s [10:44:38] that's on /var/log/haproxy/haproxy.log [10:45:38] grep Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.404.log | grep IOError | grep -c 2023-02-13 # grep Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.error.log.1 | grep -c 'Feb 13' both 6 [10:45:54] likewise Feb 12 / 2023-02-12 both 3 [10:46:23] Let me update the paste with some of this [10:54:00] vgutierrez: right, the IOErrors are always EMFILE, and thumbor2003 has logged 34124 of those today [10:54:46] vgutierrez: and I think all of those will end up in thumbor saying 502 to swift (and in turn swift saying 502 upwards) [10:55:06] hnowlan: dunno if it'd be worth increasing the max open files on thumbor hosts? [11:02:54] vgutierrez@thumbor2003:~$ sudo -i systemctl show thumbor@8801.service |grep NOFILE [11:02:54] LimitNOFILE=4096 [11:02:54] LimitNOFILESoft=1024 [11:03:36] Can't hurt to bump it honestly [11:03:49] Especially since they're dedicated hosts [11:03:57] Emperor: sounds like it couldn't hurt. sorry, in a meeting [11:04:05] hnowlan: want me to take care of it? [11:08:42] once I'm out I want to try to see what fetching those files does in a dev instance though, I worry those 500s are hiding something to do with the image itself rather than *just* being ulimit issues. [11:09:08] claime: if you have time, please [11:09:19] you can possibly set it per-firejail [11:09:21] hnowlan: yeah I can check it out, np [11:09:36] Hmm that's possible yeah [11:13:37] hnowlan: the ones that started all this, the image doesn't exist at all :) [11:16:51] vgutierrez: I wonder if firejail bypasses the system limit or not [11:18:28] In any case, the firejail set limit is default for now [11:24:07] If you do manage to adjust the file limit, LMK, so I can see if that changes our error volume? [11:24:15] Sure [11:28:42] Thanks :) given the error messages, I'm hopeful it'll improve matters... [11:29:27] Emperor: I'll try just raising the firejail profile's limit [11:30:08] We'll see if we need to raise system limits, because I'm not sure how firejail handles this [11:56:19] Firejail profile merged and applied on thumbor2003, restarting thumbor-instances service [11:57:06] Ah I need to deppol. restart then repool [11:57:10] depool* [11:58:32] Done on thumbor2003, waiting for new requests to come in to see if it helped or not [11:59:18] claime: doesn't seem to have entirely worked - mvernon@thumbor2003:/srv/log/thumbor$ systemctl show thumbor@8801.service | grep NOFILE => LimitNOFILE=4096 [11:59:18] [11:59:31] Emperor: I didn't change what systemd sets [12:00:05] cgoubert@thumbor2003:~$ /usr/bin/firejail --profile=/etc/firejail/thumbor.profile ulimit -n [12:00:06] Reading profile /etc/firejail/thumbor.profile [12:00:08] Parent pid 8706, child pid 8707 [12:00:10] Child process initialized [12:00:12] 8192 [12:00:20] Ah, yes, systemd is just being unhelpful [12:00:33] /proc/3365/limits has the right numbers in [12:01:01] (the pid I pulled from thumbor@8801.service [12:01:02] ) [12:01:06] I'm waiting to see if I get anymore limit errors [12:01:17] If not, I'll run puppet on all thumbor hosts and roll restart [12:02:40] claime: there are now python tracebacks in thumbor.error.log.1 :-/ [12:02:51] yeah [12:02:53] I see them [12:03:16] Feb 13 12:02:01 thumbor2003 thumbor@8802[3047]: 2023-02-13 12:02:01,499 8802 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/d/dc/Born_of_a_Dream.mpg/200px-Born_of_a_Dream.mpg.jpg (10.192.16.7 [12:04:07] They look like actual errors though [12:05:16] claime: yeah, there are similar errors in older logs, too, so apologies for chucking a red herring at you [12:05:22] No worries [12:06:59] I'll deploy/roll-restart in codfw first, [12:07:05] +1 [12:07:46] yeah, that looks like an image format error [12:07:50] claime: thanks for handling that change [12:08:29] no worries hnowlan ;) [12:18:07] 5xx basically gone to 0 in codfw [12:18:46] I wonder how many stale or otherwise obsolete (timed out jobs etc) file handles were hanging around pre-restart [12:19:42] yeah... [12:20:13] probably couldn't hurt to do eqiad, there's 20k open files on thumbor1006 for example [12:20:22] cgoubert@thumbor1001:~$ sudo lsof | grep -c deleted [12:20:24] 419 [12:20:33] Not that many stales [12:20:42] But yeah, I'm gonna roll-restart eqiad too [12:21:04] running puppet now [12:21:47] ty [12:22:26] a *lot* of pipes which are the side effects of it shelling out [12:22:38] yeah [12:22:48] haproxy[37648]: proxy thumbor has no server available! [12:22:52] Huh [12:23:15] Ah, it screams when you roll-restart [12:23:18] ok haproxy [12:25:17] Ok all done restarting in eqiad and codfw [12:26:18] Emperor: ^ [12:28:23] if you want a lot more debug output on a single image in thumbor, there's a single thumbor instance running in kubernetes staging that you can access via thumbor.svc.codfw.wmnet:8800 and the logs are a lot more verbose [12:28:36] not always useful, but it's good for figuring out if there's a problem with the image or the image handling [12:33:11] redeploying changeprop-jobqueue after the rollback on Friday [12:33:48] hnowlan: ack [12:35:48] Hope we don't get Friday's certificate bs all over again [12:37:27] claime: https://logstash.wikimedia.org/goto/14c050179c619427ad5509f87998f9df looking much happier [12:38:34] Emperor: Yeah, looks good. [12:39:35] likewise the ATS backends / origin servers view for codfw swift [12:50:41] jobqueue restarts done [12:51:04] have even less idea wtf happened with the roll-restart last week now 🤷 [12:53:35] Yeah I was gonna say, I kept an eye on the logs and... Nothing happened [12:54:03] Computers were a mistake [12:54:18] interesting note on the codfw thumbor 5xx errors - they started to spike just after we repooled after all services were depooled: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1675761708619&to=1676292728844 [12:56:43] hnowlan: You meand on 2023/02/07 ? [12:56:47] mean* [13:05:04] claime: yup. I guess *maybe* a bunch of streams didn't get properly closed all at once, and that left pipes open. Pretty silly problem to have though [13:05:55] hnowlan: Agreed [13:31:30] hope you are feeling better, question_mark! [14:27:13] andrewbogott: morning! [14:27:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/888236 should be a quick review [14:27:40] happy to take care of rolling it out but it's a simple name change for a deprecated option [14:27:42] sukhe: we may be having an outage so you'll need to ping me later :) [14:27:51] oh sorry continue [14:37:50] hnowlan, Emperor, claime: ATS seems to be happy after your file limit tuning in thumbor [14:38:51] see https://grafana.wikimedia.org/goto/VO_BFH14z?orgId=1 picking just 502 you can see it flatlining at ~12 UTC [14:46:05] *\o/* [15:07:41] Yay :D [15:17:20] tbf, I'm not sure it's specifically the ulimit change, or just the service restart that caused lingering fd to be cleared [15:17:49] but still, good that it cleared the issue for now [15:36:31] sukhe: OK, I have a moment of peace, want to merge that recursor change right now? [15:36:38] andrewbogott: happy to! [15:36:48] I'll keep an eye out [15:36:51] thank you [15:37:14] I am going to disable puppet on A:dns-rec, just in case, even though NOOP from PCC [15:37:38] cloudservices* is the only one I am concerened about, in that sense that it will result in a restart [15:37:41] merging [15:40:45] cloudservices1004 looks good [15:40:54] NOOP on dns1001, merging on others [15:42:10] yep, everything looks fine on my end. Thanks for the warning! [15:43:04] thanks! [16:02:32] I'm dry-running the switchdc.mediawiki cookbook, and I have an error in 03-set-db-readonly that I can't figure out if it's because I'm dry-running and nothing changes, or if there's actually a problem with the cumin query building in mysql.check_core_masters_in_sync() [16:02:37] Paste https://phabricator.wikimedia.org/P44430 [16:02:49] If anyone has insight, it'd be much appreciated :) [16:05:43] Looks like when it's trying to check x2 [16:07:15] claime: having a look [16:07:30] volans: You can check output on my tmux on cumin1001 if you want [16:07:43] sudo to my user [16:07:58] ack [16:25:00] claime: so the problem is that all hosts in A:db-section-x2 are not part of A:db-core [16:25:13] marostegui: hey we have the main board for db2181 is it possible to power the server off thanks [16:25:22] the generated query quen calling get_core_masters_heartbeats('eqiad', 'eqiad') [16:25:31] is: 'A:db-core and A:eqiad and A:db-section-x2 and A:db-role-master' [16:25:39] and that raises the exception because there are no hosts [16:25:55] s/quen/when/ [16:26:02] papaul_: yes, give me a sec [19:25:00] inflatador: I'd certainly be interested if they have any thoughts/observations re "ghost" objects (our tickets T327253 / https://bugs.launchpad.net/swift/+bug/2003941 and the older https://bugs.launchpad.net/swift/+bug/1619408 are all public); but I think I understand what I need to do from a swift/sqlite POV; it's mostly "just" the question of doing more of it automatically (e.g. parsing swift-get-nodes output) and in a way that works [19:25:00] on our infra [19:53:08] Emperor ACK. I sent out the phab task to a few former coworkers to see if they have any feedback