[08:41:59] <vgutierrez>	 Emperor: morning, something is off with swift @ codfw, it's showing 502s from every PoP that uses codfw (codfw+ulsfo+eqsin), while eqiad ones (eqiad+drmrs+esams) have a flat line for 502s
[08:46:00] <Emperor>	 :(
[08:50:20] <Emperor>	 WTB: useful logging for requests through swift
[08:51:53] <Emperor>	 vgutierrez: I'm doing a rolling-restart now
[08:52:15] <Emperor>	 we have more frontends, they've arrived in the DC, but still aren't ready to commission :-/
[08:57:28] <vgutierrez>	 rolling restart doesn't seem to impact on 502s
[08:57:56] <vgutierrez>	 I'm staring at https://grafana.wikimedia.org/goto/qWxnUIJ4z?orgId=1
[08:59:03] <Emperor>	 Yeah, I was noticing it hadn't helped. 
[09:01:30] <Emperor>	 the problem remains that I don't see the 502s, I can't see where they're combing from, and the swift server themselves aren't showing a huge number of errors (or I'm looking at the wrong frontends)
[09:02:19] <Emperor>	 e.g. I'm still seeing some "Client disconnected on read of..." log entries, but not a huge number
[09:03:57] <Emperor>	 the TTFB graphs for eqiad and codfw look similar
[09:20:32] <Emperor>	 vgutierrez: OK, so I went and followed an 502 I could find in the logs back through all the backends and https://phabricator.wikimedia.org/P44364 [NDA, it has IPs in]
[09:23:39] <Emperor>	 That looks to me like swift was asked for a thumbnail that doesn't exist and ended up saying 502, which I __think__ means that our horrible rewrite middleware tried to talk to thumbor and got a failure - godog might confirm that this is (or isn't) expected behaviour here.
[09:24:38] <vgutierrez>	 Emperor: I've pasted two requests reported as 502 by ats-backend @ cp2028 in that phab paste
[09:25:19] <Emperor>	 LMS if I can find them
[09:25:26] <vgutierrez>	 https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG/720px-Part_of_Great_Barrier_Reef_from_Helecopter.JPG is also failing for me via drmrs
[09:25:51] <Emperor>	 (this takes a while - I have to use cumin to query the frontends first, then see if there's a txnid, then try the backends...)
[09:27:19] <godog>	 Emperor: sorry I can't remember what happens in that case (i.e. thumbor fails)
[09:27:21] <volans>	 this last one exists in different sizes
[09:27:24] <volans>	 https://commons.wikimedia.org/wiki/File:Part_of_Great_Barrier_Reef_from_Helicopter.jpg
[09:27:43] <vgutierrez>	 yeah.. I've messed up a little bit by getting the URL from ats-be
[09:27:47] <vgutierrez>	 https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Part_of_Great_Barrier_Reef_from_Helicopter.jpg/720px-Part_of_Great_Barrier_Reef_from_Helicopter.jpg
[09:27:52] <vgutierrez>	 that's working as expected here
[09:28:27] <vgutierrez>	 volans: hmm notice the typo in the one on the paste
[09:28:34] <vgutierrez>	 Helecopter VS helicopter
[09:28:48] <vgutierrez>	 are we returning 502s for 404s?
[09:30:50] <Emperor>	 sorry, working on it, this takes time to log-dive
[09:32:13] <Emperor>	 vgutierrez: I've dug the first one out of the logs for you (into the phab paste)
[09:32:23] * vgutierrez checking
[09:32:39] <Emperor>	 here again swift is asking 6 nodes for the object (so the 3 real ones and 3 fallback nodes), all of which say 404. 
[09:33:27] <vgutierrez>	 Emperor: and backends returning 404s trigger a 502 on ms-fe?
[09:33:30] <Emperor>	 there we do _also_ see the frontend asking for a listing; I'm not sure how that relates
[09:34:13] <Emperor>	 vgutierrez: Normally, shouldn't do - but for thumbnails we have our own custom 404 plumbing that is meant to ask thumbor to produce the thumbnail and then return the answer to the requestor.
[09:34:59] <vgutierrez>	 Emperor: so besides the 404s on ms-be nodes we should be able to track that request on thumbor?
[09:35:07] <Emperor>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/swift/files/python3.9/SwiftMedia/wmf/rewrite.py
[09:35:17] <Emperor>	 vgutierrez: ideally, yes (but I don't know where/howto)
[09:35:49] <Emperor>	 I would expect a request from swift (probably ms-fe2010 that was handling this request) to thumbor to generate that thumnail at 09:23:06 or thereabouts
[09:36:44] <vgutierrez>	 also.. our generic 404 is quite misleading
[09:37:18] <vgutierrez>	 https://upload.wikimedia.org/wikipedia/it/thumb/1/1d/Tempio_di_Antonino_e_Faustina.jpg/782px-Tempio_di_Antonino_e_Faustina.jpg returns a 404 but definitely looks like a 5xx
[09:42:07] <Emperor>	 I'll go follow your other request through the logs for now
[09:45:38] <vgutierrez>	 Emperor: nothing special about them, just the first two that they've appeared FYI
[09:48:53] <Emperor>	 vgutierrez: that second one is more interesting, as it happens.
[09:49:23] <Emperor>	 vgutierrez: you'll see we've been asked for it 88 times this morning, and we've said 404 to 83 of those requests and 502 to 5
[09:49:48] <vgutierrez>	 FWIW the CDN caches 404s for 10 minutes
[09:50:59] <Emperor>	 sorry, "we" here == "the swift service"
[09:51:40] <vgutierrez>	 what's the difference between the 502s and the 404s?
[09:52:02] <vgutierrez>	 cause we've seen that 404s in ms-be can translate to a 502 to ats-backend aka the user
[09:53:15] <Emperor>	 I'm now looking for the transaction you flagged (which ended up being 502) in the backends; then I'll pick one that swift ended up saying 404 to
[09:57:55] <Emperor>	 vgutierrez: paste updated, but TL;DR> from the backend they look the same, which I think means next question is "what did thumbor say to those requests?"
[09:58:17] <Emperor>	 [I'm going to be AFK for 10m now]
[10:09:06] <Emperor>	 Do we have any thumbor experts here? Or at least someone who is familiar with where/how it keeps logs? I know hnowlan has been looking at k8sifying it...
[10:12:03] <claime>	 Emperor: Don't know much about thumbor, but maybe check https://logstash.wikimedia.org/goto/181e3bec56099a8ab44db52a66984215 ?
[10:12:35] <claime>	 But from what I heard h.nowlan say it's a pain to debug
[10:15:11] <Emperor>	 those logs don't look very enlightening about what thumbor is doing, from a quick look (but I'm probably driving logstash wrong)
[10:16:46] <claime>	 Emperor: No, they really don't. :/
[10:17:18] <claime>	 I had the link in my bookmarks, but yeah, they're quite unhelpful
[10:24:14] <Emperor>	 OK, after some wrestling, I can do e.g. host: thumbor* AND message: "Tempio_di_Antonino_e_Faustina.jpg/782px-Tempio_di_Antonino_e_Faustina.jpg"
[10:32:50] <Emperor>	 vgutierrez: OK, for the requests you asked about, swift is returning a 502 from thumbor (having failed to find the thumbnail locally).
[10:34:35] <vgutierrez>	 hmmm so a 502 could be thumbor being AWOL?
[10:34:47] <vgutierrez>	 or too busy?
[10:34:55] <vgutierrez>	 cause after all that's a Bad Gateway error
[10:36:02] <Emperor>	 /srv/log/thumbor/thumbor.error.log is empty on thumbor2003, which is "helpful"
[10:36:30] <Emperor>	 vgutierrez: I think the 502 is "swift asked thumbor, thumbor said 502 for whatever reason, swift passed that back to the client"
[10:36:59] <vgutierrez>	 Emperor: yeah.. I'm wondering why thumbor is returning a 502
[10:37:19] <Emperor>	 Hm...
[10:38:31] <Emperor>	 sometimes thumbor logs  ClientException('Object GET failed ',) ; sometimes it logs SSLError(SSLError(IOError(24, 'Too many open files'),),). I'm not sure that's a relevant detail
[10:38:41] <vgutierrez>	 -rw-r--r-- 1 root    root       0 Feb 13 06:25 thumbor.error.log
[10:38:42] <vgutierrez>	 -rw-r--r-- 1 root    root     37M Feb 13 10:38 thumbor.error.log.1
[10:38:50] <vgutierrez>	 it looks to me that thumbor is struggling with log rotation
[10:38:52] <hnowlan>	 502 in thumbor could be literally *anything* tbh unfortunately 
[10:38:59] <Emperor>	 "yay"
[10:39:17] <vgutierrez>	 Feb 13 10:38:28 thumbor2003 thumbor@8804[28810]: 2023-02-13 10:38:28,756 8804 thumbor:ERROR [SWIFT_STORAGE] get exception: SSLError(SSLError(IOError(24, 'Too many open files'),),)
[10:39:24] <vgutierrez>	 that's pretty specific 
[10:39:53] <Emperor>	 that doesn't seem to correspond with the requests it says 502 though, however
[10:39:59] <vgutierrez>	 vgutierrez@thumbor2003:/srv/log/thumbor$ fgrep "Too many open files" thumbor.error.log.1 |wc -l
[10:39:59] <vgutierrez>	 72466
[10:40:19] <Emperor>	 e.g. Feb 13 09:23:14 thumbor2003 thumbor@8804[28810]: 2023-02-13 09:23:14,479 8804 tornado.access:ERROR 502 GET /wikipedia/commons/thumb/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG/720px-Part_of_Great_Barrier_Reef_from_Helecopter.JPG (10.192.48.44) 56.78ms
[10:40:54] <Emperor>	 Oh, actually, that's also a IOERror one 2023-02-13 09:23:14,475 8804 thumbor:ERROR [SWIFT_LOADER] get_object failed: https%3A//ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.54/5/54/Part_of_Great_Barrier_Reef_from_Helecopter.JPG SSLError(SSLError(IOError(24, 'Too many open files'),),)
[10:41:52] <Emperor>	 mind you, there are 41 IOErrors logged against that, and only 8 502s returned
[10:42:09] <Emperor>	 at least per grep -c Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.error.log.1
[10:43:39] <vgutierrez>	 Emperor: you also got haproxy log available there
[10:44:01] <vgutierrez>	 apparently the helecopter thumb triggers 404s or 502s
[10:44:26] <Emperor>	 Ah, actually, if I restrict the date ranges (log rotation FTL here) I think we do have about the same number of IOErrors and 502s
[10:44:38] <vgutierrez>	 that's on /var/log/haproxy/haproxy.log
[10:45:38] <Emperor>	 grep Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.404.log | grep IOError | grep -c 2023-02-13 # grep Part_of_Great_Barrier_Reef_from_Helecopter.JPG thumbor.error.log.1 | grep -c 'Feb 13' both 6
[10:45:54] <Emperor>	 likewise Feb 12 / 2023-02-12 both 3
[10:46:23] <Emperor>	 Let me update the paste with some of this
[10:54:00] <Emperor>	 vgutierrez: right, the IOErrors are always EMFILE, and thumbor2003 has logged 34124 of those today
[10:54:46] <Emperor>	 vgutierrez: and I think all of those will end up in thumbor saying 502 to swift (and in turn swift saying 502 upwards)
[10:55:06] <Emperor>	 hnowlan: dunno if it'd be worth increasing the max open files on thumbor hosts?
[11:02:54] <vgutierrez>	 vgutierrez@thumbor2003:~$ sudo -i systemctl show thumbor@8801.service  |grep NOFILE
[11:02:54] <vgutierrez>	 LimitNOFILE=4096
[11:02:54] <vgutierrez>	 LimitNOFILESoft=1024
[11:03:36] <claime>	 Can't hurt to bump it honestly
[11:03:49] <claime>	 Especially since they're dedicated hosts
[11:03:57] <hnowlan>	 Emperor: sounds like it couldn't hurt. sorry, in a meeting
[11:04:05] <claime>	 hnowlan: want me to take care of it?
[11:08:42] <hnowlan>	 once I'm out I want to try to see what fetching those files does in a dev instance though, I worry those 500s are hiding something to do with the image itself rather than *just* being ulimit issues. 
[11:09:08] <hnowlan>	 claime: if you have time, please
[11:09:19] <hnowlan>	 you can possibly set it per-firejail
[11:09:21] <claime>	 hnowlan: yeah I can check it out, np
[11:09:36] <claime>	 Hmm that's possible yeah
[11:13:37] <Emperor>	 hnowlan: the ones that started all this, the image doesn't exist at all :)
[11:16:51] <claime>	 vgutierrez: I wonder if firejail bypasses the system limit or not
[11:18:28] <claime>	 In any case, the firejail set limit is default for now
[11:24:07] <Emperor>	 If you do manage to adjust the file limit, LMK, so I can see if that changes our error volume?
[11:24:15] <claime>	 Sure
[11:28:42] <Emperor>	 Thanks :) given the error messages, I'm hopeful it'll improve matters...
[11:29:27] <claime>	 Emperor: I'll try just raising the firejail profile's limit
[11:30:08] <claime>	 We'll see if we need to raise system limits, because I'm not sure how firejail handles this
[11:56:19] <claime>	 Firejail profile merged and applied on thumbor2003, restarting thumbor-instances service
[11:57:06] <claime>	 Ah I need to deppol. restart then repool
[11:57:10] <claime>	 depool*
[11:58:32] <claime>	 Done on thumbor2003, waiting for new requests to come in to see if it helped or not
[11:59:18] <Emperor>	 claime: doesn't seem to have entirely worked - mvernon@thumbor2003:/srv/log/thumbor$ systemctl show thumbor@8801.service  | grep NOFILE => LimitNOFILE=4096
[11:59:18] <Emperor>	  
[11:59:31] <claime>	 Emperor: I didn't change what systemd sets
[12:00:05] <claime>	 cgoubert@thumbor2003:~$ /usr/bin/firejail --profile=/etc/firejail/thumbor.profile ulimit -n
[12:00:06] <claime>	 Reading profile /etc/firejail/thumbor.profile
[12:00:08] <claime>	 Parent pid 8706, child pid 8707
[12:00:10] <claime>	 Child process initialized
[12:00:12] <claime>	 8192
[12:00:20] <Emperor>	 Ah, yes, systemd is just being unhelpful
[12:00:33] <Emperor>	 /proc/3365/limits has the right numbers in
[12:01:01] <Emperor>	 (the pid I pulled from thumbor@8801.service
[12:01:02] <Emperor>	 )
[12:01:06] <claime>	 I'm waiting to see if I get anymore limit errors
[12:01:17] <claime>	 If not, I'll run puppet on all thumbor hosts and roll restart
[12:02:40] <Emperor>	 claime: there are now python tracebacks in thumbor.error.log.1 :-/
[12:02:51] <claime>	 yeah
[12:02:53] <claime>	 I see them
[12:03:16] <claime>	 Feb 13 12:02:01 thumbor2003 thumbor@8802[3047]: 2023-02-13 12:02:01,499 8802 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/d/dc/Born_of_a_Dream.mpg/200px-Born_of_a_Dream.mpg.jpg (10.192.16.7
[12:04:07] <claime>	 They look like actual errors though
[12:05:16] <Emperor>	 claime: yeah, there are similar errors in older logs, too, so apologies for chucking a red herring at you
[12:05:22] <claime>	 No worries
[12:06:59] <claime>	 I'll deploy/roll-restart in codfw first,
[12:07:05] <Emperor>	 +1
[12:07:46] <hnowlan>	 yeah, that looks like an image format error 
[12:07:50] <hnowlan>	 claime: thanks for handling that change 
[12:08:29] <claime>	 no worries hnowlan ;)
[12:18:07] <claime>	 5xx basically gone to 0 in codfw
[12:18:46] <hnowlan>	 I wonder how many stale or otherwise obsolete (timed out jobs etc) file handles were hanging around pre-restart 
[12:19:42] <claime>	 yeah...
[12:20:13] <hnowlan>	 probably couldn't hurt to do eqiad, there's 20k open files on thumbor1006 for example 
[12:20:22] <claime>	 cgoubert@thumbor1001:~$ sudo lsof | grep -c deleted
[12:20:24] <claime>	 419
[12:20:33] <claime>	 Not that many stales
[12:20:42] <claime>	 But yeah, I'm gonna roll-restart eqiad too
[12:21:04] <claime>	 running puppet now
[12:21:47] <hnowlan>	 ty
[12:22:26] <hnowlan>	 a *lot* of pipes which are the side effects of it shelling out
[12:22:38] <claime>	 yeah
[12:22:48] <claime>	 haproxy[37648]: proxy thumbor has no server available!
[12:22:52] <claime>	 Huh
[12:23:15] <claime>	 Ah, it screams when you roll-restart
[12:23:18] <claime>	 ok haproxy
[12:25:17] <claime>	 Ok all done restarting in eqiad and codfw
[12:26:18] <claime>	 Emperor: ^
[12:28:23] <hnowlan>	 if you want a lot more debug output on a single image in thumbor, there's a single thumbor instance running in kubernetes staging that you can access via thumbor.svc.codfw.wmnet:8800 and the logs are a lot more verbose
[12:28:36] <hnowlan>	 not always useful, but it's good for figuring out if there's a problem with the image or the image handling 
[12:33:11] <hnowlan>	 redeploying changeprop-jobqueue after the rollback on Friday 
[12:33:48] <claime>	 hnowlan: ack
[12:35:48] <claime>	 Hope we don't get Friday's certificate bs all over again
[12:37:27] <Emperor>	 claime: https://logstash.wikimedia.org/goto/14c050179c619427ad5509f87998f9df looking much happier
[12:38:34] <claime>	 Emperor: Yeah, looks good.
[12:39:35] <Emperor>	 likewise the ATS backends / origin servers view for codfw swift
[12:50:41] <hnowlan>	 jobqueue restarts done
[12:51:04] <hnowlan>	 have even less idea wtf happened with the roll-restart last week now 🤷
[12:53:35] <claime>	 Yeah I was gonna say, I kept an eye on the logs and... Nothing happened
[12:54:03] <claime>	 Computers were a mistake
[12:54:18] <hnowlan>	 interesting note on the codfw thumbor 5xx errors - they started to spike just after we repooled after all services were depooled: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1675761708619&to=1676292728844 
[12:56:43] <claime>	 hnowlan: You meand on 2023/02/07 ?
[12:56:47] <claime>	 mean*
[13:05:04] <hnowlan>	 claime: yup. I guess *maybe* a bunch of streams didn't get properly closed all at once, and that left pipes open. Pretty silly problem to have though 
[13:05:55] <claime>	 hnowlan: Agreed
[13:31:30] <andrewbogott>	 hope you are feeling better, question_mark!
[14:27:13] <sukhe>	 andrewbogott: morning!
[14:27:22] <sukhe>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/888236 should be a quick review
[14:27:40] <sukhe>	 happy to take care of rolling it out but it's a simple name change for a deprecated option
[14:27:42] <andrewbogott>	 sukhe: we may be having an outage so you'll need to ping me later :)
[14:27:51] <sukhe>	 oh sorry continue 
[14:37:50] <vgutierrez>	 hnowlan, Emperor, claime: ATS seems to be happy after your file limit tuning in thumbor
[14:38:51] <vgutierrez>	 see https://grafana.wikimedia.org/goto/VO_BFH14z?orgId=1 picking just 502 you can see it flatlining at ~12 UTC
[14:46:05] <Emperor>	 *\o/*
[15:07:41] <claime>	 Yay :D
[15:17:20] <claime>	 tbf, I'm not sure it's specifically the ulimit change, or just the service restart that caused lingering fd to be cleared
[15:17:49] <claime>	 but still, good that it cleared the issue for now
[15:36:31] <andrewbogott>	 sukhe: OK, I have a moment of peace, want to merge that recursor change right now?
[15:36:38] <sukhe>	 andrewbogott: happy to!
[15:36:48] <andrewbogott>	 I'll keep an eye out
[15:36:51] <sukhe>	 thank you
[15:37:14] <sukhe>	 I am going to disable puppet on A:dns-rec, just in case, even though NOOP from PCC
[15:37:38] <sukhe>	 cloudservices* is the only one I am concerened about, in that sense that it will result in a restart
[15:37:41] <sukhe>	 merging
[15:40:45] <sukhe>	 cloudservices1004 looks good
[15:40:54] <sukhe>	 NOOP on dns1001, merging on others
[15:42:10] <andrewbogott>	 yep, everything looks fine on my end.  Thanks for the warning!
[15:43:04] <sukhe>	 thanks!
[16:02:32] <claime>	 I'm dry-running the switchdc.mediawiki cookbook, and I have an error in 03-set-db-readonly that I can't figure out if it's because I'm dry-running and nothing changes, or if there's actually a problem with the cumin query building in mysql.check_core_masters_in_sync()
[16:02:37] <claime>	 Paste https://phabricator.wikimedia.org/P44430
[16:02:49] <claime>	 If anyone has insight, it'd be much appreciated :)
[16:05:43] <claime>	 Looks like when it's trying to check x2
[16:07:15] <volans>	 claime: having a look
[16:07:30] <claime>	 volans: You can check output on my tmux on cumin1001 if you want
[16:07:43] <claime>	 sudo to my user
[16:07:58] <volans>	 ack
[16:25:00] <volans>	 claime: so the problem is that all hosts in A:db-section-x2 are not part of A:db-core
[16:25:13] <papaul_>	 marostegui: hey we have the main board for db2181 is it possible to power the server off thanks
[16:25:22] <volans>	 the generated query quen calling get_core_masters_heartbeats('eqiad', 'eqiad')
[16:25:31] <volans>	 is: 'A:db-core and A:eqiad and A:db-section-x2 and A:db-role-master'
[16:25:39] <volans>	 and that raises the exception because there are no hosts
[16:25:55] <volans>	 s/quen/when/
[16:26:02] <marostegui>	 papaul_: yes, give me a sec
[19:25:00] <Emperor>	 inflatador: I'd certainly be interested if they have any thoughts/observations re "ghost" objects (our tickets T327253 / https://bugs.launchpad.net/swift/+bug/2003941 and the older https://bugs.launchpad.net/swift/+bug/1619408 are all public); but I think I understand what I need to do from a swift/sqlite POV; it's mostly "just" the question of doing more of it automatically (e.g. parsing swift-get-nodes output) and in a way that works
[19:25:00] <Emperor>	 on our infra
[19:53:08] <inflatador>	 Emperor ACK. I sent out the phab task to a few former coworkers to see if they have any feedback