[08:27:27] <arturo>	 hello, good morning
[08:27:34] <arturo>	 quick +1 here? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/126
[09:24:52] <dhinus>	 arturo: approved
[09:25:24] <arturo>	 dhinus: thanks!
[09:29:30] <arturo>	 thanks for managing the redis incident over the weekend, BTW
[09:31:05] <taavi>	 dhinus: so the last time the redis max connection errors thing came up, I said that we can look into it more if it happens again
[09:31:21] <dhinus>	 arturo: no prob, luckily I was at home with no big plans
[09:32:15] <dhinus>	 taavi: yes, we should probably try to understand what's the root cause, at least we know that restarting the service seems to fix it
[09:32:51] <taavi>	 the redis docs say that 'By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.', which seems like exactly the sort of thing that would cause this kind of issuue
[09:33:15] <dhinus>	 yep, did we update the Redis version recently?
[09:33:24] <taavi>	 not as far as I'm aware
[09:33:40] <dhinus>	 or maybe some new tool is using the connection without closing it?
[09:33:58] <dhinus>	 is there a Redis setting to force-close the connections after some time?
[09:34:00] <taavi>	 yeah, or some network instability causing more connections to drop, or something liek that
[09:34:17] * arturo mumbles a joke about redis wanting to be replaced by valkey
[09:34:26] <dhinus>	 I will open a task to track this issue
[09:34:29] <taavi>	 apparently the 'timeout' setting can be used for that
[09:34:40] <taavi>	 https://redis.io/docs/latest/develop/reference/clients/#client-timeouts
[10:00:13] <taavi>	 good news: i have succesfully migrated an existing VM from linuxbridge to openvswitch without its IP address changing
[10:00:47] <taavi>	 the bad news is that the migration from one agent to an another seems to require manually updating the vif_type from linuxbridge to ovs in the neutron DB, it's not fully automatic
[10:22:41] <dcaro>	 dhinus: there was a tool that added a health check that does a `celery ping`, that connects to redis as a backend
[10:23:05] <dhinus>	 dcaro: do you think that tool could be the one that's leaking connections?
[10:23:35] <dhinus>	 I can try writing a patch using taavi's suggestion of the "timeout" setting
[10:23:39] <dcaro>	 maybe, it definitely makes sure that something runs every few seconds
[10:24:23] <dcaro>	 dhinus: iirc it was https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/
[10:39:38] <arturo>	 taavi: I think the good news are bigger than the bad news :-P
[10:40:38] <dcaro>	 I created T363683 to follow up on the upgrade chat we had the other day
[10:40:39] <stashbot>	 T363683: Decision request template - kubernetes upgrade workgroup - https://phabricator.wikimedia.org/T363683
[13:11:59] <dhinus>	 dcaro: the alert "TooManyDProcesses" is firing for tools-k8s-worker-nfs-42, I tried following the runbook and I don't seem to find NFS-related errors in journalctl
[13:12:33] <dhinus>	 have you seen other types of problems that cause the same alert?
[13:12:36] <dcaro>	 last time I was unable to find out what was the cause
[13:12:43] <dcaro>	 do new processes get stuck?
[13:13:09] <dcaro>	 (ex. ls of the homes dir, or the projects dir)
[13:13:10] <dhinus>	 how can I find out?
[13:13:13] <dhinus>	 ok
[13:13:15] <dhinus>	 let me try
[13:13:24] <dcaro>	 careful, as you'll lose that shell
[13:13:28] <dcaro>	 (it will get stuck)
[13:13:30] <dcaro>	 *might
[13:13:49] <dcaro>	 you cat try something like `ls /path/to/homes &` so it starts in the background
[13:14:12] <dcaro>	 (I should add that to the runbook xd)
[13:14:19] <dhinus>	 "ls /mnt/nfs/labstore-secondary-tools-project" is working fine
[13:17:48] <dcaro>	 hmm, yep, I can ssh too
[13:18:13] <dcaro>	 (last time I was not able to ssh as my user, as the shell will get stuck trying to access the mounted user home)
[13:18:21] <dcaro>	 there seems to be only one tool affected
[13:22:45] <dcaro>	 just added a couple notes in the runbook, I seem to be able to ls, and head the file the processes are stuck in
[13:22:52] <dcaro>	 you can try restarting that tool pods
[13:23:42] <dcaro>	 wait, it's actually working ok
[13:23:52] <dcaro>	 the processes come in and out of the D state
[13:24:22] <dcaro>	 but there's like 80 processes all accessing that file in real time
[13:24:31] <dcaro>	 https://www.irccloud.com/pastebin/rmxwBWje/
[13:25:06] <dcaro>	 some caching would be nice I guess xd
[13:26:12] <dcaro>	 it feels like a cgi script getting hammered from clients or similar
[13:26:41] <dcaro>	 it's also exceeding the memory limit
[13:26:45] <dcaro>	 [Mon Apr 29 13:21:44 2024] Memory cgroup out of memory: Killed process 3230425 (perl) total-vm:20132kB, anon-rss:12424kB, file-rss:1480kB, shmem-rss:0kB, UID:51437 pgtables:80kB oom_score_adj:985
[13:29:20] <dhinus>	 maybe we can try restarting that pod?
[13:29:34] <dhinus>	 how did you find the affected tool?
[13:34:07] <taavi>	 arturo: do you know if a hypervisor can be migrated from the 'legacy' vlan naming scheme (interface.NNNN) to the new scheme (vlanNNNN) without a full reimage?
[13:40:21] <arturo>	 mmmm
[13:40:48] <arturo>	 I don't think there is any puppet code to cleanup old network interfaces setting, so it might be cleaner to just reimage
[13:41:09] <arturo>	 (or to cleanup stuff like sysctls, monitoring, etc for the old interface)
[13:41:33] <arturo>	 I think the official procedure so far is: network change --> reimage
[13:42:01] <arturo>	 and therefore that may be the only thing supported everywhere
[13:42:23] <taavi>	 ok, that's fine
[13:42:46] <taavi>	 with the node id stored in hiera, a reimage doesn't require any special steps other than draining, right?
[13:44:19] <arturo>	 correct
[13:44:31] <taavi>	 cool
[13:44:38] <taavi>	 i will be reimaging cloudvirt2002-dev shortly
[13:44:54] <arturo>	 well, if changing interfaces, you may want to double check netbox
[13:45:34] <arturo>	 also, make sure the system is configured for single NIC (if no host-level hiera overrides exist, then it should be fine)
[13:45:56] <dhinus>	 when the CloudVPSDesignateLeaks fires, is there a way to find the leaked records that is faster than running "wmcs-dnsleaks"?
[13:46:10] <taavi>	 dhinus: the systemd journal for the unit that generates the prom node file
[13:46:27] <dhinus>	 in cloudcontrol?
[13:46:32] <taavi>	 yeah
[13:46:46] <taavi>	 iirc currently it runs on all of them, I've been meaning to only make it run on one
[13:47:09] <dhinus>	 thanks, I'll add it to the alert runbook
[13:49:46] * arturo out for a bit
[13:52:51] <dhinus>	 the DNS leaks are still coming from fullstackd, do we know what's the issue?
[13:58:10] <dcaro>	 dhinus: I just 'ps aux | grep D'
[14:00:36] <dhinus>	 dcaro: ack
[14:04:39] <dhinus>	 wdyt about restarting the pod?
[14:06:38] <dcaro>	 sorry, had to reboot my laptop
[14:07:19] <dcaro>	 dhinus: so given that the pod is not really stuck, it's just bombarding NFS, I don't think it will help much as long as the external requests keep coming in
[14:11:05] <dhinus>	 let's wait a few hours then and see if it stops
[14:20:14] <dcaro>	 okok
[15:12:50] <dhinus>	 PSA: we are about to shut down ceph in CODFW, to test the new incident response process (T348887)
[15:12:52] <stashbot>	 T348887: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887
[15:14:03] <dhinus>	 if you want to follow along, we are in https://meet.google.com/byy-mqst-jco
[15:20:27] <andrewbogott>	 Hey all, it seems that ceph is down in codfw1dev (for a drill!) can I get a hand?
[15:20:35] <andrewbogott>	 I'll start a doc
[15:20:39] <andrewbogott>	 um... cteam!
[15:20:52] <taavi>	 here
[15:21:02] * dcaro putting on the unicorn costume
[15:21:18] * dcaro here 🦄
[15:22:57] <andrewbogott>	 for now Andrew is the IC
[15:23:07] <dcaro>	 thanks!
[15:23:17] <taavi>	 i'm looking at ceph
[15:23:18] <dcaro>	 (fyi, we are testing our incident response process, nothing is really going on)
[15:23:29] <taavi>	 ceph commands on cloudcephmon2004-dev are hanging
[15:24:49] <andrewbogott>	 coordination document is https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit
[15:25:29] <arturo>	 @taavi same here
[15:25:38] <taavi>	 so I think the ceph manager service is having some issues
[15:25:51] <dcaro>	 can't access ceph from any host yep
[15:25:53] <arturo>	 `aborrero@cloudcephmon2004-dev:~ $ sudo ceph status` hangs
[15:26:54] <taavi>	 so that is ceph-mon@cloudcephmon200N-dev.service
[15:27:07] <arturo>	 there is a very suspicious ceph-crash.service
[15:27:11] <taavi>	 i will try restarting that on clodcephmon2004-dev
[15:27:52] <dcaro>	 updated status on -cloud
[15:28:51] <dcaro>	 !status doing an incident drill https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit
[15:28:51] <wmopbot>	 Too long status
[15:29:51] <taavi>	 well at least that service is running now, but ceph commands are still hanging
[15:30:36] <dcaro>	 I'll restart the other mons
[15:30:57] <taavi>	 ceph health detail finally finished:
[15:30:58] <taavi>	 HEALTH_WARN 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev; 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops
[15:30:58] <taavi>	 [WRN] MON_DOWN: 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev
[15:30:58] <taavi>	     mon.cloudcephmon2006-dev (rank 1) addr [v2:10.192.20.20:3300/0,v1:10.192.20.20:6789/0] is down (out of quorum)
[15:30:58] <taavi>	 [WRN] SLOW_OPS: 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops
[15:31:09] <dcaro>	 nice, some response :)
[15:31:17] <taavi>	 HEALTH_OK
[15:31:18] <dcaro>	 I see health ok now
[15:31:30] <arturo>	 same here
[15:34:17] <dcaro>	 https://usercontent.irccloud-cdn.com/file/01Vu0hce/image.png
[15:34:50] <Rook>	 Is horizon refusing login for everyone, or just for me?
[15:34:50] <dcaro>	 dhinus: ^ after 'add responders' there's that 'add users' link below that opens another text box
[15:35:04] <dcaro>	 Rook: should be only on codfw, does it happen on eqiad too?
[15:35:35] <Rook>	 Yeah I can't seem to login in eqiad
[15:36:02] <andrewbogott>	 Rook:  that's me (although unexpected), will fix in a moment
[15:36:38] <dhinus>	 I can log into eqiad horizon
[15:36:45] <andrewbogott>	 Rook: actually, it's working for me (in eqiad)
[15:36:48] <andrewbogott>	 codfw1dev I would expect to be broken
[15:36:55] <Rook>	 Let me try some more
[15:38:30] <Rook>	 I'm in horizon now. Now on to the mystery of why tofu can't access things and the hub container is stuck
[15:38:33] <Rook>	 Thanks!
[15:39:33] <andrewbogott>	 Rook: I think that's T363696 (which you likely already have open)
[15:39:34] <stashbot>	 T363696: [tf-infra-test] Authentication failed - https://phabricator.wikimedia.org/T363696
[15:39:53] <andrewbogott>	 btw all, I'm declaring the drill incident resolved :)
[15:40:58] <bd808>	 I love that the WMCS world has progressed to feeling a need to do a tabletop outage for practice. This would not have been something we needed to invent back in the mid-teens. ;)
[15:43:25] <andrewbogott>	 It turned out to be pretty easy to fix!
[15:43:52] <taavi>	 bd808: and we scheduled another one :-)
[15:44:32] <dcaro>	 we can chaosmonkey ourselves from time to time
[15:45:25] <andrewbogott>	 If anyone is put out by horizon being broken in codfw1dev please ping me, otherwise I'm going to continue to test there for a while.
[15:45:28] <bd808>	 The best way to schedule an outage circa 2017 was to have an offsite or all-hands. Both were somehow magically guaranteed to make the servers sad. Losing a top or rack switch during the 2017 all-hands was especially exciting. ;)
[15:45:40] <bd808>	 *top of rack
[15:45:58] <dhinus>	 bd808: LOL
[15:47:58] <andrewbogott>	 During my first or second week at WMF a bunch of SREs were at someone's (Leslie's?) apartment eating a pizza and I suddenly noticed that everyone around me had a laptop open and a worried expression
[15:48:17] <dhinus>	 I might be willing to get a page in exchange for a trip to somewhere nice. but only if the incident can be resolved in a couple hours and then I can enjoy the offsite :)
[15:48:20] <andrewbogott>	 It was like if we put too many SREs together in one place it cause some kind of feedback loop that made the servers go down.
[15:51:28] <dcaro>	 SRE critical mass
[15:51:56] <andrewbogott>	 yep
[15:53:05] <arturo>	 maybe that's some kind of manifestation of the bus factor in real life
[15:53:41] <dcaro>	 hmmm... I'm thinking that we might want to make fourohfour a toolforge component of sorts, as it's needed to get the nginx ingress up and running
[15:54:42] <dcaro>	 any concerns about that?
[15:55:03] <arturo>	 dcaro: fine for me
[15:55:35] <dhinus>	 SGTM
[15:55:55] <arturo>	 the only concern would be to make sure it runs with lowered privs, because today, it being a tool, gets a lower-priv PSP. If an admin component, the PSP gets more loose
[15:56:24] <dhinus>	 andrewbogott: looking at the google doc history for the incident doc template, there's multiple people modifying the template title then immediately reverting it :D
[15:56:47] <bd808>	 dcaro: happy for fourohfour to become more "official" somehow. It has always been infrastructure for our ingress layer.
[15:56:58] <andrewbogott>	 yep, I'm sure everyone does that. Maybe the <reason> on there should just be <hey this is a template>
[15:57:35] <dhinus>	 <don't write the reason here>
[15:58:31] <dcaro>	 👍 I'll start a new repo for it then, will deploy it only on lima-kilo for starters, and move from there
[15:58:41] <arturo>	 dcaro: ack
[16:00:21] <dcaro>	 hmm... gitlab seems to be unable to import from a url from within gitlab xd
[16:06:52] <Reedy>	 rip
[16:07:17] * arturo offline
[16:16:21] <dcaro>	 hmm... it seems that lima-kilo deployment fails after a bit
[16:16:24] <dcaro>	 https://www.irccloud.com/pastebin/nJ6aVfwX/
[17:25:16] * dcaro off
[18:09:17] * bd808 lunch
[23:52:11] * bd808 off