[08:27:27] hello, good morning [08:27:34] quick +1 here? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/126 [09:24:52] arturo: approved [09:25:24] dhinus: thanks! [09:29:30] thanks for managing the redis incident over the weekend, BTW [09:31:05] dhinus: so the last time the redis max connection errors thing came up, I said that we can look into it more if it happens again [09:31:21] arturo: no prob, luckily I was at home with no big plans [09:32:15] taavi: yes, we should probably try to understand what's the root cause, at least we know that restarting the service seems to fix it [09:32:51] the redis docs say that 'By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.', which seems like exactly the sort of thing that would cause this kind of issuue [09:33:15] yep, did we update the Redis version recently? [09:33:24] not as far as I'm aware [09:33:40] or maybe some new tool is using the connection without closing it? [09:33:58] is there a Redis setting to force-close the connections after some time? [09:34:00] yeah, or some network instability causing more connections to drop, or something liek that [09:34:17] * arturo mumbles a joke about redis wanting to be replaced by valkey [09:34:26] I will open a task to track this issue [09:34:29] apparently the 'timeout' setting can be used for that [09:34:40] https://redis.io/docs/latest/develop/reference/clients/#client-timeouts [10:00:13] good news: i have succesfully migrated an existing VM from linuxbridge to openvswitch without its IP address changing [10:00:47] the bad news is that the migration from one agent to an another seems to require manually updating the vif_type from linuxbridge to ovs in the neutron DB, it's not fully automatic [10:22:41] dhinus: there was a tool that added a health check that does a `celery ping`, that connects to redis as a backend [10:23:05] dcaro: do you think that tool could be the one that's leaking connections? [10:23:35] I can try writing a patch using taavi's suggestion of the "timeout" setting [10:23:39] maybe, it definitely makes sure that something runs every few seconds [10:24:23] dhinus: iirc it was https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/ [10:39:38] taavi: I think the good news are bigger than the bad news :-P [10:40:38] I created T363683 to follow up on the upgrade chat we had the other day [10:40:39] T363683: Decision request template - kubernetes upgrade workgroup - https://phabricator.wikimedia.org/T363683 [13:11:59] dcaro: the alert "TooManyDProcesses" is firing for tools-k8s-worker-nfs-42, I tried following the runbook and I don't seem to find NFS-related errors in journalctl [13:12:33] have you seen other types of problems that cause the same alert? [13:12:36] last time I was unable to find out what was the cause [13:12:43] do new processes get stuck? [13:13:09] (ex. ls of the homes dir, or the projects dir) [13:13:10] how can I find out? [13:13:13] ok [13:13:15] let me try [13:13:24] careful, as you'll lose that shell [13:13:28] (it will get stuck) [13:13:30] *might [13:13:49] you cat try something like `ls /path/to/homes &` so it starts in the background [13:14:12] (I should add that to the runbook xd) [13:14:19] "ls /mnt/nfs/labstore-secondary-tools-project" is working fine [13:17:48] hmm, yep, I can ssh too [13:18:13] (last time I was not able to ssh as my user, as the shell will get stuck trying to access the mounted user home) [13:18:21] there seems to be only one tool affected [13:22:45] just added a couple notes in the runbook, I seem to be able to ls, and head the file the processes are stuck in [13:22:52] you can try restarting that tool pods [13:23:42] wait, it's actually working ok [13:23:52] the processes come in and out of the D state [13:24:22] but there's like 80 processes all accessing that file in real time [13:24:31] https://www.irccloud.com/pastebin/rmxwBWje/ [13:25:06] some caching would be nice I guess xd [13:26:12] it feels like a cgi script getting hammered from clients or similar [13:26:41] it's also exceeding the memory limit [13:26:45] [Mon Apr 29 13:21:44 2024] Memory cgroup out of memory: Killed process 3230425 (perl) total-vm:20132kB, anon-rss:12424kB, file-rss:1480kB, shmem-rss:0kB, UID:51437 pgtables:80kB oom_score_adj:985 [13:29:20] maybe we can try restarting that pod? [13:29:34] how did you find the affected tool? [13:34:07] arturo: do you know if a hypervisor can be migrated from the 'legacy' vlan naming scheme (interface.NNNN) to the new scheme (vlanNNNN) without a full reimage? [13:40:21] mmmm [13:40:48] I don't think there is any puppet code to cleanup old network interfaces setting, so it might be cleaner to just reimage [13:41:09] (or to cleanup stuff like sysctls, monitoring, etc for the old interface) [13:41:33] I think the official procedure so far is: network change --> reimage [13:42:01] and therefore that may be the only thing supported everywhere [13:42:23] ok, that's fine [13:42:46] with the node id stored in hiera, a reimage doesn't require any special steps other than draining, right? [13:44:19] correct [13:44:31] cool [13:44:38] i will be reimaging cloudvirt2002-dev shortly [13:44:54] well, if changing interfaces, you may want to double check netbox [13:45:34] also, make sure the system is configured for single NIC (if no host-level hiera overrides exist, then it should be fine) [13:45:56] when the CloudVPSDesignateLeaks fires, is there a way to find the leaked records that is faster than running "wmcs-dnsleaks"? [13:46:10] dhinus: the systemd journal for the unit that generates the prom node file [13:46:27] in cloudcontrol? [13:46:32] yeah [13:46:46] iirc currently it runs on all of them, I've been meaning to only make it run on one [13:47:09] thanks, I'll add it to the alert runbook [13:49:46] * arturo out for a bit [13:52:51] the DNS leaks are still coming from fullstackd, do we know what's the issue? [13:58:10] dhinus: I just 'ps aux | grep D' [14:00:36] dcaro: ack [14:04:39] wdyt about restarting the pod? [14:06:38] sorry, had to reboot my laptop [14:07:19] dhinus: so given that the pod is not really stuck, it's just bombarding NFS, I don't think it will help much as long as the external requests keep coming in [14:11:05] let's wait a few hours then and see if it stops [14:20:14] okok [15:12:50] PSA: we are about to shut down ceph in CODFW, to test the new incident response process (T348887) [15:12:52] T348887: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 [15:14:03] if you want to follow along, we are in https://meet.google.com/byy-mqst-jco [15:20:27] Hey all, it seems that ceph is down in codfw1dev (for a drill!) can I get a hand? [15:20:35] I'll start a doc [15:20:39] um... cteam! [15:20:52] here [15:21:02] * dcaro putting on the unicorn costume [15:21:18] * dcaro here 🦄 [15:22:57] for now Andrew is the IC [15:23:07] thanks! [15:23:17] i'm looking at ceph [15:23:18] (fyi, we are testing our incident response process, nothing is really going on) [15:23:29] ceph commands on cloudcephmon2004-dev are hanging [15:24:49] coordination document is https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit [15:25:29] @taavi same here [15:25:38] so I think the ceph manager service is having some issues [15:25:51] can't access ceph from any host yep [15:25:53] `aborrero@cloudcephmon2004-dev:~ $ sudo ceph status` hangs [15:26:54] so that is ceph-mon@cloudcephmon200N-dev.service [15:27:07] there is a very suspicious ceph-crash.service [15:27:11] i will try restarting that on clodcephmon2004-dev [15:27:52] updated status on -cloud [15:28:51] !status doing an incident drill https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit [15:28:51] Too long status [15:29:51] well at least that service is running now, but ceph commands are still hanging [15:30:36] I'll restart the other mons [15:30:57] ceph health detail finally finished: [15:30:58] HEALTH_WARN 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev; 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops [15:30:58] [WRN] MON_DOWN: 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev [15:30:58] mon.cloudcephmon2006-dev (rank 1) addr [v2:10.192.20.20:3300/0,v1:10.192.20.20:6789/0] is down (out of quorum) [15:30:58] [WRN] SLOW_OPS: 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops [15:31:09] nice, some response :) [15:31:17] HEALTH_OK [15:31:18] I see health ok now [15:31:30] same here [15:34:17] https://usercontent.irccloud-cdn.com/file/01Vu0hce/image.png [15:34:50] Is horizon refusing login for everyone, or just for me? [15:34:50] dhinus: ^ after 'add responders' there's that 'add users' link below that opens another text box [15:35:04] Rook: should be only on codfw, does it happen on eqiad too? [15:35:35] Yeah I can't seem to login in eqiad [15:36:02] Rook: that's me (although unexpected), will fix in a moment [15:36:38] I can log into eqiad horizon [15:36:45] Rook: actually, it's working for me (in eqiad) [15:36:48] codfw1dev I would expect to be broken [15:36:55] Let me try some more [15:38:30] I'm in horizon now. Now on to the mystery of why tofu can't access things and the hub container is stuck [15:38:33] Thanks! [15:39:33] Rook: I think that's T363696 (which you likely already have open) [15:39:34] T363696: [tf-infra-test] Authentication failed - https://phabricator.wikimedia.org/T363696 [15:39:53] btw all, I'm declaring the drill incident resolved :) [15:40:58] I love that the WMCS world has progressed to feeling a need to do a tabletop outage for practice. This would not have been something we needed to invent back in the mid-teens. ;) [15:43:25] It turned out to be pretty easy to fix! [15:43:52] bd808: and we scheduled another one :-) [15:44:32] we can chaosmonkey ourselves from time to time [15:45:25] If anyone is put out by horizon being broken in codfw1dev please ping me, otherwise I'm going to continue to test there for a while. [15:45:28] The best way to schedule an outage circa 2017 was to have an offsite or all-hands. Both were somehow magically guaranteed to make the servers sad. Losing a top or rack switch during the 2017 all-hands was especially exciting. ;) [15:45:40] *top of rack [15:45:58] bd808: LOL [15:47:58] During my first or second week at WMF a bunch of SREs were at someone's (Leslie's?) apartment eating a pizza and I suddenly noticed that everyone around me had a laptop open and a worried expression [15:48:17] I might be willing to get a page in exchange for a trip to somewhere nice. but only if the incident can be resolved in a couple hours and then I can enjoy the offsite :) [15:48:20] It was like if we put too many SREs together in one place it cause some kind of feedback loop that made the servers go down. [15:51:28] SRE critical mass [15:51:56] yep [15:53:05] maybe that's some kind of manifestation of the bus factor in real life [15:53:41] hmmm... I'm thinking that we might want to make fourohfour a toolforge component of sorts, as it's needed to get the nginx ingress up and running [15:54:42] any concerns about that? [15:55:03] dcaro: fine for me [15:55:35] SGTM [15:55:55] the only concern would be to make sure it runs with lowered privs, because today, it being a tool, gets a lower-priv PSP. If an admin component, the PSP gets more loose [15:56:24] andrewbogott: looking at the google doc history for the incident doc template, there's multiple people modifying the template title then immediately reverting it :D [15:56:47] dcaro: happy for fourohfour to become more "official" somehow. It has always been infrastructure for our ingress layer. [15:56:58] yep, I'm sure everyone does that. Maybe the on there should just be [15:57:35] [15:58:31] 👍 I'll start a new repo for it then, will deploy it only on lima-kilo for starters, and move from there [15:58:41] dcaro: ack [16:00:21] hmm... gitlab seems to be unable to import from a url from within gitlab xd [16:06:52] rip [16:07:17] * arturo offline [16:16:21] hmm... it seems that lima-kilo deployment fails after a bit [16:16:24] https://www.irccloud.com/pastebin/nJ6aVfwX/ [17:25:16] * dcaro off [18:09:17] * bd808 lunch [23:52:11] * bd808 off