[08:27:27] hello, good morning [08:27:34] quick +1 here? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/126 [09:24:52] arturo: approved [09:25:24] dhinus: thanks! [09:29:30] thanks for managing the redis incident over the weekend, BTW [09:31:05] dhinus: so the last time the redis max connection errors thing came up, I said that we can look into it more if it happens again [09:31:21] arturo: no prob, luckily I was at home with no big plans [09:32:15] taavi: yes, we should probably try to understand what's the root cause, at least we know that restarting the service seems to fix it [09:32:51] the redis docs say that 'By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.', which seems like exactly the sort of thing that would cause this kind of issuue [09:33:15] yep, did we update the Redis version recently? [09:33:24] not as far as I'm aware [09:33:40] or maybe some new tool is using the connection without closing it? [09:33:58] is there a Redis setting to force-close the connections after some time? [09:34:00] yeah, or some network instability causing more connections to drop, or something liek that [09:34:17] * arturo mumbles a joke about redis wanting to be replaced by valkey [09:34:26] I will open a task to track this issue [09:34:29] apparently the 'timeout' setting can be used for that [09:34:40] https://redis.io/docs/latest/develop/reference/clients/#client-timeouts [10:00:13] good news: i have succesfully migrated an existing VM from linuxbridge to openvswitch without its IP address changing [10:00:47] the bad news is that the migration from one agent to an another seems to require manually updating the vif_type from linuxbridge to ovs in the neutron DB, it's not fully automatic [10:22:41] dhinus: there was a tool that added a health check that does a `celery ping`, that connects to redis as a backend [10:23:05] dcaro: do you think that tool could be the one that's leaking connections? [10:23:35] I can try writing a patch using taavi's suggestion of the "timeout" setting [10:23:39] maybe, it definitely makes sure that something runs every few seconds [10:24:23] dhinus: iirc it was https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/ [10:39:38] taavi: I think the good news are bigger than the bad news :-P [10:40:38] I created T363683 to follow up on the upgrade chat we had the other day [10:40:39] T363683: Decision request template - kubernetes upgrade workgroup - https://phabricator.wikimedia.org/T363683 [13:11:59] dcaro: the alert "TooManyDProcesses" is firing for tools-k8s-worker-nfs-42, I tried following the runbook and I don't seem to find NFS-related errors in journalctl [13:12:33] have you seen other types of problems that cause the same alert? [13:12:36] last time I was unable to find out what was the cause [13:12:43] do new processes get stuck? [13:13:09] (ex. ls of the homes dir, or the projects dir) [13:13:10] how can I find out? [13:13:13] ok [13:13:15] let me try [13:13:24] careful, as you'll lose that shell [13:13:28] (it will get stuck) [13:13:30] *might [13:13:49] you cat try something like `ls /path/to/homes &` so it starts in the background [13:14:12] (I should add that to the runbook xd) [13:14:19] "ls /mnt/nfs/labstore-secondary-tools-project" is working fine [13:17:48] hmm, yep, I can ssh too [13:18:13] (last time I was not able to ssh as my user, as the shell will get stuck trying to access the mounted user home) [13:18:21] there seems to be only one tool affected [13:22:45] just added a couple notes in the runbook, I seem to be able to ls, and head the file the processes are stuck in [13:22:52] you can try restarting that tool pods [13:23:42] wait, it's actually working ok [13:23:52] the processes come in and out of the D state [13:24:22] but there's like 80 processes all accessing that file in real time [13:24:31] https://www.irccloud.com/pastebin/rmxwBWje/ [13:25:06] some caching would be nice I guess xd [13:26:12] it feels like a cgi script getting hammered from clients or similar [13:26:41] it's also exceeding the memory limit [13:26:45] [Mon Apr 29 13:21:44 2024] Memory cgroup out of memory: Killed process 3230425 (perl) total-vm:20132kB, anon-rss:12424kB, file-rss:1480kB, shmem-rss:0kB, UID:51437 pgtables:80kB oom_score_adj:985 [13:29:20] maybe we can try restarting that pod? [13:29:34] how did you find the affected tool? [13:34:07] arturo: do you know if a hypervisor can be migrated from the 'legacy' vlan naming scheme (interface.NNNN) to the new scheme (vlanNNNN) without a full reimage? [13:40:21] mmmm [13:40:48] I don't think there is any puppet code to cleanup old network interfaces setting, so it might be cleaner to just reimage [13:41:09] (or to cleanup stuff like sysctls, monitoring, etc for the old interface) [13:41:33] I think the official procedure so far is: network change --> reimage [13:42:01] and therefore that may be the only thing supported everywhere [13:42:23] ok, that's fine [13:42:46] with the node id stored in hiera, a reimage doesn't require any special steps other than draining, right? [13:44:19] correct [13:44:31] cool [13:44:38] i will be reimaging cloudvirt2002-dev shortly [13:44:54] well, if changing interfaces, you may want to double check netbox [13:45:34] also, make sure the system is configured for single NIC (if no host-level hiera overrides exist, then it should be fine) [13:45:56] when the CloudVPSDesignateLeaks fires, is there a way to find the leaked records that is faster than running "wmcs-dnsleaks"? [13:46:10] dhinus: the systemd journal for the unit that generates the prom node file [13:46:27] in cloudcontrol? [13:46:32] yeah [13:46:46] iirc currently it runs on all of them, I've been meaning to only make it run on one [13:47:09] thanks, I'll add it to the alert runbook [13:49:46] * arturo out for a bit [13:52:51] the DNS leaks are still coming from fullstackd, do we know what's the issue? [13:58:10] dhinus: I just 'ps aux | grep D' [14:00:36] dcaro: ack [14:04:39] wdyt about restarting the pod? [14:06:38] sorry, had to reboot my laptop [14:07:19] dhinus: so given that the pod is not really stuck, it's just bombarding NFS, I don't think it will help much as long as the external requests keep coming in [14:11:05] let's wait a few hours then and see if it stops [14:20:14] okok [15:12:50] PSA: we are about to shut down ceph in CODFW, to test the new incident response process (T348887) [15:12:52] T348887: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 [15:14:03] if you want to follow along, we are in https://meet.google.com/byy-mqst-jco [15:20:27] Hey all, it seems that ceph is down in codfw1dev (for a drill!) can I get a hand? [15:20:35] I'll start a doc [15:20:39] um... cteam! [15:20:52] here [15:21:02] * dcaro putting on the unicorn costume [15:21:18] * dcaro here 🦄 [15:22:57] for now Andrew is the IC [15:23:07] thanks! [15:23:17] i'm looking at ceph [15:23:18] (fyi, we are testing our incident response process, nothing is really going on) [15:23:29] ceph commands on cloudcephmon2004-dev are hanging [15:24:49] coordination document is https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit [15:25:29] @taavi same here [15:25:38] so I think the ceph manager service is having some issues [15:25:51] can't access ceph from any host yep [15:25:53] `aborrero@cloudcephmon2004-dev:~ $ sudo ceph status` hangs [15:26:54] so that is ceph-mon@cloudcephmon200N-dev.service [15:27:07] there is a very suspicious ceph-crash.service [15:27:11] i will try restarting that on clodcephmon2004-dev [15:27:52] updated status on -cloud [15:28:51] !status doing an incident drill https://docs.google.com/document/d/1z5_zT9W-0s7j-4xqjZVD9Sg8ooE4tChZP2lS5Ykf2AA/edit [15:28:51] Too long status [15:29:51] well at least that service is running now, but ceph commands are still hanging [15:30:36] I'll restart the other mons [15:30:57] ceph health detail finally finished: [15:30:58] HEALTH_WARN 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev; 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops [15:30:58] [WRN] MON_DOWN: 1/3 mons down, quorum cloudcephmon2004-dev,cloudcephmon2005-dev [15:30:58] mon.cloudcephmon2006-dev (rank 1) addr [v2:,v1:] is down (out of quorum) [15:30:58] [WRN] SLOW_OPS: 962 slow ops, oldest one blocked for 196 sec, mon.cloudcephmon2004-dev has slow ops [15:31:09] nice, some response :) [15:31:17] HEALTH_OK [15:31:18] I see health ok now [15:31:30] same here [15:34:17] https://usercontent.irccloud-cdn.com/file/01Vu0hce/image.png [15:34:50] Is horizon refusing login for everyone, or just for me? [15:34:50] dhinus: ^ after 'add responders' there's that 'add users' link below that opens another text box [15:35:04] Rook: should be only on codfw, does it happen on eqiad too? [15:35:35] Yeah I can't seem to login in eqiad [15:36:02] Rook: that's me (although unexpected), will fix in a moment [15:36:38] I can log into eqiad horizon [15:36:45] Rook: actually, it's working for me (in eqiad) [15:36:48] codfw1dev I would expect to be broken [15:36:55] Let me try some more [15:38:30] I'm in horizon now. Now on to the mystery of why tofu can't access things and the hub container is stuck [15:38:33] Thanks! [15:39:33] Rook: I think that's T363696 (which you likely already have open) [15:39:34] T363696: [tf-infra-test] Authentication failed - https://phabricator.wikimedia.org/T363696 [15:39:53] btw all, I'm declaring the drill incident resolved :) [15:40:58] I love that the WMCS world has progressed to feeling a need to do a tabletop outage for practice. I love that the WMCS world has progressed to feeling a need to do a tabletop outage for practice. The best way to schedule an outage circa 2017 was to have an offsite or all-hands. Both were somehow magically guaranteed to make the servers sad.
Losing a top or rack switch during the 2017 all-hands was especially exciting. ;)
*top of rack
During my first or second week at WMF a bunch of SREs were at someone's (Leslie's?) apartment eating a pizza and I suddenly noticed that everyone around me had a laptop open and a worried expression
I might be willing to get a page in exchange for a trip to somewhere nice. but only if the incident can be resolved in a couple hours and then I can enjoy the offsite :)
It was like if we put too many SREs together in one place it cause some kind of feedback loop that made the servers go down.
SRE critical mass
yep
maybe that's some kind of manifestation of the bus factor in real life
hmmm... I'm thinking that we might want to make fourohfour a toolforge component of sorts, as it's needed to get the nginx ingress up and running
any concerns about that? [15:55:03] dcaro: fine for me [15:55:35] SGTM [15:55:55] the only concern would be to make sure it runs with lowered privs, because today, it being a tool, gets a lower-priv PSP. If an admin component, the PSP gets more loose [15:56:24] andrewbogott: looking at the google doc history for the incident doc template, there's multiple people modifying the template title then immediately reverting it :D [15:56:47] dcaro: happy for fourohfour to become more "official" somehow. It has always been infrastructure for our ingress layer. [15:56:58] yep, I'm sure everyone does that. Maybe the on there should just be [
[
👍 I'll start a new repo for it then, will deploy it only on lima-kilo for starters, and move from there
dcaro: ack
hmm... gitlab seems to be unable to import from a url from within gitlab xd
rip
* arturo offline
hmm... it seems that lima-kilo deployment fails after a bit
https://www.irccloud.com/pastebin/nJ6aVfwX/
* dcaro off
* bd808 off