[14:18:51] https://phabricator.wikimedia.org/T351452 has all projects ticked off, but https://openstack-browser.toolforge.org/puppetclass/role::puppetmaster::standalone still lists tf-pm-1.terraform.eqiad1.wikimedia.cloud, is that safe to ignore? [14:25:58] moritzm: checking... [14:27:51] moritzm: I deleted it :) [14:28:16] excellent :-) [14:29:07] pki-pm.pki.eqiad1.wikimedia.cloud can be ignored, so I'll start removing things related only to cloud puppet 5 masters tomorrow [14:31:25] sounds good! [14:58:22] another NFS worker alert... they seem to be getting more frequent? [15:11:36] eh guys apologies time slipped away on me I missed the start of our network-sync meeting [15:12:01] I can jump on now if you think it's worthwhile. For my part nothing urgent to discuss [15:12:16] topranks: it's next week in my calendr [15:12:24] hahahaha [15:12:39] is it better to be 10 minutes late of 1 week early? [15:12:50] clearly I have a poor grasp on that thing called "time" anyway :P [15:12:56] I will talk to you next week :) [15:13:15] :D [15:40:39] the NFS alert for tools-k8s-worker-nfs-23 stopped firing for a while, now it's firing again [17:27:00] the other day there were a couple too, they turned out to be "real" load, as in a tool that was doing a lot of disk stuff, went out eventually. The main tell is that the graph goes up and down (so processes are not getting stuck) [17:27:58] in this case for example nfs-23 seems stuck [17:28:20] https://usercontent.irccloud-cdn.com/file/rd6OZ9We/image.png [17:33:06] dcaro: what do you do in such a scenario? the runbook suggests restarting the pods that are stuck, or should we just reboot the node? [17:34:29] just reboot the node, you can try to debug the NFS issue if you want, but rebooting works [17:34:34] (and should be harmless) [17:35:03] ok! [17:35:31] I tried checking the logs as suggested in the runbook, and I don't see any "not responding" logs today, but there are a few in the previous days [17:35:33] (/me wanting to make sure also that rebooting is ok) [17:35:42] let me try rebooting it! [17:35:49] I found it flaky also the logs on dmesg [17:36:07] you can check which files the stuck process has open, though it might get you stuck too [17:36:39] (the behavior is usually that some filehandles get stuck, and anyone trying to access them gets stuck too, but trying to access any other part of the nfs tree works well) [17:37:06] ack [17:37:11] the reboot is in progress [17:37:16] reboot phase: wait_drain [17:38:00] it did log "node/tools-k8s-worker-nfs-23 drained", but then it stopped on "reboot phase: wait_drain" [17:39:37] it might have to force reboot the worker [17:39:54] what is the "wait_drain" doing? [17:40:06] if some of the processes are not pods (iirc there's a cron that starts getting stuck too, doing some kind of lsof, the wmf_restart_* thingie maybe) [17:41:36] ok I found the debug logs [17:41:41] in /var/log/spicerack [17:41:43] "Waiting for node tools-k8s-worker-nfs-23 to stop all it's pods, still 1 running" [17:42:05] "Something happened while rebooting host tools-k8s-worker-nfs-23, trying a hard rebooting the instance" [17:42:13] too late to find out what the pod was :D [17:43:53] xd [17:49:52] * dhinus offline [17:53:54] andrewbogott: we got pacific packages for bullseye \o/ https://apt.wikimedia.org/wikimedia/pool/thirdparty/ceph-pacific/ [17:54:20] great! [18:30:37] * dcaro off [18:59:02] tiring day... [18:59:07] cya tomorrow