[08:35:46] hello folks [08:35:59] I am going to use jmap on puppetserver1002 to get a heap dump [08:36:17] this shouldn't impact merges, but if you see anything hanging/not-working lemme know [09:44:19] (completed the above) [09:45:24] I am going to deploy a mw-config change in a bit to swap another poolcounter node in codfw [09:45:40] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073427 [10:06:31] started the deployment, will update once done [10:14:56] done, nothing exploded so far [13:12:12] puppetserver1002 is unresponsive, I was trying to run puppet to apply a change for the jvm but it hanged [13:12:27] no metrics published in https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1002&var-datasource=thanos&var-cluster=misc [13:13:28] trying to loging via root on the serial console but it hangs [13:13:44] out of curiosity do we have swap enabled (even a small swap partition) on those hosts? [13:15:21] I see Disk /dev/mapper/vg0-swap: 976 MiB, 1023410176 bytes, 1998848 sectors [13:15:25] on 1001 though [13:15:51] my guess is that makes things worse [13:16:22] it'd be so much better if the OOM killer intervened here [13:16:31] yep I agree [13:17:58] I am restarting puppetserver [13:18:38] I have applied my change manually (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1073751) so I'll avoid another restart [13:23:15] ok so puppetserver seems up and running, with 35 jruby workers instead of 48 [13:23:28] we'll see at steady state if it is more stable [13:23:51] if so, we can roll restart the rest of the cluster [13:30:22] totally unrelated, swapping another poolcounter IP (this time in eqiad) [13:41:17] poolcounter1006 live and serving conns [13:44:04] swapping the last poolcounter as well [13:54:06] done! All poolcounters are on bookworm [13:54:15] and mw/thumbor use the new IPs [13:54:21] thanks elukey ! [13:59:28] {◕ ◡ ◕}