[10:23:23] Heads-up, especially on-callers, we're 10 minutes away from starting the rolling-upgrade of dse-k8s-eqiad to kubernetes 1.31. T414484 [10:23:23] T414484: Upgrade DSE clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414484 [10:23:59] I'm temporarily disabling the scap deployment of mediawiki to dse-k8s- eqiad now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1260729 [10:46:18] btullis: o/ are you coordinating the work in some place? [10:50:49] Yes, b.rouberol and I are in https://meet.google.com/ysk-aazb-zha and taking notes in https://docs.google.com/document/d/1q7Amw_XSN_Lfb7fCnaSprpW8Z43iMyD4NOD3Lbq2hR4/edit?tab=t.0 Feel free to join and bring snacks. [10:51:16] We're still backing up postgres at the moment. [10:51:41] btullis: only if you need me, ping me in case! good luck! [10:53:45] Thanks j.oal is also on standby from the DE side. [11:12:25] bjensen, jayme - to spice up your last hour of oncall I am going to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1261382 [11:12:47] ack, prepared for spiciness :) [11:23:38] all good so far, I am re-initializing maps1012 [11:23:40] it will take a bit [13:33:42] The dse-k8s-eqiad rolling upgrade is complete, with the exception of istio, which we can come back to. It has gone really well, which is a testament to all the great work of others. [13:47:22] {◕ ◡ ◕} [14:23:17] I couldn't have said it better! On top of this, we also upgraded in place, without wiping etcd, to avoid deleting PVCs (which would have deleted the underlying data in Ceph) [14:24:02] hey folks, a note that we intend to switch the deployment server over to eqiad in about 40m [14:30:35] ^^ nice job all, this is the first time I'm aware of that we've done a rolling upgrade for k8s at WMF [14:40:38] ./utils/run_ci_locally.sh seems to be broken on trixie? https://www.irccloud.com/pastebin/oHCbFjFh/ [15:05:30] jhathaway, rzl: we may run into the puppet request window with the deployment server switchover by a small amount, but we hope to be done shortly after 1600 UTC [15:11:50] bjensen: ack, thanks [15:12:07] jhathaway: fyi the patch in there is something I'm coordinating with AW, I got it [15:15:43] rzl: thanks [15:32:08] vgutierrez: it may be related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142675/5/utils/run_ci_locally.sh [15:36:35] vgutierrez: even if I run trixie and it works for me, just tested it [15:36:58] it is also way faster, thanks jhathaway :) [15:37:46] vgutierrez: yup probably my fault [15:37:50] happy to debug [16:17:27] rzl: we're done, thanks for waiting! [16:17:38] bjensen: ack, thanks! [18:18:44] on-callers: I'm going to reboot the last of the sessionstore hosts. No impact expected (just a heads up). [19:02:48] hi oncallers, I found a fun race condition that is gonna make puppet fail its first time around on most of the hosts in the fleet, I'll be issuing manual reruns for failed hosts every few minutes [19:03:00] Notice: /Stage[main]/Prometheus::Nic_saturation_exporter/Package[python3-attr]/ensure: created [19:03:02] Notice: /Stage[main]/Prometheus::Nic_saturation_exporter/Systemd::Service[nic-saturation-exporter]/Service[nic-saturation-exporter]/ensure: ensure changed 'stopped' to 'running' (corrective) [19:03:30] beta cluster seems down? [19:05:05] cscott: unless a new issue has surfaced, possibly T420833? [19:05:05] T420833: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833 [19:06:38] A_smart_kitten: ok, thanks. [19:25:42] thanks for the revert cdanis