[09:25:22] <wikibugs>	 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10351753 (10JMeybohm)
[11:39:36] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bookworm completed: - wikikube-worker1256 (...
[11:56:07] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352473 (10ops-monitoring-bot) pool host wikikube-worker1256.eqiad.wmnet by cgoubert@cumin1002 with reason: RAID ok
[11:56:08] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352474 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker1256.eqiad.wmnet completed: - wikikube-worker1256.eqiad....
[11:57:02] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352475 (10Clement_Goubert) 05Open→03Resolved Host reimaged, RAID ok, repooled
[12:08:04] <wikibugs>	 06serviceops, 13Patch-For-Review: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352542 (10Clement_Goubert)
[12:41:28] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm
[12:41:29] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm
[12:41:45] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm
[12:42:50] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm
[12:42:53] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm
[12:43:23] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm
[12:43:58] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm
[12:44:44] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm
[13:25:25] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm completed: - wikikube-worker1313 (**PASS**)   - D...
[13:28:04] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm completed: - wikikube-worker1316 (**PASS**)   - D...
[13:30:14] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm completed: - wikikube-worker1317 (**PASS**)   - D...
[13:33:24] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm completed: - wikikube-worker1315 (**PASS**)   - D...
[13:35:37] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm completed: - wikikube-worker1319 (**PASS**)   - D...
[13:41:29] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm completed: - wikikube-worker1314 (**PASS**)   - D...
[13:43:44] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm completed: - wikikube-worker1320 (**PASS**)   - D...
[13:47:13] <wikibugs>	 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm completed: - wikikube-worker1318 (**PASS**)   - D...
[14:05:29] <cdanis>	 hi folks, I'm now puppet-merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094489, which (in a very stupid way) prevents a misconfiguration of mostly the wikikube clusters.  we can do it a better way in the future but this is a fine bandaid for now, since we've over-filled wikikube on nodes (and run out of pod IP space) at least twice now, which leads to many interesting and annoying
[14:05:31] <cdanis>	 failure cases
[14:11:23] <claime>	 sooo it's breaking reimages now 
[14:11:29] <claime>	 Error: Could not run: Evaluation Error: Error while evaluating a Function Call, value returned from k8s::fetch_cluster_config has wrong type, entry 'cluster_nodes' expects size to be between 1 and 255, got 264 (file: /srv/puppet_code/environments/production/modules/profile/manifests/containerd.pp, line: 10, column: 17)
[14:11:42] <cdanis>	 sigh
[14:12:08] <claime>	 I'm gonna check which cluster triggers it
[14:13:16] <effie>	 deployment is also stalled
[14:13:21] <cdanis>	 I can revert it
[14:13:46] <claime>	 it's eqiad but idk if I have 10 nodes to decom there
[14:14:02] <effie>	 claime: like this   (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7c5370b88ad906237ef913307acaf57fb66be9af1c0fad52fbcb64fb82f2d202": plugin type="calico" failed (add): failed to request IPv4 addresses: Assigned 0 out of 1 requested IPv4 addresses; No more free affine blocks and strict affinity enabled ?
[14:14:07] <cdanis>	 oh
[14:14:09] <claime>	 ugh
[14:14:10] <cdanis>	 so eqiad is actually broken right now
[14:14:16] <claime>	 effie: that's running out of ips
[14:14:42] <effie>	 it seems so yes
[14:14:47] <cdanis>	 need to delete 10 nodes asap then?
[14:14:59] <claime>	 yeah, I'll delete the ones I just imagedc
[14:15:04] <claime>	 gimme a sec
[14:15:16] <cdanis>	 I am not going to submit the revert, if that sounds good to you
[14:15:25] <effie>	 I am mid deployment, and my guess is that rolling back is not going to helo 
[14:15:55] <cdanis>	 effie: once we are in this state, we need to remove the excess nodes, and then, posssibly, wait for calico to run a few reconcilation cycles / age out the old leases
[14:16:49] <cdanis>	 claime: do you need more hands?
[14:16:53] <effie>	 we were lucky enough to hit the backport window, but I let folks know 
[14:16:56] <claime>	 nah I'm good
[14:21:38] <claime>	 fneed to check one more thing because there's a few nodes cordoned off, jayme 1309-1312 is your reboots?
[14:21:57] <effie>	 claime: I am moving forward looks like 
[14:22:08] <effie>	 woohoo
[14:22:32] <claime>	 we're down to 249 nodes with the ones I removed
[14:22:54] <claime>	 so under threshold, which should unblock deployments
[14:23:27] <claime>	 but that means I have to revert my patch adding them to kubernetes in hiera, because that breaks puppet
[14:23:46] <claime>	 And wait until we actually fix the ip exhaustion issue before putting them back
[14:23:59] <cdanis>	 claime: I can revert my puppet CI patch for now, but I'm hesitant to do so
[14:24:09] <claime>	 cdanis: don't
[14:24:11] <cdanis>	 okay
[14:24:20] <claime>	 It's gonna bite us again if you remove it
[14:24:27] <cdanis>	 yeah, that was basically my thought
[14:24:39] <cdanis>	 what I *could* do, is stop puppet from failing on the masters
[14:24:56] <claime>	 straight revert won't work because the patch was wrong and I removed a node afterwards
[14:25:01] <cdanis>	 right
[14:25:20] <effie>	 are we letting devs backport? 
[14:25:36] <cdanis>	 +1 from me
[14:26:00] <cdanis>	 claime: it's this line that is making the puppet runs fail https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097389/1/modules/k8s/types/clusterconfig.pp and it doesn't have any effect other than that
[14:26:09] <effie>	 my deployment is rolled back, a handful of nodes are left
[14:26:11] <cdanis>	 the CI check is just in the Rakefile change
[14:26:37] <effie>	 claime: objections to let the backport resume?
[14:26:52] <claime>	 nah go ahead
[14:27:21] <cdanis>	 claime: if it unblocks you, feel free to just remove the `, 255` from clusterconfig.pp and self-+2
[14:27:35] <cdanis>	 I need to go pack a bag, I have a train to catch in about 3 hours
[14:27:42] <claime>	 cdanis: it's moot because I can't add the nodes anyways
[14:27:48] <cdanis>	 ack
[14:27:51] <effie>	 cdanis: you mean a literal train not a mediawiki train?
[14:27:53] <cdanis>	 (I'll be off and on IRC)
[14:27:56] <cdanis>	 effie: yes
[14:28:05] <effie>	 I am surpised!
[14:28:07] <effie>	 go!
[14:32:50] <claime>	 effie: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097392 +1 for this please?
[14:33:05] <cdanis>	 +1
[14:33:08] <wikibugs>	 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 06Machine-Learning-Team, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353244 (10klausman) V1.16.0 requires Istio 1.20, so I have backed down the version to the v.1.12 serie...
[14:33:11] <claime>	 then I'll decom 6 hosts, and need to maths
[14:33:16] <claime>	 :p
[14:33:26] <effie>	 haha sure 1 sec
[14:33:47] <cdanis>	 claime: btw there is probably an off-by-one error in my original patch 😅
[14:33:49] <claime>	 it's ok c.danis did
[14:34:09] <claime>	 And I shouldn't merge this during a deployment actually
[14:34:17] <claime>	 Because it'll trigger ferm
[14:34:34] <claime>	 jeez
[14:34:58] <effie>	 claime: it will trigger ferm at a rate of puppet runs, so it is not that terrible 
[14:35:05] <wikibugs>	 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 06Machine-Learning-Team, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353232 (10klausman)
[14:35:11] <effie>	 unless we want to run puppet everywhere asap 
[14:35:19] <claime>	 fair
[14:35:34] <claime>	 yeah I don't need it to force run
[14:36:10] <effie>	 famous last words: I guess it is on to merge and start having puppet working :p
[14:36:18] <claime>	 yeah
[15:05:55] <jayme>	 claime: sorry - was out for lunch. But the cordoned nodes where mine (for reboot) indeed
[15:08:07] <claime>	 erf sorry, I uncordoned wikikube-worker13[10-12].eqiad.wmnet 
[15:08:47] <jayme>	 it's fine - it will just lead to more pod churn
[15:08:53] <claime>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097405 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097406
[15:08:57] <claime>	 eqiad decoms
[16:17:12] <cdanis>	 https://phabricator.wikimedia.org/P71136 might be useful at times
[16:20:25] <_joe_>	 the first rule of anything coming from cdanis with "oneliner" in the title is to wear sunglasses before looking at it
[16:25:09] <claime>	 and also ask what version of yq it's using :p
[16:32:41] <moritzm>	 the underhanded C code contest, 2024 YAML edition
[16:55:49] <_joe_>	 lol
[17:01:16] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354314 (10VRiley-WMF) a:03VRiley-WMF Would we like to proceed with replacing the CMOS b...
[18:07:00] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354661 (10JMeybohm) 05Open→03Declined Oh, sorry. We forgot to update here. Lets n...
[19:38:27] <wikibugs>	 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 06Wikipedia-Android-App-Backlog, 07Essential-Work: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10355078 (10CDanis) Have you tried also setting the http_proxy and...
[19:38:56] <wikibugs>	 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 06Wikipedia-Android-App-Backlog, 07Essential-Work: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10355079 (10CDanis) a:05Jgiannelos→03None
[23:32:29] <wikibugs>	 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 06MW-Interfaces-Team, and 3 others: Transition parsoidtest1001 to PHP 8.1 - https://phabricator.wikimedia.org/T380485#10355701 (10Scott_French) Thanks for writing this up, @cscott.  So, one alternative to migrating parsoidtest1001 to PHP 8.1...