[09:25:22] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984#10351753 (10JMeybohm) [11:39:36] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bookworm completed: - wikikube-worker1256 (... [11:56:07] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352473 (10ops-monitoring-bot) pool host wikikube-worker1256.eqiad.wmnet by cgoubert@cumin1002 with reason: RAID ok [11:56:08] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352474 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker1256.eqiad.wmnet completed: - wikikube-worker1256.eqiad.... [11:57:02] 06serviceops, 06DC-Ops, 10ops-eqiad: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10352475 (10Clement_Goubert) 05Open→03Resolved Host reimaged, RAID ok, repooled [12:08:04] 06serviceops, 13Patch-For-Review: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352542 (10Clement_Goubert) [12:41:28] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm [12:41:29] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm [12:41:45] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm [12:42:50] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm [12:42:53] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm [12:43:23] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm [12:43:58] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm [12:44:44] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm [13:25:25] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm completed: - wikikube-worker1313 (**PASS**) - D... [13:28:04] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm completed: - wikikube-worker1316 (**PASS**) - D... [13:30:14] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm completed: - wikikube-worker1317 (**PASS**) - D... [13:33:24] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm completed: - wikikube-worker1315 (**PASS**) - D... [13:35:37] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10352988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm completed: - wikikube-worker1319 (**PASS**) - D... [13:41:29] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm completed: - wikikube-worker1314 (**PASS**) - D... [13:43:44] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm completed: - wikikube-worker1320 (**PASS**) - D... [13:47:13] 06serviceops: wikikube-worker13[13-27] implementation tracking - https://phabricator.wikimedia.org/T380350#10353047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm completed: - wikikube-worker1318 (**PASS**) - D... [14:05:29] hi folks, I'm now puppet-merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094489, which (in a very stupid way) prevents a misconfiguration of mostly the wikikube clusters. we can do it a better way in the future but this is a fine bandaid for now, since we've over-filled wikikube on nodes (and run out of pod IP space) at least twice now, which leads to many interesting and annoying [14:05:31] failure cases [14:11:23] sooo it's breaking reimages now [14:11:29] Error: Could not run: Evaluation Error: Error while evaluating a Function Call, value returned from k8s::fetch_cluster_config has wrong type, entry 'cluster_nodes' expects size to be between 1 and 255, got 264 (file: /srv/puppet_code/environments/production/modules/profile/manifests/containerd.pp, line: 10, column: 17) [14:11:42] sigh [14:12:08] I'm gonna check which cluster triggers it [14:13:16] deployment is also stalled [14:13:21] I can revert it [14:13:46] it's eqiad but idk if I have 10 nodes to decom there [14:14:02] claime: like this (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7c5370b88ad906237ef913307acaf57fb66be9af1c0fad52fbcb64fb82f2d202": plugin type="calico" failed (add): failed to request IPv4 addresses: Assigned 0 out of 1 requested IPv4 addresses; No more free affine blocks and strict affinity enabled ? [14:14:07] oh [14:14:09] ugh [14:14:10] so eqiad is actually broken right now [14:14:16] effie: that's running out of ips [14:14:42] it seems so yes [14:14:47] need to delete 10 nodes asap then? [14:14:59] yeah, I'll delete the ones I just imagedc [14:15:04] gimme a sec [14:15:16] I am not going to submit the revert, if that sounds good to you [14:15:25] I am mid deployment, and my guess is that rolling back is not going to helo [14:15:55] effie: once we are in this state, we need to remove the excess nodes, and then, posssibly, wait for calico to run a few reconcilation cycles / age out the old leases [14:16:49] claime: do you need more hands? [14:16:53] we were lucky enough to hit the backport window, but I let folks know [14:16:56] nah I'm good [14:21:38] fneed to check one more thing because there's a few nodes cordoned off, jayme 1309-1312 is your reboots? [14:21:57] claime: I am moving forward looks like [14:22:08] woohoo [14:22:32] we're down to 249 nodes with the ones I removed [14:22:54] so under threshold, which should unblock deployments [14:23:27] but that means I have to revert my patch adding them to kubernetes in hiera, because that breaks puppet [14:23:46] And wait until we actually fix the ip exhaustion issue before putting them back [14:23:59] claime: I can revert my puppet CI patch for now, but I'm hesitant to do so [14:24:09] cdanis: don't [14:24:11] okay [14:24:20] It's gonna bite us again if you remove it [14:24:27] yeah, that was basically my thought [14:24:39] what I *could* do, is stop puppet from failing on the masters [14:24:56] straight revert won't work because the patch was wrong and I removed a node afterwards [14:25:01] right [14:25:20] are we letting devs backport? [14:25:36] +1 from me [14:26:00] claime: it's this line that is making the puppet runs fail https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097389/1/modules/k8s/types/clusterconfig.pp and it doesn't have any effect other than that [14:26:09] my deployment is rolled back, a handful of nodes are left [14:26:11] the CI check is just in the Rakefile change [14:26:37] claime: objections to let the backport resume? [14:26:52] nah go ahead [14:27:21] claime: if it unblocks you, feel free to just remove the `, 255` from clusterconfig.pp and self-+2 [14:27:35] I need to go pack a bag, I have a train to catch in about 3 hours [14:27:42] cdanis: it's moot because I can't add the nodes anyways [14:27:48] ack [14:27:51] cdanis: you mean a literal train not a mediawiki train? [14:27:53] (I'll be off and on IRC) [14:27:56] effie: yes [14:28:05] I am surpised! [14:28:07] go! [14:32:50] effie: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097392 +1 for this please? [14:33:05] +1 [14:33:08] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 06Machine-Learning-Team, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353244 (10klausman) V1.16.0 requires Istio 1.20, so I have backed down the version to the v.1.12 serie... [14:33:11] then I'll decom 6 hosts, and need to maths [14:33:16] :p [14:33:26] haha sure 1 sec [14:33:47] claime: btw there is probably an off-by-one error in my original patch 😅 [14:33:49] it's ok c.danis did [14:34:09] And I shouldn't merge this during a deployment actually [14:34:17] Because it'll trigger ferm [14:34:34] jeez [14:34:58] claime: it will trigger ferm at a rate of puppet runs, so it is not that terrible [14:35:05] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 06Machine-Learning-Team, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353232 (10klausman) [14:35:11] unless we want to run puppet everywhere asap [14:35:19] fair [14:35:34] yeah I don't need it to force run [14:36:10] famous last words: I guess it is on to merge and start having puppet working :p [14:36:18] yeah [15:05:55] claime: sorry - was out for lunch. But the cordoned nodes where mine (for reboot) indeed [15:08:07] erf sorry, I uncordoned wikikube-worker13[10-12].eqiad.wmnet [15:08:47] it's fine - it will just lead to more pod churn [15:08:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097405 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097406 [15:08:57] eqiad decoms [16:17:12] https://phabricator.wikimedia.org/P71136 might be useful at times [16:20:25] <_joe_> the first rule of anything coming from cdanis with "oneliner" in the title is to wear sunglasses before looking at it [16:25:09] and also ask what version of yq it's using :p [16:32:41] the underhanded C code contest, 2024 YAML edition [16:55:49] <_joe_> lol [17:01:16] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354314 (10VRiley-WMF) a:03VRiley-WMF Would we like to proceed with replacing the CMOS b... [18:07:00] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10354661 (10JMeybohm) 05Open→03Declined Oh, sorry. We forgot to update here. Lets n... [19:38:27] 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 06Wikipedia-Android-App-Backlog, 07Essential-Work: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10355078 (10CDanis) Have you tried also setting the http_proxy and... [19:38:56] 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 06Wikipedia-Android-App-Backlog, 07Essential-Work: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10355079 (10CDanis) a:05Jgiannelos→03None [23:32:29] 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 06MW-Interfaces-Team, and 3 others: Transition parsoidtest1001 to PHP 8.1 - https://phabricator.wikimedia.org/T380485#10355701 (10Scott_French) Thanks for writing this up, @cscott. So, one alternative to migrating parsoidtest1001 to PHP 8.1...