[07:35:45] greetings [09:06:23] morning! [09:20:44] hello [10:43:11] we'll soon migrate ganeti/ulsfo to routed Ganeti and I'd really love to wrap up the Bird 2.18 migration before that. I'd appreciate if someone can have a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242431, I'll also take care of upgrading these hosts to Bird 2.18 [11:57:37] * dcaro lunch [11:58:09] maybe taav.i or abogot.t might be able to help with that :/, it will take me a while to catch up on the buird/cloudgw side [12:47:54] moritzm: +1, sorry for the delay [12:49:19] thanks! I'll merge and upgrade these later the day [12:51:36] taavi: were you able to successfully deploy https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1161 in the end? (should we merge that MR?) [12:52:04] dcaro: yes, thanks for the reminder, merged [12:52:14] got distracted enough in the process to forget to click merge in the end [12:52:51] np, yep, that might be a point to improve in the process [13:30:07] XioNoX: would now be a good time for me to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275? [13:30:24] taavi: sure [13:59:30] ok finally merged after CI problems were sorted out, deploying to codfw first [14:00:36] XioNoX: is a diff for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1247735 expected? [14:12:43] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/276 (cleaning up a bit of code after the migration) [14:18:42] lgtm [14:27:24] thanks! [14:37:50] taavi: I guess it didn't get rolled to all devices, it's fine to merge it [14:37:57] ack, thanks [14:43:42] network tests pass in codfw, now deploying to eqiad [14:52:01] all done [15:04:19] I'll upgrade bird on cloudlb now, starting with 1001 [15:05:17] looking for reviews for: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/83 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1143 [15:06:17] both cloudlb* nodes upgraded, seem all fine to me [15:26:04] thx moritzm [15:26:23] hm, I can't use test-cookbook to test a spicerack change, only a cookbook change? dang [16:12:04] upgrading Bird on cloudservices now, starting with 1006 [16:20:52] * dcaro brb [16:23:35] both cloudservices* nodes upgraded, seem all fine to me [16:46:28] 🎉 [16:46:35] thanks moritzm [16:53:42] andrewbogott: if you have a bunch of changes for the toolforge/etcdctl spicerack module and there is no plan to make that module a generic etcdctl one that can be reused by prod cookbooks, it might be a good time to move it out of spicerack and into the wmcs-cookbooks repository and then use the external_modules_dir config and related SpicerackExtenderBase to exposed it inside spicerack [16:54:42] volans: that makes sense; I'll defer to dcaro about the future of those files. [16:54:57] I'm quite alarmed at how unstable and broken the etcdctl cli is :( [16:56:12] "eventual consistency" is a fancy word for "retry until it works" xd [16:56:20] *phrasing [16:57:35] I mean the code itself is unstable. The cli flags changed between 3.3 and 3.4, the docs about the change are incorrect, when you request json output it returns decimal IDs rather than hex IDs, etc. etc. [16:59:13] that's more annoying, as it would be easily fixable [17:20:56] PSA: macOS 26.3 seems to have a Rosetta bug that is breaking lima-kilo :/ [17:21:47] I've not yet upgraded in fear of similar issues :D [17:25:18] oops [17:37:20] maybe it's not as bad as I thought, k9s was failing with "segmentation fault", but downloading the latest AMD binary fixed it [17:37:36] I'll do more debugging tomorrow, but I have a sort-of functioning lima-kilo again [17:38:56] andrewbogott: I'm pretty sure you're hitting the default switch from ETCDCTL_VERSION from 2 to 3 in etcd 3.4, https://etcd.io/docs/v3.4/dev-guide/interacting_v3/ [17:39:45] I was sure of that too but I already have "export ETCDCTL_API=3" set everywhere [17:40:07] is there yet another env flag I'm missing? [17:42:09] where has the spicerack code been setting that previously? [17:43:04] AFAICT it's not in spicerack/wmcs-cookbooks [17:45:34] * dhinus off [17:47:11] I'm not sure about spicerack, I've been doing side-by-side tests on the command line. In any case, I'm confident that the docs are wrong in various ways since they're not even consistent within the docs. [17:47:45] and the thing with IDs being sometimes hex and sometimes dec is https://github.com/etcd-io/etcd/issues/9975 [17:48:17] So -- I have a way to move forward and get spicerack working. It was just a pain. And spicerack can't use json formatting because of the hex/dec thing [17:49:49] * dhinus back because I noticed a toolsdb alert :/ [17:50:18] i mean the spicerack code is clearly written for v2, so 3.4 flipping the default to v3 breaking it seems logical [17:51:12] so probably the spicerack code should explicitely set the envvar to one or the other, and if that's v3 the commands in there probably need changing for the v3 syntax [17:51:32] the toolsdb alert is another instance of https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#Transactions_way_slower_on_the_secondary_than_the_primary [17:52:04] FYI the equivalent etcdctl commands in the k8s cookbooks always set ETCDCTL_API=3 explicitely [17:52:16] * volans gotta go, sorry [17:58:37] I opened T419045 to track the toolsdb alert [17:58:37] T419045: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2026-03-04 - https://phabricator.wikimedia.org/T419045 [17:58:59] I'm trying to fix it following the runbook [18:03:00] thanks! [18:07:44] why is toolforge etcd alerting now? [18:14:08] It's just the puppet thing? Or do you see another alert I don't see? [18:16:03] andrewbogott: FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-etcd-29 in project tools [18:16:38] given that none of the automation works so far (aiui), I would only expect to see broken stuff in toolsbeta, not tools [18:17:42] the stale puppet is just be testing something else last night, I've re-enabled. [18:18:22] the toolsdb issue is fixed, but the replica will take a while to catch up [18:18:48] there might be other slow queries in the queue, I'll log off for now, but check back later [18:19:23] ah, thanks [18:43:42] * dcaro off [18:43:44] cya tomorrow! [19:13:55] andrewbogott: are you aware that the toolschecker etcd health check has been failing for a while? [19:14:20] yeah, it's just because an etcd node is stuck in a transitional state. I'm working on cleaning up. [19:14:45] I ack'd but it must've reset [19:14:51] I'm confused - I thought the automation was broken? [19:15:49] it's partially broken, in complicated ways. [19:16:10] Anyway, I'm going to march through the OS upgrade via live hacks because I predict that getting an actual spicerack change merged will take months. [19:16:40] (for example, the existing cookbook works for creating new nodes since the creation execution happens on the lowest-number node which is still Bullseye) [19:17:07] can we not? [19:17:12] 'not' what? [19:18:51] do any of this by hand instead of fixing the automation [19:19:23] this is probably the most critical datastore we have around in total [19:19:48] (and it's still not even backed up T339934) [19:19:49] T339934: [etcd,infra] Find a backup solution for the etcd database - https://phabricator.wikimedia.org/T339934 [21:46:00] andrewbogott: is T393782 dead or just stalled on capacity? I'm trying to understand the concerns expressed in the team channel yesterday about actually using Magnum. [21:46:05] T393782: Investigate new Magnum drivers - https://phabricator.wikimedia.org/T393782 [21:46:39] I have been operating under the impression that it was soft-launched and building systems using it since the fall of 2024. Hearing yesterday that it was something the folks wanted to remove entirely was a shock. [21:48:05] That task is stalled by me having been on vacation, and by other upgrades needing to happen before it can happen in eqiad1. [21:48:13] You can play with the new drivers in codfw1dev, they're running there. [21:48:27] Zuul and GitLab-CI both need Kubernetes clusters. We can pivot to some other provisioning system for these projects if that is required, but I need to know that is necessary. [21:48:27] there will also have to be some kind of migration path for users who are already using magnum in eqiad1 [21:49:54] andrewbogott: did you agree with volans and godog that projects using Magnum were a surprise? Do I need to plan a replacement? [21:51:04] Um... I can't speak for others. We spent a while in Lisbon assessing how many current magnum users there were -- it's a handful and iirc they're all staff. No one /since then/ should be surprised that there are users since we discussed it. [21:51:30] As to whether it's worth supporting or not... that remains to be discussed. [21:51:57] And of course whether the new drivers are drastrically more reliable than the old ones (which I hope they will be) would be a factor in that. [21:53:34] I think our team will have an actual manager in a couple of weeks, we might be a bit easier to deal with after that :) [22:47:05] I responded in hopefully-not-hysterical terms to the magnum and CI tickets.