[09:50:29] FYI designate @ codfw1dev seems to be struggling, per https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/pipelines [10:02:47] arturo: thanks, taavi maybe related to the reboots in codfw? [10:38:04] maybe, let me look [10:41:12] hrm, the "restart designate" cookbook relies on designate working which is not great i think [10:41:28] * taavi writes some manual cumins instead [10:44:24] yeah that fixed it [12:43:47] hmm I think the haproxy metric format has changed between bullseye and bookworm, and that's why we're now getting alerts for everything being down [12:48:52] review for https://gerrit.wikimedia.org/r/c/operations/alerts/+/1151673/? [12:57:14] dhinus: if you have a moment ^ [13:03:25] taavi: in meetings, will look in abit [13:37:43] taavi: you're still thinking that we should delay the epoxy upgrade for T395255? (I probably do too but you might've learned things since I last checked) [13:37:44] T395255: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255 [13:40:07] andrewbogott: I think I'm happy to not consider it a blocker. when I said that I wasn't fully sure what the impact of that was [13:40:33] we need to report that upstream, and probably should delete the agent objects from the database to stop the cookbooks from getting confused [13:40:40] not great, but I don't think it's a showstopper [13:40:59] ok. [13:41:34] however https://gerrit.wikimedia.org/r/c/operations/alerts/+/1151673 is something I'd like to get in before you start upgrading things [13:41:49] I'm not especially worried about this bug but I am worried that in the long run the OS releases may have lost coherence -- if all their integration testing is containerized then they may be approving different projects to release with different versions of oslo and not even know they're doing it. [13:41:57] oh yep, I have that one open already! [13:42:59] I guess they introduced a mysterious third state that's neither up nor down? [13:43:51] there's 5 different states [13:44:17] they changed the metric format to have the actual state, instead of "is it up or is it something else" [13:44:54] seems reasonable [13:45:43] but that meant our alerting interpreted "the backend is not in maintenance mode" as a problem [13:46:01] * andrewbogott nods [13:46:19] is that '# deploy-site: eqiad' at the top of that file just wrong? [13:47:01] i think that's a historic relic from the days when we didn't have a prometheus instance in codfw at all [13:47:24] we added one to collect metrics for dashboards and such, but never had a proper discussion on whether we want alerts from there or not [13:48:38] * andrewbogott nods [13:48:56] I'm sure I get /some/ alerts from codfw1dev but I haven't paid attention to see if it's a complete mirror or a subset. [13:58:26] the remaining haproxy alerts are from cloudlb1002 (which is still on bullseye, where the new query doesn't quite work). I can reimage that to bookworm, or we could silence that for now and do it, say, tomorrow. any preferences? [14:01:11] Let's reimage. I'm happy to do it if you don't feel like it or are running out of day. [14:02:05] moritzm: Just one final check: if I reboot a bunch of hosts /today/ will they land on a kernel I can trust? Are there things I should doublecheck post-reboot? [14:02:55] i'll kick it off. i have enough meetings that I'll be around to babysit it [14:07:05] thanks! [14:08:10] for bookworm yes! they have the ITS mitigations and all known regressions are ironed out [14:08:37] the 5.10.x kernels don't have the ITS fixes backported, so these are not yet ready, but I think practically all WMCS hosts are on Bookworm by now [14:21:00] Yeah, as far as I know it's just bookworm things that I'm rebooting. [14:21:07] If there are bullseye hosts, should I /not/ reboot them? [15:14:57] andrewbogott: cloudlb reimages are done [15:15:13] thanks! [15:21:30] andrewbogott: it won't hurt at all, but they'll need to be rebooted again in a few weeks (once the fixes are out) [15:21:47] so you can just as well skip them for now if they are tricky to reboot [15:22:18] sounds good, I don't think there are any in the list anyway [15:22:24] but good to know that it's harmless! [15:43:59] FYI data-persistence just moved some wikidata tables to the new "x3" section, which is not yet available on wikireplicas [15:44:32] they say it will take multiple days, which was not clear to me until now [15:44:43] I will send an announcement to cloud-announce [15:45:24] dhinus: https://wikitech.wikimedia.org/wiki/News/2025_Wikidata_term_store_database_split [15:46:12] "must be adapted to connect to the new cluster" is not entirely right, though [15:46:45] it doesn't mention the multiple days of waiting :) [15:47:16] yes, because I didn't know of that detail when writing that page :-) [15:47:31] but I'm saying you should update that as well if you're sending an announcement [15:47:44] yep that was my point, I'm not sure if data-persistence knew, or they were also surprised [15:47:50] ack, I will update that page too :) [15:58:06] taavi: quick review? https://etherpad.wikimedia.org/p/termdata [15:58:28] i would link to the News page [15:58:39] instead of/in addition the task [15:58:42] makes sense [16:00:05] how about now? [16:02:16] dhinus: lgtm. I also updated the timeline section [16:02:42] I just got an edit conflict. your version has a nicer icon :D [16:03:06] sorry :D [16:04:07] email sent [16:04:21] thank you! [16:04:39] andrewbogott: do you remember offhand what kind of issues Nodepool was causing on OpenStack? (circa 2015-2017) [16:05:19] I think it was just a traffic issue -- too many VMs being created at the same time causing basically everything to be unstable. I don't think we ever really drilled down to figure out if nodepool could be tuned more gently. [16:05:29] Do we need nodepool for zuul3? [16:05:50] Zuul 3 is the context yes [16:06:14] I thought we could use static virtual machines (similar to what we do with Jenkins now), but they would still run potentially arbitrary code [16:06:16] In theory nodepool doesn't do anything that magnum and/or opentofu and/or other existing workflows do. [16:06:34] so we need one off instances/containers whatever. The model is usually container spinned up on AWS/Azure/GCE etc [16:06:36] So it should be possible to tune it so it doesn't break things? But everything is different at every level by now. [16:06:50] Can nodepool manage containers rather than VMs? [16:06:56] yup [16:07:07] ok, that seems likely better if it works? [16:07:15] Bryan was mentioning Magnum to spin up a K8S cluster in a dedicated WMCS project [16:07:21] (if I understood properly) [16:07:23] In that case I don't know that nodepool would be hitting the openstack APIs at all really. [16:07:43] but is Magnum in real use? [16:07:51] can it "easily" spins up a k8s cluster? [16:08:22] Magnum is in real use, yes -- paws and quarry are deployed using magnum. It's not super stable in the current release but I'm upgrading at this very minute which should make it slightly better. [16:08:40] and toolforge uses it as well? [16:08:46] If you need to build a new magnum cluster every hour/every day i wouldn't really recommend it because it fails now and then. But if you just need to build yourself a k8s cluster now and then it's a good option. [16:08:49] nope! [16:08:57] No, at the moment toolforge is a puppet + manually-managed cluster. [16:09:07] toolforge will never die :] [16:09:21] Sorry, to clarify: if nodepool wants to use an already-existing magnum cluster, then magnum is a good option. [16:09:50] I believe that the catalyst folks are using k3s so you could also check in with them about how it's going. I'm not sure if that's a multi-node setup though. [16:09:58] Nodepool interacting with the OpenStack backend definitely had issues, so I'd rather avoid that [16:10:46] Magnum let us spin a k8s and from that k8s we could then spin pod/containers/jobs correct? [16:11:22] yep, that's right. Magnum builds a cluster and spits out a k8s config file that you can give to nodepool [16:11:56] that sounds too easy :] [16:13:02] hashar: notice nowhere it says "and your cluster will work" :P [16:13:52] as my brother uses to say: "that is a 'their' problem" [16:13:54] :) [16:13:56] It won't be that easy but those are reasonable steps to start with [16:14:07] +1 worth a try [16:16:23] hashar: I would check and see what nodepool uses to talk to k8s since there are a million different possible ways to approach that. [16:16:43] yeah I guess we will do some prototype [16:16:51] probably you'd use opentofu to create the initial cluster, for the sake or reproducibility. The magnum web UI is not ideal... [16:16:56] I would wait until the new version of magnum is deployed and (somewhat) tested, which is maybe later today or this week? [16:17:23] +1 [16:17:28] perfect timing :) [16:18:20] I guess my next action will be to file a task for your team with a layout of what we have in mind [16:18:37] * dhinus puts high hopes on the new version and prepares for his hopes to be crushed [16:18:48] well there are sort of two magnum upgrades coming, a little one today and then an actually maybe significant one in a few weeks. [16:19:29] ah sorry, I lost track of the various parts. is the more significant one the driver change? [16:19:30] Today is a version upgrade but still with the Heat driver. The move to the new drivers is the actual thing that might make things faster. [16:19:47] Yeah, the driver change is the major thing. lbaas was a requirement for that. [16:19:57] is the new driver already running on codfw? [16:21:00] no [16:21:25] hashar: here is an example of using opentofu to stand up a k8s cluster with magnum: https://github.com/toolforge/quarry/tree/main/tofu [16:21:36] but the new version of the heat driver + associated bits seems to work slightly better [16:21:43] ack [16:23:37] doesn't openstack have a way to templatize a project? [16:25:32] that thing I linked you to is tofu creating and managing a k8s template, and then telling openstack to build it. [16:25:46] So, kind of -- but it's a lot easier to do with tofu than with page-long curl statements. [16:26:19] Sorry, I may not be understanding the question :) [16:29:45] you kind of answered [16:30:09] I guess I would expect OpenStack to have a project that fills more or less the same role as OpenTofu [16:31:00] so we'd need opentofu to spin up instances that Magnum can then use to setup a k8s cluster on top if those instances? [16:31:42] (I feel I loose a level of abstraction every ten years, or technology adds an extra layer or two every decade and I can't keep up) [16:36:49] Magnum creates the VMs. [16:37:00] So you make a template describing the k8s cluster you want [16:37:11] and a little tofu resource definition saying "make me a cluster like this template" [16:37:16] and then tofu asks magnum to make it. [16:37:52] Or you can just feed the template directly to magnum, but that turns out to be clumsier. [16:38:30] I think looking at that github link will answer more questions than I can :) [16:56:25] andrewbogott: yeah i/we will thank you! [20:34:08] There are a few more cloudvirts to upgrade but things seem to be working fine. I'm going to step away for a bit while the cookbook finishes up the remaining cloudvirts. Please ping me if something breaks!