[09:50:29] <arturo>	 FYI designate @ codfw1dev seems to be struggling, per https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/pipelines
[10:02:47] <dhinus>	 arturo: thanks, taavi maybe related to the reboots in codfw?
[10:38:04] <taavi>	 maybe, let me look
[10:41:12] <taavi>	 hrm, the "restart designate" cookbook relies on designate working which is not great i think
[10:41:28] * taavi writes some manual cumins instead
[10:44:24] <taavi>	 yeah that fixed it
[12:43:47] <taavi>	 hmm I think the haproxy metric format has changed between bullseye and bookworm, and that's why we're now getting alerts for everything being down
[12:48:52] <taavi>	 review for https://gerrit.wikimedia.org/r/c/operations/alerts/+/1151673/?
[12:57:14] <taavi>	 dhinus: if you have a moment ^
[13:03:25] <dhinus>	 taavi: in meetings, will look in abit
[13:37:43] <andrewbogott>	 taavi: you're still thinking that we should delay the epoxy upgrade for T395255?  (I probably do too but you might've learned things since I last checked)
[13:37:44] <stashbot>	 T395255: codfw1dev has seen neutron metadata agents down since epoxy upgrade - https://phabricator.wikimedia.org/T395255
[13:40:07] <taavi>	 andrewbogott: I think I'm happy to not consider it a blocker. when I said that I wasn't fully sure what the impact of that was
[13:40:33] <taavi>	 we need to report that upstream, and probably should delete the agent objects from the database to stop the cookbooks from getting confused
[13:40:40] <taavi>	 not great, but I don't think it's a showstopper
[13:40:59] <andrewbogott>	 ok.
[13:41:34] <taavi>	 however https://gerrit.wikimedia.org/r/c/operations/alerts/+/1151673 is something I'd like to get in before you start upgrading things
[13:41:49] <andrewbogott>	 I'm not especially worried about this bug but I am worried that in the long run the OS releases may have lost coherence -- if all their integration testing is containerized then they may be approving different projects to release with different versions of oslo and not even know they're doing it.
[13:41:57] <andrewbogott>	 oh yep, I have that one open already!
[13:42:59] <andrewbogott>	 I guess they introduced a mysterious third state that's neither up nor down?
[13:43:51] <taavi>	 there's 5 different states
[13:44:17] <taavi>	 they changed the metric format to have the actual state, instead of "is it up or is it something else"
[13:44:54] <andrewbogott>	 seems reasonable
[13:45:43] <taavi>	 but that meant our alerting interpreted "the backend is not in maintenance mode" as a problem
[13:46:01] * andrewbogott nods
[13:46:19] <andrewbogott>	 is that '# deploy-site: eqiad' at the top of that file just wrong?
[13:47:01] <taavi>	 i think that's a historic relic from the days when we didn't have a prometheus instance in codfw at all
[13:47:24] <taavi>	 we added one to collect metrics for dashboards and such, but never had a proper discussion on whether we want alerts from there or not
[13:48:38] * andrewbogott nods
[13:48:56] <andrewbogott>	 I'm sure I get /some/ alerts from codfw1dev but I haven't paid attention to see if it's a complete mirror or a subset.
[13:58:26] <taavi>	 the remaining haproxy alerts are from cloudlb1002 (which is still on bullseye, where the new query doesn't quite work). I can reimage that to bookworm, or we could silence that for now and do it, say, tomorrow. any preferences?
[14:01:11] <andrewbogott>	 Let's reimage. I'm happy to do it if you don't feel like it or are running out of day.
[14:02:05] <andrewbogott>	 moritzm: Just one final check: if I reboot a bunch of hosts /today/ will they land on a kernel I can trust? Are there things I should doublecheck post-reboot?
[14:02:55] <taavi>	 i'll kick it off. i have enough meetings that I'll be around to babysit it
[14:07:05] <andrewbogott>	 thanks!
[14:08:10] <moritzm>	 for bookworm yes! they have the ITS mitigations and all known regressions are ironed out
[14:08:37] <moritzm>	 the 5.10.x kernels don't have the ITS fixes backported, so these are not yet ready, but I think practically all WMCS hosts are on Bookworm by now
[14:21:00] <andrewbogott>	 Yeah, as far as I know it's just bookworm things that I'm rebooting.
[14:21:07] <andrewbogott>	 If there are bullseye hosts, should I /not/ reboot them?
[15:14:57] <taavi>	 andrewbogott: cloudlb reimages are done
[15:15:13] <andrewbogott>	 thanks!
[15:21:30] <moritzm>	 andrewbogott: it won't hurt at all, but they'll need to be rebooted again in a few weeks (once the fixes are out)
[15:21:47] <moritzm>	 so you can just as well skip them for now if they are tricky to reboot
[15:22:18] <andrewbogott>	 sounds good, I don't think there are any in the list anyway
[15:22:24] <andrewbogott>	 but good to know that it's harmless!
[15:43:59] <dhinus>	 FYI data-persistence just moved some wikidata tables to the new "x3" section, which is not yet available on wikireplicas
[15:44:32] <dhinus>	 they say it will take multiple days, which was not clear to me until now
[15:44:43] <dhinus>	 I will send an announcement to cloud-announce
[15:45:24] <taavi>	 dhinus: https://wikitech.wikimedia.org/wiki/News/2025_Wikidata_term_store_database_split
[15:46:12] <dhinus>	 "must be adapted to connect to the new cluster" is not entirely right, though
[15:46:45] <dhinus>	 it doesn't mention the multiple days of waiting :)
[15:47:16] <taavi>	 yes, because I didn't know of that detail when writing that page :-)
[15:47:31] <taavi>	 but I'm saying you should update that as well if you're sending an announcement
[15:47:44] <dhinus>	 yep that was my point, I'm not sure if data-persistence knew, or they were also surprised
[15:47:50] <dhinus>	 ack, I will update that page too :)
[15:58:06] <dhinus>	 taavi: quick review? https://etherpad.wikimedia.org/p/termdata
[15:58:28] <taavi>	 i would link to the News page
[15:58:39] <taavi>	 instead of/in addition the task
[15:58:42] <dhinus>	 makes sense
[16:00:05] <dhinus>	 how about now?
[16:02:16] <taavi>	 dhinus: lgtm. I also updated the timeline section
[16:02:42] <dhinus>	 I just got an edit conflict. your version has a nicer icon :D
[16:03:06] <taavi>	 sorry :D
[16:04:07] <dhinus>	 email sent
[16:04:21] <taavi>	 thank you!
[16:04:39] <hashar>	 andrewbogott: do you remember offhand what kind of issues Nodepool was causing on OpenStack? (circa 2015-2017)
[16:05:19] <andrewbogott>	 I think it was just a traffic issue -- too many VMs being created at the same time causing basically everything to be unstable. I don't think we ever really drilled down to figure out if nodepool could be tuned more gently.
[16:05:29] <andrewbogott>	 Do we need nodepool for zuul3?
[16:05:50] <hashar>	 Zuul 3 is the context yes
[16:06:14] <hashar>	 I thought we could use static virtual machines (similar to what we do with Jenkins now), but they would still run potentially arbitrary code
[16:06:16] <andrewbogott>	 In theory nodepool doesn't do anything that magnum and/or opentofu and/or other existing workflows do.
[16:06:34] <hashar>	 so we need one off instances/containers whatever.  The model is usually container spinned up on AWS/Azure/GCE etc
[16:06:36] <andrewbogott>	 So it should be possible to tune it so it doesn't break things? But everything is different at every level by now.
[16:06:50] <andrewbogott>	 Can nodepool manage containers rather than VMs?
[16:06:56] <hashar>	 yup
[16:07:07] <andrewbogott>	 ok, that seems likely better if it works?
[16:07:15] <hashar>	 Bryan was mentioning Magnum to spin up a K8S cluster in a dedicated WMCS project
[16:07:21] <hashar>	 (if I understood properly)
[16:07:23] <andrewbogott>	 In that case I don't know that nodepool would be hitting the openstack APIs at all really.
[16:07:43] <hashar>	 but is Magnum in real use?
[16:07:51] <hashar>	 can it "easily" spins up a k8s cluster?
[16:08:22] <andrewbogott>	 Magnum is in real use, yes -- paws and quarry are deployed using magnum. It's not super stable in the current release but I'm upgrading at this very minute which should make it slightly better.
[16:08:40] <hashar>	 and toolforge uses it as well?
[16:08:46] <andrewbogott>	 If you need to build a new magnum cluster every hour/every day i wouldn't really recommend it because it fails now and then. But if you just need to build yourself a k8s cluster now and then it's a good option.
[16:08:49] <taavi>	 nope!
[16:08:57] <andrewbogott>	 No, at the moment toolforge is a puppet + manually-managed cluster.
[16:09:07] <hashar>	 toolforge will never die :]
[16:09:21] <andrewbogott>	 Sorry, to clarify: if nodepool wants to use an already-existing magnum cluster, then magnum is a good option.
[16:09:50] <andrewbogott>	 I believe that the catalyst folks are using k3s so you could also check in with them about how it's going. I'm not sure if that's a multi-node setup though.
[16:09:58] <hashar>	 Nodepool interacting with the OpenStack backend definitely had issues, so I'd rather avoid that
[16:10:46] <hashar>	 Magnum let us spin a k8s and from that k8s we could then spin pod/containers/jobs correct?
[16:11:22] <andrewbogott>	 yep, that's right. Magnum builds a cluster and spits out a k8s config file that you can give to nodepool
[16:11:56] <hashar>	 that sounds too easy :]
[16:13:02] <dhinus>	 hashar: notice nowhere it says "and your cluster will work" :P
[16:13:52] <hashar>	 as my brother uses to say: "that is a 'their' problem"
[16:13:54] <hashar>	 :)
[16:13:56] <andrewbogott>	 It won't be that easy but those are reasonable steps to start with
[16:14:07] <dhinus>	 +1 worth a try
[16:16:23] <andrewbogott>	 hashar: I would check and see what nodepool uses to talk to k8s since there are a million different possible ways to approach that.
[16:16:43] <hashar>	 yeah I guess we will do some prototype
[16:16:51] <andrewbogott>	 probably you'd use opentofu to create the initial cluster, for the sake or reproducibility. The magnum web UI is not ideal...
[16:16:56] <dhinus>	 I would wait until the new version of magnum is deployed and (somewhat) tested, which is maybe later today or this week?
[16:17:23] <hashar>	 +1
[16:17:28] <hashar>	 perfect timing :)
[16:18:20] <hashar>	 I guess my next action will be to file a task for your team with a layout of what we have in mind
[16:18:37] * dhinus puts high hopes on the new version and prepares for his hopes to be crushed
[16:18:48] <andrewbogott>	 well there are sort of two magnum upgrades coming, a little one today and then an actually maybe significant one in a few weeks.
[16:19:29] <dhinus>	 ah sorry, I lost track of the various parts. is the more significant one the driver change?
[16:19:30] <andrewbogott>	 Today is a version upgrade but still with the Heat driver. The move to the new drivers is the actual thing that might make things faster.
[16:19:47] <andrewbogott>	 Yeah, the driver change is the major thing. lbaas was a requirement for that.
[16:19:57] <dhinus>	 is the new driver already running on codfw?
[16:21:00] <andrewbogott>	 no
[16:21:25] <andrewbogott>	 hashar: here is an example of using opentofu to stand up a k8s cluster with magnum: https://github.com/toolforge/quarry/tree/main/tofu
[16:21:36] <andrewbogott>	 but the new version of the heat driver + associated bits seems to work slightly better
[16:21:43] <dhinus>	 ack
[16:23:37] <hashar>	 doesn't openstack have a way to templatize a project?
[16:25:32] <andrewbogott>	 that thing I linked you to is tofu creating and managing a k8s template, and then telling openstack to build it.
[16:25:46] <andrewbogott>	 So, kind of -- but it's a lot easier to do with tofu than with page-long curl statements.
[16:26:19] <andrewbogott>	 Sorry, I may not be understanding the question :)
[16:29:45] <hashar>	 you kind of answered
[16:30:09] <hashar>	 I guess I would expect OpenStack to have a project that fills more or less the same role as OpenTofu
[16:31:00] <hashar>	 so we'd need opentofu to spin up instances that Magnum can then use to setup a k8s cluster on top if those instances?
[16:31:42] <hashar>	 (I feel I loose a level of abstraction every ten years, or technology adds an extra layer or two every decade and I can't keep up)
[16:36:49] <andrewbogott>	 Magnum creates the VMs.
[16:37:00] <andrewbogott>	 So you make a template describing the k8s cluster you want
[16:37:11] <andrewbogott>	 and a little tofu resource definition saying "make me a cluster like this template"
[16:37:16] <andrewbogott>	 and then tofu asks magnum to make it.
[16:37:52] <andrewbogott>	 Or you can just feed the template directly to magnum, but that turns out to be clumsier.
[16:38:30] <andrewbogott>	 I think looking at that github link will answer more questions than I can :)
[16:56:25] <hashar>	 andrewbogott: yeah i/we will thank you!
[20:34:08] <andrewbogott>	 There are a few more cloudvirts to upgrade but things seem to be working fine. I'm going to step away for a bit while the cookbook finishes up the remaining cloudvirts. Please ping me if something breaks!