[09:27:34] topranks: hi there! I have allocated this IPv6 prefix for openstack @ eqiad1: https://netbox.wikimedia.org/ipam/prefixes/1102/ please double-check it is correct [10:28:11] fix for the issue we saw the other day during the toolforge deploy demo https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1092784 (related to forks) [10:59:41] arturo: that looks correct to me yep fire away [10:59:50] topranks: thanks! [12:20:45] * dcaro lunch [13:58:25] andrewbogott: when you are awake, in T380208 I'm running out of ideas on how to make rabbitmq work [13:58:26] T380208: openstack: codfw1dev: rabbitmq is crashing - https://phabricator.wikimedia.org/T380208 [14:01:15] * arturo food time [14:42:05] arturo: unfortunately the issue that I saw yesterday was not rabbit related. So we may have two problems -- I'll look at rabbit soon though. [15:05:29] I'm booking a time slot for the toolsdb upgrade next monday, I'm undecided between 10 UTC and 13 UTC, any preference? [15:05:48] it will hopefully be very uneventful, I can even record a screenshare if anybody is interested [15:07:06] andrewbogott: thanks [15:07:39] dhinus: no preference on my side, if you do it at 13 UTC, we might hang in the collab for it [15:23:54] draft email: https://etherpad.wikimedia.org/p/toolsdb-10.6 [15:24:48] dhinus: LGTM [15:50:11] I just saw a 'possible kernel error on cloudvirt1062' flash by on alertmanager. Is that something familiar? [15:50:23] (I reimaged it yesterday, so might just be a side-effect of that somehow) [15:50:31] yep, that's expected on reboots [15:50:40] but worth double checking what the error was [15:51:52] probably just T379351 [15:51:52] T379351: kernel message: SGX disabled by BIOS - https://phabricator.wikimedia.org/T379351 [15:52:39] yeah T380249 [15:52:39] T380249: Kernel error Server cloudvirt1062 may have kernel errors - https://phabricator.wikimedia.org/T380249 [15:56:04] great, thanks [15:56:28] dcaro, arturo: I just saw an email about last call for KubeCon talks and that reminded me of the CNCF CTO asking to hear more about the gird -> k8s migration. Y'all have a week to submit a proposal. :) [15:56:54] bd808: I have submitted a proposal already [15:57:10] for kubecon eu 2025 london [15:57:17] I owe cncf an email too -- would anyone (who isn't me) be willing to be the new point of contact for them? [15:57:32] excellent arturo. that's the one I saw the last call email for [16:13:18] 👍 [16:14:51] arturo: what's your talk about? (to avoid overlapping) [16:15:20] andrewbogott: I can, I'm still part of a couple groups (end users, harbor and buildpacks) [16:15:21] dcaro: https://docs.google.com/document/d/1ujsFGWcjX-Y-_hQjVcrNKxkvsy2eNPpPi4GXKfqTUH0/edit?usp=sharing [16:15:44] dcaro: great, thank you! [16:19:25] arturo: I did a talk about the grid migration for the staff meeting, you might be able to reuse the slides and such [16:20:04] ok! [16:21:33] (actually P&T quarterly it seems, looking) [16:22:32] I'll let you know if/when the talk is accepted [16:22:50] https://docs.google.com/presentation/d/1Anj1y-LIiz-RD-XMbBCDMTRDw2TdV2lUl3ciXyQSxig/edit#slide=id.g2c8ce9cbdad_2_31 <- there, it was relatively short, but might give some nice quotes/numbers and such [16:25:47] arturo: speaker notes for those slides https://etherpad.wikimedia.org/p/eiwGl_RxtBIFSyuxv5Yr [16:35:43] ack [16:53:30] I should mention that I helped arturo drafting the talk proposal, and he kindly offered to add me as a "co-speaker" in the submission, so in case it gets accepted I will be on stage with arturo [16:53:38] they will notify us in january if it's accepted or not [16:57:56] andrewbogott: I will go offline now, please update the phab ticket if work on rabbitmq so I can pick it up tomorrow, thanks! [16:58:13] ok! I think rabbit is working again but I'm still digging a bit [16:58:51] what did you do? [17:00:10] Reset everything, and did a 'cluster_forget_node' on the troubled node. [17:00:21] I suspect that's what you did already, so I'm waiting to see if it immediately breaks again [17:00:52] yeah, I did that for all 2 nodes, except for rabbit01 which I decided was the source of truth for the cluster [17:01:27] thanks for working on that, I truly was out of ideas earlier today [17:01:29] * arturo offline [17:01:31] when I reset 03, rabbit01 still thought it was connected to it [17:01:36] so that's probably the source of the splitbrain [17:31:17] * bd808 pokes Chris Aniszczyk on linkedin about arturo and dhinus' talk proposal [17:33:57] was there some kind of purge of old projects in codfw1dev? I see a lot of orphaned VMs attached to deleted projects. [17:34:11] (80% probability that I did it right before going on break and then forgot) [17:51:55] andrewbogott: yep, let me find it [17:53:31] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/commit/ca8b21897b2e084197b9ea85dbaf17f109661fa8 [17:54:16] ok... this is probably a tofu issue then [17:54:19] this was merged without checking if there were resources attached, so it left a few orphan vms [17:54:35] more than a few :( [17:54:50] I would say it's an openstack issue :P because the os api lets you delete a project with resources in it [17:54:52] Can our opentofu setup be changed so it checks for things before deleting projects? [17:55:01] we discussed how to check that in tofu [17:55:15] but there's not an "easy" way, I think I created a task [17:55:17] It's kind of an openstack issue but different projects don't really know about each other. [17:55:27] Anyway, I will do some cleanup now that I know this was on purpose [17:56:37] yep it's the "modular" nature of openstack... I still find very annoying that a single API "delete" call can create so much inconsistency [17:56:53] yeah it's not great [17:57:23] There was a discussion about fixing this 10 years ago and the # of things that would have to be orchestrated got so huge that the topic was largely dropped iirc [17:57:34] (in keystone discussion I mean) [18:00:41] also some of these projects were doing things :( [18:01:12] ok I didn't create a task, but there was a discussion in this channel a couple weeks ago https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/20241106.txt [18:01:21] and I left a comment in https://github.com/terraform-provider-openstack/terraform-provider-openstack/issues/1774 [18:02:41] T380303 [18:02:42] T380303: Openstack: many orphaned (or seemingly orphaned) VMs in codfw1dev - https://phabricator.wikimedia.org/T380303 [18:05:41] another related discussion: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/118 [18:06:32] maybe having more "leak detectors" is a possible solution? so if this happens again, we at least get some alerts firing? [18:06:33] it feels to me like a thing that could be done in a cookbook (at least check that there will be leaks, and list them for you to handle later) [18:06:47] as preventive I mean [18:07:23] the issue is I don't like checking for project leaks on every tofu change, because that means slowing down 99% of MRs [18:07:37] there could be other solutions, one could be to split the project list into a separate tofu repo [18:07:57] I'm tempted to say that tofu should never delete anything, that should be left to cookbooks [18:08:11] but maybe we can't have tofu creation w/out tofu deletion? [18:08:36] we can maybe introduce some checks, only for "project" resources [18:09:29] I think it could be a CI or cookbook check that verifies if the tofu plan is trying to delete a project [18:09:36] if it is, it refuses to continue [18:09:53] hm [18:09:56] we should discuss this in a phab task, andrewbogott can you create one? [18:10:17] I guess that it depends what you want the entry point to be, if you want to start with a cookbook, or a tofu patch [18:10:20] "prevent tofu from creating orphan resources" or something like that [18:10:21] having tofu delete any resources without manual confirmation troubles me. I don't think 'forgetting to add it to tofu' should be a death sentence. [18:10:31] (that iirc is still a non-resolved subject) [18:10:51] We have public APIs, users can create things without adding them to the tofu repo. [18:11:05] So having tofu actively maintain state in parallel with that is an accident waiting to happen [18:11:18] dcaro: yes, that's an open question on what is the entrypoint [18:12:08] if the entrypoint is meant to be tofu, then the checks (manual or not) have to be after the patch is created, if the entypoint is the cookbook, then they can be in the cookbook (so we are not forced to implement them inside tofu of sorts) [18:12:51] dhinus: I'm really starting in the middle of this, is there a parent task you want me to attach to or something? [18:13:07] nope, I don't think we have one [18:14:37] there are multiple parallel discussions I think, on how we want to use tofu, which things it should track, how we can minimize "accidents", etc. [18:15:10] I'm not sure where is the best place to have those discussions :) [18:15:59] maybe the offsite is a good one? [18:16:39] +1 for offsite [18:17:20] I think we only have 4 hours to discuss all team topics at the offsite :D [18:17:28] yep it will be hard :D [18:18:33] for starters... T380310 [18:18:33] T380310: opentofu shouldn't delete openstack resources - https://phabricator.wikimedia.org/T380310 [18:18:42] andrewbogott: thanks! [18:18:55] I have to log off for today, I will comment in that task tomorrow [18:47:25] * dcaro off