[00:54:32] I'm pretty sure that magnum is working properly again after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134388 (and possibly the patch that came before it). [09:58:31] arturo: let me know if you are around and if I can proceed with removing the cloudsw static routes [09:58:49] topranks: I'm around [09:58:59] but also, I'm about to jump into a meeting [09:59:01] I already removed those for the cloudsw loopback IPs (inconsequential to the overall network just used for monitoring) and I can see things look ok, so I'm confident everything will be fine [09:59:02] ok [09:59:03] no rush [09:59:14] ping me when you are available and we can give it a try [09:59:19] if you are fine with me not paying attention, then you are good to go I guess! [10:00:02] I'd probably feel more comfortable if you were able to keep an eye on alerts etc while I made the change [10:00:39] though tbh those ping stats are very good for detecting any issues [10:49:02] topranks: I'm available now [10:49:17] ok cool, give me a few mins to get myself lined up [10:53:51] arturo: ok I'm gonna kick off [10:54:03] ok! [10:55:04] ok default routes removed on the e4/f4 side [10:55:11] looks ok, doiing some checks and keeping an eye on graphs [10:55:15] ok [11:00:49] arturo: topranks: do you have an estimate of how long this will take? I have a toolforge maintenance planned (k8s upgrade), but I'd wait for you to complete what you're doing first [11:01:17] dhinus: about 10 mins? [11:01:31] topranks: thanks sounds good [11:01:44] ok, thanks <3 [11:01:59] in that spirit I'm gonna move ahead and remove the statics from cloudsw1-c8 nw [11:02:00] *now [11:02:06] ok [11:05:08] done [11:07:19] gonna remove them from d5 now [11:10:28] ok done there also [11:12:15] everything looks ok that I can see [11:12:42] arturo: if you are happy to give the all-clear we can probably let dhinus continue? [11:12:52] I see all green at the moment [11:13:21] ok thanks! I'll wait a couple minutes just to be safe :) [11:13:23] \o/ [11:13:25] cool [11:15:11] dhinus: opinions? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/170 [11:17:35] arturo: LGTM, maybe we could define it in some kind of shared module to be reused for both codfw and eqiad? [11:18:04] I have approved it anyhow, feel free to merge [11:18:22] I think this is a good example of something that is probably identical in both deployments, so we could keep it more DRY [11:18:47] we could soft link the files [11:19:15] I'm thinking more of starting to think more in "modules", I think that's where tofu really shines [11:19:25] so far we're just using it to automate lists of things, which is good [11:19:39] but the next step is using it to abstract [11:20:36] smth like a "shared/network_tests" module [11:22:01] I don't think the problem is biting us today that much. Maybe we can figure out something later when we really see the need [11:22:21] I say this because we just invested quite significant amount of time in the refactor [11:22:22] yeah agreed, not an "issue" in any way [11:23:19] ok [11:23:20] the refactor was necessary in any case, and I think it enables more complex things in the future [11:23:53] maybe we can experiment more in the toolforge-tofu project [11:23:59] where we are starting from scratch [11:28:16] sure [11:37:16] I'm starting the toolforge k8s upgrade T390214 [11:37:17] T390214: Upgrade "tools" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390214 [11:38:00] I'm screensharing in https://meet.google.com/jjz-wjzn-jqi if anyone wants to follow along [11:41:37] 👍 [11:43:53] !status upgrading toolforge k8s T390214 [11:43:54] No changes to apply or no status section in the topic [11:43:54] T390214: Upgrade "tools" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390214 [12:00:34] arturo: I think I see the issue, I'm going to re-activate the IPv6 peering now, there is a keyword I missed which took both sessions down [12:00:42] ok [12:01:21] I'll only do cr1 for now and double-check everything is correct with v4 also before re-activating cr2 [12:03:31] I think it looks ok, I'll wait a few but also let me know if you spot anything [12:03:37] https://www.irccloud.com/pastebin/VD36dbxb/ [12:08:38] so far so good [12:10:57] thanks for confirming [12:12:05] this alert is now showing up [12:12:08] https://usercontent.irccloud-cdn.com/file/AiD7jLUN/image.png [12:12:10] (just noticed) [12:14:07] that's ok, that's from when I rolled back / disabled the peering when you pinged [12:14:22] I brought it back up in the past minute again now [12:14:26] thanks [12:15:03] ok [12:39:37] control nodes are upgraded and working fine [12:39:44] now upgrading worker nodes, it will take a while [12:44:21] I divided the list of worker nodes in 4 chunks, and I'm running each chunk serially, so upgrading a max of 4 nodes in parallel [12:44:40] functional tests are running fine so far, only got a short failure while upgrading the control nodes [12:51:29] another short failure now (1 test failed, then the following ones were ok) [12:53:42] arturo: ok I am done with all changes to support the v6 routing in eqiad [12:53:54] topranks: 🎉! [12:53:54] (unless I forgot something ofc) [12:54:29] but it looks good I think, and the important thing is I'll be taking my hands off the config so hopefully no more "surprises" in the near future [12:54:53] topranks: i think cloudsw1-b1-codfw is missing a cloud-private v6 address still? [12:55:13] taavi: ah yes, sorry I was just checking for you task on that side of it [12:55:30] I need to step a side for a bit, be back later [12:55:38] in eqiad what we are missing is the anycast BGP stuff for VIPs announced from cloudservice etc [12:56:06] we can add it any time though, rest of the infra is ready [12:57:23] yea that's T379282, which i'd like to test in codfw first which needs that switch address allocated, as well as ipv6 equivalents for the v4 ranges used for those [12:57:25] T379282: IPv6 for cloud-realm services - https://phabricator.wikimedia.org/T379282 [13:16:05] taavi: ok np, I updated the task there [13:17:04] first step I think is to get the hosts configured with IPv6 addresses on cloud-private-b1-codfw (vlan2151 ints) [13:17:20] then establish the BGP to the switch and announce whatever service IPs are required [13:20:16] topranks: thanks! I'll have a look at the puppetization in a bit to see how far I got the last time I was looking at this [13:20:37] ok no. [13:20:47] sry, no problem [13:21:16] for the host IP allocations and dns ping me if you’ve a plan for it (we could copy last octet from v4 for instance) [13:21:46] I’ve a bunch of netbox scripts I can use to add them there if we have a plan [13:30:34] topranks: my plan was to re-use the last v4 octet, but let's hold off assigning those for everything for now [13:32:41] Cool yeah I think that’s makes sense [13:34:43] review for this please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134694 [13:47:30] topranks: fwiw, cloudlb2002/2003-dev are my testbed for this, i've manually allocated addresses for them a while back even [14:04:13] topranks: hmm, I wonder if we need to do something to prevent hosts that don't have dual-stack addresses yet from using their prod v6 link for reaching cloud-private addresses? [14:04:52] taavi: in a meeting but yes we might need to think about that [14:13:33] dhinus: we cannot create VMs using tofu-infra repo, as we cannot specify the project, they go to the project using the auth, so "admin" [14:13:36] :-( [14:14:22] do you mean the API/provider does not support specifying the project? [14:14:30] do you need separate project-level creds? [14:14:36] yeah [14:15:28] I don't know who to blame for this, if the opentofu provider or the openstack API [14:15:52] all other openstack resources seems to allow specifiying the project, so I'll blame this on the openstack API [14:16:48] which project do you want to use? a dedicated one? [14:17:10] for example, the patch we merged earlier to create a few VMs in the testlabs project [14:17:16] they end up in the admin project [14:17:21] I see [14:17:31] do we need them to be in testlabs? [14:17:42] we don't usually create VMs in the admin project [14:17:49] ack [14:18:48] not ideal, but maybe we could create a "testlabs" folder with its own tfstate and separate plan/apply? [14:18:58] or a separate repo [14:19:19] it is true that I wasn't sure VMs were a fitting resource in the tofu-infra repo [14:19:34] yeah [15:39:58] taavi: regarding the cloud-private ranges I guess we need to decide how we're going to move forward [15:40:16] part of me thinks it might be best to add the interface IPs and static routes to everything in advance [15:40:21] but don't add any DNS entries [16:15:46] andrewbogott: A community member has attempted a patch for T364605 -- https://gerrit.wikimedia.org/r/c/labs/striker/+/1134724. This should probably block on Simon's https://gerrit.wikimedia.org/r/c/labs/striker/+/1035718 [16:15:46] T364605: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605 [16:17:09] I think striker logins are broken today, so that's probably a high priority. slyngs do you know anything about the status there? I'm trying to sort out a different bug so don't want to get too offtrack [16:17:43] Taavi figured it out, https://phabricator.wikimedia.org/T391237 [16:18:32] It's working now after the database has been fixed. [16:19:56] ah, great [16:20:04] ok, so those patches are no longer urgent but still of interest! [16:22:07] Yes, the breakage and patches can be seen as separate "issues" :-) [16:22:54] andrewbogott: this is basically a patch to fix the "all SUL names are blocked from being tool names" issue from T380384 [16:22:54] T380384: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384 [16:23:34] or at least the start of being able to fix that [21:13:22] dhinus: magnum seems to work, at least some of the time, in codfw1dev. It's still misbehaving in eqiad1 for reasons that I still haven't sorted. [23:46:12] maintain-dbusers hung on cloudcontrol1007 again -- https://sal.toolforge.org/log/zryjEpYBffdvpiTreOC2 -- when it restarted there was a lot of work to do. [23:46:54] I'm not sure what happened to make that service flakey, but it seems to need an active monitor of some sort at this point.