[00:21:40] * bd808 off [09:44:18] I have made a few improvements to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade and I'm now ready to proceed with the upgrade [09:45:38] I think it makes sense to set the maintenance mode for Horizon, as recommended by the wiki [09:46:12] taavi do you agree and would you also send an email to cloud-announce or elsewhere? [09:47:37] dhinus: yeah, a short notice that horizon will be unavailable and the apis will be unstable would not hurt [10:40:13] million dollar question: is it worth disabling puppet on all/some cloud* hosts before merging this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/978636 [10:40:46] I'm tempted to say "no" as most config files will only be read when a service is restarted, but that is a big guess [10:43:00] Puppet might actually restart services because it finds a config change [10:43:24] dhinus: I would disable [10:43:53] on everything (cloudcontrols, nets, virts)? [10:44:01] I guess I can use cumin [10:44:11] yep [10:44:25] sounds good, I'll also add the Cumin command to the wiki [10:46:16] this seems to return what I need: sudo cumin 'P{C:openstack::serverpackages::zed::bookworm}' [10:47:10] I could add the cluster to make it reusable for future upgrades on codfw [11:14:12] puppet is disabled on all hosts to be upgraded, and I have merged the patch [11:28:20] now upgrading cloudcontrol1007 [11:45:36] cloudcontrol1007 is done, now I've started cloudcontrol1006 [11:48:02] cloudcontrol1007 has an alert "HAProxyBackendUnavailable" [11:53:45] that seems to have cleared by itself? [11:54:31] yep [12:01:59] cloudcontrol1006 upgrade completed [12:03:50] starting the upgrade of cloudcontrol1005 [12:21:21] cloudcontrol1005 completed [12:23:27] I'll take a break for lunch, then continue with the other nodes [13:40:30] I'm back, starting the upgrade on cloudnets, but first checking the network tests [13:41:56] network tests are passing [13:42:26] I forgot to set_maintenance on horizon before starting with the cloudcontrols :/ doing it now [13:48:23] hmm the set_maintenance script doesn't seem to work [13:48:38] "Hosts cloudweb[1003-1004].wikimedia.org now in maintenance mode." [13:48:45] maybe horizon has moved to different hosts? [13:49:24] the cookbook derives the hostnames dynamically from Cumin [13:50:12] no, it is running on cloudwebs [13:50:35] the most disruptive part (cloudcontrols) were already upgraded, so setting things to maintenance now seems unnecessary [13:52:53] agreed [13:53:09] we still need to figure out why the cookbook is not working, but we can do it later [13:53:28] i think it's not been adapted to the containerized horizon deploy [13:53:36] ah yes that must be it! [13:54:48] upgrading cloudnet1005, which is the standby host [14:07:49] cloudnet1005 is upgraded, some alerts have not cleared yet but I think they will soon [14:08:28] * andrewbogott here, reading backscroll [14:09:27] Yeah, could be that maintenance mode doesn't work post-dockerization [14:11:36] as soon as the cloudnet1005 alerts clear, I will upgrade cloudnet1006 which will cause a Neutron failover and according to the wiki a brief network outage [14:12:27] yep! short enough that it shouldn't timeout connections [14:14:06] I think the alerts will take a few more minutes to clear because of how they're designed (using min_over_time) [14:14:14] the systemd units are up and running so I think it's safe to proceed [14:16:34] started the upgrade on cloudnet1006 [14:17:29] andrewbogott: the wiki has no information on cloudrabbits, I assume they can be upgraded with the same cookbook? [14:18:21] good question! I don't think of them as needing to be in sync with openstack services since rabbitmq isn't an openstack project [14:18:28] but if there are new package versions in the bpo... [14:18:42] let me check [14:40:17] maybe it was in bullseye? [14:40:26] and we now fetch it from the official debian repo? [14:40:48] that's definitely what /we/ do :) It might be that zigo doesn't package it in the bpo because it's the same [14:40:59] let's see if I can find him to ask [14:42:33] I'm still confused by the 3 (leftover?) packages installed with bpo11: https://phabricator.wikimedia.org/P54014 [14:43:27] hm. harmless but weird [14:43:55] yep, it would be nice to remove them, so we make it explicit that cloudrabbits don't need the bpo repo [14:44:41] I find it useful to check the presence of the bpo repo on a host as an indication of what needs to be upgraded for a new openstack version [14:45:20] despite what apt said, when I did 'apt install python3-cinderclient' it installed bpo12 version [14:45:49] I suspect that sometimes we'll want the bpo present [14:45:54] but let's see if zigo has a clear answer [14:46:44] apt-get upgrade --dry-run shows a lot of things that it would like to upgrade [14:47:25] nothing I care about especially. I wonder if the openstack client packages are installed there by accident, or as a side-effect of me trying to get the right apt repo present [14:47:37] cloudvirts doing OK? [14:47:46] I've done two with only a canary [14:52:43] I think the "bpo11" instead of "bpo12" is just a glitch in the naming, for some reason the bpo packages in "bookworm-zed-backports" retained the "11", but should really have "12" in the name [14:54:03] so the 3 bpo packages in cloudrabbits are indeed coming from the correct "bookworm-zed-backports", and can be updated to "bookworm-antelope-backports" [14:54:35] andrewbogott do you mind doing the apt upgrades on all cloudrabbits nodes, until we find out if we can remove those packages? [14:54:52] you think I should do a full 'apt-get upgrade'? [14:55:01] seems harmless since it doesn't affect rabbit packages [14:56:07] makes sense [14:56:59] I'm now upgrading cloudvirts106[0-5] with a for loop that will do those in sequence [14:58:12] dhinus: I'm starting a meeting but ping me if anything bad happens :) [14:58:53] ok! [15:32:59] cloudvirt106[0-7] have been upgraded. now doing cloudvirt105[0-9] [15:42:23] I need to go get something done during normal people hours, talk to you tomorrow [16:06:02] dhinus: Zigo says "I stopped doing this, as it's not really needed, and it's annoying to upgrade rabbitmq or OVS between Debian releases." [16:06:28] Which seems sensible to me! I'll see if we can remove that whole bit from those servers. [16:10:19] great, thanks! [17:08:45] quick link to see the status of upgraded vs to-upgrade cloudvirts: https://debmonitor.wikimedia.org/packages/nova-compute [17:08:50] 31 done, 17 to go [17:09:16] I've added some quick checks I discussed with andrewbogott to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrading_cloudvirt_nodes [17:10:26] I get a 500 from that debmonitor link (well, and apparently all debmonitor links) [17:15:51] hmm seems to work for me [17:17:59] that's odd [17:19:15] Rook: is quarry sufficiently delegated to volunteers that we should stop seeing/caring about alerts? [17:38:08] andrewbogott: I've upgraded cloudvirtlocal*, and I'm doing cloudvirt-wdqs* -- I'll let you handle the remaining 9 regular cloudvirts (1031 to 1039) [17:38:46] if debmonitor is not working you can also check "sudo cumin 'P{C:openstack::serverpackages::zed::bookworm}'" [17:39:21] ok! [17:41:21] shall we send an update to cloud-announce saying that the upgrade is completed? [17:42:05] yep! [17:42:27] do you want to send it after you complete all the cloudvirts? [17:43:20] you can go ahead and send it now, the cloudvirt upgrades won't affect users at all [17:44:57] ok! [17:48:18] If anybody has some time to help debug email sadness, JSherman is asking about T347512 in -cloud [17:48:19] T347512: Some emails from The Wikipedia Library aren't being received as expected - https://phabricator.wikimedia.org/T347512 [17:49:20] andrewbogott: sent [17:49:33] great! [17:49:45] bd808: I might have some time tomorrow if nobody finds the answer before that [17:53:50] clloudvirt-wdqs* are all upgraded, only cloudvirt[1035-1039] are left [17:54:06] I'm logging off shortly [17:56:04] andrewbogott: yes* [17:56:04] *not really sure what the * is, but it feels like there is one [17:56:35] ok! I'll continue to ignore for now and will maybe see about getting them off the board entirelh [17:56:37] entirely [17:56:47] Seems reasonable [18:19:28] * dhinus off [18:41:41] * bd808 lunch [18:55:25] Rook: andrewbogott: I updated the config, quarry alerts should no longer be sent to us [18:55:34] thanks! [18:57:10] 👍 [20:28:29] taavi: the app creds for eqiad1 on tf-bastion.tf-infra-test.eqiad1.wikimedia.cloud have stopped working. Would you guess that they simply expired, or that the upgrade this morning somehow broke all app credentials everywhere? [20:28:37] (I'm making new ones for that one use-case in the meantime...) [20:29:00] andrewbogott: I have no idea. I'm not aware of any application credential expiry mechanism [20:29:08] hrm ok. [20:29:11] does it say 'expired' in an error or what? [20:29:36] I'm not any deeper than just seeing the TF error "│ Error: Error creating OpenStack container infra client: Authentication failed" [20:29:40] Investigating more now [20:44:08] yep, I think they're just expired. You can definitely flag a set of credentials with an expiration date, and new ones work fine. [21:45:26] I'm installing laptop updates so out for now. [22:56:14] The problem with the wpl project outbound emails in T347512 turns out to be the sender address they are using (noreply@wikipedialibrary.wmflabs.org) not having SPF and Google rejecting the messages as a result. Because the sender address itself is also bogus there wasn't anywhere for mx-out03 to send a bounce notification to. [22:56:15] T347512: Some emails from The Wikipedia Library aren't being received as expected - https://phabricator.wikimedia.org/T347512 [23:29:22] zero team-wmcs alerts! [23:44:18] zarro boogs found? nice :) [23:45:17] https://en.wikipedia.org/wiki/Bugzilla#Zarro_Boogs for those lucky enough not to know the reference.