[01:02:20] * bd808 off [14:01:34] andrewbogott: I'm ready to start with the upgrade :) I'd say let's try using this channel instead of Google Meet so we have a record of what's happening, but we can jump into the Meet if there's any complication [14:01:55] i'm also around if I can be helpful [14:02:06] thanks taavi! [14:02:10] I [14:02:14] I've updated https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrade_the_main_deployment_%28eqiad1%29 [14:03:51] hmm I left the old link in there, fixed now [14:04:12] no, still wrong :/ [14:08:00] fixed. the task for the upgrade is T348843 [14:08:01] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [14:11:08] first patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978545 [14:11:50] running PCC to double check [14:12:20] shall I send an email to cloud-announe? [14:12:23] *cloud-announce [14:13:00] i would not, just upgrading designate should not be very disruptive [14:13:21] ok. PCC is looking good [14:13:42] are the designate pools.yaml changes expected? [14:13:53] let me check [14:14:45] hmm weird [14:15:14] it might not have an impact, but I'm not sure where that change is coming from [14:18:11] balloons, bd808: I'm no longer sure what I did exactly when I experimented with mono/c# a while ago: even if it's possible to install mono runtime/dependencies via apt, the detect phase of the buildpack lifecycle will fail if the codebase isn't recohgnized as one of the builder-included buildpacks. So it's possible to add jammy-provided language runtimes and dev dependencies, but the main language needs to be [14:18:11] one of the supported ones. [14:19:30] taavi: ooh I think that's been changed after my copy of zed files to antelope [14:19:51] dhinus: yeah I was just looking it up, https://gerrit.wikimedia.org/r/c/operations/puppet/+/961170 changes it for antelope but not zed [14:21:30] then maybe it's needed for antelope? [14:24:36] hmm looks like cloudservices in codfw were upgraded on Oct, 9 and that patch is from Oct, 4 [14:24:49] andrewbogott: do you remember any detail about that patch? [14:33:23] dhinus: running late, be there shortly [14:35:32] no worries [14:35:59] I checked in codfw and the dot is gone, and designate seems happy, so I think there's no harm in removing it in eqiad too [14:36:38] yeah, let's just move forward [14:38:01] dhinus: pools.yaml is staged by puppet but not applied. So the change is definitely harmless [14:39:38] (and also I'm here now) [14:40:15] i'm looking at the cloud vps puppet agent failure errors [14:40:16] ok, I'll disable puppet in cloudservices100[56] then merge the patch [14:40:36] sounds good! [14:41:29] blancadesal, ahh, yes. Apt can help in mixed toolchain scenarios, or when you need a specific dependency not provided. But isn't a drop-in replacement for buildpack language support [14:44:26] merged, running the cookbook on cloudservices1005 [14:45:42] the cookbook is gonna reboot the host, do we expect no impact at all for end users? [14:46:52] should be ok [14:47:10] full command: cookbook wmcs.openstack.cloudvirt.upgrade_openstack_node --fqdn-to-upgrade cloudservices1005.eqiad.wmnet --task-id T348843 [14:47:10] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [14:47:14] it's one of two resolvers. If there are half-baked services that only check one of them it might be briefly upset [14:47:44] one of two auth servers. the recursors use a single VIP that's anycasted to both boxes, so no service disruption there [14:47:52] the command is wrong because I copy-pasted it from the wiki :) [14:48:00] taavi: ack [14:48:22] balloons: enabling third-party buildpack use might be the fastest way to increase flexibility and possible use cases without overhauling the whole buildpack paradigm by going into BYO container support [14:49:33] actually, I copy-pasted it from the cookbook help string, which is wrong :) [14:49:56] revised command: cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudservices1005.eqiad.wmnet --task- id T348843 [14:50:11] cloudcontrol? [14:50:24] the cookbook is called like that for all nodes, I think [14:50:30] then there's a separate cloudvirt one for cloudvirts [14:50:43] we should rename the cookbook [14:51:00] that's confusing [14:51:31] ah it has a bunch of `if "control" in self.fqdn_to_upgrade` statements [14:51:41] yes, very. the wiki seems to confirm "cloudcontrol" is the one to use, it should probably just be called "upgrade_openstack_node" [14:51:49] the cloudvirt upgrade script doesn't reboot, among other differences [14:52:12] yep, that one is fine [14:52:14] so the 'cloudcontrol' is to distinguish between the two cookbooks. [14:52:22] the cookbook is running and doing things [14:52:25] but maybe we could merge them and let it do a hostname detection [14:53:26] yep, we should rethink it. can you +1 this in the meantime? I thought it was already merged https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/969172 [14:54:24] done [14:55:17] thanks [14:55:30] the cookbook has almost completed on cloudservices1005 [14:55:53] and it does run puppet for you, not sure if it also re-enables it [14:56:02] yes it does [15:00:10] taavi: looks like you fixed the 'profile::base::production::role_description' thing? [15:00:20] yes [15:00:24] thanks! [15:01:59] the cookbook has completed with PASS on cloudservices1005 [15:02:11] I raised T352297 [15:02:11] T352297: [wmcs-cookbooks] unify upgrade_openstack cookbooks - https://phabricator.wikimedia.org/T352297 [15:02:31] there's a bunch of alerts in alertmanager about cloudservices1005 [15:03:19] jah, downtiming doesn't seem to work (or isn't attempted) [15:08:24] dhinus: something is a bit broken but I think it's version conflict between the two cloudservices nodes. So best to go ahead and upgrade the other one [15:09:08] ok I'll run the cookbook on host #2 [15:09:11] 1006 [15:09:34] ok! [15:28:00] cookbook completed with PASS. a bunch of alerts on cloudservices1006 but the ones on 1005 are gone [15:31:37] all alerts gone [15:32:56] labs-ip-alias-dump.service still failing on 1005 [15:34:11] hmm [15:34:22] that could be a leftover alert from when it was down [15:34:32] it only retries every 20 I think [15:34:41] failing in both, checked with systemctl [15:34:43] it also seems to have been failing on codfw1dev for a while [15:34:50] I'm pretty sure it was not failing 20 mins ago when I checked 1005 [15:35:03] the error message seems like a programming/api change error, not a network error [15:35:41] this one? "Unable to parse project=bastion, region=eqiad1-r, does the project exist?" [15:36:14] yes, 'AttributeError: 'FloatingIP' object has no attribute 'attached'' which is causing that [15:37:08] A fullstack test just completed successfully, I'm starting another one to double-check [15:38:27] taavi: I can look at the labs-ip-alias-dump think after the meeting if you aren't already mid-fix [15:38:40] andrewbogott: go for it [15:40:37] 2nd fullstack test worked [15:40:50] so we can go forward with other upgrades dhinus if you want to keep going today [15:44:56] dhinus: want to keep upgrading things today? [15:46:35] hmm given we have the team meeting in 15 mins I think it might be best to continue tomorrow? [15:47:11] that's fine with me, this is a good stopping place [15:47:12] if you manage to fix the labs-ip-alias-dump thing today, I can continue with upgrading cloudcontrols tomorrow morning? [15:47:25] what do you think should be the order for the remaining hosts? [15:47:56] cloudcontrols, cloudnets, cloudvirts [15:48:34] I'll prepare the patch [19:09:34] taavi: can I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/978676 ? [19:29:46] andrewbogott: yes! [19:43:16] thx [19:43:22] * bd808 lunch