[08:27:25] morning [09:24:44] o/ [10:09:10] please review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1004088 [10:37:22] FYI taavi I'm about to do another manual cleanup of the openstack DB for a similar nova-compute registration problem [10:37:43] this is T357631 [10:37:43] T357631: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 [10:39:19] the compute service delete / discover dancing turns out to not be enough :-( [10:49:13] * dhinus paged about cloudvirt1032 cc arturo [10:49:28] I'm working on it, did we downtime it yesterday for 2 days? [10:50:10] probably not :D [10:50:32] I just downtimed it again for 2 additional days [10:50:41] sorry for the noise [10:50:48] aborrero@cumin1002:~ $ sudo cookbook sre.hosts.downtime cloudvirt1032.eqiad.wmnet -D 2 --reason "nova-compute registration" [11:32:52] I just tried to unblock my laptopt with the password 'sudo -i' [11:32:53] xd [11:38:32] the wmcs.openstack.cloudvirt.lib.ensure_canary is misbehaving in a way that's supposed to be covered by unit tests [11:44:12] T357970 [11:44:13] T357970: wmcs.openstack.cloudvirt.lib.ensure_canary cookbook creates multiple canary VMs - https://phabricator.wikimedia.org/T357970 [12:00:24] FYI cloudvirt1032 is in service now after moving to single NIC setup. It has several canary VMs because the bug above [12:01:06] and as of today, the pre-reimage/post-reimage cookbooks are not enough to handle the nova compute service registration mess, and I will be trying another (additional) approach, which is to store the hypervisor ID in puppet [12:01:15] i.e: https://phabricator.wikimedia.org/T357631#9558343 [12:01:49] running an errand now, be back in ~30 [13:46:53] for kubernetes upgrades, the cadence we decided was twice a year iirc? or do we go for every 4 months? (major version release cadence) [13:48:56] kubernetes does releases every four months, we have no internal cadence for those [13:49:31] ack, best effort I guess then [13:52:18] added a note in https://phabricator.wikimedia.org/T133598, feel free to reword/change/correct me etc. [14:00:09] the toolsdb replication lag alert triggered again, I tracked it in T357624 [14:00:09] T357624: [toolsdb] Replica is frequently lagging behind the primary - https://phabricator.wikimedia.org/T357624 [14:00:27] sorry wrong phab, I meant T357979 [14:00:28] T357979: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-20 - https://phabricator.wikimedia.org/T357979 [15:02:58] dhinus: I am awake early, should we do the cloudrabbit reboots? [15:19:31] yep, I'm in a meeting [15:19:36] I'll let you know as soon as I'm free [15:19:49] I was waiting for you to be online before touching the rabbits :) [15:21:08] ok! [15:22:04] "touching the rabbits" lol [15:23:43] that rabbits are easily startled [15:24:56] xD [15:25:03] they bite [15:34:23] https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.pinterest.com%2Fpin%2F443252788298381056%2F&psig=AOvVaw1bcJBYeydcbWD_NiW_6aiU&ust=1708529650055000&source=images&cd=vfe&opi=89978449&ved=0CBIQjRxqFwoTCMia4YqfuoQDFQAAAAAdAAAAABAE [15:34:25] oops [15:34:39] de-googled [15:34:40] https://www.pinterest.com/pin/443252788298381056/ [16:00:30] andrewbogott: I'm available now, sorry for the wait! [16:00:52] I think there was a procedure on the wiki somewhere [16:01:03] for restarting the whole rabbit cluster [16:01:29] possibly, let's look [16:02:25] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/RabbitMQ [16:02:45] Yeah, for a total rebuild (which might yet come) [16:02:50] https://www.irccloud.com/pastebin/a1BhwaWI/ [16:03:15] In theory order doesn't matter, but I'd reboot 3, 2, 1 and check for stability after every reboot. [16:03:27] And expect catastrophe after rebooting the 1001 :/ [16:04:27] ok let's try [16:05:03] and would you use sre.hosts.reboot-single to reboot? [16:05:14] yes [16:05:48] ok! starting with 1003 [16:06:59] ok, 1001 can see that 1003 is down but doesn't seem upset [16:10:28] and now 1001 and 1003 both seem satisfied with the recovery [16:13:51] the reboot cookbook has completed [16:14:15] I'll proceed with 1002 [16:16:50] ok [16:17:13] 1002 disappeared from the "running" list as expected [16:17:30] when you say "doesn't seem upset", what do you mean exactly? :) [16:18:38] Just that the cluster_status output seemed normal except for the missing node [16:23:16] andrewbogott: when you are not in the middle of an operation, could you please look a this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005065 [16:24:01] andrewbogott: ok. 1002 done, and status is looking fine. rebooting 1001. [16:24:33] ok! [16:25:06] arturo: yep! [16:25:07] running "status" in 1002 looks fine (1001 is missing from "Running Nodes") [16:25:27] yeah, I'm pleasantly surprised [16:28:03] all 3 are now back in service and status seems ok [16:28:16] did I get lucky? /me crosses fingers :P [16:28:29] yep, now I'm just watching the admin-monitoring project to make sure VMs can still get created [16:29:19] what are you checking exactly? [16:29:33] just watching admin-monitoring instance list in horizon [16:29:38] to see if broken things start piling up [16:29:43] ok! [16:29:53] I will add some notes to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#cloudrabbitXXXX [17:10:28] * arturo offline [17:18:50] * dcaro off [17:18:53] cya tomorrow! [17:18:55] \awy [17:18:59] xS [17:19:01] \away [17:19:07] now.... [19:07:06] * bd808 lunch