[01:30:52] * bd808 off
[09:23:34] <arturo>	 I'm looking at the novafullstack errors in clouducontrol1006
[09:25:32] <dcaro>	 ack, thanks, yesterday it was running ok (had failed on sunday)
[09:52:50] <moritzm>	 would it be okay to reboot clouddumps1001, clouddumps1002 and  clouddb2002-dev today? these are the last remaining hosts missing for T321313
[09:54:04] <taavi>	 moritzm: clouddb2002-dev is fine to reboot anytime, you can do that or I can if you prefer that. clouddumps is a bit more complex because they host NFS shares
[09:55:47] <moritzm>	 I'll quickly go ahead with clouddb2002-dev, then. if clouddumps are more complex to handle, I'll leave them as-is, but it would be good to wrap these up in the next 1-2 weeks
[09:56:07] <moritzm>	 the current kernels are quite old, we ran into the task yesterday during Phab task triage
[09:56:26] <taavi>	 ok! I'll take care of clouddumps* in the next few days
[09:56:31] <moritzm>	 thanks!
[10:06:06] <arturo>	 regarding the novafullstack alert: I did not found an obvious reason why it was failing other than the number of leaks being > max, so I just cleaned up all the VMs
[10:14:27] <dcaro>	 :/
[11:44:32] <arturo>	 could you please have another look at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/59 maybe we can merge it today
[11:44:37] <arturo>	 and get it published
[11:48:48] <dcaro>	 I might be able to give it another look after lunch, I want to finish the first look to all the migration tasks
[11:51:06] <arturo>	 taavi: what about embedding the openapi spec in a wikitech page?
[11:51:44] <taavi>	 you mean doing a bot to sync the data to wikitech?
[11:52:53] <arturo>	 I think I mean the nice UI that can be generated from the openapi spec
[11:55:41] <dcaro>	 I would wait until we have the aggregated API before moving the definitions to wikitech (for users), otherwise we (developers) can manage with the yaml+online tools or similar
[11:56:50] <taavi>	 +
[11:57:11] <arturo>	 I thought there was urgency for making this available for users
[11:58:08] <dcaro>	 not to users, to ourselves
[11:58:21] <dcaro>	 (so we can generate the client)
[11:59:28] <dcaro>	 for the users there's less urgency, and I think that having a more stable and aggregated API first might avoid users having to change things at the beginning
[12:26:52] <taavi>	 looking for a quick review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002941 and https://gerrit.wikimedia.org/r/c/operations/dns/+/1002940
[12:29:13] <arturo>	 +1'd
[12:29:41] <taavi>	 thanks
[12:46:44] * dcaro lunch
[13:00:05] <arturo>	 any ongoing work on cloudcontrol1006?
[13:01:37] <taavi>	 I'm rebooting things as a part of the kernel updates. should have mentioned it here/downtimed, sorry
[13:02:08] <arturo>	 ok
[13:16:53] <arturo>	 taavi: is galera happy in eqiad1?
[13:17:10] <taavi>	 it should be, that alert seems like a false positive but I'm looking
[13:17:30] <arturo>	 haproxy having no backends sounds a bit concerning
[13:18:55] <taavi>	 yeah. it's a false positive apparently caused by how our haproxy config has 1007 as galera primary and the rest as backups in case 1007 goes down
[13:19:08] <taavi>	 I can see the expected several hundred connections open on 1005
[13:19:30] <arturo>	 ok
[13:23:02] <taavi>	 filed T357406 to fix that
[13:23:03] <stashbot>	 T357406: "HAProxy service mysql has no available backends" fires when galera primary is down - https://phabricator.wikimedia.org/T357406
[13:24:10] <arturo>	 thanks
[13:38:43] <taavi>	 that is fixed in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1002983
[13:39:26] <taavi>	 dcaro: dhinus: I'd like to reboot cloudcontrol1005 but you both are logged in there
[13:39:37] <dhinus>	 logged out
[13:39:54] <dcaro>	 taavi: looged, out
[13:40:15] <taavi>	 thanks! just did not want to disrupt any ongoing scripts/etc
[13:41:13] <dcaro>	 thanks for pinging :)
[13:41:52] <dcaro>	 I usually have N terminals open to different hosts of things I'm checking/doing in parallel :/ (kind of like browser tabs)
[14:44:57] <dhinus>	 taavi: thanks for all the reboots in T356975! if you get bored, I can do the ones that are left between today and tomorrow
[14:45:53] <taavi>	 dhinus: well my main project ATM is waiting for tools k8s nodes to be created, so I don't mind doing other boring stuff like cloudvirt reboots in the meantime
[14:46:19] <dhinus>	 ok, thanks :)
[14:46:23] <taavi>	 would you mind taking care of the non-trivial stuff? so cloudvirt-wdqs/local, and remaining non-cloudvirt nodes
[14:46:33] <dhinus>	 sure
[14:46:40] <dhinus>	 cloudrabbits are the painful ones I guess
[14:46:47] <taavi>	 yep
[14:46:52] <dhinus>	 wdqs and local, I'm trying to remember what I did last time
[14:46:59] <dhinus>	 when we upgraded those to bookworm
[14:48:42] <dhinus>	 looks like cloudvirtlocal were reimaged by andrewbogott in T345811
[14:48:42] <stashbot>	 T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811
[14:49:11] <dhinus>	 while wdqs were upgraded when they were moved to a new rack in T346948
[14:49:12] <stashbot>	 T346948: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948
[14:50:16] <andrewbogott>	 dhinus: as long as you do one at a time it should work OK?  In theory etcd handles the loss of 1/3 nodes
[14:50:24] <andrewbogott>	 (perhaps that was not your question)
[14:51:00] <dhinus>	 I think it was :)
[14:51:47] <andrewbogott>	 cloudvirt-wdqs is less straightforward. I check to see if any VMs are running on them (often not) and if so check in with the owners before reboot.
[14:52:58] <dhinus>	 there's one VM each
[14:53:04] <dhinus>	 which could be a canary
[14:53:17] <andrewbogott>	 that's easy then!
[14:53:32] <andrewbogott>	 They do use those hosts sometimes but it's generally pretty ephemeral.
[14:53:40] <dhinus>	 nice
[14:57:03] <dhinus>	 I'm rebooting wdqs hosts with wmcs.openstack.cloudvirt.safe_reboot
[15:02:08] <andrewbogott>	 icinga doesn't seem to believe the 'safe' part, I downtimed them there
[15:03:35] <dhinus>	 thanks, is it maybe that the cookbook removes the downtime too soon?
[15:05:42] <dhinus>	 hmm there was an interesting exception: "Aggregate 7 already has host cloudvirt-wdqs1002"
[15:05:46] <dhinus>	 the cookbook completed anyway
[15:06:56] <dhinus>	 ok the same error was printed also for wdqs1001
[15:07:19] <dhinus>	 in both cases, "Aggregate 1" and then "Aggregate 7"
[15:10:43] <dhinus>	 all 3 -wdqs hosts have been rebooted
[15:11:55] <dhinus>	 I need to go afk for a bit so I won't do more reboots for today (but I'm joining the toolforge meeting later)
[15:12:22] <taavi>	 andrewbogott: if you wanted more test subjects for the designate leaks, I can delete more k8s worker nodes. note though that we're starting to get a bit low on old named workers, there's maybe 15 or so left
[15:15:53] <andrewbogott>	 thanks taavi. After more digging yesterday I determined that the problem is a past (fixed) bug rather than a present bug. There's a batch of old dns records that don't have the associated 'managed_resource' id which would clean them up along with a deleted VM.
[15:16:09] <andrewbogott>	 So I'm going to stop worrying about it and just do periodic cleanups after the fact.
[15:16:16] <taavi>	 ok!
[15:16:31] <andrewbogott>	 That does mean that after you delete them you can ping me about the cleanup :)
[17:20:31] * arturo offline
[17:21:21] * dcaro out
[19:01:32] * bd808 lunch