[01:30:52] * bd808 off [09:23:34] I'm looking at the novafullstack errors in clouducontrol1006 [09:25:32] ack, thanks, yesterday it was running ok (had failed on sunday) [09:52:50] would it be okay to reboot clouddumps1001, clouddumps1002 and clouddb2002-dev today? these are the last remaining hosts missing for T321313 [09:54:04] moritzm: clouddb2002-dev is fine to reboot anytime, you can do that or I can if you prefer that. clouddumps is a bit more complex because they host NFS shares [09:55:47] I'll quickly go ahead with clouddb2002-dev, then. if clouddumps are more complex to handle, I'll leave them as-is, but it would be good to wrap these up in the next 1-2 weeks [09:56:07] the current kernels are quite old, we ran into the task yesterday during Phab task triage [09:56:26] ok! I'll take care of clouddumps* in the next few days [09:56:31] thanks! [10:06:06] regarding the novafullstack alert: I did not found an obvious reason why it was failing other than the number of leaks being > max, so I just cleaned up all the VMs [10:14:27] :/ [11:44:32] could you please have another look at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/59 maybe we can merge it today [11:44:37] and get it published [11:48:48] I might be able to give it another look after lunch, I want to finish the first look to all the migration tasks [11:51:06] taavi: what about embedding the openapi spec in a wikitech page? [11:51:44] you mean doing a bot to sync the data to wikitech? [11:52:53] I think I mean the nice UI that can be generated from the openapi spec [11:55:41] I would wait until we have the aggregated API before moving the definitions to wikitech (for users), otherwise we (developers) can manage with the yaml+online tools or similar [11:56:50] + [11:57:11] I thought there was urgency for making this available for users [11:58:08] not to users, to ourselves [11:58:21] (so we can generate the client) [11:59:28] for the users there's less urgency, and I think that having a more stable and aggregated API first might avoid users having to change things at the beginning [12:26:52] looking for a quick review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002941 and https://gerrit.wikimedia.org/r/c/operations/dns/+/1002940 [12:29:13] +1'd [12:29:41] thanks [12:46:44] * dcaro lunch [13:00:05] any ongoing work on cloudcontrol1006? [13:01:37] I'm rebooting things as a part of the kernel updates. should have mentioned it here/downtimed, sorry [13:02:08] ok [13:16:53] taavi: is galera happy in eqiad1? [13:17:10] it should be, that alert seems like a false positive but I'm looking [13:17:30] haproxy having no backends sounds a bit concerning [13:18:55] yeah. it's a false positive apparently caused by how our haproxy config has 1007 as galera primary and the rest as backups in case 1007 goes down [13:19:08] I can see the expected several hundred connections open on 1005 [13:19:30] ok [13:23:02] filed T357406 to fix that [13:23:03] T357406: "HAProxy service mysql has no available backends" fires when galera primary is down - https://phabricator.wikimedia.org/T357406 [13:24:10] thanks [13:38:43] that is fixed in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1002983 [13:39:26] dcaro: dhinus: I'd like to reboot cloudcontrol1005 but you both are logged in there [13:39:37] logged out [13:39:54] taavi: looged, out [13:40:15] thanks! just did not want to disrupt any ongoing scripts/etc [13:41:13] thanks for pinging :) [13:41:52] I usually have N terminals open to different hosts of things I'm checking/doing in parallel :/ (kind of like browser tabs) [14:44:57] taavi: thanks for all the reboots in T356975! if you get bored, I can do the ones that are left between today and tomorrow [14:45:53] dhinus: well my main project ATM is waiting for tools k8s nodes to be created, so I don't mind doing other boring stuff like cloudvirt reboots in the meantime [14:46:19] ok, thanks :) [14:46:23] would you mind taking care of the non-trivial stuff? so cloudvirt-wdqs/local, and remaining non-cloudvirt nodes [14:46:33] sure [14:46:40] cloudrabbits are the painful ones I guess [14:46:47] yep [14:46:52] wdqs and local, I'm trying to remember what I did last time [14:46:59] when we upgraded those to bookworm [14:48:42] looks like cloudvirtlocal were reimaged by andrewbogott in T345811 [14:48:42] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [14:49:11] while wdqs were upgraded when they were moved to a new rack in T346948 [14:49:12] T346948: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 [14:50:16] dhinus: as long as you do one at a time it should work OK? In theory etcd handles the loss of 1/3 nodes [14:50:24] (perhaps that was not your question) [14:51:00] I think it was :) [14:51:47] cloudvirt-wdqs is less straightforward. I check to see if any VMs are running on them (often not) and if so check in with the owners before reboot. [14:52:58] there's one VM each [14:53:04] which could be a canary [14:53:17] that's easy then! [14:53:32] They do use those hosts sometimes but it's generally pretty ephemeral. [14:53:40] nice [14:57:03] I'm rebooting wdqs hosts with wmcs.openstack.cloudvirt.safe_reboot [15:02:08] icinga doesn't seem to believe the 'safe' part, I downtimed them there [15:03:35] thanks, is it maybe that the cookbook removes the downtime too soon? [15:05:42] hmm there was an interesting exception: "Aggregate 7 already has host cloudvirt-wdqs1002" [15:05:46] the cookbook completed anyway [15:06:56] ok the same error was printed also for wdqs1001 [15:07:19] in both cases, "Aggregate 1" and then "Aggregate 7" [15:10:43] all 3 -wdqs hosts have been rebooted [15:11:55] I need to go afk for a bit so I won't do more reboots for today (but I'm joining the toolforge meeting later) [15:12:22] andrewbogott: if you wanted more test subjects for the designate leaks, I can delete more k8s worker nodes. note though that we're starting to get a bit low on old named workers, there's maybe 15 or so left [15:15:53] thanks taavi. After more digging yesterday I determined that the problem is a past (fixed) bug rather than a present bug. There's a batch of old dns records that don't have the associated 'managed_resource' id which would clean them up along with a deleted VM. [15:16:09] So I'm going to stop worrying about it and just do periodic cleanups after the fact. [15:16:16] ok! [15:16:31] That does mean that after you delete them you can ping me about the cleanup :) [17:20:31] * arturo offline [17:21:21] * dcaro out [19:01:32] * bd808 lunch