[08:03:41] good morning [08:20:36] morning! [09:02:13] taavi: re harbor, after some further experimentation and scouring the docs, it seems that the api would need the credentials of an admin user to view/edit quotas. https://phabricator.wikimedia.org/T341068#9358319 [09:02:33] do you see any issue with this? [09:04:11] blancadesal: can we add the robot user as a member to all of the projects and solve it that way? [09:06:25] it doesn't seem so as the "robot account" is a separate concept that doesn't map to any type of user [09:08:25] robots don't show up in the user list, and can't be added as users to projects [09:09:38] ok, that's a bit annoying, going with the admin account seems like the only option then :/ [09:09:55] i don't like it, but i think it's fine to give the builds-api access to the admin account [09:10:17] we could create a separate "robot admin" account at least for auditing purposes [09:45:05] hello! I'm back from holidays, I'm catching up on emails, etc. but let me know if there's something I should look at :) [09:57:42] welcome back :) [10:04:46] I added a "tools migrated in the last week" counter to https://grid-deprecation.toolforge.org/ [10:05:29] taavi: nice [11:00:08] dhinus, blancadesal: quick +1 on a further quota increase for T350484? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/140 [11:00:12] T350484: Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 [11:08:51] taavi: done [11:08:58] thx [12:16:04] anyone coming to collab today? [12:16:34] sure [14:05:33] Rook: the backup job for the 'test-postgres' trove instance in quarry is failing. Before I start debugging, is that something useful that I should try to preserve? [14:05:49] (I have the feeling I asked you that same question last week) [14:09:08] andrewbogott: any reason not to send cloudmetrics1003/4 to dc-ops for decom? (T351077) [14:09:08] T351077: decommission cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 [14:10:48] * andrewbogott checks the spreadsheet [14:11:09] In theory those servers are useful until 2026 so I'd hope they'd get repurposed. Maybe make a note of that in the task? [14:12:57] but otherwise handoff seems good. [14:13:10] I'll make a note [14:15:15] thanks. that was my concern as well [14:29:57] dhinus: shall we resume the etcd-node shuffling so we can upgrade cloudvirtlocal* ? [14:35:31] * andrewbogott shuffles [15:02:34] hm, andrewbogott: are the libvirtd-admin.socket alerts related to something that you're doing? or should I investigate? [15:03:32] um... I don't know why they'd be related [15:03:36] I'm reimaging cloudvirtlocal1001 [15:04:14] hm [15:04:20] Nov 27 14:54:52 cloudvirt1060 libvirtd[4037060]: Our own certificate /var/lib/nova/clientcert.pem failed validation against /var/lib/nova/cacert.pem: The certificate hasn't got a known issuer. [15:05:24] jbond: you have a moment to help with a certificate issue probably related to the puppet 7 migration? [15:09:14] why now though? the puppet7 change happened >5 days ago didn't it? [15:09:57] that's a very good question [15:10:16] lots of prometheus changes just now from filippo [15:11:38] how does the vm migration process work? does that talk to libvirtd? [15:12:05] cold-migration (which I'm doing today) does not [15:12:11] I believe that live migration does [15:13:52] it feels like something is missing an intermediary cert in the libvirt setup [15:14:07] yep [15:15:48] a-ha. puppet::expose_agent_certs has some special logic for that on puppet 7, but we don't use that resource [15:15:50] patch incoming [15:16:21] puppet says "/usr/bin/virsh secret-define --file /etc/libvirt/libvirt-secret-eqiad1-compute.xml" is failing but it works on the commandline... [15:18:33] andrewbogott: was there a way to see what user created a db? Judging from the time it was created it might be one of the new maintainers, though I'm not sure what test-postgres might be. Sounds like it is causing some issues if so I think it is safe to remove [15:18:56] I don't think I can tell but I'll check again [15:19:56] nope, can't tell. I can ping a few of the new maintainers though, have a suggestion for who might be responsive? [15:21:52] It would have been Audiodude, SD0001, or maybe Framawiki. Though I do suspect it is not needed, it's giving some unhappy errors. And it isn't part of the intended shift to k8s [15:22:46] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/977697/ [15:24:52] taavi: you know I'm going to ask this, but... how did that work before? [15:25:30] oh, I see, nevermind [15:26:54] it worked before we migrated to puppet 7. i have no clue why it broke today, and not for example last saturday [15:32:51] hello! the restbase CI has been using parsiod-external-ci-access.beta.wmflabs.org for a long time over HTTP and it's no longer accessible (which is probably for the best). What would be the best way to create a valid tls cert for a beta.wmflabs.org domain? (ticketed here https://phabricator.wikimedia.org/T350353) [15:34:11] taavi: I'm still not getting clean puppet runs on e.g. cloudvirt1060 [15:34:50] hnowlan: traditionally you would create a proxy that will serve as an https endpoint. There's a UI for that in horizon under the 'dns' sidebar tab [15:36:28] hnowlan: but that assumes that there's an http service running on the VM and you just have a firewall problem. It's hard for me to guess what's going on on that VM. [15:46:13] andrewbogott: seems there was a floating IP set up for the host rather than using the proxies previously [15:46:38] I can only create proxies on wmcloud.org, is this preferred? Not a dealbreaker for me, just curious [15:46:46] hnowlan: deployment-prep is weird, probably that was an attempt to resemble prod [15:46:51] yes, only under wmcloud.org [15:47:01] but you could also adjust the security groups for that VM and open up port 80 again probably [15:47:12] or learn all about acme-chief and set up a proper cert [15:47:17] if there's a way to do this right via the proixes that would be ideal [15:47:39] as long as you don't care about the domain name that should be easiest. [15:48:03] taavi: seems like a race where puppet marks the file 0400 and then immediately tries to read it and can't [15:48:18] (the file being /etc/libvirt/libvirt-secret-eqiad1-cinder.xml) [15:48:57] well, I guess it's not actually a race, just a bug. Does puppet7 not run as root? [15:53:59] andrewbogott: I'm still seeing some nodes complain about the new certificate setup [15:54:39] it seems like having the chain in clientcert.pem doesn't work properly. [15:56:47] so possibly two problems :/ [15:57:19] taavi: do you know of any user-facing symptoms of this problem? (I'm just trying to figure out if it's urgent or not) [15:57:45] since we're all about to start a meeting [15:58:23] I'm not sure, I suspect it might affect VM live migrations. I think we'd noticed already if it was affecting more urgent stuff [15:58:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/977728/ is my try 2 for a fix. [16:04:22] that patch seems to have done the trick, but it needs a manual 'sudo systemctl stop libvirtd.service && sudo systemctl restart libvirtd-tls.socket' to un-confuse systemd. I'll do that via cumin after the meeting [16:04:36] sounds good. thanks for the fix! [16:13:58] taavi in a meeting at the moment ill ping when out [16:43:39] taavi i just read through the history, looks like things are working now? [17:01:51] dhinus: cloudvirtlocal1001 is fully rebuilt, I'm starting to drain 1002 now. I'm happy to finish them up unless you want me to save anything for you :) [17:02:26] please go ahead :) [17:04:17] jbond: I think things are somewhat working now, thanks [17:04:48] taavi: ok great [17:04:56] andrewbogott: it seems like the failing libvirtd services have been fixed by enough puppet runs. i will clean up the phab tasks and maybe file some tasks to tweak the alertmanager rules [17:05:15] ok! [19:20:10] dhinus: cloudvirtstatic hosts are all on bookworm now [19:20:18] um, cloudvirtlocal [19:20:39] awesome! [19:23:04] taavi or dhinus, happen to know how I can make alert manager forget about servers that have been decommed? I expected the decom cookbook to do that but apparently not. [19:23:28] andrewbogott: which alert? [19:23:39] there are lots, for cloudvirt1025-1030. [19:23:49] "Neutron neutron-linuxbridge-agent on cloudvirt1027 is down" [19:24:43] by removing them from neutron. https://phabricator.wikimedia.org/P53913 [19:25:37] ah, I see, it's not the host alerting... [19:25:41] ok, I think I know how to do that