[08:35:01] got disconnected and now I got banned from using my dcaro nick temporarily [08:36:12] morning [08:36:25] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/22 (fixes the bump_version script) [08:40:03] 👀 [08:40:39] LGTM. [09:26:40] just noticed the prompt for the lima-kilo vm is `lima-lima-kilo` [09:39:03] there's many lima* stuff xd [10:21:55] dcaro: when do you plan to merge the jobs-api refactor? [10:22:04] I need to write some patches [10:25:23] I've deployed the first patch on toolsbeta, testing it [10:25:44] I might wait until after lunch for the next, but today for sure [10:32:36] great [11:22:43] is there anything to consider for upgrading postgresql on cloudbackup? (just a security update to 13.13 to 13.14, no major version bump) [11:31:34] moritzm: I'm not even sure what uses postgres on there at the moment [11:35:16] poking at Hiera this seems to for Cinder backups going to /srv/cinder-backups/postgresql, it seems Andrew was the last to touch this, I'll wait until he's around [11:35:37] ok [11:35:39] thanks [12:48:41] quick review? https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1017849 [12:51:06] left a comment [12:51:25] right, thanks [12:51:54] abandoned [13:02:48] having GPU issues today :-( [13:02:49] nostromo kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. [13:09:20] Hey all, I'm going to start rebuilding the toolforge etcd cluster w/Bullseye. Please ping me immediately if k8s starts misbehaving. [13:09:53] andrewbogott: quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017857/ [13:10:48] xd, off the bus [13:24:20] moritzm: I don't think the postgres update is complicated, but you can start with cloudbackup100[12]-dev which is lower stakes than cloudbackup200[1234] [13:36:32] ok. cloudbackup100[12]-dev are on a different OS (bookworm) and given they were only installed on Friday also on the most current version [13:36:45] so as far as bullseye hosts are concerned only 2001/2002 need the update [13:40:18] moritzm: I [13:40:52] I'll likely decom those two (2001/2002) later in the week, so you can probably just ignore them unless the upgrade is an urgent security thing. [13:46:56] ack, that sounds good to me, I'll ignore these [14:16:15] andrewbogott: how is the etcd replacement going? I see some puppet run alerts [14:16:52] I'm progressing through the cookbook failures, will clean up the broken ones shortly. [14:20:26] reminder: the toolforge monthly meeting is tomorrow, please remember to add any agenda items you would like to discuss to https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Monthly_meeting. also if you're interested in moderating the meeting, now is a good time to indicate that [14:27:04] thanks! [14:27:17] tools-prometheus-6 went down, is anyone doing anything with it? [14:28:02] not me [14:29:36] not me [14:29:51] hmm... back up (got email alerts), but the VM says it's been up for 26 days [15:33:13] * arturo offline [15:40:05] andrewbogott: taavi@tools-checker-5:~ $ sudo curl --cert /var/lib/toolschecker/etcd/tools-checker-5.tools.eqiad1.wikimedia.cloud.pem --key /var/lib/toolschecker/etcd/tools-checker-5.tools.eqiad1.wikimedia.cloud.priv https://tools-k8s-etcd-22.tools.eqiad1.wikimedia.cloud:2379/health [15:40:05] {"health":"false"} [16:09:43] * dcaro off [16:09:45] cya tomorrow [16:12:25] ` Warning FailedMount 29s (x7 over 111s) kubelet MountVolume.SetUp failed for volume "kube-api-access-4hdj9" : [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:tools-k8s-worker-nfs-56" cannot create resource "serviceaccounts/token" in API group "" in the namespace "tool-wikibugs-testing": no relationship found between node 'tools-k8s-worker-nfs-56' and this object, failed to sync configmap [16:12:25] cache: timed out waiting for the condition]` [16:12:40] etcd related I'd guess? [16:17:41] could be... [16:17:59] did it recover or are things still stuck? [16:19:17] I think it is starting to work now. I had a bit of trouble deleting the unscheduled pod, but that seems to have resolved itself. [16:19:40] ok. I will try to keep etcd up :) [16:20:18] the next task in my list did what was expected, so hopefully all is good :) [17:59:10] * bd808 lunch + partial eclipse watching