[06:28:36] <taavi>	 morning
[06:28:45] <taavi>	 is anyone looking at the puppet failures on cloudbackup1002-dev.eqiad.wmnet?
[06:29:14] <taavi>	 `Could not find class ::openstack::serverpackages::antelope::bullseye`
[07:11:19] <dcaro>	 morning
[07:11:23] <dcaro>	 I can take a look
[07:12:50] <dcaro>	 I see no alerts
[07:13:50] <dcaro>	 but I see the error, looking
[07:17:47] <dcaro>	 it's just not there, we don't have a bullseye version of the serverpackages for antelope
[07:18:57] <dcaro>	 the repo is missing also on https://mirrors.wikimedia.org/osbpo/dists/
[07:26:47] <dcaro>	 hmm, that is just a mirror from upstream, so maybe the packages are not built upstream yet
[07:28:13] <dcaro>	 it synced last time yesterday morning, so seems fresh
[07:53:00] <dcaro>	 maybe dhinus or andrewbogot.t have started to work on it for the upgrade?
[08:14:05] <dcaro>	 oh, it seems toolsbeta-harbor is failing to connect to it's db
[08:14:26] <dcaro>	 blancadesal: ^ are you doing anything there?
[08:15:08] <dcaro>	 2023-10-18T08:14:10Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /service/notifications/tasks/47 request in transaction failed: failed to connect to `host=ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud user=harbor database=harbor`: dial error (dial tcp 172.16.5.95:5432: connect: connection refused)"}]}
[08:19:02] <dcaro>	 hmm.... the /var/run/postgresql directory was owned by root, making the database container fail to start as it runs as `database`, chowning the directory allowed it to start :/
[08:19:22] <blancadesal>	 dcaro: nope, nothing beyond the upgrade-harbor MRs 
[08:19:45] <dcaro>	 I suspect that might happen every time the VM is restarted (was one of the affected VMs from the cloudvirt1051 going down yesterday, so it got restarted somewhere else)
[08:19:58] <dcaro>	 the VM being the database VM, not harbor
[08:39:54] <dhinus>	 o/ let me check if we tried to upgrade that host and when
[08:46:12] <dhinus>	 looking at the IRC log, I have reimaged cloudbackup1001-dev on Sep 18th, but I probably forgot to reimage 1002, which is still on bullseye. antelope is only available in bookworm, so that's why Puppet is failing.
[08:46:20] <dhinus>	 reimaging 1002 to bookworm should fix it
[08:46:42] <dcaro>	 ack :)
[08:47:17] <dcaro>	 quick review to fix tests on builds-cli (not sure how I got a broken commit merged) https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/20
[08:47:55] <dhinus>	 the reimage of cloudbackup1001-dev is also tracked here https://phabricator.wikimedia.org/T345810#9174746
[08:48:04] <dhinus>	 I will start the reimage of cloudbackup1002-dev now
[09:01:36] <dcaro>	 thanks, no rush
[09:21:08] <blancadesal>	 could I get a quick review of https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/18? It's not a big change and I'd like to deploy it together with the previous MR 
[09:35:43] <dcaro>	 do you want me to test it locally also?
[09:40:12] <dcaro>	 nm, done
[09:47:27] <blancadesal>	 dcaro: sorry, distracted with something else. Thanks!
[09:55:47] <dcaro>	 quick review also https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/22, making releasing easier
[10:05:22] <blancadesal>	 lgtm
[10:30:09] * dcaro lunch
[10:50:30] <taavi>	 prepping the toolforge k8s upgrade (starting in 10 minutes) here: https://etherpad.wikimedia.org/p/toolforge-k8s-upgrade-1.23
[11:01:25] <taavi>	 ok I'm starting
[11:02:27] <taavi>	 alertmanager downtime set
[11:03:33] <taavi>	 running prepare_upgrade cookbook
[11:04:52] <taavi>	 filed T349195
[11:04:52] <stashbot>	 T349195: cloud/instance-puppet.git updater is broken - https://phabricator.wikimedia.org/T349195
[11:06:36] <taavi>	 running upgrade worker cookbook for control-4
[11:08:05] <taavi>	 cluster upgrade starting
[11:09:53] <taavi>	 upgrading docker and kubelet on control-4
[11:10:45] <taavi>	 rebooting instance, again some permission errors with locks
[11:12:13] <dcaro>	 taavi: ask if you need assistance with anything
[11:15:29] <taavi>	 sure
[11:15:46] <taavi>	 the pods took a moment to restart after rebooting control-4 but seem ok now
[11:15:50] <taavi>	 moving on to control-5
[11:18:38] <taavi>	 control-5 updating control plane components
[11:20:00] <taavi>	 upgrading docker and kubelet on control-5
[11:21:46] <dcaro>	 oh, mirrors.wikimedia.org seems down
[11:22:31] <taavi>	 that's why the apt update was taking a while?
[11:23:11] <dcaro>	 maybe yes
[11:25:34] <taavi>	 it seems to be back
[11:25:48] <dcaro>	 yep
[11:25:52] <taavi>	 continuing to control-6
[11:26:24] <taavi>	 the k8s packages come from apt.wm.o not mirrors.wm.o, so apt update being slow is the main effect on the upgrade. still not ideal
[11:27:06] <taavi>	 filed T349197 too
[11:27:07] <stashbot>	 T349197: Remove TTLAfterFinished from config - https://phabricator.wikimedia.org/T349197
[11:29:45] <taavi>	 control plane upgrade complete
[11:29:55] <dcaro>	 \e/
[11:30:05] <taavi>	 verified that new pods are still getting scheduled and start fine
[11:30:09] <taavi>	 moving to first worker node
[11:35:57] <taavi>	 first 3 nodes upgraded, looks good
[11:36:02] <taavi>	 I'm starting the mass upgrade scripts
[11:38:32] <taavi>	 (by which I mean `cat ~taavi/nodes2.txt | xargs -L1 cookbook wmcs.toolforge.k8s.worker.upgrade --cluster-name tools --src-version 1.22.17 --dst-version 1.23.17 --hostname` on three different tmux tabs since the cookbook only has support for a single node at this point)
[12:05:53] <taavi>	 all nodes upgraded
[12:06:11] <taavi>	 the new cookbook seems to be much more aggressive about draining nodes compared to the old script, which is why it was this fast
[12:11:55] <taavi>	 dcaro: where was the third-party k8s component version information moved from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Components? I don't see it on https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy
[12:12:50] <dcaro>	 taavi: which component specifically?
[12:13:18] <taavi>	 I need an overview table which details which versions of various third-party components we use and what k8s versions those support
[12:14:34] <dcaro>	 taavi: for that you have the components in toolforge-deploy, if you want the metrics related ones, they are defined in the helmfile for it
[12:14:42] <dcaro>	 or the calico ones, are in the calico helmfile
[12:15:02] <dcaro>	 there's no k8s versions support specified there though
[12:15:56] <taavi>	 yes, and having that data in a single place is required for k8s version upgrade planning
[12:16:32] <dcaro>	 feel free to add it to the toolforge-deploy docs
[12:17:19] <dcaro>	 that will have to be manually kept up to date though, as (afaik) there's no easy way to extract whit k8s versions each component supports
[12:18:29] <dcaro>	 I would suggest to keep that info close to the version specification of each component, and extract it somehow for human consumption, so when someone changes the version of that component, will not miss updating that info
[12:21:09] <taavi>	 there's no single place with the versions of all of the components in toolforge-deploy, so I just re-added it to the wiki page
[12:22:47] <dcaro>	 git grep chartVersion
[12:27:35] <taavi>	 includes a bunch of info that I don't need (in-house components), and the values files don't have all of the info I need
[12:33:23] <dcaro>	 feel free to add it then if it's missing
[12:34:28] <dcaro>	 hmm, it's a pity that not all implement https://helm.sh/docs/topics/charts/#the-kubeversion-field
[12:34:55] <dhinus>	 cloudbackup1002-dev is reimaged and looking fine
[12:36:02] <dcaro>	 I think we can generate that info parsing the helm chart's kubeVersion string if there, or adding it manually if not
[12:37:05] <dcaro>	 the kubeVersion of the ones that have it does not seem good either though
[12:37:19] <dcaro>	 dhinus: 🎉
[12:39:37] <dcaro>	 taavi: adding it to toolforge-deploy will allow also to check it at deploy time, and prevent or warn when you are trying to deploy a component that does not support the target k8s
[12:39:50] <dcaro>	 and keep the single source of truth
[12:44:59] <taavi>	 if we had that data for everything, sure. but right now we don't and I don't have the time to fix it completely, so I will continue to use the manually maintained but complete and very useful table
[12:45:29] <dcaro>	 please move it to the docs at least
[13:40:33] <blancadesal>	 dcaro: do you know which postgres version we are currently using for harbor?
[13:46:36] <blancadesal>	 nm, it's 12.7
[13:48:12] <andrewbogott>	 I'm going to miss the meeting today but I have 45 minutes or so before chaos resumes. Please ping if there's anything I can help with/unblock.
[13:50:53] <taavi>	 andrewbogott: can you the canary for cloudvirt1051 back to 1051? I think I moved it to 1058 when evacuating all other VMs from it
[13:51:15] <andrewbogott>	 sure
[13:52:01] <andrewbogott>	 is 1051 back and healthy again?
[13:52:56] <dhinus>	 what's the process for starting a canary VM btw? I could not find anything in wikitech
[13:53:59] <andrewbogott>	 there's a cookbook which should go through and ensure that everything is as it should be
[13:54:15] <taavi>	 it's back up at least. I have not tested running anything on it yet, and the cause for the lockup is a mystery
[13:54:19] <andrewbogott>	 that or create by hand on the cli
[13:54:25] <andrewbogott>	 taavi: ok
[13:56:50] <andrewbogott>	 taavi: in theory the move should look like this:
[13:56:52] <andrewbogott>	 openstack server migrate --live --host cloudvirt1051 0f6b34aa-cef3-4c06-977d-2dc868dcf16d --os-compute-api-version 2.30
[13:57:01] <andrewbogott>	 but that's not working for weird reasons, I'm investigating
[13:58:32] <jbond>	 dcaro: fyi i had a look through the incident re isc-dhc-client.  i can't think of a way we could really detect this at the puppet level.  i guess the advice is dont be explicit and use ensure_packages($package) in every class that needs it.  
[13:58:44] <jbond>	 ill keep mulling it over but nothing comes to mind right now
[14:07:37] * dcaro paged
[14:07:58] <dcaro>	 cloudvirte1058 nova proc, looking
[14:09:26] <andrewbogott>	 that was me, sorry
[14:09:32] <andrewbogott>	 already resolved, I just restarted the service
[14:09:37] <andrewbogott>	 and got unlucky with the check
[14:09:45] <andrewbogott>	 dcaro: ^
[14:10:03] <dcaro>	 ack
[14:10:24] <dcaro>	 jbond: thanks
[14:10:43] <andrewbogott>	 Nova is saying "Compute host cloudvirt1058 could not be found" when I try to move hosts off of 1058.  Seems to be unique to that server.
[14:11:00] <andrewbogott>	 Of course nova also reports that that nova is up on that host, and notices when it's down.
[14:12:34] <dcaro>	 you mean "to move VMs" right?
[14:13:05] <dcaro>	 we messed up with the database yesterday when moving VMs out of 1051, maybe something got out of place?
[14:14:34] <andrewbogott>	 yes, moving vms
[14:14:44] <andrewbogott>	 could be although the complaint is about 1058, not about 1051
[14:16:14] <dcaro>	 yes, but the instances were moved to 1058
[14:16:33] <andrewbogott>	 ah, ok
[14:16:38] <dcaro>	 so maybe we forgot to update some entry somewhere, and openstack is unable to link it
[14:16:38] <andrewbogott>	 which db? i can look
[14:16:58] <dcaro>	 there's some info in the ticket https://phabricator.wikimedia.org/T349109#9258647
[14:17:18] <dcaro>	 the db was nova_eqiad1
[14:17:43] <dcaro>	 and neutron
[14:17:50] <dcaro>	 cinder seemed to get updated by itself
[14:19:06] <dcaro>	 taavi: ^ fyi
[14:29:58] <taavi>	 definitely possible the db is not in sync somewhere, I was following some clearly outdated upstream docs in a state of panic
[14:32:40] <andrewbogott>	 Running "update instances set node='cloudvirt1058.eqiad.wmnet' where host='cloudvirt1058';" which may help or hurt
[14:32:44] <andrewbogott>	 sorry about the panic :(
[14:33:52] <andrewbogott>	 that seems to have helped.
[14:34:02] <andrewbogott>	 Seems like those two fields are redundant, or at least redundant for our purposes
[14:36:39] <andrewbogott>	 and now the canary is back where it belongs
[14:37:56] <andrewbogott>	 btw taavi did you try 'openstack server evacuate'?  In theory that's the command to move a server off of a failed hypervisor
[14:38:16] <taavi>	 no, I had no clue that even exists
[14:38:19] <andrewbogott>	 I can add that to the docs if you link me to the runbook you were following
[14:38:40] <andrewbogott>	 ok :)  I think the docs say "evacuate the VMs" without specifying that evacuate is a literal thing
[14:39:15] <taavi>	 there was no runbook except the very generic https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown
[14:39:56] <andrewbogott>	 oh yeah, I guess it's hard to link to a specific runbook from a hostdown alert...
[14:40:57] <andrewbogott>	 yesterday you quoted "restore the server or evacuate manually the VMs on it" I was thinking I'd update the string wherever it is
[14:42:21] <taavi>	 that's what the prometheus alert says https://gerrit.wikimedia.org/g/operations/alerts/+/1f77dd6f684fb5211736172f70d006c627d15b02/team-wmcs/general_nodes_down.yaml#37
[14:47:42] <andrewbogott>	 oh, so I really can add a custom runbook. Let's see if I can do that before I have to go
[14:48:17] <andrewbogott>	 thx taavi (and also thanks for dealing with the outage while I was dealing with other things)
[14:49:40] <andrewbogott>	 hm, no such luck, gotta go in a minute.  I'll work on the runbook later
[15:48:10] <taavi>	 anyone up for reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/966871? I have no clue why it worked before, but it certainly does not work anymore
[15:52:16] <taavi>	 dcaro: dhinus: sorry I forgot to ask this during the meeting, will you let brett know we selected our victims for the incident review ritual? or should I?
[15:52:27] <dcaro>	 taavi: done :)
[15:52:37] <dcaro>	 I asked him to add the team to the invite too
[15:52:39] <taavi>	 awesome, thanks
[15:52:58] <dhinus>	 ah I just asked lmata as well :)
[15:53:03] <dhinus>	 in the -sre channel
[15:58:13] <taavi>	 added build and envvars services to the toolforge sidebar https://wikitech.wikimedia.org/w/index.php?title=Template%3AToolforge_nav&diff=2121081&oldid=2116117
[16:01:37] <dhinus>	 thanks!
[16:02:49] <taavi>	 I'll also rewrite the leads for both of the articles, to make them bit less technical for those who don't care about toolforge internals
[16:05:16] <dhinus>	 I wonder if we should list buildservice and envvars also in https://wikitech.wikimedia.org/wiki/Help:Toolforge#Main_features_of_Toolforge
[16:13:46] <dcaro>	 taavi: created https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/119 with the kubernetes compatibility info, nothing automated yet though
[16:14:40] <dcaro>	 I added it to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Quickstart#Build_and_host_your_first_tool
[16:14:43] <dcaro>	 (build service)
[17:11:50] * dcaro off