[06:58:26] <godog>	 greetings
[07:52:23] <godog>	 dcaro: re: k8s upgrade for tools, will that be on meet too ? I'm happy to follow along
[08:02:10] <dcaro>	 I can join the meet yep if anyone wants to follow, I'll be there in a minute
[08:05:33] <godog>	 ok will join shortly too
[08:06:20] <dhinus>	 morning
[09:21:44] <dcaro>	 morning xd,
[09:21:55] <dcaro>	 fyi. the upgrade is going ok :)
[09:34:23] <dhinus>	 great. let me know if there's any hiccup or if you need help!
[09:39:19] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 should be good to go, please take a look
[09:54:23] <dcaro>	 godog: lgtm, added a question, I might be asking silly things but well, just in case xd
[09:55:04] <godog>	 thank you dcaro !
[09:55:18] <dcaro>	 `tools-k8s-worker-nfs-68` is failing to drain, it has a stuck pod that does not terminate, looking
[09:56:56] <dcaro>	 https://www.irccloud.com/pastebin/DQgCIRsT/
[09:57:33] <dcaro>	 that sounds like nfs :/, godog remember I told you that I though draining/undraining might be enough for NFS nowadays? well this proves it's not xd
[09:57:44] <godog>	 lolsob
[09:59:37] <dcaro>	 there's two nodes stuck at least, nfs-48 is stuck too, I'll reboot them manually and rerun the upgrade
[10:56:10] <dhinus>	 can I get a +1 on https://github.com/toolforge/paws/pull/497 ?
[11:05:52] <dcaro>	 +1d
[11:07:02] <dhinus>	 thanks!
[11:13:30] <dcaro>	 I'm waiting for a full run of the functional tests, but all the nodes are now upgraded (missing the bastions)
[11:15:21] <dcaro>	 and done :), I'll merge the cookbook patches and update the docs while running the tests, but the upgrade is done \o/
[11:30:12] <dcaro>	 everything good 👍, all tests green
[13:36:55] <andrewbogott>	 dcaro: ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185937 ?
[13:37:41] <dcaro>	 👍
[13:45:16] <andrewbogott>	 ok, here go the mons...
[15:21:58] <dhinus>	 dcaro: found the explanation for the missing alerts, I did document it in T381589, but then forgot about it :)
[15:21:59] <stashbot>	 T381589: [wikireplicas] Route alerts to WMCS team - https://phabricator.wikimedia.org/T381589
[15:22:06] <dhinus>	 "Replication lag alerts were disabled for clouddb hosts in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117, by filtering based on job: mysql-core."
[15:22:37] <dhinus>	 so we need to duplicate those alerts for clouddb hosts, where job=mysql-labs
[15:22:45] <dcaro>	 yep
[15:23:06] <dcaro>	 good catch
[15:23:23] <dhinus>	 ideally, we would check other alerts for those hosts as well, right now it's a bit of a mess as described in that task
[15:26:27] <dhinus>	 the fact that quarry alerts users about wikireplicas lagging is a good stopgap until we fix this properly
[15:28:43] <dcaro>	 yep, that's quite useful
[15:29:12] <dcaro>	 it's kinda relieving/calming to know that when your query is taking long
[15:36:19] <dhinus>	 feature request for Trove, cc andrewbogott T403977
[15:36:20] <stashbot>	 T403977: Solution for trove instance access without vps instance - https://phabricator.wikimedia.org/T403977
[16:54:59] <dduvall>	 andrewbogott: making some incremental progress on provisioning that k8s cluster via magnum, but now running into an issue where the cinder-csi-plugin fails to start. it seems to be related to DNS resolution failure of `openstack.eqiad1.wikimedia.org`
[16:55:20] <dduvall>	 https://www.irccloud.com/pastebin/nhHFKGRr/
[16:56:04] <andrewbogott>	 does dns work on that host at all?
[16:56:13] <dduvall>	 i started up a debug pod in the `kube-config` namespace and verified that the lookup fails from it as well
[16:56:20] <dduvall>	 i haven't tried from the node. let me do that
[16:56:37] <andrewbogott>	 or, really, does network egress even work?
[16:58:18] <dduvall>	 from the master node, it's a-ok. i'll check egress in general from a pod
[17:00:34] <dduvall>	 doh. ok i should have checked that first. yeah, no egress it seems. i am very surprised everything else started up ok
[17:01:47] * dcaro off
[17:01:50] <dcaro>	 cya tomorrow
[17:02:31] <andrewbogott>	 dduvall: I don't immediately know how to address that but it seems like a problem :)
[17:13:40] <dduvall>	 oh, actually nm. egress works. it's just dns resolution that is busted. the nameserver in `/etc/resolv.conf` is not reachable. i wonder if it's due to the `service_cluster_ip_range` i specified
[17:14:00] * dduvall pokes around
[17:38:04] <bd808>	 congrats on another boring Kubernetes upgrade d.caro and everyone else who worked to make that smooth and repeatable. :)
[19:24:48] <dduvall>	 andrewbogott: looks like my custom `service_cluster_ip_range="10.245.0.0/16"` was indeed the problem. seems like it wants to set the nameserver for pods on the cluster network to `10.254.0.10` regardless of what the service network is set to and what the ip of `kube-dns` is. switching back to the default `service_cluster_ip_range="10.254.0.0/16"` avoids the issue