[06:58:26] greetings [07:52:23] dcaro: re: k8s upgrade for tools, will that be on meet too ? I'm happy to follow along [08:02:10] I can join the meet yep if anyone wants to follow, I'll be there in a minute [08:05:33] ok will join shortly too [08:06:20] morning [09:21:44] morning xd, [09:21:55] fyi. the upgrade is going ok :) [09:34:23] great. let me know if there's any hiccup or if you need help! [09:39:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 should be good to go, please take a look [09:54:23] godog: lgtm, added a question, I might be asking silly things but well, just in case xd [09:55:04] thank you dcaro ! [09:55:18] `tools-k8s-worker-nfs-68` is failing to drain, it has a stuck pod that does not terminate, looking [09:56:56] https://www.irccloud.com/pastebin/DQgCIRsT/ [09:57:33] that sounds like nfs :/, godog remember I told you that I though draining/undraining might be enough for NFS nowadays? well this proves it's not xd [09:57:44] lolsob [09:59:37] there's two nodes stuck at least, nfs-48 is stuck too, I'll reboot them manually and rerun the upgrade [10:56:10] can I get a +1 on https://github.com/toolforge/paws/pull/497 ? [11:05:52] +1d [11:07:02] thanks! [11:13:30] I'm waiting for a full run of the functional tests, but all the nodes are now upgraded (missing the bastions) [11:15:21] and done :), I'll merge the cookbook patches and update the docs while running the tests, but the upgrade is done \o/ [11:30:12] everything good 👍, all tests green [13:36:55] dcaro: ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185937 ? [13:37:41] 👍 [13:45:16] ok, here go the mons... [15:21:58] dcaro: found the explanation for the missing alerts, I did document it in T381589, but then forgot about it :) [15:21:59] T381589: [wikireplicas] Route alerts to WMCS team - https://phabricator.wikimedia.org/T381589 [15:22:06] "Replication lag alerts were disabled for clouddb hosts in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117, by filtering based on job: mysql-core." [15:22:37] so we need to duplicate those alerts for clouddb hosts, where job=mysql-labs [15:22:45] yep [15:23:06] good catch [15:23:23] ideally, we would check other alerts for those hosts as well, right now it's a bit of a mess as described in that task [15:26:27] the fact that quarry alerts users about wikireplicas lagging is a good stopgap until we fix this properly [15:28:43] yep, that's quite useful [15:29:12] it's kinda relieving/calming to know that when your query is taking long [15:36:19] feature request for Trove, cc andrewbogott T403977 [15:36:20] T403977: Solution for trove instance access without vps instance - https://phabricator.wikimedia.org/T403977 [16:54:59] andrewbogott: making some incremental progress on provisioning that k8s cluster via magnum, but now running into an issue where the cinder-csi-plugin fails to start. it seems to be related to DNS resolution failure of `openstack.eqiad1.wikimedia.org` [16:55:20] https://www.irccloud.com/pastebin/nhHFKGRr/ [16:56:04] does dns work on that host at all? [16:56:13] i started up a debug pod in the `kube-config` namespace and verified that the lookup fails from it as well [16:56:20] i haven't tried from the node. let me do that [16:56:37] or, really, does network egress even work? [16:58:18] from the master node, it's a-ok. i'll check egress in general from a pod [17:00:34] doh. ok i should have checked that first. yeah, no egress it seems. i am very surprised everything else started up ok [17:01:47] * dcaro off [17:01:50] cya tomorrow [17:02:31] dduvall: I don't immediately know how to address that but it seems like a problem :) [17:13:40] oh, actually nm. egress works. it's just dns resolution that is busted. the nameserver in `/etc/resolv.conf` is not reachable. i wonder if it's due to the `service_cluster_ip_range` i specified [17:14:00] * dduvall pokes around [17:38:04] congrats on another boring Kubernetes upgrade d.caro and everyone else who worked to make that smooth and repeatable. :) [19:24:48] andrewbogott: looks like my custom `service_cluster_ip_range="10.245.0.0/16"` was indeed the problem. seems like it wants to set the nameserver for pods on the cluster network to `10.254.0.10` regardless of what the service network is set to and what the ip of `kube-dns` is. switching back to the default `service_cluster_ip_range="10.254.0.0/16"` avoids the issue