[01:25:22] taav.i: I have migrated most trove instances to ovs. The remaining hosts are three that I'm pretty sure never worked in the first place (I've opened tickets poking users about them) and tools-harbordb. I fear that restarting tools-harbordb will cause a toolforge outage so I'll coordinate with you about that when we're both online. [07:05:58] morning [08:31:16] dcaro: welcome back! [08:37:29] :), I see kyverno is up and running, kudos! [08:45:42] yeah [08:46:08] I wanted to get rid of it, but then upstream replied and offered a few hints, we tried again, and it worked this time [08:49:45] what was the trick? the cluster-wide policy? [08:50:47] they key is that I was able to reproduce the outage in lima-kilo, so that allowed me to safely play with different settings [08:50:56] at the end the trick was a combo of: [08:51:14] * increase memory head room for the k8s api-server, we upscaled each control node to have more RAM [08:51:25] * disable one optional component of kyverno [08:52:05] * increased number of replicas of some kyverno pods [08:54:12] alert: [08:54:13] Kubernetes worker tools-k8s-worker-nfs-43 has many processes stuck on IO (probably NFS) [08:55:29] it seems that has been going on since friday [08:55:49] nfs-38 also started having issues at the same time, but stopped eventually (maybe rebooted) [08:56:54] * arturo rebooting laptop for nvidia driver update [09:02:22] morning [09:02:55] i'm going to expand the toolforge non-nfs worker pool a bit [09:03:15] is it getting loaded? [09:05:05] it's about as loaded as the NFS pool, which means that non-NFS tools are getting scheduled on NFS workers [09:06:28] i'm also going to start moving tools k8s workers to OVS later today [09:07:34] ack [09:19:11] ok [09:33:55] dhinus: we're going to have to reboot the tools-db nodes to OVS at some point, do you have an idea when's the best moment to do that? [09:37:34] taavi: hmm I would maybe wait for the replica to catch up, but apart from that I think any moment is fine [09:37:57] maybe we can alert people on cloud-announce of a short expected downtime? it will probably be very short, so it might not be necessary [09:42:19] dhinus: the actual VM downtime will be about a minute or two, and stopping mariadb and then starting it back up will take a bit more. so not sure [09:42:54] dcaro: can I safely reboot the harbor VM at any point? [09:44:01] taavi: I'm also undecided but I'm erring towards "let's send a short email" [09:44:38] I mean, //ideally// we could use it as a chance for T344719 :P [09:44:39] T344719: [toolsdb] test failover procedure - https://phabricator.wikimedia.org/T344719 [09:46:13] taavi: hmm, should be safe, though maybe stop harbor using docker compose first. Note that the build service will not work, and any tools using buildservice images will not be able to pull them while the VM is offline [09:46:56] arturo: are you looking into the nfs-43 host? [09:46:59] and gitlab CI for toolforge components will not work either [09:47:10] dcaro: no, sorry, I'm with other stuff. [09:47:12] dcaro: I was looking at the nfs-43 alert just now [09:47:39] dcaro: ack, doing that now [09:47:40] dhinus: ack, nfs server went away and some OS processes seem to have gotten stuck (wmf-autorestart of lldp it seems) [09:47:58] so might need a hard reboot [09:48:07] https://www.irccloud.com/pastebin/1iKN3cZs/ [09:49:55] all done with the tools-harbor reboot [09:50:14] 👍 [09:52:12] dcaro: hmm where did you find those logs? I can't find them with "sudo journalctl --boot -1 | grep tools-nfs" [09:52:25] dmesg [09:54:36] ok I'll reboot the host. any idea why the logs don't show up in journalctl? [09:56:06] do you usually drain the host before rebooting? [09:56:45] how old does journalctl go? [09:56:45] I usually use the cookbook that drains it yep, though should not be a big issue [09:57:14] dcaro: journalctl goes back to last month, so that does not explain it [09:57:32] it should be in the current boot [09:57:43] I think [09:59:10] https://www.irccloud.com/pastebin/Y7HRPFb0/ [09:59:34] ^ yep, it has not been rebooted in a while :) [10:00:11] dhinus: if yo need to reboot it might as well move it to OVS with the same reboot :-) [10:00:58] taavi: ok! what was the migration command again? [10:01:26] sudo cookbook wmcs.openstack.migrate_server_to_ovs --cluster-name eqiad1 --project --server [10:01:56] thanks [10:02:06] waiting for the drain cookbook to complete. [10:03:16] dcaro: I can find the logs if I use "--boot 0" instead of "--boot -1" [10:03:45] 👍 [10:05:27] should we suggest using 0 instead of -1 in the runbook? [10:05:30] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses [10:06:13] the drain cookbook failed with "Still has 1 pods running" (the pod is calico) [10:06:26] no sorry that was an annotation [10:06:34] the pod is fourohfour-5bb849974b-f8dj4 [10:06:41] dhinus: we should probably remove the --boot option completely [10:07:05] I think I can ignore the stuck pod and proceed with migrate_server_to_ovs [10:07:31] I might have ran the journalctl command in the example after the server rebooted xd [10:08:14] dcaro: agreed, let's remove --boot [10:13:16] taavi: tools-k8s-worker-nfs-43 migrated to ovs [10:13:27] \o/ [10:14:00] only 58 more workers (+ 4 ingress/control nodes) to go [10:14:42] dcaro: runbook updated [10:15:43] thanks :) [10:30:46] heads up, I want to set kyverno policies to enforce with this patch: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 [10:31:08] arturo: ack [10:32:06] arturo: looking at `kubectl get policyreports -A` I see lots of things in the 'fail' state, is that expected? [10:33:49] taavi: I believe those are old entries. The policies have been since renamed (and disabled) [10:33:58] I'm double checking before deploying [10:34:17] ok.. how do we check that the new policies are working as expected? [10:34:52] taavi: see comment here: https://phabricator.wikimedia.org/T368044#9909524 [10:35:03] * dhinus lunch [10:35:07] ah I see [10:36:10] I think you can also describe the policyreports you mentioned earlier [10:36:10] aborrero@tools-k8s-control-7:~$ sudo -i kubectl -n tool-zhwp-afc-bot describe policyreport pol-toolforge-zhwp-afc-bot-pod-policy [10:50:57] taavi: I have been checking, and I believe those failed reports are for deployments that were created _before_ I added an explicit security context [10:51:23] because kyverno is doing cluster-wide scans of previously defined resources and reporting if they comply with the policies [10:51:43] so, older deployments don't have, for example, an explicitly defined fsGroup [10:52:47] taavi: which makes me remember we need to merge this https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/37 care to re-review? [11:03:51] https://www.irccloud.com/pastebin/2jnekFeA/ [11:26:59] * dcaro lunch, will be late for the coworking space [12:01:28] FYI, I'm reenabling the kyverno reports controller, which was previously deactivated (given is kind of an optional thing). I have tested its performance on lima kilo with 4k policy rules with no problems. Patch is: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/343/diffs Rollback is rollback of that patch + deploy [15:33:03] the "D Processes" alert fired on a few different NFS workers [15:34:17] looks like the numbers are already going down [15:46:21] * arturo offline [15:50:07] dhinus, do you remember anything about T368233? [15:50:14] T368233: can the db server 'maps-test-2' be deleted? - https://phabricator.wikimedia.org/T368233 [15:50:44] andrewbogott: nope, I don't think I've seen it before [15:51:05] It's surely trash (possibly even made by me) but I'm always nervous destroying things [15:52:21] I also think it can be deleted, but I would feel better if we had some backup somewhere [15:52:34] * dhinus would like somethink like "wmcs glacier" :) [15:52:49] *something [15:53:39] slow storage that we can dump things to and keep them around for a few months/years "just in case" [17:03:53] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/8/commits and https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1049240 [20:45:59] topranks: We have a new host an-redacteddb1001 which presents as a clouddb replica server, but which seems to not permit network access from cloud-vps. I'm guessing that there's some special network rule in place for the existing clouddb hosts that needs adding there as well? T368316 [20:46:00] T368316: maintain-dbusers.service failing on cloudcontrol1005 - https://phabricator.wikimedia.org/T368316 [20:46:04] ^ cc taavi, btullis