[08:01:40] morning [08:02:20] morning [08:02:29] o/ [08:03:23] morning [08:37:59] morning! [08:46:11] can I get a +1 to deploy this to toolforge and merge to main? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/238 [08:51:48] do you want anyone to test it first? [08:53:01] it would be cool if that repo had a ci job that showed the diff of what would be applied to the clusters [08:58:18] that'd be nice yes, it would need access of some sort to a k8s cluster no? Or can helm work the diff from a local file of sorts? [09:00:36] you can render the templates without access to the cluster. so render the templates for the main branch, render the templates for that commit, and show a diff for those, no cluster access required [09:00:48] of course that doesn't show if there's been any local hacks or so, but that is fine [09:03:46] that'd be enough yes [09:19:46] there are a number of improvements we could add here [09:20:13] other that I would like to see is some kind of alert if there is a diff between this repo and what's live in the cluster [09:20:41] to help us detect undeployed changes [09:20:53] there might be a task for that one [09:21:16] helm is not very good at detecting changes made by hand to the resources it manages, but I would like to have some detection for that as well [09:21:31] T358908 [09:21:34] T358908: [infra,ci] Alert when toolforge-deploy changes are not deployed - https://phabricator.wikimedia.org/T358908 [09:21:42] nice [09:22:08] yep, helm ignores most (if not all) the changes made by hand, and keeps a secret (iirc) with the changes it deployed instead [09:22:19] so if you change anything by hand, it does not notice [09:22:57] and I think it was base64'd twice, for some reason [09:23:27] in the same line, having a cookbook, or script, to help force-redeploy may be useful [09:23:41] if a change is made by hand, running the deployment cookbook wont do nothing [09:23:49] so some kind of forced-redeploy is required [09:36:54] uninstall-install of the chart works, but it deletes everything in the meantime (and we don't want that on toolforge) [09:37:13] though probably we don't want to manually change anything on toolforge either [09:44:13] agreed [09:58:05] the CloudVPSDesignateLeaks alert has been flapping quite a bit over the weekend, does anyone know why? [10:01:00] nope, though I saw a ticket fly by about some cloudvps project refreshing their VMs, so maybe they remove old VMs? [10:02:03] yeah but that should not flap the alert [10:14:03] I though that old VMs were the only ones that would leak [10:14:26] if they should not, then I don't know how the leaks would happen (besides bugs somewhere) [10:19:47] wouldn't it be cool for the k8s deploy cookbook to write a message to the merge request? [10:23:23] arturo: are the SAL messages it logs not enough? I'm worried that would result in too much spam [10:24:30] I would find it useful to write it in all the bugs related to the release upgrade (so you know your fix has been deployed) [10:24:38] I usually write the message by hand into the MR anyway. I don't think is too much spam [10:25:02] you still need to test that it works by hand though [10:25:34] (so there might be a hand-written comment needed even if the deploy one is automated) [10:26:49] toolforge-deploy MRs can be updated and deployed multiple times (for example, when testing it), that's why I think it could be useful to have a record of when was it deployed (the updates are already tracked by gitlab itself) [10:28:04] without that message, using this as expample https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/238 you would need to stare at both the MR update history and the SAL and manually do the math to see when was each update deployed [10:29:08] I don't think this is high priority in anyway, just some information correlation I've missed a few times [10:29:50] it might be more useful in that regard to have a 'live' way of seeing all the versions that are currently deployed in an environment [10:30:05] (something I missed also several times already xd) [10:30:47] mainly because deploys don't tell you the current state of the environment, as they can be reverted [10:31:00] something like this? [10:31:02] https://usercontent.irccloud-cdn.com/file/0aiEMHhO/image.png [10:31:26] a bit nicer than helm list [10:34:18] like highlighting mr-deployments, linking to the toolforge-deploy commit that was deployed with, etc. [10:35:02] and filtering out user-charts [10:35:28] grouping per component (instead of per helm chart) [10:40:53] I am rebooting cloudgw1002 (currently inactive eqiad1 cloudgw) for https://phabricator.wikimedia.org/T366555 [10:42:45] ack [10:43:03] dcaro: in lima-kilo, when running the functional tests I get this [10:43:17] https://www.irccloud.com/pastebin/GAk5dne8/ [10:43:39] triggered by the setup_file step of the build-smoke-test file [10:43:41] I got that once, but went away the next time I tried to run it [10:44:05] I'm getting it always :-( [10:44:06] I think it might happen when trying to cleanup if there's still no harbor project created for that tool [10:44:24] most likely is a deadlock [10:44:37] we can add 'toolforge build clean --yes-i-know 2>/dev/null || :' to the tests to ignore failures [10:44:38] can't create the harbor project bc it can't run the tests [10:44:49] ok, will send that patch [10:45:17] but probably the client should just not fail and say that no builds have been created yet [10:45:38] though the FORBIDDEN returned by harbor is not helpful [10:45:47] (instead of not found or such) [10:46:51] it might be a change of behavior in harbor, as the api already did check that [10:46:54] https://www.irccloud.com/pastebin/gqgfREu7/ [10:47:03] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/321 [10:50:59] thanks [10:51:44] arturo: to gracefully fail over from cloudgw1001 to cloudgw1002, I should stop keepalived on 1001, right? [10:51:58] taavi: mmmm [10:52:08] yeah, I think that should do it [10:52:18] they don't run BPG yet, right? [10:52:21] BGP* [10:52:29] arturo: when rebooting the lima-kilo vm jobs-api starts failing with [10:52:30] │ requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default.svc', port=443): Max retries exceeded with url: /api/v1/namespaces/tf-public/configmaps/image-config (Caused by NameResolutionError(" (the python api) [10:52:51] arturo: correct [10:52:54] [Errno -3] Temporary failure in name resolution [10:52:54] xd [10:52:56] ok trying [10:53:06] taavi: anyway, a normal reboot should also be gracefully enough [10:53:42] dcaro: fails to resolve the k8s API FQDN itself, that's werid [10:53:43] yeah, I just wanted to be extra careful juuust in case there was something in the new kernel that required an immediate rollback [10:53:46] but seems like there is not [10:53:54] taavi: ack [10:53:56] now rebooting 1001 [10:54:06] arturo: it happens with some of the service actually, cert-manager, kyverno and metrics too fail to start the first time [10:54:26] dcaro: maybe they start before coredns itself is started [10:56:13] kyverno is not coming up yet [11:00:07] downloading images? [11:01:42] E0610 11:00:52.592068 1 reflector.go:148] k8s.io/client-go@v0.27.3/tools/cache/reflector.go:231: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://10.96.0.1:443/api/v1/namespaces/kyverno/configmaps?fieldSelector=metadata.name%3Dkyverno-metrics&limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout [11:01:55] from the kyverno-admission-controller [11:02:02] tohe [11:02:07] *note that I have 4 workers [11:02:16] that's should be the k8s API address [11:03:08] let me re-delete the deployments for it again to forge recreating the pods [11:03:19] *re-delete the pods to recreate the containers [11:03:36] still the same error [11:04:37] is the k8s API up? did you load the base PSP? [11:04:52] did you run the ansible setup script in full? [11:05:28] the ip might have changed after the reboot [11:05:29] 10.244.0.1/32 [11:05:41] the k8s API address? [11:06:04] the one I see from 'ip a' on the toolforge-control-node container [11:06:24] mmmm [11:06:45] so maybe that's self inflicted by the fact that we change the hostname and such after the VM creation? [11:07:35] I think it might be a combination of multiple nodes + docker not keeping the networks after reboot [11:07:43] let me try to reboot again, see if the ips change [11:08:28] but the k8s API address should remain stable, I expect it to don't be affected by stuff happening outside k8s [11:08:30] hmm... though those ips are not allocated by docker [11:09:32] hmm, maybe it's the docker inside the nodes (the one k8s uses) [11:09:45] that should be containerd [11:10:04] they use containerd yep [11:10:31] got confused by '--provider-id=kind://docker/toolforge/toolforge-control-plane' in the arguments [11:20:24] another weirdness dcaro [11:20:29] in my system [11:20:38] @test "tail logs and wait (slow)" { [11:20:38] # this also waits for it to finish [11:20:38] toolforge build logs -f [11:20:38] } [11:20:56] this returns but the Status: OK is still not reported, so the next test fails [11:21:04] https://www.irccloud.com/pastebin/lA8kGgTp/ [11:22:19] you had a patch for that, I commented there, the issue is that the pipelinerun object might take a bit to get updated by the tekton controller, so even if the pod finished successfully, the object is not updated right away (eventual consistency means temporal inconsistency xd) [11:23:05] about the extra k8s nodes, cordoning all but one worker makes everyone happy, so for some reason the network seems to get borked after reboot inside kind [11:24:23] oh, but jobs-api fails with [11:24:25] https://www.irccloud.com/pastebin/gDcwP2h9/ [11:24:37] is it trying to mount it from the worker node? [11:25:06] it's there :/ [11:25:08] https://www.irccloud.com/pastebin/406agjF2/ [11:28:09] I'll rebuild the vm [11:30:03] * dcaro lunch [11:30:42] arturo: this is the mr for the build show issue: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/320 [11:48:36] ok! I forgot [12:47:36] oh, I just got the cpu issue! [12:47:39] https://www.irccloud.com/pastebin/JkslKeno/ [12:47:47] clean VM, just ran the functional tests [12:47:54] 🎉 [12:48:17] maybe is not a CPU issue, but why is the node not available? [12:48:34] it might be related to the weird network condition after a reboot [12:49:08] kyverno is taking ~20% of the cpu for itself [12:49:37] how did you see that? [12:49:37] volume admission gets 14% [12:49:43] kubectl describe node [12:50:07] it has 10 pods, requesting 2% each (100m) [12:50:43] jobs-api does not request anything xd (0%) [12:50:48] maybe that's the solution? [12:50:51] mmm [12:51:03] this is request/limits not actual usage, no? [12:51:19] ok [12:51:26] yep, that's what's getting exhausted [12:51:29] so maybe we can instruct kyverno to don't request anything in lima-kilo [12:52:33] the others probably too [14:59:20] cloudweb puppet alerts are me, fix incoming [14:59:38] ack, I was going to ask [15:00:09] (git logs did not show anything very clearly xd) [15:02:50] fixed [15:23:04] As a result of the chat here last week about options for the redis container I had proposed I made . Feedback and testing welcome. [15:27:17] bd808: that's really cool! [15:31:19] the coolest part is that I just had to put together bits that y'all have made :) [15:33:52] thanks bd808 ! [16:04:06] * arturo offline [17:31:29] * dcaro off [17:44:03] This "containers" tool also gave me a good excuse to use Striker's support for multiple toolinfo records per tool that was noticed in this channel a few weeks ago-- https://toolsadmin.wikimedia.org/tools/id/containers