[07:06:18] greetings [09:02:14] morning [09:02:42] hey [09:04:22] morning [09:06:29] o/ [10:33:39] dcaro: https://cluebotng-monitoring.toolforge.org/d/f0116e33-c519-4562-b625-306262d46dd0/kubelet-active-pods the model change was merged on the 25th, bit of a generic metric but it does look like the nodes where really slowly getting drained as thought [10:36:37] nice, it did stop going down pretty quick though, and even recovered a bit, that makes it weird [10:40:14] that's also when we did changes to the default requests and limits, so that might also have an impact (as the nfs workers were saturated while non-nfs were not, freeing quotas in the nfs ones would move workload to them too) [10:41:52] the flickering of the number of pods in nfs workers is big enough to make up for any changes in the non-nfs ones xd, stats are hard [10:46:31] Yeah, I'm also not sure what the big drop on nfs is and we can't select by image type [10:47:13] I just re-ran all my tools and they moved back as expected - 1 then threw errors because I missed the mount option; perhaps it's worth a note in the changelog as there's been a month of things doing 1 (incorrect) thing and now will 'break' [10:53:52] 👍 [11:42:45] fixing some calico CRDs on toolsbeta (see https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/18) [11:52:41] wot, the functional tests failed but the output seems to be off the screen somehow? D: [11:53:09] huh, got a screenshot? [11:53:15] (more curious than anything xd) [11:55:13] dcaro: https://phabricator.wikimedia.org/F68239972 [11:55:52] it's off screen on the left! [11:57:11] the 'run this command to run these tests manually' command does not work either: https://phabricator.wikimedia.org/P84343 [11:59:08] good point, it still needs some env variables :/ [11:59:10] `TEST_TOOL_UID` [11:59:16] want me to file a task? [11:59:42] yes please, maybe also the cwd might matter [12:01:03] taavi: oh, I see you are running the maintain-harbor tests, those might be broken [12:01:23] wmcs.toolforge.component.deploy is trying to run those [12:01:30] filed T408679 [12:01:32] T408679: Command given to run failing functional tests manually does not function - https://phabricator.wikimedia.org/T408679 [12:02:12] that's the default when the component does not specify a specific suite, this is calico right? [12:02:16] yes [12:02:30] yep, I think it does not have anything set [12:02:42] you can try running for example jobs-api, that would cover most stuff [12:03:03] (not using the deploy, but using the cookbook to run the tests, or manually, passing -c jobs-api) [12:05:54] like, from a clone of toolforge-deploy, `dcaro@tools-bastion-15:~/toolforge-deploy$ utils/run_functional_tests.sh -c jobs-api`, or `cookbook wmcs.toolforge.run_tests --component jobs-api --cluster-name toolsbeta` [12:06:08] found the cookbook, doing that now [12:06:17] but IHMO wmcs.toolforge.component.deploy should not fail like this [12:06:32] it should not, maintain-harbor tests should be passing [12:07:28] let me try those tests again to see whether it was an one-off or what [12:07:40] they are broken, I think Raymond_Ndib.e is working on it [12:07:50] let me find the task [12:08:40] T407496 [12:08:41] T407496: [maintain-harbor] Failing to cleanup stale artifacts - https://phabricator.wikimedia.org/T407496 [12:09:15] oh, it was merged but failing to delpoy https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1012 [12:09:19] so half-deploy :/ [12:09:21] *deployed [12:09:44] looking [12:09:52] that seems like the same error I hit [12:11:17] does it happen if you run only the jobs-api tests? [12:11:26] I'll check the tests in prod also [12:12:08] not so far [12:13:02] okok, I'll look into the maintain-harbor tests [12:14:37] i think it's just that that maintain-harbor issue r.aymond is looking at hasn't been fixed properly [12:14:41] yep [12:14:59] the other day we discussed solving it a different way than it is [12:15:50] you can try skipping tests on deploy, and running them manually [12:15:51] `--skip-tests` [12:16:19] or passing a `--filter-tags` with `jobs-api` only [12:16:29] the jobs-api tests are happy, so I'll deploy the calico crd fix to tools with that if that seems fine to you? [12:16:33] as a workaround [12:16:43] yep lgtm [12:16:47] cool [12:23:30] dcaro: can you think of any reason why the deploy cookbook should not show the helmfile apply output? since right now we get extremely unhelpful errors like this: https://phabricator.wikimedia.org/P84344 [12:24:29] sometimes if there's no diff when applying the helmfile it does not return any diff output [12:24:40] I think that might be a spicerack/tests interaction, looking [12:25:31] yep, I think it's that, spicerack swallows the output on error [12:25:40] did it report on the MR though? [12:25:46] (it should have) [12:26:41] nothing, since that's the deploy stage and not the test stage [12:26:43] volans: ^ iirc there was some fix on the way? Maybe already there? [12:26:53] true [12:26:55] RemoteExecutionError has results if that's what you're looking for [12:27:05] https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteClusterExecutionError [12:27:13] sorry wrong link [12:27:17] https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteExecutionError [12:27:32] not sure if that's relevant here, I don't think we try to capture this output anyway? [12:28:30] unless print_output=False it should be printed indeed [12:29:27] I think we are hiding the output [12:29:49] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1199760/ [12:30:05] we use wmcs_libs/common.py:CUMIN_UNSAFE_WITHOUT_OUTPUT = CuminParams(print_output=False, print_progress_bars=False) [12:31:32] taavi: +1d [12:34:21] and pair that with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199761 [12:34:50] * volans lunch [12:35:14] seems reasonable [12:38:04] which reveals me that the reason the deploy failed is that helm timed out waiting for all the calico-node pods to be upgraded [12:38:30] oh, that's happened before with others, tekton for example [12:38:39] kyverno does that too [12:38:56] yep, just making a patch to copy-paste the timeout bump from kyverno :D [12:39:02] but I remember seeing the progress... so I might have not used the cookbook? [12:39:40] I suspect that alloy will also fail there, as if the daemonset changes, it will take a while to reboot all pods [12:39:53] (as it has a disruption budget of 1 iirc) [12:40:33] depends on how fast that starts up I guess. the calico one takes a real while [12:41:15] I manually restarted them last time and it took a while (in the other of 10s of minutes to restart all) [12:41:20] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1020 [12:41:48] that might be around there yep xd [12:58:13] taavi: I got to go for lunch, if you are confident feel free to deploy the newer version of calico, otherwise I can review and test later [13:24:16] I'll push it to toolsbeta and then head to lunch myself as well [15:21:46] Does anyone know off the top of their head how often the kube client certs are rotated? [15:27:06] Damianz: about weekly [15:28:24] thanks [15:56:27] I am upgrading calico on toolforge, I don't expect any impact but FYI just in case [16:08:51] Damianz: fyi. there's a random trigger too, so it's not always at the same time [16:14:29] * taavi wonders whether even the 30min timeout is enough [16:18:37] does it show any kind of progress? iirc kyverno did not, and if you don't know it's going to take long gets you quite nervous [16:19:39] it does not, so I am just looking at the pods in kube-system manually [16:20:00] looks like about 50 out of 80 pods done in 20 minutes [16:21:52] uff, yep, looks a bit tight [16:29:48] yep, timed out with a good 15-20 nodes to go [16:29:51] * taavi doubles the limit [16:30:44] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1025 [16:34:14] dcaro: yeah, basically I will just keep nfs mounts for alloy for now... the job that gets the copied to the env daily doesn't matter if it fails for 1 day (for now)... but thanks for the info [16:38:11] andrewbogott: this is the guide I followed for publishing a .deb to production https://wikitech.wikimedia.org/wiki/Debian_packaging/Package_your_software_as_deb [16:47:08] thanks! I'm told that this is the new hotness: https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI [16:47:31] but I need to get lunch and run errands, will tackle when I'm back [17:06:04] andrewbogott: that page links to my page on the second paragraph... but doesn't specify when you should use one or the other :) [17:08:00] oh and the answer to that question is here: https://wikitech.wikimedia.org/wiki/Debian_packaging#Rebuilding_a_package [18:02:31] I made a few changes to https://wikitech.wikimedia.org/wiki/Debian_packaging/Package_your_software_as_deb, hopefully it's a bit clearer now [18:02:55] bd808: I saw you were also editing it, I should've retained your changes [18:04:58] dhinus: looks good to me. [18:10:19] * dcaro off, cya tomorrow [18:13:39] * dhinus also off