[07:06:18] <godog>	 greetings
[09:02:14] <dhinus>	 morning
[09:02:42] <volans>	 hey
[09:04:22] <dcaro>	 morning
[09:06:29] <taavi>	 o/
[10:33:39] <Damianz>	 dcaro: https://cluebotng-monitoring.toolforge.org/d/f0116e33-c519-4562-b625-306262d46dd0/kubelet-active-pods the model change was merged on the 25th, bit of a generic metric but it does look like the nodes where really slowly getting drained as thought
[10:36:37] <dcaro>	 nice, it did stop going down pretty quick though, and even recovered a bit, that makes it weird
[10:40:14] <dcaro>	 that's also when we did changes to the default requests and limits, so that might also have an impact (as the nfs workers were saturated while non-nfs were not, freeing quotas in the nfs ones would move workload to them too)
[10:41:52] <dcaro>	 the flickering of the number of pods in nfs workers is big enough to make up for any changes in the non-nfs ones xd, stats are hard
[10:46:31] <Damianz>	 Yeah, I'm also not sure what the big drop on nfs is and we can't select by image type
[10:47:13] <Damianz>	 I just re-ran all my tools and they moved back as expected - 1 then threw errors because I missed the mount option; perhaps it's worth a note in the changelog as there's been a month of things doing 1 (incorrect) thing and now will 'break'
[10:53:52] <dcaro>	 👍
[11:42:45] <taavi>	 fixing some calico CRDs on toolsbeta (see https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/18)
[11:52:41] <taavi>	 wot, the functional tests failed but the output seems to be off the screen somehow? D:
[11:53:09] <dcaro>	 huh, got a screenshot?
[11:53:15] <dcaro>	 (more curious than anything xd)
[11:55:13] <taavi>	 dcaro: https://phabricator.wikimedia.org/F68239972
[11:55:52] <dcaro>	 it's off screen on the left!
[11:57:11] <taavi>	 the 'run this command to run these tests manually' command does not work either: https://phabricator.wikimedia.org/P84343
[11:59:08] <dcaro>	 good point, it still needs some env variables :/
[11:59:10] <dcaro>	 `TEST_TOOL_UID`
[11:59:16] <taavi>	 want me to file a task?
[11:59:42] <dcaro>	 yes please, maybe also the cwd might matter
[12:01:03] <dcaro>	 taavi: oh, I see you are running the maintain-harbor tests, those might be broken
[12:01:23] <taavi>	 wmcs.toolforge.component.deploy is trying to run those
[12:01:30] <taavi>	 filed T408679
[12:01:32] <stashbot>	 T408679: Command given to run failing functional tests manually does not function - https://phabricator.wikimedia.org/T408679
[12:02:12] <dcaro>	 that's the default when the component does not specify a specific suite, this is calico right?
[12:02:16] <taavi>	 yes
[12:02:30] <dcaro>	 yep, I think it does not have anything set
[12:02:42] <dcaro>	 you can try running for example jobs-api, that would cover most stuff
[12:03:03] <dcaro>	 (not using the deploy, but using the cookbook to run the tests, or manually, passing -c jobs-api)
[12:05:54] <dcaro>	 like, from a clone of toolforge-deploy, `dcaro@tools-bastion-15:~/toolforge-deploy$ utils/run_functional_tests.sh -c jobs-api`, or `cookbook wmcs.toolforge.run_tests --component jobs-api --cluster-name toolsbeta`
[12:06:08] <taavi>	 found the cookbook, doing that now
[12:06:17] <taavi>	 but IHMO wmcs.toolforge.component.deploy should not fail like this
[12:06:32] <dcaro>	 it should not, maintain-harbor tests should be passing
[12:07:28] <taavi>	 let me try those tests again to see whether it was an one-off or what
[12:07:40] <dcaro>	 they are broken, I think Raymond_Ndib.e is working on it
[12:07:50] <dcaro>	 let me find the task
[12:08:40] <dcaro>	 T407496
[12:08:41] <stashbot>	 T407496: [maintain-harbor] Failing to cleanup stale artifacts - https://phabricator.wikimedia.org/T407496
[12:09:15] <dcaro>	 oh, it was merged but failing to delpoy https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1012
[12:09:19] <dcaro>	 so half-deploy :/
[12:09:21] <dcaro>	 *deployed
[12:09:44] <dcaro>	 looking
[12:09:52] <taavi>	 that seems like the same error I hit
[12:11:17] <dcaro>	 does it happen if you run only the jobs-api tests?
[12:11:26] <dcaro>	 I'll check the tests in prod also
[12:12:08] <taavi>	 not so far
[12:13:02] <dcaro>	 okok, I'll look into the maintain-harbor tests
[12:14:37] <taavi>	 i think it's just that that maintain-harbor issue r.aymond is looking at hasn't been fixed properly
[12:14:41] <dcaro>	 yep
[12:14:59] <dcaro>	 the other day we discussed solving it a different way than it is
[12:15:50] <dcaro>	 you can try skipping tests on deploy, and running them manually
[12:15:51] <dcaro>	 `--skip-tests`
[12:16:19] <dcaro>	 or passing a `--filter-tags` with `jobs-api` only
[12:16:29] <taavi>	 the jobs-api tests are happy, so I'll deploy the calico crd fix to tools with that if that seems fine to you?
[12:16:33] <dcaro>	 as a workaround
[12:16:43] <dcaro>	 yep lgtm
[12:16:47] <taavi>	 cool
[12:23:30] <taavi>	 dcaro: can you think of any reason why the deploy cookbook should not show the helmfile apply output? since right now we get extremely unhelpful errors like this: https://phabricator.wikimedia.org/P84344
[12:24:29] <dcaro>	 sometimes if there's no diff when applying the helmfile it does not return any diff output
[12:24:40] <dcaro>	 I think that might be a spicerack/tests interaction, looking
[12:25:31] <dcaro>	 yep, I think it's that, spicerack swallows the output on error
[12:25:40] <dcaro>	 did it report on the MR though?
[12:25:46] <dcaro>	 (it should have)
[12:26:41] <taavi>	 nothing, since that's the deploy stage and not the test stage
[12:26:43] <dcaro>	 volans: ^ iirc there was some fix on the way? Maybe already there?
[12:26:53] <dcaro>	 true
[12:26:55] <volans>	 RemoteExecutionError has results if that's what you're looking for
[12:27:05] <volans>	 https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteClusterExecutionError
[12:27:13] <volans>	 sorry wrong link
[12:27:17] <volans>	 https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html#spicerack.remote.RemoteExecutionError
[12:27:32] <taavi>	 not sure if that's relevant here, I don't think we try to capture this output anyway?
[12:28:30] <volans>	 unless print_output=False it should be printed indeed
[12:29:27] <dcaro>	 I think we are hiding the output
[12:29:49] <taavi>	 https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1199760/
[12:30:05] <dcaro>	 we use wmcs_libs/common.py:CUMIN_UNSAFE_WITHOUT_OUTPUT = CuminParams(print_output=False, print_progress_bars=False)
[12:31:32] <dcaro>	 taavi: +1d
[12:34:21] <taavi>	 and pair that with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199761
[12:34:50] * volans lunch
[12:35:14] <dcaro>	 seems reasonable
[12:38:04] <taavi>	 which reveals me that the reason the deploy failed is that helm timed out waiting for all the calico-node pods to be upgraded
[12:38:30] <dcaro>	 oh, that's happened before with others, tekton for example
[12:38:39] <dcaro>	 kyverno does that too
[12:38:56] <taavi>	 yep, just making a patch to copy-paste the timeout bump from kyverno :D
[12:39:02] <dcaro>	 but I remember seeing the progress... so I might have not used the cookbook?
[12:39:40] <dcaro>	 I suspect that alloy will also fail there, as if the daemonset changes, it will take a while to reboot all pods
[12:39:53] <dcaro>	 (as it has a disruption budget of 1 iirc)
[12:40:33] <taavi>	 depends on how fast that starts up I guess. the calico one takes a real while
[12:41:15] <dcaro>	 I manually restarted them last time and it took a while (in the other of 10s of minutes to restart all)
[12:41:20] <taavi>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1020
[12:41:48] <dcaro>	 that might be around there yep xd
[12:58:13] <dcaro>	 taavi: I got to go for lunch, if you are confident feel free to deploy the newer version of calico, otherwise I can review and test later
[13:24:16] <taavi>	 I'll push it to toolsbeta and then head to lunch myself as well
[15:21:46] <Damianz>	 Does anyone know off the top of their head how often the kube client certs are rotated?
[15:27:06] <taavi>	 Damianz: about weekly
[15:28:24] <Damianz>	 thanks
[15:56:27] <taavi>	 I am upgrading calico on toolforge, I don't expect any impact but FYI just in case
[16:08:51] <dcaro>	 Damianz: fyi. there's a random trigger too, so it's not always at the same time
[16:14:29] * taavi wonders whether even the 30min timeout is enough
[16:18:37] <dcaro>	 does it show any kind of progress? iirc kyverno did not, and if you don't know it's going to take long gets you quite nervous
[16:19:39] <taavi>	 it does not, so I am just looking at the pods in kube-system manually
[16:20:00] <taavi>	 looks like about 50 out of 80 pods done in 20 minutes
[16:21:52] <dcaro>	 uff, yep, looks a bit tight
[16:29:48] <taavi>	 yep, timed out with a good 15-20 nodes to go
[16:29:51] * taavi doubles the limit
[16:30:44] <taavi>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1025
[16:34:14] <Damianz>	 dcaro: yeah, basically I will just keep nfs mounts for alloy for now... the job that gets the copied to the env daily doesn't matter if it fails for 1 day (for now)... but thanks for the info
[16:38:11] <dhinus>	 andrewbogott: this is the guide I followed for publishing a .deb to production https://wikitech.wikimedia.org/wiki/Debian_packaging/Package_your_software_as_deb
[16:47:08] <andrewbogott>	 thanks! I'm told that this is the new hotness: https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI
[16:47:31] <andrewbogott>	 but I need to get lunch and run errands, will tackle when I'm back
[17:06:04] <dhinus>	 andrewbogott: that page links to my page on the second paragraph... but doesn't specify when you should use one or the other :)
[17:08:00] <dhinus>	 oh and the answer to that question is here: https://wikitech.wikimedia.org/wiki/Debian_packaging#Rebuilding_a_package
[18:02:31] <dhinus>	 I made a few changes to https://wikitech.wikimedia.org/wiki/Debian_packaging/Package_your_software_as_deb, hopefully it's a bit clearer now
[18:02:55] <dhinus>	 bd808: I saw you were also editing it, I should've retained your changes
[18:04:58] <bd808>	 dhinus: looks good to me.
[18:10:19] * dcaro off, cya tomorrow
[18:13:39] * dhinus also off