[09:02:23] <dcaro>	 morning
[09:03:01] <arturo>	 o/ morning
[09:20:43] <arturo>	 hey dcaro I have been thinking about the `dump` function for toolforge jobs
[09:20:44] <arturo>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/17
[09:21:24] <arturo>	 what if all the new logic lived in the API, with a new endpoint like `/dump`, so all the magic  to minimize the defaults etc were in the server side?
[09:21:39] <arturo>	 and the CLI only does the GET + print
[09:22:48] <dcaro>	 that's one of the goals of https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/67, so validation is done only on the API side (as far as it's reasonable)
[09:23:44] <dcaro>	 the only concern I have with the dump endpoint, is that it overlaps with the "future" orchestrator endpoint, that will be using a different format for the config https://docs.google.com/document/d/1y7cIX3oiqOH8hEuPhSEuqWx-yElxTq9Ga8qJRvqAJjY/edit#heading=h.y2ic81vn5mos
[09:25:01] <arturo>	 I see
[09:25:48] <dcaro>	 though even if it does, once we have api versioning management in would be easier to manage on the API side than on the cli
[09:26:26] <arturo>	 then I see 2 ways to move forward
[09:26:45] <arturo>	 1) merge the MR as is, just as stop gap, and revisit later when we have the component API
[09:26:57] <arturo>	 2) don't merge, and wait for the component API
[09:27:00] <arturo>	 what do you think?
[09:30:57] <dcaro>	 I think that the component API will take some time, so we should ship the feature sooner
[09:32:38] <arturo>	 ok! then I'll wait if you want to review it additionally, then merge
[09:33:01] <dcaro>	 is it ready for review then?
[09:33:07] <arturo>	 I think so, yes
[09:33:08] <dcaro>	 (/me was waiting for the label)
[09:33:45] <arturo>	 just added it
[09:34:26] <dcaro>	 ack, will review
[09:37:28] <arturo>	 thanks
[09:41:26] <dcaro>	 fyi. I use this https://wm-lol.toolforge.org/?q=mrs to monitor MRs that need reviews, so usually things without `Needs review` label don't show up
[09:42:14] <dcaro>	 actually https://wm-lol.toolforge.org/api/v1/search?query=mrs that ends up being https://gitlab.wikimedia.org/groups/repos/cloud/-/merge_requests?scope=all&state=opened&label_name[]=Needs%20review&approved_by_usernames[]=None
[09:42:29] <dcaro>	 (whatever.... I just type `mrs` on my address bar)
[09:44:11] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/112
[10:09:12] <arturo>	 dcaro: approved
[10:09:22] * arturo running an errand, be back in a bit
[10:09:26] <dcaro>	 thanks!
[10:38:26] <taavi>	 thoughts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013963
[10:40:56] <dhinus>	 taavi: I thought about it a few times, but I'm not sure what's the rationale behind the default value
[10:41:14] <dhinus>	 I mean: what's the downside? is there any situation where we want a short interval?
[10:42:03] <dhinus>	 I will comment in the patch so the discussion can be found in the future :)
[10:43:02] <taavi>	 in my mind someone either spots the initial alert on irc/email when it first comes, or someone sees it on the alertmanager dashboard, I'm not sure when a repeat alert is /that/ useful
[10:43:14] * taavi comments on patch too
[10:59:29] <taavi>	 heads-up, I'm replacing the metricsinfra alertmanager host with a bookworm one, in case you see weird alerts it might be due to that
[11:00:19] <dhinus>	 ack
[11:01:49] <dcaro>	 ack, 24h sounds good to me too
[13:24:53] * dcaro lunch
[13:37:30] <andrewbogott>	 taavi: on Friday you said that you expected acme-chief to be able to sync between nodes but I don't see evidence that that's properly set up on cloud-vps (it uses keyholder but keyholder doesn't have any keys) -- have you seen that bit work on VMs before? Am I just missing some hiera?
[13:38:17] <taavi>	 it works on tools at least. you would need to generate those keys I guess.
[13:42:50] <andrewbogott>	 ok. for the moment I did a manual sync and the next issue is '[unable to get local issuer certificate for /CN=cloudinfra-acme-chief-02.cloudinfra.eqiad1.wikimedia.cloud]' which I think is unrelated to the sync thing
[13:43:18] <taavi>	 where are you seeing that?
[13:44:03] <andrewbogott>	 for example mx-out03.cloudinfra.eqiad1.wikimedia.cloud
[13:47:59] <taavi>	 `taavi@cloudinfra-acme-chief-02:~$ sudo systemctl restart nginx` fixed that
[13:48:47] <andrewbogott>	 huh
[13:50:11] <taavi>	 yeah, that was weird, since a plain `reload` did not help. but `openssl s_client` was showing a different issuer than what the certificates on disk were so I figured it had to be the old certificates still in memory somewhere
[13:51:29] <andrewbogott>	 ok, maybe this will save me next time https://wikitech.wikimedia.org/w/index.php?title=Acme-chief%2FCloud_VPS_setup&diff=2162397&oldid=2124649
[13:51:48] <andrewbogott>	 thanks!  That may actually be the last piece for this new acme-chief host but let's see...
[13:54:46] <andrewbogott>	 mx-out05.cloudinfra.eqiad1.wikimedia.cloud just installed a bunch of certs so I think we're good.
[13:54:56] <andrewbogott>	 now I need to figure out how to test mx servers...
[13:55:22] <taavi>	 oh, that's going to be annoying since we (still) don't have host-independent service names for those
[13:57:29] <andrewbogott>	 yeah, I added new .wmcloud.org dns for them but there are likely other pieces missing...
[13:58:36] <taavi>	 we really should have, say, mx-out-{a,b}.cloudinfra.wmcloud.org and re-use those every time we replace the hosts instead of having the individual host name in the public names
[14:00:04] <andrewbogott>	 that seems fine... I can add dns and certs for that
[14:06:18] <taavi>	 looking for a review of: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013523
[16:58:31] * arturo offline
[17:34:41] * dcaro off
[17:34:43] <dcaro>	 cya tomorrow
[18:03:19] <bd808>	 This diff tool seems kind of neat -- https://difftastic.wilfred.me.uk/
[18:24:34] * bd808 lunch
[18:28:49] <andrewbogott>	 it does!
[18:34:59] <andrewbogott>	 I just now replaced the old Buster cloudinfra mx-out servers.  Tests suggest that it's still working but please let me know if you find issues with outbound emails.
[21:29:20] <bd808>	 Commands like `kubectl get pods` are being excruciatingly slow right now. What dashboard should I look at to see how loaded the k8s cluster is?
[21:32:37] * bd808 kills a huge cpu sucking sha1sum command on tools-sgebastion-10
[21:35:23] <bd808>	 The `ps` command I ran didn't include username in the output, but someone was trying to compute the sha1sum of about 1000 pdf files at the same time. :sigh:
[21:36:00] <bd808>	 all with highly descriptive names like "0a0ff02400db424ef8e7ce165d066b4b72e570a5.pdf"
[21:36:58] <bd808>	 load on the instance has gone down from ~8 to 0.4 as a result of the kill :)
[21:37:06] <taavi>	 that file name reminds me of https://phabricator.wikimedia.org/T349913
[21:37:57] <bd808>	 taavi: I was thinking the same thing
[21:39:42] <bd808>	 wtf? https://phabricator.wikimedia.org/P58916 -- I can't become hoiscript
[21:40:08] <bd808>	 Some goofiness in their profile files maybe
[21:41:07] <bd808>	 ugh. a fish shell switch in .bash_profile
[21:44:58] <bd808>	 taavi: it was totally T349913
[21:44:59] <stashbot>	 T349913: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913