[09:02:23] morning [09:03:01] o/ morning [09:20:43] hey dcaro I have been thinking about the `dump` function for toolforge jobs [09:20:44] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/17 [09:21:24] what if all the new logic lived in the API, with a new endpoint like `/dump`, so all the magic to minimize the defaults etc were in the server side? [09:21:39] and the CLI only does the GET + print [09:22:48] that's one of the goals of https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/67, so validation is done only on the API side (as far as it's reasonable) [09:23:44] the only concern I have with the dump endpoint, is that it overlaps with the "future" orchestrator endpoint, that will be using a different format for the config https://docs.google.com/document/d/1y7cIX3oiqOH8hEuPhSEuqWx-yElxTq9Ga8qJRvqAJjY/edit#heading=h.y2ic81vn5mos [09:25:01] I see [09:25:48] though even if it does, once we have api versioning management in would be easier to manage on the API side than on the cli [09:26:26] then I see 2 ways to move forward [09:26:45] 1) merge the MR as is, just as stop gap, and revisit later when we have the component API [09:26:57] 2) don't merge, and wait for the component API [09:27:00] what do you think? [09:30:57] I think that the component API will take some time, so we should ship the feature sooner [09:32:38] ok! then I'll wait if you want to review it additionally, then merge [09:33:01] is it ready for review then? [09:33:07] I think so, yes [09:33:08] (/me was waiting for the label) [09:33:45] just added it [09:34:26] ack, will review [09:37:28] thanks [09:41:26] fyi. I use this https://wm-lol.toolforge.org/?q=mrs to monitor MRs that need reviews, so usually things without `Needs review` label don't show up [09:42:14] actually https://wm-lol.toolforge.org/api/v1/search?query=mrs that ends up being https://gitlab.wikimedia.org/groups/repos/cloud/-/merge_requests?scope=all&state=opened&label_name[]=Needs%20review&approved_by_usernames[]=None [09:42:29] (whatever.... I just type `mrs` on my address bar) [09:44:11] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/112 [10:09:12] dcaro: approved [10:09:22] * arturo running an errand, be back in a bit [10:09:26] thanks! [10:38:26] thoughts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013963 [10:40:56] taavi: I thought about it a few times, but I'm not sure what's the rationale behind the default value [10:41:14] I mean: what's the downside? is there any situation where we want a short interval? [10:42:03] I will comment in the patch so the discussion can be found in the future :) [10:43:02] in my mind someone either spots the initial alert on irc/email when it first comes, or someone sees it on the alertmanager dashboard, I'm not sure when a repeat alert is /that/ useful [10:43:14] * taavi comments on patch too [10:59:29] heads-up, I'm replacing the metricsinfra alertmanager host with a bookworm one, in case you see weird alerts it might be due to that [11:00:19] ack [11:01:49] ack, 24h sounds good to me too [13:24:53] * dcaro lunch [13:37:30] taavi: on Friday you said that you expected acme-chief to be able to sync between nodes but I don't see evidence that that's properly set up on cloud-vps (it uses keyholder but keyholder doesn't have any keys) -- have you seen that bit work on VMs before? Am I just missing some hiera? [13:38:17] it works on tools at least. you would need to generate those keys I guess. [13:42:50] ok. for the moment I did a manual sync and the next issue is '[unable to get local issuer certificate for /CN=cloudinfra-acme-chief-02.cloudinfra.eqiad1.wikimedia.cloud]' which I think is unrelated to the sync thing [13:43:18] where are you seeing that? [13:44:03] for example mx-out03.cloudinfra.eqiad1.wikimedia.cloud [13:47:59] `taavi@cloudinfra-acme-chief-02:~$ sudo systemctl restart nginx` fixed that [13:48:47] huh [13:50:11] yeah, that was weird, since a plain `reload` did not help. but `openssl s_client` was showing a different issuer than what the certificates on disk were so I figured it had to be the old certificates still in memory somewhere [13:51:29] ok, maybe this will save me next time https://wikitech.wikimedia.org/w/index.php?title=Acme-chief%2FCloud_VPS_setup&diff=2162397&oldid=2124649 [13:51:48] thanks! That may actually be the last piece for this new acme-chief host but let's see... [13:54:46] mx-out05.cloudinfra.eqiad1.wikimedia.cloud just installed a bunch of certs so I think we're good. [13:54:56] now I need to figure out how to test mx servers... [13:55:22] oh, that's going to be annoying since we (still) don't have host-independent service names for those [13:57:29] yeah, I added new .wmcloud.org dns for them but there are likely other pieces missing... [13:58:36] we really should have, say, mx-out-{a,b}.cloudinfra.wmcloud.org and re-use those every time we replace the hosts instead of having the individual host name in the public names [14:00:04] that seems fine... I can add dns and certs for that [14:06:18] looking for a review of: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013523 [16:58:31] * arturo offline [17:34:41] * dcaro off [17:34:43] cya tomorrow [18:03:19] This diff tool seems kind of neat -- https://difftastic.wilfred.me.uk/ [18:24:34] * bd808 lunch [18:28:49] it does! [18:34:59] I just now replaced the old Buster cloudinfra mx-out servers. Tests suggest that it's still working but please let me know if you find issues with outbound emails. [21:29:20] Commands like `kubectl get pods` are being excruciatingly slow right now. What dashboard should I look at to see how loaded the k8s cluster is? [21:32:37] * bd808 kills a huge cpu sucking sha1sum command on tools-sgebastion-10 [21:35:23] The `ps` command I ran didn't include username in the output, but someone was trying to compute the sha1sum of about 1000 pdf files at the same time. :sigh: [21:36:00] all with highly descriptive names like "0a0ff02400db424ef8e7ce165d066b4b72e570a5.pdf" [21:36:58] load on the instance has gone down from ~8 to 0.4 as a result of the kill :) [21:37:06] that file name reminds me of https://phabricator.wikimedia.org/T349913 [21:37:57] taavi: I was thinking the same thing [21:39:42] wtf? https://phabricator.wikimedia.org/P58916 -- I can't become hoiscript [21:40:08] Some goofiness in their profile files maybe [21:41:07] ugh. a fish shell switch in .bash_profile [21:44:58] taavi: it was totally T349913 [21:44:59] T349913: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913