[09:00:09] <taavi>	 o/
[09:01:22] <arturo>	 o/
[09:19:07] <blancadesal>	 morning
[09:29:33] <dcaro>	 hmpf... I pushed to main on maintain-kubeusers, give me a sec to fix it
[09:30:00] <arturo>	 do you have any pre-specified / recommended method to cherry-pick a change on cloudcumins for a cookbook? or you do locally?
[09:31:14] <dcaro>	 done
[09:31:23] <dcaro>	 I do it locally
[09:31:44] <taavi>	 arturo: `test-cookbook -c [gerrit change number]`
[09:31:44] <dcaro>	 (for now)
[09:44:33] <arturo>	 ok!
[10:11:00] <blancadesal>	 for T357227, do we have any admin docs on how to give a tool write access to the elasticsearch cluster?
[10:11:01] * dhinus paged, can't look at it right now
[10:11:01] <stashbot>	 T357227: Elasticsearch credential request for capacity-exchange - https://phabricator.wikimedia.org/T357227
[10:11:27] <dhinus>	 I will be online in 1 hour
[10:11:29] <taavi>	 blancadesal: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Granting_a_tool_write_access_to_Elasticsearch
[10:11:32] <taavi>	 dhinus: I will take a look
[10:12:04] <taavi>	 arturo: "cloudvirt1032/nova-compute proc minimum" and "cloudvirt1032/ensure kvm processes are running" just paged, this you?
[10:12:10] <arturo>	 yes, me
[10:12:14] <arturo>	 just downtimed it
[10:12:15] <dcaro>	 it went away
[10:12:17] <dcaro>	 ack
[10:12:30] <arturo>	 I'm reimaging the host
[10:12:39] <taavi>	 ok
[10:12:48] <taavi>	 that's a paging alert from icinga. I thought we got rid of all of those
[10:15:15] <dcaro>	 it seems nova-compute is still there
[10:23:38] <taavi>	 blancadesal: hmm, the elasticsearch procedure might be worth amending to write the credentials as envvars instead of in an NFS file
[10:25:31] <blancadesal>	 taavi: for toolsdb access we currently have both the envvars and a file, no? 
[10:27:22] <blancadesal>	 are envvars used for non-buildservice deploys?
[10:34:14] <arturo>	 blancadesal: I'm not aware
[10:34:22] <dcaro>	 yes, they are used for all
[10:36:18] <dcaro>	 https://www.irccloud.com/pastebin/Sp0zgju6/
[10:36:20] <arturo>	 FYI there seems to be fleet-wide puppet problems
[10:36:35] <dcaro>	 looking
[10:36:54] <dcaro>	 I refreshed CA certs last week, might be related
[10:37:17] <arturo>	 dcaro: this is mostly hardware servers, so maybe prod related
[10:37:30] <dcaro>	 ack
[10:37:36] <arturo>	 https://www.irccloud.com/pastebin/frEQymda/
[10:37:57] <dcaro>	 that seems puppet7 migration related
[10:38:15] <arturo>	 see -operations there may be an outage ongoing
[10:39:10] <dcaro>	 that error comes from the puppetmaster directly
[10:43:07] <blancadesal>	 writing elasticsearch creds to envvars instead of ini file seems ok to me. any opinions against?
[10:45:20] <dcaro>	 +1 from me
[10:58:46] <arturo>	 could someone please lend me the management password while a misconfig in pwstore for my user is fixed?
[11:00:29] <blancadesal>	 the elasticsearch cred docs say to run `tools-clushmaster-02:~$ sudo cumin "P{O:wmcs::toolforge::elastic7}" "run-puppet-agent"` should that host be tools-cumin-1?
[11:00:44] <taavi>	 yeah I fixed that already, you need to refresh the page :D
[11:00:51] <blancadesal>	 haha
[11:26:57] <taavi>	 the puppet issues should start resolving now
[11:27:02] <taavi>	 is anyone looking at the tools NFS share already?
[11:30:06] <arturo>	 taavi: I got as far as creating a task T357882 but I didn't do anything for real
[11:30:07] <stashbot>	 T357882: 2024-02-19: toolforge NFS cleanup - https://phabricator.wikimedia.org/T357882
[11:30:18] <taavi>	 ok, are you planning to or should I?
[11:30:36] <arturo>	 please, take over. I'm with a reimage
[11:30:51] <taavi>	 sure
[11:31:46] <arturo>	 thanks
[12:14:39] <taavi>	 hmm, the U key on my keyboard seems to occasionally require a stronger-than-usual press to work correctly :/
[12:20:45] <dhinus>	 not as bad as my old macbook where that happened on the "cmd" key and I was constantly typing "c" instead of copying :D
[12:21:18] <dhinus>	 (they did replace the keyboard for free though as it was the infamous batch of 2017 macbooks)
[12:22:03] <taavi>	 yep, that seems more bad
[12:23:12] <taavi>	 I'll probably have to replace the switch or something. I think I have a few leftovers/spares lying arond somewhere
[12:26:02] <arturo>	 python question
[12:26:05] <arturo>	 I get this traceback
[12:26:08] <arturo>	 https://www.irccloud.com/pastebin/wkuyNxji/
[12:26:20] <arturo>	 in this patch https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1004116
[12:26:40] <arturo>	 it is not immediately obvious to me if I need to do a casting, ask for an attribute, or what?
[12:31:23] <taavi>	 arturo: in `args=["--deployment", self.cluster_name, "--hostname-list", self.hostname]` self.cluster_name is an enum and not a string. probably needs to be replaced with `self.cluster_name.value` if I recall correctly
[12:31:30] <taavi>	 can I get a +1 on T357901?
[12:31:31] <stashbot>	 T357901: Request increased server-group-members quota for tools - https://phabricator.wikimedia.org/T357901
[12:32:18] <arturo>	 +1'd
[12:32:43] <taavi>	 thanks
[12:33:01] <taavi>	 now let's see if the quota increase cookbook is familiar with this even existing. I certainly was not
[12:33:09] <arturo>	 same here
[12:34:04] <taavi>	 I think it worked
[12:34:13] <taavi>	 | server-group-members  |      80 |
[12:37:59] <arturo>	 ok
[12:44:48] <arturo>	 taavi: I tested https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1004088 and it works, barring that maybe it needs to be extended later
[12:44:54] <arturo>	 would you like to +1?
[12:45:55] <taavi>	 I will look in a bit
[12:55:05] <arturo>	 thanks
[12:55:17] * arturo brb
[13:52:48] <taavi>	 just found a 846G log file from NFS :/
[14:05:22] <taavi>	 well we're back to 77% usage (down from 86%). that was easy
[14:57:22] <andrewbogott>	 I'm out today (and about to be offline) but popping in to make sure the cloudvirt1032 alert isn't anything.  arturo, that's just you reimaging with 1 nic? 
[15:03:17] <dcaro>	 I think it was yes
[15:04:19] <dhinus>	 arturo reimaged that host this morning, I think there's only the KVM alert that is still firing
[15:05:23] <andrewbogott>	 ok!
[15:06:43] <dcaro>	 oh, you should resorve on splunk then
[15:07:30] <andrewbogott>	 I'm only seeing it on icinga
[15:08:25] <dhinus>	 it's firing but acked both in alertmanager and victorops
[15:08:44] <dhinus>	 when it's resolved in icing, it should autoresolve in am and vo
[15:09:28] <dcaro>	 the issue is that if it's acked in vo it will re-page in 24h
[15:09:39] <dcaro>	 or something like that was
[15:10:33] <dhinus>	 yep I think that's right, if arturo needs more than 24 hours to put it back in service, we should probably resolve it manually in vo
[15:10:34] <dhinus>	 though I'm not sure that's enough to prevent it from re-paging again in vo :)
[15:10:56] <andrewbogott>	 my recollection is that even if the issue is fixed vo will re-page, I think it never resolves iself
[15:10:59] <andrewbogott>	 *itself
[15:11:34] <dhinus>	 at least for certain alerts I remember they just went to "resolved" in vo without me doing anything
[15:11:55] <andrewbogott>	 ok, well maybe we should experiment then :)
[15:12:10] <andrewbogott>	 Anyway, I'm out! Have a good rest of Monday all.
[15:13:05] <dhinus>	 see you tomorrow!
[15:15:10] <dhinus>	 I downtimed the alert in icinga
[15:15:25] <dhinus>	 let's see what happens in vo
[15:17:40] <dhinus>	 this previous incident did auto-resolve in vo when icinga sent a RECOVERY notification https://portal.victorops.com/ui/wikimedia/incident/4440/timeline
[15:18:36] <dhinus>	 but a downtime is probably not going be enough because icinga will not send a RECOVERY
[15:19:05] <dhinus>	 I will resolve manually in vo
[15:20:19] <arturo>	 sorry for the noise, I think I run the downtime cookbook for 1 day, so not sure what happened here
[15:24:42] <dhinus>	 I think it downtimed alertmanager but not icinga, not sure why
[15:24:52] <dhinus>	 was it from cloudcumin?
[15:25:58] <arturo>	 normal cumin
[15:26:16] <dhinus>	 the reimage cookbook should also take care of downtiming
[15:26:29] <arturo>	 more specifically:
[15:26:31] <arturo>	 aborrero@cumin1002:~ $ sudo cookbook sre.hosts.downtime cloudvirt1032.eqiad.wmnet -D 1 --reason "reimage"
[15:31:02] <dhinus>	 I'm checking the spicerack logs in cumin1002 and it seems like it created the downtime... but I can't find it in icinga
[15:31:57] <taavi>	 was that downtime done before or after the reimage?
[15:32:15] <arturo>	 after the reimage was completed
[15:32:31] <arturo>	 it paged after KVM was missing the canary VM, after the reimage
[15:32:45] <arturo>	 so I run the downtime cookbook to silence the KVM check
[15:32:52] <taavi>	 in the SAL I only see a manual downtime after the reimage https://sal.toolforge.org/production?p=0&q=aborrero&d=
[15:37:14] <dhinus>	 there's a downtime at 10:11 for 1 day, then there's the reimage starting at 11:11 (I guess the reimage wipes puppetdb and also wipes the downtime), then there's a second downtime at 11:28 but that's for 2 hours only, and it expired
[15:38:38] <dhinus>	 I think the reimage cookbook expects all services will be back within 2 hours, but kvm is still not up and running, so icinga alerted again at 13:34
[15:48:39] <dcaro>	 I'm considering to stop using the jobs-api and build service phabricator tags in favor of using toolforge everywhere (and maybe adding manually the `[jobs-api,buildservice]` kind of prefixes to the tasks)
[15:48:56] <dcaro>	 specially as we are all working together on toolforge and there's currently no 'subgroups'
[15:49:03] <dcaro>	 that would allow having one single workboard
[15:49:38] <dcaro>	 we can discuss tomorrow in the check-in
[16:08:01] <arturo>	 I dont think I have a strong opinion. Having a dedicated tag for jobs-api has been useful in the past given the amount of tickets we had for it alone. But maybe today is different?
[16:08:50] <dcaro>	 we were also splitting jobs-api and buildservice work from toolforge at large (and two different groups of people focusing on each), now we are just one group
[16:23:59] <dhinus>	 I don't have a strong opinion either, but I think the pro of having separate tags is that the backlog looks more organized, and we can still have a single board for "prioritized & in progress" tasks (the "toolforge iteration" milestone board)
[16:25:07] <dhinus>	 if we move to a single board, we could try adding links in the sidebar to easily filter the board to show only tasks including "[jobs-api]" in the title. that might be faster to use than multiple tags with separate boards
[16:27:50] <dhinus>	 I think the idea we discussed recently of declining old tasks with no activity, or moving them to an "icebox" column would also help to keep the board more manageable
[16:28:47] <dcaro>	 the trigger has been that I just duplicated two tasks today because I did not find them under 'toolforge', but they were on the subproject xd
[16:31:53] <dhinus>	 yep searchability across projects is not great in phabricator... I like the idea of experimenting with a single board, especially if we combine it with some cleanup of old tasks
[16:50:58] <dcaro>	 I'm open to other alternatives too, but yep, having everything in one board would help me find and triage tasks
[16:57:12] <dcaro>	 I seem to be unable to rerun the pipeline on https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/8/pipelines
[16:57:16] <dcaro>	 ^can someone try?
[16:58:10] <dcaro>	 is it because it has open threads? that'd be weird, it should block merging, not reruning ci :/
[16:58:12] <taavi>	 I do not see a button to do that either
[16:58:28] <taavi>	 but also the last run passed, maybe it only shows when it failed?
[16:58:50] <dcaro>	 maybe
[16:59:49] <dcaro>	 it's ok though, but for ci that builds images, we clean them up after a bit (to save space), so we need to retrigger
[17:18:38] * dcaro off
[17:18:40] <dcaro>	 cya tomorrow