[09:00:09] o/ [09:01:22] o/ [09:19:07] morning [09:29:33] hmpf... I pushed to main on maintain-kubeusers, give me a sec to fix it [09:30:00] do you have any pre-specified / recommended method to cherry-pick a change on cloudcumins for a cookbook? or you do locally? [09:31:14] done [09:31:23] I do it locally [09:31:44] arturo: `test-cookbook -c [gerrit change number]` [09:31:44] (for now) [09:44:33] ok! [10:11:00] for T357227, do we have any admin docs on how to give a tool write access to the elasticsearch cluster? [10:11:01] * dhinus paged, can't look at it right now [10:11:01] T357227: Elasticsearch credential request for capacity-exchange - https://phabricator.wikimedia.org/T357227 [10:11:27] I will be online in 1 hour [10:11:29] blancadesal: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Granting_a_tool_write_access_to_Elasticsearch [10:11:32] dhinus: I will take a look [10:12:04] arturo: "cloudvirt1032/nova-compute proc minimum" and "cloudvirt1032/ensure kvm processes are running" just paged, this you? [10:12:10] yes, me [10:12:14] just downtimed it [10:12:15] it went away [10:12:17] ack [10:12:30] I'm reimaging the host [10:12:39] ok [10:12:48] that's a paging alert from icinga. I thought we got rid of all of those [10:15:15] it seems nova-compute is still there [10:23:38] blancadesal: hmm, the elasticsearch procedure might be worth amending to write the credentials as envvars instead of in an NFS file [10:25:31] taavi: for toolsdb access we currently have both the envvars and a file, no? [10:27:22] are envvars used for non-buildservice deploys? [10:34:14] blancadesal: I'm not aware [10:34:22] yes, they are used for all [10:36:18] https://www.irccloud.com/pastebin/Sp0zgju6/ [10:36:20] FYI there seems to be fleet-wide puppet problems [10:36:35] looking [10:36:54] I refreshed CA certs last week, might be related [10:37:17] dcaro: this is mostly hardware servers, so maybe prod related [10:37:30] ack [10:37:36] https://www.irccloud.com/pastebin/frEQymda/ [10:37:57] that seems puppet7 migration related [10:38:15] see -operations there may be an outage ongoing [10:39:10] that error comes from the puppetmaster directly [10:43:07] writing elasticsearch creds to envvars instead of ini file seems ok to me. any opinions against? [10:45:20] +1 from me [10:58:46] could someone please lend me the management password while a misconfig in pwstore for my user is fixed? [11:00:29] the elasticsearch cred docs say to run `tools-clushmaster-02:~$ sudo cumin "P{O:wmcs::toolforge::elastic7}" "run-puppet-agent"` should that host be tools-cumin-1? [11:00:44] yeah I fixed that already, you need to refresh the page :D [11:00:51] haha [11:26:57] the puppet issues should start resolving now [11:27:02] is anyone looking at the tools NFS share already? [11:30:06] taavi: I got as far as creating a task T357882 but I didn't do anything for real [11:30:07] T357882: 2024-02-19: toolforge NFS cleanup - https://phabricator.wikimedia.org/T357882 [11:30:18] ok, are you planning to or should I? [11:30:36] please, take over. I'm with a reimage [11:30:51] sure [11:31:46] thanks [12:14:39] hmm, the U key on my keyboard seems to occasionally require a stronger-than-usual press to work correctly :/ [12:20:45] not as bad as my old macbook where that happened on the "cmd" key and I was constantly typing "c" instead of copying :D [12:21:18] (they did replace the keyboard for free though as it was the infamous batch of 2017 macbooks) [12:22:03] yep, that seems more bad [12:23:12] I'll probably have to replace the switch or something. I think I have a few leftovers/spares lying arond somewhere [12:26:02] python question [12:26:05] I get this traceback [12:26:08] https://www.irccloud.com/pastebin/wkuyNxji/ [12:26:20] in this patch https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1004116 [12:26:40] it is not immediately obvious to me if I need to do a casting, ask for an attribute, or what? [12:31:23] arturo: in `args=["--deployment", self.cluster_name, "--hostname-list", self.hostname]` self.cluster_name is an enum and not a string. probably needs to be replaced with `self.cluster_name.value` if I recall correctly [12:31:30] can I get a +1 on T357901? [12:31:31] T357901: Request increased server-group-members quota for tools - https://phabricator.wikimedia.org/T357901 [12:32:18] +1'd [12:32:43] thanks [12:33:01] now let's see if the quota increase cookbook is familiar with this even existing. I certainly was not [12:33:09] same here [12:34:04] I think it worked [12:34:13] | server-group-members | 80 | [12:37:59] ok [12:44:48] taavi: I tested https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1004088 and it works, barring that maybe it needs to be extended later [12:44:54] would you like to +1? [12:45:55] I will look in a bit [12:55:05] thanks [12:55:17] * arturo brb [13:52:48] just found a 846G log file from NFS :/ [14:05:22] well we're back to 77% usage (down from 86%). that was easy [14:57:22] I'm out today (and about to be offline) but popping in to make sure the cloudvirt1032 alert isn't anything. arturo, that's just you reimaging with 1 nic? [15:03:17] I think it was yes [15:04:19] arturo reimaged that host this morning, I think there's only the KVM alert that is still firing [15:05:23] ok! [15:06:43] oh, you should resorve on splunk then [15:07:30] I'm only seeing it on icinga [15:08:25] it's firing but acked both in alertmanager and victorops [15:08:44] when it's resolved in icing, it should autoresolve in am and vo [15:09:28] the issue is that if it's acked in vo it will re-page in 24h [15:09:39] or something like that was [15:10:33] yep I think that's right, if arturo needs more than 24 hours to put it back in service, we should probably resolve it manually in vo [15:10:34] though I'm not sure that's enough to prevent it from re-paging again in vo :) [15:10:56] my recollection is that even if the issue is fixed vo will re-page, I think it never resolves iself [15:10:59] *itself [15:11:34] at least for certain alerts I remember they just went to "resolved" in vo without me doing anything [15:11:55] ok, well maybe we should experiment then :) [15:12:10] Anyway, I'm out! Have a good rest of Monday all. [15:13:05] see you tomorrow! [15:15:10] I downtimed the alert in icinga [15:15:25] let's see what happens in vo [15:17:40] this previous incident did auto-resolve in vo when icinga sent a RECOVERY notification https://portal.victorops.com/ui/wikimedia/incident/4440/timeline [15:18:36] but a downtime is probably not going be enough because icinga will not send a RECOVERY [15:19:05] I will resolve manually in vo [15:20:19] sorry for the noise, I think I run the downtime cookbook for 1 day, so not sure what happened here [15:24:42] I think it downtimed alertmanager but not icinga, not sure why [15:24:52] was it from cloudcumin? [15:25:58] normal cumin [15:26:16] the reimage cookbook should also take care of downtiming [15:26:29] more specifically: [15:26:31] aborrero@cumin1002:~ $ sudo cookbook sre.hosts.downtime cloudvirt1032.eqiad.wmnet -D 1 --reason "reimage" [15:31:02] I'm checking the spicerack logs in cumin1002 and it seems like it created the downtime... but I can't find it in icinga [15:31:57] was that downtime done before or after the reimage? [15:32:15] after the reimage was completed [15:32:31] it paged after KVM was missing the canary VM, after the reimage [15:32:45] so I run the downtime cookbook to silence the KVM check [15:32:52] in the SAL I only see a manual downtime after the reimage https://sal.toolforge.org/production?p=0&q=aborrero&d= [15:37:14] there's a downtime at 10:11 for 1 day, then there's the reimage starting at 11:11 (I guess the reimage wipes puppetdb and also wipes the downtime), then there's a second downtime at 11:28 but that's for 2 hours only, and it expired [15:38:38] I think the reimage cookbook expects all services will be back within 2 hours, but kvm is still not up and running, so icinga alerted again at 13:34 [15:48:39] I'm considering to stop using the jobs-api and build service phabricator tags in favor of using toolforge everywhere (and maybe adding manually the `[jobs-api,buildservice]` kind of prefixes to the tasks) [15:48:56] specially as we are all working together on toolforge and there's currently no 'subgroups' [15:49:03] that would allow having one single workboard [15:49:38] we can discuss tomorrow in the check-in [16:08:01] I dont think I have a strong opinion. Having a dedicated tag for jobs-api has been useful in the past given the amount of tickets we had for it alone. But maybe today is different? [16:08:50] we were also splitting jobs-api and buildservice work from toolforge at large (and two different groups of people focusing on each), now we are just one group [16:23:59] I don't have a strong opinion either, but I think the pro of having separate tags is that the backlog looks more organized, and we can still have a single board for "prioritized & in progress" tasks (the "toolforge iteration" milestone board) [16:25:07] if we move to a single board, we could try adding links in the sidebar to easily filter the board to show only tasks including "[jobs-api]" in the title. that might be faster to use than multiple tags with separate boards [16:27:50] I think the idea we discussed recently of declining old tasks with no activity, or moving them to an "icebox" column would also help to keep the board more manageable [16:28:47] the trigger has been that I just duplicated two tasks today because I did not find them under 'toolforge', but they were on the subproject xd [16:31:53] yep searchability across projects is not great in phabricator... I like the idea of experimenting with a single board, especially if we combine it with some cleanup of old tasks [16:50:58] I'm open to other alternatives too, but yep, having everything in one board would help me find and triage tasks [16:57:12] I seem to be unable to rerun the pipeline on https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/8/pipelines [16:57:16] ^can someone try? [16:58:10] is it because it has open threads? that'd be weird, it should block merging, not reruning ci :/ [16:58:12] I do not see a button to do that either [16:58:28] but also the last run passed, maybe it only shows when it failed? [16:58:50] maybe [16:59:49] it's ok though, but for ci that builds images, we clean them up after a bit (to save space), so we need to retrigger [17:18:38] * dcaro off [17:18:40] cya tomorrow