[05:00:10] wikibugs seems to be down again [05:07:51] Restarted it, but it doesn't seem to be to show phab comments on irc yet [05:08:08] maybe I should create a task so it can be looked at properly [05:09:52] https://phabricator.wikimedia.org/T308995 [08:35:57] marostegui: there were any specific message that was not posted? Quoting from wikitech: "After restart, the bot joins only #wikimedia-cloud connect by default. Other channels are joined when the first relevant message to them is supposed to be sent." [08:53:04] but yeah some message should have been shown by now :( [09:19:44] hello folks, to complete https://phabricator.wikimedia.org/T296982 I moved the kafka clusters in deployment-prep to the fixed uid/gid (logging, jumbo, main) [09:20:13] the prod ones are already good, and afaik there is no other cluster that we maintain/take care of [09:20:43] I am asking since it would be great to default to the fixed uid/gid from now on, but if there is any cluster that you think it may need a look please tell me [09:20:46] <_joe_> elukey: thanks a lot. This really shouldn't have been yours to do,so thanks multiple times [09:21:14] <_joe_> elukey: I would say we need the analytics/o11y folks to acknowledge that? [09:21:55] _joe_ ah yes it was an old thing that I forgot to finish, I was reviewing open tasks :) I'll ping them [09:25:05] FYI An interesting article and related new open source project to validate prometheus alerting rules from CloudFlare: https://blog.cloudflare.com/monitoring-our-monitoring/ CCing observability, ONFIRE, godog, lmata [09:44:47] very cool, thanks for the link volans [09:45:03] seems promising [09:45:12] the article's author is also upstream for the UI at alerts.w.o, FWIW [09:45:28] oh, nice, didn't know :) [10:09:31] godog: https://gerrit.wikimedia.org/r/c/operations/alerts/+/792564 is alerting on the wmcs prometheus instance, I wonder if the MXQueueNoMetrics alert should be limited to thanos or something similar [10:27:20] taavi: yes I'll limit the alert to eqiad/codfw and prometheus 'ops' only, thanks for the heads up [12:30:10] Thanks for the link volans ! [12:34:41] I need to disable puppet in eqiad for a few minutes to reduce traffic towards puppetdb while it gets moved to a new ganeti node [12:35:28] ack, frontend or backend? [12:36:02] puppetdb itself, I'm disabling the puppet agents across A:eqiad [12:36:23] ack, if it was the DB you could need ot disable it everywhere ;) [12:37:07] ganeti1011 is one of the old servers with only 1GB NIC and if all agents are running that generates too much change in puppetdb to get transfered over the 1G link [12:38:29] yep [12:46:32] puppet has been re-enabled [14:26:54] <_joe_> marostegui: did you try to restart wikibugs this morning? [14:29:11] I think so, I remember reading that [14:31:19] <_joe_> anyone else tried anything? it's still not working and I can't find any logs for the running pods [14:59:47] _joe_: yes I tried [14:59:55] followed what wikitech said [15:00:04] <_joe_> yeah doesn't work rn [15:00:07] and as it didn't work I created a task for the experts [15:12:54] !log updating firmware on ganeti5001 per T308211 [15:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [16:06:44] wikibugs has re-appeared in -ops, do you know what was the issue? [16:40:41] volans: restarting it fixed it fine. Earlier in the day it was only the webservice that was restarted not the actual bot. [16:40:49] Not sure exactly why it crashed though [16:41:12] https://phabricator.wikimedia.org/T308995 was the task [16:41:21] RhinosF1: ack, but according to SAL it was already restarted 3 days ago [16:41:42] volans: it was only down a day [16:45:11] The time wikibugs went down yesterday was a few minutes after the Redis switchover [16:45:25] and restarting it properly cleared it [16:45:42] ack [21:31:22] Back up here in UK [23:06:06] First roundup by PHP Foundation, https://fosstodon.org/@php/108210293476275428 [23:37:05] volans: if I understand the cloudflare/pint article correctly, its linter/CI mode relies on a running prometheus server to know whether metrics "exist". This seems fair and useful for its production/deamon mode, but seems a bit dissapoiting for its linter mode, that makes it quite impractical to use for us I think? It's not something we can set up in our puppet CI for example, apart from its syntax checks.