[09:19:25] volans: cookbooks fail trying to post to alertmanager to set downtimes, I'm guessing that the authentication is done by ip? [09:19:51] (so only requests from cumin/internal are allowed) [09:20:34] dcaro: you mean when running from your latop? [09:21:27] yes, Alertmanager doesn't have ACL support at all AFAIK and IIRC it's authorized via IPs. I can check but you might want to ask o11y [09:22:25] see hieradata/common/profile/alertmanager/api.yaml [09:33:05] that broke all our downtiming cookbooks :/ [09:45:54] will have to revert to not using alertmanager yet, and only icinga [09:47:10] FYI o11y is actively migrating alerts from icinga to alertmanager, not downtiming there would most likely mean unwanted IRC spam and potentially also some page [09:47:24] yep [09:50:49] the alternative is that we don't downtime at all, that means more unwanted irc spam, and more pages :), so taking the lesser of two evils until a better solution is possible [14:10:37] Any objections to keeping the second section in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/discovery/services.yaml sorted? I'm about to add a service and could fold it into that change [14:11:13] It _looks_ like the section used to be sorted alphabetically, but it is not anymore. [16:10:15] !log ganeti3001 rebooting and reimaging for firmware updates via T308238 [16:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:19] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [16:48:28] It looks like CI might be down. Jenkins isn't spawning a new job despite changes on https://gerrit.wikimedia.org/r/c/operations/alerts/+/804450 [16:49:11] brett: just overloaded (it does almost every day around this time), see https://integration.wikimedia.org/zuul/ [16:50:25] volans: oho. Thanks for the info [16:51:16] that's a lot of jobs all at once O.o [16:55:04] A lot of them are chained [16:56:33] like a spiked flail to the face [17:10:37] anyone know why thumbor2006 has had puppet disabled since May 31st? [17:10:47] reason is "foo" [17:10:52] 😬 [17:11:37] it means has not been disabled properly with the disable-puppet script :/ [17:12:31] hnowlan: 301 mutante [17:12:48] I have not disabled it. I only noticed it was disabled for reasons unknown to me. [17:13:13] mutante: your bash history says it differently :) [17:13:15] then I rebooted both 2006 and 2004. 2004 is the one that is broken physically [17:13:43] maybe was intented for anothe rhost? anyway please never use 'puppet agent', ever [17:14:02] but the bash wrapper scripts instead [17:14:05] checking SAL [17:14:21] disable-puppet, enable-puppet, run-puppet-agent [17:15:04] ok [17:15:13] would it have changed anything about not having a reason? [17:15:20] looking for one [17:15:28] would have added the $USER [17:15:40] that in this case was you AFAICT from bash history [17:15:42] why is it not named run-puppet btw? to be consistent :) [17:15:54] XioNoX: ask j.oe [17:15:56] :) [17:16:19] ok [17:16:19] also puppet has a lot of subcommands, for example on the puppetmasters [17:19:11] hnowlan: it's enabled and puppet ran now, sorry about that. I think it was an alert related to rsyslog [17:19:41] mutante: ah, cool. Thanks! [17:20:50] hnowlan: so..2004 is down and I depooled it hard (inactive) [17:20:59] and the purchase date is > 5 years ago [17:21:08] so I'm afraid there won't be much repairing [17:21:15] that beign said Papaul took the ticket [17:21:19] mutante: it just came back up [17:21:24] but I don't have high hopes [17:21:29] how did you do that? [17:21:40] I had tried powercycle but it was like nothing happened [17:21:52] not sure, Papaul did it :) [17:25:05] mutante: it seems healthy, I will repool if that's okay with you [17:25:42] hnowlan: sure, please do [17:26:07] I won't question how he did the magic [17:26:13] maybe DRAC reset [17:26:21] ooof, didn't realise how much the other instances are suffering in its absence [17:26:44] déjà maps [17:29:03] yea, that's the issue here. one host down is too many [17:29:06] and the hosts are old [17:29:13] so we are lucky this is back :) [17:30:29] the spec doesn't seem to be too nonstandard so if we're really in trouble in future we could scavenge some old decommed hardware for a temporary replacement [17:30:58] but I Would Prefer Not To [17:31:50] that was basically what I said in a meeting earlier.. I feel like this is going towards reusing other hardware.. like steal from appservers.. which is horrible but "if we have to" [17:32:13] if we can make this survive just a little bit longer that is much better though