[07:32:14] <_joe_>	 !incidents
[07:32:14] <sirenbot>	 5611 (ACKED)  db2189 (paged)/MariaDB Replica SQL: s2 (paged)
[07:32:14] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[07:32:14] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[07:32:15] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[07:32:15] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[07:32:15] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[07:32:15] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[07:32:16] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[07:32:16] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[07:32:17] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[07:32:17] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[08:36:11] <Emperor>	 !incidents
[08:36:11] <sirenbot>	 5611 (ACKED)  db2189 (paged)/MariaDB Replica SQL: s2 (paged)
[08:36:12] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[08:36:12] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[08:36:12] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[08:36:12] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[08:36:12] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[08:36:13] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[08:36:13] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[08:36:13] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[08:36:14] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[08:36:14] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[11:08:58] <vgutierrez>	 while working on some specs for acme_chief::server class I'm getting the following error ` Could not find resource 'Exec[apt-get update]' in parameter 'require'`
[11:09:25] <vgutierrez>	 do I need some helper to make the spec test happy?
[11:23:59] <vgutierrez>	  let(:pre_condition) { 'include apt' } --> that fixes the issue
[11:49:31] <Emperor>	 Do we have style for re-using functions between cookbooks? sre.swift.remove-ghost-objects has _find_db_paths as a member function of class RemoveGhostObjectsRunner and I think I need similar code for a different cookbook
[11:50:34] <volans>	 Emperor: first step move that common code in sre/swift/__init__.py, second step let's chat if there is a case for making a swift library in spicerack
[11:51:31] <volans>	 for step 1 you can either have a function that gets all the params needed, or make a base Runner class for all/some of your swift cookbooks if that makes more sense
[11:51:48] <Emperor>	 my use-case this time is (hopefully!) more of a one-off, run the sqlite integrity check against a bunch of container DBs
[11:54:53] <Emperor>	 https://github.com/wikimedia/operations-cookbooks/blob/420e429dfcd377310ff215a5a8abef13e774f549/cookbooks/sre/swift/remove-ghost-objects.py#L400 is the function concerned, it needs access to bits of the the Runner class (remote.query, though I guess you could pass that in as a parameter)
[11:55:43] <volans>	 yes
[12:00:43] <Emperor>	 Thanks, I'll have a go on that basis
[12:36:03] <jynus>	 If there is an alert tonight about ldap user missing on puppet, that would be me- there is only one step missing on onboarding, but I don't want to add an unnecessary step with a temporary patch for an ldap only user (in theory it should be solved before it alerts)
[12:47:47] <moritzm>	 jynus: ack, thanks for the note
[12:48:24] <moritzm>	 (a while ago we fixed the report to only mail to the SRE IF alias and not root@, so most people don't see it anyway)
[12:49:21] <jynus>	 actually, it was solved now, so will work on the patch now
[12:49:38] <jynus>	 moritzm: thanks, sent you a review in case you have the time later
[12:50:06] <jynus>	 feel free to suggest alternative people from your team to help review access patches so I don't send all to you 0:-D
[12:50:10] <moritzm>	 sure, I'll have a look in a bit
[13:10:11] <Emperor>	 !incidents
[13:10:11] <sirenbot>	 5623 (UNACKED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[13:10:12] <sirenbot>	 5622 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[13:10:12] <sirenbot>	 5611 (RESOLVED)  db2189 (paged)/MariaDB Replica SQL: s2 (paged)
[13:10:12] <sirenbot>	 5621 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr1-esams.wikimedia.org)
[13:10:12] <sirenbot>	 5620 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:10:12] <sirenbot>	 5619 (RESOLVED)  db2207 (paged)/MariaDB Replica SQL: s2 (paged)
[13:10:13] <sirenbot>	 5618 (RESOLVED)  db2148 (paged)/MariaDB Replica SQL: s2 (paged)
[13:10:13] <sirenbot>	 5617 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[13:10:13] <sirenbot>	 5616 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:10:14] <sirenbot>	 5615 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[13:10:14] <sirenbot>	 5614 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:10:15] <sirenbot>	 5613 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[13:10:15] <sirenbot>	 5612 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[13:10:21] <Emperor>	 !ack 5623
[13:10:21] <sirenbot>	 5623 (ACKED)  Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage
[13:10:31] <Emperor>	 I've asked in #wikimedia-sre-collab about gerrit
[13:10:54] <urbanecm>	 Emperor: fyi there's also a little discussion in -releng about the behaviour
[13:10:57] <effie>	 oh cool, I had just started digging 
[13:11:20] <arnaudb>	 joining the effort
[13:12:56] <Emperor>	 urbanecm: I think the service belongs to collab hence my channel choice :)
[13:13:30] <urbanecm>	 not doubting the choice, i just wanted to inform :)
[14:07:30] <inflatador>	 hnowlan Apologies for the cloudelastic decom, looking at it now
[14:15:47] <inflatador>	 based on the cumin2002 logs, it looks like the cookbook ran successfully. LMK if I did miss anything
[14:55:49] <hnowlan>	 inflatador: not entirely sure tbh, seems that it's stuck in the puppetdb somehow
[15:01:04] <inflatador>	 jayme ^^ is PCC still broken from https://phabricator.wikimedia.org/T380937#10461825 ?
[15:01:31] <jayme>	 idk, we bypassed it
[15:01:58] <inflatador>	 ACK, I'm guessing not since that was almost a week ago
[15:03:41] <jayme>	 inflatador: the values are still in the pcc puppetdb - so I guess compilation on a deploy host will still fail for the same reason
[15:05:55] <inflatador>	 jayme interesting, I don't remember this ever happening when we decom'd early CE hosts. Will take a look some more but I'll probably need an adult at some pt
[15:14:14] <urandom>	 hnowlan: did you just downtime in https://phabricator.wikimedia.org/T383820#10479104?  (that's what it looks like from the ticket, but it's actually down, so...)
[15:15:44] <hnowlan>	 urandom: I depooled and downtimed 
[15:15:58] <urandom>	 depooled in the restbase-sense though??
[15:16:06] <hnowlan>	 oh no, sorry - been a while so I forgot 
[15:16:08] <urandom>	 sorry... too many question marks
[15:16:11] <hnowlan>	 just in pybal
[15:16:14] <urandom>	 ya, ok
[16:05:53] <jynus>	 Anyone for a bluesky +1 ? https://gerrit.wikimedia.org/r/c/operations/dns/+/1113170
[16:06:16] <sukhe>	 jynus: looking and getting the full context
[16:06:53] <jynus>	 I verified identity of requested through Slack
[16:06:57] <jynus>	 *request
[17:33:25] <federico3>	 volans, elukey: I'm trying to decommission db2133.codfw.wmnet and the cookbook fails the IPMI connection test. I tried 5 times.
[17:34:27] <elukey>	 very nice :D
[17:35:06] <elukey>	 is the db2133.mgmt.codfw.wmnet IP reachable? If yes we may want to go through the Management console wikitech page to see if anything can be done
[17:35:13] <volans>	 federico3: you could go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands
[17:35:34] <federico3>	 thanks
[17:35:56] <volans>	 it should guide you through the different failures and help to fix it
[17:42:07] <_joe_>	 volans: stop pretending hardware makes sense, and tell him what sacrifices he must make to the gods of quantum mechanics
[17:42:20] <_joe_>	 federico3: so you need a chicken and a goat
[17:42:30] <_joe_>	 :)
[17:43:01] <claime>	 Dress yourself in ham and circle twice counterclockwise around the server
[17:43:10] <_joe_>	 (my point about hardware being dark magic is serious though)
[17:50:57] <ihurbain>	 claime: naheulbeuk reference spotted
[18:26:32] <swfrench-wmf>	 lol, just realized I've been !log'ing into the void in -operations ...
[18:26:32] <swfrench-wmf>	 is anyone here familiar with the procedure for getting stashbot to rejoin the channels it's supposed to be in?
[18:27:00] <volans>	 swfrench-wmf: just restart it https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Maintenance
[18:27:01] <swfrench-wmf>	 AFAICT per https://wikitech.wikimedia.org/wiki/Tool:Stashbot, there's an admin group one needs to be in?
[18:27:09] <volans>	 but if you don't have access I can do it for you
[18:28:26] <volans>	 {done}
[18:28:45] <volans>	 swfrench-wmf: try to re-log your last thing now
[18:28:47] <swfrench-wmf>	 volans: ah, awesome, thank you!
[18:29:05] <swfrench-wmf>	 sorry for the delayed reply - doing a fiddly thing ...
[18:29:10] <swfrench-wmf>	 there's our friend
[19:57:02] <sukhe>	 I was checking puppet disabled hosts for another reason saw that gerrit1003 still has puppet disabled
[19:57:05] <sukhe>	 The last Puppet run was at Tue Jan 21 15:48:21 UTC 2025 (248 minutes ago). Puppet is disabled. adding a temporary scraping workaround in nftables - jelto
[19:57:28] <sukhe>	 er sorry for the inadvertent ping jel.to
[19:57:47] <moritzm>	 yeah, that's still needed for now, otherwise a puppet run would yank the temporarily added block rule
[19:57:49] <sukhe>	 safe to revert this?
[19:57:52] <sukhe>	 ok
[19:58:03] <sukhe>	 thanks moritzm, I thought the thing was resolved but ok
[19:59:16] <moritzm>	 I'll figure out a puppet-resistant temporary hack tomorrow with Jelto (and then a more long term fix)
[19:59:41] <moritzm>	 does this block anything or were you merely cleaning alerts?
[20:00:08] <sukhe>	 moritzm: no, just ensure Puppet was enabled on some hosts that we disabled Puppet on for Traffic and noticed this so thought to check
[20:00:18] <sukhe>	 doesn't block anything
[20:00:27] <moritzm>	 ok
[20:00:35] <sukhe>	 s/just ensure/was just ensuring
[20:33:57] <jelto>	 We deceided to leave puppet disabled for today until we find a persistent fix tomorrow. We did not want to risk any more Gerrit downtime 
[20:36:13] <sukhe>	 thanks, makes sense. sorry about the ping -- should have checked the message before pasting it!