[07:32:14] <_joe_> !incidents [07:32:14] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [07:32:14] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [07:32:14] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [07:32:15] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [07:32:15] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [07:32:15] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [07:32:15] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [07:32:16] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [07:32:16] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [07:32:17] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [07:32:17] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [08:36:11] !incidents [08:36:11] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [08:36:12] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [08:36:12] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [08:36:12] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [08:36:12] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [08:36:12] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [08:36:13] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [08:36:13] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [08:36:13] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [08:36:14] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [08:36:14] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [11:08:58] while working on some specs for acme_chief::server class I'm getting the following error ` Could not find resource 'Exec[apt-get update]' in parameter 'require'` [11:09:25] do I need some helper to make the spec test happy? [11:23:59] let(:pre_condition) { 'include apt' } --> that fixes the issue [11:49:31] Do we have style for re-using functions between cookbooks? sre.swift.remove-ghost-objects has _find_db_paths as a member function of class RemoveGhostObjectsRunner and I think I need similar code for a different cookbook [11:50:34] Emperor: first step move that common code in sre/swift/__init__.py, second step let's chat if there is a case for making a swift library in spicerack [11:51:31] for step 1 you can either have a function that gets all the params needed, or make a base Runner class for all/some of your swift cookbooks if that makes more sense [11:51:48] my use-case this time is (hopefully!) more of a one-off, run the sqlite integrity check against a bunch of container DBs [11:54:53] https://github.com/wikimedia/operations-cookbooks/blob/420e429dfcd377310ff215a5a8abef13e774f549/cookbooks/sre/swift/remove-ghost-objects.py#L400 is the function concerned, it needs access to bits of the the Runner class (remote.query, though I guess you could pass that in as a parameter) [11:55:43] yes [12:00:43] Thanks, I'll have a go on that basis [12:36:03] If there is an alert tonight about ldap user missing on puppet, that would be me- there is only one step missing on onboarding, but I don't want to add an unnecessary step with a temporary patch for an ldap only user (in theory it should be solved before it alerts) [12:47:47] jynus: ack, thanks for the note [12:48:24] (a while ago we fixed the report to only mail to the SRE IF alias and not root@, so most people don't see it anyway) [12:49:21] actually, it was solved now, so will work on the patch now [12:49:38] moritzm: thanks, sent you a review in case you have the time later [12:50:06] feel free to suggest alternative people from your team to help review access patches so I don't send all to you 0:-D [12:50:10] sure, I'll have a look in a bit [13:10:11] !incidents [13:10:11] 5623 (UNACKED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [13:10:12] 5622 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [13:10:12] 5611 (RESOLVED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [13:10:12] 5621 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-esams.wikimedia.org) [13:10:12] 5620 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:10:12] 5619 (RESOLVED) db2207 (paged)/MariaDB Replica SQL: s2 (paged) [13:10:13] 5618 (RESOLVED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [13:10:13] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [13:10:13] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:10:14] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [13:10:14] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:10:15] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:10:15] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:10:21] !ack 5623 [13:10:21] 5623 (ACKED) Manual (paged) by urbanecm (murbanec@wikimedia.org): Nearly complete Gerrit outage [13:10:31] I've asked in #wikimedia-sre-collab about gerrit [13:10:54] Emperor: fyi there's also a little discussion in -releng about the behaviour [13:10:57] oh cool, I had just started digging [13:11:20] joining the effort [13:12:56] urbanecm: I think the service belongs to collab hence my channel choice :) [13:13:30] not doubting the choice, i just wanted to inform :) [14:07:30] hnowlan Apologies for the cloudelastic decom, looking at it now [14:15:47] based on the cumin2002 logs, it looks like the cookbook ran successfully. LMK if I did miss anything [14:55:49] inflatador: not entirely sure tbh, seems that it's stuck in the puppetdb somehow [15:01:04] jayme ^^ is PCC still broken from https://phabricator.wikimedia.org/T380937#10461825 ? [15:01:31] idk, we bypassed it [15:01:58] ACK, I'm guessing not since that was almost a week ago [15:03:41] inflatador: the values are still in the pcc puppetdb - so I guess compilation on a deploy host will still fail for the same reason [15:05:55] jayme interesting, I don't remember this ever happening when we decom'd early CE hosts. Will take a look some more but I'll probably need an adult at some pt [15:14:14] hnowlan: did you just downtime in https://phabricator.wikimedia.org/T383820#10479104? (that's what it looks like from the ticket, but it's actually down, so...) [15:15:44] urandom: I depooled and downtimed [15:15:58] depooled in the restbase-sense though?? [15:16:06] oh no, sorry - been a while so I forgot [15:16:08] sorry... too many question marks [15:16:11] just in pybal [15:16:14] ya, ok [16:05:53] Anyone for a bluesky +1 ? https://gerrit.wikimedia.org/r/c/operations/dns/+/1113170 [16:06:16] jynus: looking and getting the full context [16:06:53] I verified identity of requested through Slack [16:06:57] *request [17:33:25] volans, elukey: I'm trying to decommission db2133.codfw.wmnet and the cookbook fails the IPMI connection test. I tried 5 times. [17:34:27] very nice :D [17:35:06] is the db2133.mgmt.codfw.wmnet IP reachable? If yes we may want to go through the Management console wikitech page to see if anything can be done [17:35:13] federico3: you could go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands [17:35:34] thanks [17:35:56] it should guide you through the different failures and help to fix it [17:42:07] <_joe_> volans: stop pretending hardware makes sense, and tell him what sacrifices he must make to the gods of quantum mechanics [17:42:20] <_joe_> federico3: so you need a chicken and a goat [17:42:30] <_joe_> :) [17:43:01] Dress yourself in ham and circle twice counterclockwise around the server [17:43:10] <_joe_> (my point about hardware being dark magic is serious though) [17:50:57] claime: naheulbeuk reference spotted [18:26:32] lol, just realized I've been !log'ing into the void in -operations ... [18:26:32] is anyone here familiar with the procedure for getting stashbot to rejoin the channels it's supposed to be in? [18:27:00] swfrench-wmf: just restart it https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Maintenance [18:27:01] AFAICT per https://wikitech.wikimedia.org/wiki/Tool:Stashbot, there's an admin group one needs to be in? [18:27:09] but if you don't have access I can do it for you [18:28:26] {done} [18:28:45] swfrench-wmf: try to re-log your last thing now [18:28:47] volans: ah, awesome, thank you! [18:29:05] sorry for the delayed reply - doing a fiddly thing ... [18:29:10] there's our friend [19:57:02] I was checking puppet disabled hosts for another reason saw that gerrit1003 still has puppet disabled [19:57:05] The last Puppet run was at Tue Jan 21 15:48:21 UTC 2025 (248 minutes ago). Puppet is disabled. adding a temporary scraping workaround in nftables - jelto [19:57:28] er sorry for the inadvertent ping jel.to [19:57:47] yeah, that's still needed for now, otherwise a puppet run would yank the temporarily added block rule [19:57:49] safe to revert this? [19:57:52] ok [19:58:03] thanks moritzm, I thought the thing was resolved but ok [19:59:16] I'll figure out a puppet-resistant temporary hack tomorrow with Jelto (and then a more long term fix) [19:59:41] does this block anything or were you merely cleaning alerts? [20:00:08] moritzm: no, just ensure Puppet was enabled on some hosts that we disabled Puppet on for Traffic and noticed this so thought to check [20:00:18] doesn't block anything [20:00:27] ok [20:00:35] s/just ensure/was just ensuring [20:33:57] We deceided to leave puppet disabled for today until we find a persistent fix tomorrow. We did not want to risk any more Gerrit downtime [20:36:13] thanks, makes sense. sorry about the ping -- should have checked the message before pasting it!