[07:57:22] <moritzm>	 is eqsin open for changes again do we expect that the JTAC asks for further diagnosis/reboots of the hardware for today?
[07:57:42] <XioNoX>	 moritzm: all is back to normal in eqsin
[07:57:46] <moritzm>	 ack, thx
[07:58:06] <XioNoX>	 I'm going to tackled esams today... wish me luck :)
[07:58:14] <moritzm>	 I do :-)
[08:32:15] <jynus>	 the majority of criticals on icinga right now are systemd errors- if any of you have 5 minutes, could you have a look and see if there is some about a service you own, that is easy to fix or a known isue to ack: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=2&sortoption=3&serviceprops=270336
[08:36:09] <jynus>	 for example, one I checked on people2002 was about a restart process about a service that no longer existed CC moritzm, so doing a reset-failed removed it (apparently puppet hadn't removed it properly)
[08:39:56] <XioNoX>	 jynus: the bgpalerter ones is related to recent work jbond and I are doing
[08:40:18] <jynus>	 that's ok, I am more worried about ones that have been going on for 30 days unattended
[08:40:28] <XioNoX>	 the "real" service is running, not sure why it added a node- one
[08:40:35] <jynus>	 and haven't been even noticed
[08:40:41] <XioNoX>	 jynus: or the other way around :)
[08:40:58] <XioNoX>	 if they have been going for ages they should be tasks and not alerts
[08:41:03] <XioNoX>	 hmmm
[08:41:07] <XioNoX>	 maybe we should do that
[08:41:24] <XioNoX>	 have a script that convert long running alerts to tasks and ack them automatically
[08:41:56] <jynus>	 but that would defeat the purpose of having an alert
[08:42:18] <jynus>	 the idea is having humans checking them, otherwise the alert should probably be removed
[08:42:57] <jynus>	 I think it is ok to have some ongoing, but the list is now too large an potentially confusing for oncall/responders
[08:45:00] <XioNoX>	 not sure I agree
[08:45:17] <XioNoX>	 an alert going on for days is not really an "alert"
[08:45:21] <moritzm>	 the puppetdb2003 one is also WIP, will resolve later when I merge the patch to fix the underlying issue
[08:45:51] <jynus>	 XioNoX: sure we agree on that, so it shouldn't be on the list of alerts
[08:46:12] <XioNoX>	 agreed :)
[08:46:39] <XioNoX>	 my half joking suggestion was more of a workaround, now that the alerts are there
[08:46:54] <XioNoX>	 and in the task it could be investigated if the alert is even needed
[08:47:00] <XioNoX>	 or how to improve it
[08:47:07] <jynus>	 I am trying people to ack, check or do some actionables if we can to reduce alert fatigue, so "having an alert" doesn't become the norm
[08:47:39] <jynus>	 the people I talked to, mostle outside SRE wasn't even aware the alerts were ongoing! 
[08:47:58] <XioNoX>	 I tried that many times, made some progress and then with time it goes back to that
[08:48:30] <jynus>	 So I am helping here- again, it is ok to have ongoing alerts while you work on it/it is a real issue
[08:49:00] <jynus>	 but in the case of systemd ones, probably most haven't been noticed
[08:49:23] <XioNoX>	 +1
[08:49:27] <jynus>	 or actually you should be right that they should be reduced to warning in some cases
[08:50:08] <XioNoX>	 people look even less at warnings :)
[08:52:47] <jynus>	 yes, but at least they may help reducing the congnitive load of the outage responder - not saying it is a great solution- I think systemd in general should alert, but I also see some teams don't look at those alerts that in a way we have "imposed" on them (outside SRE team)
[08:53:32] <jynus>	 the other issue is probably the lack of ownership for some services :-/
[08:55:13] <XioNoX>	 yeah agreed
[08:55:51] <XioNoX>	 I'm wondering at which point small improvements here help, vs. getting around a table and coming with a long term solution
[08:56:15] <jynus>	 maybe I can start some kind of "best practices" tutorial and try to agree them between SREs to have some common agreed procedures
[08:56:23] <XioNoX>	 and how that fits with the move to prometheus/alertmanager
[08:56:42] <jynus>	 yeah, that adds complexity too
[08:57:30] <XioNoX>	 jynus: iirc we did something like that you and me long time ago (best practices) trying to find them
[08:57:41] <Emperor>	 feels like it's one of those things where if you don't have broad agreement that people _should_ be trying to make sure they don't leave things unacked you're not going to make progress
[08:57:44] <jynus>	 for example, one thing we do in our team, is to always disable notifications for hosts being setup
[08:58:03] <jynus>	 (it takes a lot of time for dbs to be setup)
[08:58:07] <XioNoX>	 there https://wikitech.wikimedia.org/wiki/Icinga#How_to_handle_active_alerts
[08:58:08] <Emperor>	 (cf some of the clinic duty stuff we talked about in Prague)
[08:58:24] <jynus>	 this was not known as a possiblity by some- it is just a hiera line
[09:00:04] <jynus>	 and the reason I started looking at this is because being on call it just makes my life easier the cleaner the dashboard is
[09:00:33] <jynus>	 "ok, I can see XioNoX has started his maintenance as I see some failures here, checked"
[09:00:38] <XioNoX>	 Emperor: yeah I agree
[09:01:14] <XioNoX>	 and it should be a session at the next summit :)
[09:01:28] <jynus>	 Emperor: it is not all, for example, if systemd creates a lot of alerts, as I said, maybe the tooling could be tuned too
[09:02:11] <jynus>	 again, not saying we should reduce that for SREs, but I feel other teams didn't "signed" for that, specially those without a lot of system people
[09:02:37] <jynus>	 it is difficult to find the right balance and will require a lot of input from many people
[09:03:41] <jynus>	 then there are things that are "useful" for a team, but not for site stability- I ended up moving backups alerts mostly outside icinga for that reason
[09:04:28] <jynus>	 as nobody should be notified because a single backup job run failed, but I want to know
[09:04:51] <_joe_>	 jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/849928 and followups
[09:05:08] <_joe_>	 I don't know why I never got to merge them
[09:05:22] <jynus>	 _joe_: looks nice, let me get added if I can help!
[09:05:29] <_joe_>	 that would help reducing the noise wouldn't it
[09:05:33] <jynus>	 happy to get some traction of rising these issues
[09:05:38] <jynus>	 _joe_: indeed!
[09:06:03] <jynus>	 so happy to have people commenting and caring about this
[09:07:07] <jynus>	 I wonder if we could have some discussion session or something to move forward on some decisions?
[09:07:51] <jynus>	 e.g. things like that patch and rising awareness/tune https://wikitech.wikimedia.org/wiki/Icinga#How_to_handle_active_alerts
[09:08:49] <jynus>	 related- my new alert failed, and I think it is a missing dependency, going to fix that
[09:11:01] <XioNoX>	 jynus: 302 lmata too
[09:11:56] <jynus>	 I don't mind onfire (or observability LUL) taking care of this- but I would love also if some grassroot process did it, so it didn't look "imposed"
[09:13:51] <jynus>	 (and just to be clear- it is ok to have outstanding alerts for ongoing issues, it is when bad patterns arise that I think we can do better (e.g. too many systemd alerts, or some teams unaware of long ongoing alerts)
[09:14:40] <jynus>	 now let me go fix the new warnings I created :-D
[09:17:58] <Emperor>	 I'm about to go and edit actually-private-puppet to go with https://gerrit.wikimedia.org/r/c/labs/private/+/868718
[09:23:18] <claime>	 jynus: I had the exact same feedback to give from my first on-call rotation before the holidays :)
[09:24:14] <jynus>	 great, so please help too if you can having a look at some of the services your team may own 0:-D
[09:34:05] <claime>	 jynus: I regularly do :) I still reset the systemd failure for train-presync because I'm nice like that :P
[09:34:28] <jynus>	 :-(
[09:34:44] <jynus>	 remember that is something is annoying, the issue is the automation (or lack of) not you!
[09:35:58] <XioNoX>	 fyi, esams switches upgrade in ~20min
[09:36:05] <XioNoX>	 everything is alrady depooled and downtimed
[09:36:30] * jynus crosses fingers
[09:37:07] <claime>	 jynus: It *technically* isn't one of our services, but it's on one of our machines
[09:37:14] <XioNoX>	 keeping an eye on drmrs as well for risks of transit links saturations - https://librenms.wikimedia.org/bill/bill_id=25/
[09:37:27] <claime>	 XioNoX: gl;hf
[09:38:18] <Emperor>	 private changes made, deploying hiera change
[09:40:52] <Emperor>	 (and running puppet by hand on a representative set of target nodes to check for no unexpected changes)
[10:00:55] <XioNoX>	 esams going down
[10:08:36] <Emperor>	 hiera change has gone smoothly, I've put in https://gerrit.wikimedia.org/r/c/labs/private/+/879283 to remove the old entries and then we can close T162123 (opened in 2017) :)
[10:08:37] <stashbot>	 T162123: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123
[10:13:16] <XioNoX>	 you won't beleive it but all 3 esams switches came back healthy, and with only a 10min downtime
[10:16:27] <volans>	 lol
[10:16:33] <volans>	 don't 'jinx it
[10:17:03] <Emperor>	 :)
[10:18:26] <XioNoX>	 Arelion is running hot in drmrs but so far so good https://librenms.wikimedia.org/graphs/to=1673518200/id=23135/type=port_bits/from=1673496600/
[10:22:52] <XioNoX>	 keeping esams depooled for the next maintenance with remote hands in 1h - https://phabricator.wikimedia.org/T325048
[10:39:44] <dhinus>	 I was playing around with victorops and noticed there are 17 incidents in "triggered" status that are very old (>6 months) and without a team assigned. do you mind if I batch-resolve them?
[10:42:37] <jynus>	 let me see
[10:42:49] <jynus>	 I didn't see anything not resolved last time I checked
[10:43:19] <dhinus>	 if you go to the dashboard and select "all teams" form the dropdown you should see them
[10:43:25] <dhinus>	 *from
[10:43:31] <jynus>	 ah, I see- I was only looking at SRE team
[10:45:11] <jynus>	 they seem to be old ones, not properly classified?
[10:45:52] <jynus>	 so go ahead
[10:46:34] <jynus>	 although I wonder if they will generate alert spam?
[10:47:20] <jynus>	 nah, they don't have anyone being notified- seems like tests when initially setup
[10:47:41] <dhinus>	 probably yeah, I'll resolve them
[10:48:25] <dhinus>	 done
[10:48:57] <dhinus>	 there was indeed some alert spam :/
[10:48:57] <jynus>	 paged for your team are routed correctly, to your knowledge?
[10:49:00] <jynus>	 *pages
[10:49:10] <dhinus>	 yes I believe
[10:49:22] <dhinus>	 but let me know if you see something strange
[10:49:34] <jynus>	 probably they predate the team division
[10:49:56] <dhinus>	 yes exactly
[11:54:55] <XioNoX>	 fyi, we're going to get some alerts about esams, it's depooled, for T318783
[11:54:55] <stashbot>	 T318783: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783
[12:01:55] <XioNoX>	 moritzm: that's https://phabricator.wikimedia.org/T322529
[12:02:11] <XioNoX>	 downtime/ack expired today as it was supposed to come back up today
[12:02:29] <moritzm>	 ack, ok
[13:02:02] <Amir1>	 claime: can you ping me a bit before the mwmaint restart? Please 🥺
[14:10:24] <moritzm>	 FYI, in 10 minutes I'll reboot irc1001, irc.wikimedia.org has been failed over to irc2001 which has already been rebooted, so any reconnecting bots will connect to it
[14:11:28] <moritzm>	 (the majority of bots doesn't seem to reconnect by default, #en.wikipedia on irc2001 only has 12 connected bots despite it being the primary for two days now)
[14:11:54] <volans>	 that has always be the case IIRC
[14:12:38] <taavi>	 I'd suspect most bots will only reconnect once they have been disconnected
[14:18:52] <moritzm>	 yeah, that has been my experience for past reboots as well
[15:37:19] <cdanis>	 XioNoX: nice to know that esams is no longer 'too big to fail' :D
[15:38:26] <XioNoX>	 cdanis: for sure! some drmrs links run hot, but it's undercontrol
[15:39:39] <_joe_>	 indeed
[15:39:47] <claime>	 Amir1: For sure, it's next week anyways
[17:33:06] <jynus>	 I was so excited about some other work that forgot to mention stuff for handover- today was a calm day
[17:33:41] <jynus>	 netops followed with their maintenance IIRC, but no incidents
[17:34:33] <jynus>	 I've been trying to reduce the number of criticals on icinga to facilitate oncall work, check if you can help somehow :-D
[17:35:20] <denisse>	 jynus: Taking a look. ^^
[17:36:12] <jynus>	 (specially if you know some easy wins for services your team owns or knows about) but no need to go overboard
[17:36:39] <jynus>	 e.g. my largest worry was the large number of systemd checks failed, sometimes for a long time
[17:36:40] <mutante>	 thanks jynus and also for your work on check_legal. I love  https://wikitech.wikimedia.org/wiki/Check_legal_html
[17:36:47] <mutante>	 looking at icinga 
[17:37:35] <mutante>	 often systemd checks can be fixed with 'systemctl reset-failed'
[17:37:46] <mutante>	 because they happened in the past but not anymore
[17:37:53] <mutante>	 and if they do it just comes back by itself
[17:38:01] <jynus>	 mutante: yeah, I did a few that had wrong services still enabled
[17:38:17] <jynus>	 e.g. puppet removed them but didn't update systemd properly or something
[17:38:26] <mutante>	 cool! let me see what else we have 
[17:38:35] <claime>	 Taking the opportunity to re-up _joe_'s work https://gerrit.wikimedia.org/r/c/operations/puppet/+/849928
[17:38:54] <claime>	 (excludes individually monitored sytemd units from the general systemd check)
[17:38:54] <jynus>	 basically a clearer dashboard will make much easier to spot anomalies
[17:39:13] <jynus>	 claime: that will help too, but it is too late in my day today, sorry
[17:39:50] <mutante>	 oh, I had not seen that, thanks for linking to it, nice
[17:39:56] <claime>	 jynus: i know, I was informing mutante and denisse ;)
[17:40:07] <jynus>	 oh, sorry :-D
[17:40:35] <mutante>	 well, starting with "HOSTS down", one of the mc machines did not come back from reboot or something
[17:40:44] <mutante>	 mc2040
[17:40:55] <mutante>	 effie: did that have a problem during maintenance?
[17:42:13] <mutante>	 SAL says it was just normally rebooted. but it's down now. looking at mgmt console
[17:42:24] <effie>	 2040 ?
[17:42:29] <effie>	 not 1040 ?
[17:42:44] <mutante>	 effie: yea, 2040
[17:42:52] <effie>	 now that is odd
[17:43:10] <claime>	 it was rebooted without issues 3 days ago according to my logs
[17:43:13] <effie>	 if you are on the console can you just do a powercycle lplase?
[17:43:30] <effie>	 I did it, and it came back so, something else is odd
[17:43:35] <claime>	 But it crashed today 2023-01-12 13:50:45   [+icinga-wm]    PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100%                                                 
[17:43:48] <effie>	 ok I missed that ine 
[17:43:49] <mutante>	 I am on mgmt
[17:43:55] <claime>	 effie: me too :/
[17:43:58] <effie>	 mutante: please powercycle it 
[17:44:06] <mutante>	 console is empty, will powercycle
[17:44:13] <effie>	 claime: I was taking care of the eqiad ones, I shoudl haveseen that 
[17:44:14] <effie>	 anyway 
[17:45:08] <mutante>	 racadm>>racadm serveraction powercycle
[17:45:08] <mutante>	 Server power operation initiated successfully
[17:46:39] <mutante>	 ..booting...
[17:47:05] <jynus>	 mutante: thanks for contributing to the grind! Leaving you for the day! I am happy for what was accomplished today!
[17:47:22] <mutante>	 jynus: thx, cya
[17:47:48] <effie>	 sigh this is a rather new machine 
[17:48:02] <mutante>	 effie: SSH should work again now
[17:48:06] <mutante>	 just got on
[17:48:19] <effie>	 yeah just saw my pings having a reply 
[17:48:29] <effie>	 thank ytou for taking care of this daniel 
[17:48:38] <claime>	 20  | Jan-12-2023 | 13:50:29 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error
[17:48:40] <claime>	 Gronk..
[17:48:52] <claime>	 That ain't good
[17:49:05] <mutante>	 Description: The self-heal operation successfully completed at DIMM DIMM_B2.
[17:49:10] <mutante>	 Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
[17:49:25] <mutante>	 self-heal succesful ? wow, lol
[17:49:35] <sukhe>	 mutante: AI™ 
[17:49:41] <mutante>	 omg:p
[17:49:59] <mutante>	 Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process
[17:50:05] <mutante>	 and then we did
[17:50:33] <claime>	 sukhe: Now even ECC is AI smh my head
[17:51:03] <mutante>	 https://phabricator.wikimedia.org/P43145
[17:51:22] <claime>	 Ah, racadm got more info than getsel
[17:51:29] <claime>	 ipmi-sel*
[17:51:50] <sukhe>	 this is a new machine?
[17:51:55] <claime>	 We maybe probably want to change that DIMM
[17:51:58] <mutante>	 we are supposed to create a dcops ticket using their template, I think
[17:52:02] <mutante>	 if we want it replaced
[17:52:12] <claime>	 mutante: yep
[17:52:36] <mutante>	 sukhe: November 2021
[17:54:13] <mutante>	 https://phabricator.wikimedia.org/T326834
[17:54:58] <mutante>	 denisse: I used the template for a ticket linked at https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook .. just to share
[17:55:24] <mutante>	 claime: wanna paste your output too ^ ?
[17:55:45] <claime>	 mutante: My output is an ipmi truncated version of yours
[17:55:50] <claime>	 But yeah sure
[17:55:50] <denisse>	 mutante: Thanks, I've subscribed. :)
[17:56:41] <mutante>	 claime: effie: well.. should we depool it though?
[17:57:04] <mutante>	 the dcops template says so.. but it can also wait until they actually get to it I suppose
[17:57:47] <mutante>	 "Put system into a failed state in Netbox." OK, can do .     "Provide urgency of request, along with justification (redundancy, dependencies, etc)"  eh..not sure 
[17:58:17] <claime>	 It was down for 4 hours with no impact
[17:58:29] <claime>	 So I'd say low-medium
[17:58:50] <mutante>	 'k
[17:59:01] <effie>	 mutante: about mc2040 ?
[17:59:14] <claime>	 yeah
[17:59:33] <mutante>	 set to 'failed' in netbox
[17:59:37] <effie>	 the mgutter pool takes over 
[17:59:40] <mutante>	 because it says to do that
[18:00:11] <effie>	 if you hgave not created a DCops request, I can do so 
[18:00:17] <mutante>	 effie: so no confctl action needed? it's because dcops template says s
[18:00:29] <claime>	 mutante: yes, and FYI if you didn't know, it's the requester's responsibility to put it back in active when done
[18:00:32] <mutante>	 effie: I did, but see the checkboxes https://phabricator.wikimedia.org/T326834
[18:01:58] <mutante>	 maybe you can do the 2 remaining ones
[18:02:00] <effie>	 the gutter pool takes over 
[18:02:08] <mutante>	 great
[18:02:15] <effie>	 yeah I will deal with it tomorrow because it is rather late here
[18:02:25] <effie>	 thank you very much !
[18:02:41] <claime>	 Only the depooling left over
[18:02:54] <mutante>	 ok, good night effie
[18:03:00] <claime>	 g'night effie o/
[18:03:36] <effie>	 ttyl!
[18:03:37] <denisse>	 Good night effie. :)
[18:05:32] <mutante>	 claime: yea, so since these are not in confctl, I don't know of other depooling
[18:05:57] <claime>	 mutante: from what I understand, it should failover on its own through mcrouter and the gutter pool
[18:06:28] <mutante>	 yea, I got that we have no technial problem thaks to the gutter pool
[18:06:36] <mutante>	 now was just speaking about that checkbox 
[18:06:38] <mutante>	 but it's fine 
[18:06:43] <claime>	 Ah right
[18:07:24] <mutante>	 dcops just wants "you can work on it anytime"
[18:07:49] <claime>	 mutante: lol that's exactly what I was typing
[18:08:07] <mutante>	 ah, its ops-codfw though
[18:08:20] <mutante>	 ok, please do:)
[18:10:59] <mutante>	 so, so no more hosts down in Icinga, but 30 service alerts to go 
[18:11:17] <mutante>	 normal range though I guess
[18:14:08] <mutante>	 out of 30, 19 are active alerts that are not acked or donwtimed but only have disabled notifications. I'd still advice against using that way to silence
[18:14:54] <mutante>	 so I'm going to ACK all those because otherwise they are indistinguuishable from real problems
[18:15:22] <mutante>	 incl. cassandra*, cloudcontrol*
[18:16:22] <mutante>	 but since disabled notifications never auto-re-enable themselves, they are also often forgotten from previous times.. so you can have real alerts in that state
[18:16:37] <mutante>	 something that does not happen when using downtimes
[18:19:57] <claime>	 fyi mutante, downtimed mc2040, dcops are going to work on it
[18:20:06] <claime>	 and now I'm out o/
[18:20:22] <denisse>	 claime: See you!!
[18:20:40] <claime>	 Bye denisse :)
[18:21:24] <mutante>	 thanks, cya
[18:53:08] <mutante>	 XioNoX: cr2-eqsin and cr4-ulsfo both have 1 interface down. is that a link between them and known maintenance we are waiting for like the other day?
[18:59:28] <cdanis>	 mutante: it used to be easier to see this on icinga... ah those alerts still come from icinga
[18:59:43] <cdanis>	 you can look in the details on icinga to see which interfaces
[18:59:47] <cdanis>	 https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqsin&service=Router+interfaces
[19:00:10] <cdanis>	 `xe-0/1/4: down -> Transport: cr4-ulsfo:xe-0/1/2 (SingTel...` from cr2-eqsin
[19:00:18] <cdanis>	 so yeah, a transport link between the two routers
[19:00:45] <cdanis>	 I see some recent emails from Arzhel to Singtel as well
[19:05:07] <mutante>	 cdanis: thank you! ACK:)
[21:30:44] <jhathaway>	 got my ripe atlas probe in the mail, https://atlas.ripe.net/probes/62952/
[21:33:08] <RhinosF1>	 Nice jhathaway
[21:33:33] <Reedy>	 not having usb sticks they inevitably break sounds useful
[21:33:56] <Reedy>	 jhathaway: You know it displays your address, right?
[21:34:08] <Reedy>	 (or an address, at least)
[21:34:34] <jhathaway>	 yeah thanks, I guess I could make it more anonymous, but I'm not exactly concerned about the privacy risk, though perhaps I should be?
[21:35:27] <jhathaway>	 They didn't make that clear when I was naming the probe, or I probably didn't read carefully enough, I thought it was my private name
[21:35:43] <RhinosF1>	 I can tell you many stories but the risk is fairly low
[21:36:05] <RhinosF1>	 I’m not aware of any LTA whose actually taken action on threats made
[21:36:45] <jhathaway>	 perhaps I'll change it to the nearest cross streets
[21:39:33] <jhathaway>	 updated, thanks folks
[21:39:47] <Reedy>	 You'd think they'd put a disclaimer under the box
[21:40:19] <jhathaway>	 :)
[21:40:35] <Reedy>	 Do they pay bug bounties?
[21:40:44] <jhathaway>	 not sure
[23:36:44] <denisse>	 jhathaway: the ripe atlas looks very interesting