[07:57:22] is eqsin open for changes again do we expect that the JTAC asks for further diagnosis/reboots of the hardware for today? [07:57:42] moritzm: all is back to normal in eqsin [07:57:46] ack, thx [07:58:06] I'm going to tackled esams today... wish me luck :) [07:58:14] I do :-) [08:32:15] the majority of criticals on icinga right now are systemd errors- if any of you have 5 minutes, could you have a look and see if there is some about a service you own, that is easy to fix or a known isue to ack: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=2&sortoption=3&serviceprops=270336 [08:36:09] for example, one I checked on people2002 was about a restart process about a service that no longer existed CC moritzm, so doing a reset-failed removed it (apparently puppet hadn't removed it properly) [08:39:56] jynus: the bgpalerter ones is related to recent work jbond and I are doing [08:40:18] that's ok, I am more worried about ones that have been going on for 30 days unattended [08:40:28] the "real" service is running, not sure why it added a node- one [08:40:35] and haven't been even noticed [08:40:41] jynus: or the other way around :) [08:40:58] if they have been going for ages they should be tasks and not alerts [08:41:03] hmmm [08:41:07] maybe we should do that [08:41:24] have a script that convert long running alerts to tasks and ack them automatically [08:41:56] but that would defeat the purpose of having an alert [08:42:18] the idea is having humans checking them, otherwise the alert should probably be removed [08:42:57] I think it is ok to have some ongoing, but the list is now too large an potentially confusing for oncall/responders [08:45:00] not sure I agree [08:45:17] an alert going on for days is not really an "alert" [08:45:21] the puppetdb2003 one is also WIP, will resolve later when I merge the patch to fix the underlying issue [08:45:51] XioNoX: sure we agree on that, so it shouldn't be on the list of alerts [08:46:12] agreed :) [08:46:39] my half joking suggestion was more of a workaround, now that the alerts are there [08:46:54] and in the task it could be investigated if the alert is even needed [08:47:00] or how to improve it [08:47:07] I am trying people to ack, check or do some actionables if we can to reduce alert fatigue, so "having an alert" doesn't become the norm [08:47:39] the people I talked to, mostle outside SRE wasn't even aware the alerts were ongoing! [08:47:58] I tried that many times, made some progress and then with time it goes back to that [08:48:30] So I am helping here- again, it is ok to have ongoing alerts while you work on it/it is a real issue [08:49:00] but in the case of systemd ones, probably most haven't been noticed [08:49:23] +1 [08:49:27] or actually you should be right that they should be reduced to warning in some cases [08:50:08] people look even less at warnings :) [08:52:47] yes, but at least they may help reducing the congnitive load of the outage responder - not saying it is a great solution- I think systemd in general should alert, but I also see some teams don't look at those alerts that in a way we have "imposed" on them (outside SRE team) [08:53:32] the other issue is probably the lack of ownership for some services :-/ [08:55:13] yeah agreed [08:55:51] I'm wondering at which point small improvements here help, vs. getting around a table and coming with a long term solution [08:56:15] maybe I can start some kind of "best practices" tutorial and try to agree them between SREs to have some common agreed procedures [08:56:23] and how that fits with the move to prometheus/alertmanager [08:56:42] yeah, that adds complexity too [08:57:30] jynus: iirc we did something like that you and me long time ago (best practices) trying to find them [08:57:41] feels like it's one of those things where if you don't have broad agreement that people _should_ be trying to make sure they don't leave things unacked you're not going to make progress [08:57:44] for example, one thing we do in our team, is to always disable notifications for hosts being setup [08:58:03] (it takes a lot of time for dbs to be setup) [08:58:07] there https://wikitech.wikimedia.org/wiki/Icinga#How_to_handle_active_alerts [08:58:08] (cf some of the clinic duty stuff we talked about in Prague) [08:58:24] this was not known as a possiblity by some- it is just a hiera line [09:00:04] and the reason I started looking at this is because being on call it just makes my life easier the cleaner the dashboard is [09:00:33] "ok, I can see XioNoX has started his maintenance as I see some failures here, checked" [09:00:38] Emperor: yeah I agree [09:01:14] and it should be a session at the next summit :) [09:01:28] Emperor: it is not all, for example, if systemd creates a lot of alerts, as I said, maybe the tooling could be tuned too [09:02:11] again, not saying we should reduce that for SREs, but I feel other teams didn't "signed" for that, specially those without a lot of system people [09:02:37] it is difficult to find the right balance and will require a lot of input from many people [09:03:41] then there are things that are "useful" for a team, but not for site stability- I ended up moving backups alerts mostly outside icinga for that reason [09:04:28] as nobody should be notified because a single backup job run failed, but I want to know [09:04:51] <_joe_> jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/849928 and followups [09:05:08] <_joe_> I don't know why I never got to merge them [09:05:22] _joe_: looks nice, let me get added if I can help! [09:05:29] <_joe_> that would help reducing the noise wouldn't it [09:05:33] happy to get some traction of rising these issues [09:05:38] _joe_: indeed! [09:06:03] so happy to have people commenting and caring about this [09:07:07] I wonder if we could have some discussion session or something to move forward on some decisions? [09:07:51] e.g. things like that patch and rising awareness/tune https://wikitech.wikimedia.org/wiki/Icinga#How_to_handle_active_alerts [09:08:49] related- my new alert failed, and I think it is a missing dependency, going to fix that [09:11:01] jynus: 302 lmata too [09:11:56] I don't mind onfire (or observability LUL) taking care of this- but I would love also if some grassroot process did it, so it didn't look "imposed" [09:13:51] (and just to be clear- it is ok to have outstanding alerts for ongoing issues, it is when bad patterns arise that I think we can do better (e.g. too many systemd alerts, or some teams unaware of long ongoing alerts) [09:14:40] now let me go fix the new warnings I created :-D [09:17:58] I'm about to go and edit actually-private-puppet to go with https://gerrit.wikimedia.org/r/c/labs/private/+/868718 [09:23:18] jynus: I had the exact same feedback to give from my first on-call rotation before the holidays :) [09:24:14] great, so please help too if you can having a look at some of the services your team may own 0:-D [09:34:05] jynus: I regularly do :) I still reset the systemd failure for train-presync because I'm nice like that :P [09:34:28] :-( [09:34:44] remember that is something is annoying, the issue is the automation (or lack of) not you! [09:35:58] fyi, esams switches upgrade in ~20min [09:36:05] everything is alrady depooled and downtimed [09:36:30] * jynus crosses fingers [09:37:07] jynus: It *technically* isn't one of our services, but it's on one of our machines [09:37:14] keeping an eye on drmrs as well for risks of transit links saturations - https://librenms.wikimedia.org/bill/bill_id=25/ [09:37:27] XioNoX: gl;hf [09:38:18] private changes made, deploying hiera change [09:40:52] (and running puppet by hand on a representative set of target nodes to check for no unexpected changes) [10:00:55] esams going down [10:08:36] hiera change has gone smoothly, I've put in https://gerrit.wikimedia.org/r/c/labs/private/+/879283 to remove the old entries and then we can close T162123 (opened in 2017) :) [10:08:37] T162123: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 [10:13:16] you won't beleive it but all 3 esams switches came back healthy, and with only a 10min downtime [10:16:27] lol [10:16:33] don't 'jinx it [10:17:03] :) [10:18:26] Arelion is running hot in drmrs but so far so good https://librenms.wikimedia.org/graphs/to=1673518200/id=23135/type=port_bits/from=1673496600/ [10:22:52] keeping esams depooled for the next maintenance with remote hands in 1h - https://phabricator.wikimedia.org/T325048 [10:39:44] I was playing around with victorops and noticed there are 17 incidents in "triggered" status that are very old (>6 months) and without a team assigned. do you mind if I batch-resolve them? [10:42:37] let me see [10:42:49] I didn't see anything not resolved last time I checked [10:43:19] if you go to the dashboard and select "all teams" form the dropdown you should see them [10:43:25] *from [10:43:31] ah, I see- I was only looking at SRE team [10:45:11] they seem to be old ones, not properly classified? [10:45:52] so go ahead [10:46:34] although I wonder if they will generate alert spam? [10:47:20] nah, they don't have anyone being notified- seems like tests when initially setup [10:47:41] probably yeah, I'll resolve them [10:48:25] done [10:48:57] there was indeed some alert spam :/ [10:48:57] paged for your team are routed correctly, to your knowledge? [10:49:00] *pages [10:49:10] yes I believe [10:49:22] but let me know if you see something strange [10:49:34] probably they predate the team division [10:49:56] yes exactly [11:54:55] fyi, we're going to get some alerts about esams, it's depooled, for T318783 [11:54:55] T318783: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 [12:01:55] moritzm: that's https://phabricator.wikimedia.org/T322529 [12:02:11] downtime/ack expired today as it was supposed to come back up today [12:02:29] ack, ok [13:02:02] claime: can you ping me a bit before the mwmaint restart? Please 🥺 [14:10:24] FYI, in 10 minutes I'll reboot irc1001, irc.wikimedia.org has been failed over to irc2001 which has already been rebooted, so any reconnecting bots will connect to it [14:11:28] (the majority of bots doesn't seem to reconnect by default, #en.wikipedia on irc2001 only has 12 connected bots despite it being the primary for two days now) [14:11:54] that has always be the case IIRC [14:12:38] I'd suspect most bots will only reconnect once they have been disconnected [14:18:52] yeah, that has been my experience for past reboots as well [15:37:19] XioNoX: nice to know that esams is no longer 'too big to fail' :D [15:38:26] cdanis: for sure! some drmrs links run hot, but it's undercontrol [15:39:39] <_joe_> indeed [15:39:47] Amir1: For sure, it's next week anyways [17:33:06] I was so excited about some other work that forgot to mention stuff for handover- today was a calm day [17:33:41] netops followed with their maintenance IIRC, but no incidents [17:34:33] I've been trying to reduce the number of criticals on icinga to facilitate oncall work, check if you can help somehow :-D [17:35:20] jynus: Taking a look. ^^ [17:36:12] (specially if you know some easy wins for services your team owns or knows about) but no need to go overboard [17:36:39] e.g. my largest worry was the large number of systemd checks failed, sometimes for a long time [17:36:40] thanks jynus and also for your work on check_legal. I love https://wikitech.wikimedia.org/wiki/Check_legal_html [17:36:47] looking at icinga [17:37:35] often systemd checks can be fixed with 'systemctl reset-failed' [17:37:46] because they happened in the past but not anymore [17:37:53] and if they do it just comes back by itself [17:38:01] mutante: yeah, I did a few that had wrong services still enabled [17:38:17] e.g. puppet removed them but didn't update systemd properly or something [17:38:26] cool! let me see what else we have [17:38:35] Taking the opportunity to re-up _joe_'s work https://gerrit.wikimedia.org/r/c/operations/puppet/+/849928 [17:38:54] (excludes individually monitored sytemd units from the general systemd check) [17:38:54] basically a clearer dashboard will make much easier to spot anomalies [17:39:13] claime: that will help too, but it is too late in my day today, sorry [17:39:50] oh, I had not seen that, thanks for linking to it, nice [17:39:56] jynus: i know, I was informing mutante and denisse ;) [17:40:07] oh, sorry :-D [17:40:35] well, starting with "HOSTS down", one of the mc machines did not come back from reboot or something [17:40:44] mc2040 [17:40:55] effie: did that have a problem during maintenance? [17:42:13] SAL says it was just normally rebooted. but it's down now. looking at mgmt console [17:42:24] 2040 ? [17:42:29] not 1040 ? [17:42:44] effie: yea, 2040 [17:42:52] now that is odd [17:43:10] it was rebooted without issues 3 days ago according to my logs [17:43:13] if you are on the console can you just do a powercycle lplase? [17:43:30] I did it, and it came back so, something else is odd [17:43:35] But it crashed today 2023-01-12 13:50:45 [+icinga-wm] PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:48] ok I missed that ine [17:43:49] I am on mgmt [17:43:55] effie: me too :/ [17:43:58] mutante: please powercycle it [17:44:06] console is empty, will powercycle [17:44:13] claime: I was taking care of the eqiad ones, I shoudl haveseen that [17:44:14] anyway [17:45:08] racadm>>racadm serveraction powercycle [17:45:08] Server power operation initiated successfully [17:46:39] ..booting... [17:47:05] mutante: thanks for contributing to the grind! Leaving you for the day! I am happy for what was accomplished today! [17:47:22] jynus: thx, cya [17:47:48] sigh this is a rather new machine [17:48:02] effie: SSH should work again now [17:48:06] just got on [17:48:19] yeah just saw my pings having a reply [17:48:29] thank ytou for taking care of this daniel [17:48:38] 20 | Jan-12-2023 | 13:50:29 | ECC Uncorr Err | Memory | Uncorrectable memory error [17:48:40] Gronk.. [17:48:52] That ain't good [17:49:05] Description: The self-heal operation successfully completed at DIMM DIMM_B2. [17:49:10] Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. [17:49:25] self-heal succesful ? wow, lol [17:49:35] mutante: AI™ [17:49:41] omg:p [17:49:59] Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process [17:50:05] and then we did [17:50:33] sukhe: Now even ECC is AI smh my head [17:51:03] https://phabricator.wikimedia.org/P43145 [17:51:22] Ah, racadm got more info than getsel [17:51:29] ipmi-sel* [17:51:50] this is a new machine? [17:51:55] We maybe probably want to change that DIMM [17:51:58] we are supposed to create a dcops ticket using their template, I think [17:52:02] if we want it replaced [17:52:12] mutante: yep [17:52:36] sukhe: November 2021 [17:54:13] https://phabricator.wikimedia.org/T326834 [17:54:58] denisse: I used the template for a ticket linked at https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook .. just to share [17:55:24] claime: wanna paste your output too ^ ? [17:55:45] mutante: My output is an ipmi truncated version of yours [17:55:50] But yeah sure [17:55:50] mutante: Thanks, I've subscribed. :) [17:56:41] claime: effie: well.. should we depool it though? [17:57:04] the dcops template says so.. but it can also wait until they actually get to it I suppose [17:57:47] "Put system into a failed state in Netbox." OK, can do . "Provide urgency of request, along with justification (redundancy, dependencies, etc)" eh..not sure [17:58:17] It was down for 4 hours with no impact [17:58:29] So I'd say low-medium [17:58:50] 'k [17:59:01] mutante: about mc2040 ? [17:59:14] yeah [17:59:33] set to 'failed' in netbox [17:59:37] the mgutter pool takes over [17:59:40] because it says to do that [18:00:11] if you hgave not created a DCops request, I can do so [18:00:17] effie: so no confctl action needed? it's because dcops template says s [18:00:29] mutante: yes, and FYI if you didn't know, it's the requester's responsibility to put it back in active when done [18:00:32] effie: I did, but see the checkboxes https://phabricator.wikimedia.org/T326834 [18:01:58] maybe you can do the 2 remaining ones [18:02:00] the gutter pool takes over [18:02:08] great [18:02:15] yeah I will deal with it tomorrow because it is rather late here [18:02:25] thank you very much ! [18:02:41] Only the depooling left over [18:02:54] ok, good night effie [18:03:00] g'night effie o/ [18:03:36] ttyl! [18:03:37] Good night effie. :) [18:05:32] claime: yea, so since these are not in confctl, I don't know of other depooling [18:05:57] mutante: from what I understand, it should failover on its own through mcrouter and the gutter pool [18:06:28] yea, I got that we have no technial problem thaks to the gutter pool [18:06:36] now was just speaking about that checkbox [18:06:38] but it's fine [18:06:43] Ah right [18:07:24] dcops just wants "you can work on it anytime" [18:07:49] mutante: lol that's exactly what I was typing [18:08:07] ah, its ops-codfw though [18:08:20] ok, please do:) [18:10:59] so, so no more hosts down in Icinga, but 30 service alerts to go [18:11:17] normal range though I guess [18:14:08] out of 30, 19 are active alerts that are not acked or donwtimed but only have disabled notifications. I'd still advice against using that way to silence [18:14:54] so I'm going to ACK all those because otherwise they are indistinguuishable from real problems [18:15:22] incl. cassandra*, cloudcontrol* [18:16:22] but since disabled notifications never auto-re-enable themselves, they are also often forgotten from previous times.. so you can have real alerts in that state [18:16:37] something that does not happen when using downtimes [18:19:57] fyi mutante, downtimed mc2040, dcops are going to work on it [18:20:06] and now I'm out o/ [18:20:22] claime: See you!! [18:20:40] Bye denisse :) [18:21:24] thanks, cya [18:53:08] XioNoX: cr2-eqsin and cr4-ulsfo both have 1 interface down. is that a link between them and known maintenance we are waiting for like the other day? [18:59:28] mutante: it used to be easier to see this on icinga... ah those alerts still come from icinga [18:59:43] you can look in the details on icinga to see which interfaces [18:59:47] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqsin&service=Router+interfaces [19:00:10] `xe-0/1/4: down -> Transport: cr4-ulsfo:xe-0/1/2 (SingTel...` from cr2-eqsin [19:00:18] so yeah, a transport link between the two routers [19:00:45] I see some recent emails from Arzhel to Singtel as well [19:05:07] cdanis: thank you! ACK:) [21:30:44] got my ripe atlas probe in the mail, https://atlas.ripe.net/probes/62952/ [21:33:08] Nice jhathaway [21:33:33] not having usb sticks they inevitably break sounds useful [21:33:56] jhathaway: You know it displays your address, right? [21:34:08] (or an address, at least) [21:34:34] yeah thanks, I guess I could make it more anonymous, but I'm not exactly concerned about the privacy risk, though perhaps I should be? [21:35:27] They didn't make that clear when I was naming the probe, or I probably didn't read carefully enough, I thought it was my private name [21:35:43] I can tell you many stories but the risk is fairly low [21:36:05] I’m not aware of any LTA whose actually taken action on threats made [21:36:45] perhaps I'll change it to the nearest cross streets [21:39:33] updated, thanks folks [21:39:47] You'd think they'd put a disclaimer under the box [21:40:19] :) [21:40:35] Do they pay bug bounties? [21:40:44] not sure [23:36:44] jhathaway: the ripe atlas looks very interesting