[10:44:10] I asked on the sre channel, but maybe this is a better place, I have a question about pages/victorops/alertmanager, we had acknowledged a page on victorops, and added a karma ack (!ACK message), but after the victorops ack expired it paged again, what should have been the process instead? [10:44:35] Resolving on victorops side and !ACK on alertmanager? Using an icinga downtime? any other? [10:45:26] Currently we have resolved on victorops, and kept the ack on alertmanager, added nothing on icinga (see https://alerts.wikimedia.org/?q=instance%3Dcloudnet1004%3A9100) [13:34:41] hey dcaro, victorops will re-trigger acked alerts after 24 hours on its own. yes resolving it in victorops will prevent re-triggering of the existing "old" alert, and at the same time ack/downtime on the monitoring system will prevent new alerts from being generated [14:12:48] okok, so a working workflow could be resolve on victorops, ack on alertmanager, do nothing on icinga? [14:17:04] dcaro: for a generalized workflow I'd say ack on the system where the alert originated alertmanager and/or icinga [14:17:25] in this case it looks like we have both "MD RAID" in icinga and SmartNotHealthy in alertmanager [14:17:35] so acking both [14:18:54] ah looks like it was auto acked already yeah [14:22:33] I thought all alerts passed through alertmanager [14:24:27] so there's no way (yet?) to handle the alerts on one system only? (ex. alertmanager) [14:26:10] that's correct, is a work in progress [14:26:34] at least some of the alerts pass through alertmanager :) (the source:icinga label), what happens if I only ack one of those alerts on alertmanager only? [14:41:28] Also, if anyone have the time, https://gerrit.wikimedia.org/r/c/operations/puppet/+/802040 needs review, filipp.o was helping but now he's on PTO. I can wait though if nobody has time. [15:09:55] essentially nothing will happen, they are presented for viewing at a glance but aren't in the flow in terms of paging [15:22:01] okok, that's a bit inconvenient xd, is that going to change in the short term? Is the plan to move that 'authority' to alertmanager or just wait until all the alerts migrate from icinga? [15:36:30] both basically, although the short term focus is to reduce icinga's paging alert footprint which is being tracked in T305847 [15:37:15] that'd be usefule yes, as then there would be only one point for page acks (alertmanager) [15:37:53] wmcs is somewhat of a special case in terms of paging alerts from icinga since IIRC the icinga contact used will generate pages by default, regardless of the critical flag. I'm not sure if there is a task yet to unpack that [15:37:59] I see in the task though that WMCS (my team) alerts will be tackled later [15:38:38] let me know if I can help in that space, getting that sorted out is a huge benefit for us :) [15:42:47] cool, thanks, I'm not sure off hand when work on wmcs alerts migration is expected to start but I made a note to bring it up at our next o11y meeting, and you will be in the loop for sure [15:43:09] also will have a look at the patch [15:50:32] hmmm... we are paging almost on everything (through the contactgroup wmcs-team), it's going to be tricky xd [21:13:31] Heya! Where can I view active alertmanager alert configurations? I know about alerts.wm.o but that seems to only show active alerts without a way to "show all"... [21:15:48] I'm thinking about like the prometheus dashboard that has alerts listed (I don't see any such functionality in Thanos) - I suspect that an alert or two might have broken expressions [21:17:54] Most alert configurations live in the alerts repository: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master [21:18:47] As to alert states, alertmanager only knows when state changes (e.g. OK->WARN) [21:19:29] It's up to whatever is observing the state change to notify alertmanager for routing, relay, and deduplication. [21:20:28] For alerts repository alerts, site-local Prometheus (unless configured globally) observes those state changes. [21:20:44] I hope that makes sense :) [21:21:46] cwhite: Thanks. Really, I wanted to verify the alert expression itself is registered in AM and whether the expression is broken in prometheus. It seems like the best way to do that is just assume that everything in the alerts repo is registered and then just use thanos/grafana to test queries [21:23:32] Specifically, I'm suspicious of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-traffic/traffic.yaml since there are no "layer=tls" entries at all for cluster_layer_code:trafficserver_responses_total:rate5m, only "layer=backend" [21:23:35] Thats a safe assumption with a caveat though: the alerts repository is deployed regularly via Puppet [21:24:00] I wasn't sure whether any team-traffic stuff was in operation yet, hence the question [21:24:09] That makes sense, thanks for that info [21:26:43] brett: that's a good catch. It seems the layer label value has changed since this alert was created last year. [21:26:59] cwhite: Cool. I'll create a CR. Thanks for the help