[14:42:54] hey, I've been digging at a mystery with our edge traffic drop alerts @ drmrs (and other edge sites) [14:43:07] to summarize the dots I've tried to connect: [14:43:33] 1) the alert we see firing on IRC, for the specific case of text@drmrs: [14:43:36] 13:34 < jinxer-wm> (EdgeTrafficDrop) resolved: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - [14:43:40] https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [14:43:57] I know jinxer is driven by the AM infra in general [14:44:47] there's a version of these monitors that's puppetized into icinga as well, but that puppetization explicitly only configures the existing 5 prod sites, not drmrs, so I know this alert isn't related to icinga in that sense. [14:45:35] 2) the alerts repo has the rule in question, configured as the first of two entries here: [14:45:38] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-traffic/traffic.yaml [14:46:29] it seems to generically template on the "site" values, so that somehow lead to it recently beginning to pick up "drmrs" as a site value and firing these alerts, once we started bringing up cp6xxx nodes in drmrs with real puppetization yesterday [14:46:41] but: [14:47:30] when I go to https://alerts.wikimedia.org/ , I can't figure out how to silence or view it. In both the main query/search bar, and in the "silence a rule" form, the site= metadata seems to only match the two core sites "eqiad" and "codfw" [14:47:43] I can't ever get results to show up for anything at an edge site [14:48:10] and I've been digging in the various related repos, and I don't seem to find any configuration that explicitly limits the UI to only knowing about site=eqiad|codfw, either. [14:49:05] I suspect it's very possible I'm just failing at understanding how to use the UI, and that maybe because eqiad|codfw are the only sites with currently-active alerts, that limits what it shows me for site=completions. But I did clear the default @state=active filter. [14:49:52] any ideas? [14:53:09] (either that, or the UI is limited in its view to only the core sites, in some configuration I've failed to discover) [14:56:59] I'm getting the feeling as I explore the alert UI more, that it has no knowledge of alerts that are not currently firing (meaning you can't match/see anything if you don't get there in time while it's active)? [14:59:08] apparently in the silence UI, I can just manually specify site=drmrs. It's hard to know if that does what I think it does without any existing examples to match, but I'll do it anyways and see how it goes! [21:49:04] bblack: I'm glad you were able to put a silence in place. IIUC, Alertmanager doesn't have knowledge of what is getting checked, but gets notifications from Prometheus that metrics are reading out of spec. The EdgeTrafficDrop alert splits the query by site. If `site=drmrs` exists, the query will pick it up and fire notifications to Alertmanager unless it is explicitly filtered [21:49:07] out of the query. g.odog knows better than I do and may have a good idea on how to best handle this issue