[08:26:29] <_joe_> godog: so small update, the problem is as always double-stack networking. prometheus-statsd-exporter binds to :::9125 for udp in the pod [08:26:41] <_joe_> but k8s services only speak ipv4 [08:29:54] <_joe_> or at least, that's my last guess [08:44:49] siiigh [08:44:53] thank you for the update _joe_ [13:31:46] <_joe_> I now see metrics from mediawiki being collected out of mw-debug [13:32:48] <_joe_> https://prometheus-eqiad.wikimedia.org/k8s/graph?g0.expr=mediawiki_pagestore_linkcache_accesses_total%7Bkubernetes_namespace%3D~%22.*mw-debug%22%7D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h [14:07:58] _joe_: excellent [18:34:50] does anyone know, can the alertmanager 'know' that it is sending a notification due to a silence that just expired? and add that as an annotation to the alert body? [18:37:06] if I'm thinking about the, ah, "user journey" of receiving a page, it's really nice to know up-front that it's due to an expired silence [18:39:22] .... wow I had never ever known about `kthxbye` until just now [18:52:07] it's even on the wikis :) https://en.wiktionary.org/wiki/kthxbye [18:52:37] bblack: I meant the software, not the slang [18:52:39] :) [18:53:13] ah https://github.com/prymitive/kthxbye [18:53:21] I never saw that before. kinda hard to google for :) [18:53:47] our documentation for it on wikitech is good [19:32:25] \o Hi o11y, inflatador and I are working on some new alerts and we're having trouble with getting `{{$labels.topic}}` to plumb through properly. See patch here: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1043198 [19:33:12] We're expecting `summary="CirrusSearch job topic eqiad.cirrussearch.update_pipeline.update.rc0 blahblahblah` but getting `summary="CirrusSearch job topic blahblahblah` [19:35:23] I suspect there's something off with our `cirrussearch_test.yaml` entry for this alert (`CirrusSearchUpdatePipelineUnexpectedUpdateTopicMessageRateDrop`, we'll probably choose a less verbose name later :P) but I can't see anything obviously syntactically wrong when comparing to the known working alert `CirrusSearchJobQueueBacklogTooBig` which uses `{{$$labels.topic}}` in a similar way [19:44:34] Okay I've partially rubber-ducked. Getting rid of the sum makes it plumb through properly [19:47:20] So I think ultimately it's just a square peg / round hole type problem. Ultimately we're just summing eqiad and codfw's message rates so I think we'll just hardcode the summary to say something like `The summed message update rate of topics (eqiad|codfw).cirrussearch.update_pipeline.update.rc0 is too low` [19:54:07] ryankemper: I suspect it's probably possible to write a promql rule that fires an alert if the per-cluster rate has dropped a lot *and* the globally-summed rate has dropped too, but it needs some trick I haven't thought of yet [19:54:32] some examples of other scenarios https://www.robustperception.io/combining-alert-conditions/ [19:55:10] ah actually, what you need is the `on ()` trick from the last example [20:13:38] nice, thanks cdanis [21:16:53] hello observability folks! operations/alerts structure question: is there a recommended pattern for "duplicating" an alert definition across teams other then literally duplicating it? (e.g., same alert exists in team-a/ and team-b/, but with different deploy-tag values) [21:17:09] (not sure if cross-file yaml refs are advisable or even supported) [22:48:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:08:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag