[08:26:29] <_joe_>	 godog: so small update, the problem is as always double-stack networking. prometheus-statsd-exporter binds to :::9125 for udp in the pod
[08:26:41] <_joe_>	 but k8s services only speak ipv4 
[08:29:54] <_joe_>	 or at least, that's my last guess 
[08:44:49] <godog>	 siiigh
[08:44:53] <godog>	 thank you for the update _joe_ 
[13:31:46] <_joe_>	 I now see metrics from mediawiki being collected out of mw-debug
[13:32:48] <_joe_>	 https://prometheus-eqiad.wikimedia.org/k8s/graph?g0.expr=mediawiki_pagestore_linkcache_accesses_total%7Bkubernetes_namespace%3D~%22.*mw-debug%22%7D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
[14:07:58] <godog>	 _joe_: excellent
[18:34:50] <cdanis>	 does anyone know, can the alertmanager 'know' that it is sending a notification due to a silence that just expired? and add that as an annotation to the alert body?
[18:37:06] <cdanis>	 if I'm thinking about the, ah, "user journey" of receiving a page, it's really nice to know up-front that it's due to an expired silence
[18:39:22] <cdanis>	 .... wow I had never ever known about `kthxbye` until just now
[18:52:07] <bblack>	 it's even on the wikis :) https://en.wiktionary.org/wiki/kthxbye
[18:52:37] <cdanis>	 bblack: I meant the software, not the slang
[18:52:39] <cdanis>	 :)
[18:53:13] <bblack>	 ah https://github.com/prymitive/kthxbye
[18:53:21] <bblack>	 I never saw that before.  kinda hard to google for :)
[18:53:47] <cdanis>	 our documentation for it on wikitech is good
[19:32:25] <ryankemper>	 \o Hi o11y, inflatador and I are working on some new alerts and we're having trouble with getting `{{$labels.topic}}` to plumb through properly. See patch here: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1043198
[19:33:12] <ryankemper>	 We're expecting `summary="CirrusSearch job topic eqiad.cirrussearch.update_pipeline.update.rc0 blahblahblah` but getting `summary="CirrusSearch job topic  blahblahblah`
[19:35:23] <ryankemper>	 I suspect there's something off with our `cirrussearch_test.yaml` entry for this alert (`CirrusSearchUpdatePipelineUnexpectedUpdateTopicMessageRateDrop`, we'll probably choose a less verbose name later :P) but I can't see anything obviously syntactically wrong when comparing to the known working alert `CirrusSearchJobQueueBacklogTooBig` which uses `{{$$labels.topic}}` in a similar way
[19:44:34] <ryankemper>	 Okay I've partially rubber-ducked. Getting rid of the sum makes it plumb through properly
[19:47:20] <ryankemper>	 So I think ultimately it's just a square peg / round hole type problem. Ultimately we're just summing eqiad and codfw's message rates so I think we'll just hardcode the summary to say something like `The summed message update rate of topics (eqiad|codfw).cirrussearch.update_pipeline.update.rc0 is too low`
[19:54:07] <cdanis>	 ryankemper: I suspect it's probably possible to write a promql rule that fires an alert if the per-cluster rate has dropped a lot *and* the globally-summed rate has dropped too, but it needs some trick I haven't thought of yet
[19:54:32] <cdanis>	 some examples of other scenarios https://www.robustperception.io/combining-alert-conditions/
[19:55:10] <cdanis>	 ah actually, what you need is the `on ()` trick from the last example
[20:13:38] <inflatador>	 nice, thanks cdanis 
[21:16:53] <swfrench-wmf>	 hello observability folks! operations/alerts structure question: is there a recommended pattern for "duplicating" an alert definition across teams other then literally duplicating it? (e.g., same alert exists in team-a/ and team-b/, but with different deploy-tag values)
[21:17:09] <swfrench-wmf>	 (not sure if cross-file yaml refs are advisable or even supported)
[22:48:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:08:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag