[13:01:44] <claime>	 elukey: ok to merge your alertmanager patch?
[13:05:13] <claime>	 elukey: I'm merging it
[13:06:00] <elukey>	 thanks! ( and sorry)
[13:06:15] <claime>	 np
[14:16:36] <effie>	 taavi: I caught your commit on puppet merge, ok to proceed?
[14:16:42] <taavi>	 effie: yes please
[14:16:46] <effie>	 cheers
[14:42:15] <Krinkle>	 When debugging a Prometheus metric, compared to how it was with Graphite, there are quite a lot of labels injected after the fact. Is there a list of these "reserved" labels that basically can't be used?
[14:42:43] <Krinkle>	 e.g. when exploring in Grafana, I spent a good 5 minutes waiting for slow queries to load one by one and identifying each label to exclude: sum(mediawiki_WikimediaEvents_authmanager_success_total) without (instance, app, job, kubernetes_pod_name, kubernetes_namespace, prometheus, site, cluster, pod_template_hash, routed_via, release)
[14:43:37] <Krinkle>	 maybe those should be grouped under a common prefix (apart from I guess semi-standard ones like "instance", "site", and "cluster").
[14:44:02] <Krinkle>	 or maybe removed for the case of mediawiki/statsd-exporter if we don't need them. That'd reduce cardinality presumably by quite a bit.
[14:45:31] <Raine>	 I think that the biggest offender in terms of cardinality would be kubernetes_pod_name
[14:46:07] <Krinkle>	 The above produce ~43 time series currently
[14:46:20] <Krinkle>	 e.g. when I inverted it to `by()`
[14:46:21] <Krinkle>	 sum(mediawiki_parseroutputaccess_cache) by (instance, app, job, kubernetes_pod_name, kubernetes_namespace, prometheus, site, cluster, pod_template_hash, routed_via, release)
[14:46:29] <Krinkle>	 (last 2 days)
[14:46:43] <Krinkle>	 that's a fairly big multiplier
[14:47:31] <Krinkle>	 Raine: yeah, although it's 1 per k8s host afaik
[14:47:38] <Krinkle>	 but yeah that will habve churn
[14:47:45] <Krinkle>	 it's not a constant number over time
[14:47:56] <claime>	 No, kubernetes_pod_name is one per k8s pod
[14:47:59] <Krinkle>	 constant amount, but not static.
[14:48:03] <claime>	 So a few thousands
[14:48:12] <claime>	 hundreds
[14:48:13] <Krinkle>	 claime: we're both right
[14:48:15] <claime>	 same difference
[14:48:33] <Krinkle>	 statsd-exporter isn't 1 per mw pod, right?
[14:48:43] <claime>	 No, it's 1 per namespace
[14:49:18] <claime>	 Ah right
[14:49:36] <claime>	 Not 1, a couple per namespace
[14:50:23] <Krinkle>	 not all metrics are emitted from all clusters, so it's a little fuzzy to get a total, but kubernetes_pod_name contributes 38 of the 43 it seems.
[14:50:35] <Krinkle>	 so yeah the others all correlate very well which make sense, they dont multiply
[14:50:36] <claime>	 But yeah, I confused it with the other metrics where the kubernetes_pod_name label is by mediawiki pod, but there it's by statsd-exporter pod
[14:50:50] <Krinkle>	 the same pod name isn't going to exist elsewhere since its globaly unique 
[14:52:37] <Krinkle>	 to a first approximation instance == pod template == pod name, over recent history, given a high uptime.
[14:52:50] <Krinkle>	 so not so easy to reduce
[14:52:56] <Krinkle>	 each has about 38-40
[14:53:44] <Krinkle>	 10.194.141.247:9102 / statsd-exporter-prometheus-586469b67-9g7tq / 586469b67
[14:54:34] <Krinkle>	 personally I wouldn't mind if all three disappeared, but I worry `instance` might be relied upon somewhere as like an assumed thing that always exists?
[14:55:04] <Krinkle>	 again specifically for metrics from mediawiki app only, anything infra related I'm sure they're useful :)
[14:55:59] <Krinkle>	 or maybe the copy that goes from prom/ops to thanos can drop them if we want to keep them short-term but not queried by default in the common case. 
[17:18:21] <mutante>	 the "netbox reports" icinga check has been alerting for > 7 days