[13:01:44] elukey: ok to merge your alertmanager patch? [13:05:13] elukey: I'm merging it [13:06:00] thanks! ( and sorry) [13:06:15] np [14:16:36] taavi: I caught your commit on puppet merge, ok to proceed? [14:16:42] effie: yes please [14:16:46] cheers [14:42:15] When debugging a Prometheus metric, compared to how it was with Graphite, there are quite a lot of labels injected after the fact. Is there a list of these "reserved" labels that basically can't be used? [14:42:43] e.g. when exploring in Grafana, I spent a good 5 minutes waiting for slow queries to load one by one and identifying each label to exclude: sum(mediawiki_WikimediaEvents_authmanager_success_total) without (instance, app, job, kubernetes_pod_name, kubernetes_namespace, prometheus, site, cluster, pod_template_hash, routed_via, release) [14:43:37] maybe those should be grouped under a common prefix (apart from I guess semi-standard ones like "instance", "site", and "cluster"). [14:44:02] or maybe removed for the case of mediawiki/statsd-exporter if we don't need them. That'd reduce cardinality presumably by quite a bit. [14:45:31] I think that the biggest offender in terms of cardinality would be kubernetes_pod_name [14:46:07] The above produce ~43 time series currently [14:46:20] e.g. when I inverted it to `by()` [14:46:21] sum(mediawiki_parseroutputaccess_cache) by (instance, app, job, kubernetes_pod_name, kubernetes_namespace, prometheus, site, cluster, pod_template_hash, routed_via, release) [14:46:29] (last 2 days) [14:46:43] that's a fairly big multiplier [14:47:31] Raine: yeah, although it's 1 per k8s host afaik [14:47:38] but yeah that will habve churn [14:47:45] it's not a constant number over time [14:47:56] No, kubernetes_pod_name is one per k8s pod [14:47:59] constant amount, but not static. [14:48:03] So a few thousands [14:48:12] hundreds [14:48:13] claime: we're both right [14:48:15] same difference [14:48:33] statsd-exporter isn't 1 per mw pod, right? [14:48:43] No, it's 1 per namespace [14:49:18] Ah right [14:49:36] Not 1, a couple per namespace [14:50:23] not all metrics are emitted from all clusters, so it's a little fuzzy to get a total, but kubernetes_pod_name contributes 38 of the 43 it seems. [14:50:35] so yeah the others all correlate very well which make sense, they dont multiply [14:50:36] But yeah, I confused it with the other metrics where the kubernetes_pod_name label is by mediawiki pod, but there it's by statsd-exporter pod [14:50:50] the same pod name isn't going to exist elsewhere since its globaly unique [14:52:37] to a first approximation instance == pod template == pod name, over recent history, given a high uptime. [14:52:50] so not so easy to reduce [14:52:56] each has about 38-40 [14:53:44] 10.194.141.247:9102 / statsd-exporter-prometheus-586469b67-9g7tq / 586469b67 [14:54:34] personally I wouldn't mind if all three disappeared, but I worry `instance` might be relied upon somewhere as like an assumed thing that always exists? [14:55:04] again specifically for metrics from mediawiki app only, anything infra related I'm sure they're useful :) [14:55:59] or maybe the copy that goes from prom/ops to thanos can drop them if we want to keep them short-term but not queried by default in the common case. [17:18:21] the "netbox reports" icinga check has been alerting for > 7 days