[14:35:44] we now have metrics reflecting whether we have diffs between HEAD and the live kubernetes state for admin_ng, per k8s cluster [14:35:44] https://thanos.wikimedia.org/graph?g0.expr=helmfile_admin_ng_pending_changes&g0.tab=0&g0.stacked=0&g0.range_input=1m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [14:36:19] note: the metric for cluster=staging is at 3.0, because helmfile -e staging exits with the exit code 3, meaning "environment does not exist" [14:36:47] nice! [14:36:54] can we exclude staging? [14:37:11] from puppet you mean? [14:37:45] yeah...don't run helmfile for it I mean [14:38:21] I was wondering whether `staging` should still be returned by `k8s::fetch_cluster_groups()` in the first place. But it somehow it does, we could definitely add an `if $cluster == "staging"` condition somehwere [14:38:38] and delete the associated timer and service [14:42:00] yes, it has to be returned by fetch_cluster_groups() as it's a valid cluster in some context [14:42:24] ah, actually, `staging` does not appear in hieradata/common/kubernetes.yaml, but it's the alias for staging-eqiad atm [14:42:26] ack [14:42:26] but it's actually an alias to either staging-eqiad or staging-codfw [14:43:05] function k8s::fetch_clusters has an argument to not return it, but ofc. it also makes some sense to use the existing loop in helmfile.pp [14:46:20] brouberol: I am about to test your metric :) [14:46:40] also, I didn't even realize thanos had a UI [14:47:24] for now, the metric is only exported daily, because we're ultimately going to hook alerting on top it. When you have a pending diff, ping me and I'll re-run the timers [14:47:55] we don't want/need to have this metric change too fast IMO, as we only want to alert about diffs that were forgotten about and have been pending for >=1d [14:48:23] I think that's probably better handled at the alertmanager level? [14:48:59] you can write `for: 1d` in the alert definition [14:49:15] which does exactly what it sounds like [14:50:01] I think we don't run helmfile too often...but once an hour maybe? [14:50:08] that's fair. We could emit the metric hourly I guess [14:50:13] there you go [14:50:16] eheh [14:51:32] I'll prep a series of patches to absent all resources related to the staging cluster, and another one to change the timer frequency [15:07:59] I've assigned the first part to you jayme. No rush whatsoever [19:20:46] brouberol: https://i.imgur.com/yFYH9MU.png works btw :)