[14:35:44] <brouberol>	 we now have metrics reflecting whether we have diffs between HEAD and the live kubernetes state for admin_ng, per k8s cluster  
[14:35:44] <brouberol>	 https://thanos.wikimedia.org/graph?g0.expr=helmfile_admin_ng_pending_changes&g0.tab=0&g0.stacked=0&g0.range_input=1m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[14:36:19] <brouberol>	 note: the metric for cluster=staging is at 3.0, because helmfile -e staging exits with the exit code 3, meaning "environment does not exist"
[14:36:47] <jayme>	 nice!
[14:36:54] <jayme>	 can we exclude staging?
[14:37:11] <brouberol>	 from puppet you mean?
[14:37:45] <jayme>	 yeah...don't run helmfile for it I mean
[14:38:21] <brouberol>	 I was wondering whether `staging` should still be returned by `k8s::fetch_cluster_groups()` in the first place. But it somehow it does, we could definitely add an `if $cluster == "staging"` condition somehwere
[14:38:38] <brouberol>	 and delete the associated timer and service
[14:42:00] <jayme>	 yes, it has to be returned by fetch_cluster_groups() as it's a valid cluster in some context
[14:42:24] <brouberol>	 ah, actually, `staging` does not appear in hieradata/common/kubernetes.yaml, but it's the alias for staging-eqiad atm
[14:42:26] <brouberol>	 ack
[14:42:26] <jayme>	 but it's actually an alias to either staging-eqiad or staging-codfw
[14:43:05] <jayme>	 function k8s::fetch_clusters has an argument to not return it, but ofc. it also makes some sense to use the existing loop in helmfile.pp
[14:46:20] <cdanis>	 brouberol: I am about to test your metric :)
[14:46:40] <cdanis>	 also, I didn't even realize thanos had a UI
[14:47:24] <brouberol>	 for now, the metric is only exported daily, because we're ultimately going to hook alerting on top it. When you have a pending diff, ping me and I'll re-run the timers  
[14:47:55] <brouberol>	 we don't want/need to have this metric change too fast IMO, as we only want to alert about diffs that were forgotten about and have been pending for >=1d
[14:48:23] <cdanis>	 I think that's probably better handled at the alertmanager level?
[14:48:59] <cdanis>	 you can write `for: 1d` in the alert definition
[14:49:15] <cdanis>	 which does exactly what it sounds like
[14:50:01] <jayme>	 I think we don't run helmfile too often...but once an hour maybe?
[14:50:08] <brouberol>	 that's fair. We could emit the metric hourly I guess
[14:50:13] <brouberol>	 there you go
[14:50:16] <jayme>	 eheh
[14:51:32] <brouberol>	 I'll prep a series of patches to absent all resources related to the staging cluster, and another one to change the timer frequency
[15:07:59] <brouberol>	 I've assigned the first part to you jayme. No rush whatsoever
[19:20:46] <cdanis>	 brouberol: https://i.imgur.com/yFYH9MU.png works btw :)