[13:13:36] godog: where is the acl for the alertmanager api i.e. http://alertmanager-eqiad.wikimedia.org/ [13:13:46] id like to add the puppetdbs to the list [13:16:51] jbond: yes, check out hieradata/common/profile/alertmanager/api.yaml [13:16:57] godog: cheers [13:17:03] I'm curious as to the why jbond ? [13:19:01] godog: re T345909. im looking to move that script to the puppetdbs as most the data is from there. however currently the script uses spicerack to look for downtimed hosts. [13:19:16] i was thinking of changeing that logic to insted of looking for downtimed hosts in icinge [13:19:23] look for silenced hosts in alertmanager [13:23:56] jbond: got it, actually now that I'm thinking about it, what is the check looking for? could we replace it with an alertmanager alert instead based on prometheus expression with self-reported puppet agent data from the hosts themselves ? [13:24:18] simpler, and already covers the downtimed host case automatically [13:26:21] godog: hmm possibly what the script currently dose is look for hosts where every puppet run in the last 24 hours caused a change [13:26:32] and then filteres out any hosts that are downtimed [13:26:57] let's see [13:27:06] so yes i guess just a check where the puppet status was changed for the last 24 hours [13:27:59] yeah maybe puppet_agent_resources_changed [13:29:34] yes that looks good [13:32:03] sadly I can't compare the results with what the old check would emit, but yeah [13:33:44] anyways something to think about jbond, could simplify things a bit if the alert expression is reliable [13:34:47] in other words I don't know what "failing" hosts currently are [13:35:10] godog: yes thanks ill send something shortly [13:35:35] godog: yes ill try and get that out of the script in a bit [13:35:42] cheers [13:36:08] godog: how do i update this so ui can see it calculated over the last 24h? [13:36:11] https://prometheus-eqiad.wikimedia.org/ops/classic/graph?g0.range_input=1h&g0.expr=puppet_agent_resources_changed%20%3E%201&g0.tab=1 [13:37:35] jbond: depends on what calculation you want to do, for example avg_over_time(puppet_agent_resources_changed[1d]) [13:37:38] or sum_over_time [13:38:14] hmm i wante to ensure we have no values that where 0 in the last 24 hours [13:38:27] so neoither sum or avg would really work [13:38:54] min_over_time(puppet_agent_resources_changed[1d]) > 0 [13:38:57] sth like this? [13:39:10] yes that look sgood thanks ill have a play with that [13:39:29] sure np [14:27:41] jbond: very cool re: dropping check_puppet_run_changes [14:27:58] far easier this way [14:28:08] yes this is much better thanks fopr the pointer [14:29:06] np