[08:56:58] godog: sorry to disturb your sprint week but could I get your review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/902303/? thanks [09:19:42] vgutierrez: sure no worries, will take a look [09:27:05] cheers [09:31:31] vgutierrez: {{done}}, happy to discuss here too [09:32:27] is it feasible to implement your approach today? [09:32:44] good question, checking real quick [09:33:06] context being https://phabricator.wikimedia.org/T332796 I wanna have something in place before the weekend [09:35:37] vgutierrez: I think it should be doable by adding a distro comparison to modules/prometheus/templates/etc/default/prometheus-node-exporter-0.17.erb and append the options for >= bullseye [09:37:48] I realize that's not the news you were looking for :) I'm ok to go with your solution too but bear in mind that we'll need to undo that at some point (i.e. metric and alerts) [09:38:07] no problem [09:39:04] godog: does the option fail on pre-bullseye? [09:40:19] volans: indeed [09:40:30] :( [09:41:34] yeah that's a bummer, the good news is that bullseye ships node-exporter >= 1.x so at least we can expect some stability there [09:41:41] FSVO good [09:51:11] ack [09:51:29] to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/901220 is there anything I should do post-merge? [09:52:29] volans: no action needed no [09:52:42] great, thx [09:56:57] godog: something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/902310 :? [10:00:01] vgutierrez: yeah exactly, left a comment but LGTM [10:00:09] metrics? [10:00:33] https://github.com/prometheus-community/systemd_exporter#systemd-versions isn't accurate then? [10:01:07] we're talking about node-exporter not systemd-exporter [10:01:11] oh my fault [10:02:56] looking good! I'll kick off PCC and I think we're good to go [10:03:10] cheers [10:13:13] vgutierrez: looks like it is working [10:13:17] yep [10:13:46] \o/ [10:26:15] godog: and as a corollary https://gerrit.wikimedia.org/r/c/operations/alerts/+/902312/ [10:39:15] vgutierrez: neat! [10:39:34] godog: any idea of why CI is torturing me? [10:39:43] checking [10:40:30] vgutierrez: you have an extra label in the test, layer [10:40:47] the rest looks good [10:41:04] that seems better than assert 1 == 0 failed [10:41:05] ;P [10:41:50] lol [10:42:01] yeah I was reading a couple of lines up [16:41:51] Hello friends, I can't find an alert that should have fired, and was hoping someone could help me look. I merged this alert in yesterday (https://gerrit.wikimedia.org/r/c/operations/alerts/+/902052), and set up a condition that should cause it to fire as the graph is showing the condition met (https://grafana.wikimedia.org/goto/PtqQpOfVz?orgId=1). Did I miss something obvious? [16:43:26] eoghan: the prometheus="k8s" label is added on the thanos layer, it won't match when the alert queries are done on an individual prometheus instance [16:46:52] eoghan: yeah what taavi said, my apologies for the pitfall of thanos vs prometheus [16:47:17] I'll make it a bit more prominent in https://wikitech.wikimedia.org/wiki/Alertmanager [16:47:27] I wonder if we could lint for that somehow [16:47:45] Aha, I see. [16:47:58] A linter would be great, if that was possible [16:49:05] yeah a linting for that would be definitely the nail in the coffin [16:50:29] So to be sure I understand, what I'm missing is the `# deploy-tag` line? [16:50:53] Actually, no, that's there [16:51:01] eoghan: no, you got the right tag but a stray prometheus="k8s" label matcher in there [16:51:38] Oh, right! I see it now. So just removing that should be good? [16:51:38] eoghan: running the query on prometheus k8s web interface will yield the results your alert will see [16:51:41] https://prometheus-eqiad.wikimedia.org/k8s/classic/graph [16:52:09] eoghan: yeah that's correct [16:52:13] Cool, thanks! [16:52:19] I get it now [16:54:28] Thanks taavi and godog! [16:54:56] sure no problem -- thanks for reaching out eoghan [16:55:11] If one of you could cast your eyes on https://gerrit.wikimedia.org/r/c/operations/alerts/+/902438 please? [16:55:28] re: linting that'd be great, I think to be accurate though would need to happen in golang to be able to parse back the query into an AST an look at the labels [17:07:24] Ah, without that filter though, that alert will trigger for the staging cluster [17:07:44] Which is why I had suggested it in the first place [17:08:00] Any other way of achieving the same result? [17:10:11] akosiaris: the alert is deployed to the prometheus k8s instance only via # deploy-tag: k8s at the top so that should work [17:10:16] i.e. not k8s-staging [17:10:38] gotta go now but happy to resume tomorrow [17:11:31] Ah good point