[06:16:09] jbond: there are still some hosts alerting after the revert of that bad patch, what should be the fix for those? just restart the affected exporter? [06:35:46] jbond: I just saw the issue with one particular host, es1024 is...The last Puppet run was at Tue Dec 20 11:03:16 UTC 2022 (1171 minutes ago). Puppet is disabled. investigate prometheus_puppet_agent_stats.service - jbond [06:35:49] is that expected? [06:36:10] can I enable that? [10:29:59] marostegui: i have enabled es1024 where there any others? [10:30:30] jbond: I think so, double check icinga [10:31:33] marostegui: i dont see anything [10:31:48] jbond: Great then [15:03:53] CI on my CRs is failing because of something that is entirely unrelated to the work I'm doing - https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/56699/console is this a known issue? profile::wmcs::cloudlb::haproxy on Debian 11 seems unhappy [15:09:00] arturo: ^ probably caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/868726 ? [15:18:46] fix sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/870562 [15:23:00] thank you <3 [15:24:35] Emperor: fix is merged [15:28:25] ta, have rebased my changes and pushed them again [15:28:33] [oops, and now I have a meeting] [15:47:59] CI says yes \o/ [15:58:03] Emperor jbond moritzm, thanks, sorry [15:58:45] I think the CI should have catch that in the original patch that introduced the problem, no? isn't that the point of the CI? [16:02:15] jhathaway: I'm seeing a puppet failure that I think is from https://gerrit.wikimedia.org/r/c/operations/puppet/+/770960, are you on top of that already? [16:02:39] andrewbogott: I am not, thanks for pointing it out, what is the failure? [16:03:09] It's on a Stretch host: [16:03:12] https://www.irccloud.com/pastebin/K41BBn3T/ [16:03:25] I haven't investigated at all, just going by 'who touched it last' [16:03:52] labstore1005.eqiad.wmnet [16:03:58] yup that is my fault, and now the mystery of why we didn't have that patch is solved!! [16:04:31] I'll revert and add a note! [16:05:16] andrewbogott: for future reference how did you see that problem, so I can monitor for issues? [16:06:02] the puppet failure showed up on alert manager. It happened to be on a host that's marked with team=wmcs which I tend to check frequently [16:06:10] thanks! [16:06:23] what is the host, so I can verify the fix? [16:06:30] labstore1005.eqiad.wmnet [16:06:45] ah, you already mentioned that, sorry [16:06:49] Among the last of the Stretch hosts but I think some other teams are still running other Stretch things. [16:07:05] yeah this probably only broke on streth boxes [16:07:09] *stretch [16:11:46] in addition to labstore1005/1005 the other stretch hosts are ms-fe[12]009 and maps* [16:12:08] sorry, thumbor*, not maps* [16:12:20] 12 in total [16:14:23] jhathaway: thumbor* should see it's k8s rollout in January (which allows the decom of the current hosts( and the swiftrepl replacement is also WIP, so depending on timing for labstore1004/1005 you could also wait and reapply as-is next month [16:15:24] moritzm: thanks, I'm working on a patch now, with a temporary fix, and I'll add a note to revert once all the stretch boxes are gone [16:15:48] ack [16:28:14] andrewbogott: care to review, https://gerrit.wikimedia.org/r/c/operations/puppet/+/870640 [23:39:06] TIL about subquery syntax in Prometheus, enabling queries like "what is the highest rate[5m] in the last 24h", e.g. for display in a singlestat at https://grafana.wikimedia.org/d/000000066/resourceloader. [23:39:50] max_over_time( sum(rate(varnish_resourceloader_resp{site=~"$site"}[5m])) [24h:5m] ) [23:40:52] which translates to query `sum(rate[5m]))`, act like it's a 24h graph with 5m steps between dots (the subquery), and then apply a max() to that looking back. [23:41:02] If you do it as a singlestate/instant it's actually reasonably fast. [23:41:32] Can be made faster by increasing the inner and outer 5m so less spot checks are done and thus shorter spikes more hidden. [23:41:58] but [24h:5m] is pretty quick already, comparable to what a typical graph would plot/fetch anyway