[00:05:44] hey o11y folks -- I'm going over the SLO data for the Dec-Feb reporting quarter, and it looks like we missed the logstash latency SLO in eqiad https://grafana-rw.wikimedia.org/d/slo-Logstash/logstash-slo-s?orgId=1&from=1669881600000&to=1677657599000 [00:06:05] there was a single event on Feb 11 that put us over our quarterly budget for slow requests [00:07:10] codfw was a near miss for some smaller events on Jan 25, 26, and 29-30, but did meet the SLO [00:08:14] I'd like to grab 30m to chat about what happens next, who should I invite to that? :) I know herron was working on that SLO, anyone else from the team want to be there? [00:09:05] Hi Reuven, I'd like to attend the meeting as well. [00:10:15] 👍 [00:19:08] (also just to say the hopefully-obvious out loud, it's exactly like incident reviews: nothing is anyone's fault and nobody's in trouble, just something to understand and learn from) [00:21:16] Nice, I like the blameless perspective. It makes it a safe place for learning and sharing knowledge. [00:37:48] Happy to meet as well. [00:41:22] I’ll listen in too for my own benefit :-) [00:45:35] I'll just invite the team then, keep it simple :D consider it optional all around, no pressure from me as long as there's somebody to talk to [00:49:40] sgtm, thanks! [14:06:41] thanks for organizing r.zl [15:14:47] godog: for T329073, did you (or anyone else) do anything to prometheus1005 that would stop it from sending alerts? we're troubleshooting some WMCS alerts during the maintenance and our current theory is that it was depooled from queries/LVS but still managed to send some alerts out [15:42:13] taavi: I didn't, no, though that's a good point re: alerting, definitely it would fire alerts from its POV [15:50:27] (filing a task) [15:53:07] taavi: https://phabricator.wikimedia.org/T331449 ^ I'll bring it up to the team meeting tomorrow too [15:56:03] thank you! mystery solved then [15:56:51] do you have a sample alert I could look at? if it mis-fired then I'm pretty sure it must have been that [15:58:43] we had several NodeDown alerts for cloudvirt hosts [16:01:38] ack, thank you (in meeting) [17:24:33] sorry, my machine locked up, be right back [17:46:35] thanks for meeting <3 I just filed https://phabricator.wikimedia.org/T331461 as a starting point but it's pretty bare, feel free to edit/rewrite [20:48:23] does this CI failure for operations/alerts ring any bells for anyone? https://integration.wikimedia.org/ci/job/alerts-pipeline-test/805/console [21:08:37] urandom: Not for me. Looks like something within Jenkins? Recent changes to alerts and integration-config don't look suspicious. [21:31:16] cwhite: it's not a very helpful error message :/ [21:31:25] at least not to me [21:38:21] Same. Doesn't seem very alerts-related either. :) [22:11:30] see if it keeps doing it if you repat it by putting "recheck" on the change [22:11:45] if it's consistent I think it's a task for releng [22:14:13] https://integration.wikimedia.org/ci/job/alerts-pipeline-test/ seems to show that it already fails at "configure" before actually running the test