[09:42:19] good morning! I have some questions regarding the mgmt alerts in AM [09:42:55] good morning volans, sure, shoot [09:43:34] apparently mr1-ulsfo has died or is anyway unreachable and Icinga is reporting it for the few related alerts remained there, but there is no alert from alert manager. I've checked the configuration in the alerts repo and if I'm reading it correctly it will open a task to dcops after 12h or unrechability [09:44:09] yes IIRC 12h is correct [09:44:38] 1) is there a way I can see the failing but not yet alerting checks in AM or elsewhere? Basically the equivalent of the SOFT alerts in icinga [09:45:58] 2) what's the plan for the rest of the mgmt alerts? BEcause while for a single host I can guess a task to dcops after few hours might be ok, surely doesn't seem ok for a whole mr1 down. [09:47:03] 3) maybe the 12h is a bit too much? it risks that by the time it alerts we're out of business hours for the onsite people and might mean more than 24h before anyone could look at it, but this is just fine-tuning [09:48:50] ok so re: 1) yes you can query for alert state 'pending', e.g. here https://w.wiki/6PrB [09:49:12] thanos for convenience but you can issue the same query to e.g. https://prometheus-ulsfo.wikimedia.org/ops [09:49:50] but not on alerts.w.o? [09:50:01] I get stuff with q=%40state%3Dpending [09:50:28] bu I don't see them#IJ [09:51:03] can you post a link? I tried @state=pending but I don't get results [09:51:21] https://alerts.wikimedia.org/?q=%40state%3Dpending [09:51:36] I get some alerts but I don't see the ulsfo mgmt ones [09:51:51] yeah I don't think that's a thing, i.e. the results are the same without the query [09:52:16] alerts.w.o shows only firing alerts, alertmanager doesn't know about pending alerts [09:52:17] @state=active is not the same alertstate in thanos? [09:52:23] ok, confusing [09:53:38] I disagree but let's move to 2) [09:54:35] I'm checking how we are probing mr1-ulsfo for its availability [09:54:59] that's in icinga atm AFAICT [09:55:13] and it's a host, so I guess ping first of all [09:57:04] indeed, so ATM we are not probing mr from prometheus yet afaics [09:57:19] which of course explains why there's no alert for it yet [09:57:52] yes I wasn't expecting one, was wondering if there was a plan to move all mgmt probing to AM [09:59:11] yeah I'd like that eventually for sure, for network devices we can have different thresholds or even different alerts really [09:59:30] ack [09:59:48] * volans has a meeting, replies might be delayed [10:00:10] ok! re: 3) I don't feel strongly about the threshold, we can tweak it as needed [10:00:31] I went with 12h as a safe value but shorter is fine too [10:00:47] feel free to send a review, happy to review [10:11:44] I've created T330989 for the mr bits [14:00:02] cdanis et al: FYI Service[vo-escalate.service] is causing a change on every puppet run on alert2001: ensure changed 'stopped' to 'running' (corrective) [14:04:00] indeed, I remember running into that issue when deploying vo-escalate, I wonder if the right thing is to use a systemd::timer::job instead [14:16:18] hah jbond already provided, thank you! (reviewing) [14:18:01] no probs :) [14:33:55] I don’t know why I didn’t use timer::job in the first place tbh [14:36:07] likely because you didn't write that puppet, I did :) [14:37:51] 😅 [14:39:32] wildly beside the point but I wonder if we'll ever have the strength or motivation to stop bolting stuff onto systemd::timer::job [15:44:40] cwhite: o/ thanks for the review :) Can +2+merge freely or do I need to do some follow up later on? (like reloads etc..) [15:45:16] The ORES patch? I went ahead and deployed it. Nothing left to do :) [15:45:51] ah sorry I saw a +1! you are the best thanks <3 [15:49:04] Thanks for the patch!