[12:36:34] godog: quick question, what is the stat on prometheus that tracks if a server is up/down? [12:38:16] dcaro: hey, the "up" metric I think is likely what you are after, what is the context though? [12:38:33] godog: that seems to have only VMs [12:38:48] dcaro: up{job="node"} [12:38:53] godog: I want to create an alert that triggers when certain amount of hosts ar up/down [12:39:02] 👍 let me check! [12:39:15] pretty sure 'up' has physical hosts too [12:39:17] I can certainly see hardware servers with that on thanos [12:39:49] oh my, I'm using the wrong thanos xd [12:40:12] it's inevitable [12:40:54] dcaro: probably JobUnavailable alert can be used as a template/guide for what you're looking for [12:41:19] noice! [12:42:24] If I want to add more than one alert, should I create more than on yaml file? or should I aggregate similar ones on the same yaml? [12:42:29] (under the alert repo) [12:43:09] oh, I see JobUnavailable has more than one too [12:43:35] yeah you can totally create more alerts per yaml file, we've grouped more or less around software the alerts are for, or general "area" [12:50:39] what's the relevance of the group? [12:50:49] (it shows as a label in the alert or something?) [13:00:12] doesn't show up in the alert no, it is for prometheus to know which alerts can be evaluated concurrently and which can't [13:05:00] should I split them in any specific way to not waste resources? [13:05:14] (should I avoid groups with many alerts for example?) [13:09:43] dcaro: I think for now alerts in a group is fine [13:10:11] ok so the semantics are that alerts in a group are evaluated sequentially, groups are evaluated concurrently afaics [13:10:49] 👍 thanks! [13:10:58] np! [13:54:26] godog: I'm having some issues with the tests :/, I'm trying to test the first alert in the patch here: https://gerrit.wikimedia.org/r/c/operations/alerts/+/811999 [13:54:35] but getting nothing, any hints? [13:55:41] (I changed from percentage to count for testing) [14:01:24] oh, the site label has to be in the query too [14:01:38] okok, found it, nm, thanks for listening!! :) [14:05:09] dcaro: haha! for sure, rubberduck tecnique hardly ever fails [14:28:50] godog: oh, would you mind if I add a test to the alerts repo that checks that there's a runbook for every wmcs alert and that it exists? (doing a GET request to it) [14:32:27] dcaro: for sure! sounds like a good idea [14:32:42] perhaps even not limited to wmcs alerts, seems good to have in general [14:33:31] are you sure you want to enforce having runbooks on all the alerts? (I'm ok with that though, just making sure) [14:36:23] ah I misread what you wrote: I've parsed that as *if* there's a runbook then check that it is valid [14:36:49] that I can do too :) [14:37:08] (as a general test, then add one for wmcs to ensure there's one on every alert) [14:37:31] ok! yeah that could work better, ATM dashboard/runbook is optional/warning anyways [14:37:35] thank you [14:37:41] 👍 [14:38:01] bbiab [15:20:24] godog: I added some parallelization so the requests are not so slow [16:03:08] dcaro: nice! thank you