[12:36:34] <dcaro>	 godog: quick question, what is the stat on prometheus that tracks if a server is up/down?
[12:38:16] <godog>	 dcaro: hey, the "up" metric I think is likely what you are after, what is the context though?
[12:38:33] <dcaro>	 godog: that seems to have only VMs
[12:38:48] <taavi>	 dcaro: up{job="node"}
[12:38:53] <dcaro>	 godog: I want to create an alert that triggers when certain amount of hosts ar up/down
[12:39:02] <dcaro>	 👍 let me check!
[12:39:15] <godog>	 pretty sure 'up' has physical hosts too
[12:39:17] <taavi>	 I can certainly see hardware servers with that on thanos
[12:39:49] <dcaro>	 oh my, I'm using the wrong thanos xd
[12:40:12] <godog>	 it's inevitable
[12:40:54] <godog>	 dcaro: probably JobUnavailable alert can be used as a template/guide for what you're looking for
[12:41:19] <dcaro>	 noice!
[12:42:24] <dcaro>	 If I want to add more than one alert, should I create more than on yaml file? or should I aggregate similar ones on the same yaml?
[12:42:29] <dcaro>	 (under the alert repo)
[12:43:09] <dcaro>	 oh, I see JobUnavailable has more than one too
[12:43:35] <godog>	 yeah you can totally create more alerts per yaml file, we've grouped more or less around software the alerts are for, or general "area"
[12:50:39] <dcaro>	 what's the relevance of the group?
[12:50:49] <dcaro>	 (it shows as a label in the alert or something?)
[13:00:12] <godog>	 doesn't show up in the alert no, it is for prometheus to know which alerts can be evaluated concurrently and which can't
[13:05:00] <dcaro>	 should I split them in any specific way to not waste resources?
[13:05:14] <dcaro>	 (should I avoid groups with many alerts for example?)
[13:09:43] <godog>	 dcaro: I think for now alerts in a group is fine
[13:10:11] <godog>	 ok so the semantics are that alerts in a group are evaluated sequentially, groups are evaluated concurrently afaics
[13:10:49] <dcaro>	 👍 thanks!
[13:10:58] <godog>	 np!
[13:54:26] <dcaro>	 godog: I'm having some issues with the tests :/, I'm trying to test the first alert in the patch here: https://gerrit.wikimedia.org/r/c/operations/alerts/+/811999
[13:54:35] <dcaro>	 but getting nothing, any hints?
[13:55:41] <dcaro>	 (I changed from percentage to count for testing)
[14:01:24] <dcaro>	 oh, the site label has to be in the query too
[14:01:38] <dcaro>	 okok, found it, nm, thanks for listening!! :)
[14:05:09] <godog>	 dcaro: haha! for sure, rubberduck tecnique hardly ever fails
[14:28:50] <dcaro>	 godog: oh, would you mind if I add a test to the alerts repo that checks that there's a runbook for every wmcs alert and that it exists? (doing a GET request to it)
[14:32:27] <godog>	 dcaro: for sure! sounds like a good idea
[14:32:42] <godog>	 perhaps even not limited to wmcs alerts, seems good to have in general
[14:33:31] <dcaro>	 are you sure you want to enforce having runbooks on all the alerts? (I'm ok with that though, just making sure)
[14:36:23] <godog>	 ah I misread what you wrote: I've parsed that as *if* there's a runbook then check that it is valid
[14:36:49] <dcaro>	 that I can do too :)
[14:37:08] <dcaro>	 (as a general test, then add one for wmcs to ensure there's one on every alert)
[14:37:31] <godog>	 ok! yeah that could work better, ATM dashboard/runbook is optional/warning anyways
[14:37:35] <godog>	 thank you
[14:37:41] <dcaro>	 👍
[14:38:01] <godog>	 bbiab
[15:20:24] <dcaro>	 godog: I added some parallelization so the requests are not so slow
[16:03:08] <godog>	 dcaro: nice! thank you