[09:24:06] godog: o/ when you have time we have a question to know whether it's possible to re-use the same rule with varying severity levels based on thresholds (https://gerrit.wikimedia.org/r/c/operations/alerts/+/762902/1/team-search-platform/rdf_streaming_updater.yaml#71) [09:24:41] dcausse: yes it is possible and encouraged [09:25:36] dcausse: sorry I misread what you wrote, I thought it was multiple rules with different severities [09:27:03] it'd be, how to best represent the puppet function "monitoring::check_prometheus" that defines a separate warning and critical threshold [09:28:01] would it need separate rules with one like: value > warning_threshold and value < critical_threshold [09:30:36] yeah it'd be two rules with the same name and different severities and thresholds, no need to "exclude" the warning threshold though because if there's the same alert (name, instance, etc) firing at critical and warning the warning one is inhibited [09:31:50] specifically by this https://github.com/wikimedia/puppet/blob/production/modules/alertmanager/templates/alertmanager.yml.erb#L333 [09:33:58] thanks! [09:37:33] sure np, I've added an entry to alertmanager's FAQ [21:34:34] lmata: godog: inflatador and I are doing elasticsearch training this week so we won't be able to make the search:o11y alertmanager meeting tomorrow. is there a time next week that works for you guys? (monday is a us holiday) [21:35:34] I was looking at 6pm UTC but looks like that conflicts with the `EMs: Community of Practice`meeting you have lmata. So maybe 5pm UTC on thursday? [21:36:14] I can skip the EM a meeting [21:36:24] Let me move it around [21:36:52] `s/6pm UTC/6pm UTC tuesday` in my message above [21:37:48] lmata: okay great, 18:00 utc on tues should work for me/inflatador, hopefully godog can make it too [22:16:32] hey, could you help me find instance=etherpad1003 in alertmanager UI? I want to confirm whether a specific alert is fixed.. but it's like I can only see active alerts.. so the non-existence means it is fixed I guess [22:17:18] how do I search though for an instance= that is currently not alerting [22:21:40] we have @state=unprocessed or @state=supressed [22:21:52] but i dont think that covers " non alerting" checks [22:22:52] https://alerts.wikimedia.org/?q=instance%3Detherpad1003&q=state%3Dunprocessed&q=state%3Dsupressed [22:23:27] Filippo might be the best person to ask for this [22:24:53] lmata: that link is one of the things I tried, yea, but can't find the host [22:25:15] my icinga-mindset wants to link to the current check whether it's green or red [22:25:23] and be able to paste it on a ticket to say "fixed" [22:25:38] I will ask tomorrow at a more European time [22:25:41] thank you [22:26:04] also.. I had used a silence on that check [22:26:19] but when I search with silence_author=~dzahn I don't see that now [22:27:24] kind of wish I could just use https://alerts.wikimedia.org/?q=instance%3Detherpad1003 and get the host overview regardless of state, if I remove all other filters.. I would expect that to happen, actually [22:28:19] makes sense, and is a natural ask, i dont know if this is supported by Karma but i can follow up and work on seeing if i can get you a better answer [22:28:20] :D [22:28:21] anyways, will chat later. I think I just resolved https://phabricator.wikimedia.org/T301872 anyways [22:28:34] since I managed to build the new prometheus-exporter there [22:28:39] that's what made me want to confirm [22:28:51] but I can already confirm with curl it outputs metrics [22:29:10] thanks lmata, sounds good! :) have a good one [22:29:20] ack, you too! [22:29:28] if its interesting we do office hours as part of our team meeting on wed [22:29:53] if you want to drop by and chat a bit more about this [22:30:03] ah! noted :) [22:30:08] alright [22:30:28] no pressure, let me know if its interesting :D [22:40:02] honestly it's a bit early for me. I would say the ideal forms of communication (for non-major things) are all those that don't require realtime/timezones and are public. mailing list for example. [22:40:58] Completely understand, happy to accommodate a thread instead [22:41:20] mutante: in a pinch you could throw back port 9900 via ssh to a prometheus server and have a look at alerts there to see if the one in question is active any longer, checking now it looks like yeah etherpad1003 has cleared [22:42:00] that erm.. not a super great ux solution though :) [22:42:49] herron: thank you! :) that was nice though [22:42:57] so far I just had this: [22:42:58] curl etherpad1003:9198/metrics -v [22:43:08] but not that it is actually cleared [22:45:54] mutante: cheers np