[18:13:05] I know now how to replace all my Icinga check_http and check_tcp checks variations with Prometheus/Alertmanager. But what I don't know yet is how you usually replace one of the NRPE process checks.. whether a certain process is running. [18:14:53] but then again, it's a good time to wonder if that's actually useful to check.. and maybe it's smarter to reduce it to just monitoring a public endpoint being up.. and not worry about processes. there are always some legit cases though, like "is zuul_merger running" for CI [18:20:38] yeah, whenever possible it will be better to check something that's more indicative of service health than pgrep. I've seen so many times a process that is running but is in a broken state [18:25:07] Yea, I am writing something like this on a ticket for my team.. That we should question each one. But there will be some left that we kind of do want to keep I think. and for those it's still the question what would replace that kind of check. [18:25:38] one example where I will just remove it: already checking if https is up on Etherpad.. dont care to additionally know if the process is running [18:26:05] one example where I think we need to keep it: is clamd not crashed on VRTS [18:55:09] we may be able to find a prometheus exporter for clamd that'd give up/down status along with some additional metrics too [18:55:39] an in general s/clamd/$service [19:01:58] oh, _that_ specific.. hm, ok! [19:02:25] well, there's going to be zuul, zuul-merger, jenkins, gerrit.. etc [19:02:31] I made a fresh ticket :) [19:19:51] there are also systemd failed unit checks in place, which probably could supersede various process name nrpe checks. but yeah the more specific the better [19:21:38] good points. thank you