[09:17:34] godog: ^ any ideas on the two points above? [09:19:04] dcaro: checking [09:20:05] dcaro: no auth for the api ATM no, though yeah what you did (amtool) makes sense to me [09:21:39] dcaro: for the second point, you could even have a timer that does the check and exports the result as a metric file for node-exporter to pick up (and then alert on it as you mentioned), could you give a little more context on the problem though ? [09:25:37] godog: the context is that I have an internal openstack api serving on http that sometimes gets hung (non-responsive), so I want to have a dummy check that does a curl type of request from the same host to make user I get a response and alert when that does not happen [09:26:02] (and I don't want to add a new nagios check if possible, as a pre-step for moving stuff from icinga to prometheus/alertmanager xd) [09:27:21] thanks for not adding new nagios checks :) appreciate it! [09:27:53] yeah the small script executed periodically that drops metrics seems reasonable to me, searching for examples now [09:28:33] basically all modules/prometheus/manifests/node_*.pp [09:28:57] also, is there a defined process/way to move things out of icinga and into alertmanager/prometheus? (I might want to get an effort running there, at least for any alerts that pages us) [09:30:05] basically "open a task with checkboxes for each alert/class of alerts; port one alert; goto 10" [09:30:43] xd [09:30:51] e.g. what I'm doing at https://phabricator.wikimedia.org/T305847 [09:30:56] that's useful yep [09:31:06] (/me would like to see how to port them) [09:31:35] is it essentially creating a script that exports to prometheus, then adding a prometheus filter for it? [09:31:40] /filter/alert/ [09:32:21] ah yeah, step 0 for me has been "can we delete it instead of porting?", then "are there better / higher layer ways of achieving the same?" then "do we have metrics already?" [09:32:36] and then yeah last resort is what you described with the script and all [09:33:24] a good example is etcd-mirror, where the previous page was a regexp on /lag to check for lag [09:33:34] the new way is etcd-mirror itself exporting lag as a metric [09:33:42] i.e. https://phabricator.wikimedia.org/T309546 [09:34:05] ack, sounds reasonable yes :) [09:35:40] gotta run an errand, bbiab [09:35:55] 👍 thanks! [10:47:52] dcaro: sth else that occurred to me, if the standard systemd unit failure alerts would be sufficient, you could make the periodic unit fail when the api doesn't respond [10:48:21] might be simpler, we have sth similar for rsyslog, though that also auto-remediates and restarts rsyslog [10:53:02] does the systemd information go to prometheus? [11:05:47] it does, node-exporter exposes unit status as a metric [11:06:05] I'll be tackling the same as part of the pages migration, haven't looked yet but it is possible [11:07:01] one of our biggest metrics actually IIRC [11:07:20] e.g. node_systemd_unit_state{cluster="thanos", instance="thanos-fe1001:9100", job="node", name="cron.service", prometheus="ops", site="eqiad", state="active", type="simple"} [11:08:26] nice! [15:04:56] godog: can we add a link from the alertmananger alert entry to the icinga alert (at least to the search of the alertname) so it's easier to silence? [16:11:17] dcaro: yeah the link is there a little bit hidden perhaps, if you click an alert "timestamp" there's a dropdown with the "source" of the alert [17:08:15] Ahhh, the "4 hours ago" thing (right beside the collapsing button), could we change the name from 'wikimedia.org' to 'icinga' or similar? [17:08:34] (I'll do if you agree and it's possible xd)