[09:17:34] <dcaro>	 godog: ^ any ideas on the two points above?
[09:19:04] <godog>	 dcaro: checking
[09:20:05] <godog>	 dcaro: no auth for the api ATM no, though yeah what you did (amtool) makes sense to me
[09:21:39] <godog>	 dcaro: for the second point, you could even have a timer that does the check and exports the result as a metric file for node-exporter to pick up (and then alert on it as you mentioned), could you give a little more context on the problem though ?
[09:25:37] <dcaro>	 godog: the context is that I have an internal openstack api serving on http that sometimes gets hung (non-responsive), so I want to have a dummy check that does a curl type of request from the same host to make user I get a response and alert when that does not happen
[09:26:02] <dcaro>	 (and I don't want to add a new nagios check if possible, as a pre-step for moving stuff from icinga to prometheus/alertmanager xd)
[09:27:21] <godog>	 thanks for not adding new nagios checks :) appreciate it! 
[09:27:53] <godog>	 yeah the small script executed periodically that drops metrics seems reasonable to me, searching for examples now
[09:28:33] <godog>	 basically all modules/prometheus/manifests/node_*.pp
[09:28:57] <dcaro>	 also, is there a defined process/way to move things out of icinga and into alertmanager/prometheus? (I might want to get an effort running there, at least for any alerts that pages us)
[09:30:05] <godog>	 basically "open a task with checkboxes for each alert/class of alerts; port one alert; goto 10"
[09:30:43] <dcaro>	 xd
[09:30:51] <godog>	 e.g. what I'm doing at https://phabricator.wikimedia.org/T305847
[09:30:56] <dcaro>	 that's useful yep
[09:31:06] <dcaro>	 (/me would like to see how to port them)
[09:31:35] <dcaro>	 is it essentially creating a script that exports to prometheus, then adding a prometheus filter for it?
[09:31:40] <dcaro>	 /filter/alert/
[09:32:21] <godog>	 ah yeah, step 0 for me has been "can we delete it instead of porting?", then "are there better / higher layer ways of achieving the same?" then "do we have metrics already?"
[09:32:36] <godog>	 and then yeah last resort is what you described with the script and all
[09:33:24] <godog>	 a good example is etcd-mirror, where the previous page was a regexp on /lag to check for lag
[09:33:34] <godog>	 the new way is etcd-mirror itself exporting lag as a metric
[09:33:42] <godog>	 i.e. https://phabricator.wikimedia.org/T309546
[09:34:05] <dcaro>	 ack, sounds reasonable yes :)
[09:35:40] <godog>	 gotta run an errand, bbiab
[09:35:55] <dcaro>	 👍 thanks!
[10:47:52] <godog>	 dcaro: sth else that occurred to me, if the standard systemd unit failure alerts would be sufficient, you could make the periodic unit fail when the api doesn't respond
[10:48:21] <godog>	 might be simpler, we have sth similar for rsyslog, though that also auto-remediates and restarts rsyslog
[10:53:02] <dcaro>	 does the systemd information go to prometheus?
[11:05:47] <godog>	 it does, node-exporter exposes unit status as a metric
[11:06:05] <godog>	 I'll be tackling the same as part of the pages migration, haven't looked yet but it is possible
[11:07:01] <godog>	 one of our biggest metrics actually IIRC
[11:07:20] <godog>	 e.g. node_systemd_unit_state{cluster="thanos", instance="thanos-fe1001:9100", job="node", name="cron.service", prometheus="ops", site="eqiad", state="active", type="simple"}
[11:08:26] <dcaro>	 nice!
[15:04:56] <dcaro>	 godog: can we add a link from the alertmananger alert entry to the icinga alert (at least to the search of the alertname) so it's easier to silence?
[16:11:17] <godog>	 dcaro: yeah the link is there a little bit hidden perhaps, if you click an alert "timestamp" there's a dropdown with the "source" of the alert
[17:08:15] <dcaro>	 Ahhh, the "4 hours ago" thing (right beside the collapsing button), could we change the name from 'wikimedia.org' to 'icinga' or similar?
[17:08:34] <dcaro>	 (I'll do if you agree and it's possible xd)