[00:23:24] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:28:24] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:33:24] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:24] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:43:24] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:48:24] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:53:24] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:58:24] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:03:24] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:07:55] <cwhite>	 !log clear up some space on arclamp2001 to allow arclamp_compress_logs to complete
[01:07:55] <stashbot>	 cwhite: Not expecting to hear !log here
[01:13:24] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:24] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:07:10] <godog>	 cwhite: re: arclamp I added 70G to /srv on both hosts
[07:09:08] <godog>	 bd808: yes, the bot is an instance of https://github.com/knyar/phalerts that runs on alertmanager hosts in production, all the puppet bits are ready and what's missing I'd say is deploy phalerts on metricsinfra (I think) in cloudvps and create a phab bot account
[07:09:16] <godog>	 cc taavi ^
[07:10:30] <godog>	 or maybe phalerts is already deployed on metricsinfra? don't know
[07:27:12] <taavi>	 yeah, not currently setup there but would be possible without too much effort
[08:20:24] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:24] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:01:40] <jinxer-wm>	 RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:23:06] <sukhe>	 hi folks. I am wondering what/how differentiates an alert from going to email (from the alerting host) vs to IRC and further to a pag.e
[13:23:16] <sukhe>	 the reason for asking is that yesterday, this alert fired: ** PROBLEM alert - dns3003/gdnsd checkconf #pag<>e is CRITICAL **
[13:23:27] <sukhe>	 (this was expected and so is the fact that it is a paging alert)
[13:24:07] <sukhe>	 but there no actual page, just an email. in setting this up, I did add critical => true and reading the module, it should have made it a paging alert
[13:24:19] <sukhe>	 what am I missing from here? the alert itself is in modules/gdnsd/manifests/monitor_conf.pp
[13:26:44] <sukhe>	 though on closer inspection, this is the only nrpe::monitor_service alert which has critical set to true. everything else that calls this module has it set to false so now I am thinking if I set this up incorrectly. should the decription have an explicit #pag.e?
[13:27:03] <lmata>	 sukhe: hi! do you have a timestamp for that alert? it could have coincided with an alert ingestion incident in Splunk https://status.victorops.com/incidents/47wsks3pq639
[13:27:45] <sukhe>	 lmata: ah let me check.
[13:28:36] <sukhe>	 16:37 Eastern, so that would be 14:37 MDT.
[13:29:06] <sukhe>	 and Splunk implemented the fix at 15:25 MDT
[13:29:41] <sukhe>	 I guess that would be it then?
[13:29:56] <sukhe>	 >  This incident affected: Splunk On-Call Alert Processing & Integrations (Alert Ingestion - Inbound email). 
[13:30:31] <godog>	 I'm about to jump in a meeting, though yes we did send emails
[13:30:32] <godog>	 https://phabricator.wikimedia.org/P76223
[13:31:33] <sukhe>	 godog: thanks, not urgent for sure. but this means that emails were sent from our side (which they were) but not ingested for the actual reporting?
[13:31:39] <sukhe>	 (at their end)
[13:45:50] <godog>	 sukhe: that's my understanding too
[13:46:12] <sukhe>	 cool, thanks! just making sure the alert was set up the right way then because I thought I didn't do that. but that adds up.
[13:46:15] <sukhe>	 thanks folks!
[13:46:37] <godog>	 sure np, can confirm the alert is working as expected
[13:46:57] <sukhe>	 thanks!
[13:47:07] <godog>	 I can't resist though: please consider working on moving away from paging via emails, i.e. move to prometheus/alertmanager
[13:47:15] <sukhe>	 :P
[13:47:39] <sukhe>	 you should not resist. we are not adding any new alerts there but yeah, there are a bunch of pending ones in Icinga
[13:47:48] <sukhe>	 I will be more mindful of moving them 
[13:47:56] <godog>	 thank you! appreciate it sukhe 
[18:37:33] <brett>	 At the risk of receiving ire from o11y... What's the endpoint for using Grafana's API?
[19:36:52] <denisse>	 brett: Why would you receive ire for asking that question? :o
[19:37:08] <denisse>	 You can access the API from https://grafana.wikimedia.org/api/
[21:12:57] <brett>	 denisse: I'm getting a 404 at that endpoint
[21:13:19] <denisse>	 brett: How are you trying to use it?
[21:13:23] <denisse>	 It works fine for me.
[21:13:36] <brett>	 oh, the root api endpoint just doesn't return anything. Gotcha
[21:14:28] <denisse>	 Yes, so for example, you'd use https://grafana.wikimedia.org/api/search to search for stuff. You can find more info on the official documentation: https://grafana.com/docs/grafana/latest/developers/http_api/
[21:15:04] <brett>	 Thank you!