[00:23:24] FIRING: [3x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:24] FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:24] FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:24] FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:24] FIRING: [5x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:24] FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:24] FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:24] FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:24] FIRING: [4x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:55] !log clear up some space on arclamp2001 to allow arclamp_compress_logs to complete [01:07:55] cwhite: Not expecting to hear !log here [01:13:24] RESOLVED: [3x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:24] FIRING: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:10] cwhite: re: arclamp I added 70G to /srv on both hosts [07:09:08] bd808: yes, the bot is an instance of https://github.com/knyar/phalerts that runs on alertmanager hosts in production, all the puppet bits are ready and what's missing I'd say is deploy phalerts on metricsinfra (I think) in cloudvps and create a phab bot account [07:09:16] cc taavi ^ [07:10:30] or maybe phalerts is already deployed on metricsinfra? don't know [07:27:12] yeah, not currently setup there but would be possible without too much effort [08:20:24] FIRING: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:24] RESOLVED: [2x] SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:01:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:23:06] hi folks. I am wondering what/how differentiates an alert from going to email (from the alerting host) vs to IRC and further to a pag.e [13:23:16] the reason for asking is that yesterday, this alert fired: ** PROBLEM alert - dns3003/gdnsd checkconf #pag<>e is CRITICAL ** [13:23:27] (this was expected and so is the fact that it is a paging alert) [13:24:07] but there no actual page, just an email. in setting this up, I did add critical => true and reading the module, it should have made it a paging alert [13:24:19] what am I missing from here? the alert itself is in modules/gdnsd/manifests/monitor_conf.pp [13:26:44] though on closer inspection, this is the only nrpe::monitor_service alert which has critical set to true. everything else that calls this module has it set to false so now I am thinking if I set this up incorrectly. should the decription have an explicit #pag.e? [13:27:03] sukhe: hi! do you have a timestamp for that alert? it could have coincided with an alert ingestion incident in Splunk https://status.victorops.com/incidents/47wsks3pq639 [13:27:45] lmata: ah let me check. [13:28:36] 16:37 Eastern, so that would be 14:37 MDT. [13:29:06] and Splunk implemented the fix at 15:25 MDT [13:29:41] I guess that would be it then? [13:29:56] > This incident affected: Splunk On-Call Alert Processing & Integrations (Alert Ingestion - Inbound email). [13:30:31] I'm about to jump in a meeting, though yes we did send emails [13:30:32] https://phabricator.wikimedia.org/P76223 [13:31:33] godog: thanks, not urgent for sure. but this means that emails were sent from our side (which they were) but not ingested for the actual reporting? [13:31:39] (at their end) [13:45:50] sukhe: that's my understanding too [13:46:12] cool, thanks! just making sure the alert was set up the right way then because I thought I didn't do that. but that adds up. [13:46:15] thanks folks! [13:46:37] sure np, can confirm the alert is working as expected [13:46:57] thanks! [13:47:07] I can't resist though: please consider working on moving away from paging via emails, i.e. move to prometheus/alertmanager [13:47:15] :P [13:47:39] you should not resist. we are not adding any new alerts there but yeah, there are a bunch of pending ones in Icinga [13:47:48] I will be more mindful of moving them [13:47:56] thank you! appreciate it sukhe [18:37:33] At the risk of receiving ire from o11y... What's the endpoint for using Grafana's API? [19:36:52] brett: Why would you receive ire for asking that question? :o [19:37:08] You can access the API from https://grafana.wikimedia.org/api/ [21:12:57] denisse: I'm getting a 404 at that endpoint [21:13:19] brett: How are you trying to use it? [21:13:23] It works fine for me. [21:13:36] oh, the root api endpoint just doesn't return anything. Gotcha [21:14:28] Yes, so for example, you'd use https://grafana.wikimedia.org/api/search to search for stuff. You can find more info on the official documentation: https://grafana.com/docs/grafana/latest/developers/http_api/ [21:15:04] Thank you!