[00:12:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:25] FIRING: [9x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:25] FIRING: [9x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:25] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:40] FIRING: [8x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:25] FIRING: [9x] SystemdUnitFailed: arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:31] I'll look into those ^ [07:00:26] arclamp hosts ran out of space, that would explain [07:02:13] I'll silence the alerts for now [07:17:19] filed as T392791 and I doubt I'll have time to look at it today [07:17:20] T392791: arclamp hosts ran out of space: big excimer daily/hourly logs - https://phabricator.wikimedia.org/T392791 [11:27:12] godog: wikibase had an infinite recursion bug that manifested itself as terrifying towers in excimer - T374230 [11:27:13] T374230: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230 [11:27:28] I think this might have swollen the average size of data gathered [12:18:09] mszabo: hah! that might explain it indeed, thank you for the pointer [12:21:01] although the lack of compression for sure doesnt help either :) [12:21:39] heheh it is a bunch of things together indeed [13:11:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:16:35] RESOLVED: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:22:22] I took a look at the arclamp_generate_metrics alert and deleted .tmp files along with '*excimer-wall*' from the time the issue happened, `/srv` now has 128G (52%) free space. [15:38:40] nice, thank you denisse ! [22:04:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [22:09:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag