[00:01:23] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:13:29] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:19:41] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:50:09] RECOVERY - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:03:49] PROBLEM - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:50:13] RECOVERY - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:03:53] PROBLEM - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:14:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_09 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [06:34:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_09 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:13:04] !log apt-get clean on an-airflow1001 to free some space on the root partition [08:13:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:15:17] RECOVERY - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:17:36] airflow1001 may need some extra disk/space, the root partition is getting filled up [08:17:50] not something to do now but maybe during the coming week [08:17:57] cc: btullis: --^ :) [08:29:01] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:15:41] RECOVERY - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:29:23] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:15:49] RECOVERY - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:29:31] PROBLEM - Check unit status of refinery-drop-webrequest-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:22:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:27:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:20:34] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 3.006 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [21:30:01] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.3466 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [22:00:47] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.724 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [22:05:29] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.4882 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos