[00:07:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[00:12:08] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.22e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[00:32:42] <btullis>	 I've checked the Grafana link and it looks like the consumer lag above is going in the right direction.
[00:32:55] <btullis>	 https://usercontent.irccloud-cdn.com/file/xKLjYaQy/_20230930_013157.JPG
[00:40:16] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 872 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[02:02:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[03:02:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:28:45] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[07:17:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:29:00] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[10:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[11:17:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:33:45] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:39:06] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[15:17:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:34:00] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:40:45] <wikibugs>	 10Data-Platform-SRE: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 (10brouberol) The topic reassignment finished on Sep 30th, 2PM UTC. {F37866398}
[16:16:52] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) a:05Umherirrender→03None
[17:43:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[19:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:28:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[19:34:00] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:22:59] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:34:00] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (5)  crashloop on kafka-jumbo1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop