[00:22:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:49] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:19] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:08:35] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:40:31] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:41] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:44:03] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:51:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:24] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:37:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:33] 10Data-Engineering: Check home/HDFS leftovers of ktsouroupidou - https://phabricator.wikimedia.org/T335012 (10MoritzMuehlenhoff) [07:47:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:26] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:07:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:15:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:30:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:27] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10elukey) I think that this is related to the new kafka logging configs (see T326419).... [08:50:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:30] hey folks, I should have fixed the produce-canary failures (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/909954) [09:02:36] let's see if it re-happens again [09:02:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:16] thanks a milion elukey :) [09:03:26] joal: bonjour! [09:03:30] <3 [09:04:29] elukey: That's fantastic! Many thanks. [09:05:44] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10elukey) Yep I can't reproduce the timeout issue anymore, it was definitely due to the... [09:10:02] elukey: I think that there are some features of the common templates we can use as well, to remove the need to specify brokers individually by setting `values.kafka.allowed_clusters` [09:10:02] https://github.com/wikimedia/operations-deployment-charts/blob/master/modules/base/networkpolicy_1.0.0.tpl#L32-L48 [09:30:29] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10BTullis) Thanks so much for troubleshooting this @elukey. It occurs to me that we might even be able to mak... [09:32:15] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:34:01] (03PS1) 10Gerrit maintenance bot: Add fat.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/909774 (https://phabricator.wikimedia.org/T335019) [09:34:22] btullis: yes yes definitely, I think that eventgate-logging-external is one of the special configs that needs to be migrated over [09:34:49] the other problem is that the observability team didn't sync before changing the brokers, we'll need to be better at this (but it happens) [09:35:55] Oh, I see. So it's only this deployment of eventgate? Not all of them. I see now. [09:37:46] I still see kafka-jumbo defined in full here: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/eventgate-analytics-external/values.yaml#L74-L185 [09:40:03] I don't recall if Andrew already opened a task for this, but there were talks about migrating these deployments over to the new configs [09:41:31] 10Data-Engineering: Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10BTullis) [09:43:11] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10BTullis) Follow-up ticket created here: {T335024} [09:44:30] elukey: Thanks so much for the help. I've create a follow-up ticket, but we can always merge it in if we find any duplicates. [10:02:49] super [10:11:41] btullis: one qs - would it be ok for my team to "steal" the last two GPUs on hadoop (like you did in https://phabricator.wikimedia.org/T318696) to add them on Lift Wing? [10:11:48] so that we could start testing serving models with them [10:12:40] elukey: Totally fine by me 👍 [10:13:36] The more they get used, the better, as far as I'm concerned. [10:13:50] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:13:54] I am close to get the 4 that you installed working on dse [10:14:03] fingers crossed [10:14:24] Awesome. Following closely. :) [10:56:25] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ladsgroup) A note on how long it takes to run a schema change: A similar one took 23 days: T333332 I could have gone faster though. Noting that schema change is different... [11:25:48] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/909774 (https://phabricator.wikimedia.org/T335019) (owner: 10Gerrit maintenance bot) [11:38:18] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:59:29] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:09:37] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Ottomata) I'm mostly looking for a decision ASAP so we can adjust our data model now rather than later when people are actually using it. :) [12:23:09] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) Starting point for a discussion on how this should be implemented. We currently have 2 ways of publishing data on the web: **1 -... [12:26:40] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) We have a seemingly stable job running on YARN. Using [[ https://gitlab.wikimedia.org/repos/data... [12:31:20] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:34:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:41:56] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:04:17] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:24] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10CDanis) Is there anything to be done here around making this easier to diagnose? Better logging (whether i... [13:14:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Q4 eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10JArguello-WMF) [13:14:45] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10JArguello-WMF) [13:15:59] 10Quarry, 10Regression: Quarry queries do not finish - https://phabricator.wikimedia.org/T334903 (10IKhitron) >>! In T334903#8788998, @IKhitron wrote: > Maybe you should kill the 33 still running queries. Anyone? Nine querries are still running, and using resources. Thank you. [13:41:04] 10Quarry, 10Regression: Quarry queries do not finish - https://phabricator.wikimedia.org/T334903 (10rook) @IKhitron oh I wouldn't worry about it, I don't suspect that the queries are still actually running, this behavior is tracked in T278583 "running" isn't a reflection of a query actually doing any processin... [14:00:39] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10prabhat) Thanks for clarifications, @Ottomata [14:32:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:23] 10Data-Engineering, 10Event-Platform Value Stream: Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10Ottomata) [14:42:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:43:39] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10pmiazga) I have couple of questions regarding this topic: - when an event occurs -... [14:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:48] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:24] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10Ottomata) In hindsight, The log messages from eventgate were pretty clear. Timeout errors talking to Kafka... [14:52:59] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:54:39] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Ottomata) I'd go with Option 2 for this if we can. Option 1 is nice, but I think putting things on dumps.wikimedia.org can and should require m... [14:57:58] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:46] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:59:57] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [15:00:42] 10Data-Engineering, 10Event-Platform Value Stream: Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10Ottomata) Related: {T253058} [15:15:22] 10Data-Engineering, 10Event-Platform Value Stream (sprint 12): Update eventgate helm chart to use automatic kafka egress networkpolicies - https://phabricator.wikimedia.org/T335024 (10lbowmaker) [15:16:26] 10Data-Engineering, 10SRE, 10SRE Observability, 10Event-Platform Value Stream (sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lbowmaker) [15:16:46] 10Data-Engineering, 10Event-Platform Value Stream (sprint 12): eventutilities-python should support using Kafka TLS ports - https://phabricator.wikimedia.org/T331526 (10lbowmaker) [15:17:50] 10Data-Engineering-Planning, 10Event-Platform Value Stream (sprint 12), 10Patch-For-Review: Use new PageUndeleteComplete hook to emit mediawiki.page_change undelete event - https://phabricator.wikimedia.org/T328308 (10lbowmaker) [15:24:44] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [NEEDS GROOMING] Fix eventutilities-python linting - https://phabricator.wikimedia.org/T328547 (10Ottomata) 05Open→03Resolved a:03Ottomata [15:24:47] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:25:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:30:13] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [15:30:49] hello a-team, is this the right place to inquire about a possible EventBus issue? [15:31:00] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-event-enrichment and event enrichment job repo templating should bundle schema repos - https://phabricator.wikimedia.org/T335045 (10Ottomata) [15:31:13] herzog: Please do. [15:31:50] hello btullis - I'm investigating T333227 and T333899 now that I have logstash access [15:31:51] T333899: Investigate if TranslationNotification's DigestEmailer.php is really sending emails and what happens to them - https://phabricator.wikimedia.org/T333899 [15:31:51] T333227: Translation notifications via email digest may be broken - https://phabricator.wikimedia.org/T333227 [15:32:00] and I'm seeing a recurring eventbus issue that may be the cause [15:32:16] the message is Non-scalar value found in the event [15:32:39] See e.g. https://logstash.wikimedia.org/goto/0e6ccf3334a12e905c8efdacfaaf3ec8 [15:32:58] I was wondering if that is the reason the digest emailer may not be working [15:33:53] so I'd like to kindly ask for help understanding what that error is about and if it may be the cause for the feature being broken [15:42:04] herzog: Interesting, thank you for sharing the details. I have to confess that I'm not immediately certain. Sending email via Eventbus isn't something that I've come across before. It will take some time to understand the nature of the issue. [15:42:34] btullis: would you say this merits its own Task? [15:43:14] I can post some details via Phatality that may be useful for the team if needed [15:43:26] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) > when an event occurs - are we going to generate events in all versions... [15:46:00] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:51:27] herzog: Let's see if anyone shows up who has more experience of how an email sent via EventBus /should/ work. [15:52:10] btullis: thanks - I wonder if ottomata or olja_ (per mw / wikitech docs) would - in any case, waiting for now as instructed :) [15:52:22] o/ i don't know very much about how eventbus + jobqueue work together [15:52:27] but let's see... [15:52:42] non scalr in event eh...that sounds like a json serialization problem [15:52:49] if it helps https://logstash.wikimedia.org/goto/349fae3fb2bf8de32866ad4d4abb4ddf [15:54:15] this error started on January 2023, unless Logstash records don't go past that, it may not be the whole root cause of the failure but it may be? [15:54:19] i don't know :) [15:54:43] Yeah, so it looks like it should be validated by eventgate with this schema: https://github.com/wikimedia/schemas-event-primary/blob/master/jsonschema/mediawiki/job/1.0.0.yaml [15:56:47] the digestemail sender is broken from some time ago too per https://phabricator.wikimedia.org/P46012 but I am still learning to navigate through the logstash ocean :) [15:56:56] validation happens at eventgate, this is a stackfrom mediawiki [15:57:27] hm bnut [15:57:33] raw_events looks fine [15:57:36] serialized to json just fine [15:58:14] or...unless this is the validation error message from eventgate...i don't thikn so though [15:58:42] would be nice if there was a php stack trace here [16:00:04] logged from [16:00:04] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventBus/+/refs/heads/master/includes/EventBus.php#369 [16:00:28] ah wait [16:00:49] prop_val_type [16:00:49] MailAddress [16:00:58] to field is being set an instance of MailAddress [16:01:07] which is failing because MailAddress is not a scalar [16:02:03] MailAddress's toString does work as expected though [16:02:21] herzog: i don't know where MailAddress is being passed to eventbus->send() [16:02:25] but wherever that is happenign [16:02:38] https://gerrit.wikimedia.org/g/mediawiki/core/+/a4da635e8a2f00fd317cc458b6787cba17efc452/HISTORY#7894 [16:02:39] probably if you just call $mailAddress->toString() instead, it'll work [16:02:40] meeting times! [16:03:00] I'll see where this is used in TranslationNotifications [16:03:42] it's at https://gerrit.wikimedia.org/g/mediawiki/extensions/TranslationNotifications/+/4f4d6e5091aead5c1393bf2b20ca445457816d30/includes/Jobs/TranslationNotificationsEmailJob.php#10 [16:03:57] and https://gerrit.wikimedia.org/g/mediawiki/extensions/TranslationNotifications/+/4f4d6e5091aead5c1393bf2b20ca445457816d30/scripts/DigestEmailer.php#200 [16:04:33] I'll file a task so I don't forget [16:11:49] oh [16:11:52] Failed executing job: sendMail Special:MyLanguage/Main_Page to= [16:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:16:42] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:06] (03PS1) 10Phuedx: Remove deprecated all_settings streamconfigs param [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910046 (https://phabricator.wikimedia.org/T286344) [16:47:51] (03CR) 10Ottomata: [C: 03+2] Remove deprecated all_settings streamconfigs param [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910046 (https://phabricator.wikimedia.org/T286344) (owner: 10Phuedx) [16:48:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove deprecated all_settings streamconfigs param [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910046 (https://phabricator.wikimedia.org/T286344) (owner: 10Phuedx) [16:59:04] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:08:59] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:37] 10Data-Engineering, 10DBA, 10Discovery-Search, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [18:22:37] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 65 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) Graph done in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/902213 [18:27:23] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) >>! In T317167#8793461, @Ottomata wrote: > I'd go with Option 2 for this if we can. Great - let's go for that :) > For option 2,... [18:49:17] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:50] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Ottomata) Yeah but I guess it is the same problem for the stat box synced data too :/ [19:36:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20), 10MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), and 2 others: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client ... - https://phabricator.wikimedia.org/T286344 [19:56:13] (DiskSpace) firing: Disk space stat1007:9100:/ 5.999% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:56:39] ^^ Looking [19:59:40] 10Data-Engineering: Low disk space on stat1007 - https://phabricator.wikimedia.org/T335069 (10BTullis) [20:00:41] 10Data-Engineering: Low disk space on stat1007 - https://phabricator.wikimedia.org/T335069 (10BTullis) p:05Triage→03High [20:06:13] (DiskSpace) resolved: Disk space stat1007:9100:/ 5.896% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:12:07] 10Data-Engineering: Low disk space on stat1007 - https://phabricator.wikimedia.org/T335069 (10BTullis) It seems that most of the space is taken up in `/tmp` and is related to airflow development, along with a bit of flink development. {F36956823,width=50%} {F36956825,width=50%} Maybe @mforns or @gmodena or @xcol... [20:13:12] mforns: xcollazo: You wouldn't be able to do a bit of spring cleaning in `/tmp` on stat1007 would you, if either of you is around? [20:18:44] 10Quarry, 10Regression: Quarry queries do not finish - https://phabricator.wikimedia.org/T334903 (10IKhitron) I see. Thank you. [20:28:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20), 10MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), and 2 others: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client ... - https://phabricator.wikimedia.org/T286344 [20:39:27] (03PS1) 10Mforns: Migrate unique devices druid loading queries to Airflow/SparkSQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910092 (https://phabricator.wikimedia.org/T334096) [20:44:32] (03PS1) 10Mforns: Fix HiveToDruid to allow for non-partitioned source tables. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/910094 (https://phabricator.wikimedia.org/T334096) [22:49:17] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed