[00:04:45] (SystemdUnitCrashLoop) firing: (15) crashloop on kafka-jumbo1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:58:25] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 1477 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:01:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:01:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:02:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:04:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:07:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 1435 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:07:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:09:01] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:13:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:17:42] (SystemdUnitFailed) firing: monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:19:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 5.573e+06 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:19:43] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:30:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:31:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:31:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:32:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:47] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:34:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:34:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:35] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:43] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:45] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:39:23] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:39:23] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:40:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:45:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:49:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:49:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:49:44] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:50:24] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:52:12] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:07:54] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:08:16] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:09:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:09:40] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:13:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:14:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:21:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:22:14] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:22:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:24:08] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:27:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:27:22] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:46] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:10] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:10] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:32:52] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:33:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:34:30] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 3047 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [02:36:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:38:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:44:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:44:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:45:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:45:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:51:14] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 664.9 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [03:10:16] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [03:11:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [04:05:00] (SystemdUnitCrashLoop) firing: (15) crashloop on kafka-jumbo1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:37:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [04:37:27] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [04:39:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [04:39:27] The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [05:32:57] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:19] Hi elukey btullis I'm seeing lots of kafka alerts from the past few hours please help out. [07:13:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [07:13:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [07:20:24] stevemunene: I can have a look [07:23:41] Thanks brouberol [07:26:05] the mirrormaker processes start/exit in a loop, with multiple instances of this error [07:26:11] [mirrormaker-thread-6] Mirror maker thread failure due to org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The server experienced an unexpected error when processing the reques [07:28:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [07:28:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [07:29:31] that's not much to go on. I'll try to dig through the other kafka logfiles, to see if I can find the "other side" of that failure [07:33:12] I'm not seeing any under-replicated partition, the cluster seems to be in a good state [07:33:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [07:33:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [07:36:04] hello folks [07:37:07] seeing weird stuff on the wdqs updater@eqiad and it's only using kafka-main [07:37:37] o/ what weird stuff? :) [07:38:36] I'm not seeing any errors in the broker logs themselves [07:38:38] plenty of "reconciliation" events, something that should not happen, still looking at possible causes but if mirrormaker is having issues that could explain some of it [07:38:53] dcausse ah so the issue could be coming from kafka-main itself? [07:39:30] brouberol: possibly, the app I'm looking at only touches kafka-main and is dependent on mirrormaker [07:39:49] hi folks o/ [07:40:13] so the issue seems to be with mirror maker on jumbo, pulling from main [07:40:15] might be unrelated but the coincidence is worth looking into perhaps [07:40:33] for sure. jumbo looks fine, afaict [07:41:11] For info: we deployed EventGate with new schemas yesterday evening with gmodena [07:41:12] the only weird thing that I see on main-eqiad is [07:41:13] https://grafana-rw.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=47 [07:41:26] at around 6:45 UTC [07:41:47] but the issue with mirror maker started way before [07:42:12] ~20UTC yesterday [07:42:17] And there is some synchronicity between our dpeloy and the start of the mirror-maker errors [07:42:51] nothing on the analytics SAL for yesterday https://sal.toolforge.org/analytics [07:43:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [07:43:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [07:43:31] ah wait folks [07:43:41] https://grafana-rw.wikimedia.org/d/000000521/kafka-mirrormaker?orgId=1&refresh=5m&from=now-2d&to=now [07:44:05] ok yes definitely some weird error at 20 UTC [07:44:11] see all the mirror maker graphs [07:44:38] Wow - no mirror-maker :( [07:45:09] This coincides with our EventGate deploy, but I don't understand how it could have impacted :( [07:45:26] also with rdf-streaming-updater on dse [07:45:37] ack [07:46:32] ok if I roll restart all mirror makers on jumbo? [07:46:43] works for me [07:46:46] elukey: be my guest, but they keep restarting in a loop [07:46:57] so I'm guessing systemd is already doing that for you [07:46:59] elukey: it's weird, it seems it's only eqiad, not codfw [07:47:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [07:47:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [07:47:39] brouberol: yes yes it is just to see if anything improves, I don't have a lot of confidence that it will solve much [07:47:47] !log roll restart mirror maker instances on kafka jumbo [07:47:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:48:00] joal: the other graphs are, IIUC, for main-eqiad -> codfw [07:48:05] so kafka main clusters [07:48:58] it shouldn't be jumbo-main kafka related since we'd see alerts from other producers/consumers in theory [07:49:14] The mirror-name tells that we have issues for main-eqiad-to-jumbo-eqiad, not for main-codfw-to-jumbo-codfw [07:49:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 2316 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [07:49:24] main-codfw-to-jumbo-eqiad sorry [07:49:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 2108 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [07:50:05] joal: iirc we don't have main codfw to jumbo, only main eqiad to jumbo [07:50:14] actually we do! [07:50:15] I'm seeing these kinds of errors on all (at least the first 3 I checked) mirror maker logs over in kafka-main.eqiad: [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-9, groupId=kafka-mirror-main-codfw_to_main-eqiad] Node 2003 was unable to process the fetch request with (sessionId=1434757511, epoch=5645926): INVALID_FETCH_SESSION_EPOCH. [07:50:52] wow - corrupted kafka topc? [07:51:53] dcausse: can we stop rdf-streaming-updater on dse for a little while? [07:52:02] just to remove some variables [07:52:06] elukey: sure, lemme do this [07:52:20] thanks :) [07:52:32] joal: what did you deploy for eventgate? [07:52:49] joal: I'm not sure. Google is not very helpful there [07:54:21] rdf-streaming-updater@dse-k8s should be gone [07:54:27] thanks! [07:54:29] This was the eventgate change, as far as I understand. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/960610 [07:54:34] elukey: We deployed the 4 instances on staging/eqiad/codfw in this order, with a new docker-image for new schemas. [07:55:12] Sorry, that was the wrong link. [07:55:25] elukey: As I understand, we also bumped on 3 other instances a change that was only on eventgate-main that you (ML) baked [07:55:28] Need coffee. [07:56:38] joal: okok to understand everything, maybe the deployments were a coincidence [07:57:04] elukey: no problem at all - I dislike synchronicity [07:57:55] from https://grafana-rw.wikimedia.org/d/000000521/kafka-mirrormaker?orgId=1&from=1695756942589&to=1695758862050&var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad it seems that the time was 19:50 UTC [07:58:02] sharp dip [07:58:43] indeed. Does that line up with a deployment somehow? [07:59:06] I was trying to match one, eventgate and rdfstreaming updater kinda match but nothing moreafaics [07:59:10] at least from sAL [08:00:01] maybe it is something related to a particular message that mirror maker processed at around that time? [08:02:00] I need to be AFK for kids - I'll try to have a quick look here soon [08:02:04] sorry folks [08:02:13] it all right, nothing is on fire :) [08:03:00] maybe we could raise the logs on say jumbo1001 to debug and see what happens [08:03:24] looking in mm-main logs around that time, I see a bunch of [08:03:24] [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-11, groupId=kafka-mirror-main-codfw_to_main-eqiad] Revoking previously assigned partitions [...] [08:03:24] [2023-09-26 19:50:02,452] 2285859644 [mirrormaker-thread-11] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-11, groupId=kafka-mirror-main-codfw_to_main-eqiad] (Re-)joining group [08:03:52] yeah but this is main codfw to main eqiad right? That seems working fine [08:04:01] after which, I see many such messages [08:04:01] [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-11, groupId=kafka-mirror-main-codfw_to_main-eqiad] Resetting offset for partition codfw.change-prop.partitioned.mediawiki.job.refreshLinks-6 to offset 42043423. [08:04:01] [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-11, groupId=kafka-mirror-main-codfw_to_main-eqiad] Resetting offset for partition codfw.cpjobqueue.retry.mediawiki.job.deletePage-0 to offset 3. [08:04:01] [Consumer clientId=kafka-mirror-main-codfw_to_main-eqiad-11, groupId=kafka-mirror-main-codfw_to_main-eqiad] Resetting offset for partition codfw.change-prop.retry.mediawiki.job.enqueue-0 to offset 0. [08:04:16] ah, let me check real quick [08:04:42] oh yeah, that's right, it's right there in the name [08:05:00] (SystemdUnitCrashLoop) firing: (15) crashloop on kafka-jumbo1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:07:17] trying one thing on the nodes [08:07:20] stop all mirror makers [08:07:23] bring up only one [08:07:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [08:07:27] High Kafka consumer lag for mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [08:08:24] I've arrived at my desk. Sorry for being late to the party. Reading backscroll now. [08:08:36] !log stop all mirror maker on jumbo, start only one on jumbo1001 [08:08:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:08:47] ok the instance seems up and running [08:08:53] it is not crashing afaics [08:09:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:19] yeah yeah sorry [08:09:27] ok so I see some fetch rate in the metrics [08:09:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:36] I'll wait a bit and then start the one on 1002 [08:09:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:41] sorry for the noise folks [08:10:25] downtimed jumbo nodes [08:10:44] brouberol: so one instance works [08:11:15] !log start kafka mirror on jumbo1002 [08:11:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:11:47] this is really weird [08:11:52] 1002 works fine afaics [08:12:32] I won't lie, I don't understand much atm ^^ [08:12:37] Shall I make an incident doc? [08:13:28] I'm really intrigued by the INVALID_FETCH_SESSION_EPOCH error that you saw. [08:13:59] !log slowly start mirror maker on one instance at the time on all jumbo nodes [08:14:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:14:31] brouberol: me too, but my impression is that mirror maker ended up in a weird state in which multiple attempts to form the consumer group were made, all failing [08:14:58] with one start at the time it is more clear how new consumers enter the cgroup, and who is the leader etc.. [08:15:15] there must be a weird jira/bug related to this somewhere [08:15:32] if planets are aligned in a certain way, you run kafka mirror version X and kafka version Y, etc.. [08:17:06] right, that I do understand, thanks! [08:18:13] np! These are all speculations, I don't have a solid idea what's happening [08:18:22] slowly bring up all mirrors [08:18:28] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:19:05] quick question elukey: By restarting all mirror-makers, ahve we lost the history of not-yet mirrored messages? [08:19:51] joal: in theory no, they are all consumer groups with offset stored in kafka __offset topic/partitions etc.. [08:19:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:19:57] they should restart from where they stopped [08:20:10] ack elukey - thanks :) [08:21:43] That's unless they have been stopped for longer than the retention period of the topic, but in this case they are nowhere near it. [08:22:11] ah yes yes that is the nuclear case :D [08:22:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:22:43] all mirror makers up [08:25:16] sigh they are failing again [08:25:24] Gah! [08:25:45] maybe there is a critical mass for the cgroup size [08:25:58] because they were superstable until I re-enabled all of them [08:26:03] https://www.irccloud.com/pastebin/MRsQgFAp/ [08:26:23] I'm going to make an incident doc. I'll be the IC. [08:27:11] This is the exception I saw and logged at 09:26 indeed [08:27:29] do you mind if I try to leave say only 5 mirror makers running? [08:27:32] to see if it re-happens [08:27:35] and it's super generic, so Google doesn't tell us much [08:27:36] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 1 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [08:27:47] elukey: That seems fine to me. [08:28:34] !log `elukey@cumin1001:~$ sudo cumin 'kafka-jumbo10[06-15]*' 'systemctl stop kafka-mirror'` [08:28:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:29:03] !log `elukey@cumin1001:~$ sudo cumin 'kafka-jumbo10[01-05]*' 'systemctl start kafka-mirror' -b 1 -s 30` [08:29:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:30:49] https://issues.apache.org/jira/browse/KAFKA-5407 seems close to what we are seeing [08:31:48] even Here's an incident doc. https://docs.google.com/document/d/1tZ0t9KGXRu_dELvsvvM4FIrF7bT_qDL1_BUBgSz-_eA/edit - I'll keep adding to it with what we've found so far. [08:32:35] ack! [08:32:43] we are running with mirror makers on 1001->1005 [08:33:11] elukey: let me search for org.apache.kafka.common.errors.RecordTooLargeException in our logs [08:33:53] yeah they talk about fetch size / max message bytes, that we increased recently [08:33:56] it may be a good lead [08:34:15] now I am curious to see if 5 mirrors stay up [08:34:49] sudo cumin 'kafka-jumbo*' 'grep org.apache.kafka.common.errors.RecordTooLargeException /var/log/kafka/*.log' [08:34:49] -> no matches [08:35:44] the mm consumer settings say max.partition.fetch.bytes=12058624 [08:35:52] that in theory is the correct value [08:36:20] however, I have some matches on kafka-main! [08:36:22] paste inc [08:36:26] ah! [08:37:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [08:37:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [08:38:23] brouberol@kafka-main1003:~$ grep -l org.apache.kafka.common.errors.RecordTooLargeException /var/log/kafka/*.log [08:38:30] (03CR) 10Elukey: "Hi Jeff! The change looks sane, can you add more context on the commit msg about what led to it, use case etc.." [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [08:38:41] match found in /var/log/kafka/server.log [08:38:57] (thats kafka-main1003.eqiad.wmnet) [08:39:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [08:39:27] The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [08:39:34] last match: [2023-09-27 08:28:05,100] ERROR [GroupMetadataManager brokerId=1003] Appending metadata message for group kafka-mirror-main-eqiad_to_jumbo-eqiad generation 19944 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager) [08:40:08] the mirrors are still running, something that makes me very puzzled [08:41:05] * brouberol will be back in 5-10min [08:42:26] So we're still on 5 mirror-maker processes elukey? Is that right? [08:43:02] btullis: exactly yes, they are all fine [08:43:41] I'll wait another 10 mins and I'll start a couple more [08:43:59] we didn't have troubles with 9 brokers, so I suspect it may be a weird bug due to the fact that we have 15 [08:45:13] it is true that our mirro maker is not exactly the latest and greatest :D [08:48:31] starting 1006 1007 1008 1009 [08:48:37] +4 [08:49:53] FWIW, we have message.max.bytes=10485760 on kafka jumbo and message.max.bytes=4194304 on kafka-main [08:50:06] yeah we didn't bump it everywhere [08:50:12] so if we're mirroring one into the other, we might be hitting message size limits [08:50:34] the mirroring is from main to jumbo, one way [08:50:46] gotcha, in that case that;s ok [08:51:07] but the record too large that you found is an indication that some producer do sends bigger messages to main too [08:51:14] so we'll have to follow up with service ops :D [08:51:25] or maybe it was just a test [08:51:39] 9 mirrors running btullis [08:51:57] Ack, thanks. [08:52:02] so you're thinking the RecordTooLargeException is unrelated w/ the mirrormaker issues? [08:52:43] I am inclined to think this way, but I am not 100% sure. The fact that all works fine with X mirrors is too weird [08:53:03] what do you think? [08:53:58] I actually think that they are related. I had an idea to check when the first RecordTooLargeException issue occured [08:54:15] brouberol@kafka-main1003:~$ grep org.apache.kafka.common.errors.RecordTooLargeException /var/log/kafka/server.log | head -n 1 [08:54:15] [2023-09-26 19:54:46,039] ERROR [GroupMetadataManager brokerId=1003] Appending metadata message for group kafka-mirror-main-eqiad_to_jumbo-eqiad generation 15610 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager) [08:54:18] yeah if you find more data to support the link let's keep investigating [08:54:34] ah! [08:54:42] which matches with the MM issue, I believe [08:55:28] ok so maybe they keep running until they find a big message that causes the trouble [08:55:31] so we might be dealing with some kind of poison pill message that can't be mirrored [08:55:46] right, my thoughts exactly [08:55:47] okok this would make sense [08:55:54] I am convinced [08:56:22] Great. [08:56:28] as to why 5 MM instances work and not 10, I'm not 100% sure. They should all replicate the total amount of partitions either way. [08:56:53] yeah but if you are right we could be using 100, they'd fail when the message too big hits them [08:57:16] probably our version of mirror maker doesn't deal well with errors from main [08:57:29] so, if this is right, and we are seeing record too large on main [08:57:31] true. The thing I don't understand is why *all* MM instances were failing [08:57:41] I was really struck by this graph that I just discovered as well. https://usercontent.irccloud-cdn.com/file/WcNFV9f5/image.png [08:57:47] https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad&orgId=1&refresh=5m&from=now-24h&to=now [08:57:49] ah, maybe be cause they crashed, and reassigned their partitions to others [08:57:49] it means that some producer to main (unrelated to jumbo) sometimes produces big messages [08:58:06] so everyone kept handing over the hawt potato [08:58:42] one thing could be that the big message is skipped after the crash [08:59:23] one thing which is unlikely to be relevant since the main Kafka service runs on Java 8, OpenJDK 11 was updated yesterday on buster hosts (and -jumbo are on it still). just mentioning it for completely, maybe there's some tool which uses the default Java 11 and caused this [08:59:26] the kafka main broker sends the error to the mirror maker (on jumbo) consumer group, that is old and gets into a weird state for $reasons [08:59:27] I'm not sure it would though, because the consumer offset would still point to before that message, as it wouldn't have been able to commit [08:59:40] moritzm: ack thanks! [09:00:06] I had thought that the mirror makers were just restarting message mirroring every time they crashed, but there were no messages passed. There was a blip at 00:58 and again at 02:35 and then a surge at 08:08 [09:00:15] brouberol: yes true, but maybe the message from main makes it commit the new offset, but it doesn't avoid the crash [09:00:41] btullis: so far they are processing, this is the weird thing [09:01:57] Yes, very weird. [09:02:05] so the theories are two: [09:02:37] 1) a big message from kafka main causes the consumer group of mirror maker on jumbo to get into a weird state. Somehow after a slow restart (1by1) the cluster gets up. [09:03:02] 2) the number of mirror maker instances in the cgroup causes instability upon receiving $certain-events [09:03:11] or both :D [09:03:28] lemme know if you have other ideas [09:04:23] that matches my understanding as well [09:05:18] btullis: added 3 hours of downtime to jumbo [09:06:50] Thanks. I am suspicious of mw_page_content_change_enrich - This seems to have crashed overnight too. https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich&from=now-24h&to=now [09:07:24] If I recall correctly, this was the application for which we increased the maximum message size in kafka-jumbo. [09:08:17] I may be barking up the wrong tree if this isn't publishing anything at all to kafka-main. It could just be a coincidence. [09:10:22] I think I understood my problem, definitely related to mirrormaker eqiad->jumbo issues, it caused many late events detected by the rdf-streaming-updater@dse-k8s that in turn shipped many reconciliation events that were wrongly processed by the production job running in wikikube@eqiad. [09:10:33] tl/dr, Issue is mainly on my side: the flink production job should not treat "reconcile" events emitted by the test job running in dse-k8s [09:10:48] dcausse: we can restart rdf-streaming-updater if you want! [09:10:54] Ah, maybe it was just an issue dealing with the high consumer lag, causing that app to crash. So a symptom rather than a cause. [09:11:21] elukey: actually I don't want to restart it yet :) [09:11:45] dcausse: ack :D [09:11:56] this mirrormaker issue uncovered a problem :) [09:12:06] dcausse: Thanks for the update. [09:14:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [09:14:27] The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [09:16:34] elukey: Are we still running stable with 9 brokers for now? [09:16:44] btullis: I was about to write, yes [09:17:03] what do you think it is best? Shall we leave those running for say 2/3 hours until after lunch? [09:17:15] then we try to add more [09:17:30] FWIW, I'm workign on packaging topicmappr as a deb (PR was merged this morning, I still have to publish the deb), that will allow me to start evacuating kafka-jumbo10[06-10] [09:17:40] brouberol: <3 [09:17:55] Can we set up some kind of log capture for a RecordTooLargeException on kafka-main brokers? [09:17:57] meaning that in the short/medium term, our broker count should go back to the original 9 [09:19:31] btullis: we can grep later on via cumin, like brouberol did [09:21:25] Yep, fair enough. [09:22:13] ok I am going afk in a bit for errands, not sure if the 3 hours of downtime will be enough, maybe we can add more or revisit later on if you folks are online [09:22:31] didn't want to add too many hours since other issues may fall through [09:22:37] Yep, sure thing. Thanks elukey. [09:23:40] I have pinged g.modena, just because that's the only other app I can think of that is working with larger than normal kafka messages recently. There's a chance it could be related somehow. [09:25:35] btullis ack [09:26:03] do you know on which topic these RecordTooLargeException are thrown? [09:26:23] they do not appear in mw-page-content-change-enrich logs [09:27:23] there are issues with mw-page-content-change-enrich in eiaq (reported in maling list and slack), but they are not RecordTooLargeException related (only canary events go through) [09:27:29] Not yet. We only see them on kafka-main1003. I'm looking more closely at the logs now, to see if we can find out which topic or topics are involved. [09:27:43] https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Xqro1IoBYI7nmYcaEY_P [09:28:18] btullis we don't produce in kafka main [09:28:26] we consume from it, and produce in jumbo [09:29:17] OK, thanks. I think you're in the clear then. It was only a hunch. [09:29:49] btullis ahah. I've been guilty of large records before, no worries :). Thanks for the ping. [09:32:18] I can look into your ConfigMapLock issue later, but it sounds a little like a k8s control plane kind of error, so perhaps #wikimedia-k8s-sig might have a pointer, especially as it's wikikube-eqiad. [09:32:57] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:34] * elukey afk! errands + lunch [09:39:46] btullis no rush. I'll ping wikimedia-k8s-sig [10:11:18] We haven't seen any more large message errors on kafka-main1003 since 08:28 this morning. [10:11:24] https://www.irccloud.com/pastebin/apQlkULG/ [10:28:19] I am intrigued as to why this command to list consumer groups on kafka-main isn't working. I would expect this to show me the mirrormaker consumer groups, even if nothing else. `kafka-consumer-groups --bootstrap-server localhost:9092 --list` [10:28:34] Am I doing something wrong? It work on kafka-jumbo. [11:01:45] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING] stream processing: we should have automated integration tests on staging - https://phabricator.wikimedia.org/T347472 (10gmodena) [11:09:09] 10Data-Platform-SRE: Apply partman recipe patch again and see if it affects unrelated reimages - https://phabricator.wikimedia.org/T347434 (10MoritzMuehlenhoff) I think it was your globbing pattern which might have caused the issue, the following should work: cloudelastic100[7-9]|cloudelastic101[0-2]) echo pa... [11:44:18] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10gmodena) [11:49:13] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10BTullis) [11:49:15] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventgate Docker image to Bullseye and nodejs 12 - https://phabricator.wikimedia.org/T343510 (10BTullis) [11:52:23] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) a:03BTullis [11:55:22] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING] schema services should be moved to k8s - https://phabricator.wikimedia.org/T347421 (10gmodena) [12:05:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:27] PROBLEM - Check systemd state on kafka-jumbo1014 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:31] PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:40] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I've verified the above and can confirm that the two slots 1 and 4 are no longer visible to `megacli` ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number:... [12:05:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:53] Ah, more mirror-maker failures. [12:05:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:06:03] PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:03] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:09] PROBLEM - Check systemd state on kafka-jumbo1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:28] expired downtime from earlier debugging? [12:07:42] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:37] moritzm: Oh thanks. Silly me. [12:16:11] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:16:40] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10BTullis) [12:16:59] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10BTullis) a:03BTullis [12:18:23] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ecdec37-d094-4187-9394-a9b8086d33b7) set by btullis... [12:18:32] !log added 3 more hours downtime to kafka-jumbo101[0-5].eqiad.wmnet [12:18:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:23:59] I suggest that we restart the remaining mirrormaker processes then. brouberol , stevemunene would you agree? [12:26:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Should we introduce static typing to Event Platform nodejs codebases? - https://phabricator.wikimedia.org/T345389 (10gmodena) [12:26:27] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) Just adding here, the server didn't boot successfully. [12:27:44] btullis: fine by me 👍 [12:28:13] brouberol: o/ [12:28:17] btullis: o/ [12:28:17] back [12:28:22] how are things? [12:28:27] Ah, you're back. Good :-0 [12:28:36] I mean good you're back, [12:28:43] yes yes :) [12:28:54] shall we start one at the time? [12:29:00] maybe waiting 10/15 mins between each start [12:29:03] things seem to be steady at the moment, with the 9/15 mirrormakers [12:29:40] There haven't been any more of the RecordTooLargeException errors in kafka-main1003 nor any other broker. [12:30:34] ok to start mm on 1010? [12:31:00] there are two possibilities [12:31:01] Yes. In one sense I think that there might be value is starting them all at once. [12:31:15] *in starting them... [12:31:30] btullis: yeah I did it earlier on though and it failed.. [12:31:39] what I am thinking is [12:31:58] 1) after starting the Xth broker we get again into troubles, and we know the critical mass (if any) [12:32:20] 2) we start all and nothing happens, then it is an indication that we are probably waiting for a big message to come [12:32:25] does it make sense? [12:32:35] (started on 1010) [12:32:45] RECOVERY - Check systemd state on kafka-jumbo1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:16] Yes, makes sense. I wasn't accounting for the possibility of 1) so much, but you're right, it will be a good indicator. [12:33:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:33:56] There is a ticket now, so we can log our actions to SAL and to T347481 [12:33:57] T347481: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 [12:34:18] I can make a summary if what I did, since some stuff are not in sal [12:35:29] Yeah, feel free to add/edit anything. Trying to write the timeline on the Google Doc and copy/paste the backscroll from IRC is horrible :-) [12:37:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [12:37:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [12:37:50] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10elukey) Two theories for the moment: 1) For some reason, a big message is sent to kafka main (triggering a RecordToo... [12:38:17] started 1011 [12:38:29] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:32] (03PS2) 10Jgreen: Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging a deb for Bookworm. [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 [12:38:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:38:53] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Can we import metrics from logstash to promethues? - https://phabricator.wikimedia.org/T347484 (10gmodena) [12:39:04] (03CR) 10Elukey: [C: 03+1] Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [12:39:34] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Can we import metrics from logstash to promethues? - https://phabricator.wikimedia.org/T347484 (10gmodena) [12:39:50] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=195bf9c0-3e24-446f-ba90-48d15ed5d628) set by btullis@cumin1001 for 0:20:00 on 1 host(s) and their services with reason: Cold bo... [12:40:44] !log cold rebooting dbstore1005 to see if it sees two missing disks for T347449 [12:40:56] (03CR) 10Jgreen: Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging (031 comment) [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [12:48:21] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I have cold booted it and the missing slots have come back. ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Numb... [12:49:52] 1012 up [12:50:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:50:41] RECOVERY - Check systemd state on kafka-jumbo1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:14] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) Ok, it's rebuilding automatically. ` btullis@dbstore1005:~$ sudo megacli -PDList -aall|grep 'Firmware state' Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: On... [12:51:20] elukey: ack, thanks. [12:53:41] looks good so far [12:54:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:54:10] 1013 up [12:54:45] RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:06] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) https://gitlab.wikimedia.org/repos/sre/kafka-kit/-/merge_requests/2 was required to successfully build the debian package on `build2001.codfw.wmnet` [12:58:39] 1014 up [12:58:43] RECOVERY - Check systemd state on kafka-jumbo1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:11] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:00:42] 1015 up [13:00:46] ah! [13:00:48] it failed! [13:00:53] Oh! [13:00:55] what the hell.. [13:01:21] I stopped 1015 [13:01:40] will the others recover? [13:01:59] https://usercontent.irccloud-cdn.com/file/QcNks280/image.png [13:02:52] yeah they seem recovering [13:03:13] How did you see them fail so quickly? We you tailing all of the logs? [13:03:25] yeah I tailed 1015's mm log [13:03:33] and I am checking the mm grafana logs [13:03:44] the fetch request rate went up again [13:07:39] Starting build #87 for job analytics-refinery-update-jars-docker [13:08:03] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.23 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/960683 [13:08:03] Project analytics-refinery-update-jars-docker build #87: 09SUCCESS in 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/87/ [13:08:32] ok so whatever it is, it is triggered by 1015 [13:08:59] at the beginning I thought it was maybe related to some topic/partition assigned to 1015 containing big messages [13:09:13] Or is it triggered by the 15th mirrormaker (whichever that is)? [13:09:14] but in theory once an instance is down the others should take over [13:09:18] (03CR) 10Aqu: [V: 03+2 C: 03+2] "Deployment train." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/960683 (owner: 10Maven-release-user) [13:09:33] btullis: ah right, we could stop on 1001, start on 1015 [13:09:40] doing it [13:09:49] Yes, exactly. That would be a useful test. [13:10:16] stopped 1010, waiting a bit and then starting on 1015 [13:10:29] this is a weird one :D [13:10:58] It's a good job that we like challenges. :-) [13:11:19] started on 1015 [13:11:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:11:34] seems stable [13:11:48] ok so the 15th mirror maker instance triggers some weirdness [13:12:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:46] !log Deployment weekly train of analytics-refinery (included new refinery-source version) [13:12:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:12:59] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:13:17] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:53] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10elukey) It seems that the 15th mirror maker instance triggers the issue (it is independent which one, after the 14th... [13:14:58] updated the task [13:31:41] https://groups.google.com/g/confluent-platform/c/OdfoZMSewpU?pli=1 [13:31:48] we are not the only ones having this issue [13:31:53] or, that had this issue [13:32:14] https://issues.apache.org/jira/browse/KAFKA-7149 [13:32:19] "We observed that when we have high number of partitions, instances or stream-threads, assignment-data size grows too fast and we start getting below RecordTooLargeException at kafka-broker." [13:32:28] (03PS1) 10Aqu: Update scap deployment in Hadoop test [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/961390 [13:34:25] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10elukey) ` [2023-09-27 13:01:02,991] ERROR [GroupMetadataManager brokerId=1003] Appending metadata message for group k... [13:43:29] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Scap deployment on Hadoop test cluster broken - https://phabricator.wikimedia.org/T347491 (10Antoine_Quhen) [13:44:03] (03CR) 10Btullis: [C: 03+1] Update scap deployment in Hadoop test [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/961390 (owner: 10Aqu) [13:44:14] !log Deployed refinery using scap, then deployed onto hdfs [13:44:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:57] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) The build instructions are as follows: ` ssh build2001.codfw.wmnet git clone https://gitlab.wikimedia.org/repos/sre/kafka-kit.git cd kafka-kit DIST=bookworm-wikimedia pdebuild ` This will... [13:47:01] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Scap deployment on Hadoop test cluster broken - https://phabricator.wikimedia.org/T347491 (10BTullis) Oh, this is also related to the fat that git-fat is not generally present on bullseye machines. https://github.com/wikimedia/operations-puppet/b... [13:47:25] 10Data-Platform-SRE: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking) Per IRC conversation with @dcausse , it seems I was checking the wrong indices..."ecs*" is where the Kubernetes logging "lives." (See also T336076 ) .Thus, I'm closing out this ticket. [13:47:36] btullis: at this point I think it is a bug in our version of kafka + mirror maker, but they suggest to increase `max.message.size` on the brokers, and kafka-main is kinda out of the scope [13:47:43] 10Data-Platform-SRE: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking) 05Open→03Invalid [13:48:02] so maybe we could add a workaround in puppet to avoid deploying mm to all jumbo brokers, as stopgap [13:49:18] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Scap deployment on Hadoop test cluster broken - https://phabricator.wikimedia.org/T347491 (10BTullis) I have manually installed `git-fat` onto an-test-client1002 so that you shold be able to complete the deployment. ` btullis@an-test-client1002:~... [13:49:57] elukey: I was coming to the same conclusion myself. I don't really want to be suggesting that we change anything in kafka-main, if we can avoid it. [13:50:46] exactly yes [13:50:55] in the role we could add a simple if [13:51:09] maybe we can avoid the instances that brouberol is going to decom [13:51:15] We're going to be decommissioning kafka-jumbo100[1-6] very shortly, so maybe a systemd mask for the service on one of these nodes would be fine. [13:51:46] puppet will try to restore the previous state though [13:51:57] IIRC the mark is not enough [13:52:00] *mask [13:52:20] Yes I meant a mask via puppet, or a disable, something like that. [13:52:27] ah okok! [13:52:38] we could avoid the profile directly in theory [13:54:08] Yep, sounds good. Shall I make a patch? [13:55:02] creating one, 1 min [13:59:19] btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961397 [13:59:53] not proud of it but it is easy enough [14:02:55] btullis: going in a meeting, if you want to merge + rollout please feel free to [14:05:11] elukey: Will do. [14:07:53] !log deploying kafka-mirror-maker exclusion patch to kafka-jumbo100[1-6] [14:07:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:07:59] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lmata) This looks super interesting, moving to radar for when we need to help out. [14:12:38] !log re-enabled and run puppet on the rest of kafka-jumbo to bring the mirror-makers back to where they should be. [14:12:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:14:07] !log removing downtime for kafka-jumbo [14:14:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:15:46] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) I produced a deb specifically for bullseye by running the following, on `build2001.codfw.wmnet`: ` sed -i 's/4.2.1/4.2.1-1~wmf11u1/' debian/control DIST=bullseye-wikimedia pdebuild ` I then... [14:17:37] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: services should use common logging schema - https://phabricator.wikimedia.org/T347498 (10gmodena) [14:21:00] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) I'm tempted to say that we shouldn't waste time on trying to build for `buster` because: - having `topicmappr` installed on a single machine is enough - only 6 hosts are running on buster `... [14:21:51] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10MoritzMuehlenhoff) Using the bullseye binary on buster isn't an option either, it depends on glibc 2.29, while buster has 2.28. So unless we patch out the use of io.StringWriter for the buster build, m... [14:23:50] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) I 100% agree @MoritzMuehlenhoff. This is a nice-to-have, and most of our buster brokers are due for decommissioning anyway. Oh and I just want to get on the record and salute Moritz for the... [14:38:05] I am resolving the incident. [14:40:41] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) I rebuilt the bookworm package with `Standards-Version: 4.2.1-1~wmf12u1` ` brouberol@apt1001:~$ grep Standards-Version -r bookworm-amd64/kafka-kit_4.2.1-1.dsc:Standards-Version: 4.2.1-1~wm... [14:45:55] joal: Would you have any idea why event_default and eventlogging shows no data on this gobblin dashboard please? https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&refresh=15m&from=now-24h&to=now&var-gobblin_job_name=event_default&var-kafka_topic=All [14:46:37] I was looking to see if we could see any impact (i.e. delay) on downstream pipeline processing after the kafka incident. [14:56:44] !log Deploy latest Airflow DAGs to analytics instance [14:56:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:49] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) ` root@apt1001:/home/brouberol/bookworm-amd64# reprepro ls kafka-kit kafka-kit | 4.2.1-1~wmf11u1 | bullseye-wikimedia | amd64, source kafka-kit | 4.2.1-1~wmf12u1 | bookworm-wikimedia | amd64... [15:01:11] * btullis I published a draft of the Incident doc: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka-jumbo_mirror-makers - Will finish it later. Feel free to add/amend it as you like. [15:03:05] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) a:05gmodena→03None [15:03:36] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) ` brouberol@sretest1001:~$ sudo apt-get install kafka-kit Reading package lists... Done Building dependency tree... Done Reading state information... Done The following package was automatic... [15:03:50] 10Data-Engineering, 10Data-Platform-SRE: [INCIDENT] kafka-jumbo mirrormaker from main-eqiad crashes associated with RecordTooLargeException errors - https://phabricator.wikimedia.org/T347481 (10BTullis) I have published a draft of the incident doc: https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka... [15:26:50] FYI, kafka-kit is now available as a deb package for both bullseye and bookworm (my warmest thanks to moritzm for his invaluable help). Here's a small puppet PR making it available for kafka brokers https://gerrit.wikimedia.org/r/c/operations/puppet/+/961405 [15:27:13] 10Data-Platform-SRE, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:27:32] 10Data-Platform-SRE, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:34:34] 10Data-Platform-SRE, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [15:35:17] 10Data-Platform-SRE, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [15:35:20] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:37:42] 10Data-Platform-SRE: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10bking) [15:37:42] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:31] 10Data-Platform-SRE, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [15:38:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:38:53] 10Data-Platform-SRE, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [15:39:00] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:39:30] 10Data-Platform-SRE, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [15:41:59] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [15:47:42] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:47] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) [15:49:57] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking) [16:00:01] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) We were unblocked on the` analytics-wmde` admin group and user and were able to create the admin group after all the right approvals. So we are actively back in progres... [16:33:38] 10Data-Engineering: Generate a list of Superset users affected by changes to IP masking/temp users - https://phabricator.wikimedia.org/T347510 (10OSefu-WMF) [16:38:00] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Although I prepared a patch to fix the wikidatawiki issue, I'm not sure it's necessary... [17:04:47] 10Data-Platform-SRE: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper) [17:05:10] 10Data-Platform-SRE: Service implementation for wdqs202[3-5].codfw.wmnet - https://phabricator.wikimedia.org/T345475 (10RKemper) These hosts are in service as of yesterday. [17:07:30] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) @dcausse couple of questions: - Are we OK to start the data load as soon as these hosts are in production? - Does each host need its data loa... [17:12:34] 10Data-Platform-SRE: Apply partman recipe patch again and see if it affects unrelated reimages - https://phabricator.wikimedia.org/T347434 (10bking) 05Open→03Resolved a:03bking Thanks! Per @Eevans comment in IRC, it looks like his problem persists even after this was removed. So the recipe might be wrong,... [17:12:37] 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) [17:43:19] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) @BTullis did you have updates on Partitioning/Raid: for task? [18:01:43] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [18:11:56] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) I haven't quite worked out precisely where the `/srv/dumps/xmldatadumps/public/other/m... [18:47:35] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10Jclark-ctr) @BTullis this server is out of warranty i do not have any 1.6tb drives available. i do have 1.9tb we can use if needed [18:57:41] 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10bking) [19:10:49] 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) See https://phabricator.wikimedia.org/T346764 for details about how `kafka-kit` was debian-packaged. [19:11:22] 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) a:03brouberol [19:29:49] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Platform Team Initiatives (New Hook System): Update EventLogging to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346540 (10Umherirrender) a:03Umherirrender [19:47:58] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:08] 10Data-Platform-SRE: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 (10bking) Operational steps taken so far: - Staging -- `helmfile -e staging -i destroy` + `helmfile -e staging -i apply` --- **Motivation:** I noticed that the flinkdeployment resource... [21:24:32] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Hannah_Bast) @dcausse @Gehel @WolfgangFahl QLever can now also produce `application/sparql-results+xml`.... [21:42:13] (03PS11) 10Milimetric: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [21:49:39] (03CR) 10CI reject: [V: 04-1] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [23:02:42] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:32:43] (03CR) 10Jdlrobson: [C: 03+1] "LGTM. Will test once the WikimediaEvents patch is further along." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/960141 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)