[00:22:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:02:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:18] (03CR) 10Krinkle: [C: 03+2] Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907871 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [01:04:56] (03Merged) 10jenkins-bot: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907871 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [01:16:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:51] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:41] 10Data-Engineering, 10Data-Services, 10cloud-services-team: Drop several views from ptwikisource - https://phabricator.wikimedia.org/T332596 (10Marostegui) [08:06:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:41] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) @Jclark-ctr progress! I was able to reimage, but the two disks in the flex bay seem in `Firmware state: Unconfigured(good), Spun Up`, so the OS got installed on... [08:17:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:14] has anybody checked --^ [08:18:39] it seems that the timer fails to produce canary events to the logging external instance [08:22:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:26:09] ah I see https://phabricator.wikimedia.org/T334510 [08:30:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:14] !log About to deploy analytics/refinery in production [08:31:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:32:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:18] 10Data-Engineering, 10Event-Platform Value Stream, 10observability: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10elukey) I see also a lot of 503s logged by the `eventgate-production-tls-proxy` container (to support the issues that w... [09:05:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:59] !log About to migrate refine webrequest form Oozie to Airflow [09:25:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:48] Hi elukey seen your updates on an-worker1132, currently declining/closing the tasks I had created on decommissioning it from the hadoop cluster. Thanks [10:07:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:38] steve_munene: Hi! In theory we'll have to reimage the node again once dcops fixes the issues, and after that everything should be fine. [10:10:04] do you have any question about what happened in the task? I can try to explain in more details what it seems that happened [10:10:12] (at least what I understood) [10:15:22] Yes, please do. Was it a delayed update that caused the initial degraded RAID? [10:16:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:32] steve_munene: so on every worker we have 2x480GB SSDs that are configured in RAD1 (hardware), so they appear to the OS as /dev/sda usually [10:21:44] the rest of the disks are HDDs, 4TBs each [10:22:03] due to a lot of weird details about dell, we expose each of them as single raid0 devices [10:22:21] that are Virtual Disks in megacli's nomenclature [10:22:31] and they go from /dev/sdb to /dev/sdm [10:23:09] now for some reason the host booted with the two SDDs not configured in raid1, and the rest of the HDDs started from sda rather than sdb [10:23:19] so the OS got installed on one of the 4TBs HDDs [10:23:46] what I asked to John in DCops is to configured the RAID1 so that we'll get /dev/sda as we expect again, and after that we'll be able to reimage [10:31:50] Got it, thanks elukey [10:35:53] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in c... [11:25:18] (03CR) 10AikoChou: Add event schema for ML classification change on current page state (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/905965 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [11:51:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:02:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:31:22] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:31:52] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [12:32:06] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [12:34:47] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:35:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:40] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) >>! In T333377#8775126, @Marostegui wrote: > @ayounsi we are placing new DB hosts in production, can you run the same query yo... [12:37:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:26] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) Thank you, nothing changes from our DB side! [12:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:51] (SystemdUnitFailed) firing: (7) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:43] 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, 10Patch-For-Review: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10Ottomata) Wow thanks Isaac, very helpful. I'm also inclined to treat these as separate streams then. We could however use a... [13:02:04] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) [13:02:18] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) When we do this, we should also consider all the other kinds of mediawiki 'links'.... [13:02:47] (03CR) 10Awight: "Just wondering, is it necessary to collect this information or could we alternatively log user_id=0 the same as for IP anonymous users?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) (owner: 10Bartosz Dziewoński) [13:36:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:51] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:51] (SystemdUnitFailed) firing: (8) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:55] (03PS3) 10Thiemo Kreuz (WMDE): Update Editing team schemas for IP masking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) (owner: 10Bartosz Dziewoński) [13:57:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: Add log_search to monthly sqoop list - https://phabricator.wikimedia.org/T332621 (10lbowmaker) [13:59:11] 10Data-Engineering-Planning, 10Data Pipelines: Airflow skein hook shouldn't fail when not managing to gather yarn logs - https://phabricator.wikimedia.org/T332215 (10lbowmaker) [14:17:31] (03CR) 10Ottomata: Created development/geoip/network_latency 1.0.0 schema (035 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [14:18:09] (03CR) 10Ottomata: Add event schema for ML classification change on current page state (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/905965 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [14:23:15] (03CR) 10Ottomata: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [14:30:38] !log re-ran airflow task compute_pageview_actor_hourly for dag pageview_actor_hourly for 2023-04-12T08->09 [14:30:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:39:00] !log cleared airflow task aggregate_pageview_actor_to_pageview_hourly from dag pageview_hourly for 2023-04-12T08->09 [14:39:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:50:38] !log cleared airflow task aggregrate_pageview_to_projectview from projectview_hourly dag for 2023-04-12Y08->09 [14:50:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:51:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:03] 10Data-Engineering, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Created raid 1 for 2 ssd @elukey [15:06:18] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) First attempt on dse-k8s-worker1001 ended up in some errors, among them: ` The following packages have unmet dependencies: rocm-llvm : Dep... [15:10:35] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) Hi! Just wanted to know if anyone on DE had... [15:12:46] steve_munene: o/ so it may be good to exclude an-worker1132 from hdfs at this point, so we can completely reimage it (extra disks too) [15:13:20] once the node is good we can remove it from the exlude list [15:15:55] I +1ed the change, if you want tomorrow we can try it [15:16:17] !log cleared airflow task aggregate_projectview_geographically from dag projectview_geo for 2023-04-12T08->09 [15:16:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:29:10] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 11): Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) No near plans, please proceed, I will review! [15:40:26] (03PS2) 10Snwachukwu: Update pageview hourly and daily druid tables. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906595 (https://phabricator.wikimedia.org/T334224) [15:42:12] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) Had a chat with Moritz, and here some relevant readings: * https://github.com/RadeonOpenCompute/ROCm/issues/1125 * https://github.com/ROCm-D... [16:01:19] PROBLEM - eventgate-analytics-external validation error rate too high on alert2001 is CRITICAL: 2.037 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [16:10:17] 10Data-Engineering, 10Event-Platform Value Stream, 10observability: Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10Ottomata) > maybe it is worth to update all instances with the same code Done, still seeing same errors. FWIW, the im... [16:10:44] 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 11): Produce requests to eventgate-logging-external in eqiad occasionally fail. - https://phabricator.wikimedia.org/T334510 (10Ottomata) [16:14:07] Ack, thanks elukey that does indeed seem to be the safer option. [16:23:59] RECOVERY - eventgate-analytics-external validation error rate too high on alert2001 is OK: (C)2 gt (W)1 gt 0.9611 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [16:27:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:31] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > Flink doc does suggest that their k8s HA implementation could wor... [17:22:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:09] (03PS5) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [17:25:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:03] (03PS6) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [17:30:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:12] (03CR) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema (035 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [17:32:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:59] (03CR) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [17:34:40] (03CR) 10Joal: "One nit" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906595 (https://phabricator.wikimedia.org/T334224) (owner: 10Snwachukwu) [17:41:46] (03CR) 10Ottomata: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [17:50:38] (03PS7) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [17:52:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:41] (03CR) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:31] (03CR) 10Ottomata: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [18:02:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:39] (03PS8) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [18:52:57] (03CR) 10Jameel Kaisar: Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [19:35:27] (03CR) 10Ottomata: [C: 03+1] "LGTM then! We can always bikeshed more later since we are using the 'development' directory." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [19:35:57] (03CR) 10Ottomata: [C: 03+1] Created development/geoip/network_latency 1.0.0 schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) (owner: 10Jameel Kaisar) [19:37:08] (03PS9) 10Jameel Kaisar: Created development/network/probe 1.0.0 schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/907508 (https://phabricator.wikimedia.org/T334417) [19:40:44] (03PS3) 10Snwachukwu: Update pageview hourly and daily druid tables. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906595 (https://phabricator.wikimedia.org/T334224) [19:57:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:35] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Thanks Thiemo!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) (owner: 10Bartosz Dziewoński) [20:10:39] (03CR) 10Bartosz Dziewoński: [C: 03+1] Update Editing team schemas for IP masking (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/907960 (https://phabricator.wikimedia.org/T332437) (owner: 10Bartosz Dziewoński) [20:12:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:56:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:51] (SystemdUnitFailed) firing: (9) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:38:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage