[00:20:35] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:13] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:25] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:59] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:35] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:00] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:21] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:15] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:41] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:01] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:22] hi team, Naé is sick today, I'll keep her at home and will be mostly unavailable :S [09:05:35] <3 [10:53:16] 10Data-Engineering, 10Observability-Alerting, 10User-fgiunchedi: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10fgiunchedi) Update on this: the SRE bits are done, what's left are zk alerts for 'analytics' Prometheus instance, namely for druid an... [11:23:07] (03PS1) 10Barakat Ajadi: PaintTiming: Move painttiming to navtiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/901169 (https://phabricator.wikimedia.org/T328256) [12:36:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) [12:37:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) [12:37:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:38:00] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:39:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10BTullis) [12:40:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:41:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade hadoop master to bullseye - https://phabricator.wikimedia.org/T332573 (10BTullis) [12:42:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:56:56] As part of the bullseye upgrade for the analytics cluster, we will lose python3.7. https://gerrit.wikimedia.org/r/c/operations/puppet/+/901196 as I'm not aware of a requirement to forward-port it. [12:57:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade hadoop standby master to bullseye - https://phabricator.wikimedia.org/T332578 (10BTullis) [12:58:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [13:02:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 (10BTullis) [13:03:21] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 (10BTullis) [13:06:47] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [13:28:20] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 (10BTullis) [13:30:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [13:35:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10BTullis) [13:38:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [14:43:53] 10Data-Engineering, 10Observability-Alerting, 10Patch-For-Review, 10User-fgiunchedi: Migrate Kafka prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309010 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The Prometheus Kafka alerts have been migrated from Puppet / Ici... [14:48:24] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10BTullis) [14:54:21] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) We have excluded spark2 and python 3.7 from bullseye builds. The `hadoop-yarn-nodemanager` service is... [15:22:34] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Joe) Removing the sustainability tag as it doesn't seem like there is any related actionable here. @Clement_Goubert if... [15:25:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, and 2 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Volans) [15:31:21] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10xcollazo) > Also the anacoda-wmf package isn't available in bullseye As per https://wikitech.wikimedia.org/wiki/Data_Engineering/Syste... [16:13:49] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate popularity_score.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329877 (10EBernhardson) needs https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/304 to properly pass t... [17:26:24] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10JArguello-WMF) [17:40:35] 10Data-Engineering, 10Product-Analytics: Add log_search to monthly sqoop list - https://phabricator.wikimedia.org/T332621 (10nettrom_WMF) [17:43:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:42] 10Data-Engineering, 10Data Pipelines (sprint 10): Differential privacy airflow-dags merge request - https://phabricator.wikimedia.org/T330234 (10JArguello-WMF) 05Open→03Resolved [18:15:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:02] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:28] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:58] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10SNowick_WMF) Hi @BTullis can you please add @JTannerWMF (jtanner) to the `sql_lab` role so that she can access queries and dashboards? Thank you.