[00:00:18] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10BTullis) [00:01:03] 10Data-Platform-SRE, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10BTullis) [00:02:04] 10Data-Platform-SRE, 10Product-Analytics: Fix presto kerberos support for system users - https://phabricator.wikimedia.org/T292072 (10BTullis) [00:02:57] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search, 10Event-Platform: Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10BTullis) [00:03:23] 10Data-Engineering, 10Data-Platform-SRE: Send a critical alert to data-engineering if produce_canary_events isn't running correctly - https://phabricator.wikimedia.org/T337055 (10BTullis) [00:04:02] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Add checksumming of miniconda installer - https://phabricator.wikimedia.org/T337271 (10BTullis) [00:04:33] 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10BTullis) [00:05:17] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) [00:06:09] 10Data-Engineering, 10Data-Platform-SRE, 10AQS2.0: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) [00:07:17] 10Data-Engineering: ***New Tasks Above*** - https://phabricator.wikimedia.org/T328026 (10BTullis) 05Stalled→03Invalid No longer required. [00:08:05] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Persistence-Backup: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10BTullis) [00:09:17] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10BTullis) [00:10:05] 10Data-Engineering, 10Data-Platform-SRE: SLF4J logspam when using hadoop command-line clients - https://phabricator.wikimedia.org/T276240 (10BTullis) [00:10:45] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10BTullis) [00:11:09] 10Data-Platform-SRE: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10BTullis) [00:11:41] 10Data-Platform-SRE: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10BTullis) [00:12:27] 10Data-Engineering: --NEWLY ADDED ABOVE -- - https://phabricator.wikimedia.org/T304608 (10BTullis) 05Stalled→03Invalid No longer required. [00:13:03] 10Data-Platform-SRE, 10Product-Analytics: kerberos::systemd_timer should have a smarter default for syslog_identifier - https://phabricator.wikimedia.org/T302533 (10BTullis) [00:13:47] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10BTullis) [00:14:55] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, 10SRE, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) [00:15:47] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10BTullis) [00:17:56] 10Data-Engineering, 10Data-Platform-SRE, 10Research-Freezer: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10BTullis) [00:18:45] 10Data-Platform-SRE: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10BTullis) [00:19:13] 10Data-Platform-SRE: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10BTullis) [00:20:01] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) [00:20:35] 10Data-Platform-SRE: Add authentication and encryption to Druid Analytics clients - https://phabricator.wikimedia.org/T250484 (10BTullis) [00:21:12] 10Data-Platform-SRE: Defining a better authentication scheme for Druid and Presto - https://phabricator.wikimedia.org/T241189 (10BTullis) [00:22:10] 10Data-Engineering, 10Data-Platform-SRE, 10Research-Freezer, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10BTullis) [00:22:39] 10Data-Engineering, 10Data-Platform-SRE, 10superset.wikimedia.org: Superset Timeout Logging - https://phabricator.wikimedia.org/T294772 (10BTullis) [00:23:42] 10Data-Platform-SRE, 10superset.wikimedia.org: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10BTullis) [00:24:23] 10Data-Engineering, 10Data-Platform-SRE, 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10BTullis) [00:25:04] 10Data-Platform-SRE, 10superset.wikimedia.org: Fix the LDAP integration and Superset user account creation. - https://phabricator.wikimedia.org/T298647 (10BTullis) [00:26:07] 10Data-Platform-SRE, 10Product-Analytics: Investigate easier methods for WMF staff to access Superset - https://phabricator.wikimedia.org/T258962 (10BTullis) [00:27:37] 10Data-Engineering, 10Data-Platform-SRE: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10BTullis) [00:28:34] 10Data-Platform-SRE, 10Observability-Metrics, 10SRE, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10BTullis) [00:29:24] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis) [00:31:25] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Catalog: Re-enable Public Druid metadata ingestion - https://phabricator.wikimedia.org/T311547 (10BTullis) [02:35:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:57] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:41:46] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10BTullis) [18:41:52] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10BTullis) [21:11:43] (SystemdUnitFailed) firing: wmf_auto_restart_airflow-scheduler@analytics_test.service Failed on an-test-client1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed