[00:02:34] (DiskSpace) firing: (12) Disk space an-worker1144:9100:/var/lib/hadoop/data/l 5.603% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:03:44] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:22] (SystemdUnitFailed) firing: (15) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:34] (DiskSpace) firing: (9) Disk space an-worker1153:9100:/var/lib/hadoop/data/c 5.849% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:27:34] (DiskSpace) firing: (10) Disk space an-worker1153:9100:/var/lib/hadoop/data/c 5.815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:47:35] (DiskSpace) firing: (11) Disk space an-worker1135:9100:/var/lib/hadoop/data/b 5.835% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:17:35] (DiskSpace) firing: (12) Disk space an-worker1112:9100:/var/lib/hadoop/data/j 5.786% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:19:22] (SystemdUnitFailed) firing: (15) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:34] (DiskSpace) firing: (14) Disk space an-worker1107:9100:/var/lib/hadoop/data/j 5.718% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:27:34] (DiskSpace) firing: (19) Disk space an-worker1107:9100:/var/lib/hadoop/data/j 5.58% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:29:34] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [01:32:34] (DiskSpace) firing: (30) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 5.823% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:37:34] (DiskSpace) firing: (34) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 5.656% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:42:34] (DiskSpace) firing: (37) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 5.487% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:52:35] (DiskSpace) firing: (38) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 5.173% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:57:34] (DiskSpace) firing: (37) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 5.046% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:02:34] (DiskSpace) firing: (38) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 4.9% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:07:34] (DiskSpace) firing: (37) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 4.788% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:12:36] (DiskSpace) resolved: (35) Disk space an-worker1107:9100:/var/lib/hadoop/data/e 4.788% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:20:46] (SystemdUnitFailed) firing: (14) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:35] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [06:01:48] (03PS1) 10TChin: Use zstd compression for aqs_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/993478 (https://phabricator.wikimedia.org/T352669) [06:39:17] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [08:07:27] (03CR) 10Joal: [C: 03+1] "Thanks Thomas :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/993478 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [08:59:59] !log reimaging an-tool1008, causing unavailability of the yarn.wikimedia.org UI for the duration of the op - T349399 [09:00:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:02] T349399: Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 [09:00:33] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1002 for host an-tool1008.eqiad.wmnet with OS bull... [09:20:46] (SystemdUnitFailed) firing: (14) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:19] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1002 for host an-tool1008.eqiad.wmnet with OS bullseye... [09:30:52] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10brouberol) 05Open→03Resolved [09:30:54] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [09:31:52] !log yarn.wikimedia.org is back [09:31:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:34:20] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 08), 10Patch-For-Review: Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10CodeReviewBot) phuedx merged https://gitlab.wikimedia.org/repo... [09:35:10] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 08), 10Patch-For-Review: Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [09:35:30] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 08), 10Patch-For-Review: Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [09:49:28] 10Data-Engineering (Sprint 7), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Antoine_Quhen) [09:53:58] 10Data-Platform-SRE: Write a cookbook to check the age of all Java processes associated with the Hadoop clusters - https://phabricator.wikimedia.org/T355886 (10Gehel) p:05Triage→03High [09:56:47] 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10Gehel) p:05Triage→03High [09:57:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10Gehel) [09:59:39] !log starting a scap deployment of analytics airflow dags [09:59:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:09:34] RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:42] RECOVERY - Check systemd state on an-airflow1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:46] (SystemdUnitFailed) firing: (14) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:11] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10brouberol) Hue has proven to be quite tricky to build for bullseye. @MoritzMuehlenhoff suggested that as https://phabricator.wikimedia.or... [10:14:22] (SystemdUnitFailed) firing: (14) prometheus-node-textfile-prometheus-check-certificate-expiry.service Failed on an-airflow1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1005.eqiad.wmnet with OS bullseye [10:17:40] !log upgrading an-airflow1005 (search) to bullseye for T335261 [10:17:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:17:44] T335261: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 [10:29:04] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [10:34:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:35:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) That patch to `wmfdata-python` has been [[https://github.com/wikimedia/wmfdata-python/commit/205b726bea22ff0328a67a88291a0... [10:36:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) p:05Medium→03High [10:37:53] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [10:37:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) [10:56:21] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1005.eqiad.wmnet with OS bullseye completed: - an-airf... [11:09:07] (03Abandoned) 10Btullis: Add the scap targets for the new hadoop coordinators [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/980396 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [11:11:57] 10Data-Engineering (Sprint 7), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Antoine_Quhen) I've added the missing partitions in the... [11:12:10] 10Data-Engineering (Sprint 7), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Antoine_Quhen) [11:12:21] 10Data-Engineering (Sprint 7), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Antoine_Quhen) [11:17:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:27:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [11:29:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) The `sre.hadoop.init-hadoop-workers` fails in creating new partitions. running the cookbook for `an-worker1157` fails with the details below ` Cr... [12:19:37] 10Data-Engineering (Sprint 7): [Data Quality] Provide documentation for Data Quality Metrics on Wikitech - https://phabricator.wikimedia.org/T355624 (10lbowmaker) 05Open→03Resolved [12:19:39] 10Data-Engineering (Sprint 7): [Event Platform] mw-page-content-change-enrich: increase producer max.request.size - https://phabricator.wikimedia.org/T355426 (10lbowmaker) 05Open→03Resolved [12:19:41] 10Data-Engineering (Sprint 7): [Refine System] Define a concept and an approach for refactoring the Refine system - https://phabricator.wikimedia.org/T354696 (10lbowmaker) 05Open→03Resolved [12:19:43] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10lbowmaker) 05Open→03Resolved [12:19:45] 10Data-Engineering (Sprint 7): [Data Quality][Webrequest] Log severity level of alerts generated by refinery - https://phabricator.wikimedia.org/T354568 (10lbowmaker) 05Open→03Resolved [12:19:47] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10lbowmaker) 05Open→03Resolved [12:19:49] 10Data-Engineering, 10Epic: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents - https://phabricator.wikimedia.org/T345912 (10lbowmaker) [12:19:51] 10Data-Engineering (Sprint 7): [Dataset Config Store] [SPIKE] Investigate existing backend solutions - https://phabricator.wikimedia.org/T354558 (10lbowmaker) 05Open→03Resolved [12:19:53] 10Data-Engineering, 10Epic: Dataset Config Store - https://phabricator.wikimedia.org/T354557 (10lbowmaker) [12:19:55] 10Data-Engineering (Sprint 7): [Data Quality] Finalize Data Quality Metrics Schema - https://phabricator.wikimedia.org/T352683 (10lbowmaker) 05Open→03Resolved [12:19:57] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10lbowmaker) 05Open→03Resolved [12:19:59] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [12:20:04] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [12:20:14] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Move MetricsExporter to refinery-spark - https://phabricator.wikimedia.org/T352688 (10lbowmaker) 05Open→03Resolved [12:20:18] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [12:20:22] 10Data-Engineering, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10lbowmaker) [12:20:26] 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10lbowmaker) [12:20:30] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10lbowmaker) 05Open→03Resolved [12:20:34] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10lbowmaker) 05Open→03Resolved [12:20:38] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2024.01.01 - 2024.01.21), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10lbowmaker) 05Open→03Resolved [12:20:53] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [12:21:03] 10Data-Engineering (Sprint 8): [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10lbowmaker) [12:22:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) It looks like the operating system can only see one disk. ` btullis@an-worker1157:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOU... [12:22:41] 10Data-Engineering (Sprint 8): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10lbowmaker) [12:22:52] 10Data-Engineering (Sprint 8): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10lbowmaker) [12:23:10] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10lbowmaker) [12:23:30] 10Data-Engineering (Sprint 8): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10lbowmaker) [12:23:35] 10Data-Engineering (Sprint 8): [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10lbowmaker) [12:23:40] 10Data-Engineering (Sprint 8), 10Data-Platform-SRE, 10SRE Observability: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker) [12:23:46] 10Data-Engineering (Sprint 8), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker) [12:23:48] 10Data-Engineering (Sprint 8), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10lbowmaker) [12:23:51] 10Data-Engineering (Sprint 8), 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10lbowmaker) [12:23:53] 10Data-Engineering (Sprint 8): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10lbowmaker) [12:23:55] 10Data-Engineering (Sprint 8), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [12:23:57] 10Data-Engineering (Sprint 8): [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690 (10lbowmaker) [12:24:11] 10Data-Engineering (Sprint 8): [Dataset Config Store] - Proof of Concept - https://phabricator.wikimedia.org/T355542 (10lbowmaker) [12:24:29] 10Data-Engineering (Sprint 8): [Dataset Config Store] - Proof of Concept - https://phabricator.wikimedia.org/T355542 (10lbowmaker) [12:24:37] 10Data-Engineering (Sprint 8), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10lbowmaker) [12:24:48] 10Data-Engineering (Sprint 8), 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [12:25:12] 10Data-Engineering (Sprint 8): [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker) [12:25:43] 10Data-Engineering (Sprint 8): [Maintenance] Delete sanitized events removed from sanitization list - https://phabricator.wikimedia.org/T347586 (10lbowmaker) [12:25:53] 10Data-Engineering (Sprint 8): [Maintenance] Delete sanitized events removed from sanitization list - https://phabricator.wikimedia.org/T347586 (10lbowmaker) [12:26:19] 10Data-Engineering: [Maintenance] Add a deletion job for `hdfs_usage` data - https://phabricator.wikimedia.org/T348774 (10lbowmaker) [12:26:38] 10Data-Engineering (Sprint 8): [Maintenance] Add a deletion job for `hdfs_usage` data - https://phabricator.wikimedia.org/T348774 (10lbowmaker) [12:27:35] 10Data-Engineering (Sprint 8): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10lbowmaker) [12:27:39] 10Data-Engineering (Sprint 8): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10lbowmaker) [12:30:54] 10Data-Engineering, 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [12:33:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) @Stevemunene - Here is a command to run. It looks like this happened during the last batch of workers as well. https://phabricator.wikimedia.org/T3437... [12:33:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10BTullis) I suppose we could add it to the `sre.hadoop.init-hadoop-workers` cookbook, if it doesn't find 12 data drives. What do you think? [12:37:32] 10Data-Engineering (Sprint 8): NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10lbowmaker) [12:56:13] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Thanks @BTullis , We can add it to the cookbook for future reference. Did some further reading on the RAID Configuration Input Options used from... [13:17:15] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines, 10Patch-For-Review: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) I have just enabled this for superset-next and announced it to our users. ` btullis@an-tool1005:~$ sudo run-pup... [13:21:18] 10Data-Engineering (Sprint 8): [Dataset Config Store] - Proof of Concept - https://phabricator.wikimedia.org/T355542 (10lbowmaker) [14:15:47] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:21] (03PS1) 10Joal: Update sqoop list adding new wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) [14:35:18] (03CR) 10Gmodena: [C: 03+1] Update sqoop list adding new wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [14:36:15] (03CR) 10TChin: Update sqoop list adding new wikis (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [14:37:26] (03PS2) 10Joal: Update sqoop list adding new wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) [14:38:14] (03CR) 10Joal: Update sqoop list adding new wikis (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [14:39:58] (03CR) 10TChin: [C: 03+1] Update sqoop list adding new wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/994178 (https://phabricator.wikimedia.org/T349743) (owner: 10Joal) [14:56:27] 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netbox: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) Adding #data-engineering as the change will... [15:09:28] (03CR) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [15:30:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) a:03brouberol Reassigning this ticket to @brouberol as he has identified a bug in the Superset codebase relating to this display of... [15:35:39] 10Data-Engineering, 10Data-Platform-SRE: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) Adding #data-platform-sre - We can make the underlying database and give you a single user with full permissions any time... [15:38:22] 10Data-Engineering (Sprint 8), 10Patch-For-Review: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10lbowmaker) [16:09:51] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) Adding to this, had to run the cookbook `sre.hadoop.init-hadoop-workers` to install megacli first on all the hosts then pass the megacli command t... [16:10:47] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [16:13:17] (EventgateValidationErrors) firing: ... [16:13:17] eventgate-analytics-external stream eventlogging_EditAttemptStep validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:14:48] 10Data-Platform-SRE: Decommission an-worker10[78-95] & an-worker1116 - https://phabricator.wikimedia.org/T353784 (10Gehel) Let's not decommission those servers right now, we are low disk space on the hadoop cluster at the moment. [16:17:33] 10Data-Engineering: [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10lbowmaker) [16:21:07] 10Data-Engineering, 10Data Pipelines: Refactor and migrate navigationtiming to Airflow - https://phabricator.wikimedia.org/T356192 (10lbowmaker) [16:21:29] 10Data-Engineering (Sprint 8), 10Data Pipelines: Refactor and migrate navigationtiming to Airflow - https://phabricator.wikimedia.org/T356192 (10lbowmaker) [16:28:17] (EventgateValidationErrors) resolved: ... [16:28:17] eventgate-analytics-external stream eventlogging_EditAttemptStep validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:44:21] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10Gehel) a:05bking→03None [16:48:09] 10Data-Engineering, 10Data-Platform-SRE: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) Would you like a separate database for the `analytics_test_hadoop` cluster, or will one database be enough for both clusters? [16:48:25] 10Data-Engineering, 10Data-Platform-SRE: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10Gehel) p:05Triage→03High [16:48:47] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10Gehel) [16:50:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10EBernhardson) We followed this data over time and it seemed to stay in line. We've now progressed from... [16:54:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): ProbeDown - https://phabricator.wikimedia.org/T355272 (10RKemper) 05Open→03Resolved a:03RKemper [17:07:55] (03CR) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [17:12:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10brouberol) We have sent a patch [[ https://github.com/apache/superset/pull/26892 | upstream ]] for this bug. In the meantime, we can pull in t... [17:51:03] (03CR) 10Eevans: [C: 03+1] Add query to load MediaWiki snapshot to Cassandra AQS config table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [18:03:49] (03PS1) 10Btullis: Fix an issue with displaying nested columns from presto [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/994213 (https://phabricator.wikimedia.org/T340144) [18:15:47] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:40:39] 10Data-Engineering (Sprint 8): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) SRE presented a work document for Alert Review: https://docs.google.com/document/d/1PQKabMx9qoAKQS6qlHJDs2z2B_Bum_KqLYRaZ1pzXGc/edit [18:46:52] !log deployed latest DAG changes to analytics Airflow instance [18:46:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:48:44] !log ran the following commands to create a production test dump folder: [18:48:44] kerberos-run-command hdfs hdfs dfs -mkdir /wmf/data/archive/content_dump_test [18:48:44] kerberos-run-command hdfs hdfs dfs -chown analytics /wmf/data/archive/content_dump_test [18:48:44] kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/archive/content_dump_test [18:48:44] Bug: T346278 [18:48:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:48:46] T346278: Implement an Airflow job that runs and publishes the XML dumps - https://phabricator.wikimedia.org/T346278 [19:13:07] 10Data-Engineering (Sprint 8): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10Ahoelzl) [19:13:24] 10Data-Engineering (Sprint 8): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) [19:13:36] 10Data-Engineering (Sprint 8): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) [19:14:03] 10Data-Engineering (Sprint 8): [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10Ahoelzl) [19:14:15] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10Ahoelzl) [19:14:59] 10Data-Engineering (Sprint 8), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10Ahoelzl) [19:56:06] 10Data-Engineering: Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10SNowick_WMF) [20:02:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) [20:03:06] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [20:24:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) @BTullis I started working on the release process but realized we should remove the Urllib3 version pin now. I've p... [20:53:39] 10Data-Engineering: Add `event.app_donor_experience` fields to event sanitization allowlist - https://phabricator.wikimedia.org/T356214 (10lbowmaker) Please submit a patch with your change here. You will have to specify how you want each field to be treated. This will be quicker than if we have to schedule the w... [21:07:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10xcollazo) >>! In T345482#9499926, @nshahquinn-wmf wrote: > @BTullis I started working on the release process but realized we should... [21:38:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Change TLS/load balancer configuration for cloudelastic - https://phabricator.wikimedia.org/T355720 (10taavi) https://etherpad.wikimedia.org/p/cloudelastic-T355617 proposes changing the current traffic ingress method from direct access to the L... [22:15:47] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:11] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Ottomata) Copypasting comment from Alerts Review doc: `#wikimedia-search` and `#wikimedia-analytics` are (at least historically) als... [23:52:50] 10Analytics-Radar, 10Data-Engineering, 10Data Products, 10Metrics Platform Backlog: mw.user.generateRandomSessionId should return a UUID - https://phabricator.wikimedia.org/T266813 (10Ottomata)