[00:19:20] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:22] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:22] (SystemdUnitFailed) firing: (13) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:33] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [01:40:43] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:16] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [05:40:44] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:34] (DiskSpace) firing: Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.985% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:48:34] (DiskSpace) resolved: Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.974% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:53:34] (DiskSpace) firing: Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.878% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:58:34] (DiskSpace) firing: (5) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.769% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:03:34] (DiskSpace) firing: (13) Disk space an-worker1114:9100:/var/lib/hadoop/data/b 5.699% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:08:34] (DiskSpace) firing: (23) Disk space an-worker1112:9100:/var/lib/hadoop/data/f 5.613% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:13:35] (DiskSpace) firing: (26) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.875% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:18:34] (DiskSpace) firing: (26) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.801% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:23:35] (DiskSpace) firing: (26) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.837% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:28:35] (DiskSpace) firing: (24) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.811% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:28:52] 10Data-Engineering, 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10dcausse) [08:30:02] 10Data-Engineering, 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10dcausse) [08:32:10] 10Data-Engineering, 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10dcausse) [08:33:35] (DiskSpace) firing: (24) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:38:35] (DiskSpace) firing: (24) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.811% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:43:34] (DiskSpace) firing: (26) Disk space an-worker1112:9100:/var/lib/hadoop/data/d 5.815% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:48:34] (DiskSpace) firing: (23) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.851% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:53:34] (DiskSpace) firing: (25) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.857% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:58:34] (DiskSpace) firing: (26) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.857% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:13:35] (DiskSpace) firing: (24) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.854% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:18:36] (DiskSpace) firing: (25) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.854% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:28:35] (DiskSpace) firing: (26) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.854% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:29:33] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [09:40:45] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:30] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10brouberol) a:03brouberol [09:53:34] (DiskSpace) firing: (12) Disk space an-worker1114:9100:/var/lib/hadoop/data/e 5.953% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:07:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) a:03Stevemunene [10:18:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [10:25:13] 10Data-Engineering (Sprint 8): [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization - https://phabricator.wikimedia.org/T338065 (10Antoine_Quhen) I would like to add rewrite_manifests to the list of maintenance actions: * rewrite_data_files * expire snapshots * **rewrite_m... [10:28:34] (DiskSpace) firing: (12) Disk space an-worker1114:9100:/var/lib/hadoop/data/e 5.924% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:33:35] (DiskSpace) resolved: (12) Disk space an-worker1114:9100:/var/lib/hadoop/data/e 5.924% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:38:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) I'm proposing to start with an-airflow1007, since it appears not to be used for anything yet. I have checked with the users in the `#wmf-wmde` Slack channel. [10:46:47] !log upgrading an-airflow1007 to bullseye for T335261 [10:46:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:46:51] T335261: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 [10:47:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1007.eqiad.wmnet with OS bullseye [10:58:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10Gehel) [10:58:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Discovery-Search (Current work): Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10Gehel) p:05Triage→03High [11:28:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1007.eqiad.wmnet with OS bullseye completed: - an-airflow1007 (**PASS**) -... [11:44:52] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:51:41] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [12:20:49] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:24:30] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [12:28:17] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [12:51:02] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) @bking I see that cloudelastic1010 seems to be happy on it's new IP/hostname? Glad that it seems to have gone well. I am mostly on leave t... [13:06:31] !log I'm starting the reimaging process of an-tool1009.eqiad.wmnet, which will cause unavalability of hue.wikimedia.org while it runs - T349400 [13:06:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:06:34] T349400: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 [13:07:30] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1002 for host an-tool1009.eqiad.wmnet with OS bullseye [13:23:25] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1006.eqiad.wmnet with OS bullseye [13:29:33] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [13:33:34] 10Data-Engineering, 10Discovery-Search, 10Image-Suggestions: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10dcausse) [13:40:45] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:39] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2024.01.22 - 2024.02.11): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10Antoine_Quhen) https://docs.google.com/document/d/1upAje5lMawu4X6seRxI8Lx7YN-oHEzcEcm2fO6E5OH0/edit [14:02:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-worker11[57-75] into service - https://phabricator.wikimedia.org/T353776 (10Stevemunene) [14:10:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1006.eqiad.wmnet with OS bullseye completed: - an-airflow1006 (**PASS**) -... [14:31:57] 10Data-Engineering (Sprint 7), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10JAllemandou) Thanks a log for not forgetting about this ticket @mfossati :) the Data Engineering team is on the roa... [14:37:28] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1002 for host an-tool1009.eqiad.wmnet with OS bullseye executed with errors:... [14:39:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [14:41:21] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10gmodena) Exploratory code is available at: https://gitlab.wikimedia.org/-/snippets/113 We can migrate the current AD job to the new DQ stack by... [14:41:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:46:15] 10Data-Engineering (Sprint 7): [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10Antoine_Quhen) a:03Antoine_Quhen [14:54:37] (03PS5) 10TChin: Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) [14:54:46] (03CR) 10TChin: "Done" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [15:07:50] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10brouberol) We found out that the `hue` package hadn't been built for bullseye. We're going to revert `an-tool1009` to Buster until we can build a `hue` package... [15:17:04] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1002 for host an-tool1009.eqiad.wmnet with OS buster [15:58:16] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1002 for host an-tool1009.eqiad.wmnet with OS buster c... [16:23:56] 10Data-Platform-SRE: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10joanna_borun) [16:38:37] brouberol: FYI there is an old run of sre.hadoop.roll-restart-masters test on cumin1001 from you that is there pending user input [16:52:34] (DiskSpace) firing: Disk space an-worker1153:9100:/var/lib/hadoop/data/d 5.778% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:56:16] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) a:03VRiley-WMF [16:56:53] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10VRiley-WMF) [16:56:57] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) 05Open→03Resolved [16:57:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) This has been removed and decommissioned [16:57:34] (DiskSpace) firing: (2) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.503% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:02:34] (DiskSpace) firing: (4) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.777% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:04:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) a:03VRiley-WMF [17:04:15] (03PS6) 10TChin: Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) [17:05:05] volans: oh my bad. I’m currently afk (on phone only). If you’re able, feel free to kill at will. If not, I will do it when back at a keyboard. Thanks again! [17:07:14] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10VRiley-WMF) [17:07:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) This server has been removed and decommissioned. [17:07:29] brouberol: no hurry is not doing anything :) I'll leave it to you at your own convenience [17:07:35] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) 05Open→03Resolved [17:12:34] (DiskSpace) firing: (5) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.472% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:17:34] (DiskSpace) firing: (5) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.431% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:22:34] (DiskSpace) firing: (10) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.404% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:27:34] (DiskSpace) firing: (11) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.758% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:29:34] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [17:31:45] (03PS7) 10TChin: Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) [17:33:54] (03CR) 10Joal: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [17:37:34] (DiskSpace) firing: (12) Disk space an-worker1153:9100:/var/lib/hadoop/data/b 5.758% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:44:22] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10VRiley-WMF) [17:45:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [17:45:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) This server has been removed and decommissioned [17:45:47] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10SRE, 10decommission-hardware, 10ops-eqiad: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) [17:47:34] (DiskSpace) firing: (11) Disk space an-worker1153:9100:/var/lib/hadoop/data/d 5.426% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:07:35] (DiskSpace) firing: (7) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.941% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:12:34] (DiskSpace) firing: (8) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.938% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:22:34] (DiskSpace) firing: (9) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.945% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:42:34] (DiskSpace) firing: (10) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.941% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:47:34] (DiskSpace) firing: (11) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.934% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:50:03] volans: done [18:51:41] thx a lot! :) [18:57:34] (DiskSpace) firing: (11) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.931% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:58:29] don't mention it [19:07:44] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10Ahoelzl) [19:12:35] (DiskSpace) firing: (12) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.934% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:52:35] (DiskSpace) firing: (15) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:57:34] (DiskSpace) firing: (16) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:12:35] (DiskSpace) firing: (16) Disk space an-worker1113:9100:/var/lib/hadoop/data/i 5.927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:23:12] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10CodeReviewBot) joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_reque... [20:37:34] (DiskSpace) firing: (17) Disk space an-worker1120:9100:/var/lib/hadoop/data/g 5.082% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:42:34] (DiskSpace) firing: (17) Disk space an-worker1120:9100:/var/lib/hadoop/data/g 4.405% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:42:50] (DiskSpace) firing: (17) Disk space an-worker1120:9100:/var/lib/hadoop/data/g 4.405% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:47:34] (DiskSpace) firing: (17) Disk space an-worker1120:9100:/var/lib/hadoop/data/g 5.143% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:52:35] (DiskSpace) firing: (17) Disk space an-worker1120:9100:/var/lib/hadoop/data/g 4.405% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:15:07] 10Data-Engineering: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T349743 (10CMyrick-WMF) There are 4 more wikis missing from mediawiki_history, which brings the updated list of missing wikis (including those Neil listed above) to: ` zghwiki bjnwiki... [21:22:34] (DiskSpace) firing: (11) Disk space an-worker1132:9100:/var/lib/hadoop/data/j 2.729% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:29:34] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [21:30:20] 10Data-Engineering, 10Patch-For-Review: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/595 Don't retry convert_history_xml_to_parquet. [21:35:26] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10xcollazo) Since we'd rather have a failed job than bad data, as a stop gap measure, we have merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/... [21:42:34] (DiskSpace) firing: (12) Disk space an-worker1132:9100:/var/lib/hadoop/data/j 2.665% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:45:45] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:34] (DiskSpace) firing: (13) Disk space an-worker1132:9100:/var/lib/hadoop/data/j 2.658% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:52:34] (DiskSpace) firing: (13) Disk space an-worker1132:9100:/var/lib/hadoop/data/j 2.651% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:07:35] (DiskSpace) firing: (13) Disk space an-worker1132:9100:/var/lib/hadoop/data/j 2.624% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:17:35] (DiskSpace) firing: (15) Disk space an-worker1114:9100:/var/lib/hadoop/data/i 5.992% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:27:35] (DiskSpace) firing: (15) Disk space an-worker1114:9100:/var/lib/hadoop/data/i 5.983% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:32:35] (DiskSpace) firing: (15) Disk space an-worker1114:9100:/var/lib/hadoop/data/i 5.983% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:42:34] (DiskSpace) firing: (13) Disk space an-worker1114:9100:/var/lib/hadoop/data/e 5.793% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:47:34] (DiskSpace) firing: (14) Disk space an-worker1114:9100:/var/lib/hadoop/data/c 5.995% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:52:34] (DiskSpace) firing: (18) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.933% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:57:35] (DiskSpace) firing: (20) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.929% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:02:35] (DiskSpace) firing: (21) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.892% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:12:34] (DiskSpace) firing: (21) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.771% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:17:36] (DiskSpace) firing: (22) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.861% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:22:35] (DiskSpace) firing: (24) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.861% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:27:35] (DiskSpace) firing: (24) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.861% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:32:35] (DiskSpace) firing: (25) Disk space an-worker1114:9100:/var/lib/hadoop/data/d 5.863% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:37:34] (DiskSpace) firing: (22) Disk space an-worker1114:9100:/var/lib/hadoop/data/c 5.621% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:42:34] (DiskSpace) firing: (18) Disk space an-worker1112:9100:/var/lib/hadoop/data/j 5.957% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:47:35] (DiskSpace) firing: (16) Disk space an-worker1112:9100:/var/lib/hadoop/data/j 5.963% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:52:34] (DiskSpace) firing: (14) Disk space an-worker1114:9100:/var/lib/hadoop/data/c 5.621% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:57:34] (DiskSpace) firing: (13) Disk space an-worker1114:9100:/var/lib/hadoop/data/j 5.93% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace