[00:14:37] 10Data-Engineering: Improve pageview automated traffic detection heuristics - https://phabricator.wikimedia.org/T280565 (10Mayakp.wiki) Another issue discovered recently T355608 which could benefit from improving automated bot detection. [00:31:13] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10CodeReviewBot) ebysans merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/576 Update browser_general dag to gene... [01:50:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:40:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [03:00:16] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [03:10:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [03:30:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [03:55:50] (DiskSpace) firing: Disk space stat1005:9100:/ 2.067% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:04:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [04:09:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [05:04:20] (SystemdUnitFailed) firing: (14) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:05:41] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:23] PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:33] PROBLEM - Check systemd state on clouddb1021 is CRITICAL: CRITICAL - degraded: The following units failed: check-private-data.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [07:55:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.05% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:41:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) 05Open→03Resolved [08:41:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [08:41:43] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) 05Open→03Resolved [08:50:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [09:02:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [09:05:38] (SystemdUnitFailed) firing: (14) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:27] (03CR) 10Phuedx: [C: 03+1] Remove trvwikisource from scoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992944 (owner: 10Aqu) [09:35:53] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene) [09:36:17] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10Stevemunene) a:05Stevemunene→03None [09:36:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene) [09:37:38] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene) [09:37:56] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10Stevemunene) a:05Stevemunene→03None [09:38:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10Stevemunene) a:05Stevemunene→03None [09:39:08] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [09:40:33] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) All SRE steps have been completed and the hosts have been decommissioned and handed over to dc ops for the final step. [09:52:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [09:55:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Stevemunene) Current airflow logs are managed by a systemd timer job that runs everyday at 0300HRS UTC and deletes any logs older than 90 days. However, this does not del... [10:04:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [10:14:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [10:21:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [10:28:17] stevemunene: about T336043, could you add a link to the decommission tickets for DC-Ops? [10:28:17] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [10:29:01] Oh sorry, it's already on the sub tasks [10:30:08] stevemunene: and to make sure, have you updated https://docs.google.com/spreadsheets/d/1Obj5ozGQYl7Zei0MBLELVD8eDGqqsF_t9T3ZbrOsmZg/edit#gid=0 as well? [10:30:39] I think that those subtasks need reassigning to j.clark-ctr and the ops-eqiad tag adding, otherwise dc-ops won't see them. [10:31:34] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Gehel) 05Open→03Resolved [10:36:57] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9484621, @Lucas_Werkmeister_WMDE wrote: >>>! In T355685#9484091, @akosiaris wrote: >> My high level suggestion woul... [10:44:20] (SystemdUnitFailed) firing: (15) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:38] (SystemdUnitFailed) firing: (15) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:35] (DiskSpace) resolved: Disk space stat1005:9100:/ 1.999% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:53:23] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) >>! In T355685#9490204, @akosiaris wrote: >> Would it be possible to have just one helm release, but have Test Wikida... [11:46:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware, 10ops-eqiad: decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10BTullis) [11:46:24] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware, 10ops-eqiad: decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10BTullis) [11:46:28] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10decommission-hardware, 10ops-eqiad: decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10BTullis) [11:48:30] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) Thanks for the decommissions @Stevemunene - I just added the #ops-eqiad tag to the subtasks to help make sure that they are seen by the right team. [11:49:35] (DiskSpace) firing: Disk space stat1005:9100:/ 2.394% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:56:15] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [13:24:15] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [13:45:45] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9490230, @Lucas_Werkmeister_WMDE wrote: >>>! In T355685#9490204, @akosiaris wrote: >>> Would it be possible to have... [14:20:48] (03CR) 10Joal: "One last nit, then good to go" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [14:49:20] (SystemdUnitFailed) firing: (14) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:32] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) Thanks a lot – I’ve added some of that information at https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#D... [15:25:35] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9490871, @Lucas_Werkmeister_WMDE wrote: > Thanks a lot – I’ve added some of that information at https://wikitech.wi... [15:37:00] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:01] I have reached out to mnz who is making heavy use of /tmp on stat1005, to see if he can move these files beneath /srv. [15:39:20] (SystemdUnitFailed) firing: (14) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:16] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) >>! In T355685#9490969, @akosiaris wrote: > Definitely different task. I am also not at all sure right now that the t... [15:49:17] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9491033, @Lucas_Werkmeister_WMDE wrote: >>>! In T355685#9490969, @akosiaris wrote: >> Definitely different task. I... [15:49:49] (DiskSpace) firing: Disk space stat1005:9100:/ 2.368% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:51:04] RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:24] RECOVERY - Check systemd state on clouddb1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:20] (SystemdUnitFailed) firing: (14) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:46] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [16:01:33] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Support querying a range of hourly data partitions - https://phabricator.wikimedia.org/T294654 (10mpopov) @nettrom_WMF Thank you for sharing that code! I recently used it in T353666 and it was very helpful! Just wanted to show my appreciation. [16:33:44] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cloudelastic1010.wikimedia.org` - cloudelastic1010.wikim... [16:35:09] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) I've been testing out various approaches on this task and I have received great help from @brouberol for which I am very g... [16:37:47] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) This is now waiting on: https://github.com/wikimedia/wmfdata-python/pull/50 and {T345482}. Once that is merged, we will create a new conda-analytics pac... [16:43:49] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) a:03BTullis [17:14:30] 10Data-Engineering (Sprint 7), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10mfossati) @lbowmaker @JAllemandou , I was thinking that perhaps we could implement these deletion jobs as tasks in... [17:17:58] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS bullseye [17:22:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10brouberol) @BTullis My pleasure! [17:24:30] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [17:37:57] 10Data-Engineering, 10Patch-For-Review: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/595 Don't retry convert_history_xml_to_parquet. [17:54:59] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [17:57:45] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.eqiad.wmnet with OS bullseye completed:... [19:49:50] (DiskSpace) firing: Disk space stat1005:9100:/ 2.341% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:55:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:31] (HdfsCapacityRemainingPercent) firing: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [23:49:50] (DiskSpace) firing: Disk space stat1005:9100:/ 2.318% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:55:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed