[00:03:16] (HdfsCapacityRemainingPercent) resolved: Alarmingly low free space on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Capacity_Remaining - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=106&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCapacityRemainingPercent [00:12:59] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye completed: - elastic2103 (**PASS**)... [00:28:54] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [00:31:54] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [01:20:38] (SystemdUnitFailed) firing: (11) user-runtime-dir@24065.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:51] (DiskSpace) firing: Disk space stat1005:9100:/ 2.664% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:03:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-mediawiki-production-daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:20] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:52] (DiskSpace) firing: Disk space stat1005:9100:/ 2.65% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:47:20] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 08), 10Patch-For-Review: Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10SGupta-WMF) [09:05:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:15] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Joe) It generally seems ok, but a few considerations: * kafka-main is much sma... [09:19:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10brouberol) > I also want to note that this doesn't solve the long-standing iss... [09:22:52] (DiskSpace) firing: Disk space stat1005:9100:/ 2.667% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:49:20] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:05] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10BTullis) @Antoine_Quhen - are you happy for me to try a refinery deployment to the test cluster at any time, to try to inv... [11:29:13] 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) 05Resolved→03Open Re-opening this ticket to track the work to enable to enable the feature flag. [11:34:26] 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) Another test of this feature flag has been requested by @jallemandou and @lbowmaker (here: T335356#9486565 and [[https://wikimedia.slack.com/archives/CSV483812... [11:38:07] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) [11:38:23] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) [11:40:39] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) [11:41:17] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10cmooney) [12:57:58] btullis, brouberol: for https://gerrit.wikimedia.org/r/c/operations/puppet/+/989090 I've deployed the change on schema2004 and restarted nginx, it looks all good to me, but if you want to check as well, let me know? otherwise I'd apply this to the other schema hosts in a bit by re-enabling puppet [13:18:28] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10lbowmaker) [13:21:58] 10Data-Platform-SRE: Write a cookbook to check the age of all Java processes associated with the Hadoop clusters - https://phabricator.wikimedia.org/T355886 (10BTullis) [13:22:52] (DiskSpace) firing: Disk space stat1005:9100:/ 2.633% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:41:54] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10BTullis) @Iflorez - Are you able to fill in any more of the information about the urgency and/or impact of this bug please, since you mentioned tha... [13:50:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:56] 10Data-Platform-SRE, 10Wikidata-Query-Service: Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10dcausse) [13:51:25] 10Data-Platform-SRE, 10Wikidata-Query-Service: Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 (10dcausse) [13:51:29] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10dcausse) [13:56:28] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10lbowmaker) Thanks @BTullis. I just tagged your team for awareness as this could be a blocker for moving forward with the next Superset version. @f... [13:58:21] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10BTullis) > Thanks @BTullis. I just tagged your team for awareness as this could be a blocker for moving forward with the next Superset version. Oh... [14:16:19] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10fkaelin) The issue seems that the superset ui for these queries can't render nested parquet structures, e.g. the metrics column contains a set of s... [14:24:12] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10fkaelin) Also for reference, at some point I created a template superset dashboard which mirrors the `content_gap_metric` hive tables - here https:... [14:33:00] 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10BTullis) a:05BTullis→03None [14:36:22] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10bking) [14:37:04] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10DC-Ops, 10ops-codfw: Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10bking) [14:48:11] 10Data-Platform-SRE, 10Data-Platform: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset - https://phabricator.wikimedia.org/T355859 (10BTullis) Thanks @fkaelin - I think that the display of these nested structures might be improved by {T340144}. We're hoping to enable this for Supe... [14:57:35] (DiskSpace) resolved: Disk space stat1005:9100:/ 2.618% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:57:49] 10Data-Engineering (Sprint 7): Fix refinery-source.refinery-core.Utilities::getValueForKey - https://phabricator.wikimedia.org/T355391 (10Antoine_Quhen) 05Open→03Resolved [14:58:53] 10Data-Engineering, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10Antoine_Quhen) [14:59:00] 10Data-Engineering (Sprint 7): [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10Antoine_Quhen) 05Open→03Resolved [15:18:19] (03PS1) 10Aqu: Remove trvwikisource from scoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/992944 [15:21:48] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:22:00] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) [15:22:32] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10Data Products (Epics Timeline), and 3 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [15:35:19] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10MoritzMuehlenhoff) [15:39:03] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:55:34] (DiskSpace) firing: Disk space stat1005:9100:/ 2.102% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:04:45] JFTR, schema* migration for nginx scheme is completed, only need to cleanup the old transition packages, will do that tomorrow [16:11:55] moritzm: Nice, thanks. [16:59:31] (03PS4) 10TChin: Add iceberg version of interlanguage_navigation table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) [16:59:52] (03CR) 10TChin: "TBD: lz4 compression" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/986839 (https://phabricator.wikimedia.org/T352671) (owner: 10TChin) [17:46:17] (03PS7) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [17:50:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:22] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) It's back up and running. The following query is producing results. Note that the `is_goog_isp` field is mainly for helping... [17:59:49] (03CR) 10Eevans: "For posterity sake, bikeshedding naming might be better tracked on the ticket; I updated the ticket [here](https://phabricator.wikimedia.o" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [18:36:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Physikerwelt) [18:38:12] (03CR) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [19:11:36] (03PS8) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [19:13:05] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10WDoranWMF) [19:13:08] (03PS9) 10Mforns: Add query to load MediaWiki snapshot to Cassandra AQS config table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) [19:55:35] (DiskSpace) firing: Disk space stat1005:9100:/ 2.077% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:50:38] (SystemdUnitFailed) firing: (12) refinery-sqoop-mediawiki-production-daily.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10RKemper) Old masters are no longer master-eligible. They're still participating in the actual cluster; we're holding off on the physical decom until T355617 is done [23:13:59] 10Data-Engineering, 10Cassandra: Replace use of the aqs.config table with Dataset Config Store - https://phabricator.wikimedia.org/T355911 (10Eevans) [23:14:11] 10Data-Engineering, 10Cassandra: Replace use of the aqs.config table with Dataset Config Store - https://phabricator.wikimedia.org/T355911 (10Eevans) p:05Triage→03Medium [23:52:34] (03CR) 10Eevans: Add query to load MediaWiki snapshot to Cassandra AQS config table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/989558 (https://phabricator.wikimedia.org/T352948) (owner: 10Mforns) [23:55:49] (DiskSpace) firing: Disk space stat1005:9100:/ 2.053% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace