[01:27:08] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10794441 (10xcollazo) New snapshot counts: ` spark.sql(""" SELECT trunc(committed_at, 'month') as week, count(1) as coun... [01:38:32] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] Avoid ingestion delays by marking Gobblin's SimpleSkein job as essential - https://phabricator.wikimedia.org/T393397#10794467 (10xcollazo) Skein jobs now being marked as `essential`: {F59707543} [05:37:15] FIRING: HdfsRpcQueueLength: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [06:12:15] RESOLVED: HdfsRpcQueueLength: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [09:39:36] Helloooo! Im trying to find this data, https://wikitech.wikimedia.org/wiki/API_Gateway#Logs_and_analytics but looking in superset I see nothing! :D Anyone know what I am missing? =] I was looking in event.api_gateway_request but seems empty [09:43:03] 06Data-Engineering, 06Data-Platform-SRE, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10795077 (10JAllemandou) I concur with @BTullis analysis: the fields queried by the filters load relatively high-cardinality data. I can think of multiple way... [09:49:15] FIRING: HdfsRpcQueueLength: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [09:54:15] RESOLVED: HdfsRpcQueueLength: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [10:18:42] 06Data-Engineering, 06Data-Engineering-Radar, 06Discovery-Search, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860#10795138 (10Volans) For now I've prepared a patch to exclude elasticsear... [12:02:55] (03CR) 10Joal: [V:03+2 C:03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1140566 (https://phabricator.wikimedia.org/T387454) (owner: 10Joal) [12:05:03] !log Deploying refinery using scap [12:05:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:08:04] 06Data-Engineering-Radar, 06Data-Platform-SRE, 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810#10795581 (10Gehel) p:05Triage→03High [12:08:46] 06Data-Engineering-Radar, 10BDC-Implementation, 07Epic: [Trino] additional worker nodes for eqiad - https://phabricator.wikimedia.org/T392131#10795583 (10Gehel) Removing DPE SRE as it does not seem that we need to be involved. [12:08:47] !log Pause webrequest airflow jobs until schema is changed [12:08:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:09:14] 06Data-Engineering-Radar, 10BDC-Implementation, 07Epic: [MinIO] Improve cluster to minimum system configuration for production - https://phabricator.wikimedia.org/T392112#10795586 (10Gehel) Removing DPE SRE as it does not look like we need to be involved. [12:09:59] 06Data-Engineering-Radar, 10BDC-Implementation, 07Epic: EPIC: MinIO implementation - https://phabricator.wikimedia.org/T392090#10795588 (10Gehel) Removing DPE SRE as it does not look like we need to be involved. [12:10:15] 06Data-Engineering-Radar, 10BDC-Implementation, 07Epic: EPIC: Trino implementation - https://phabricator.wikimedia.org/T392093#10795590 (10Gehel) Removing DPE SRE as it does not look like we need to be involved. [12:13:59] !log Deploy refinery to HDFS [12:14:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:18:53] 06Data-Engineering-Radar, 06Traffic, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810#10795627 (10Gehel) [12:21:05] 06Data-Engineering, 06Data-Platform-SRE, 06Product-Analytics: Allow curl commands from Airflow BashOperator - https://phabricator.wikimedia.org/T392288#10795646 (10brouberol) 05Open→03Declined I've thought about it some more, and I don't think we have such a strong case in favor of adding `curl` to t... [12:21:35] 06Data-Engineering, 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: Add new WMCS IP ranges to analytics - https://phabricator.wikimedia.org/T392468#10795649 (10Gehel) Removing DPE SRE as it does not look like we need to be directly involved. Please add us back if needed! [12:21:51] 06Data-Engineering, 06Product-Analytics, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Allow curl commands from Airflow BashOperator - https://phabricator.wikimedia.org/T392288#10795651 (10Gehel) [12:25:58] !log Update webrequest schemas (raw + refined) and restarst webrequest_sampled_live druid indexation with new field [12:26:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:26:09] !log Unpause airflow webrequest jobs [12:26:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:30:45] 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): The 20250401 dumps haven't started on time because the mediawikiwiki dump from 20250320 is looping - https://phabricator.wikimedia.org/T390839#10795673 (10Gehel) [13:10:33] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Traffic, 10DPE HAProxy Migration, 13Patch-For-Review: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10795886 (10JAllemandou) [13:19:50] 06Data-Engineering, 06Machine-Learning-Team, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10795913 (10isarantopoulos) 05In progress→03Resolved [13:37:09] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Traffic, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10796058 (10JAllemandou) [13:37:22] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Traffic, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10796059 (10JAllemandou) This is done :) [13:42:28] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Traffic, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10796080 (10Fabfur) Many thanks for all this! [14:15:30] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10Experimentation Lab (Experiment Platform Sprint 6), 13Patch-For-Review: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 - https://phabricator.wikimedia.org/T391959#10796224 (10Ottomata) > Maybe w... [15:58:16] 06Data-Engineering, 06Data-Platform-SRE, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10796707 (10hnowlan) p:05Unbreak!→03Medium Thanks a lot for looking at this and offering solutions! I'm downgrading the priority accordingly, and we'll ad... [16:18:34] 06Data-Engineering, 06Data-Platform-SRE, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10796828 (10Volans) [I was OOO for the last two weeks, just heard about this now] The only changes that have been made recently to the dashboard can be groupe... [16:54:22] 06Data-Engineering, 06Product-Analytics, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Allow curl commands from Airflow BashOperator - https://phabricator.wikimedia.org/T392288#10796998 (10xcollazo) > I don't think we have such a strong case in favor of adding curl to the airflow docker image. Fair eno... [17:28:31] 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10797175 (10xcollazo) Just ran this in prod: ` sudo -u analytics bash kerberos-run-command analytics spa... [17:30:02] !log Set "write.metadata.delete-after-commit.enabled" on wmf_content.inconsistent_rows_of_mediawiki_content_history_v1. T393405#10797175. [17:30:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:30:05] T393405: [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405 [17:36:33] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10Dumps-Generation, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): The 20250401 dumps haven't started on time because the mediawikiwiki dump from 20250320 is looping - https://phabricator.wikimedia.org/T390839#10797220 (10Ahoelzl) [17:37:41] 10Data-Engineering (Q4 2025 April 1st - June 30th): Update airflow-dags wmf_airflow_common library rename analytics to main - https://phabricator.wikimedia.org/T393111#10797224 (10Ahoelzl) [17:39:39] 10Data-Engineering (Q4 2025 April 1st - June 30th): Update airflow-dags wmf_airflow_common library rename analytics to main - https://phabricator.wikimedia.org/T393111#10797232 (10Ahoelzl) [17:41:38] 06Data-Engineering, 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797241 (10Milimetric) Approved as well, for the `analytics-privatedata-users` group, as per [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/pr... [17:43:33] 06Data-Engineering, 06Data-Engineering-Radar, 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T393066#10797257 (10Ahoelzl) [17:45:57] 06Data-Engineering, 06Data-Engineering-Radar, 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Requesting access to for - https://phabricator.wikimedia.org/T393066#10797266 (10BTullis) a:03BTullis I'll pick up the puppet change for this ticket. [17:50:50] 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10797303 (10xcollazo) We figured that what the vast majority of the file size is because of `.metadata.js... [17:53:31] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Data-Platform-SRE, 10superset.wikimedia.org: Frequent filter timeouts in superset UI - https://phabricator.wikimedia.org/T393097#10797322 (10Ahoelzl)