[00:00:18] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Switch Over day scripts - https://phabricator.wikimedia.org/T392696#11002785 (10Ahoelzl) a:03Antoine_Quhen [00:01:01] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Post-Migration Cleanup - https://phabricator.wikimedia.org/T392698#11002786 (10Ahoelzl) a:03Antoine_Quhen [00:01:02] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Post-Migration Cleanup - https://phabricator.wikimedia.org/T392698#11002787 (10Ahoelzl) a:05Antoine_Quhen→03None [00:01:19] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10Observability-Tracing, 10Event-Platform: EventGate: Enable OpenTelemetry Propagation - https://phabricator.wikimedia.org/T391353#11002788 (10Ahoelzl) a:03tchin [00:02:01] 06Data-Engineering, 10MediaWiki-DomainEvents, 07Epic, 10Event-Platform, and 2 others: Hypothesis 5.2.13: EventBus Adoption of Domain Events - https://phabricator.wikimedia.org/T391254#11002792 (10Ahoelzl) [00:03:21] 10Data-Engineering (Q4 2025 April 1st - June 30th): Move more of refine_hive_hourly dag logic into RefineConfiguration - https://phabricator.wikimedia.org/T375064#11002796 (10Ahoelzl) a:03Antoine_Quhen [00:04:05] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Post-Migration Cleanup - https://phabricator.wikimedia.org/T392698#11002801 (10Ahoelzl) a:03Antoine_Quhen [00:04:10] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Update Refine Documentation on Wikitech - https://phabricator.wikimedia.org/T392697#11002802 (10Ahoelzl) a:03Antoine_Quhen [00:04:48] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Structured-Data-Backlog: Bump memory to enable large artifacts sync on HDFS - https://phabricator.wikimedia.org/T348958#11002803 (10Ahoelzl) a:03xcollazo [00:05:32] 10Data-Engineering (Q4 2025 April 1st - June 30th): Analyze temporary hourly webrequest traffic loss - https://phabricator.wikimedia.org/T399312#11002806 (10Ahoelzl) a:03Ahoelzl [00:06:15] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10Dumps-Generation: wikidatawiki fails dumps of the wbt_* tables, also lagging on XML Dumps - https://phabricator.wikimedia.org/T396125#11002808 (10Ahoelzl) a:03xcollazo [03:18:35] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:38:38] 06Data-Engineering, 06DBA, 13Patch-For-Review, 07Schema-change-in-production: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249#11003211 (10Marostegui) [06:42:17] 06Data-Engineering, 06DBA, 13Patch-For-Review, 07Schema-change-in-production: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249#11003223 (10Marostegui) [07:18:35] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [07:56:10] 06Data-Engineering, 10service-utils: Migrate and re-deploy eventgate using new service-utils - https://phabricator.wikimedia.org/T361768#11003333 (10akosiaris) >>! In T361768#10995844, @tchin wrote: >>>! In T361768#10995389, @akosiaris wrote: >>>>! In T361768#10459087, @Ahoelzl wrote: >>> This includes upgrade... [09:15:20] 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#11003583 (10BTullis) 05Open→03Resolved This is now done... [11:18:35] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [14:09:09] 06Data-Engineering, 06DBA, 10GlobalBlocking, 06Trust and Safety Product Team, 07Schema-change: globalblocks table: SQL in extension and production have different type for gb_address - https://phabricator.wikimedia.org/T395669#11004605 (10Ottomata) cc also @Milimetric @mforns ^ [14:17:01] 06Data-Engineering: Airflow: MariaDB to Hive to operator - https://phabricator.wikimedia.org/T395554#11004630 (10Ottomata) @BTullis in the non k8s world, I would suggest to use Spark in Yarn to query MariaDB. But, in k8s world where airflow tasks can be isolated, perhaps this request is fine? We should still en... [14:25:09] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T390328#11004649 (10Ottomata) Interesting. There is a [[ ht... [14:32:41] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T390328#11004685 (10Ottomata) [14:36:54] 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004697 (10Htriedman) Following up on this — any chance I could get someone to take a quick look? [14:41:01] 06Data-Engineering, 10EventStreams: SSE events from offline DC topics - https://phabricator.wikimedia.org/T396564#11004729 (10Ottomata) [[ https://wikimedia.slack.com/archives/CSV483812/p1748951692761429 | Relevant Slack thread ]] [14:43:13] 06Data-Engineering, 10SRE-Access-Requests: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004737 (10MoritzMuehlenhoff) [14:45:19] 06Data-Engineering, 10SRE-Access-Requests: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004755 (10ssingh) ` sukhe@krb1002:~$ sudo manage_principals.py reset-password htriedman --email_address=htriedman-ctr@wikimedia.org Password reset successfully. Successfully sent... [14:47:42] 06Data-Engineering, 10EventStreams: SSE events from offline DC topics - https://phabricator.wikimedia.org/T396564#11004762 (10Ottomata) ` curl 'https://stream.wikimedia.org/v2/stream/rdf-streaming-updater.mutation.v2' | grep id: # ... id: [{"topic":"eqiad.rdf-streaming-updater.mutation","partition":0,"timesta... [14:54:42] 06Data-Engineering, 10EventStreams, 10Event-Platform: SSE events from offline DC topics - https://phabricator.wikimedia.org/T396564#11004826 (10Ottomata) [14:56:39] 06Data-Engineering, 06Data-Engineering-Radar, 10CheckUser, 06Trust and Safety Product Team, and 2 others: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' - https://phabricator.wikimedia.org/T395683#11004891 (10Dreamy_Jazz) a:03Dreamy_Jazz [15:02:30] 06Data-Engineering, 10EventStreams, 10Event-Platform: SSE events from offline DC topics - https://phabricator.wikimedia.org/T396564#11004948 (10Ottomata) I would investigate to see if `topics` is being passed properly from eventstreams to KafkaSSE. Reading KafkaSSE code, I think it should choose the right a... [15:08:06] 06Data-Engineering, 10SRE-Access-Requests: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004989 (10Htriedman) This seems to have worked! Thank you for the lightning-fast response time :) [15:10:51] 06Data-Engineering, 10SRE-Access-Requests: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11005013 (10ssingh) 05Open→03Resolved a:03ssingh [15:11:03] 10Data-Engineering-Roadmap, 07Epic: [EPIC] Datahub Improvements - https://phabricator.wikimedia.org/T369756#11005017 (10Ottomata) [15:19:07] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [15:45:25] 10Data-Engineering-Roadmap, 10MediaWiki-DomainEvents, 06MW-Interfaces-Team, 07Epic: DomainEvents - Broadcasting and receiving cross-service events - https://phabricator.wikimedia.org/T379935#11005117 (10Ottomata) [15:45:53] 10Data-Engineering-Roadmap, 10MediaWiki-DomainEvents, 06MW-Interfaces-Team, 07Epic: DomainEvents - Broadcasting and receiving cross-service Integration events - https://phabricator.wikimedia.org/T379935#11005119 (10Ottomata) [15:48:34] 06Data-Engineering, 06Data-Engineering-Icebox, 10Data Pipelines: Add support for repository artifacts in Airflow - https://phabricator.wikimedia.org/T322690#11005129 (10Ottomata) a:03amastilovic [15:48:37] 06Data-Engineering, 06DBA, 10GlobalBlocking, 06Trust and Safety Product Team, 07Schema-change: globalblocks table: SQL in extension and production have different type for gb_address - https://phabricator.wikimedia.org/T395669#11005128 (10Milimetric) I don't know of anything we do with globalblocks but I... [15:48:50] 06Data-Engineering, 06Data-Engineering-Icebox, 10Data Pipelines: Add support for repository artifacts in Airflow - https://phabricator.wikimedia.org/T322690#11005130 (10Ottomata) @amastilovic should we resolve or decline this? Thanks! [15:53:11] 06Data-Engineering, 06Data-Engineering-Icebox, 10Data Pipelines: HDFS utils on Airflow to handle actions on hdfs files - https://phabricator.wikimedia.org/T336771#11005142 (10Ottomata) 05Open→03Resolved a:03Ottomata I'm going to resolve, as I am pretty sure we have the ability to do this. Please r... [15:54:35] 06Data-Engineering, 10Event-Platform: Support topics without a schema in Flink Catalog - https://phabricator.wikimedia.org/T328232#11005149 (10Ottomata) 05Open→03Declined I'm okay with declining this for now. If we ever decide to invest in Flink SQL stuff, we can reopen. [15:56:21] 06Data-Engineering, 10Event-Platform: Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641#11005161 (10Ottomata) 05Open→03Declined IIRC, this was part of a spike and is no longer being worked on. [15:57:17] 06Data-Engineering, 06Data-Engineering-Radar, 10Data Pipelines, 13Patch-For-Review: Prototype Spark Streaming Job for Content Dumps - https://phabricator.wikimedia.org/T322326#11005165 (10Ottomata) 05Open→03Resolved I'm going to resolve this, even though we don't have a Spark vs Flink evaluation. T... [15:57:36] 06Data-Engineering, 10Event-Platform: Spark Streaming Dumps POC: Update iceberg tables - https://phabricator.wikimedia.org/T323645#11005169 (10Ottomata) 05Open→03Declined Done as part of Dumps 2. [16:08:12] 06Data-Engineering, 06Data-Engineering-Icebox, 06SRE Observability: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430#11005224 (10Ottomata) Hey wow, just catching up! This is quite awesome! There is indeed overlap in intention... [16:12:11] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T390328#11005234 (10Ottomata) [16:13:52] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T390328#11005248 (10Ottomata) I just added Option 6: {T3474... [17:04:05] 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: Refine to Hive with Airflow – Handle Late-Arrived Events - https://phabricator.wikimedia.org/T370665#11005665 (10Ottomata) This is incredible, thank you Joseph! I updated the description fields of the 2 problematic files panels in the Gra... [17:06:52] 06Data-Engineering, 06Java-Scala-Standardization, 10Discovery-Search (2025.06.13 - 2025.07.04): Create Gitlab CI templates for JVM packages - https://phabricator.wikimedia.org/T386406#11005682 (10amastilovic) Hey @pfischer, I'm not the original creator of the project and do not know its original intended pur... [17:09:24] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10Event-Platform: mediawiki.page_change.v1 should not contain events for undelete into existing pages. - https://phabricator.wikimedia.org/T395327#11005685 (10Ottomata) Q: Is it possible for undelete into existing page to update the latest revision of that pa... [18:20:11] (03PS1) 10Aqu: Refine: Force partition creation at end of refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1169720 (https://phabricator.wikimedia.org/T369845) [18:28:57] 06Data-Engineering, 06DBA, 13Patch-For-Review, 07Schema-change-in-production: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249#11006042 (10Marostegui) [18:29:14] 06Data-Engineering, 06DBA, 13Patch-For-Review, 07Schema-change-in-production: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249#11006047 (10Marostegui) [18:30:59] 06Data-Engineering, 06DBA, 13Patch-For-Review, 07Schema-change-in-production: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249#11006066 (10Marostegui) [18:35:23] (03CR) 10Mforns: [C:03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1169720 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [18:51:31] since a couple days there are alerts about widespread puppet failures that affect what looks like basically all analytics machines. [18:51:45] the pattern is that hadoop-yarn-nodemanager fails to be started by puppet [18:51:54] then on next run it can start it.. and then it fails again [18:52:27] the side issue seems to be that it doesnt effectively notify [18:53:16] https://puppetboard.wikimedia.org/nodes?status=failed [18:53:27] https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:54:22] hadoop-yarn-nodemanager.service: Main process exited, code=exited, status=255/EXCEPTION > but also -> ervice[hadoop-yarn-nodemanager]/ensure: ensure changed 'stopped' to 'running' [18:56:26] mutante: yeah I did ping btullis about it and they said it was expected, at least on Friday [18:56:36] not sure what the update is but yeah, thanks for following up [18:56:52] thanks as well, sukhe [19:19:07] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:05:43] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Enable querying operational (prometheus) metrics via the WMF Data Platform - https://phabricator.wikimedia.org/T390328#11006411 (10Ottomata) [20:06:25] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Enable querying operational (prometheus) metrics via the WMF Data Platform - https://phabricator.wikimedia.org/T390328#11006416 (10Ottomata) I've updated the task name and description to avoid the event platform based solution.... [21:36:15] 10Data-Engineering (Q4 2025 April 1st - June 30th): Revive data engineering alert metrics dashboard - https://phabricator.wikimedia.org/T399518#11006803 (10Ahoelzl) Requested **data-engineering-alerts-monitor** bot email. http://wikimediainternal.zendesk.com/hc/requests/114662 [21:44:34] 06Data-Engineering, 06DBA, 10MediaWiki-Page-derived-data, 10MW-1.45-notes (1.45.0-wmf.11; 2025-07-22), 13Patch-For-Review: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951#11006851 (10Zabe) [22:47:12] 10Data-Engineering (Q4 2025 April 1st - June 30th): Define Event Platform Essentail Work FY26 - https://phabricator.wikimedia.org/T399661 (10Ahoelzl) 03NEW [22:47:25] 10Data-Engineering (Q4 2025 April 1st - June 30th): Define Event Platform Essentail Work FY26 - https://phabricator.wikimedia.org/T399661#11007206 (10Ahoelzl) https://docs.google.com/document/d/1uleeCLHJGyXG1jq4RmjFP_z4Y292eykuGMgLBudQ6tI/edit?tab=t.0#heading=h.y53osp7osgn3 [23:19:25] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem