[07:47:50] hey folks [07:48:05] there is an alarm for high RCP activity for the hdfs namenode [07:48:27] from the hdfs audit log on an-master1004 it seems all coming from analytics airflow, mostly for src=/wmf/data/wmf_content/mediawiki_content_history_v1/data/wiki_id=enwiki/... [07:48:33] cc: joal --^ [07:48:56] Hi elukey - thanks for the ping! [07:49:36] o/ bonjour [07:50:53] elukey: this job needs to work with a lot of files... I'll talk with xcollazo about how we could reduce he nubmer of RPC calls, but I'm not sure we'll be able to reduce the load [07:51:36] yes yes it may be ok, IIRC in the past we had some outages for HDFS this is why I pinged : [07:51:39] :) [07:51:52] I am not aware of the current specs so probably the limits are too low [08:13:58] I remember the problem :) [08:22:07] yes yes I know your remember EVERYTHING :D [08:25:15] hmf, I wish (or maybe not? :D) [09:34:44] (03PS5) 10Aqu: Refine deterministic transform deduplication [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1140134 (https://phabricator.wikimedia.org/T369845) [09:35:18] (03CR) 10Aqu: [C:03+2] Backport: Refine deterministic transform deduplication [analytics/refinery/source] (0.2.49) - 10https://gerrit.wikimedia.org/r/1140141 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:45:05] (03Merged) 10jenkins-bot: Backport: Refine deterministic transform deduplication [analytics/refinery/source] (0.2.49) - 10https://gerrit.wikimedia.org/r/1140141 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:54:55] (03CR) 10Aqu: [C:03+2] Refine deterministic transform deduplication [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1140134 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [10:08:09] (03Merged) 10jenkins-bot: Refine deterministic transform deduplication [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1140134 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [10:11:33] Starting build #31 for job analytics-refinery-maven-release [10:31:13] Project analytics-refinery-maven-release build #31: 09SUCCESS in 19 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release/31/ [10:31:18] Starting build #32 for job analytics-refinery-maven-release [10:49:19] Project analytics-refinery-maven-release build #32: 09SUCCESS in 18 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release/32/ [11:34:04] Starting build #28 for job analytics-refinery-update-jars [11:35:31] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.61 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141870 [11:35:33] Project analytics-refinery-update-jars build #28: 09SUCCESS in 1 min 29 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/28/ [11:35:36] Starting build #29 for job analytics-refinery-update-jars [11:35:39] Project analytics-refinery-update-jars build #29: 04FAILURE in 3.3 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/29/ [11:44:29] Starting build #30 for job analytics-refinery-update-jars [11:45:00] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.49.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141873 [11:45:01] Yippee, build fixed! [11:45:01] Project analytics-refinery-update-jars build #30: 09FIXED in 32 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/30/ [11:48:20] (03CR) 10Aqu: [C:03+2] Add refinery-source jars for v0.2.61 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141870 (owner: 10Maven-release-user) [11:48:25] (03CR) 10Aqu: [V:03+2 C:03+2] Add refinery-source jars for v0.2.61 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141870 (owner: 10Maven-release-user) [11:51:52] (03PS2) 10Aqu: Add refinery-source jars for v0.2.49.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141873 (owner: 10Maven-release-user) [11:53:13] (03PS3) 10Aqu: Add refinery-source jars for v0.2.49.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141873 (owner: 10Maven-release-user) [11:53:23] (03CR) 10Aqu: [C:03+2] Add refinery-source jars for v0.2.49.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141873 (owner: 10Maven-release-user) [11:53:25] (03CR) 10Aqu: [V:03+2 C:03+2] Add refinery-source jars for v0.2.49.4 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1141873 (owner: 10Maven-release-user) [12:00:47] !log Deploying new artifacts in analytics/refinery 0.2.29.4 and 0.2.61 [12:00:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:51:14] 06Data-Engineering, 06Data-Engineering-Radar, 06Discovery-Search, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860#10791904 (10Gehel) [12:51:21] 06Data-Engineering, 10Data-Platform-SRE (2025-05-02 - 2025-05-23), 07Documentation: https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log should be on Wikitech - https://phabricator.wikimedia.org/T387878#10791902 (10Gehel) [12:51:49] 06Data-Engineering, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Airflow UI sometimes shows no response for a DAG run task with many mapped tasks - https://phabricator.wikimedia.org/T381479#10791921 (10Gehel) [12:52:05] 06Data-Engineering, 06Data-Engineering-Radar, 10CirrusSearch, 10Structured Data Engineering, and 3 others: Migrate image recommendation to use page_weighted_tags_changed stream - https://phabricator.wikimedia.org/T372912#10791916 (10Gehel) [12:53:43] 07Analytics-Data-Problem, 06Discovery-Search, 06serviceops, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Search Update Pipeline requests to Action API are logged as coming from 127.0.0.1 - https://phabricator.wikimedia.org/T388855#10791953 (10Gehel) [12:53:59] 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10791951 (10Gehel) [13:36:02] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10DPE-Mediawiki-Content: [Dumps 2] Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112#10792211 (10xcollazo) The `2025-05-01` and `2025-05-02` runs of `spark_process_reconciliation_events` job, from `mw_content_merge_... [14:21:21] 06Data-Engineering, 06Data-Engineering-Radar, 10CirrusSearch, 10Structured Data Engineering, and 3 others: Migrate image recommendation to use page_weighted_tags_changed stream - https://phabricator.wikimedia.org/T372912#10792541 (10Gehel) [14:21:35] 06Data-Engineering, 06Java-Scala-Standardization, 10Discovery-Search (2025.05.02 - 2025.05.23): Create Gitlab CI templates for JVM packages - https://phabricator.wikimedia.org/T386406#10792545 (10Gehel) [14:22:09] 06Data-Engineering, 06Data-Platform-SRE, 06Java-Scala-Standardization, 10Discovery-Search (2025.05.02 - 2025.05.23), 13Patch-For-Review: Migrate existing Java packages to deploying to Gitlab, including new version of parent pom, validation that all depen... - https://phabricator.wikimedia.org/T367405#10792571 [14:46:36] 10Analytics-Canonical-Data, 06Movement-Insights, 06Security-Team, 06WMF-Legal, 07SecTeam-Processed: Decide stewardship of the Country and Territory Protection List - https://phabricator.wikimedia.org/T381944#10792744 (10sbassett) @nshahquinn-wmf - I recently met with @MFischer and we discussed this parti... [14:51:54] 10Data-Engineering (Q4 2025 April 1st - June 30th), 07Essential-Work: Assess impact of schema changes of categorylinks, metadata, imagelinks on data pipelines - https://phabricator.wikimedia.org/T391527#10792756 (10Ahoelzl) Waiting on input from traffic team. [15:56:34] (03PS2) 10Joal: Add HAProxy termination_state to webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1140566 (https://phabricator.wikimedia.org/T387454) [16:19:40] 10Data-Engineering (Q4 2025 April 1st - June 30th), 06Traffic, 10DPE HAProxy Migration, 13Patch-For-Review: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10793139 (10JAllemandou) [16:26:36] 10Data-Engineering (Q4 2025 April 1st - June 30th), 07Essential-Work, 10Event-Platform: Gobblin-wmf Gitlab migration and maintenance - https://phabricator.wikimedia.org/T370368#10793164 (10amastilovic) [16:32:57] 10Data-Engineering (Q4 2025 April 1st - June 30th), 07Essential-Work, 10Event-Platform: Gobblin-wmf Gitlab migration and maintenance - https://phabricator.wikimedia.org/T370368#10793200 (10amastilovic) [16:36:56] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] Avoid ingestion delays by marking Gobblin's SimpleSkein job as essential - https://phabricator.wikimedia.org/T393397 (10xcollazo) 03NEW [17:06:28] 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: [OpsWeek] Avoid ingestion delays by marking Gobblin's SimpleSkein job as essential - https://phabricator.wikimedia.org/T393397#10793353 (10xcollazo) 05Open→03In progress [17:10:29] 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: [OpsWeek] RefineSanitize fails to send emails - https://phabricator.wikimedia.org/T393202#10793359 (10xcollazo) 05Open→03In progress [17:48:16] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405 (10xcollazo) 03NEW [17:55:43] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10793487 (10xcollazo) Listing succeeds by bumping CLI mem: ` HADOOP_CLIENT_OPTS="-Xmx16g $HADOOP_CLIENT_OPTS" hdfs dfs -ls /wmf... [17:56:31] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10793490 (10xcollazo) Issue seems to correlate directly with the amount of snapshots: ` spark.sql(""" SELECT count(1) as count... [18:18:01] 10Data-Engineering (Q4 2025 April 1st - June 30th): [OpsWeek] wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 hoarding metadata - https://phabricator.wikimedia.org/T393405#10793558 (10xcollazo) Looks like we are accumilating about ~60k snapshots per month: ` spark.sql(""" SELECT trunc(committed_at,... [18:56:22] 10Analytics-Canonical-Data, 06Movement-Insights, 06Security-Team, 06WMF-Legal, 07SecTeam-Processed: Decide stewardship of the Country and Territory Protection List - https://phabricator.wikimedia.org/T381944#10793675 (10nshahquinn-wmf) @sbassett thank you very much! 😊