[02:40:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:45:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:23:20] (03PS2) 10Joal: Fix druid unique-devices daily aggregated monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [07:23:45] (03CR) 10Joal: Fix druid unique-devices daily aggregated monthly (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [08:36:17] !log roll-restarting druid on test cluster for T356382 [08:36:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:53:17] dcausse: Good morning! Would you have a minute for me? [08:53:30] joal: sure [08:53:46] dcausse: thursday meeting meet? [08:53:59] ok [08:54:54] !log roll-restarting hadoop masters on test cluster for T356382 [08:54:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:30] 10Data-Engineering (Q4 2024 April 1st - June 30th): [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total) - https://phabricator.wikimedia.org/T361499#9712516 (10JAllemandou) a:03JAllemandou [09:09:03] btullis: Hi! I need a confirmation please :) In my mind we have removed GPUs from hadoop hosts - is that right? [09:17:58] (03CR) 10Joal: [C:03+1] "LGTM :) Thanks a lot Aleks" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1018365 (owner: 10Aleksandar Mastilovic) [09:22:27] 06Data-Engineering, 06Discovery-Search, 06Java-Scala-Standardization, 10Metrics Platform Backlog, and 2 others: 14Adapt gitlab pipelines for the new wmf-jvm-parent-pom - 14https://phabricator.wikimedia.org/T358841#9712584 (10Gehel) 05Open→03Declined [09:23:06] (03CR) 10Joal: [C:03+1] "I don't understand why the previous version was wrong, but ok :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 (owner: 10Aleksandar Mastilovic) [09:25:42] (SparkHistoryTestServiceUnavailable) firing: ... [09:25:42] spark-history-analytics-test-hadoop is unavailable on k8s-dse - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark_History#The_app_isn't_running - https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%2Bprometheus/k8s-dse&var-namespace=spark-history-test&var-container=All - https://alerts.wikimedia.org/?q=alertname%3DSparkHistoryTestServiceUnavailable [09:30:02] (03PS2) 10Gehel: Sort pom.xml according to standard sortpom order. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014516 (https://phabricator.wikimedia.org/T360219) [09:30:02] (03PS4) 10Gehel: Start using wmf-jvm-parent-pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 [09:30:02] (03PS3) 10Gehel: Remove duplication from parent pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014546 (https://phabricator.wikimedia.org/T360219) [09:30:03] (03PS2) 10Gehel: Sort the dependencyManagement section according to sortPom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014564 (https://phabricator.wikimedia.org/T360219) [09:30:03] (03PS2) 10Gehel: Move version configuration of dependencies to main pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014601 (https://phabricator.wikimedia.org/T360219) [09:30:05] (03PS2) 10Gehel: Sort some refinery modules according to sortPom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1015035 (https://phabricator.wikimedia.org/T360219) [09:30:09] (03PS2) 10Gehel: Correct stlye issues with spotless. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1015075 [09:45:42] (SparkHistoryTestServiceUnavailable) resolved: ... [09:45:42] spark-history-analytics-test-hadoop is unavailable on k8s-dse - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark_History#The_app_isn't_running - https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%2Bprometheus/k8s-dse&var-namespace=spark-history-test&var-container=All - https://alerts.wikimedia.org/?q=alertname%3DSparkHistoryTestServiceUnavailable [09:50:11] (03CR) 10CI reject: [V:04-1] Remove duplication from parent pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014546 (https://phabricator.wikimedia.org/T360219) (owner: 10Gehel) [09:50:36] (03CR) 10CI reject: [V:04-1] Sort some refinery modules according to sortPom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1015035 (https://phabricator.wikimedia.org/T360219) (owner: 10Gehel) [10:03:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:18:15] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [10:43:15] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [10:45:56] !log roll-restarting hadoop masters on the prod cluster for T356382 [10:45:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:36] We got another namenode failback failure when running the sre.hadoop.roll-restart-masters cookbook. [11:05:41] !log sudo systemctl start hadoop-hdfs-namenode.service on an-master1003 after failed failback operation. [11:05:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:06:48] This is the same pattern we've seen several times before. Miograting to the new nameservers and increasing the heap for the nameodes clearly hasn't worked. [11:14:50] I have reopened https://phabricator.wikimedia.org/T310293 as a result. [11:40:11] 06Data-Engineering, 06Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 3 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263#9712907 (10Dreamy_Jazz) [12:21:15] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [12:40:58] 06Data-Engineering, 10Dumps-Generation, 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9713057 (10Gehel) [12:41:49] 06Data-Engineering, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397#9713055 (10Gehel) [12:45:37] (03CR) 10Aqu: "We should reference the iceberg table in the comment usage and in Airflow, accessible from the database wmf_readership not wmf." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [13:13:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:30:57] (03CR) 10Mforns: Productionize CommonsCategoryGraphBuilder for CIM project (039 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1015013 (https://phabricator.wikimedia.org/T358681) (owner: 10Mforns) [14:11:04] 10Data-Engineering (Q4 2024 April 1st - June 30th): [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator - https://phabricator.wikimedia.org/T361853#9713508 (10Ottomata) [14:11:07] 10Data-Engineering (Q4 2024 April 1st - June 30th), 13Patch-For-Review: [Refine refactoring] Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762#9713509 (10Ottomata) [14:16:15] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [14:32:46] 06Data-Engineering, 06Data Products, 10FY2023-24-WE 2.1 Typography and palette customizations, 13Patch-For-Review, 10Web-Team-Backlog (FY2023-24 Q4 Sprint 2): Update Sample Rates for Metrics Platform Events - https://phabricator.wikimedia.org/T361962#9713629 (10phuedx) >>! In T361962#9697087, @phuedx wro... [14:44:39] 10Data-Engineering (Q4 2024 April 1st - June 30th), 07Spike: [SPIKE] Evaluate and document solutions for table-management tooling - https://phabricator.wikimedia.org/T360969#9713705 (10lbowmaker) [14:52:17] (03PS3) 10Joal: Fix druid unique-devices daily aggregated monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [14:53:36] (03CR) 10Joal: "Done in usage-comments, airflow already uses the new iceberg tables." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [15:02:25] 06Data-Engineering, 10Data Products (Data Products Sprint 13): Past edits increase in wmf.edit_hourly with every new snapshot - https://phabricator.wikimedia.org/T355182#9713798 (10mpopov) @VirginiaPoundstone Howdy! This is not a blocking anything. Thanks for checking! [15:04:14] 06Data-Engineering, 10Data-Services: Expose more properties to the user_properties_anon table on Wiki Replicas - https://phabricator.wikimedia.org/T226162#9713805 (10odimitrijevic) a:05odimitrijevic→03None [15:11:48] joal: re the fifo queue, confirmed with Erik we should not be using it [15:25:17] 10Data-Engineering (Q4 2024 April 1st - June 30th), 06Data-Platform, 13Patch-For-Review: Unique devices tables have missing or incorrect data for January and February 2024 - https://phabricator.wikimedia.org/T361242#9713999 (10MNeisler) I did a quick check and confirmed that the missing daily data issue appe... [15:47:18] 06Data-Engineering, 10Data-Services: 14Expose more properties to the user_properties_anon table on Wiki Replicas - 14https://phabricator.wikimedia.org/T226162#9714113 (10odimitrijevic) 05Open→03Declined 14With the migration to liftwing these settings are no longer applicable. cc @calbon  [16:08:35] (03CR) 10Aqu: [C:03+2] Fix druid unique-devices daily aggregated monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [16:18:56] (03PS3) 10Aleksandar Mastilovic: Browser report queries updates [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1018365 [16:24:10] (03PS2) 10Aleksandar Mastilovic: Corrected version of pingback's version.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 [16:32:45] (03PS4) 10Aleksandar Mastilovic: WMCS HQL scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 [16:39:30] (03CR) 10Joal: [V:03+2 C:03+2] "Merging for tomorrow's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1015579 (https://phabricator.wikimedia.org/T361242) (owner: 10Milimetric) [16:45:37] (03PS1) 10Mforns: Add queries to format commons impact metrics data as dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019845 (https://phabricator.wikimedia.org/T358701) [16:58:46] (03PS3) 10Aleksandar Mastilovic: Corrected version of pingback's version.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 [17:00:42] (03CR) 10Aleksandar Mastilovic: "The ORDER BY clause in "total_agg" CTE was being applied to the whole UNION ALL result, as opposed to just the latest results. This caused" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 (owner: 10Aleksandar Mastilovic) [17:27:20] (03PS5) 10Aleksandar Mastilovic: WMCS HQL scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 [17:27:55] (03CR) 10Aleksandar Mastilovic: "Comments addressed - TYVM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 (owner: 10Aleksandar Mastilovic) [18:40:31] (03PS1) 10Gmodena: refinery-job: add webrequest instrumentation. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1019867 [18:45:55] (03CR) 10CI reject: [V:04-1] refinery-job: add webrequest instrumentation. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1019867 (owner: 10Gmodena) [18:49:55] 06Data-Engineering, 06Data Products, 10Observability-Logging, 06Traffic, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9715083 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/65... [18:49:55] (03PS2) 10Mforns: Add queries to format commons impact metrics data as dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019845 (https://phabricator.wikimedia.org/T358701) [18:52:31] (03PS3) 10Mforns: Add queries to format commons impact metrics data as dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019845 (https://phabricator.wikimedia.org/T358701) [18:57:14] (03CR) 10Aleksandar Mastilovic: [V:03+2 C:03+2] "Ready to merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1018365 (owner: 10Aleksandar Mastilovic) [19:16:42] 06Data-Engineering, 06Community-Tech, 10Multiblocks, 10Data Products (Data Products Sprint 09), 10Event-Platform: 14Investigate if the new 'Multiblocks' user blocks feature affects the mediawiki.user-blocks-change event stream - 14https://phabricator.wikimedia.org/T356597#9715172 (10VirginiaPoundstone... [19:23:52] 06Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Data Products (Data Products Sprint 10): 14Migrate EventLogging to JSDoc - 14https://phabricator.wikimedia.org/T357444#9715237 (10VirginiaPoundstone) 05Open→03Resolved [19:23:56] 06Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 07Technical-Debt: 14Fix public documentation for mw.eventLog.submit() and dispatch() - 14https://phabricator.wikimedia.org/T357003#9715242 (10VirginiaPoundstone) 05Open→0... [19:24:13] 06Data-Engineering, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10): 14[SPIKE] Draft of Mediawiki extension proposal for Metrics Platform Instrumentation (& Experimentation) - 14https://phabricator.wikimedia.org/T355599#9715247 (10VirginiaPoundstone) 05Open→03Resolved [19:24:21] 06Data-Engineering, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 07Spike: 14[SPIKE] Remove mentions of MetricsClient#dispatch() and the monoschema from documentation - 14https://phabricator.wikimedia.org/T355046#9715248 (10VirginiaPoundstone) 05Open→03Resolved [19:26:12] 06Data-Engineering, 10Data-Engineering-Wikistats, 06Data Products: Missing contributor stats for Singapore - https://phabricator.wikimedia.org/T344624#9715269 (10VirginiaPoundstone) [19:37:53] (03CR) 10Joal: [C:03+1] "LGTM! Merge at will" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 (owner: 10Aleksandar Mastilovic) [19:38:25] thanks a lot dcausse for the confirmation :) [19:42:22] (03CR) 10Joal: "Two nits - I have not checked the logic, I believe you have tested the results are good :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 (owner: 10Aleksandar Mastilovic) [19:43:52] (03CR) 10Joal: [C:03+1] "Actually, one nit :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019363 (owner: 10Aleksandar Mastilovic) [19:44:12] (03PS16) 10Xcollazo: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) [19:46:20] (03CR) 10Xcollazo: [V:03+2] "Patch set 16 solves the below comments, as well as the issue were we were emiting wiki database names instead of canonical names for the d" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [20:41:31] 06Data-Engineering, 10Data Pipelines, 06SRE, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9715528 (10mpopov) >>! In T252227#9655162, @dr0ptp4kt wrote: > Okay, if I understand correctly, then the idea would be to... > > 1. Continue "allowing" tag... [20:46:44] (03PS6) 10Aleksandar Mastilovic: WMCS HQL scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 [20:47:34] (03CR) 10Aleksandar Mastilovic: "OK I've removed the ORDER BYs. I've tested the queries with and without these clauses and the results are indeed the same." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 (owner: 10Aleksandar Mastilovic) [20:51:12] (03CR) 10Mforns: Clean up and parameterize SQL code for Common Impact Metrics. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) (owner: 10Xcollazo) [21:38:04] (03CR) 10Aleksandar Mastilovic: [V:03+2 C:03+2] "Ready to merge" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1017161 (owner: 10Aleksandar Mastilovic) [21:47:22] (03Abandoned) 10Aleksandar Mastilovic: WMCS unpivoted data HQL script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1018331 (owner: 10Aleksandar Mastilovic) [21:52:35] (03PS1) 10Aleksandar Mastilovic: WMCS pivoted query update [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1019932 [23:02:26] 06Data-Engineering, 06Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 3 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263#9715773 (10Etonkovidova) a:05Etonkovidova→03None [23:48:31] 06Data-Engineering, 10Cassandra, 06Data-Persistence: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181#9715883 (10Eevans) p:05Triage→03Medium