[00:01:05] (KafkaReplicationFactorTooLow) firing: ... [00:01:05] Kafka topic eqiad.mediawiki.job.LoginNotifyPurgeSeen replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.job.LoginNotifyPurgeSeen&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [00:06:05] (KafkaReplicationFactorTooLow) resolved: ... [00:06:05] Kafka topic eqiad.mediawiki.job.LoginNotifyPurgeSeen replication factor is too low on main-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.job.LoginNotifyPurgeSeen&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [00:10:58] 06Data-Engineering, 06Data Products, 06Movement-Insights, 10Movement-Metrics: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911#9660060 (10lbowmaker) [00:11:56] 10Data-Engineering (Q4 2024 April 1st - June 30th), 06Data Products, 06Movement-Insights, 10Movement-Metrics: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911#9660062 (10lbowmaker) [00:12:35] 10Data-Engineering (Sprint 9), 10Data Pipelines: [Refine refactoring] Refactor and migrate navigationtiming to Airflow - https://phabricator.wikimedia.org/T356192#9660065 (10lbowmaker) [00:13:33] 06Data-Engineering, 06Data Products, 10MediaWiki-extensions-WikimediaEvents, 06Web-Team-Backlog, 13Patch-For-Review: Update mediawiki.web_ui_actions Stream Config - https://phabricator.wikimedia.org/T360955#9660066 (10lbowmaker) [00:15:18] 06Data-Engineering, 10Data Pipelines, 10Data-Catalog: Spike: Integrate Spark with DataHub - https://phabricator.wikimedia.org/T306896#9660068 (10lbowmaker) [00:20:55] 06Data-Engineering: [Developer Experience] Implement CI hql Linting - https://phabricator.wikimedia.org/T360967 (10lbowmaker) 03NEW [00:23:30] 06Data-Engineering, 07Spike: [Developer Experience] [SPIKE] Investigate process to automate deployment of hdfs artifacts - https://phabricator.wikimedia.org/T360968 (10lbowmaker) 03NEW [00:29:02] 06Data-Engineering, 07Spike: [SPIKE] Investigate OpenHouse as a data lake management tool - https://phabricator.wikimedia.org/T360969 (10lbowmaker) 03NEW [02:01:04] (03PS4) 10Snwachukwu: Mediawiki History Data Quality Metrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1008934 (https://phabricator.wikimedia.org/T354692) [02:01:19] 10Quarry, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660226 (10tstarling) PhpRedis is getting behind KeyDB with [[https://github.com/phpredis/phpredis/issues/2466|#2466]]... [02:04:32] (03PS5) 10Snwachukwu: Mediawiki History Data Quality Metrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1008934 (https://phabricator.wikimedia.org/T354692) [04:25:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:20:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:13:43] 10Data-Engineering (Sprint 9), 13Patch-For-Review: [Dataset Config Store] Deploy poc to dse-k8s - https://phabricator.wikimedia.org/T357434#9660734 (10tchin) a:03tchin [10:39:07] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Update the From: addresses of all email from DPE pipelines so that they use routable addresses - https://phabricator.wikimedia.org/T358675#9660816 (10BTullis) I have now deployed this patch to refinery, so that all refiner... [10:39:36] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: 14Monitor the availability of the superset deployments - 14https://phabricator.wikimedia.org/T356484#9660817 (10brouberol) 05Openā†’03Resolved [10:40:44] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 07Epic: 14Migrate the Analytics Superset instances to our DSE Kubernetes cluster - 14https://phabricator.wikimedia.org/T347710#9660819 (10brouberol) 05Openā†’03Resolved a:03brouberol [10:41:56] (03CR) 10Gmodena: "LGTM but could you maybe:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1008934 (https://phabricator.wikimedia.org/T354692) (owner: 10Snwachukwu) [10:42:55] 06Data-Engineering, 10superset.wikimedia.org, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): 14Superset Timeout Logging - 14https://phabricator.wikimedia.org/T294772#9660831 (10BTullis) 05Openā†’03Declined 14I'll be bold and decline this ticket, but please feel free to reopen it if anyone feels strong... [10:49:47] btullis: hello! [10:51:39] btullis: I've seen you have closed T280905, and reading the ticket makes me wonder how much it would cost to migrate the services running on those coords to k8s - just an idea/question - let me know if I'm crazy :) [10:51:40] T280905: Analytics coordinator failover improvements - https://phabricator.wikimedia.org/T280905 [11:11:40] (03CR) 10Joal: [C:03+1] "I have not triple checked the code logic but I trust you have. The `TEMPORARY VIEW `approach is great :) If it's been tested and validated" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006970 (owner: 10Aleksandar Mastilovic) [11:13:39] (03CR) 10Joal: [V:03+2 C:03+2] "Merge for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1011434 (https://phabricator.wikimedia.org/T360303) (owner: 10Gerrit maintenance bot) [11:13:54] 10Quarry, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660965 (10larissagaulia) [11:14:41] (03CR) 10Joal: [V:03+2 C:03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1011437 (https://phabricator.wikimedia.org/T360310) (owner: 10Gerrit maintenance bot) [11:24:46] joal: Good questions and not crazy at all. The an-coord100[3-4] machines *are* basically stateless now. [11:28:56] right btullis - maybe some next-steps migration when you guys will have done what's already on your plate :) [11:29:56] Would you like to meet for a few minutes now to talk about the pros and cons? Or put something in the diary? [11:32:17] Migrating Hive (server2 and metastore) to k8s would likely be fairly easy at this point. Probably quite quick. [11:34:37] Migrating Presto to dse-k8s is potentially very interesting because there are so many more combinations of how it could be done. [11:35:20] btullis: already in meeting, many meetings today - I'll ping you later :) [11:35:32] joal: Ack. Any time. [11:43:54] brouberol: there's an uncommitted private puppet change for an-tool1010, I believe that's you? [11:44:16] oops, sorry. on it [11:44:45] brouberol: There's an alert in #wikimedia-data-platform-alerts about superset-staging on dse-k8s. Is that due to a re-deployment or similar? Is it expected? [11:45:30] taavi: fixed [11:45:39] thank you! [11:45:43] np [11:46:24] btullis: I found out that superset-next has no helm releases, for some unexplained reason [11:46:45] I've had to perform manual helm commands to redeploy it, and before I did, I think this alert triggered [11:47:06] I need to log off for now, but I'll investigate in the afternoon. Superset-next is running now [11:47:15] Cool. Was just wondering. See ya later. [12:03:33] jayme: brouberol: Are you happy for me to do an admin_ng apply on dse-k8s at the moment? I see a pending change for T360612 so thought it best to check. [12:03:34] T360612: Add redis (rdb) instances to external-services - https://phabricator.wikimedia.org/T360612 [12:05:38] btullis: it's good to deploy [12:05:49] Thanks. [13:19:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:24:40] 10Data-Engineering (Sprint 9), 06Data-Platform, 06Movement-Insights: Add movement insights group/users to MWH denormalize job alerts - https://phabricator.wikimedia.org/T357472#9661407 (10JAllemandou) Done using airflow variable mechanism. [13:44:24] Sorry, Iā€™m afk for a bit, for a personal errand [13:44:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:45:56] 06Data-Engineering: Migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T361014 (10lbowmaker) 03NEW [13:50:33] 06Data-Engineering: [Data Quality] Migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T361014#9661577 (10lbowmaker) [13:53:35] 06Data-Engineering: [Data Quality] Migrate MWHistoryChecker to DeeQu checks - https://phabricator.wikimedia.org/T361016 (10lbowmaker) 03NEW [13:57:45] 06Data-Engineering, 10Event-Platform, 07Spike: [SPIKE] Can we express Event Platform configs in config store? - https://phabricator.wikimedia.org/T361017 (10gmodena) 03NEW [13:58:18] (03PS1) 10Gehel: Sort pom.xml according to standard sortpom order. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014516 (https://phabricator.wikimedia.org/T360219) [13:58:19] (03PS1) 10Gehel: Start using wmf-jvm-parent-pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 [13:59:51] (03CR) 10Gehel: "There are test failures that seem related to invalid symlinks in refinery-hive/test/resources (the maxmind databases). I vaguely remember " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [14:00:37] joal: I did some work to migrate refinery to the new parent pom (well... start using it). I'm having trouble with symlinks (see comment on https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1014517). Does this ring a bell? [14:05:21] (03CR) 10CI reject: [V:04-1] Start using wmf-jvm-parent-pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [14:31:24] (03CR) 10Gehel: "This is probably related to https://issues.apache.org/jira/browse/MRESOURCES-237" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [14:33:09] gehel: Hi! thank you for doing this - I can recall issues with the maxmind testing, but I can't remember symlink being a part of it :( [14:41:10] * brouberol is back [14:45:52] btullis: wrt T353774, is there anything to do apart from running the decommissioning cookbook & removing hiera/secrets for the associated hosts? [14:45:53] T353774: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 [15:04:44] brouberol: I don't /think/ so :-) I think that they're all ready for the cookbook. You could update the spreadsheet to rub out two more buster servers: https://docs.google.com/spreadsheets/d/1Obj5ozGQYl7Zei0MBLELVD8eDGqqsF_t9T3ZbrOsmZg/edit#gid=0 [15:05:10] nice! On it [15:05:20] do you have a preferred one I should keep for last? [15:05:26] Or any last words? [15:06:31] Nope, fire at will :-) [15:09:31] * brouberol nods and presses the red button [15:12:48] (03PS2) 10Gehel: Start using wmf-jvm-parent-pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 [15:14:12] joal: I found the issue ^ [15:14:25] joal: note that this isn't entirely ready to be review just yet. [15:15:55] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895#9661955 (10brouberol) @lbowmaker I'm ready to kill the hue server. I've heard from the [[ https://wikimedia.slack.com/archives/CLKDS4MG9/p1693594875458559?thread_ts... [15:38:14] * brouberol for I have become the destroyer of an-coords [15:39:26] šŸ’„ [15:39:43] and for the final rites: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014533?tab=checks [15:47:31] (03PS2) 10Kimberly Sarabia: Desktopwebuiactionstracking: add missing `show` to action enum [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1008928 (https://phabricator.wikimedia.org/T359182) (owner: 10DLynch) [15:55:01] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895#9662087 (10lbowmaker) >>! In T341895#9529415, @lbowmaker wrote: > The new version of Superset allows nested data types to be visualized and we have spoken with user... [15:55:43] (03PS10) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [15:56:24] (03CR) 10CI reject: [V:04-1] development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [15:58:16] 06Data-Engineering, 07Spike: [Status Store] [SPIKE] Investigate and document approach for Iceberg Sensors - https://phabricator.wikimedia.org/T360922#9662101 (10Ahoelzl) [16:03:14] (03PS11) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:03:40] (03CR) 10CI reject: [V:04-1] development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:10:30] (03CR) 10Joal: Start using wmf-jvm-parent-pom. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [16:10:38] one comment --^ Good catch gehel ! [16:14:08] (03PS1) 10Gehel: Remove duplication from parent pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014546 (https://phabricator.wikimedia.org/T360219) [16:15:51] 10Data-Engineering (Sprint 9): Airflow mapped tasks UI & metrics - https://phabricator.wikimedia.org/T357430#9662196 (10Antoine_Quhen) The Airflow PR has been merged and should be released in Airflow 2.9 in April. [16:17:28] 10Data-Engineering (Sprint 9): [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store - https://phabricator.wikimedia.org/T356360#9662207 (10Antoine_Quhen) 5 datasets are being refined as a POC on the prod cluster. 2 on the test cluster. [16:18:06] 10Data-Engineering (Sprint 9), 10Data Pipelines: 14[Refine refactoring] Refactor and migrate navigationtiming to Airflow - 14https://phabricator.wikimedia.org/T356192#9662214 (10Antoine_Quhen) 05Openā†’03Resolved a:03Antoine_Quhen [16:18:24] 10Data-Engineering (Sprint 9), 06Data-Platform, 06Movement-Insights: Add movement insights group/users to MWH denormalize job alerts - https://phabricator.wikimedia.org/T357472#9662221 (10Ahoelzl) a:03JAllemandou [16:20:09] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895#9662233 (10brouberol) a:03brouberol [16:22:58] (03PS3) 10Gehel: Start using wmf-jvm-parent-pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 [16:22:58] (03PS2) 10Gehel: Remove duplication from parent pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014546 (https://phabricator.wikimedia.org/T360219) [16:24:07] (03PS12) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:24:18] (03CR) 10Gehel: Start using wmf-jvm-parent-pom. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [16:24:41] (03CR) 10CI reject: [V:04-1] development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:26:51] (03CR) 10CI reject: [V:04-1] Remove duplication from parent pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014546 (https://phabricator.wikimedia.org/T360219) (owner: 10Gehel) [16:27:19] (03PS13) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:28:00] (03CR) 10CI reject: [V:04-1] development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:31:38] (03PS14) 10Gmodena: development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:32:03] (03CR) 10CI reject: [V:04-1] development: add webrequest schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [16:44:11] (03PS1) 10Jennifer Ebe: Update Eqiad Targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1014561 [16:45:32] (03CR) 10Milimetric: [C:03+2] Update Eqiad Targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1014561 (owner: 10Jennifer Ebe) [16:45:52] (03CR) 10Milimetric: [V:03+2 C:03+2] Update Eqiad Targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1014561 (owner: 10Jennifer Ebe) [17:05:20] (03PS1) 10Gehel: Sort the dependencyManagement section according to sortPom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014564 (https://phabricator.wikimedia.org/T360219) [17:51:02] 06Data-Engineering, 07Spike: [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS - https://phabricator.wikimedia.org/T360968#9662682 (10JAllemandou) [17:55:51] 06Data-Engineering: [Airflow] SparkSqlOperator fails when executing via Skein with master=local - https://phabricator.wikimedia.org/T359435#9662688 (10JAllemandou) We currently have use-cases doing this exactly that work. there must have been another issue than the pone described here. I think this ticket is inv... [18:26:03] (03CR) 10Esanders: [C:03+2] Desktopwebuiactionstracking: add missing `show` to action enum [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1008928 (https://phabricator.wikimedia.org/T359182) (owner: 10DLynch) [18:27:05] (03Merged) 10jenkins-bot: Desktopwebuiactionstracking: add missing `show` to action enum [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1008928 (https://phabricator.wikimedia.org/T359182) (owner: 10DLynch) [18:37:41] 06Data-Engineering, 10Observability-Logging, 06Traffic, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9662826 (10Ottomata) >> meta.id > Do you know who set these fields with the current webrequest flow? It isn't set for curr... [18:50:48] (03CR) 10Ottomata: development: add webrequest schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/983898 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [19:02:33] (03CR) 10Gehel: "Done" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014517 (owner: 10Gehel) [19:47:24] (03PS1) 10Gehel: Move version configuration of dependencies to main pom. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1014601 (https://phabricator.wikimedia.org/T360219) [19:48:10] joal: for when you have time, that string of 5 CR should be ready for review and should be good enough to have refinery start using our new parent pom. [20:01:47] \o/ [20:01:52] Will review tomorrow :) [21:45:54] 06Data-Engineering, 06Product-Analytics, 10Wmfdata-Python: Remove "master" terminology from wmfdata-python - https://phabricator.wikimedia.org/T272220#9663787 (10Uzume) I am not really against removal of "slave" terms as there and usually plenty of other more precise words that can be used that are unrelated...