[02:17:23] 06Data-Engineering-Radar, 06Traffic, 13Patch-For-Review: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617#9682674 (10CodeReviewBot) sukhe opened https://gitlab.wikimedia.org/repos/sre/varnishkafka/-/merge_requests/3 Release 1.1.0-4 [06:01:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:06:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:52:42] (03CR) 10Aqu: [C:03+2] Updating changelog to prepare next deployment [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016440 (owner: 10Santiago Faci) [08:03:46] (03Merged) 10jenkins-bot: Updating changelog to prepare next deployment [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016440 (owner: 10Santiago Faci) [08:07:15] Starting build #139 for job analytics-refinery-maven-release-docker [08:11:25] aqu: ^ that might be the first release since we switched to the new pom.xml. Let me know if there are any issues! [08:23:34] Project analytics-refinery-maven-release-docker build #139: 09SUCCESS in 16 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/139/ [08:32:45] Starting build #100 for job analytics-refinery-update-jars-docker [08:32:50] Project analytics-refinery-update-jars-docker build #100: 04FAILURE in 4.6 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/100/ [08:34:10] Starting build #101 for job analytics-refinery-update-jars-docker [08:34:14] Project analytics-refinery-update-jars-docker build #101: 04STILL FAILING in 4.6 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/101/ [08:37:09] Starting build #102 for job analytics-refinery-update-jars-docker [08:37:13] Project analytics-refinery-update-jars-docker build #102: 04STILL FAILING in 4.3 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/102/ [08:50:06] (KafkaReplicationFactorTooLow) firing: (873) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [08:55:06] (KafkaReplicationFactorTooLow) resolved: (873) Kafka topic PlatformEvent_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [09:23:12] puppet is disabled on schema1003 since nine days with a reference to T360412, given that the task is resolved, I'll re-enable puppet there [09:23:13] T360412: Phase out cergen for Data Platform services - https://phabricator.wikimedia.org/T360412 [09:34:07] 06Data-Engineering: Create per-wiki user preference metrics - https://phabricator.wikimedia.org/T361684#9683478 (10TheresNoTime) [09:35:47] 06Data-Engineering: Create per-wiki user preference metrics - https://phabricator.wikimedia.org/T361684#9683487 (10TheresNoTime) [09:38:06] 06Data-Engineering: Create per-wiki user preference metrics - https://phabricator.wikimedia.org/T361684#9683492 (10TheresNoTime) [09:59:01] (03PS45) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [10:00:16] (03CR) 10CI reject: [V:04-1] Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:05:51] (03PS46) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [11:46:02] !log disable puppet on `an-test-client1002` to test new conda-analytics version T356231 [11:46:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:46:05] T356231: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 [12:16:31] 06Data-Engineering, 06Data Products, 10Metrics Platform Backlog: Create per-wiki user preference metrics - https://phabricator.wikimedia.org/T361684#9684061 (10lbowmaker) [12:52:03] 10Data-Engineering (Q4 2024 April 1st - June 30th), 06Structured-Data-Backlog: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688#9684217 (10lbowmaker) [13:07:57] noticing we have puppet failures across stat10[06-10] as well as on an-test-client1002 related to T361266 and their running processes. [13:07:58] T361266: Offboard Michael Grosse from WMF systems - https://phabricator.wikimedia.org/T361266 [13:20:27] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [13:50:17] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [14:19:06] (03PS1) 10Xcollazo: WIP: Clean up and parameterize SQL code for Common Impact Metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1016796 (https://phabricator.wikimedia.org/T358681) [15:37:36] (03PS1) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [15:37:44] (03CR) 10CI reject: [V:04-1] Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [15:40:03] (03PS2) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [16:59:06] (KafkaReplicationFactorTooLow) firing: (929) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:04:07] (KafkaReplicationFactorTooLow) resolved: (929) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [17:21:04] 14Data-Engineering (Sprint 9), 06Data-Platform, 06Movement-Insights: 14Add movement insights group/users to MWH denormalize job alerts - 14https://phabricator.wikimedia.org/T357472#9685377 (10Rmaung) 14Please add cg-data@wikimedia.org to the notification list! Thank you!!  [19:16:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:47:05] (KafkaReplicationFactorTooLow) firing: ... [20:47:05] Kafka topic codfw.mediawiki.product_metrics.wikifunctions_ui replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.product_metrics.wikifunctions_ui&viewPanel=40 - ... [20:47:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [20:52:05] (KafkaReplicationFactorTooLow) resolved: ... [20:52:05] Kafka topic codfw.mediawiki.product_metrics.wikifunctions_ui replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.product_metrics.wikifunctions_ui&viewPanel=40 - ... [20:52:05] https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [21:16:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:46:23] 06Data-Engineering, 06Data-Platform, 06Movement-Insights: Add movement insights group/users to MWH denormalize job alerts - https://phabricator.wikimedia.org/T357472#9686306 (10nshahquinn-wmf) 05Resolved→03Open Reopening and moving back to triage since @Rmaung has an additional request. [21:46:35] 06Data-Engineering, 06Data-Platform: Add movement insights group/users to MWH denormalize job alerts - https://phabricator.wikimedia.org/T357472#9686310 (10nshahquinn-wmf) [22:11:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:56:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:09:12] 06Data-Engineering, 10Data Pipelines, 07Spike: 14Refine Investigation - 14https://phabricator.wikimedia.org/T296529#9686517 (10Ottomata) 05Open→03Declined 14Being bold, reopen if needed. [23:15:52] (03CR) 10Ottomata: "This is cool!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [23:20:39] (03CR) 10Ottomata: Add CLI to create or update Iceberg tables (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [23:33:58] 10Data-Engineering (Q4 2024 April 1st - June 30th): Re-implement service runner to better support metrics and debugging - https://phabricator.wikimedia.org/T360924#9686561 (10Ahoelzl) [23:33:58] (03CR) 10Ottomata: Extract RefineSingleApp code from Refine (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [23:35:42] (03CR) 10Ottomata: Add CLI to create or update Iceberg tables (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [23:37:38] 10Data-Engineering (Q4 2024 April 1st - June 30th): Replace service runner with a simplified library to better support metrics and debugging - https://phabricator.wikimedia.org/T360924#9686569 (10Ahoelzl) [23:40:53] 06Data-Engineering, 13Patch-For-Review: [NEEDS GROOMING][SPIKE] Extract refine schema management into a dedicated tool - https://phabricator.wikimedia.org/T356762#9686574 (10Ottomata) Just read [[ https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1003745 | Antione's patch ]] and I think I'm missing... [23:41:04] (03CR) 10Ottomata: Add CLI to create or update Iceberg tables (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [23:41:10] (03CR) 10Ottomata: Extract RefineSingleApp code from Refine (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) (owner: 10Joal) [23:49:30] 10Data-Engineering (Q4 2024 April 1st - June 30th): [Dataset Config Store] Setup initial CI checks - https://phabricator.wikimedia.org/T357468#9686578 (10Ahoelzl) a:03tchin