[05:25:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:10:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:32:24] * brouberol waves good morning! [08:32:42] btullis: I'm available at your convenience if you want to start migrating superset-next for real [09:51:22] Great! Let's do it. We can chat in the sync. [10:34:06] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9575805 (10BTullis) 05Open→03Resolved [10:34:08] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9575806 (10BTullis) [10:48:00] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9575848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb... [10:48:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:10:39] btullis: I've redeployed superset-next service in dse-k8s. However, I'm still seeing 404s on static assets when visiting https://superset-next-k8s.wikimedia.org/login/, and the only requests I see on the nginx side are GET /login, GET /health and GET /static/assets/manifest.json [11:11:18] brouberol: OK, right in the middle of the presto coordinator migratin at the moment. Will look in a sec. [11:11:29] How did you figure out the requests for statics were routed to wikikube? [11:11:31] no worries [11:17:18] brouberol: you need to list the domain in `cache::alternate_domains` in hiera, or otherwise it's treated as a mediawiki domain with mediawiki-like static assets which are handled separately [11:17:52] oh, thank you! I completely missed that [11:23:57] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Declare the superset domains as alternate domains - https://phabricator.wikimedia.org/T358479#9575976 (10brouberol) [11:24:24] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804#9575994 (10BTullis) [11:24:27] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482#9575993 (10BTullis) [11:24:38] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045#9575995 (10BTullis) [11:25:42] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045#9575990 (10BTullis) 05Open→03Resolved We have carried out the presto coordinator migration and all went as planned. {F42144114,width=40%} Although we initial... [11:27:39] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Remove all resources associated with the superset-(next-)k8s.wimedia.org domains - https://phabricator.wikimedia.org/T358480#9576003 (10brouberol) [11:27:54] Thanks again taavi. I had no idea about that either :-) [11:28:03] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Epic, 10Patch-For-Review: Declare the superset domains as alternate domains - https://phabricator.wikimedia.org/T358479#9576014 (10brouberol) [11:30:41] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Create saved views for the superset deployment logs - https://phabricator.wikimedia.org/T356485#9576029 (10brouberol) 05Open→03Resolved [11:30:44] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9576030 (10brouberol) [11:33:41] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234#9576034 (10BTullis) a:03BTullis I've had confirmation from Dylan Kozlowski that it's fine to delete the data. [11:41:35] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234#9576043 (10BTullis) 05Open→03Resolved Removing home directory files: ` btullis@cumin1002:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:pr... [11:42:49] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9576047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS b... [11:44:11] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9576050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with... [11:54:46] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring stat1011 into service - https://phabricator.wikimedia.org/T354526#9576091 (10BTullis) 05Open→03Resolved [11:54:48] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454#9576092 (10BTullis) [12:18:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:33:11] taavi: thanks again, we're now able to load our statics. I don't know how I would have found what needed to be done [13:24:38] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Data Products (Data Products Sprint 10), 10Patch-For-Review: Migrate EventLogging to JSDoc - https://phabricator.wikimedia.org/T357444#9576231 (10phuedx) [13:25:18] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Data Products (Data Products Sprint 10), 10Patch-For-Review: Migrate EventLogging to JSDoc - https://phabricator.wikimedia.org/T357444#9576233 (10phuedx) a:03apaskulin [13:33:13] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490#9576250 (10brouberol) Now that the reverse proxy container is in place, with the statics on the filesystem, we should add the req... [14:18:19] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Remove all resources associated with the superset-(next-)k8s.wimedia.org domains - https://phabricator.wikimedia.org/T358480#9576414 (10Gehel) p:05Triage→03Medium [14:19:27] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Declare the superset domains as alternate domains - https://phabricator.wikimedia.org/T358479#9576421 (10Gehel) p:05Triage→03High [14:49:51] 10Data-Engineering, 10MW-Interfaces-Team: Getting added_lines data - https://phabricator.wikimedia.org/T331150#9576526 (10lbowmaker) @MW-Interfaces-Team - do we have an API to get added_lines? @Leaderboard - Our team doesn’t plan on implementing this stream anytime soon but you could make use of this existing... [14:52:22] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105#9576534 (10Fabfur) Applied to cp4037 the new log format to check eventual errors [15:04:57] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Explore the use of Airflow notifiers for more flexible DAG failure handling - https://phabricator.wikimedia.org/T343234#9576579 (10BTullis) p:05Low→03Medium [15:09:02] 10Data-Engineering (Sprint 9), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561#9576590 (10mfossati) >>! In T347561#9494134, @JAllemandou wrote: > Thanks a log for not forgetting about this ticket @... [17:09:03] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Declare the superset domains as alternate domains - https://phabricator.wikimedia.org/T358479#9577244 (10brouberol) 05Open→03Resolved [17:09:08] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9577245 (10brouberol) [17:55:01] (03PS4) 10Aleksandar Mastilovic: Add HQL file for CX report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 [19:09:03] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [19:09:03] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [19:12:06] 10Data-Engineering (Sprint 9), 10ChangeProp, 10observability, 10service-runner, 10Event-Platform: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180#9577828 (10gmodena) a:03gmodena [19:27:15] (03CR) 10Joal: "Minimal changes - should be ready for tomorrow :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 (owner: 10Aleksandar Mastilovic) [20:09:04] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [20:09:04] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [21:58:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage