[01:00:41] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 6 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [01:28:51] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [06:12:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:47:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:18:46] good morning! btullis, do you know at what interval is the hdfs rebalancer systemd timer supposed to kick in? I'm still seeing pretty widespread disk usage between new and old workers. [07:23:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) >>! In T344614#9161898, @bking wrote: > The flink-app in dse-k8s is healthy again, but I have no evidence that it's talking... [07:37:37] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) The flink cluster in eqiad looks healthy: ` elukey@flink-zk1001:~$ echo "srvr" | nc localhost 2181 Zookeeper version: 3.8.0... [07:55:40] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) @bking another thing to verify: ` root@deploy1002:~# kubectl logs flink-app-wdqs-54cd5c5567-zjq7r -n rdf-streaming-updater... [08:12:06] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10gmodena) [08:15:45] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Document the onboarding journey on Event Platfrom - https://phabricator.wikimedia.org/T345193 (10gmodena) I moved the Google doc draft to https://wikitech.wikimedia.org/wiki/Event_Platform/Onboarding and created a ph... [08:21:28] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10gmodena) [08:27:28] brouberol: It looks like 06:00 every day, I think: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/hadoop/balancer.pp#L25 [08:28:13] On the production cluster it runs on an-launcher1002 [08:29:14] https://www.irccloud.com/pastebin/qDzUchQm/ [08:29:53] I think it'll take on the order of days to weeks to finish rebalancing, but it should keep moving. [08:30:39] 10Data-Platform-SRE, 10Discovery-Search: Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10Gehel) [08:30:39] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10Gehel) [08:32:48] 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Alert for snapshot101[4567] not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T345907 (10BTullis) 05Open→03Resolved a:03BTullis I have merged this patch, so I believe that we can close this. [08:47:27] oh that much? Well it makes sense, given the data size [08:48:09] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10Gehel) [08:48:11] 10Data-Platform-SRE: Troubleshoot rdf-streaming-updater/dse-k8s cluster - https://phabricator.wikimedia.org/T346048 (10Gehel) [08:48:14] it seems to be working: the new nodes are past 1TB of data, and they weren't close to 1TB when I asked the question [09:10:40] hi folks! [09:11:02] I'd need to run a query for webrequest (one hour of text data), what's the best tool nowadays? [09:11:07] spark.sql? [09:13:09] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) 05Open→03In progress I'm working on some more improvements before we can test. [09:14:11] I tried something like (with pyspark3) [09:14:12] spark.sql("SELECT * FROM wmf.webrequest where webrequest_source = 'text' and year=2023 and month=9 and day=10 and hour=12 and uri_ho ...: st = 'stream.wikimedia.org' LIMIT 10").show() [09:14:25] but I got a ton of errors [09:27:39] of course I need https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#Start_a_spark_shell_in_yarn [09:27:57] brouberol: I'll be 1' late, bio break [09:30:46] elukey: What about SQL Lab in superset, or just the presto CLI? [09:32:19] never used them, are they stable now? [09:33:45] Yeah, should be. They're no co-located with the data like spark on yarn, so there is more network traffic as a result, but I'd say that they're stable. [09:35:24] Try this: https://superset.wikimedia.org/superset/sqllab/?savedQueryId=764 [09:39:34] ack will do, at the moment I am using pyspark --master yarn etc.. and it works nicely [09:39:52] 👍 [09:53:43] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10dcausse) @bking thanks! I can confirm that the job is running fine, the dashboards show some activity the test stream `wdqs_... [10:00:53] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Milimetric) It makes sense, @mforns, it's... [10:19:18] 10Data-Engineering, 10cloud-services-team: dbproxy1018 alert for two instances down - https://phabricator.wikimedia.org/T346012 (10taavi) 05Open→03Resolved a:03taavi I reloaded haproxy on that host. [10:20:44] 10Data-Engineering, 10cloud-services-team: dbproxy1018 alert for two instances down - https://phabricator.wikimedia.org/T346012 (10dr0ptp4kt) Thanks @taavi . I see it reflected now, with it sending traffic to clouddb1017. ` dr0ptp4kt@tools-sgebastion-10:~$ mariadb --defaults-file=$HOME/replica.my.cnf -h enwik... [10:35:26] 10Data-Engineering, 10cloud-services-team: dbproxy1018 alert for two instances down - https://phabricator.wikimedia.org/T346012 (10dr0ptp4kt) [10:37:34] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10elukey) @bking I had a chat with @dcausse and from https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/ha/zoo... [10:41:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10gmodena) > We set high-availability.type, that is not supported in 1.16. Moreover we should also set the cluster-id too. FWIW we ha... [10:48:07] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10gmodena) [11:03:47] o/ I'm looking into the referrer_daily SLA miss and I can't see any obvious differences between the run on 2023-09-12 and others in the same month [11:10:03] Also, the SLA is 6 hours (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/referrer/referrer_daily_dag.py#L46) and it took ~50 minutes for the DAG to run :/ [11:34:46] (03CR) 10Btullis: [V: 03+2 C: 03+2] Use sudo with git in refinery_deploy_to_hdfs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [11:35:06] (03CR) 10Btullis: [V: 03+2 C: 03+2] Use sudo with git in refinery_deploy_to_hdfs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [11:45:31] 10Data-Platform-SRE, 10Patch-For-Review: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) This is now merged, but we're waiting for the first refinery-deploy after this, to validate whether or not it works as expected. I have added a note t... [12:29:07] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10gmodena) [12:39:55] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] What happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10gmodena) [12:41:28] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10gmodena) @dcausse @bking I moved the google doc draft to https://wikitech.wikimedia.org/wiki/Event_Platform/SLO/Fli... [12:41:57] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10gmodena) [13:46:14] 10Data-Platform-SRE, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) [13:46:33] 10Data-Platform-SRE, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) a:03BTullis [14:05:43] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) a:05mforns→03Virgin... [14:05:55] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 01), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) [14:11:09] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 01), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) a:05VirginiaPoundston... [14:12:04] (03PS6) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [14:12:29] (03CR) 10CI reject: [V: 04-1] cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [14:13:33] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [TEMPLATE] Onboard request for APPLICATION NAME to Event Platform - https://phabricator.wikimedia.org/T346207 (10gmodena) [14:47:34] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) raids configured [14:50:07] 10Data-Platform-SRE, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) I've been following the instruction here: https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner/Trusted_Runners#Request_access_to_Trusted_Run... [14:50:54] 10Data-Platform-SRE, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) [14:50:56] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) [14:53:26] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) a:05BTullis→03None Removing myself as the assignee, since it appears that I will need assistan... [14:54:18] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) >>! In T346244#9164100, @BTullis wrote: > It appears that I do not have the required rights to creat... [14:54:31] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineer... [14:56:22] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10BTullis) >>! In T346244#9164113, @dancy wrote: >>>! In T346244#9164100, @BTullis wrote: >> It appears that... [14:58:21] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-... [14:58:31] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-... [15:01:39] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners), 10Patch-For-Review: Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) 05Open→03Resolved a:03dancy You're all set. [15:01:45] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10dancy) [15:20:05] I _think_ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/956916 is ready for a final review. I had to revert back to curl-ing the opensearch API from the local node itself, as the logstash-{eqiad,codfw} API is unreachable from cumin hosts. [15:21:05] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [NEEDS GROOMING][SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) [15:26:24] brouberol: that's something that can probably be changed easily, the cumin hosts are more than trusted hosts, I'd ask o11y if that's feasible [15:26:29] my 2 centa [15:26:32] *cents [15:33:18] (03PS7) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [15:33:38] (03CR) 10CI reject: [V: 04-1] cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [15:38:49] gotcha, thank you! I've asked in #wikimedia-observability [15:47:48] (03PS6) 10Peter Fischer: Adapt schema to meet latest requirements. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) [15:48:16] (03CR) 10CI reject: [V: 04-1] Adapt schema to meet latest requirements. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [15:49:14] (03PS7) 10Peter Fischer: Adapt schema to meet latest requirements. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) [15:52:04] (03PS8) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [16:00:50] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) Some observed behavior from T344614 , Flink-app will start when HA is misconfigured, wh... [16:04:20] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) I have finished building conda-analytics version 0.0.19 and it is now on apt.wikimedia.org,... [16:11:12] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) Updated the apt repositories on the test cluster with: ` btullis@cumin1001:~$ sudo cumin A:... [16:25:04] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) It's a good start, showing that I was able to activate the base environment and that the ma... [16:34:44] (SystemdUnitCrashLoop) firing: jupyterhub-conda.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:37:48] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) OK, less good results from jupyterhub. First of all, I just tried connecting to the jupyter... [17:05:15] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 6 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [17:14:57] (03PS2) 10Sharvaniharan: Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 [17:15:58] (03CR) 10Sharvaniharan: "Good catch @Mikhail. Done." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [17:34:50] 10Data-Platform-SRE, 10Release-Engineering-Team, 10serviceops, 10GitLab (CI & Job Runners): Request access to trusted runners for data-engineering/spark - https://phabricator.wikimedia.org/T346244 (10dancy) >>! In T346244#9164100, @BTullis wrote: > I've been following the instruction here: https://wikitech... [17:46:23] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineer... [18:46:58] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [19:08:02] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [20:34:59] (SystemdUnitCrashLoop) firing: jupyterhub-conda.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:27:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:54:32] 10Data-Platform-SRE, 10Discovery-Search (Current work): Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10RKemper) The reindex, even though we had to terminate it before it finished, had already gotten to `enwiki_content`. So this is done. Here's what the value on both eqiad a... [21:56:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [21:56:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [22:02:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [22:06:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [22:06:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [22:30:59] 10Data-Platform-SRE, 10Discovery-Search (Current work): Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10RKemper)