[03:33:31] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:33:31] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:27:00] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Five deleted Wikidata items pertaining to Wikimedia category pages still present in the Query Service - https://phabricator.wikimedia.org/T342593 (10dcausse) [07:33:31] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:33:31] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:47:19] (03PS12) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) [07:47:52] (03CR) 10Peter Fischer: "Thanks, Erik!" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer) [08:30:29] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) We are probably going to have to ask for some guidance from the Service Ops team here, since I don't know of an... [09:51:42] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10JMeybohm) IIUC you want to expose a staging service to the public internet. That is nothing we ever considered to do whi... [09:52:02] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10JMeybohm) [10:40:17] 10Data-Engineering, 10Data-Platform-SRE, 10AQS2.0: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10hnowlan) As part of T342213 I kinda jumped the gun and [[ https://gerrit.wikimedia.org/r/c/operations/dns/+/943616 | added aqs.discovery.wmnet ]] reco... [11:33:31] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:33:31] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:33:33] 10Data-Engineering: Mediarequests top articles: should use a disallow filter just like top articles - https://phabricator.wikimedia.org/T343793 (10Milimetric) [13:30:24] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Event-Platform: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10JArguello-WMF) [13:59:56] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) >>! In T343236#9076462, @JMeybohm wrote: > IIUC you want to expose a staging service to t... [14:03:16] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1082.eqiad.wmnet with OS bullseye [15:03:26] btullis, stevemunene: weekly meeting in https://meet.google.com/rnb-jtio-dcy [15:04:01] (03CR) 10Ebernhardson: [C: 03+2] Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer) [15:04:16] Sorry, be there now. [15:04:29] (03Merged) 10jenkins-bot: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer) [15:19:12] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) Per today's Data Platform meeting, Ben provided an example of [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/al... [15:28:17] 10Data-Platform-SRE, 10Discovery-Search: Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10RKemper) [15:33:31] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:33:31] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:44:06] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1082.eqiad.wmnet with OS bullseye completed: - an-worker1082 (**PASS**) - Downtimed on Icinga/Alertmanag... [15:44:54] 10Data-Platform-SRE: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Gehel) 05Open→03Resolved [15:45:52] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10Gehel) 05Open→03Resolved [15:45:54] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10Gehel) [15:52:14] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10RKemper) Looks like we lost track of this a bit. @bking and I can work this this week. [15:52:22] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) a:03bking [15:53:53] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1083.eqiad.wmnet with OS bullseye [15:54:32] 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) a:05BTullis→03None [16:00:30] stevemunene: It's not super clear to me what you need from Kara on T340648. Or is it on the related T342546 ? [16:00:31] T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 [16:03:36] 10Data-Engineering, 10Product-Analytics: Conda analytics environments breakage - https://phabricator.wikimedia.org/T343823 (10nettrom_WMF) [16:10:47] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10BTullis) It might be the case that the alerts are already in place, since we have the metrics (for the eqiad cluster) [[https://graf... [16:10:59] 10Data-Engineering, 10Product-Analytics: Conda analytics environments breakage - https://phabricator.wikimedia.org/T343823 (10nettrom_WMF) [16:13:22] gehel: Just the ssh key mentioned here https://phabricator.wikimedia.org/T342546#9054613 since all the other requirements have been met. [16:17:57] stevemunene: ack, I'll check with Kara [16:18:15] Thanks gehel [16:21:37] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10BTullis) I suppose that it might also depend a bit on how for we should go with: {T342578} Perhaps we want to have this zookeeper c... [17:22:09] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye [17:24:26] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye [17:24:47] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1083.eqiad.wmnet with OS bullseye completed: - an-worker1083 (**PASS**) - Downtimed on Icinga/Alertmanag... [17:56:35] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye completed: - wcqs1001 (**WARN**) - Downt... [18:10:53] 10Data-Engineering: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) [18:37:31] 10Data-Platform-SRE: Confirm TLS certificate monitoring is in place for Search Platform-owned domains - https://phabricator.wikimedia.org/T343761 (10Gehel) [19:33:32] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:33:32] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:33:11] 10Data-Platform-SRE, 10Discovery-Search: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10bking) [20:34:05] 10Data-Platform-SRE, 10Discovery-Search: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10bking) [20:49:46] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10bking) @WolfgangFahl We've whitelisted the endpoints, but [[ https://w.wiki/6q2i | the query you linked a... [20:57:24] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye executed with errors: - wcqs1002 (**FAIL**... [21:02:23] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye [21:46:09] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10EBernhardson) >>! In T339347#9078729, @bking wrote: > @WolfgangFahl We've whitelisted the endpoints, but... [21:46:28] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye executed with errors: - wcqs1003 (**FAIL**... [22:53:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:33:32] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:33:32] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability