[01:15:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:12] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:42] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:53:42] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [05:21:00] 10Quarry, 10Tool-tsreports: Quarry-TSreports feature parity - https://phabricator.wikimedia.org/T78549 (10Frostly) [05:28:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:09] 10Data-Platform-SRE, 10DBA, 10cloud-services-team, 10Patch-For-Review: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) [06:23:10] 10Data-Platform-SRE, 10DBA, 10cloud-services-team, 10Patch-For-Review: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) 05Open→03Resolved This is done [06:32:11] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2023/2024-Q1): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Frostly) >>! In T169452#8952000, @Stuartyeates wrote: > I've opened a superset discussion at https://github.com/apache/superset/discussions/... [06:42:12] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Nikerabbit) 05Open→03In progress [07:49:09] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 2 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10Osnard) Thank you very much for reaching out to us. The provided patch (https://gerrit.wikimedia.or... [07:53:42] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:53:42] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:58:57] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) Picking this up from `an-worker1108` [08:27:01] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1108.eqiad.wmnet with OS bullseye [08:30:24] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Thanks for the efforts on this, @Stevemunene! Please let us know if there's anything needed on our end :) [08:46:50] stevemunene: I just realized we did not have our usual daily sync this morning. You can update the etherpad async and I'll have a look later if that's helpful [08:47:22] stevemunene, btullis: I won't be able to join the SRE sync today either :/ [08:51:58] sure, np gehel [09:00:34] gehel, all fine by me. [09:05:31] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1108.eqiad.wmnet with OS bullseye completed: - an-worker1108 (**PASS**) - Downtimed on Icinga/Alertm... [09:28:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:14] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10SLyngshede-WMF) @BTullis I think that was one of my regressions with my updated script. I think I stripped out the /default/rack bit, but added it back in. The old script has this sl... [09:58:42] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1109.eqiad.wmnet with OS bullseye [10:03:38] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) >>! In T329363#9105491, @SLyngshede-WMF wrote: > @BTullis I think that was one of my regressions with my updated script. I think I stripped out the /default/rack bit, but add... [10:33:37] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1110.eqiad.wmnet with OS bullseye [10:33:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1109.eqiad.wmnet with OS bullseye completed: - an-worker1109 (**PASS**) - Downtimed on Icinga/Alertm... [10:43:51] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I have updated all roles on the staging database and logged in. Everything appears to work correctly. ` MariaDB [superset_staging]> select * from ab_role;... [10:44:48] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) Updated the production Superset database. ` MariaDB [superset_staging]> use superset_production; Reading table information for completion of table and col... [10:47:20] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) 05Open→03Resolved I believe that this ticket is now done, so I'll resolve it. I'll let some of our more active Superset users know about the change an... [10:48:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:04] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for suwikisource - https://phabricator.wikimedia.org/T343547 (10BTullis) I believe that this is resolved now. ` btullis@tools-sgebastion-10:~$ sql suwikisource Reading table information for completion of table... [10:57:14] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for suwikisource - https://phabricator.wikimedia.org/T343547 (10BTullis) 05Open→03Resolved a:03BTullis [11:03:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:55] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1110.eqiad.wmnet with OS bullseye completed: - an-worker1110 (**PASS**) - Downtimed on Icinga/Alertm... [11:18:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:33] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for blkwiktionary - https://phabricator.wikimedia.org/T343541 (10BTullis) 05Open→03Resolved a:03BTullis I believe that this is complete now. ` btullis@tools-sgebastion-10:~$ sql blkwiktionary Reading tabl... [11:53:42] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:53:42] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:55:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1111.eqiad.wmnet with OS bullseye [12:56:24] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1112.eqiad.wmnet with OS bullseye [13:04:37] 10Analytics-Radar, 10Data-Engineering-Icebox, 10MediaWiki-Core-AuthManager, 10Privacy Engineering, 10MediaWiki-Platform-Team (Radar): Clear site data on MediaWiki log out - https://phabricator.wikimedia.org/T179752 (10Krinkle) [13:23:29] (03PS5) 10Peter Fischer: cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [13:28:02] (03CR) 10Peter Fischer: cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [13:35:36] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1111.eqiad.wmnet with OS bullseye completed: - an-worker1111 (**PASS**) - Downtimed on Icinga/Alertm... [13:39:47] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1112.eqiad.wmnet with OS bullseye completed: - an-worker1112 (**PASS**) - Downtimed on Icinga/Alertm... [13:44:26] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots pageing responibilities - https://phabricator.wikimedia.org/T344608 (10Marostegui) I am not sure what's the current status for wikireplicas alerts. I do know they do alert on IRC, but I am not sure if they already page #cloud... [13:45:28] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10jbond) [13:48:01] (03CR) 10Joal: [C: 04-1] Use sudo with git in refinery_deploy_to_hdfs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [13:49:30] (03CR) 10Btullis: Use sudo with git in refinery_deploy_to_hdfs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:03:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:11] 10Data-Platform-SRE, 10Discovery-Search: Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10bking) [14:33:17] 10Data-Platform-SRE, 10Discovery-Search: Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10bking) [14:33:19] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [14:33:21] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [14:33:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [14:33:40] 10Data-Platform-SRE, 10Discovery-Search: Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10bking) [15:05:26] 10Data-Engineering, 10Discovery-Search, 10serviceops-radar, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker) [15:09:40] 10Data-Platform-SRE, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10Gehel) p:05Triage→03High [15:10:03] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10fnegri) > That aside i wondered if we should have some general policy on uses granted wmcs-roots i.e. should they be added to the batphone paging group in... [15:10:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10Gehel) [15:12:15] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) [15:12:31] (03CR) 10Phuedx: "This many files must have been tricky to manage. Nice!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [15:13:39] 10Data-Platform-SRE, 10Discovery-Search: Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10Gehel) p:05Triage→03High [15:13:48] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10Gehel) [15:14:25] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Align throttling configuration naming for WDQC / WCQS - https://phabricator.wikimedia.org/T344413 (10Gehel) p:05Triage→03Medium [15:16:32] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10Marostegui) @fnegri keep in mind that wikireplicas do not page for SRE as well. The most recent alert I can think of are the ones related to the last outa... [15:17:21] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10Gehel) [15:18:56] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) We did get the package to build. [15:19:45] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), and 2 others: Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10Gehel) [15:20:54] 10Data-Engineering: ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10lbowmaker) [15:22:57] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10lbowmaker) [15:23:16] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10lbowmaker) a:03tchin [15:31:16] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) [15:31:28] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) p:05Triage→03High [15:31:49] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) Marking high priority due to production pipeline being compromised. [15:32:02] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) a:03BTullis [15:32:19] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) CC @Htriedman [15:32:27] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) p:05High→03Unbreak! [15:35:35] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) For reference: https://wikimedia.slack.com/archives/C02291Z9YQY/p1676919972260169?thread_ts=1676915698.659289&cid=C02291Z9YQY and {T329398} I'... [15:36:17] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10Gehel) [15:38:43] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10Gehel) [15:42:19] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) I have regenerated the skein certificates. ` btullis@marlin:~/wmf/archiva$ ssh an-airflow1004.eqiad.wmnet btullis@an-airflow1004:~$ sudo su - a... [15:42:36] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) p:05Unbreak!→03High [15:46:28] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:12] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [15:48:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:51] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'flink-app' test service - https://phabricator.wikimedia.org/T344614 (10bking) [[ https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/zookeeper_ha/#example-configuration | Example configuration from the Flink we... [15:49:58] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [15:50:06] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) @xcollazo - Could you see if this fixes the immediate issue please? [15:50:38] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:53:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:57:53] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) [16:03:07] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) Thanks @BTullis! All runs of `test_generic_artifact_deployment_dag` are now green, and `country_project_page_daily_dag` is running its first f... [16:04:28] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10BTullis) Great! I'm sorry that this affected you. I will make sure that we get a handle on {T329398} because it got away from me. It's an annual time-bo... [16:07:23] 10Data-Engineering, 10Data-Engineering-Wikistats: Missing contributor stats for Singapore - https://phabricator.wikimedia.org/T344624 (10Robertsky) [16:18:22] 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) Per email conversation with @MoritzMuehlenhoff , work is underway for a Foundation wide Debian package building service. [[ https://app.asana.com/0/1204893496929527/1204930274838121 | Work is trac... [16:18:27] 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) 05Open→03Resolved [16:25:51] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) [16:31:04] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10xcollazo) 05Open→03Resolved >Will close this once the first production rerun succeeds. Success. > I will make sure that we get a handle on T329398... [16:32:09] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10xcollazo) [16:38:50] 10Data-Platform-SRE: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError - https://phabricator.wikimedia.org/T344617 (10Htriedman) Thanks for taking care of this @xcollazo and @BTullis! really appreciate you catching this while I was OOO [16:41:47] 10Data-Engineering: Check home/HDFS leftovers of tmtl.io contractors - https://phabricator.wikimedia.org/T340942 (10Htriedman) Hi @BTullis! All of these Tumult Labs folks were working in more of an advisory role — even if their directories contain some uncommitted changes, you can delete them and remove their us... [17:15:11] 10Data-Engineering, 10cloud-services-team, 10Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608 (10fnegri) I found that alert in Logstash, attached is the alert JSON data from Logstash. `"team": "sre"` means that it didn't page WMCS, but shouldn't it h... [17:21:45] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) We have a couple of different places for rdf-streaming-updater test config within the [[ https://gerrit.wikimedia.org/g/opera... [17:23:47] (03CR) 10Mforns: "I understood we were going to split by web and app, but just that, no further dimensions. Do you not agree? I imagined it as just 3 "monof" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [17:57:50] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:36] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:58:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:52] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [18:20:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:40] 10Data-Engineering, 10Data Products, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) from @Mayakp.wiki > ...the major concern... [18:21:01] 10Data-Engineering, 10Data Products, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) [18:21:32] 10Data-Engineering, 10Data Products, 10Data Pipelines (Sprint 14), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10VirginiaPoundstone) a:05Mayakp.wiki→03Milimetric [18:23:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:46] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [19:13:11] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 0): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10xcollazo) [19:32:34] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) Since wdqs1010 is in an unreachable state after an attempted reimage, I'm going to update firmware on wdqs10[06-09] before attempting their reimages (wdqs10[03-05] are already scheduled for... [19:51:37] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [19:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:53:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:07:55] (03PS1) 10Sharvaniharan: Updating some more documentation for the new Android schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951181 [20:21:22] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye [20:24:26] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) DRAC firmware updates have been staged on wdqs1006, 1007, and 1009. We need to reboot these hosts before we start the reimage, so the firmware is actually updated. wdqs1008 has had its fir... [20:48:24] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye [20:50:07] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [20:50:21] (03PS2) 10Clare Ming: Add Metrics Platform fragments by entity, platform [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) [20:50:50] (03CR) 10Clare Ming: Add Metrics Platform fragments by entity, platform (038 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [20:58:02] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye [20:59:09] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye completed: - wdqs1008 (**WARN**) - Downtimed on Icinga/Alertman... [21:05:48] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) @Jhancock.wm when you are back onsite can you please check the network cable for wdqs2023 both 10G nic's are showing down. Thanks [21:35:48] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [21:45:37] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [22:20:16] (03CR) 10Clare Ming: Add Metrics Platform fragments by entity, platform (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [22:23:43] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye [22:45:53] (03PS1) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [22:46:20] (03CR) 10CI reject: [V: 04-1] Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [22:49:11] (03PS2) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [22:52:36] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye [22:53:45] (03CR) 10Clare Ming: "This patch removes the entity split grouping -- because of the inclusion of various other fragments, the only data object groupings preser" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [22:57:50] (03CR) 10Clare Ming: "@phuedx @mforns please look at https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/951191" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [23:08:57] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye completed: - wdqs2024 (**PASS... [23:39:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye completed: - wdqs2025 (**PASS... [23:39:56] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [23:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:53:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability