[01:48:16] 10Data-Engineering: Mediarequests "Not found" error message is confusing - https://phabricator.wikimedia.org/T343945 (10Dominicbm) [02:15:21] 10Data-Engineering: Method for per-file cumulative total in mediarequests API - https://phabricator.wikimedia.org/T343947 (10Dominicbm) [02:43:15] 10Data-Engineering: Method for per-file cumulative total in mediarequests API - https://phabricator.wikimedia.org/T343947 (10Dominicbm) [03:33:34] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:33:34] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:33:34] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:33:35] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:22:24] (03PS4) 10DCausse: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) [08:22:55] (03CR) 10CI reject: [V: 04-1] Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [08:55:49] (03PS5) 10DCausse: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) [08:55:51] (03PS1) 10DCausse: cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793 [08:57:04] !log paused all dags on all airflow instances [08:57:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:57:18] !log stopped all airflow-scheduler services [08:57:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:58:38] !log rebooting an-db1001 [08:58:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:56] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:10:24] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:18:41] !un-paused all DAGs and restarted all airflow schedulers [09:54:55] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10Stevemunene) >>! In T343236#9079448, @BTullis wrote: >>>! In T343236#9079372, @JMeybohm wrote: >>... [10:04:55] 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) >>! In T343236#9082978, @Stevemunene wrote: > Could we proceed with the OIDC test with th... [10:06:38] I'm planning to do a rolling reboot of kafka-jumbo today. [10:08:16] Oops, scratch that last message. It's not necessary. These are the insetup kafka-jumbo nodes and I already rebooted them. [10:16:33] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1090.eqiad.wmnet with OS bullseye [10:55:56] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1090.eqiad.wmnet with OS bullseye completed: - an-worker1090 (**PASS**) - Downtimed on Icinga/Alertmanag... [11:14:04] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1091.eqiad.wmnet with OS bullseye [11:33:02] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye [11:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:33:35] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:06:59] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) >>! In T318155#9007691, @MoritzMuehlenhoff wrote: > Maybe a very quick fix with immediate impact is to simply move away from using the local host... [12:08:48] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1091.eqiad.wmnet with OS bullseye completed: - an-worker1091 (**PASS**) - Downtimed on Icinga/Alertmanag... [12:22:06] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1092.eqiad.wmnet with OS bullseye [12:33:51] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) OK, there's still an error on an-test-coord1001 because of a conflict over `python-is-python2`, which is added by hive and `python-is-python3`, which is added by presto-serve... [12:50:42] 10Data-Platform-SRE, 10Discovery-Search: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10Gehel) >>! In T343856#9080017, @Reedy wrote: > Would be a good time to potentially rename the file in support of {T254646}. Yes! [12:51:16] 10Data-Platform-SRE, 10Wikidata-Query-Service: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10Gehel) [13:22:22] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1092.eqiad.wmnet with OS bullseye completed: - an-worker1092 (**PASS**) - Downtimed on Icinga/Alertmanag... [13:50:43] (SystemdUnitFailed) firing: (6) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:42] (SystemdUnitFailed) firing: (7) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:26] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye completed: - an-test-coord1001 (**WARN**)... [14:09:15] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10JArguello-WMF) [14:10:42] (SystemdUnitFailed) firing: (7) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:42] (SystemdUnitFailed) resolved: hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:29] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Hive isn't happy with the MariaDB connector instead of the MySQL connector. We knew that this was a possibility, but now this is confirmed. I edited `/etc/hive/conf/hive-sit... [15:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:33:35] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:54:01] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @elukey, thank you for your work on rebalancing and optimising kafka! We're glad to hear that our in... [16:07:27] (03CR) 10Ebernhardson: [C: 03+2] cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793 (owner: 10DCausse) [16:07:56] (03Merged) 10jenkins-bot: cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793 (owner: 10DCausse) [16:15:24] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I've now been able to copy the libmysql-java package to bullseye. ` btullis@apt1001:~$ sudo -i reprepro -C component/libmysql-java copy bullseye-wikimedia buster-wikimedia li... [16:30:42] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:44] (03CR) 10Ebernhardson: [C: 03+2] Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [16:32:15] (03Merged) 10jenkins-bot: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [16:40:42] (SystemdUnitFailed) firing: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:42] (SystemdUnitFailed) firing: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:42] (SystemdUnitFailed) resolved: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:44] !log re-enabled airflow jobs on analytics_test instance [16:59:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:13:05] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) WCQS is now completely on Bullseye, next step is to determine which WDQS hosts need to be upgraded (we don't want to bother upgrading the hosts that'll be retired soon). [17:47:44] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) 05Open→03Resolved a:03BTullis I haven't yet tested this, but I'll be on the lookout for any improvements the next time we have to create a... [17:48:23] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [17:50:49] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I think that this is all finished now, except for the issue with all servers appearing in the default rack, which we're going to fix with: https://gerrit.wikimedia.org/r/c/op... [18:21:37] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye [18:36:42] 10Data-Platform-SRE, 10KaiOS-Wikipedia-app (Discovery): Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10Gehel) [18:36:54] 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10Gehel) [18:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:36] 10Data-Platform-SRE, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Priority Backlog 📥): wdqs: replace git-fat with git-lfs - https://phabricator.wikimedia.org/T316876 (10Gehel) [19:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:33:35] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:18:43] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) In the interest of moving this forward, I'm going to go ahead and start provisioning these VMs. If there is a resource shortage in CODFW (or o... [21:21:49] 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye executed with errors: - wdqs2007 (**FAIL**... [21:41:27] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10bking) [23:33:35] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:33:41] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability