[01:48:16] <wikibugs>	 10Data-Engineering: Mediarequests "Not found" error message is confusing - https://phabricator.wikimedia.org/T343945 (10Dominicbm)
[02:15:21] <wikibugs>	 10Data-Engineering: Method for per-file cumulative total in mediarequests API - https://phabricator.wikimedia.org/T343947 (10Dominicbm)
[02:43:15] <wikibugs>	 10Data-Engineering: Method for per-file cumulative total in mediarequests API - https://phabricator.wikimedia.org/T343947 (10Dominicbm)
[03:33:34] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[03:33:34] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[07:33:34] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[07:33:35] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[08:22:24] <wikibugs>	 (03PS4) 10DCausse: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565)
[08:22:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[08:55:49] <wikibugs>	 (03PS5) 10DCausse: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565)
[08:55:51] <wikibugs>	 (03PS1) 10DCausse: cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793
[08:57:04] <btullis>	 !log paused all dags on all airflow instances
[08:57:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:57:18] <btullis>	 !log stopped all airflow-scheduler services
[08:57:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:58:38] <btullis>	 !log rebooting an-db1001
[08:58:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:00:56] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[09:10:24] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[09:18:41] <btullis>	 !un-paused all DAGs and restarted all airflow schedulers
[09:54:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10Stevemunene) >>! In T343236#9079448, @BTullis wrote: >>>! In T343236#9079372, @JMeybohm wrote: >>...
[10:04:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Patch-For-Review: Get datahub-staging.wikimedia.org working with the staging deployment of datahub - https://phabricator.wikimedia.org/T343236 (10BTullis) >>! In T343236#9082978, @Stevemunene wrote: > Could we proceed with the OIDC test with th...
[10:06:38] <btullis>	 I'm planning to do a rolling reboot of kafka-jumbo today.
[10:08:16] <btullis>	 Oops, scratch that last message. It's not necessary. These are the insetup kafka-jumbo nodes and I already rebooted them. 
[10:16:33] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1090.eqiad.wmnet with OS bullseye
[10:55:56] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1090.eqiad.wmnet with OS bullseye completed: - an-worker1090 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[11:14:04] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1091.eqiad.wmnet with OS bullseye
[11:33:02] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye
[11:33:35] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[11:33:35] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[12:06:59] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) >>! In T318155#9007691, @MoritzMuehlenhoff wrote: > Maybe a very quick fix with immediate impact is to simply move away from using the local host...
[12:08:48] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1091.eqiad.wmnet with OS bullseye completed: - an-worker1091 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[12:22:06] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1092.eqiad.wmnet with OS bullseye
[12:33:51] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) OK, there's still an error on an-test-coord1001 because of a conflict over `python-is-python2`, which is added by hive and `python-is-python3`, which is added by presto-serve...
[12:50:42] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10Gehel) >>! In T343856#9080017, @Reedy wrote: > Would be a good time to potentially rename the file in support of {T254646}.  Yes!
[12:51:16] <wikibugs>	 10Data-Platform-SRE, 10Wikidata-Query-Service: Move whitelist.txt from WDQS deploy repo into puppet - https://phabricator.wikimedia.org/T343856 (10Gehel)
[13:22:22] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1092.eqiad.wmnet with OS bullseye completed: - an-worker1092 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[13:50:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:26] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye completed: - an-test-coord1001 (**WARN**)...
[14:09:15] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10JArguello-WMF)
[14:10:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) hive-metastore.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:15:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:18:29] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) Hive isn't happy with the MariaDB connector instead of the MySQL connector. We knew that this was a possibility, but now this is confirmed.  I edited `/etc/hive/conf/hive-sit...
[15:33:35] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[15:33:35] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[15:54:01] <wikibugs>	 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @elukey, thank you for your work on rebalancing and optimising kafka! We're glad to hear that our in...
[16:07:27] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793 (owner: 10DCausse)
[16:07:56] <wikibugs>	 (03Merged) 10jenkins-bot: cirrussearch: fragments should be under /fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/947793 (owner: 10DCausse)
[16:15:24] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I've now been able to copy the libmysql-java package to bullseye. ` btullis@apt1001:~$ sudo -i reprepro -C component/libmysql-java copy bullseye-wikimedia buster-wikimedia li...
[16:30:42] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:31:44] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[16:32:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add mediawiki/cirrussearch/page_rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[16:40:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:55:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) hive-server2.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:59:44] <btullis>	 !log re-enabled airflow jobs on analytics_test instance
[16:59:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:13:05] <wikibugs>	 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) WCQS is now completely on Bullseye, next step is to determine which WDQS hosts need to be upgraded (we don't want to bother upgrading the hosts that'll be retired soon).
[17:47:44] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) 05Open→03Resolved a:03BTullis I haven't yet tested this, but I'll be on the lookout for any improvements the next time we have to create a...
[17:48:23] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis)
[17:50:49] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I think that this is all finished now, except for the issue with all servers appearing in the default rack, which we're going to fix with: https://gerrit.wikimedia.org/r/c/op...
[18:21:37] <wikibugs>	 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye
[18:36:42] <wikibugs>	 10Data-Platform-SRE, 10KaiOS-Wikipedia-app (Discovery): Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10Gehel)
[18:36:54] <wikibugs>	 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10Gehel)
[18:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:47:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:56:36] <wikibugs>	 10Data-Platform-SRE, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Priority Backlog 📥): wdqs: replace git-fat with git-lfs - https://phabricator.wikimedia.org/T316876 (10Gehel)
[19:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:35] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[19:33:35] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[20:18:43] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) In the interest of moving this forward, I'm going to go ahead and start provisioning these VMs.  If there is a resource shortage in CODFW (or o...
[21:21:49] <wikibugs>	 10Data-Platform-SRE, 10Epic: [Epic] Migrate all Search Platform servers to Debian Bullseye - https://phabricator.wikimedia.org/T323921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye executed with errors: - wdqs2007 (**FAIL**...
[21:41:27] <wikibugs>	 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10bking)
[23:33:35] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[23:33:41] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability