[00:00:31] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:04:31] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:09:31] <wikibugs>	 (03PS3) 10Jdlrobson: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250)
[00:10:31] <jinxer-wm>	 (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:11:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52675 and previous config saved to /var/cache/conftool/dbconfig/20230927-001109-arnaudb.json
[00:13:15] <wikibugs>	 (03PS1) 10Jdlrobson: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444)
[00:26:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52676 and previous config saved to /var/cache/conftool/dbconfig/20230927-002616-arnaudb.json
[00:28:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:28:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:29:08] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:29:59] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:30:41] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2020.codfw.wmnet with OS bullseye
[00:30:48] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2020.codfw.wmnet with OS bullseye completed: - restbase20...
[00:30:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:34:59] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:35:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:38:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:38:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677 (owner: 10TrainBranchBot)
[00:39:59] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:41:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52677 and previous config saved to /var/cache/conftool/dbconfig/20230927-004122-arnaudb.json
[00:41:24] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[00:41:25] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[00:41:31] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:41:38] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[00:41:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52678 and previous config saved to /var/cache/conftool/dbconfig/20230927-004144-arnaudb.json
[00:42:54] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2022.codfw.wmnet with OS bullseye
[00:43:02] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2022.codfw.wmnet with OS bullseye
[00:46:53] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:47:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:53:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677 (owner: 10TrainBranchBot)
[00:58:33] <wikibugs>	 (03CR) 10Ssingh: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[00:59:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2022.codfw.wmnet with reason: host reimage
[01:01:09] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:01:31] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:02:41] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:02:41] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2022.codfw.wmnet with reason: host reimage
[01:04:55] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:07:41] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:09:01] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:13:53] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:15:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52679 and previous config saved to /var/cache/conftool/dbconfig/20230927-011514-arnaudb.json
[01:15:23] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[01:18:27] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:19:43] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:25:26] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2022.codfw.wmnet with OS bullseye
[01:25:33] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2022.codfw.wmnet with OS bullseye completed: - restbase20...
[01:25:52] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2022.codfw.wmnet
[01:25:53] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2022.codfw.wmnet
[01:26:54] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS bullseye
[01:27:03] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2025.codfw.wmnet with OS bullseye
[01:27:36] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[01:28:20] <wikibugs>	 (03CR) 10Herron: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[01:30:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52680 and previous config saved to /var/cache/conftool/dbconfig/20230927-013020-arnaudb.json
[01:30:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:31:27] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:31:47] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:32:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[01:32:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:47] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:34:29] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:34:41] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:36:35] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:36:43] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:36:45] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:37:27] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:37:39] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:37:51] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:38:19] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:38:19] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:38:33] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:39:23] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:39:23] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:43:23] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2025.codfw.wmnet with reason: host reimage
[01:45:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52681 and previous config saved to /var/cache/conftool/dbconfig/20230927-014527-arnaudb.json
[01:45:31] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:45:58] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2025.codfw.wmnet with reason: host reimage
[01:49:04] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:49:24] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:49:44] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:50:24] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[01:52:12] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:00:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52682 and previous config saved to /var/cache/conftool/dbconfig/20230927-020034-arnaudb.json
[02:00:36] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[02:00:39] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[02:00:45] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[02:05:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1234.eqiad.wmnet with OS bullseye
[02:05:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1234.eqiad.wmnet with OS bullseye
[02:06:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1235.eqiad.wmnet with OS bullseye
[02:06:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1235.eqiad.wmnet with OS bullseye
[02:07:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1236.eqiad.wmnet with OS bullseye
[02:07:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1236.eqiad.wmnet with OS bullseye
[02:07:54] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:08:16] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:08:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS bullseye
[02:08:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1237.eqiad.wmnet with OS bullseye
[02:09:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1238.eqiad.wmnet with OS bullseye
[02:09:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1238.eqiad.wmnet with OS bullseye
[02:09:18] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:09:40] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:09:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1239.eqiad.wmnet with OS bullseye
[02:10:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1239.eqiad.wmnet with OS bullseye
[02:10:07] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS bullseye
[02:10:13] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2025.codfw.wmnet with OS bullseye completed: - restbase20...
[02:10:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1240.eqiad.wmnet with OS bullseye
[02:10:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1240.eqiad.wmnet with OS bullseye
[02:11:02] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[02:11:15] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2025.codfw.wmnet
[02:11:15] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2025.codfw.wmnet
[02:11:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1241.eqiad.wmnet with OS bullseye
[02:11:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1241.eqiad.wmnet with OS bullseye
[02:13:24] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:14:50] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:18:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage
[02:19:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage
[02:20:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage
[02:21:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage
[02:21:14] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:21:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage
[02:21:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage
[02:22:14] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:22:36] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:22:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage
[02:23:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage
[02:23:57] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage
[02:24:08] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:24:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage
[02:26:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage
[02:27:02] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:27:22] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:28:04] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:28:04] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:28:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage
[02:28:46] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:29:10] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:29:10] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:29:50] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:30:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage
[02:32:43] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage
[02:32:52] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:33:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage
[02:33:34] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage
[02:33:52] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:36:56] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:37:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:38:02] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:38:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:38:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:38:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1234.eqiad.wmnet with OS bullseye
[02:38:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1234.eqiad.wmnet with OS bullseye completed: - db1234 (**PASS**)   - Removed f...
[02:39:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:40:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:40:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1235.eqiad.wmnet with OS bullseye
[02:40:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1235.eqiad.wmnet with OS bullseye completed: - db1235 (**PASS**)   - Removed f...
[02:41:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:42:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:42:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1237.eqiad.wmnet with OS bullseye
[02:42:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1237.eqiad.wmnet with OS bullseye completed: - db1237 (**PASS**)   - Removed f...
[02:43:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:43:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:44:20] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:44:20] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:44:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:44:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1238.eqiad.wmnet with OS bullseye
[02:44:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1238.eqiad.wmnet with OS bullseye completed: - db1238 (**PASS**)   - Removed f...
[02:45:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:45:38] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:45:38] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[02:46:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:46:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:46:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1236.eqiad.wmnet with OS bullseye
[02:46:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1236.eqiad.wmnet with OS bullseye completed: - db1236 (**WARN**)   - Removed f...
[02:47:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:47:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1239.eqiad.wmnet with OS bullseye
[02:47:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1239.eqiad.wmnet with OS bullseye completed: - db1239 (**WARN**)   - Removed f...
[02:49:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:50:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:50:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:50:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1240.eqiad.wmnet with OS bullseye
[02:50:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1240.eqiad.wmnet with OS bullseye completed: - db1240 (**WARN**)   - Removed f...
[02:51:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:51:28] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:51:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1241.eqiad.wmnet with OS bullseye
[02:51:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1241.eqiad.wmnet with OS bullseye completed: - db1241 (**WARN**)   - Removed f...
[02:53:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jhancock.wm)
[03:08:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:13:09] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[04:13:12] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[04:36:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:41:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:48:24] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (ldap-rw2001), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:54:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:44:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Allow setting values for jsonschema entities [software/conftool] - 10https://gerrit.wikimedia.org/r/909203
[05:44:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272
[05:47:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto)
[05:48:36] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273
[05:49:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273 (owner: 10Ilias Sarantopoulos)
[05:49:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:49:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273 (owner: 10Ilias Sarantopoulos)
[05:50:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:53:17] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[05:53:52] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[05:54:32] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[05:54:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T0600)
[06:04:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:07:46] <jinxer-wm>	 (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:09:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:12:46] <jinxer-wm>	 (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:17:25] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on dbstore1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T347449 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:17:31] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10ops-monitoring-bot)
[06:32:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:34:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:35:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:40:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:40:59] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[06:41:13] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[06:50:14] <logmsgbot>	 !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[06:50:19] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[06:50:24] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[06:54:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:57:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP
[06:57:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP
[06:59:23] <wikibugs>	 (03PS1) 10Slyngshede: SSH Key mgmt: Allow multiple SSH keys to be stored in LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/961278
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:04:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:05:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:08:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:09:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:17:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:39:52] <Emperor>	 !log repool ms-fe2009
[07:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:50] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[07:48:50] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[07:54:20] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[08:00:02] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[08:09:04] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:09:36] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:09:38] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:09:47] <elukey>	 this is me sorry --^
[08:10:19] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[08:10:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[08:11:45] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Move HAProxy 2.7 experiments to cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799)
[08:11:57] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::instance: decrease priority of access rule [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406)
[08:13:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[08:13:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:18:28] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:18:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43638/console" [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[08:19:08] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:19:31] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Move HAProxy 2.7 experiments to cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[08:19:52] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:21:25] <vgutierrez>	 !log update HAProxy to version 2.7.10 in cp4051 - T317799
[08:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:32] <stashbot>	 T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799
[08:22:18] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[08:23:40] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:23:50] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[08:24:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961092 (owner: 10Majavah)
[08:28:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[08:28:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[08:29:12] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms
[08:31:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406) (owner: 10Majavah)
[08:32:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960163 (owner: 10Majavah)
[08:33:31] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Please add:" [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[08:34:08] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:34:39] <btullis>	 We have an ongoing incident affecting kafka-jumbo mirror makers that we're handling in #wikimedia-analytics - I am the IC. No user-facing impact at the moment.
[08:34:55] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] dnsrecusor: Remove labs-ip-alias-dump icinga check [puppet] - 10https://gerrit.wikimedia.org/r/960163 (owner: 10Majavah)
[08:35:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi)
[08:35:02] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms
[08:37:00] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:toolforge::instance: decrease priority of access rule [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406) (owner: 10Majavah)
[08:37:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] galera: Fix some ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/961092 (owner: 10Majavah)
[08:37:50] <wikibugs>	 (03CR) 10Vgutierrez: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[08:38:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/960164 (owner: 10Majavah)
[08:39:26] <icinga-wm>	 RECOVERY - Check systemd state on ldap-rw2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:45] <wikibugs>	 (03PS26) 10Fabfur: varnish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[08:40:47] <wikibugs>	 (03CR) 10Fabfur: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[08:41:15] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10Marostegui) ` [2574910.962214] megaraid_sas 0000:18:00.0: scanning for scsi0... [2574910.962794] megaraid_sas 0000:18:00.0: 1244 (749109359s/0x0001/CRIT) - VD 00/0 is now DEGRADED [2575154.297980] megaraid_sas 0000:...
[08:42:43] <wikibugs>	 (03PS7) 10Vgutierrez: haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799)
[08:42:45] <wikibugs>	 (03PS12) 10Vgutierrez: hiera: Test HAProxy bw limits per URL on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799)
[08:44:26] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[08:44:29] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[08:44:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[08:44:40] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[08:44:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui)
[08:44:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui)
[08:46:37] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) @Jhancock.wm were you able to see anything?
[08:47:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:49:08] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:52:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui) Thank you!
[08:53:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove per-host healthchck [puppet] - 10https://gerrit.wikimedia.org/r/961062 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[08:53:37] <wikibugs>	 (03PS2) 10Majavah: wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067
[08:53:39] <wikibugs>	 (03PS2) 10Majavah: Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381)
[08:53:58] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[08:54:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[08:55:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[08:55:08] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=068d8793-7777-446c-b4d2-653f3aae2433) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host crashed ` db2109.codfw.wmnet `
[08:55:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) >>! In T347411#9200644, @jbond wrote: > We may be able to use redfish to get this information (although i couldn't find it from a quick look) and the u...
[08:56:44] <wikibugs>	 (03PS27) 10Fabfur: varnish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[08:57:31] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) p:05Triage→03Medium
[09:00:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: Test HAProxy bw limits per URL on cp4051 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[09:00:50] <wikibugs>	 (03PS1) 10Majavah: cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381)
[09:02:55] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422)
[09:05:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[09:05:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo
[09:06:14] <wikibugs>	 (03CR) 10Vgutierrez: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[09:06:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:06:55] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage pc1015 [puppet] - 10https://gerrit.wikimedia.org/r/961339
[09:06:58] <jynus>	 vgutierrez:  ^
[09:07:06] <jynus>	 haproxy paged
[09:07:15] <jynus>	 acking 
[09:07:28] <jynus>	 ah, Cathal was faster :-D
[09:07:56] <wikibugs>	 (03PS1) 10Muehlenhoff: clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497)
[09:08:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc1015 [puppet] - 10https://gerrit.wikimedia.org/r/961339 (owner: 10Marostegui)
[09:08:09] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:08:15] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:08:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:08:24] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-ext, mw-web: raise replicas for traffic bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422)
[09:10:04] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:10:07] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:11:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:12:21] <wikibugs>	 (03CR) 10Clément Goubert: "Just in case." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[09:12:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:12:50] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto)
[09:13:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:13:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:14:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:14:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[09:15:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[09:15:02] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067 (owner: 10Majavah)
[09:15:05] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=68fd1013-aa8f-4502-bb60-c027808c1750) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host crashed ` db2109.codfw.wmnet `
[09:15:33] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609)
[09:15:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto)
[09:16:01] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609)
[09:18:36] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:19:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43639/console" [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez)
[09:19:30] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez)
[09:22:28] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:23:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:23:36] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:26:29] <wikibugs>	 (03PS1) 10Majavah: hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381)
[09:28:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:29:15] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:29:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:33:01] <taavi>	 !log update CR firewall policy, gerrit 961336
[09:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:01] <jayme>	 !log cordoning kubernetes1013 for debug porposes
[09:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:09] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[09:36:19] <wikibugs>	 (03CR) 10Fabfur: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[09:39:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[09:40:43] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[09:43:31] <claime>	 !log Bumping mw-on-k8s traffic to 8% - T346422
[09:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:38] <stashbot>	 T346422: Move 10% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T346422
[09:44:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto)
[09:45:11] <claime>	 jynus, topranks, heads up ^
[09:45:25] <topranks>	 claime: thanks :)
[09:46:28] <jynus>	 thanks
[09:47:57] <wikibugs>	 (03PS1) 10Majavah: hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891)
[09:47:59] <wikibugs>	 (03PS1) 10Majavah: hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891)
[09:48:03] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1013.*
[09:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:49:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10jcrespo) I got an alert about ldap-rw2001 failing its backups (probably expected during setup), but wanted to give a heads up.
[09:50:42] <wikibugs>	 (03PS1) 10Jcrespo: this is a test patch - ignore [puppet] - 10https://gerrit.wikimedia.org/r/961350
[09:50:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[09:51:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[09:51:31] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[09:51:38] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[09:52:00] <wikibugs>	 (03Abandoned) 10Jcrespo: this is a test patch - ignore [puppet] - 10https://gerrit.wikimedia.org/r/961350 (owner: 10Jcrespo)
[09:54:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:13] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536)
[09:59:05] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272
[09:59:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Make the dbconfig settings conditional on the hdb backend [puppet] - 10https://gerrit.wikimedia.org/r/961352
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1000)
[10:01:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto)
[10:02:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:03:50] <wikibugs>	 (03PS28) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[10:07:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961352 (owner: 10Muehlenhoff)
[10:09:22] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410)
[10:11:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert)
[10:11:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert)
[10:13:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:13:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond)
[10:13:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond)
[10:14:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:14:43] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:15:43] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272
[10:15:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:18:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto)
[10:19:47] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:20:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): update netbox sync to also sync to puppetservers - https://phabricator.wikimedia.org/T347410 (10jbond) 05Open→03Resolved a:03jbond Cookbook has now been updated
[10:20:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[10:22:03] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272
[10:22:19] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:22:53] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:23:00] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422)
[10:23:21] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:24:17] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[10:25:03] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[10:27:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:27:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:27:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:27:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:27:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:32:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:32:29] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193)
[10:36:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos)
[10:37:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos)
[10:38:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52683 and previous config saved to /var/cache/conftool/dbconfig/20230927-103800-arnaudb.json
[10:38:10] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[10:39:18] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:39:44] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:40:12] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:40:38] <wikibugs>	 (03PS29) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[10:41:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356
[10:43:07] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422)
[10:43:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto)
[10:44:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[10:44:18] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[10:44:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[10:45:00] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[10:45:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:45:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:46:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:46:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:47:50] <wikibugs>	 (03PS1) 10Jbond: puppetserver: we use the backup profile for backups [puppet] - 10https://gerrit.wikimedia.org/r/961359 (https://phabricator.wikimedia.org/T347390)
[10:48:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: we use the backup profile for backups [puppet] - 10https://gerrit.wikimedia.org/r/961359 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[10:48:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto)
[10:48:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497)
[10:49:08] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: Suggest install-console for troubleshooting (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking)
[10:50:35] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:53:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P52684 and previous config saved to /var/cache/conftool/dbconfig/20230927-105306-arnaudb.json
[10:53:10] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:53:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) >>! In T347093#9188497, @fnegri wrote: > Is there a task where I can learn more about this?  I don't think we have one open...
[10:55:47] <hauskater>	 Hi. 80k+ logstash errors in the last hour for cewiki alone re a maintenance script
[10:55:59] <hauskater>	 88k+*
[10:56:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356 (owner: 10Giuseppe Lavagetto)
[10:57:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:58:10] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:58:11] <wikibugs>	 (03Abandoned) 10Volans: Install hosts: fallback to drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans)
[10:59:39] <wikibugs>	 (03Merged) 10jenkins-bot: Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356 (owner: 10Giuseppe Lavagetto)
[11:04:10] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422)
[11:04:32] <wikibugs>	 (03Abandoned) 10Jbond: WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782 (owner: 10Jbond)
[11:04:47] <wikibugs>	 (03Abandoned) 10Jbond: puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[11:07:39] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P52685 and previous config saved to /var/cache/conftool/dbconfig/20230927-110813-arnaudb.json
[11:10:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:12:17] <wikibugs>	 (03PS2) 10Jbond: wikimedia.org: drop puppetboard-next [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286)
[11:12:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:12:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961114 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[11:12:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961119 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[11:12:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wikimedia.org: drop puppetboard-next [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[11:14:50] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:14:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:17:02] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497)
[11:17:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, the old puppetdb's are in the insetup role now" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[11:18:43] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:19:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:19:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:20:35] <icinga-wm>	 PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 580 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:22:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:23:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52686 and previous config saved to /var/cache/conftool/dbconfig/20230927-112320-arnaudb.json
[11:23:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[11:23:29] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[11:23:36] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[11:23:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52687 and previous config saved to /var/cache/conftool/dbconfig/20230927-112342-arnaudb.json
[11:24:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:24:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:26:21] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[11:26:35] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[11:26:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52688 and previous config saved to /var/cache/conftool/dbconfig/20230927-112640-arnaudb.json
[11:26:44] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:26:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:26:56] <wikibugs>	 (03PS2) 10Jbond: bacula: update bacula config to trust the pki and puppet ca's [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390)
[11:26:58] <wikibugs>	 (03PS1) 10Majavah: wiki-replicas: Add CREATE USER and GRANT OPTION to labsdbadmin [puppet] - 10https://gerrit.wikimedia.org/r/961366 (https://phabricator.wikimedia.org/T347381)
[11:27:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:27:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:27:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:28:07] <wikibugs>	 (03CR) 10jenkins-bot: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:32:18] <wikibugs>	 (03CR) 10Jbond: "Please review" [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[11:42:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This looks good to me, although I would like to be around to test when deployed, to make sure backups and recoveries work as usual. I thin" [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[11:43:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/961278 (owner: 10Slyngshede)
[11:43:36] <icinga-wm>	 PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard2003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 580 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:45:26] <wikibugs>	 (03CR) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[11:46:24] <wikibugs>	 (03PS1) 10Majavah: maintain-dbusers: just log to stdout [puppet] - 10https://gerrit.wikimedia.org/r/961368
[11:46:26] <wikibugs>	 (03PS1) 10Majavah: maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369
[11:47:19] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529)
[11:48:27] <wikibugs>	 (03PS4) 10Majavah: dnsrecursor: remove need to run labs-ip-alias-dump twice [puppet] - 10https://gerrit.wikimedia.org/r/960164
[11:48:57] <wikibugs>	 (03PS5) 10Majavah: dnsrecursor: remove need to run labs-ip-alias-dump twice [puppet] - 10https://gerrit.wikimedia.org/r/960164
[11:49:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah)
[11:49:52] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:50:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[11:50:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:50:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:50:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:50:50] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:51:58] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:51:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:52:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) a:03BTullis
[11:53:14] <wikibugs>	 (03PS2) 10Majavah: maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369
[11:54:06] <wikibugs>	 (03CR) 10Fabfur: varnish: allow PURGE requests also from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[11:56:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Test HAProxy bw limits per URL on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[11:57:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[11:57:42] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[11:58:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[11:58:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:58:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:00:29] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[12:00:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:01:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:02:10] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475)
[12:03:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:03:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SSH Key mgmt: Allow multiple SSH keys to be stored in LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/961278 (owner: 10Slyngshede)
[12:05:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:05:25] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:05:27] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1014 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:29] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:05:31] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:39] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:05:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I've verified the above and can confirm that the two slots 1 and 4 are no longer visible to `megacli` ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number:...
[12:05:49] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:05:55] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:06:03] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:03] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:09] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[12:08:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[12:09:32] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[12:10:33] <wikibugs>	 (03PS1) 10Slyngshede: Navbar: Show SSH and attributes in menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/961371
[12:11:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Navbar: Show SSH and attributes in menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/961371 (owner: 10Slyngshede)
[12:11:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah)
[12:11:42] <wikibugs>	 (03PS2) 10Muehlenhoff: cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497)
[12:13:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[12:14:22] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[12:14:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:14:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:16:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:16:11] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:17:20] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799)
[12:18:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: Still running on 9 mirrormaker processes from main-eqiad to jumbo
[12:18:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:18:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: Still running on 9 mirrormaker processes from main-eqiad to jumbo
[12:18:53] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43647/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:19:31] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:03] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:15] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:21:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:21:11] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799)
[12:22:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:23:16] <wikibugs>	 (03CR) 10Vgutierrez: "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43648/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:24:17] <wikibugs>	 (03PS3) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799)
[12:24:56] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469)
[12:25:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:25:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:25:40] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43649/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:26:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) Just adding here, the server didn't boot successfully.
[12:26:54] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:29:33] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375
[12:32:45] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:33:27] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:33:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:36:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] maintain-dbusers: just log to stdout [puppet] - 10https://gerrit.wikimedia.org/r/961368 (owner: 10Majavah)
[12:36:17] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah)
[12:37:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.455 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:38:29] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:33] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:38:42] <wikibugs>	 (03PS2) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375
[12:39:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on dbstore1005.eqiad.wmnet with reason: Cold booting to see if it sees two missing disks
[12:39:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on dbstore1005.eqiad.wmnet with reason: Cold booting to see if it sees two missing disks
[12:39:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=195bf9c0-3e24-446f-ba90-48d15ed5d628) set by btullis@cumin1001 for 0:20:00 on 1 host(s) and their services with reason: Cold bo...
[12:43:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:44:11] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:45:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: load nf_conntrack sysctl settings later [puppet] - 10https://gerrit.wikimedia.org/r/961376 (https://phabricator.wikimedia.org/T347469)
[12:45:49] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): New projects default to Vector 2022 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[12:48:21] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I have cold booted it and the missing slots have come back. ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Numb...
[12:50:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:50:41] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) Ok, it's rebuilding automatically. ` btullis@dbstore1005:~$ sudo megacli -PDList -aall|grep 'Firmware state' Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: On...
[12:53:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:03] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:54:13] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.timer,rsync-config-backup-gitlab2002.wikimedia.org.timer,rsync-data-backup-gitlab1003.wikimedia.org.timer,rsync-data-backup-gitlab2002.wikimedia.org.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:45] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:25] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857)
[12:57:27] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudgw: Don't override conntrack settings from firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497)
[12:58:07] <wikibugs>	 (03PS1) 10Elukey: modules: duplicate ingress:istio_1.0.2 to 1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961378
[12:58:09] <wikibugs>	 (03PS1) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379
[12:58:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:43] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:50] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Jdforrester-WMF)
[12:59:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey)
[12:59:11] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:59:24] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10Jdforrester-WMF) 05Open→03In progress
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1300).
[13:00:05] <jouncebot>	 houseofm and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <Lucas_WMDE>	 o/
[13:00:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:04:17] <Lucas_WMDE>	 no HouseOfM yet, I’ll start the gate-and-submit for my backport then
[13:04:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE))
[13:08:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:11:27] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[13:11:56] <wikibugs>	 (03PS1) 10JMeybohm: wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397)
[13:12:59] <aqu>	 !log Deployment weekly train of analytics-refinery (+new source version)
[13:12:59] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[13:13:03] <Lucas_WMDE>	 FTR, my backport will merge in ca. 6 minutes and will then take quite a while to sync (as it touches i18n)
[13:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:11] <Lucas_WMDE>	 so if anyone wants to scap something else first, let me know ^^
[13:13:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10ayounsi) It's great to see momentum on this recurring pain point!  To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP)...
[13:13:17] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) I had a little play with the redfish api and the PCIe info is available.  Unfortunately Linux predictable interface names still seem about as [[ https:...
[13:14:21] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:54] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[13:16:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE))
[13:17:04] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS bullseye
[13:17:17] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2017.codfw.wmnet with OS bullseye
[13:17:28] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f]: Regular analytics weekly train [analytics/refinery@223be0fb]
[13:17:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:17:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr)
[13:18:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm)
[13:18:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE))
[13:18:56] <Lucas_WMDE>	 huh, why is scap backport’s git output showing a bunch of “new branch” for wmf.28
[13:18:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm)
[13:19:03] <Lucas_WMDE>	 isn’t it already deployed to group0?
[13:19:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]]
[13:19:39] <stashbot>	 T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857
[13:20:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:20:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:21:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: sync
[13:21:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync
[13:21:47] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync
[13:21:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync
[13:24:26] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f]: Regular analytics weekly train [analytics/refinery@223be0fb] (duration: 06m 58s)
[13:25:11] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[13:25:16] <wikibugs>	 (03PS5) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995)
[13:25:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync
[13:26:02] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (thin): Regular analytics weekly train THIN [analytics/refinery@223be0fb]
[13:26:12] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync
[13:26:13] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f] (thin): Regular analytics weekly train THIN [analytics/refinery@223be0fb] (duration: 00m 10s)
[13:26:15] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb]
[13:26:31] <logmsgbot>	 !log aqu@deploy2002 deploy aborted: Regular analytics weekly train TEST [analytics/refinery@223be0fb] (duration: 00m 16s)
[13:26:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) >>! In T347411#9203208, @ayounsi wrote: > To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP) and then get th...
[13:26:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:28:35] <wikibugs>	 (03PS1) 10Jclark-ctr: add mossbe1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961389 (https://phabricator.wikimedia.org/T342675)
[13:29:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) To keep it somewhere for later, on Dell SONiC it should be on the `/openconfig-qos:qos/interfaces` path. Grouping it by sour...
[13:29:41] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add mossbe1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961389 (https://phabricator.wikimedia.org/T342675) (owner: 10Jclark-ctr)
[13:30:07] <Lucas_WMDE>	 (still running build-and-push-container-images…)
[13:30:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) 05Open→03Stalled
[13:31:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, @nshahquinn if it's ok on your side I'll consider this as completed
[13:32:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:32:42] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10JMeybohm) a:03JMeybohm It took me a while to figure this out, sorry. Due to wikifunctions having more strict firewall rules in genera...
[13:33:24] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2017.codfw.wmnet with reason: host reimage
[13:35:09] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb]
[13:36:39] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2017.codfw.wmnet with reason: host reimage
[13:36:39] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcumin2001.codfw.wmnet with OS bullseye
[13:36:47] <wikibugs>	 (03CR) 10Jforrester: "Aha! Nice find." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm)
[13:37:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:37:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:37:54] <Lucas_WMDE>	 testing
[13:38:02] <stashbot>	 T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857
[13:38:10] <Lucas_WMDE>	 yup, seems to work on the enwiki main page
[13:38:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[13:38:38] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) Thanks for the feedback everyone! I was waiting so that we can get most of the comments in before replying; responses inline:  >>! In T347054#91...
[13:38:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) FYI, the `mgmt_junos` bug (also present on the fasw) might not be fixed by an upgrade, but maybe with the solution exposed in https://www.reddit.com/r/Juniper/comments/mvq8hf/comment/j7gd...
[13:40:23] <wikibugs>	 (03PS3) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375
[13:40:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:40:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede)
[13:41:29] <wikibugs>	 (03PS5) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995)
[13:41:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) Updating naming as per requested.   cp1100 - cp1115
[13:42:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:42:16] <wikibugs>	 (03PS4) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375
[13:43:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED
[13:43:43] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb] (duration: 08m 33s)
[13:44:22] <aqu>	 !log Deployed refinery using scap, then deployed onto hdfs
[13:44:24] <wikibugs>	 (03PS6) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995)
[13:44:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:41] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) Looking at `10.3.0.0/24` [[ https://netbox.wikimedia.org/ipam/prefixes/97/ip-addresses/ | in Netbox ]]:  I plan to reserve `10.3.0.8/32` for `nt...
[13:46:04] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Lucas_Werkmeister_WMDE)
[13:46:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye
[13:46:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye
[13:47:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:47:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bacula: update bacula config to trust the pki and puppet ca's [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond)
[13:48:04] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[13:48:33] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:49:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]] (duration: 29m 56s)
[13:49:33] <stashbot>	 T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857
[13:50:38] <Lucas_WMDE>	 still no HouseOfM, so I guess that config change will have to be rescheduled yet again :(
[13:50:57] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:06] <icinga-wm>	 PROBLEM - DPKG on sretest1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:51:07] * Lucas_WMDE done deploying
[13:51:30] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) Thanks, as this VIP won't be critical we can skip the static routes and only allocate `10.3.0.8/32`.  The existing "Reserved for XXX (backup st...
[13:51:39] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcumin2001.codfw.wmnet with reason: host reimage
[13:53:33] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:34] <wikibugs>	 (03PS1) 10JMeybohm: admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397)
[13:53:36] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395
[13:53:49] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9203404, @ayounsi wrote: > Thanks, as this VIP won't be critical we can skip the static routes and only allocate `10.3.0.8/32`. >...
[13:53:59] <wikibugs>	 10SRE, 10Traffic: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 (10Fabfur) a:03Fabfur
[13:54:20] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcumin2001.codfw.wmnet with reason: host reimage
[13:55:16] <wikibugs>	 (03PS5) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375
[13:55:18] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) ` sukhe@re0.cr2-codfw# show routing-options static  /* Anycast recdns - backup route */ route 10.3.0.0/30 {     next-hop 208.80.153.77;     read...
[13:56:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43660/console" [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede)
[13:56:42] <wikibugs>	 (03PS7) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995)
[13:57:21] <wikibugs>	 (03PS1) 10Elukey: role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481)
[13:57:57] <wikibugs>	 10SRE, 10Traffic: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10Fabfur) 05Open→03Resolved a:03Fabfur Done with   * https://gerrit.wikimedia.org/r/c/operations/software/purged/+/958477 * https://gerrit.wikimedia.org/r/c/operations/software/purged/+/959049
[13:57:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm)
[13:58:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395 (owner: 10JMeybohm)
[13:58:28] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43662/console" [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede)
[13:58:52] <wikibugs>	 10SRE, 10Traffic: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 (10Fabfur) 05Open→03Stalled Waiting for actual deployment to definitely closing this task
[13:58:55] <claime>	 jouncebot: nowandnext
[13:58:56] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1300)
[13:58:56] <jouncebot>	 In 0 hour(s) and 1 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1400)
[13:58:58] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43663/console" [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey)
[13:59:19] <wikibugs>	 (03CR) 10Elukey: role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1400)
[14:00:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede)
[14:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm)
[14:00:36] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395 (owner: 10JMeybohm)
[14:00:49] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2017.codfw.wmnet with OS bullseye
[14:00:51] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) For `route 10.3.0.0/30` above, `next-hop 208.80.153.77` is actually the old authdns host, so we are clearly not keeping the static routes update...
[14:01:05] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2017.codfw.wmnet with OS bullseye completed: - restbase20...
[14:01:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[14:01:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey)
[14:04:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "this is not enough! the sysctl file is still deployed." [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:05:20] <icinga-wm>	 PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey)
[14:05:40] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357)
[14:06:31] <_joe_>	 !log updating conftool everywhere
[14:06:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:01] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263
[14:08:08] <stashbot>	 T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263
[14:08:10] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: e...
[14:08:29] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment Use bitu cmd for systemd service [puppet] - 10https://gerrit.wikimedia.org/r/961398
[14:08:42] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcumin2001.codfw.wmnet with OS bullseye
[14:10:33] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment Use bitu cmd for systemd service [puppet] - 10https://gerrit.wikimedia.org/r/961398 (owner: 10Slyngshede)
[14:10:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2017.codfw.wmnet
[14:10:38] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2017.codfw.wmnet
[14:11:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) As an example of this hardware being configured as JBOD - T326352
[14:12:05] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[14:12:42] <icinga-wm>	 RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:52] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) Thanks, I opened {T347494} to get rid of them. You can use 10.3.0.2/32 for the NTP VIP.
[14:13:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for 15 hosts
[14:13:57] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts
[14:15:49] <wikibugs>	 (03CR) 10Vgutierrez: varnish: allow PURGE requests also from dedicated socket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[14:16:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10aborrero)
[14:16:30] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall: Also move the sysctl under the manage_nf_conntrack conditional [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497)
[14:16:39] <wikibugs>	 (03PS2) 10Slyngshede: Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211
[14:18:50] <wikibugs>	 10SRE, 10observability: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10lmata) hi @dr0ptp4kt   Can you submit a patch with this info? we can happily review it when ready. cc/ @herron will be your point of contact.
[14:19:13] <wikibugs>	 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10lmata)
[14:20:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10dcaro) Wouldn't in make sense to start on 1001-dev? (otherwise it seems that 1007-dev should exist, or will...
[14:21:34] <icinga-wm>	 RECOVERY - DPKG on sretest1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:22:31] <claime>	 !log Repooling eqiad services in progress - T345263
[14:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:38] <stashbot>	 T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263
[14:23:23] <wikibugs>	 (03PS30) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[14:23:37] <wikibugs>	 (03CR) 10Fabfur: varnish: allow PURGE requests also from dedicated socket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[14:25:39] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:29:14] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263
[14:29:22] <stashbot>	 T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263
[14:29:23] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: e...
[14:29:27] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:59] <moritzm>	 !log Added Arnaud to pwstore and removed Jeff (frtech SREs no longer need/use it)
[14:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2018.codfw.wmnet']
[14:33:43] <icinga-wm>	 PROBLEM - Host restbase2018 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "great job 😊" [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[14:36:13] <wikibugs>	 (03CR) 10Muehlenhoff: cloudgw: Don't override conntrack settings from firewall profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:38:50] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2018.codfw.wmnet']
[14:38:55] <icinga-wm>	 RECOVERY - Host restbase2018 is UP: PING WARNING - Packet loss = 75%, RTA = 73.27 ms
[14:40:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2018.codfw.wmnet with OS bullseye
[14:40:27] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2018.codfw.wmnet with OS bullseye
[14:43:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye
[14:43:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004...
[14:44:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:46:37] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:40] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10cmooney) Is there any reason we can't announce the "unicast" IPs in BGP too?  I can't really see a good reason that any static routes are needed here.
[14:50:11] <wikibugs>	 (03PS1) 10FNegri: Add new cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241)
[14:50:13] <wikibugs>	 (03PS1) 10FNegri: Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241)
[14:51:18] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] varnish: allow PURGE requests also from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[14:53:29] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:55] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:24] <wikibugs>	 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9203737, @cmooney wrote: > Is there any reason we can't announce the "unicast" IPs in BGP too?  I can't really see a good reason...
[14:55:47] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@49e3804]: Deploy latest Airflow DAGs to analytics instance
[14:56:29] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@49e3804]: Deploy latest Airflow DAGs to analytics instance (duration: 00m 42s)
[14:57:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[14:58:05] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcumin1001.eqiad.wmnet with OS bullseye
[14:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:58:37] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2018.codfw.wmnet with reason: host reimage
[14:59:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[14:59:30] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:59:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:00:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:01:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:01:40] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:02:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:02:30] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2018.codfw.wmnet with reason: host reimage
[15:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:04:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:05:00] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10JMeybohm)
[15:05:31] <wikibugs>	 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10JMeybohm) 05In progress→03Resolved Direct access to mw-api is forbidden now. wikifunctions still working
[15:06:38] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcumin1001.eqiad.wmnet with reason: host reimage
[15:07:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp.anycast.wmnet - sukhe@cumin2002"
[15:07:21] <wikibugs>	 (03CR) 10Hashar: "I have found another way which is to use a hiera value that is passed to the various profiles:" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:07:36] <wikibugs>	 (03PS5) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143)
[15:07:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp.anycast.wmnet - sukhe@cumin2002"
[15:07:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:08:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:09:10] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:09:13] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:09:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache ntp.anycast.wmnet on all recursors
[15:09:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ntp.anycast.wmnet on all recursors
[15:09:46] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcumin1001.eqiad.wmnet with reason: host reimage
[15:10:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Make the dbconfig settings conditional on the hdb backend [puppet] - 10https://gerrit.wikimedia.org/r/961352 (https://phabricator.wikimedia.org/T292942)
[15:10:39] <wikibugs>	 (03CR) 10Hashar: "The spec for `profile::gerrit::migration` fails to find `profile::gerrit::gerrit_site`." [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:10:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:12:49] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:12:50] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[15:13:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:41] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:13:41] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[15:14:05] <wikibugs>	 (03CR) 10Hashar: "With `PUPPET_DEBUG=1`:" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:14:54] <wikibugs>	 (03PS2) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot)
[15:17:01] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot)
[15:17:04] <wikibugs>	 (03CR) 10Hashar: gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:17:25] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:17:47] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot)
[15:19:15] <wikibugs>	 (03CR) 10Muehlenhoff: On Bookworm ship ppolicy.schema via Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[15:19:59] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) >>! In T346724#9182178, @cmooney wrote: > I've spec'd the 'Advanced 2' license here.  That supports EVPN/VXLAN, which at this stage would...
[15:20:41] <wikibugs>	 (03PS1) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741)
[15:21:31] <wikibugs>	 (03PS6) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143)
[15:23:06] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:23:10] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcumin1001.eqiad.wmnet with OS bullseye
[15:24:00] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[15:24:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri)
[15:25:26] <wikibugs>	 (03PS3) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764)
[15:25:58] <Reedy>	 jouncebot: nowandnext
[15:25:58] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 34 minute(s)
[15:25:58] <jouncebot>	 In 1 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1700)
[15:26:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[15:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:28:43] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2018.codfw.wmnet with OS bullseye
[15:28:51] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2018.codfw.wmnet with OS bullseye completed: - restbase20...
[15:29:19] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.63.0" for 598 hosts
[15:30:19] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.63.0" completed for 598 hosts
[15:30:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM see comment for possible improvement" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:31:41] <wikibugs>	 (03CR) 10Btullis: [kafka] Install kafka-kit on bullseye/bookworm brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:33:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[15:33:26] <wikibugs>	 (03PS2) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741)
[15:33:36] <wikibugs>	 (03PS4) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764)
[15:33:42] <wikibugs>	 (03CR) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:35:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43670/console" [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[15:35:47] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:35:54] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1 C: 03+2] airflow-wmde: Remove statsd analytics-wmde user [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[15:39:20] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:40:44] <wikibugs>	 (03PS2) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379
[15:41:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST configurations) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:41:08] <wikibugs>	 (03PS3) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741)
[15:41:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey)
[15:41:33] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2018.codfw.wmnet
[15:41:34] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2018.codfw.wmnet
[15:42:02] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[15:42:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43671/console" [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[15:42:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri)
[15:43:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2023.codfw.wmnet with OS bullseye
[15:43:38] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2023.codfw.wmnet with OS bullseye
[15:44:17] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:46:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST configurations) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:48:05] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1016.eqiad.wmnet
[15:49:03] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1016.eqiad.wmnet
[15:49:20] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1017.eqiad.wmnet
[15:49:30] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1017.eqiad.wmnet
[15:49:43] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1018.eqiad.wmnet
[15:49:54] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1018.eqiad.wmnet
[15:50:44] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1019.eqiad.wmnet
[15:50:48] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1019.eqiad.wmnet
[15:51:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1022.eqiad.wmnet
[15:51:14] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1022.eqiad.wmnet
[15:51:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet
[15:51:24] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet
[15:52:00] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol)
[15:52:17] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221
[15:52:53] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[15:53:11] <wikibugs>	 (03PS2) 10Kamila Součková: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221
[15:53:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:53:33] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:53:40] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:53:47] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:54:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:54:11] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet
[15:54:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:55:24] <logmsgbot>	 !log reedy@deploy2002 Started scap: (no justification provided)
[15:55:48] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Add new cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri)
[15:56:07] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:57:00] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221 (owner: 10Kamila Součková)
[15:59:00] <wikibugs>	 (03PS2) 10Jdlrobson: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444)
[15:59:05] <wikibugs>	 (03CR) 10Jdlrobson: New projects default to Vector 2022 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[16:00:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2023.codfw.wmnet with reason: host reimage
[16:02:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:02:47] <logmsgbot>	 !log reedy@deploy2002 Finished scap: (no justification provided) (duration: 07m 22s)
[16:03:58] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2023.codfw.wmnet with reason: host reimage
[16:07:04] <jinxer-wm>	 (KubernetesAPILatency) resolved: (9) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:09:27] <kamila_>	 !log Pooled back eqiad for traffic after the DC switchover (T345263)
[16:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:36] <stashbot>	 T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263
[16:10:13] <wikibugs>	 (03PS1) 10Btullis: Change the owner:group of the wikidatawiki entities link [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165)
[16:11:10] <wikibugs>	 (03PS2) 10Btullis: Change the owner:group of the wikidatawiki entities link [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165)
[16:11:24] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) (owner: 10Btullis)
[16:12:11] <wikibugs>	 (03PS1) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509)
[16:12:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "This needs a default domain, otherwise specifying a project like 'O{project:tools}' gets us "Caught BadRequest exception: Expecting to fin" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott)
[16:14:47] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Jdforrester-WMF) Possibly now solved by https://gerrit.wikimedia.org/r/c/operations/puppet/+/961351 ?
[16:16:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS bullseye
[16:16:43] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye
[16:17:34] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) I should also add that unless we are provisioning new racks, any rack allocated for this will already have a switch in it.  So we should...
[16:20:34] <wikibugs>	 (03CR) 10Btullis: "Well, this change looks like it should work, but I wonder if the other option would simply be to remove the 'entities` symlink from dumpsd" [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) (owner: 10Btullis)
[16:24:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52692 and previous config saved to /var/cache/conftool/dbconfig/20230927-162433-arnaudb.json
[16:24:34] <logmsgbot>	 !log dduvall@deploy2002 Started scap: (no justification provided)
[16:24:40] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[16:27:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:28:38] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2023.codfw.wmnet with OS bullseye
[16:28:45] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2023.codfw.wmnet with OS bullseye completed: - restbase20...
[16:28:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:29:04] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2023.codfw.wmnet
[16:29:04] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2023.codfw.wmnet
[16:29:36] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[16:31:39] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1019.eqiad.wmnet with OS bullseye
[16:31:46] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye executed with errors: -...
[16:32:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1019.eqiad.wmnet']
[16:34:11] <wikibugs>	 (03PS1) 10Jbond: Revert "rsyslog: update to use pki certificates" [puppet] - 10https://gerrit.wikimedia.org/r/961224
[16:34:21] <wikibugs>	 (03CR) 10Jbond: "Sep 27 16:34:09 centrallog2002 rsyslogd[285425]: invalid cert info: peer provided 1 certificate(s). Certificate 1 info: certificate valid " [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond)
[16:35:10] <wikibugs>	 (03PS8) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[16:36:52] <wikibugs>	 (03PS7) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[16:39:16] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1019.eqiad.wmnet']
[16:39:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52693 and previous config saved to /var/cache/conftool/dbconfig/20230927-163940-arnaudb.json
[16:39:50] <wikibugs>	 (03PS8) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[16:42:49] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS bullseye
[16:42:57] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye
[16:44:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wiki-replicas: Add CREATE USER and GRANT OPTION to labsdbadmin [puppet] - 10https://gerrit.wikimedia.org/r/961366 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah)
[16:44:44] <wikibugs>	 (03CR) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:45:14] <wikibugs>	 (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:52:50] <logmsgbot>	 !log dduvall@deploy2002 Finished scap: (no justification provided) (duration: 28m 15s)
[16:53:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:53:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:54:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52694 and previous config saved to /var/cache/conftool/dbconfig/20230927-165446-arnaudb.json
[16:55:39] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1019.eqiad.wmnet with reason: host reimage
[16:56:12] <wikibugs>	 (03PS1) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590)
[16:58:04] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) I double checked an so far, backup, recovery and restores with the puppet master key still work as expected :-D.
[16:58:46] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1019.eqiad.wmnet with reason: host reimage
[16:58:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1700)
[17:00:27] <wikibugs>	 (03PS2) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590)
[17:02:26] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir6002.drmrs.wmnet with OS bookworm
[17:02:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bookworm
[17:02:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto)
[17:03:28] <wikibugs>	 (03PS3) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590)
[17:05:34] <wikibugs>	 (03CR) 10Herron: "we've left the original" [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond)
[17:06:32] <wikibugs>	 (03CR) 10Herron: Revert "rsyslog: update to use pki certificates" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond)
[17:09:08] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:09:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52695 and previous config saved to /var/cache/conftool/dbconfig/20230927-170953-arnaudb.json
[17:09:55] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[17:10:08] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[17:10:08] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[17:10:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52696 and previous config saved to /var/cache/conftool/dbconfig/20230927-171014-arnaudb.json
[17:18:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:23:11] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage
[17:23:54] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1019.eqiad.wmnet with OS bullseye
[17:24:01] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye completed: - restbase10...
[17:25:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage
[17:34:40] <wikibugs>	 (03PS1) 10Lucas Werkmeister: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000)
[17:36:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[17:36:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[17:36:49] <logmsgbot>	 !log jgreen@cumin1001 START - Cookbook sre.dns.netbox
[17:36:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr)
[17:37:49] <wikibugs>	 (03PS2) 10Lucas Werkmeister: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000)
[17:38:12] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1019.eqiad.wmnet
[17:38:13] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1019.eqiad.wmnet
[17:38:35] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[17:39:00] <logmsgbot>	 !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frauth2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001"
[17:39:48] <logmsgbot>	 !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frauth2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001"
[17:39:48] <logmsgbot>	 !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:37] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Jgreen) a:03Papaul
[17:43:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) @BTullis  did you have updates on Partitioning/Raid: for task?
[17:49:45] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir6002.drmrs.wmnet with OS bookworm
[17:49:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bookworm completed: - ncredir6002 (**PASS**)   - Downtimed on Ici...
[17:51:31] <wikibugs>	 (03PS2) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509)
[17:51:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED
[17:51:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED
[17:52:30] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2026.codfw.wmnet
[17:52:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011']
[17:52:43] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['stat1011']
[17:52:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011']
[17:53:05] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet
[17:53:25] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:53:28] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:53:48] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:53:51] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:57:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:31] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1011']
[17:59:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) 05Open→03In progress a:03RobH
[18:00:06] <jouncebot>	 dduvall and brennen: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1800)
[18:00:06] <jouncebot>	 dduvall and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1800).
[18:00:49] <wikibugs>	 (03PS8) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995)
[18:01:08] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet
[18:01:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10nskaggs) It's recommended that existing names not be reused.  See https://wikitech.wikimedia.org/wiki/SRE/In...
[18:01:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye
[18:01:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye
[18:04:44] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889)
[18:04:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot)
[18:05:12] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet
[18:05:13] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase2026.codfw.wmnet
[18:06:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) Found error in partitioning, discussing with John.
[18:07:06] <wikibugs>	 (03PS9) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995)
[18:07:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:07:40] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts restbase2027.codfw.wmnet
[18:07:46] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet
[18:08:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS bullseye
[18:08:37] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2026.codfw.wmnet with OS bullseye
[18:09:35] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot)
[18:09:39] <wikibugs>	 (03PS1) 10RobH: set lists1004 partman info [puppet] - 10https://gerrit.wikimedia.org/r/961440 (https://phabricator.wikimedia.org/T342374)
[18:10:21] <wikibugs>	 (03CR) 10RobH: [C: 03+2] set lists1004 partman info [puppet] - 10https://gerrit.wikimedia.org/r/961440 (https://phabricator.wikimedia.org/T342374) (owner: 10RobH)
[18:10:55] <wikibugs>	 (03CR) 10Herron: pyrra: add trafficserver mapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[18:11:52] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts restbase2027.codfw.wmnet
[18:13:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) a:05RobH→03Jclark-ctr
[18:14:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye
[18:15:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye
[18:15:08] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43677/console" [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[18:15:53] <brett>	 !log disabling puppet on apt1001 for a quick test of CR 957766's effectiveness
[18:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:17:51] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.28  refs T345889
[18:17:57] <stashbot>	 T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889
[18:19:45] <brett>	 !log re-enabling puppet on apt1001 from a quick test of CR 957766's effectiveness
[18:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:48] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir6001.drmrs.wmnet with OS bookworm
[18:20:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bookworm
[18:22:35] <jinxer-wm>	 (KubernetesAPILatency) firing: (28) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:22:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[18:23:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10BBlack)
[18:24:38] <logmsgbot>	 !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.28  refs T345889 (duration: 06m 46s)
[18:24:44] <stashbot>	 T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889
[18:24:56] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2026.codfw.wmnet with reason: host reimage
[18:26:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:26:29] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:27:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:27:35] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2026.codfw.wmnet with reason: host reimage
[18:28:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:28:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:28:54] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:31:08] <wikibugs>	 (03PS1) 10Majavah: Take cloudcontrol1006 out of service [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891)
[18:31:10] <wikibugs>	 (03PS1) 10Majavah: Cleanup remains of haproxy-on-cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/961443
[18:32:58] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:33:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Take cloudcontrol1006 out of service [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[18:34:04] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:36:15] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto)
[18:37:57] <wikibugs>	 (03PS1) 10Ssingh: aptrepo: s/haproxy28-bookworm/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/961444
[18:39:11] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43678/console" [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh)
[18:39:49] <sukhe>	 !log disable puppet on O:apt_repo
[18:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:22] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] aptrepo: s/haproxy28-bookworm/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh)
[18:41:15] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye
[18:41:23] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye
[18:41:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye
[18:41:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004 (...
[18:42:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:42:55] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye
[18:43:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye
[18:43:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:43:15] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage
[18:45:26] <sukhe>	 !log re-enable puppet on O:apt_repo
[18:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:32] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage
[18:47:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:47:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10Jclark-ctr) @BTullis this server is out of warranty i do not have any 1.6tb drives available. i do have 1.9tb we can use  if needed
[18:48:49] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS bullseye
[18:48:56] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2026.codfw.wmnet with OS bullseye completed: - restbase20...
[18:49:05] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) a:03Eevans
[18:49:25] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[18:50:17] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet
[18:50:54] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2027.codfw.wmnet
[18:51:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] "Ie10249983a6b5f2d98cc40b6734da103c836349c was merged but it was throwing up an error so we decided to do try this. Adding for post-merge-r" [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh)
[18:55:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:55:53] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lists1004.eqiad.wmnet with reason: host reimage
[18:56:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be1003.eqiad.wmnet with OS bullseye
[18:56:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye executed with erro...
[18:59:01] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists1004.eqiad.wmnet with reason: host reimage
[18:59:31] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:00:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:00:27] <wikibugs>	 (03PS6) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995)
[19:01:44] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2027.codfw.wmnet
[19:01:45] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase2027.codfw.wmnet
[19:02:08] <wikibugs>	 (03CR) 10Herron: [C: 03+2] pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:03:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS bullseye
[19:03:38] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2027.codfw.wmnet with OS bullseye
[19:03:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:05:50] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1020.eqiad.wmnet
[19:05:55] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1020.eqiad.wmnet
[19:06:50] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:07:03] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:07:50] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:08:11] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:09:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir6001.drmrs.wmnet with OS bookworm
[19:10:08] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bookworm completed: - ncredir6001 (**PASS**)   - Downtimed on Ici...
[19:13:11] <wikibugs>	 (03PS1) 10BBlack: traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460
[19:14:43] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001"
[19:18:15] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001"
[19:18:15] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists1004.eqiad.wmnet with OS bullseye
[19:18:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye completed: - lists1004 (**PASS**)...
[19:19:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH)
[19:19:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) 05In progress→03Resolved OS imaged, system ready for service owners.
[19:22:03] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage
[19:23:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:23:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[19:24:32] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage
[19:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:26:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4002.ulsfo.wmnet with OS bookworm
[19:26:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bookworm
[19:28:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:30:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:34:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be1003']
[19:35:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:35:41] <inflatador>	 !log bking@deploy2002 deleting flink-operator leader pod to force failover T347521
[19:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:54] <stashbot>	 T347521: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521
[19:37:07] <jinxer-wm>	 (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:03] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:38:06] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:39:15] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:39:17] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:40:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be1003']
[19:42:07] <jinxer-wm>	 (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:47:36] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage
[19:50:54] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage
[19:52:11] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@c6454a9]: update rdf tools jar to .131
[19:52:39] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@c6454a9]: update rdf tools jar to .131 (duration: 00m 28s)
[19:54:21] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2027.codfw.wmnet with OS bullseye
[19:54:29] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2027.codfw.wmnet with OS bullseye completed: - restbase20...
[19:59:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T2000).
[20:00:05] <jouncebot>	 lucaswerkmeister and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[20:00:08] <lucaswerkmeister>	 o/
[20:01:01] <cjming>	 o/ i can deploy
[20:01:34] <wikibugs>	 (03PS3) 10Clare Ming: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister)
[20:02:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister)
[20:02:50] <lucaswerkmeister>	 \o/
[20:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister)
[20:03:33] <Jdlrobson>	 here
[20:03:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:03:54] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]]
[20:05:17] <logmsgbot>	 !log cjming@deploy2002 lucaswerkmeister and cjming: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:05:22] <cjming>	 lucaswerkmeister: are you able to test?
[20:05:25] <lucaswerkmeister>	 yup, one moment
[20:07:24] <lucaswerkmeister>	 eh… it doesn’t make the API request faster as I had hoped, but it doesn’t make it slower either
[20:07:40] <lucaswerkmeister>	 and the result doesn’t change either (which is expected, but also nice)
[20:07:52] <lucaswerkmeister>	 I’d say let’s deploy it anyway and I’ll check up with Amir later whether we should keep it or not
[20:08:01] <lucaswerkmeister>	 but I don’t think it needs to be reverted right now, it shouldn’t hurt 🤷
[20:08:08] <cjming>	 sounds good - syncing
[20:08:10] <lucaswerkmeister>	 thanks
[20:08:13] <logmsgbot>	 !log cjming@deploy2002 lucaswerkmeister and cjming: Continuing with sync
[20:10:07] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4002.ulsfo.wmnet with OS bookworm
[20:10:14] <cjming>	 hi Jdlrobson: can your 3 go out together?
[20:10:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bookworm completed: - ncredir4002 (**PASS**)   - Downtimed on Ici...
[20:14:12] <Jdlrobson>	 cjming: the logos can
[20:14:17] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]] (duration: 10m 23s)
[20:14:31] <Jdlrobson>	 let's do https://gerrit.wikimedia.org/r/c/961260/ separately
[20:14:45] <cjming>	 ok!
[20:15:42] <wikibugs>	 (03PS4) 10Clare Ming: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson)
[20:16:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4001.ulsfo.wmnet with OS bookworm
[20:16:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bookworm
[20:16:42] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson)
[20:17:24] <wikibugs>	 (03Merged) 10jenkins-bot: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson)
[20:17:40] <wikibugs>	 (03PS2) 10Clare Ming: Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson)
[20:19:15] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson)
[20:20:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson)
[20:21:02] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]]
[20:21:09] <stashbot>	 T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258
[20:21:10] <stashbot>	 T341250: Design: Provide wordmarks and taglines for Wikimedia special projects - https://phabricator.wikimedia.org/T341250
[20:22:27] <logmsgbot>	 !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:22:30] <cjming>	 Jdlrobson: logos patches are ready to test if you're able
[20:23:19] <brett>	 !log update haproxy 2.6 and 2.8 into bookworm archives with reprepro - T342154
[20:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:24] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[20:23:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:24:09] <Jdlrobson>	 cjming: on it
[20:24:50] <Jdlrobson>	 cjming: LGTM!
[20:24:54] <Jdlrobson>	 cjming: please sync
[20:24:57] <cjming>	 yay - syncing
[20:25:01] <logmsgbot>	 !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync
[20:27:26] <cjming>	 Jdlrobson: should your vector 2022 default patch be rebased on top of master or parent change?
[20:27:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on restbase2027.codfw.wmnet with reason: Repairing/rebuilding Cassandra instances
[20:27:40] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on restbase2027.codfw.wmnet with reason: Repairing/rebuilding Cassandra instances
[20:27:42] <Jdlrobson>	 master
[20:28:00] <wikibugs>	 (03PS3) 10Clare Ming: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:58] <wikibugs>	 (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah)
[20:29:09] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:30:55] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]] (duration: 09m 52s)
[20:31:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:31:08] <stashbot>	 T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258
[20:31:09] <stashbot>	 T341250: Design: Provide wordmarks and taglines for Wikimedia special projects - https://phabricator.wikimedia.org/T341250
[20:32:36] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage
[20:33:14] <wikibugs>	 (03Merged) 10jenkins-bot: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:33:39] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]]
[20:33:44] <stashbot>	 T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444
[20:34:15] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:34:18] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:35:03] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:35:08] <cjming>	 Jdlrobson: logos patches are live! vector 2022 patch is ready to test
[20:35:17] <Jdlrobson>	 cjming: okay looking! wish me luck :)
[20:35:42] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage
[20:36:16] * cjming wishes luck to Jdlrobson
[20:37:19] <Jdlrobson>	 cjming: okay it looks like something's gone wrong with that patch. Looks like the dblist is not working... 
[20:37:39] <cjming>	 bummer - do we need to revert?
[20:37:51] <Jdlrobson>	 probably... im just looking to see if I've missed something obvious
[20:38:59] <Jdlrobson>	 yeh looks like dblists-index.php didn't get updated
[20:39:15] <cjming>	 is it a caching thing?
[20:41:19] <Jdlrobson>	 cjming: one line fix (facepalm)
[20:42:08] <cjming>	 Jdlrobson: i can continue sync and we can deploy your fix?
[20:42:31] <cjming>	 or revert, add your fix, and redeploy
[20:42:36] <wikibugs>	 (03PS1) 10Jdlrobson: Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444)
[20:42:50] <Jdlrobson>	 we need to add this fix before syncing
[20:43:07] <Jdlrobson>	 I'm okay to revert and squash this or apply this patch on top of the previous one - what's better?
[20:43:23] <Jdlrobson>	 but we definitely shouldnt' sync what's currently on the debug servers
[20:43:32] <Jdlrobson>	 the German Wikipedians will not be happy haha
[20:43:59] <cjming>	 i'll stop scap backporting current patch, we'll merge your fix, and then scap both
[20:44:06] <logmsgbot>	 !log cjming@deploy2002 Sync cancelled.
[20:44:32] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:45:16] <wikibugs>	 (03Merged) 10jenkins-bot: Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:45:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye
[20:46:20] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]]
[20:46:25] <stashbot>	 T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444
[20:47:40] <logmsgbot>	 !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:47:43] <cjming>	 Jdlrobson: how about now?
[20:48:22] <Jdlrobson>	 cjming: looking :)
[20:48:42] <Jdlrobson>	 cjming: this is looking much more promising. Give me a few more minutes :)
[20:50:33] <Jdlrobson>	 cjming: yeh this looks good. Sync away.
[20:50:41] <cjming>	 nice! syncing
[20:50:44] <logmsgbot>	 !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync
[20:54:23] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4001.ulsfo.wmnet with OS bookworm
[20:54:33] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bookworm completed: - ncredir4001 (**PASS**)   - Downtimed on Ici...
[20:55:22] <wikibugs>	 (03PS1) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444)
[20:56:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson)
[20:56:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:56:38] <wikibugs>	 (03PS2) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444)
[20:57:26] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]] (duration: 11m 05s)
[20:57:30] <cjming>	 Jdlrobson: should be live :)
[20:57:31] <stashbot>	 T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444
[20:58:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:59:42] <cjming>	 !log end of UTC late backport window
[20:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T2100)
[21:00:16] <Jdlrobson>	 thanks a bunch cjming 
[21:00:25] <Jdlrobson>	 sorry for the blip in the deploy :)
[21:00:35] <cjming>	 your welcome - no worries :)
[21:01:26] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5002.eqsin.wmnet with OS bookworm
[21:01:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:01:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bookworm
[21:08:07] <icinga-wm>	 PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2a00:1188:5:e::4)
[21:09:09] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:11:28] <lucaswerkmeister>	 thanks from me too cjming!
[21:11:30] <lucaswerkmeister>	 (bit late ^^)
[21:11:47] <cjming>	 lucaswerkmeister: np! yw :)
[21:13:31] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:23:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:27:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[21:27:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[21:35:46] <wikibugs>	 (03PS1) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463)
[21:39:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52697 and previous config saved to /var/cache/conftool/dbconfig/20230927-213946-arnaudb.json
[21:39:52] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[21:43:55] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[21:45:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage
[21:48:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage
[21:54:09] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:54:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P52698 and previous config saved to /var/cache/conftool/dbconfig/20230927-215452-arnaudb.json
[22:03:00] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye
[22:03:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[22:03:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[22:09:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P52699 and previous config saved to /var/cache/conftool/dbconfig/20230927-220959-arnaudb.json
[22:15:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52700 and previous config saved to /var/cache/conftool/dbconfig/20230927-221536-arnaudb.json
[22:15:42] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[22:18:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:21:25] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:22:55] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5002.eqsin.wmnet with OS bookworm
[22:23:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bookworm completed: - ncredir5002 (**PASS**)   - Downtimed on Ici...
[22:25:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52701 and previous config saved to /var/cache/conftool/dbconfig/20230927-222505-arnaudb.json
[22:25:11] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[22:25:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[22:28:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:28:27] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:29:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye
[22:30:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52702 and previous config saved to /var/cache/conftool/dbconfig/20230927-223042-arnaudb.json
[22:42:54] <wikibugs>	 (03PS3) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257)
[22:45:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52703 and previous config saved to /var/cache/conftool/dbconfig/20230927-224548-arnaudb.json
[22:48:43] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes)
[22:49:12] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes)
[23:00:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52704 and previous config saved to /var/cache/conftool/dbconfig/20230927-230055-arnaudb.json
[23:00:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[23:01:02] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:01:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[23:01:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T343198)', diff saved to https://phabricator.wikimedia.org/P52705 and previous config saved to /var/cache/conftool/dbconfig/20230927-230117-arnaudb.json
[23:03:13] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-truthy-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:01] <wikibugs>	 (03PS4) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257)
[23:15:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:16:45] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:17:56] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347533 (10Brycehughes)
[23:18:33] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) 05Open→03Invalid
[23:18:40] <wikibugs>	 (03PS1) 10Jdlrobson: Wikimedia special project logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961484
[23:19:30] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) Closed and re-filed as a bug report. No idea if that matters.
[23:19:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:19:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:23:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye
[23:23:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye
[23:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:36:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[23:39:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[23:54:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:55:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:56:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bullseye
[23:56:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye completed: - moss-...
[23:56:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) 05Open→03Resolved