[00:00:31] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:04:31] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:09:31] (03PS3) 10Jdlrobson: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) [00:10:31] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:11:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52675 and previous config saved to /var/cache/conftool/dbconfig/20230927-001109-arnaudb.json [00:13:15] (03PS1) 10Jdlrobson: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) [00:26:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52676 and previous config saved to /var/cache/conftool/dbconfig/20230927-002616-arnaudb.json [00:28:03] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:28:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:29:08] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:29:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:30:41] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2020.codfw.wmnet with OS bullseye [00:30:48] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2020.codfw.wmnet with OS bullseye completed: - restbase20... [00:30:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:34:59] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:35:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2023.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:38:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:38:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677 (owner: 10TrainBranchBot) [00:39:59] (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:41:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52677 and previous config saved to /var/cache/conftool/dbconfig/20230927-004122-arnaudb.json [00:41:24] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [00:41:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [00:41:31] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:41:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [00:41:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52678 and previous config saved to /var/cache/conftool/dbconfig/20230927-004144-arnaudb.json [00:42:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2022.codfw.wmnet with OS bullseye [00:43:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2022.codfw.wmnet with OS bullseye [00:46:53] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:47:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:53:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960677 (owner: 10TrainBranchBot) [00:58:33] (03CR) 10Ssingh: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [00:59:28] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2022.codfw.wmnet with reason: host reimage [01:01:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:01:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:02:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:02:41] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2022.codfw.wmnet with reason: host reimage [01:04:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:07:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:09:01] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:13:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:15:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52679 and previous config saved to /var/cache/conftool/dbconfig/20230927-011514-arnaudb.json [01:15:23] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:18:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:19:43] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:25:26] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2022.codfw.wmnet with OS bullseye [01:25:33] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2022.codfw.wmnet with OS bullseye completed: - restbase20... [01:25:52] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2022.codfw.wmnet [01:25:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2022.codfw.wmnet [01:26:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS bullseye [01:27:03] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2025.codfw.wmnet with OS bullseye [01:27:36] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [01:28:20] (03CR) 10Herron: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [01:30:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52680 and previous config saved to /var/cache/conftool/dbconfig/20230927-013020-arnaudb.json [01:30:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:31:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:31:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:32:35] (03CR) 10Ssingh: [C: 03+1] pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [01:32:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:47] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:34:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:34:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:35] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:43] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:36:45] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:37:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:19] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:38:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:39:23] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:39:23] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:43:23] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2025.codfw.wmnet with reason: host reimage [01:45:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52681 and previous config saved to /var/cache/conftool/dbconfig/20230927-014527-arnaudb.json [01:45:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:45:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2025.codfw.wmnet with reason: host reimage [01:49:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:49:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:49:44] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:50:24] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [01:52:12] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:00:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343198)', diff saved to https://phabricator.wikimedia.org/P52682 and previous config saved to /var/cache/conftool/dbconfig/20230927-020034-arnaudb.json [02:00:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [02:00:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [02:00:45] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:05:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1234.eqiad.wmnet with OS bullseye [02:05:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1234.eqiad.wmnet with OS bullseye [02:06:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1235.eqiad.wmnet with OS bullseye [02:06:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1235.eqiad.wmnet with OS bullseye [02:07:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1236.eqiad.wmnet with OS bullseye [02:07:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1236.eqiad.wmnet with OS bullseye [02:07:54] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:08:16] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:08:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS bullseye [02:08:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1237.eqiad.wmnet with OS bullseye [02:09:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1238.eqiad.wmnet with OS bullseye [02:09:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1238.eqiad.wmnet with OS bullseye [02:09:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:09:40] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:09:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1239.eqiad.wmnet with OS bullseye [02:10:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1239.eqiad.wmnet with OS bullseye [02:10:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS bullseye [02:10:13] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2025.codfw.wmnet with OS bullseye completed: - restbase20... [02:10:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1240.eqiad.wmnet with OS bullseye [02:10:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1240.eqiad.wmnet with OS bullseye [02:11:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [02:11:15] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2025.codfw.wmnet [02:11:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2025.codfw.wmnet [02:11:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1241.eqiad.wmnet with OS bullseye [02:11:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1241.eqiad.wmnet with OS bullseye [02:13:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:14:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:18:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [02:19:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [02:20:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [02:21:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [02:21:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:21:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [02:21:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [02:22:14] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:22:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:22:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage [02:23:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage [02:23:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [02:24:08] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:24:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [02:26:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [02:27:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:27:22] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:28:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [02:28:46] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:10] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:10] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:29:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:30:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [02:32:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage [02:32:52] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:33:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [02:33:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage [02:33:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:36:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:37:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:38:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:38:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:38:45] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:38:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1234.eqiad.wmnet with OS bullseye [02:38:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1234.eqiad.wmnet with OS bullseye completed: - db1234 (**PASS**) - Removed f... [02:39:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:40:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:40:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1235.eqiad.wmnet with OS bullseye [02:40:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1235.eqiad.wmnet with OS bullseye completed: - db1235 (**PASS**) - Removed f... [02:41:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1237.eqiad.wmnet with OS bullseye [02:42:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1237.eqiad.wmnet with OS bullseye completed: - db1237 (**PASS**) - Removed f... [02:43:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:43:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:44:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:44:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:44:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:44:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1238.eqiad.wmnet with OS bullseye [02:44:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1238.eqiad.wmnet with OS bullseye completed: - db1238 (**PASS**) - Removed f... [02:45:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:45:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:45:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [02:46:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:46:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:46:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1236.eqiad.wmnet with OS bullseye [02:46:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1236.eqiad.wmnet with OS bullseye completed: - db1236 (**WARN**) - Removed f... [02:47:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:47:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1239.eqiad.wmnet with OS bullseye [02:47:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1239.eqiad.wmnet with OS bullseye completed: - db1239 (**WARN**) - Removed f... [02:49:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:50:27] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:50:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:50:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1240.eqiad.wmnet with OS bullseye [02:50:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1240.eqiad.wmnet with OS bullseye completed: - db1240 (**WARN**) - Removed f... [02:51:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:51:28] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1241.eqiad.wmnet with OS bullseye [02:51:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1241.eqiad.wmnet with OS bullseye completed: - db1241 (**WARN**) - Removed f... [02:53:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jhancock.wm) [03:08:45] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:13:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:13:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:36:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:41:34] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:48:24] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (ldap-rw2001), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:49:34] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:54:34] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:44:45] (03PS2) 10Giuseppe Lavagetto: Allow setting values for jsonschema entities [software/conftool] - 10https://gerrit.wikimedia.org/r/909203 [05:44:47] (03PS1) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 [05:47:41] (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto) [05:48:36] (03PS1) 10Ilias Sarantopoulos: ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273 [05:49:47] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273 (owner: 10Ilias Sarantopoulos) [05:49:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:49:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:50:37] (03Merged) 10jenkins-bot: ml-services: revert CORS settings in app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961273 (owner: 10Ilias Sarantopoulos) [05:50:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:53:17] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [05:53:52] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [05:54:32] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [05:54:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T0600) [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:46] (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:46] (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:25] ACKNOWLEDGEMENT - MegaRAID on dbstore1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T347449 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:17:31] 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10ops-monitoring-bot) [06:32:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:35:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:40:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:40:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:41:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:50:14] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [06:50:19] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [06:50:24] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [06:54:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:57:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP [06:57:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP [06:59:23] (03PS1) 10Slyngshede: SSH Key mgmt: Allow multiple SSH keys to be stored in LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/961278 [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:05:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:08:45] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:17:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:39:52] !log repool ms-fe2009 [07:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:50] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:50] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:54:20] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [08:00:02] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [08:09:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:09:47] this is me sorry --^ [08:10:19] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 15 hosts with reason: Kafka mirror issues on jumbo [08:10:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 15 hosts with reason: Kafka mirror issues on jumbo [08:11:45] (03PS1) 10Vgutierrez: hiera: Move HAProxy 2.7 experiments to cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799) [08:11:57] (03PS1) 10Majavah: P:toolforge::instance: decrease priority of access rule [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406) [08:13:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [08:13:45] (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:18:28] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:18:57] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43638/console" [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [08:19:08] (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:31] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Move HAProxy 2.7 experiments to cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/961333 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [08:19:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:21:25] !log update HAProxy to version 2.7.10 in cp4051 - T317799 [08:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:32] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [08:22:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [08:23:40] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:50] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:24:22] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961092 (owner: 10Majavah) [08:28:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo [08:28:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo [08:29:12] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [08:31:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406) (owner: 10Majavah) [08:32:15] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960163 (owner: 10Majavah) [08:33:31] (03CR) 10Marostegui: [C: 04-1] "Please add:" [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [08:34:08] (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:39] We have an ongoing incident affecting kafka-jumbo mirror makers that we're handling in #wikimedia-analytics - I am the IC. No user-facing impact at the moment. [08:34:55] (03CR) 10Majavah: [C: 03+2] dnsrecusor: Remove labs-ip-alias-dump icinga check [puppet] - 10https://gerrit.wikimedia.org/r/960163 (owner: 10Majavah) [08:35:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) [08:35:02] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [08:37:00] (03CR) 10Majavah: [C: 03+2] P:toolforge::instance: decrease priority of access rule [puppet] - 10https://gerrit.wikimedia.org/r/961334 (https://phabricator.wikimedia.org/T288406) (owner: 10Majavah) [08:37:18] (03CR) 10Majavah: [V: 03+1 C: 03+2] galera: Fix some ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/961092 (owner: 10Majavah) [08:37:50] (03CR) 10Vgutierrez: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [08:38:11] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/960164 (owner: 10Majavah) [08:39:26] RECOVERY - Check systemd state on ldap-rw2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:45] (03PS26) 10Fabfur: varnish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [08:40:47] (03CR) 10Fabfur: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [08:41:15] 10SRE, 10ops-eqiad: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10Marostegui) ` [2574910.962214] megaraid_sas 0000:18:00.0: scanning for scsi0... [2574910.962794] megaraid_sas 0000:18:00.0: 1244 (749109359s/0x0001/CRIT) - VD 00/0 is now DEGRADED [2575154.297980] megaraid_sas 0000:... [08:42:43] (03PS7) 10Vgutierrez: haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) [08:42:45] (03PS12) 10Vgutierrez: hiera: Test HAProxy bw limits per URL on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) [08:44:26] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [08:44:29] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [08:44:37] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [08:44:40] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [08:44:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui) [08:44:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) [08:46:37] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) @Jhancock.wm were you able to see anything? [08:47:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:49:08] (JobUnavailable) firing: (4) Reduced availability for job jmx_kafka_mirrormaker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:52:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui) Thank you! [08:53:18] (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove per-host healthchck [puppet] - 10https://gerrit.wikimedia.org/r/961062 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [08:53:37] (03PS2) 10Majavah: wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067 [08:53:39] (03PS2) 10Majavah: Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) [08:53:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [08:54:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [08:55:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [08:55:08] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=068d8793-7777-446c-b4d2-653f3aae2433) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host crashed ` db2109.codfw.wmnet ` [08:55:30] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) >>! In T347411#9200644, @jbond wrote: > We may be able to use redfish to get this information (although i couldn't find it from a quick look) and the u... [08:56:44] (03PS27) 10Fabfur: varnish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [08:57:31] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) p:05Triage→03Medium [09:00:08] (03CR) 10Vgutierrez: [C: 03+1] hiera: Test HAProxy bw limits per URL on cp4051 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:00:50] (03PS1) 10Majavah: cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) [09:02:55] (03PS1) 10Clément Goubert: mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422) [09:05:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo [09:05:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 15 hosts with reason: Kafka mirror issues on jumbo [09:06:14] (03CR) 10Vgutierrez: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:06:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:06:55] (03PS1) 10Marostegui: install_server: Do not reimage pc1015 [puppet] - 10https://gerrit.wikimedia.org/r/961339 [09:06:58] vgutierrez: ^ [09:07:06] haproxy paged [09:07:15] acking [09:07:28] ah, Cathal was faster :-D [09:07:56] (03PS1) 10Muehlenhoff: clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) [09:08:03] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc1015 [puppet] - 10https://gerrit.wikimedia.org/r/961339 (owner: 10Marostegui) [09:08:09] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [09:08:15] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:08:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:24] (03PS1) 10Clément Goubert: mw-api-ext, mw-web: raise replicas for traffic bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422) [09:10:04] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [09:10:07] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:11:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:12:21] (03CR) 10Clément Goubert: "Just in case." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:12:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:12:50] (03PS2) 10Clément Goubert: trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:13:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:14:43] (03CR) 10Muehlenhoff: [C: 03+2] clouwgw: Update ordering for the variant using profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961340 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:14:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [09:15:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [09:15:02] (03CR) 10Majavah: [C: 03+2] wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067 (owner: 10Majavah) [09:15:05] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=68fd1013-aa8f-4502-bb60-c027808c1750) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host crashed ` db2109.codfw.wmnet ` [09:15:33] (03PS1) 10Vgutierrez: haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) [09:15:43] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:16:01] (03PS2) 10Vgutierrez: haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) [09:18:36] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43639/console" [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [09:19:30] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Disable varnish connection limit [puppet] - 10https://gerrit.wikimedia.org/r/961343 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [09:22:28] (03CR) 10Majavah: [C: 03+2] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:23:00] (03CR) 10Marostegui: [C: 03+1] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:23:36] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:29] (03PS1) 10Majavah: hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381) [09:28:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:29:15] (03CR) 10Majavah: [C: 03+2] cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:29:48] (03Merged) 10jenkins-bot: cr-labs: Permit dbproxy access for wiki replica metadata database [homer/public] - 10https://gerrit.wikimedia.org/r/961336 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:29:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:33:01] !log update CR firewall policy, gerrit 961336 [09:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] !log cordoning kubernetes1013 for debug porposes [09:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:09] (03CR) 10Majavah: [C: 03+2] hieradata: move maintain-dbusers to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/961345 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:36:19] (03CR) 10Fabfur: varnish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:39:27] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:40:43] (03Merged) 10jenkins-bot: mw-on-k8s: Raise idle worker alerting threshold to 50% [alerts] - 10https://gerrit.wikimedia.org/r/961337 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:43:31] !log Bumping mw-on-k8s traffic to 8% - T346422 [09:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:38] T346422: Move 10% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T346422 [09:44:33] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:45:11] jynus, topranks, heads up ^ [09:45:25] claime: thanks :) [09:46:28] thanks [09:47:57] (03PS1) 10Majavah: hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891) [09:47:59] (03PS1) 10Majavah: hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891) [09:48:03] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1013.* [09:49:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:49:59] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10jcrespo) I got an alert about ldap-rw2001 failing its backups (probably expected during setup), but wanted to give a heads up. [09:50:42] (03PS1) 10Jcrespo: this is a test patch - ignore [puppet] - 10https://gerrit.wikimedia.org/r/961350 [09:50:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [09:51:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [09:51:31] (03CR) 10Majavah: [C: 03+2] hieradata: move Galera primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961348 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [09:51:38] (03CR) 10Majavah: [C: 03+2] hieradata: move prometheus-openstack-exporter to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/961349 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [09:52:00] (03Abandoned) 10Jcrespo: this is a test patch - ignore [puppet] - 10https://gerrit.wikimedia.org/r/961350 (owner: 10Jcrespo) [09:54:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:13] (03PS1) 10Clément Goubert: mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536) [09:59:05] (03PS2) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 [09:59:19] (03PS1) 10Muehlenhoff: Make the dbconfig settings conditional on the hdb backend [puppet] - 10https://gerrit.wikimedia.org/r/961352 [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1000) [10:01:55] (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto) [10:02:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:50] (03PS28) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [10:07:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961352 (owner: 10Muehlenhoff) [10:09:22] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) [10:11:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert) [10:11:40] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Remove wikidata exception [puppet] - 10https://gerrit.wikimedia.org/r/961351 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert) [10:13:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond) [10:13:36] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond) [10:14:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:15:43] (03PS3) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 [10:15:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:32] (03CR) 10CI reject: [V: 04-1] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto) [10:19:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): update netbox sync to also sync to puppetservers - https://phabricator.wikimedia.org/T347410 (10jbond) 05Open→03Resolved a:03jbond Cookbook has now been updated [10:20:49] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [10:22:03] (03PS4) 10Giuseppe Lavagetto: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 [10:22:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:00] (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422) [10:23:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:17] (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [10:25:03] (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961353 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [10:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:27:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:27:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:27:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:27:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:32:29] (03PS1) 10Ilias Sarantopoulos: ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193) [10:36:07] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos) [10:37:23] (03Merged) 10jenkins-bot: ml-services: allow empty boolean query param in ores legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961355 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos) [10:38:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52683 and previous config saved to /var/cache/conftool/dbconfig/20230927-103800-arnaudb.json [10:38:10] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [10:39:18] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:39:44] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:40:12] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:40:38] (03PS29) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [10:41:58] (03PS1) 10Giuseppe Lavagetto: Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356 [10:43:07] (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422) [10:43:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto) [10:44:03] (03CR) 10Jbond: [C: 03+2] backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [10:44:18] (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [10:44:23] (03CR) 10Jbond: [C: 03+2] puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [10:45:00] (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 25 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961357 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [10:45:38] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:45:50] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:46:04] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:46:13] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:47:50] (03PS1) 10Jbond: puppetserver: we use the backup profile for backups [puppet] - 10https://gerrit.wikimedia.org/r/961359 (https://phabricator.wikimedia.org/T347390) [10:48:09] (03CR) 10Jbond: [C: 03+2] puppetserver: we use the backup profile for backups [puppet] - 10https://gerrit.wikimedia.org/r/961359 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [10:48:18] (03Merged) 10jenkins-bot: Add X-Known-Client support [software/conftool] - 10https://gerrit.wikimedia.org/r/961272 (owner: 10Giuseppe Lavagetto) [10:48:45] (03PS1) 10Muehlenhoff: Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) [10:49:08] (03CR) 10Volans: sre.hosts.reimage: Suggest install-console for troubleshooting (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [10:50:35] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P52684 and previous config saved to /var/cache/conftool/dbconfig/20230927-105306-arnaudb.json [10:53:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:53:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) >>! In T347093#9188497, @fnegri wrote: > Is there a task where I can learn more about this? I don't think we have one open... [10:55:47] Hi. 80k+ logstash errors in the last hour for cewiki alone re a maintenance script [10:55:59] 88k+* [10:56:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356 (owner: 10Giuseppe Lavagetto) [10:57:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:58:10] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:58:11] (03Abandoned) 10Volans: Install hosts: fallback to drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [10:59:39] (03Merged) 10jenkins-bot: Release 2.3.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/961356 (owner: 10Giuseppe Lavagetto) [11:04:10] (03PS1) 10Clément Goubert: mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) [11:04:32] (03Abandoned) 10Jbond: WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782 (owner: 10Jbond) [11:04:47] (03Abandoned) 10Jbond: puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [11:07:39] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P52685 and previous config saved to /var/cache/conftool/dbconfig/20230927-110813-arnaudb.json [11:10:07] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:12:17] (03PS2) 10Jbond: wikimedia.org: drop puppetboard-next [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286) [11:12:26] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:12:36] (03CR) 10Jbond: [C: 03+2] puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961114 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [11:12:40] (03CR) 10Jbond: [C: 03+2] puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961119 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [11:12:51] (03CR) 10Jbond: [C: 03+2] wikimedia.org: drop puppetboard-next [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [11:14:50] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:14:53] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:17:02] (03PS1) 10Muehlenhoff: cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) [11:17:50] (03CR) 10Jbond: [C: 03+1] "lgtm, the old puppetdb's are in the insetup role now" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:18:43] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:19:09] (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:19:11] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:20:35] PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 580 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:22:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:23:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52686 and previous config saved to /var/cache/conftool/dbconfig/20230927-112320-arnaudb.json [11:23:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:23:29] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:23:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:23:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52687 and previous config saved to /var/cache/conftool/dbconfig/20230927-112342-arnaudb.json [11:24:40] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:24:46] (03CR) 10Arturo Borrero Gonzalez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:26:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [11:26:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [11:26:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52688 and previous config saved to /var/cache/conftool/dbconfig/20230927-112640-arnaudb.json [11:26:44] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:26:52] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:26:56] (03PS2) 10Jbond: bacula: update bacula config to trust the pki and puppet ca's [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) [11:26:58] (03PS1) 10Majavah: wiki-replicas: Add CREATE USER and GRANT OPTION to labsdbadmin [puppet] - 10https://gerrit.wikimedia.org/r/961366 (https://phabricator.wikimedia.org/T347381) [11:27:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:27:40] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:28:07] (03CR) 10jenkins-bot: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:32:18] (03CR) 10Jbond: "Please review" [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [11:42:36] (03CR) 10Jcrespo: [C: 03+1] "This looks good to me, although I would like to be around to test when deployed, to make sure backups and recoveries work as usual. I thin" [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [11:43:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/961278 (owner: 10Slyngshede) [11:43:36] PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard2003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 580 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:45:26] (03CR) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:46:24] (03PS1) 10Majavah: maintain-dbusers: just log to stdout [puppet] - 10https://gerrit.wikimedia.org/r/961368 [11:46:26] (03PS1) 10Majavah: maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 [11:47:19] (03PS2) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) [11:48:27] (03PS4) 10Majavah: dnsrecursor: remove need to run labs-ip-alias-dump twice [puppet] - 10https://gerrit.wikimedia.org/r/960164 [11:48:57] (03PS5) 10Majavah: dnsrecursor: remove need to run labs-ip-alias-dump twice [puppet] - 10https://gerrit.wikimedia.org/r/960164 [11:49:33] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah) [11:49:52] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:50:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [11:50:17] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:50:47] (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:50:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:50:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:51:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:51:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:52:23] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) a:03BTullis [11:53:14] (03PS2) 10Majavah: maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 [11:54:06] (03CR) 10Fabfur: varnish: allow PURGE requests also from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [11:56:54] (03CR) 10Vgutierrez: [C: 03+2] hiera: Test HAProxy bw limits per URL on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [11:57:11] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [11:57:42] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [11:58:07] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [11:58:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:58:36] (03CR) 10CI reject: [V: 04-1] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:00:29] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [12:00:44] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:01:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:02:10] (03PS1) 10Kevin Bazira: ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) [12:03:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:03:32] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SSH Key mgmt: Allow multiple SSH keys to be stored in LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/961278 (owner: 10Slyngshede) [12:05:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch main cloudgw hosts to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/961360 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:05:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:27] PROBLEM - Check systemd state on kafka-jumbo1014 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:31] PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:40] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I've verified the above and can confirm that the two slots 1 and 4 are no longer visible to `megacli` ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number:... [12:05:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:05:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:06:03] PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:03] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:09] PROBLEM - Check systemd state on kafka-jumbo1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [12:08:16] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [12:09:32] (03Merged) 10jenkins-bot: mw-on-k8s: Fold canaries into global php-fpm idle alert [alerts] - 10https://gerrit.wikimedia.org/r/961362 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [12:10:33] (03PS1) 10Slyngshede: Navbar: Show SSH and attributes in menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/961371 [12:11:11] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Navbar: Show SSH and attributes in menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/961371 (owner: 10Slyngshede) [12:11:28] (03CR) 10David Caro: [C: 03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah) [12:11:42] (03PS2) 10Muehlenhoff: cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) [12:13:40] (03CR) 10Jbond: [C: 03+1] puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:14:22] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [12:14:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:14:46] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:16:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:16:11] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:17:20] (03PS1) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) [12:18:02] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: Still running on 9 mirrormaker processes from main-eqiad to jumbo [12:18:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:18:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: Still running on 9 mirrormaker processes from main-eqiad to jumbo [12:18:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43647/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:19:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:21:11] (03PS2) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) [12:22:32] (03CR) 10Muehlenhoff: [C: 03+2] cloudgw: Remove profile::openstack::base::cloudgw::firewall_profile [puppet] - 10https://gerrit.wikimedia.org/r/961365 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:23:16] (03CR) 10Vgutierrez: "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43648/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:24:17] (03PS3) 10Vgutierrez: haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) [12:24:56] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) [12:25:22] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:25:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:25:40] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43649/console" [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:26:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) Just adding here, the server didn't boot successfully. [12:26:54] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:29:33] (03PS1) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 [12:32:45] RECOVERY - Check systemd state on kafka-jumbo1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:33:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:33:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh puppet namespace [puppet] - 10https://gerrit.wikimedia.org/r/961363 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:36:04] (03CR) 10Majavah: [C: 03+2] maintain-dbusers: just log to stdout [puppet] - 10https://gerrit.wikimedia.org/r/961368 (owner: 10Majavah) [12:36:17] (03CR) 10Majavah: [C: 03+2] maintain-dbusers: don't bother querying PAWS user names [puppet] - 10https://gerrit.wikimedia.org/r/961369 (owner: 10Majavah) [12:37:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.455 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:38:29] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:38:42] (03PS2) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 [12:39:38] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on dbstore1005.eqiad.wmnet with reason: Cold booting to see if it sees two missing disks [12:39:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on dbstore1005.eqiad.wmnet with reason: Cold booting to see if it sees two missing disks [12:39:50] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=195bf9c0-3e24-446f-ba90-48d15ed5d628) set by btullis@cumin1001 for 0:20:00 on 1 host(s) and their services with reason: Cold bo... [12:43:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:44:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:45] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: load nf_conntrack sysctl settings later [puppet] - 10https://gerrit.wikimedia.org/r/961376 (https://phabricator.wikimedia.org/T347469) [12:45:49] (03CR) 10Thiemo Kreuz (WMDE): New projects default to Vector 2022 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [12:48:21] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) I have cold booted it and the missing slots have come back. ` btullis@dbstore1005:~$ sudo megacli -PDList -a0|grep "Slot Number" Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Numb... [12:50:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:50:41] RECOVERY - Check systemd state on kafka-jumbo1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:14] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) Ok, it's rebuilding automatically. ` btullis@dbstore1005:~$ sudo megacli -PDList -aall|grep 'Firmware state' Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: On... [12:53:45] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:54:13] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.timer,rsync-config-backup-gitlab2002.wikimedia.org.timer,rsync-data-backup-gitlab1003.wikimedia.org.timer,rsync-data-backup-gitlab2002.wikimedia.org.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:45] RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:25] (03PS1) 10Lucas Werkmeister (WMDE): Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) [12:57:27] (03PS1) 10Muehlenhoff: cloudgw: Don't override conntrack settings from firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) [12:58:07] (03PS1) 10Elukey: modules: duplicate ingress:istio_1.0.2 to 1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961378 [12:58:09] (03PS1) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [12:58:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:43] RECOVERY - Check systemd state on kafka-jumbo1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:50] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Jdforrester-WMF) [12:59:00] (03CR) 10CI reject: [V: 04-1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [12:59:11] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:59:24] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10Jdforrester-WMF) 05Open→03In progress [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1300). [13:00:05] houseofm and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:17] no HouseOfM yet, I’ll start the gate-and-submit for my backport then [13:04:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [13:08:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:11:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:11:56] (03PS1) 10JMeybohm: wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) [13:12:59] !log Deployment weekly train of analytics-refinery (+new source version) [13:12:59] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [13:13:03] FTR, my backport will merge in ca. 6 minutes and will then take quite a while to sync (as it touches i18n) [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] so if anyone wants to scap something else first, let me know ^^ [13:13:17] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10ayounsi) It's great to see momentum on this recurring pain point! To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP)... [13:13:17] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka-mirror-main-eqiad_to_jumbo-eqiad@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:36] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) I had a little play with the redfish api and the PCIe info is available. Unfortunately Linux predictable interface names still seem about as [[ https:... [13:14:21] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:54] (03CR) 10CDanis: [C: 03+1] haproxy: Fix filter bwlim syntax [puppet] - 10https://gerrit.wikimedia.org/r/961373 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [13:16:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [13:17:04] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2017.codfw.wmnet with OS bullseye [13:17:17] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2017.codfw.wmnet with OS bullseye [13:17:28] !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f]: Regular analytics weekly train [analytics/refinery@223be0fb] [13:17:31] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:17:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr) [13:18:13] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [13:18:22] (03Merged) 10jenkins-bot: Add label for Wikifunctions in “other projects” sidebar section [extensions/WikimediaMessages] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/961217 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [13:18:56] huh, why is scap backport’s git output showing a bunch of “new branch” for wmf.28 [13:18:56] (03Merged) 10jenkins-bot: wikifunctions: Allow orchestrator to connecto to mw-api-int pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [13:19:03] isn’t it already deployed to group0? [13:19:29] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]] [13:19:39] T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857 [13:20:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:20:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:21:02] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [13:21:06] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [13:21:47] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [13:21:51] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [13:24:26] !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f]: Regular analytics weekly train [analytics/refinery@223be0fb] (duration: 06m 58s) [13:25:11] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [13:25:16] (03PS5) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [13:25:24] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [13:26:02] !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (thin): Regular analytics weekly train THIN [analytics/refinery@223be0fb] [13:26:12] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [13:26:13] !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f] (thin): Regular analytics weekly train THIN [analytics/refinery@223be0fb] (duration: 00m 10s) [13:26:15] !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb] [13:26:31] !log aqu@deploy2002 deploy aborted: Regular analytics weekly train TEST [analytics/refinery@223be0fb] (duration: 00m 16s) [13:26:32] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) >>! In T347411#9203208, @ayounsi wrote: > To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP) and then get th... [13:26:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:28:35] (03PS1) 10Jclark-ctr: add mossbe1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961389 (https://phabricator.wikimedia.org/T342675) [13:29:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) To keep it somewhere for later, on Dell SONiC it should be on the `/openconfig-qos:qos/interfaces` path. Grouping it by sour... [13:29:41] (03CR) 10Jclark-ctr: [C: 03+2] add mossbe1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961389 (https://phabricator.wikimedia.org/T342675) (owner: 10Jclark-ctr) [13:30:07] (still running build-and-push-container-images…) [13:30:18] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) 05Open→03Stalled [13:31:18] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, @nshahquinn if it's ok on your side I'll consider this as completed [13:32:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:32:42] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10JMeybohm) a:03JMeybohm It took me a while to figure this out, sorry. Due to wikifunctions having more strict firewall rules in genera... [13:33:24] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2017.codfw.wmnet with reason: host reimage [13:35:09] !log aqu@deploy2002 Started deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb] [13:36:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2017.codfw.wmnet with reason: host reimage [13:36:39] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcumin2001.codfw.wmnet with OS bullseye [13:36:47] (03CR) 10Jforrester: "Aha! Nice find." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961383 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [13:37:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:37:47] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:37:54] testing [13:38:02] T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857 [13:38:10] yup, seems to work on the enwiki main page [13:38:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [13:38:38] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) Thanks for the feedback everyone! I was waiting so that we can get most of the comments in before replying; responses inline: >>! In T347054#91... [13:38:51] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) FYI, the `mgmt_junos` bug (also present on the fasw) might not be fixed by an upgrade, but maybe with the solution exposed in https://www.reddit.com/r/Juniper/comments/mvq8hf/comment/j7gd... [13:40:23] (03PS3) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 [13:40:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:40:48] (03CR) 10CI reject: [V: 04-1] C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede) [13:41:29] (03PS5) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [13:41:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) Updating naming as per requested. cp1100 - cp1115 [13:42:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:42:16] (03PS4) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 [13:43:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [13:43:43] !log aqu@deploy2002 Finished deploy [analytics/refinery@223be0f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@223be0fb] (duration: 08m 33s) [13:44:22] !log Deployed refinery using scap, then deployed onto hdfs [13:44:24] (03PS6) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [13:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:41] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) Looking at `10.3.0.0/24` [[ https://netbox.wikimedia.org/ipam/prefixes/97/ip-addresses/ | in Netbox ]]: I plan to reserve `10.3.0.8/32` for `nt... [13:46:04] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Lucas_Werkmeister_WMDE) [13:46:35] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [13:46:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [13:47:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:47:29] (03CR) 10Jbond: [C: 03+2] bacula: update bacula config to trust the pki and puppet ca's [puppet] - 10https://gerrit.wikimedia.org/r/937445 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [13:48:04] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:48:33] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:26] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961217|Add label for Wikifunctions in “other projects” sidebar section (T342857)]] (duration: 29m 56s) [13:49:33] T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857 [13:50:38] still no HouseOfM, so I guess that config change will have to be rescheduled yet again :( [13:50:57] !log UTC afternoon backport+config window done [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:06] PROBLEM - DPKG on sretest1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:51:07] * Lucas_WMDE done deploying [13:51:30] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) Thanks, as this VIP won't be critical we can skip the static routes and only allocate `10.3.0.8/32`. The existing "Reserved for XXX (backup st... [13:51:39] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcumin2001.codfw.wmnet with reason: host reimage [13:53:33] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:34] (03PS1) 10JMeybohm: admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397) [13:53:36] (03PS1) 10JMeybohm: admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395 [13:53:49] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9203404, @ayounsi wrote: > Thanks, as this VIP won't be critical we can skip the static routes and only allocate `10.3.0.8/32`. >... [13:53:59] 10SRE, 10Traffic: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 (10Fabfur) a:03Fabfur [13:54:20] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcumin2001.codfw.wmnet with reason: host reimage [13:55:16] (03PS5) 10Slyngshede: C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 [13:55:18] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) ` sukhe@re0.cr2-codfw# show routing-options static /* Anycast recdns - backup route */ route 10.3.0.0/30 { next-hop 208.80.153.77; read... [13:56:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43660/console" [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede) [13:56:42] (03PS7) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [13:57:21] (03PS1) 10Elukey: role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) [13:57:57] 10SRE, 10Traffic: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10Fabfur) 05Open→03Resolved a:03Fabfur Done with * https://gerrit.wikimedia.org/r/c/operations/software/purged/+/958477 * https://gerrit.wikimedia.org/r/c/operations/software/purged/+/959049 [13:57:59] (03CR) 10JMeybohm: [C: 03+2] admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [13:58:01] (03CR) 10JMeybohm: [C: 03+2] admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395 (owner: 10JMeybohm) [13:58:28] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43662/console" [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede) [13:58:52] 10SRE, 10Traffic: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 (10Fabfur) 05Open→03Stalled Waiting for actual deployment to definitely closing this task [13:58:55] jouncebot: nowandnext [13:58:56] For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1300) [13:58:56] In 0 hour(s) and 1 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1400) [13:58:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43663/console" [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey) [13:59:19] (03CR) 10Elukey: role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1400) [14:00:18] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment ensure that rq service is enabled for git installs [puppet] - 10https://gerrit.wikimedia.org/r/961375 (owner: 10Slyngshede) [14:00:33] (03Merged) 10jenkins-bot: admin_nd: Don't allow uncached api access from wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961394 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [14:00:36] (03Merged) 10jenkins-bot: admin_ng/wikikube: Allow pods to use DNS over TCP [deployment-charts] - 10https://gerrit.wikimedia.org/r/961395 (owner: 10JMeybohm) [14:00:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2017.codfw.wmnet with OS bullseye [14:00:51] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) For `route 10.3.0.0/30` above, `next-hop 208.80.153.77` is actually the old authdns host, so we are clearly not keeping the static routes update... [14:01:05] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2017.codfw.wmnet with OS bullseye completed: - restbase20... [14:01:54] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [14:01:59] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey) [14:04:31] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "this is not enough! the sysctl file is still deployed." [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:05:20] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:27] (03CR) 10Btullis: [C: 03+2] role::kafka::jumbo: exclude kafka-jumbo100[1-6] from Mirror Maker [puppet] - 10https://gerrit.wikimedia.org/r/961397 (https://phabricator.wikimedia.org/T347481) (owner: 10Elukey) [14:05:40] (03PS6) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [14:06:31] <_joe_> !log updating conftool everywhere [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263 [14:08:08] T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 [14:08:10] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: e... [14:08:29] (03PS1) 10Slyngshede: C:idm::deployment Use bitu cmd for systemd service [puppet] - 10https://gerrit.wikimedia.org/r/961398 [14:08:42] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcumin2001.codfw.wmnet with OS bullseye [14:10:33] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment Use bitu cmd for systemd service [puppet] - 10https://gerrit.wikimedia.org/r/961398 (owner: 10Slyngshede) [14:10:38] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2017.codfw.wmnet [14:10:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2017.codfw.wmnet [14:11:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) As an example of this hardware being configured as JBOD - T326352 [14:12:05] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [14:12:42] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:52] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) Thanks, I opened {T347494} to get rid of them. You can use 10.3.0.2/32 for the NTP VIP. [14:13:52] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for 15 hosts [14:13:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts [14:15:49] (03CR) 10Vgutierrez: varnish: allow PURGE requests also from dedicated socket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [14:16:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10aborrero) [14:16:30] (03PS1) 10Muehlenhoff: firewall: Also move the sysctl under the manage_nf_conntrack conditional [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) [14:16:39] (03PS2) 10Slyngshede: Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 [14:18:50] 10SRE, 10observability: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10lmata) hi @dr0ptp4kt Can you submit a patch with this info? we can happily review it when ready. cc/ @herron will be your point of contact. [14:19:13] 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10lmata) [14:20:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10dcaro) Wouldn't in make sense to start on 1001-dev? (otherwise it seems that 1007-dev should exist, or will... [14:21:34] RECOVERY - DPKG on sretest1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:22:31] !log Repooling eqiad services in progress - T345263 [14:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:38] T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 [14:23:23] (03PS30) 10Fabfur: varnish: allow PURGE requests also from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [14:23:37] (03CR) 10Fabfur: varnish: allow PURGE requests also from dedicated socket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [14:25:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:29:14] !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263 [14:29:22] T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 [14:29:23] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: e... [14:29:27] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:59] !log Added Arnaud to pwstore and removed Jeff (frtech SREs no longer need/use it) [14:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:38] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2018.codfw.wmnet'] [14:33:43] PROBLEM - Host restbase2018 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:58] (03CR) 10Vgutierrez: [C: 03+1] "great job 😊" [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [14:36:13] (03CR) 10Muehlenhoff: cloudgw: Don't override conntrack settings from firewall profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:38:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2018.codfw.wmnet'] [14:38:55] RECOVERY - Host restbase2018 is UP: PING WARNING - Packet loss = 75%, RTA = 73.27 ms [14:40:19] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2018.codfw.wmnet with OS bullseye [14:40:27] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2018.codfw.wmnet with OS bullseye [14:43:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye [14:43:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004... [14:44:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:46:37] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:40] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10cmooney) Is there any reason we can't announce the "unicast" IPs in BGP too? I can't really see a good reason that any static routes are needed here. [14:50:11] (03PS1) 10FNegri: Add new cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241) [14:50:13] (03PS1) 10FNegri: Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241) [14:51:18] (03CR) 10Fabfur: [C: 03+2] varnish: allow PURGE requests also from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [14:53:29] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:55] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:24] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9203737, @cmooney wrote: > Is there any reason we can't announce the "unicast" IPs in BGP too? I can't really see a good reason... [14:55:47] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@49e3804]: Deploy latest Airflow DAGs to analytics instance [14:56:29] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@49e3804]: Deploy latest Airflow DAGs to analytics instance (duration: 00m 42s) [14:57:08] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:58:05] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcumin1001.eqiad.wmnet with OS bullseye [14:58:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:37] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2018.codfw.wmnet with reason: host reimage [14:59:17] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:59:30] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:59:54] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:00:02] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:01:02] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:01:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:02:24] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:02:30] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2018.codfw.wmnet with reason: host reimage [15:03:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:58] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:05:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10JMeybohm) [15:05:31] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10JMeybohm) 05In progress→03Resolved Direct access to mw-api is forbidden now. wikifunctions still working [15:06:38] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcumin1001.eqiad.wmnet with reason: host reimage [15:07:04] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp.anycast.wmnet - sukhe@cumin2002" [15:07:21] (03CR) 10Hashar: "I have found another way which is to use a hiera value that is passed to the various profiles:" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:07:36] (03PS5) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) [15:07:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp.anycast.wmnet - sukhe@cumin2002" [15:07:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:18] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:10] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:09:13] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:09:18] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache ntp.anycast.wmnet on all recursors [15:09:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ntp.anycast.wmnet on all recursors [15:09:46] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcumin1001.eqiad.wmnet with reason: host reimage [15:10:03] (03PS2) 10Muehlenhoff: Make the dbconfig settings conditional on the hdb backend [puppet] - 10https://gerrit.wikimedia.org/r/961352 (https://phabricator.wikimedia.org/T292942) [15:10:39] (03CR) 10Hashar: "The spec for `profile::gerrit::migration` fails to find `profile::gerrit::gerrit_site`." [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:10:52] (03CR) 10CI reject: [V: 04-1] gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:12:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:50] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [15:13:18] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:41] (03Merged) 10jenkins-bot: ml-services: fix recommendation-api-ng readiness probe failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/960681 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [15:14:05] (03CR) 10Hashar: "With `PUPPET_DEBUG=1`:" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:14:54] (03PS2) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot) [15:17:01] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot) [15:17:04] (03CR) 10Hashar: gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:17:25] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:17:47] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 (owner: 10PipelineBot) [15:19:15] (03CR) 10Muehlenhoff: On Bookworm ship ppolicy.schema via Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [15:19:59] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) >>! In T346724#9182178, @cmooney wrote: > I've spec'd the 'Advanced 2' license here. That supports EVPN/VXLAN, which at this stage would... [15:20:41] (03PS1) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) [15:21:31] (03PS6) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) [15:23:06] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:23:10] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcumin1001.eqiad.wmnet with OS bullseye [15:24:00] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:24:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri) [15:25:26] (03PS3) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) [15:25:58] jouncebot: nowandnext [15:25:58] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [15:25:58] In 1 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1700) [15:26:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:28:43] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2018.codfw.wmnet with OS bullseye [15:28:51] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2018.codfw.wmnet with OS bullseye completed: - restbase20... [15:29:19] !log dancy@deploy2002 Installing scap version "4.63.0" for 598 hosts [15:30:19] !log dancy@deploy2002 Installation of scap version "4.63.0" completed for 598 hosts [15:30:50] (03CR) 10Jbond: [C: 03+1] "LGTM see comment for possible improvement" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [15:31:41] (03CR) 10Btullis: [kafka] Install kafka-kit on bullseye/bookworm brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:33:15] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:33:26] (03PS2) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) [15:33:36] (03PS4) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) [15:33:42] (03CR) 10Brouberol: [kafka] Install kafka-kit on bullseye/bookworm brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:35:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43670/console" [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:35:47] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:35:54] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] airflow-wmde: Remove statsd analytics-wmde user [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [15:39:20] (03CR) 10Btullis: [C: 03+1] [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:40:44] (03PS2) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [15:41:03] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST configurations) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:08] (03PS3) 10Jbond: puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) [15:41:31] (03CR) 10CI reject: [V: 04-1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [15:41:33] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2018.codfw.wmnet [15:41:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2018.codfw.wmnet [15:42:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:42:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43671/console" [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:42:51] (03CR) 10Jbond: [C: 03+1] Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri) [15:43:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2023.codfw.wmnet with OS bullseye [15:43:38] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2023.codfw.wmnet with OS bullseye [15:44:17] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:46:03] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST configurations) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:48:05] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1016.eqiad.wmnet [15:49:03] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1016.eqiad.wmnet [15:49:20] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1017.eqiad.wmnet [15:49:30] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1017.eqiad.wmnet [15:49:43] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1018.eqiad.wmnet [15:49:54] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1018.eqiad.wmnet [15:50:44] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1019.eqiad.wmnet [15:50:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1019.eqiad.wmnet [15:51:09] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1022.eqiad.wmnet [15:51:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1022.eqiad.wmnet [15:51:19] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet [15:51:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet [15:52:00] (03CR) 10Brouberol: [C: 03+2] [kafka] Install kafka-kit on bullseye/bookworm brokers [puppet] - 10https://gerrit.wikimedia.org/r/961405 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [15:52:17] (03PS1) 10Kamila Součková: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221 [15:52:53] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:53:11] (03PS2) 10Kamila Součková: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221 [15:53:29] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [15:53:33] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [15:53:40] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [15:53:47] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [15:54:06] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [15:54:11] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [15:54:22] (03CR) 10Vgutierrez: [C: 03+1] aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:55:24] !log reedy@deploy2002 Started scap: (no justification provided) [15:55:48] (03CR) 10FNegri: [C: 03+2] Add new cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961401 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri) [15:56:07] (03CR) 10BCornwall: [C: 03+2] aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:57:00] (03CR) 10Kamila Součková: [C: 03+2] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/961221 (owner: 10Kamila Součková) [15:59:00] (03PS2) 10Jdlrobson: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) [15:59:05] (03CR) 10Jdlrobson: New projects default to Vector 2022 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [16:00:45] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2023.codfw.wmnet with reason: host reimage [16:02:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:47] !log reedy@deploy2002 Finished scap: (no justification provided) (duration: 07m 22s) [16:03:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2023.codfw.wmnet with reason: host reimage [16:07:04] (KubernetesAPILatency) resolved: (9) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:09:27] !log Pooled back eqiad for traffic after the DC switchover (T345263) [16:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:36] T345263: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 [16:10:13] (03PS1) 10Btullis: Change the owner:group of the wikidatawiki entities link [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) [16:11:10] (03PS2) 10Btullis: Change the owner:group of the wikidatawiki entities link [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) [16:11:24] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) (owner: 10Btullis) [16:12:11] (03PS1) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509) [16:12:18] (03CR) 10Andrew Bogott: [C: 04-1] "This needs a default domain, otherwise specifying a project like 'O{project:tools}' gets us "Caught BadRequest exception: Expecting to fin" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [16:14:47] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Jdforrester-WMF) Possibly now solved by https://gerrit.wikimedia.org/r/c/operations/puppet/+/961351 ? [16:16:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS bullseye [16:16:43] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye [16:17:34] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) I should also add that unless we are provisioning new racks, any rack allocated for this will already have a switch in it. So we should... [16:20:34] (03CR) 10Btullis: "Well, this change looks like it should work, but I wonder if the other option would simply be to remove the 'entities` symlink from dumpsd" [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) (owner: 10Btullis) [16:24:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52692 and previous config saved to /var/cache/conftool/dbconfig/20230927-162433-arnaudb.json [16:24:34] !log dduvall@deploy2002 Started scap: (no justification provided) [16:24:40] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:27:25] (03CR) 10Jbond: [C: 03+2] rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:28:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2023.codfw.wmnet with OS bullseye [16:28:45] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2023.codfw.wmnet with OS bullseye completed: - restbase20... [16:28:51] (03CR) 10Jbond: [C: 03+2] rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:29:04] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2023.codfw.wmnet [16:29:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2023.codfw.wmnet [16:29:36] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [16:31:39] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1019.eqiad.wmnet with OS bullseye [16:31:46] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye executed with errors: -... [16:32:06] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1019.eqiad.wmnet'] [16:34:11] (03PS1) 10Jbond: Revert "rsyslog: update to use pki certificates" [puppet] - 10https://gerrit.wikimedia.org/r/961224 [16:34:21] (03CR) 10Jbond: "Sep 27 16:34:09 centrallog2002 rsyslogd[285425]: invalid cert info: peer provided 1 certificate(s). Certificate 1 info: certificate valid " [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond) [16:35:10] (03PS8) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [16:36:52] (03PS7) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [16:39:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1019.eqiad.wmnet'] [16:39:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52693 and previous config saved to /var/cache/conftool/dbconfig/20230927-163940-arnaudb.json [16:39:50] (03PS8) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [16:42:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS bullseye [16:42:57] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye [16:44:18] (03CR) 10Marostegui: [C: 03+2] wiki-replicas: Add CREATE USER and GRANT OPTION to labsdbadmin [puppet] - 10https://gerrit.wikimedia.org/r/961366 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [16:44:44] (03CR) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:45:14] (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:52:50] !log dduvall@deploy2002 Finished scap: (no justification provided) (duration: 28m 15s) [16:53:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:53:45] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:54:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52694 and previous config saved to /var/cache/conftool/dbconfig/20230927-165446-arnaudb.json [16:55:39] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1019.eqiad.wmnet with reason: host reimage [16:56:12] (03PS1) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) [16:58:04] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) I double checked an so far, backup, recovery and restores with the puppet master key still work as expected :-D. [16:58:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1019.eqiad.wmnet with reason: host reimage [16:58:52] (03CR) 10CI reject: [V: 04-1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1700) [17:00:27] (03PS2) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) [17:02:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir6002.drmrs.wmnet with OS bookworm [17:02:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bookworm [17:02:41] (03CR) 10CI reject: [V: 04-1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [17:03:28] (03PS3) 10Jelto: gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) [17:05:34] (03CR) 10Herron: "we've left the original" [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond) [17:06:32] (03CR) 10Herron: Revert "rsyslog: update to use pki certificates" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond) [17:09:08] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343198)', diff saved to https://phabricator.wikimedia.org/P52695 and previous config saved to /var/cache/conftool/dbconfig/20230927-170953-arnaudb.json [17:09:55] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:10:08] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:10:08] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:10:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52696 and previous config saved to /var/cache/conftool/dbconfig/20230927-171014-arnaudb.json [17:18:45] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:23:11] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage [17:23:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1019.eqiad.wmnet with OS bullseye [17:24:01] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1019.eqiad.wmnet with OS bullseye completed: - restbase10... [17:25:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage [17:34:40] (03PS1) 10Lucas Werkmeister: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) [17:36:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [17:36:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [17:36:49] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:36:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [17:37:49] (03PS2) 10Lucas Werkmeister: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) [17:38:12] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1019.eqiad.wmnet [17:38:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1019.eqiad.wmnet [17:38:35] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:39:00] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frauth2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:39:48] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frauth2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:39:48] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:37] 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Jgreen) a:03Papaul [17:43:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) @BTullis did you have updates on Partitioning/Raid: for task? [17:49:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir6002.drmrs.wmnet with OS bookworm [17:49:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bookworm completed: - ncredir6002 (**PASS**) - Downtimed on Ici... [17:51:31] (03PS2) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509) [17:51:38] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED [17:51:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host stat1011.mgmt.eqiad.wmnet with reboot policy FORCED [17:52:30] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2026.codfw.wmnet [17:52:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011'] [17:52:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['stat1011'] [17:52:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['stat1011'] [17:53:05] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet [17:53:25] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:53:28] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:53:48] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:53:51] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:57:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['stat1011'] [17:59:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) 05Open→03In progress a:03RobH [18:00:06] dduvall and brennen: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1800) [18:00:06] dduvall and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T1800). [18:00:49] (03PS8) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [18:01:08] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet [18:01:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10nskaggs) It's recommended that existing names not be reused. See https://wikitech.wikimedia.org/wiki/SRE/In... [18:01:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host stat1011.eqiad.wmnet with OS bullseye [18:01:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host stat1011.eqiad.wmnet with OS bullseye [18:04:44] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889) [18:04:46] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:05:12] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet [18:05:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase2026.codfw.wmnet [18:06:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) Found error in partitioning, discussing with John. [18:07:06] (03PS9) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [18:07:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:40] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts restbase2027.codfw.wmnet [18:07:46] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet [18:08:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS bullseye [18:08:37] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2026.codfw.wmnet with OS bullseye [18:09:35] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961439 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:09:39] (03PS1) 10RobH: set lists1004 partman info [puppet] - 10https://gerrit.wikimedia.org/r/961440 (https://phabricator.wikimedia.org/T342374) [18:10:21] (03CR) 10RobH: [C: 03+2] set lists1004 partman info [puppet] - 10https://gerrit.wikimedia.org/r/961440 (https://phabricator.wikimedia.org/T342374) (owner: 10RobH) [18:10:55] (03CR) 10Herron: pyrra: add trafficserver mapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:11:52] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts restbase2027.codfw.wmnet [18:13:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) a:05RobH→03Jclark-ctr [18:14:55] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [18:15:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [18:15:08] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43677/console" [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:15:53] !log disabling puppet on apt1001 for a quick test of CR 957766's effectiveness [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:17:51] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.28 refs T345889 [18:17:57] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [18:19:45] !log re-enabling puppet on apt1001 from a quick test of CR 957766's effectiveness [18:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir6001.drmrs.wmnet with OS bookworm [18:20:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bookworm [18:22:35] (KubernetesAPILatency) firing: (28) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:23:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10BBlack) [18:24:38] !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.28 refs T345889 (duration: 06m 46s) [18:24:44] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [18:24:56] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2026.codfw.wmnet with reason: host reimage [18:26:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:27:35] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2026.codfw.wmnet with reason: host reimage [18:28:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:08] (03PS1) 10Majavah: Take cloudcontrol1006 out of service [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) [18:31:10] (03PS1) 10Majavah: Cleanup remains of haproxy-on-cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/961443 [18:32:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:00] (03CR) 10CI reject: [V: 04-1] Take cloudcontrol1006 out of service [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [18:34:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:36:15] (03CR) 10AOkoth: [C: 03+1] gitlab failover: tweak settings for gitlab-backup command [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [18:37:57] (03PS1) 10Ssingh: aptrepo: s/haproxy28-bookworm/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/961444 [18:39:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43678/console" [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh) [18:39:49] !log disable puppet on O:apt_repo [18:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:22] (03CR) 10Ssingh: [V: 03+1 C: 03+2] aptrepo: s/haproxy28-bookworm/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh) [18:41:15] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [18:41:23] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye [18:41:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [18:41:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004 (... [18:42:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:55] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [18:43:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [18:43:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage [18:45:26] !log re-enable puppet on O:apt_repo [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage [18:47:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:35] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10Jclark-ctr) @BTullis this server is out of warranty i do not have any 1.6tb drives available. i do have 1.9tb we can use if needed [18:48:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS bullseye [18:48:56] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2026.codfw.wmnet with OS bullseye completed: - restbase20... [18:49:05] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) a:03Eevans [18:49:25] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:50:17] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2027.codfw.wmnet [18:50:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2027.codfw.wmnet [18:51:53] (03CR) 10Ssingh: [V: 03+1 C: 03+2] "Ie10249983a6b5f2d98cc40b6734da103c836349c was merged but it was throwing up an error so we decided to do try this. Adding for post-merge-r" [puppet] - 10https://gerrit.wikimedia.org/r/961444 (owner: 10Ssingh) [18:55:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:53] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lists1004.eqiad.wmnet with reason: host reimage [18:56:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be1003.eqiad.wmnet with OS bullseye [18:56:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye executed with erro... [18:59:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists1004.eqiad.wmnet with reason: host reimage [18:59:31] (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:00:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:00:27] (03PS6) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [19:01:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2027.codfw.wmnet [19:01:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase2027.codfw.wmnet [19:02:08] (03CR) 10Herron: [C: 03+2] pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:03:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS bullseye [19:03:38] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2027.codfw.wmnet with OS bullseye [19:03:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:05:50] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1020.eqiad.wmnet [19:05:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1020.eqiad.wmnet [19:06:50] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:07:03] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:07:50] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:08:11] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:09:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir6001.drmrs.wmnet with OS bookworm [19:10:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bookworm completed: - ncredir6001 (**PASS**) - Downtimed on Ici... [19:13:11] (03PS1) 10BBlack: traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460 [19:14:43] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [19:18:15] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [19:18:15] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists1004.eqiad.wmnet with OS bullseye [19:18:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye completed: - lists1004 (**PASS**)... [19:19:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) [19:19:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) 05In progress→03Resolved OS imaged, system ready for service owners. [19:22:03] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [19:23:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:23:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:24:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [19:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:26:24] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4002.ulsfo.wmnet with OS bookworm [19:26:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bookworm [19:28:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be1003'] [19:35:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:41] !log bking@deploy2002 deleting flink-operator leader pod to force failover T347521 [19:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:54] T347521: Troubleshoot mw-page-content-change-enrich and flink-operator - https://phabricator.wikimedia.org/T347521 [19:37:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:03] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [19:38:06] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:39:15] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [19:39:17] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:40:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be1003'] [19:42:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage [19:50:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage [19:52:11] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@c6454a9]: update rdf tools jar to .131 [19:52:39] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@c6454a9]: update rdf tools jar to .131 (duration: 00m 28s) [19:54:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2027.codfw.wmnet with OS bullseye [19:54:29] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2027.codfw.wmnet with OS bullseye completed: - restbase20... [19:59:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T2000). [20:00:05] lucaswerkmeister and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [20:00:08] o/ [20:01:01] o/ i can deploy [20:01:34] (03PS3) 10Clare Ming: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:02:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:02:50] \o/ [20:03:28] (03Merged) 10jenkins-bot: commonswiki: Add $wgExternalLinksDomainGaps for another domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961433 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:03:33] here [20:03:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:03:54] !log cjming@deploy2002 Started scap: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]] [20:05:17] !log cjming@deploy2002 lucaswerkmeister and cjming: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:05:22] lucaswerkmeister: are you able to test? [20:05:25] yup, one moment [20:07:24] eh… it doesn’t make the API request faster as I had hoped, but it doesn’t make it slower either [20:07:40] and the result doesn’t change either (which is expected, but also nice) [20:07:52] I’d say let’s deploy it anyway and I’ll check up with Amir later whether we should keep it or not [20:08:01] but I don’t think it needs to be reverted right now, it shouldn’t hurt 🤷 [20:08:08] sounds good - syncing [20:08:10] thanks [20:08:13] !log cjming@deploy2002 lucaswerkmeister and cjming: Continuing with sync [20:10:07] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4002.ulsfo.wmnet with OS bookworm [20:10:14] hi Jdlrobson: can your 3 go out together? [20:10:17] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bookworm completed: - ncredir4002 (**PASS**) - Downtimed on Ici... [20:14:12] cjming: the logos can [20:14:17] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:961433|commonswiki: Add $wgExternalLinksDomainGaps for another domain (T341000)]] (duration: 10m 23s) [20:14:31] let's do https://gerrit.wikimedia.org/r/c/961260/ separately [20:14:45] ok! [20:15:42] (03PS4) 10Clare Ming: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson) [20:16:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir4001.ulsfo.wmnet with OS bookworm [20:16:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bookworm [20:16:42] (03CR) 10Clare Ming: [C: 03+2] Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson) [20:17:24] (03Merged) 10jenkins-bot: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) (owner: 10Jdlrobson) [20:17:40] (03PS2) 10Clare Ming: Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson) [20:19:15] (03CR) 10Clare Ming: [C: 03+2] Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson) [20:20:03] (03Merged) 10jenkins-bot: Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson) [20:21:02] !log cjming@deploy2002 Started scap: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]] [20:21:09] T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258 [20:21:10] T341250: Design: Provide wordmarks and taglines for Wikimedia special projects - https://phabricator.wikimedia.org/T341250 [20:22:27] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:22:30] Jdlrobson: logos patches are ready to test if you're able [20:23:19] !log update haproxy 2.6 and 2.8 into bookworm archives with reprepro - T342154 [20:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:24] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [20:23:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:24:09] cjming: on it [20:24:50] cjming: LGTM! [20:24:54] cjming: please sync [20:24:57] yay - syncing [20:25:01] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [20:27:26] Jdlrobson: should your vector 2022 default patch be rebased on top of master or parent change? [20:27:38] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on restbase2027.codfw.wmnet with reason: Repairing/rebuilding Cassandra instances [20:27:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on restbase2027.codfw.wmnet with reason: Repairing/rebuilding Cassandra instances [20:27:42] master [20:28:00] (03PS3) 10Clare Ming: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:28:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:58] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [20:29:09] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:55] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:960121|Special wiki wordmarks and taglines (T341250)]], [[gerrit:961238|Add wordmark for li wikinews (T341258)]] (duration: 09m 52s) [20:31:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:31:08] T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258 [20:31:09] T341250: Design: Provide wordmarks and taglines for Wikimedia special projects - https://phabricator.wikimedia.org/T341250 [20:32:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage [20:33:14] (03Merged) 10jenkins-bot: New projects default to Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961260 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:33:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:33:39] !log cjming@deploy2002 Started scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]] [20:33:44] T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444 [20:34:15] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [20:34:18] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:35:03] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:35:08] Jdlrobson: logos patches are live! vector 2022 patch is ready to test [20:35:17] cjming: okay looking! wish me luck :) [20:35:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage [20:36:16] * cjming wishes luck to Jdlrobson [20:37:19] cjming: okay it looks like something's gone wrong with that patch. Looks like the dblist is not working... [20:37:39] bummer - do we need to revert? [20:37:51] probably... im just looking to see if I've missed something obvious [20:38:59] yeh looks like dblists-index.php didn't get updated [20:39:15] is it a caching thing? [20:41:19] cjming: one line fix (facepalm) [20:42:08] Jdlrobson: i can continue sync and we can deploy your fix? [20:42:31] or revert, add your fix, and redeploy [20:42:36] (03PS1) 10Jdlrobson: Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444) [20:42:50] we need to add this fix before syncing [20:43:07] I'm okay to revert and squash this or apply this patch on top of the previous one - what's better? [20:43:23] but we definitely shouldnt' sync what's currently on the debug servers [20:43:32] the German Wikipedians will not be happy haha [20:43:59] i'll stop scap backporting current patch, we'll merge your fix, and then scap both [20:44:06] !log cjming@deploy2002 Sync cancelled. [20:44:32] (03CR) 10Clare Ming: [C: 03+2] Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:45:16] (03Merged) 10jenkins-bot: Populate the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961469 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:45:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [20:46:20] !log cjming@deploy2002 Started scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]] [20:46:25] T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444 [20:47:40] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:43] Jdlrobson: how about now? [20:48:22] cjming: looking :) [20:48:42] cjming: this is looking much more promising. Give me a few more minutes :) [20:50:33] cjming: yeh this looks good. Sync away. [20:50:41] nice! syncing [20:50:44] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [20:54:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir4001.ulsfo.wmnet with OS bookworm [20:54:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bookworm completed: - ncredir4001 (**PASS**) - Downtimed on Ici... [20:55:22] (03PS1) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) [20:56:09] (03CR) 10CI reject: [V: 04-1] Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:56:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:38] (03PS2) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) [20:57:26] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:961260|New projects default to Vector 2022 (T347444)]], [[gerrit:961469|Populate the legacy-vector dblist (T347444)]] (duration: 11m 05s) [20:57:30] Jdlrobson: should be live :) [20:57:31] T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444 [20:58:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:59:42] !log end of UTC late backport window [20:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230927T2100) [21:00:16] thanks a bunch cjming [21:00:25] sorry for the blip in the deploy :) [21:00:35] your welcome - no worries :) [21:01:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5002.eqsin.wmnet with OS bookworm [21:01:35] (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:01:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bookworm [21:08:07] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2a00:1188:5:e::4) [21:09:09] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:11:28] thanks from me too cjming! [21:11:30] (bit late ^^) [21:11:47] lucaswerkmeister: np! yw :) [21:13:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:23:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:27:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [21:27:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [21:35:46] (03PS1) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) [21:39:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52697 and previous config saved to /var/cache/conftool/dbconfig/20230927-213946-arnaudb.json [21:39:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:43:55] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [21:45:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage [21:48:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage [21:54:09] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P52698 and previous config saved to /var/cache/conftool/dbconfig/20230927-215452-arnaudb.json [22:03:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [22:03:20] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [22:03:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [22:09:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P52699 and previous config saved to /var/cache/conftool/dbconfig/20230927-220959-arnaudb.json [22:15:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52700 and previous config saved to /var/cache/conftool/dbconfig/20230927-221536-arnaudb.json [22:15:42] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:18:46] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:22:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir5002.eqsin.wmnet with OS bookworm [22:23:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bookworm completed: - ncredir5002 (**PASS**) - Downtimed on Ici... [22:25:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T343198)', diff saved to https://phabricator.wikimedia.org/P52701 and previous config saved to /var/cache/conftool/dbconfig/20230927-222505-arnaudb.json [22:25:11] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:25:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:28:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:29:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [22:30:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52702 and previous config saved to /var/cache/conftool/dbconfig/20230927-223042-arnaudb.json [22:42:54] (03PS3) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [22:45:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52703 and previous config saved to /var/cache/conftool/dbconfig/20230927-224548-arnaudb.json [22:48:43] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) [22:49:12] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) [23:00:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343198)', diff saved to https://phabricator.wikimedia.org/P52704 and previous config saved to /var/cache/conftool/dbconfig/20230927-230055-arnaudb.json [23:00:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [23:01:02] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:01:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [23:01:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T343198)', diff saved to https://phabricator.wikimedia.org/P52705 and previous config saved to /var/cache/conftool/dbconfig/20230927-230117-arnaudb.json [23:03:13] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-truthy-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:01] (03PS4) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [23:15:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:16:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:17:56] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347533 (10Brycehughes) [23:18:33] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) 05Open→03Invalid [23:18:40] (03PS1) 10Jdlrobson: Wikimedia special project logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961484 [23:19:30] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10Brycehughes) Closed and re-filed as a bug report. No idea if that matters. [23:19:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:19:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:34] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [23:23:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [23:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:36:14] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [23:39:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [23:54:53] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:55:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:56:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bullseye [23:56:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye completed: - moss-... [23:56:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) 05Open→03Resolved