[00:13:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957
[00:13:22] <stashbot>	 T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957
[00:26:00] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:27:46] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[00:42:14] <wikibugs>	 (03PS1) 10Dzahn: add metafo records for planet [dns] - 10https://gerrit.wikimedia.org/r/891730 (https://phabricator.wikimedia.org/T330091)
[00:42:28] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:44:18] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[00:45:53] <wikibugs>	 (03PS1) 10Dzahn: add metafo records for people.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/891731 (https://phabricator.wikimedia.org/T330091)
[00:47:33] <wikibugs>	 (03PS2) 10Dzahn: add metafo records for people.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/891731 (https://phabricator.wikimedia.org/T330091)
[00:47:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:47:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2013.codfw.wmnet with OS bullseye
[00:48:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2013.codfw.wmnet with OS bullseye completed: - wdqs2013 (**PA...
[00:49:08] <wikibugs>	 (03PS1) 10Dzahn: drop people.eqiad.wmnet service alias [dns] - 10https://gerrit.wikimedia.org/r/891732
[00:51:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:51:18] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:51:20] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:03:02] <wikibugs>	 (03PS2) 10Krinkle: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314
[01:03:04] <wikibugs>	 (03PS1) 10Krinkle: Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733
[01:03:06] <wikibugs>	 (03PS1) 10Krinkle: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734
[01:03:43] <wikibugs>	 (03PS2) 10Krinkle: drop people.eqiad.wmnet service alias [dns] - 10https://gerrit.wikimedia.org/r/891732 (owner: 10Dzahn)
[01:04:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2014.codfw.wmnet with OS bullseye
[01:04:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2014.codfw.wmnet with OS bullseye
[01:06:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye
[01:06:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye
[01:10:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye
[01:10:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye
[01:13:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye
[01:13:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye
[01:24:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage
[01:28:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage
[01:30:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage
[01:33:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage
[01:34:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage
[01:37:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage
[01:43:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:48:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015
[01:48:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015
[01:49:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015
[01:49:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015
[01:49:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:50:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:50:23] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:50:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015
[01:51:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015
[01:51:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015
[01:51:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015
[01:53:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:53:45] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:53:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2016.codfw.wmnet with OS bullseye
[01:53:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:53:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2014.codfw.wmnet with OS bullseye
[01:53:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye completed: - wdqs2016 (**PA...
[01:53:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2014.codfw.wmnet with OS bullseye completed: - wdqs2014 (**PA...
[01:55:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm when you are back on site can you please check wdqs2015 it looks like i have no network cable connected to it.  Thanks
[01:56:17] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:56:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2017.codfw.wmnet with OS bullseye
[01:56:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye completed: - wdqs2017 (**PA...
[02:02:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2018.codfw.wmnet with OS bullseye
[02:02:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye
[02:03:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2015.codfw.wmnet with OS bullseye
[02:03:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye executed with errors: - wdq...
[02:05:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2019.codfw.wmnet with OS bullseye
[02:05:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2019.codfw.wmnet with OS bullseye
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:03] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2018.codfw.wmnet with OS bullseye
[02:18:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye executed with errors: - wdq...
[02:18:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2018.codfw.wmnet with OS bullseye
[02:18:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye
[02:42:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye
[02:45:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2018.codfw.wmnet with reason: host reimage
[02:48:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2018.codfw.wmnet with reason: host reimage
[02:51:09] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:53:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2019.codfw.wmnet with reason: host reimage
[02:56:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2019.codfw.wmnet with reason: host reimage
[02:58:11] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2020.codfw.wmnet with OS bullseye
[02:58:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye executed with errors: - wdq...
[02:58:34] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye
[02:58:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye
[03:04:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:08:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye
[03:08:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye
[03:12:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:12:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2018.codfw.wmnet with OS bullseye
[03:12:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye completed: - wdqs2018 (**PA...
[03:12:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:13:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:13:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2019.codfw.wmnet with OS bullseye
[03:13:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2019.codfw.wmnet with OS bullseye completed: - wdqs2019 (**PA...
[03:28:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[03:32:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[03:47:17] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2020.codfw.wmnet with OS bullseye
[03:47:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye executed with errors: - wdq...
[03:47:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:49:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:49:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2021.codfw.wmnet with OS bullseye
[03:49:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye completed: - wdqs2021 (**PA...
[03:50:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul)
[06:51:09] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230224T0700)
[07:10:58] <wikibugs>	 (03CR) 10Slyngshede: SUL account linking (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede)
[07:30:33] <wikibugs>	 (03PS5) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807)
[07:39:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Couple of minor comments, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus)
[07:50:28] <icinga-wm>	 PROBLEM - Disk space on kubestagetcd1006 is CRITICAL: DISK CRITICAL - free space: / 711 MB (3% inode=95%): /tmp 711 MB (3% inode=95%): /var/tmp 711 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops
[07:52:18] <elukey>	 !log rm /var/log/{syslog,messages,user.log}.1 on kubetcd1006 to free up space - T329717
[07:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:23] <stashbot>	 T329717: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717
[07:56:00] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717)
[07:56:54] <elukey>	 akosiaris: o/ if you have a moment --^
[07:57:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39822/console" [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey)
[07:58:50] <elukey>	 mmm weird kubestagetcd in codfw shows no changes..
[07:59:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 8 hosts with reason: Downtime DSE workers for cluster upgrade
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230224T0800)
[08:00:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 8 hosts with reason: Downtime DSE workers for cluster upgrade
[08:04:35] <wikibugs>	 (03PS6) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261)
[08:06:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23
[08:09:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[08:09:25] <wikibugs>	 (03PS1) 10Slyngshede: LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797
[08:10:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:10:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host dse-k8s-ctrl1001.eqiad.wmnet with OS bullseye
[08:11:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:11:16] <icinga-wm>	 RECOVERY - Disk space on kubestagetcd1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops
[08:14:54] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39823/console" [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey)
[08:21:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[08:24:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[08:25:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! At some point we should add a brief introduction doc which explains which part is WMF-specific and what local config changes a" [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 (owner: 10Slyngshede)
[08:26:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "Not sure why but pcc shows me a diff only for the eqiad nodes, not for the codfw ones.." [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey)
[08:32:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:36:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:36:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) p:05Triage→03Medium
[08:37:48] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet with OS bullseye
[08:38:42] <wikibugs>	 (03PS1) 10Slyngshede: Access to analytics-privatedata-users for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364)
[08:40:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host dse-k8s-ctrl1002.eqiad.wmnet with OS bullseye
[08:43:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey)
[08:49:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) Approval required by: @Ottomata  or @odimitrijevic
[08:51:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[08:54:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[08:54:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) @KFrancis Given that this is a reactivation of an account, I would assume that Bruno at some point signed an NDA, b...
[08:54:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Bummer re: runtimerandomizedextrasec 😞" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite)
[08:56:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SLyngshede-WMF) p:05Triage→03Medium
[08:57:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Medium
[08:57:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF)
[08:58:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey)
[08:58:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) p:05Triage→03Medium
[09:01:46] <wikibugs>	 (03PS7) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309)
[09:08:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet with OS bullseye
[09:08:43] <elukey>	 !log rm /var/log/{syslog,messages,user.log}.1 on kubetcd1005 to free up space - T329717
[09:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:48] <stashbot>	 T329717: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717
[09:08:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "FWIW this change gave me an idea for PCC feature request at https://phabricator.wikimedia.org/T330484" [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite)
[09:09:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[09:10:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[09:11:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye
[09:11:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye
[09:12:45] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[09:13:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[09:13:21] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[09:13:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[09:20:11] <wikibugs>	 (03CR) 10Elukey: "Looks good but I have a question - when did we agree to extend the scope of the DSE experiment to the rdf-streaming-updater? What is the p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[09:26:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[09:27:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[09:27:33] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[09:27:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[09:29:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[09:31:55] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[09:32:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[09:33:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[09:33:52] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140)
[09:33:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: Explaining prometheus.io/port annotation [puppet] - 10https://gerrit.wikimedia.org/r/891800
[09:34:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[09:35:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[09:37:01] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[09:37:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[09:37:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[09:38:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[09:39:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[09:39:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140) (owner: 10Alexandros Kosiaris)
[09:40:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Explaining prometheus.io/port annotation [puppet] - 10https://gerrit.wikimedia.org/r/891800 (owner: 10Alexandros Kosiaris)
[09:42:03] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[09:47:29] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff)
[09:48:12] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff)
[09:48:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10MoritzMuehlenhoff)
[09:48:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:58:03] <wikibugs>	 (03PS1) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803
[09:58:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans)
[09:59:17] <wikibugs>	 (03PS2) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803
[10:00:21] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:04] <wikibugs>	 (03CR) 10Btullis: analytics: rename postgres DB user for search platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking)
[10:02:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans)
[10:05:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) >>! In T330364#8643473, @SLyngshede-WMF wrote: > @KFrancis Given that this is a reactivation of an account, I wo...
[10:06:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[10:07:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye
[10:08:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF)
[10:09:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:09:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "I'll merge this now. Hope that's OK bking." [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking)
[10:10:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[10:12:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye
[10:13:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader2003.wikimedia.org with OS bullseye
[10:14:06] <wikibugs>	 (03PS3) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803
[10:14:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader2003.wikimedia.org with OS bullseye
[10:18:01] <wikibugs>	 (03PS1) 10Btullis: Ensure that the airflow database names match existing conventions [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440)
[10:20:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SLyngshede-WMF)
[10:21:31] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:22:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:29:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[10:31:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:31:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:31:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:31:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:32:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:32:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:32:28] <moritzm>	 !log installing emacs security updates on bullseye
[10:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[10:35:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:35:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:35:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:35:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:35:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:35:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:36:02] <elukey>	 sorry a bit of spam :)
[10:40:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:40:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:41:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2003.wikimedia.org with reason: host reimage
[10:42:30] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans)
[10:44:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:44:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:44:47] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:44:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2003.wikimedia.org with reason: host reimage
[10:45:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:45:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:46:33] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:46:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:52:19] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[10:58:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host urldownloader2003.wikimedia.org with OS bullseye
[10:58:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader2003.wikimedia.org with OS bullseye completed: - urldownloader2003 (**PASS**)   -...
[10:59:47] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[10:59:53] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[11:02:12] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[11:13:25] <claime>	   /19
[11:13:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader2004.wikimedia.org with OS bullseye
[11:14:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader2004.wikimedia.org with OS bullseye
[11:17:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Ensure that the airflow database names match existing conventions [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[11:31:52] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) > and puppetmaster[12]004 as the new puppetserver backends, This is one detail im a bit hazy on, the normal way to do things in puppet now is to have multiple puppet compilers and one...
[11:37:50] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[11:41:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2004.wikimedia.org with reason: host reimage
[11:44:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2004.wikimedia.org with reason: host reimage
[11:49:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff)
[11:49:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:54:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "let's merge this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[11:58:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host urldownloader2004.wikimedia.org with OS bullseye
[11:58:37] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "I removed the incorrectly named databases and roles with the following commands in the psql command line on an-db1001." [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[11:59:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[11:59:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader2004.wikimedia.org with OS bullseye completed: - urldownloader2004 (**PASS**)   -...
[12:00:15] <wikibugs>	 (03PS1) 10Superpes15: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470)
[12:01:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[12:11:02] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[12:13:00] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[12:22:58] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:23:12] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:24:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 4.592 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:24:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.915 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:26:06] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[12:31:46] <wikibugs>	 (03PS1) 10Jbond: do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816
[12:33:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39824/console" [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond)
[12:33:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Blacklist f2fs [puppet] - 10https://gerrit.wikimedia.org/r/891817
[12:34:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond)
[12:35:54] <wikibugs>	 (03PS1) 10KartikMistry: Content Translation: Adjust the global limit for unedited MT to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482)
[12:38:02] <wikibugs>	 10SRE, 10Data-Engineering-Radar, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10JArguello-WMF)
[12:38:15] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10JArguello-WMF)
[12:39:04] <wikibugs>	 10SRE, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF)
[12:39:15] <wikibugs>	 (03PS2) 10Aklapper: Remove redirect for pk.wikimedia.org (Pakistan) [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596)
[12:48:19] <wikibugs>	 (03Abandoned) 10Jbond: do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond)
[12:53:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:54:04] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:59:54] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware:  update bios regex [cookbooks] - 10https://gerrit.wikimedia.org/r/891820
[13:23:17] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: allow simplifying the backend array [puppet] - 10https://gerrit.wikimedia.org/r/891825
[13:23:19] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826
[13:23:21] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443)
[13:26:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[13:27:36] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443)
[13:41:20] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF)
[13:55:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495)
[13:55:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[13:57:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495)
[13:58:30] <wikibugs>	 (03PS1) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031)
[13:58:59] <wikibugs>	 (03PS1) 10Btullis: Update the SSH configuration to add the keys to the agent on first use [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891834
[14:01:02] <wikibugs>	 (03PS1) 10Esanders: Disable VectorPromoteAddTopic on production wikis initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891836 (https://phabricator.wikimedia.org/T267444)
[14:01:07] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622)
[14:02:22] <wikibugs>	 (03PS5) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157)
[14:03:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 (owner: 10Jcrespo)
[14:04:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo)
[14:09:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495)
[14:09:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[14:10:20] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[14:14:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840
[14:15:22] <wikibugs>	 (03PS2) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495)
[14:15:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[14:16:39] <wikibugs>	 (03PS1) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841
[14:17:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney)
[14:17:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[14:22:00] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[14:23:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[14:23:50] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[14:24:07] <elukey>	 sigh
[14:28:55] <wikibugs>	 (03PS2) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841
[14:29:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney)
[14:31:26] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[14:31:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye
[14:31:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye
[14:34:39] <wikibugs>	 (03PS3) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841
[14:35:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[14:36:20] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[14:36:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney)
[14:37:16] <wikibugs>	 (03PS4) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841
[14:38:49] <wikibugs>	 (03PS1) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484)
[14:39:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[14:41:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:42:04] <wikibugs>	 (03PS2) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484)
[14:42:42] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) I checked all translations regarding time and date. I had to fix all of them manually, at least the one...
[14:43:28] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39825/console" [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney)
[14:44:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware:  update bios regex [cookbooks] - 10https://gerrit.wikimedia.org/r/891820 (owner: 10Jbond)
[14:44:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:46:09] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:47:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891817 (owner: 10Muehlenhoff)
[14:47:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891840 (owner: 10Muehlenhoff)
[14:49:51] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2013']
[14:50:00] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['wdqs2013']
[14:50:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: Cluster half broken, in the middle of upgrading
[14:50:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: Cluster half broken, in the middle of upgrading
[14:50:34] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[14:50:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade to k8s 1.23
[14:50:49] <wikibugs>	 (03PS5) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841
[14:51:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested tho)" [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[14:51:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney)
[14:51:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "FWIW +1 to the idea, thank you for the improvement!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:52:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2020.codfw.wmnet with reason: host reimage
[14:52:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840
[14:54:10] <wikibugs>	 (03PS3) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484)
[14:54:12] <wikibugs>	 (03PS1) 10Jbond: fix formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891847
[14:54:47] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:55:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2020.codfw.wmnet with reason: host reimage
[14:56:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] fix formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891847 (owner: 10Jbond)
[14:57:57] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[14:59:20] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[15:02:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10Ottomata) Approved.
[15:04:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[15:06:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840 (owner: 10Muehlenhoff)
[15:06:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495)
[15:06:43] <wikibugs>	 (03PS1) 10Ayounsi: Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421)
[15:06:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "If Steve is okay with this, then I think it is ready for merge." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[15:07:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[15:08:23] <wikibugs>	 (03PS2) 10Ayounsi: Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421)
[15:10:35] <wikibugs>	 (03CR) 10DCausse: dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[15:11:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:11:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm some optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro)
[15:11:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new url downloaders - jmm@cumin2002 - T329945"
[15:12:01] <stashbot>	 T329945: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945
[15:21:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new url downloaders - jmm@cumin2002 - T329945"
[15:21:24] <stashbot>	 T329945: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945
[15:22:49] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10MoritzMuehlenhoff) Should we also extend the cookbook to run sre.puppet.sync-netbox-hiera? Or at least print a...
[15:22:59] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:23:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2020.codfw.wmnet with OS bullseye
[15:23:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye completed: - wdqs2020 (**PA...
[15:23:45] <wikibugs>	 (03CR) 10Elukey: dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[15:25:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "It looks good to me, the reimages of the hosts in row E/F are already broken so it is worth a try. As specified in IRC I'd like somebody f" [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi)
[15:33:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "+1 then it looks like the failure is the deployment user does not have the proper sudo rule to act as `kibana`?" [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688) (owner: 10Cwhite)
[15:40:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi)
[15:51:23] <wikibugs>	 (03PS1) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168)
[15:52:01] <wikibugs>	 (03CR) 10Ahmon Dancy: scap: add required Python3 venv package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[15:52:24] <wikibugs>	 (03PS2) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168)
[15:54:21] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39828/console" [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[16:07:13] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:08:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:14:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "I can't find where port 30002 is being used today:" [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah)
[16:14:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) frbast1002 port 9 frmon1002 port 11 frpig1002 port 1
[16:14:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr)
[16:15:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: P:toolforge::k8s::haproxy: remove support for hash node list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah)
[16:15:36] <wikibugs>	 (03CR) 10Majavah: P:toolforge::k8s::haproxy: allow simplifying the backend array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah)
[16:16:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: P:toolforge::k8s::haproxy: add api gateway load balancer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[16:19:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:toolforge::k8s::haproxy: allow simplifying the backend array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah)
[16:21:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::k8s::haproxy: allow simplifying the backend array [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah)
[16:21:54] <wikibugs>	 (03PS1) 10Btullis: Add dummy keydata for the new ceph admin user [labs/private] - 10https://gerrit.wikimedia.org/r/891854 (https://phabricator.wikimedia.org/T328123)
[16:22:44] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826
[16:22:46] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443)
[16:22:50] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keydata for the new ceph admin user [labs/private] - 10https://gerrit.wikimedia.org/r/891854 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[16:22:53] <wikibugs>	 (03CR) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah)
[16:23:13] <wikibugs>	 (03CR) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[16:33:35] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[16:34:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Only +1, cannot merge & babysit myself now." [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah)
[16:35:25] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:45:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS buster
[16:45:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Only +1, cannot merge & babysit the rollout myself now." [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[16:48:32] <wikibugs>	 (03PS4) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443)
[16:55:54] <wikibugs>	 (03PS3) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867)
[16:57:06] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[16:57:32] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 26s)
[16:57:47] <wikibugs>	 (03CR) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus)
[17:03:40] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[17:03:51] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 10s)
[17:05:46] <wikibugs>	 (03PS1) 10BCornwall: ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861
[17:06:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall)
[17:07:17] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall)
[17:08:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[17:08:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[17:08:45] <wikibugs>	 (03PS2) 10Jaime Nuche: scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622)
[17:09:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[17:09:30] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[17:12:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[17:15:31] <wikibugs>	 (03PS1) 10Ssingh: ntp/eqsin: set ntp.eqsin to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891862
[17:16:57] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:17:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Missed this in the review but for posterity, it should read ntp/drmrs as it is dns6002." [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall)
[17:25:34] <wikibugs>	 (03PS1) 10EoghanGaffney: Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931)
[17:34:59] <wikibugs>	 (03PS1) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931)
[17:39:05] <wikibugs>	 (03Abandoned) 10Ssingh: hiera: update Traffic cloud instances hieradata for digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/856955 (owner: 10Ssingh)
[17:39:43] <wikibugs>	 (03PS4) 10Ssingh: hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944)
[17:46:18] <wikibugs>	 (03PS1) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931)
[17:57:40] <wikibugs>	 (03PS2) 10EoghanGaffney: Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931)
[17:59:44] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Merging per IRC discussion, thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus)
[18:00:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[18:03:56] <wikibugs>	 (03CR) 10Dzahn: "no offense, but I think it's better to remove myself as reviewer than to give the impression that I am going to merge them." [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix)
[18:06:25] <wikibugs>	 (03CR) 10Dzahn: "I'm afraid to get these merged 2 things are needed: a) some link to community consensus that this is agreed that can be pasted   b) contac" [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix)
[18:09:02] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:09:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:10:16] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[18:13:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:14:57] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS bullseye
[18:15:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6001.wikimedia.org with OS bullseye
[18:19:00] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[18:20:31] <icinga-wm>	 PROBLEM - Host 2a02:ec80:600:1:185:15:58:5 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:600:1:185:15:58:5)
[18:20:42] <wikibugs>	 (03PS1) 10Dzahn: service-catalog: add planet service [puppet] - 10https://gerrit.wikimedia.org/r/891894
[18:20:51] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:20:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:21:03] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:24:35] <wikibugs>	 (03PS1) 10Dzahn: service-catalog: add people service [puppet] - 10https://gerrit.wikimedia.org/r/891895
[18:24:41] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:26:13] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Papaul)
[18:26:42] <RhinosF1>	 Is drmrs known?
[18:26:54] <wikibugs>	 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) 05Open→03Resolved switch received and added to Netbox
[18:28:06] <RhinosF1>	 sukhe: see asw1-b12-drmrs alert. Or anyone else from traffic.
[18:28:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul)
[18:30:52] <sukhe>	 r
[18:31:11] <RhinosF1>	 r?
[18:31:13] <sukhe>	 RhinosF1: thanks, expected because of the dnsreimaging
[18:31:20] <RhinosF1>	 sukhe: cool
[18:32:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye
[18:32:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye
[18:32:16] <sukhe>	 dns6001 specifically that brett is doing
[18:32:45] <brett>	 oh, shit, sorry, I missed this message
[18:32:50] <brett>	 Sorry for any alarm
[18:33:21] <sukhe>	 np, we can't avoid these alerts anyway :)
[18:33:21] <RhinosF1>	 It’s good
[18:34:51] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6001.wikimedia.org with reason: host reimage
[18:37:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6001.wikimedia.org with reason: host reimage
[18:38:21] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2022.codfw.wmnet with OS bullseye
[18:38:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye executed with errors: - wdq...
[18:39:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm can you also check the network cable on wdqs2022.
[18:43:04] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert)
[18:44:06] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:48:34] <sukhe>	 ^expected
[18:48:54] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.58.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[18:51:34] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:54:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002"
[18:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:55:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002"
[18:55:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:56:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2013.mgmt.codfw.wmnet with reboot policy FORCED
[18:57:56] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:58:10] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:00:27] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6001.wikimedia.org with OS bullseye
[19:00:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6001.wikimedia.org with OS bullseye completed: - dns6001 (**PASS**)   - Downtimed on Icinga/Al...
[19:02:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2014.mgmt.codfw.wmnet with reboot policy FORCED
[19:04:02] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:04:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host thanos-fe2004.mgmt.codfw.wmnet with reboot policy FORCED
[19:05:18] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:10:16] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] ntp/eqsin: set ntp.eqsin to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891862 (owner: 10Ssingh)
[19:10:40] <wikibugs>	 (03PS1) 10BCornwall: Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636
[19:11:01] <wikibugs>	 (03PS2) 10BCornwall: Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636
[19:11:02] <icinga-wm>	 RECOVERY - Check that envoy is running on idm2001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[19:11:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2013.mgmt.codfw.wmnet with reboot policy FORCED
[19:11:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2014.mgmt.codfw.wmnet with reboot policy FORCED
[19:12:29] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 (owner: 10BCornwall)
[19:14:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 (owner: 10BCornwall)
[19:14:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013']
[19:14:58] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe2013']
[19:15:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013']
[19:18:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3002.wikimedia.org with OS bullseye
[19:18:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye
[19:18:24] <brett>	 Doing some esams dns reimaging
[19:18:53] <mutante>	 ack, thx
[19:18:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2014']
[19:19:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe2004.mgmt.codfw.wmnet with reboot policy FORCED
[19:20:32] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:20:58] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:34] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:91:198:174:62 is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004']
[19:23:02] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:23:34] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:23:50] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:24:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:25:48] <icinga-wm>	 PROBLEM - Recursive DNS on 91.198.174.62 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[19:28:25] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013']
[19:29:38] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2014']
[19:29:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:30:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) @jbond  ` poweredge-r450: picking DellDriverCategory.BIOS update file We have found multiple entries please pick from the list below: 0: /srv/...
[19:32:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004']
[19:33:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013']
[19:34:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:35:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2014']
[19:36:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004']
[19:36:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[19:36:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul)
[20:06:52] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:08:40] <wikibugs>	 (03PS1) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096)
[20:11:34] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[20:11:47] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3002.wikimedia.org with OS bullseye
[20:11:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye executed with errors: - dns3002 (**FAIL**)   - Downtimed o...
[20:12:50] <wikibugs>	 (03PS2) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096)
[20:13:14] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[20:25:15] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff)
[20:32:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3002.wikimedia.org with OS bullseye
[20:33:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye
[20:33:48] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013']
[20:33:50] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004']
[20:33:55] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2014']
[20:44:45] <wikibugs>	 (03PS3) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096)
[20:45:56] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[20:46:01] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[20:46:16] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 19s)
[20:49:29] <wikibugs>	 (03PS4) 10Bking: wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096)
[20:52:03] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[20:52:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3002.wikimedia.org with reason: host reimage
[20:53:50] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[20:55:46] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3002.wikimedia.org with reason: host reimage
[20:55:56] <icinga-wm>	 PROBLEM - Disk space on people2002 is CRITICAL: DISK CRITICAL - free space: / 2871 MB (3% inode=88%): /tmp 2871 MB (3% inode=88%): /var/tmp 2871 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people2002&var-datasource=codfw+prometheus/ops
[20:57:09] <mutante>	 ^ yea, it's me :)
[20:57:24] <mutante>	 i got away with 97% and gotta clean it up somehow
[20:58:59] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided)
[20:59:10] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 10s)
[21:03:38] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 80.97 ms
[21:05:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:06:00] <mutante>	 !log ganeti2021 - adding a virtual 20G disk to people2002 - to temp get some space for backups and syncing T330091
[21:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:05] <stashbot>	 T330091: Switchover People and Planet services to codfw - https://phabricator.wikimedia.org/T330091
[21:10:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:11:36] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[21:12:52] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 81.05 ms
[21:14:50] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:14:56] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:15:30] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:17:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3002.wikimedia.org with OS bullseye
[21:17:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye completed: - dns3002 (**WARN**)   - Removed from Puppet an...
[21:19:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:19:52] <mutante>	 !log rebooting people2002
[21:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:37] <wikibugs>	 (03PS1) 10Papaul: Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848)
[21:25:56] <brett>	 Okay, dns work in esams is done for the week
[21:26:12] <mutante>	 !log people2002 - performing the usual dance when device names changed after editing virtual hardware (s/ens13/ens14 in /etc/network/interfaces ... reboot)
[21:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:17] <RhinosF1>	 enjoy your weekend brett 
[21:26:28] <brett>	 Thanks! You too :)
[21:26:36] <RhinosF1>	 :)
[21:31:54] <wikibugs>	 (03PS2) 10Papaul: Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848)
[21:34:33] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848) (owner: 10Papaul)
[21:37:16] <icinga-wm>	 RECOVERY - Disk space on people2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people2002&var-datasource=codfw+prometheus/ops
[21:54:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2013.codfw.wmnet with OS bullseye
[21:54:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS...
[22:12:19] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: add bacula file set srv-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920
[22:15:40] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: add bacula file set srv-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920
[22:15:58] <wikibugs>	 (03PS3) 10Dzahn: peopleweb: add bacula file set srv-org-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920
[22:18:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage
[22:21:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage
[22:40:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:49:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:49:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2013.codfw.wmnet with OS bullseye
[22:49:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed: - ms-f...
[22:52:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul)
[22:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:15:49] <mutante>	 !log people2002 - for each user who has a public_html dir that is not empty (for pubdir in $(find . -name public_html -type d -not -empty); ..); rsync it from people1003 with --delete (rsync -avp rsync://people1003.eqiad.wmnet/people-home/${pubdiruser}/public_html/ /home/${pubdiruser}/public_html/); T330091
[23:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:56] <stashbot>	 T330091: Switchover People and Planet services to codfw - https://phabricator.wikimedia.org/T330091
[23:31:28] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091)
[23:32:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts people2002.codfw.wmnet /tmp/test_people.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:33:42] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091)
[23:34:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:37:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/891382 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:37:47] <wikibugs>	 (03PS2) 10Arlolra: Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662)
[23:41:53] <wikibugs>	 (03CR) 10Dzahn: "I have rsynced any public_html dir that was not empty but have not touched anything outside public_html dirs. Also we have httpbb tests an" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:44:00] <wikibugs>	 (03PS4) 10Dzahn: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091)
[23:44:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "also double checked people2002 is on the SANs of the TLS cert for envoy" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:45:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts people2002.codfw.wmnet /srv/deployment/httpbb-tests/people/test_people.yaml" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:51:40] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)