[00:13:18] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply JRE updates - bking@cumin1001 - T329957 [00:13:22] T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [00:26:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:27:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:42:14] (03PS1) 10Dzahn: add metafo records for planet [dns] - 10https://gerrit.wikimedia.org/r/891730 (https://phabricator.wikimedia.org/T330091) [00:42:28] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:44:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:45:53] (03PS1) 10Dzahn: add metafo records for people.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/891731 (https://phabricator.wikimedia.org/T330091) [00:47:33] (03PS2) 10Dzahn: add metafo records for people.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/891731 (https://phabricator.wikimedia.org/T330091) [00:47:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:47:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2013.codfw.wmnet with OS bullseye [00:48:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2013.codfw.wmnet with OS bullseye completed: - wdqs2013 (**PA... [00:49:08] (03PS1) 10Dzahn: drop people.eqiad.wmnet service alias [dns] - 10https://gerrit.wikimedia.org/r/891732 [00:51:14] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:51:18] PROBLEM - puppet last run on puppetdb2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:51:20] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:02] (03PS2) 10Krinkle: Remove redundant wgOriginTrials and wgFeaturePolicyReportOnly settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891314 [01:03:04] (03PS1) 10Krinkle: Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 [01:03:06] (03PS1) 10Krinkle: noc: Clarify menu labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891734 [01:03:43] (03PS2) 10Krinkle: drop people.eqiad.wmnet service alias [dns] - 10https://gerrit.wikimedia.org/r/891732 (owner: 10Dzahn) [01:04:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2014.codfw.wmnet with OS bullseye [01:04:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2014.codfw.wmnet with OS bullseye [01:06:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye [01:06:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye [01:10:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye [01:10:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye [01:13:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye [01:13:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye [01:24:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage [01:28:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2014.codfw.wmnet with reason: host reimage [01:30:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [01:33:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [01:34:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage [01:37:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage [01:43:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:48:04] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015 [01:48:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015 [01:49:30] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015 [01:49:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015 [01:49:46] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:50:07] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:23] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:29] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015 [01:51:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015 [01:51:12] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2015 [01:51:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2015 [01:53:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:53:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:53:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2016.codfw.wmnet with OS bullseye [01:53:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:53:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2014.codfw.wmnet with OS bullseye [01:53:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye completed: - wdqs2016 (**PA... [01:53:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2014.codfw.wmnet with OS bullseye completed: - wdqs2014 (**PA... [01:55:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm when you are back on site can you please check wdqs2015 it looks like i have no network cable connected to it. Thanks [01:56:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:56:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2017.codfw.wmnet with OS bullseye [01:56:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye completed: - wdqs2017 (**PA... [02:02:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2018.codfw.wmnet with OS bullseye [02:02:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye [02:03:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2015.codfw.wmnet with OS bullseye [02:03:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye executed with errors: - wdq... [02:05:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2019.codfw.wmnet with OS bullseye [02:05:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2019.codfw.wmnet with OS bullseye [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:03] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2018.codfw.wmnet with OS bullseye [02:18:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye executed with errors: - wdq... [02:18:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2018.codfw.wmnet with OS bullseye [02:18:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye [02:42:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye [02:45:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2018.codfw.wmnet with reason: host reimage [02:48:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2018.codfw.wmnet with reason: host reimage [02:51:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:53:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2019.codfw.wmnet with reason: host reimage [02:56:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2019.codfw.wmnet with reason: host reimage [02:58:11] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2020.codfw.wmnet with OS bullseye [02:58:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye executed with errors: - wdq... [02:58:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye [02:58:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye [03:04:43] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:08:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [03:08:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye [03:12:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:12:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2018.codfw.wmnet with OS bullseye [03:12:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2018.codfw.wmnet with OS bullseye completed: - wdqs2018 (**PA... [03:12:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:13:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:13:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2019.codfw.wmnet with OS bullseye [03:13:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2019.codfw.wmnet with OS bullseye completed: - wdqs2019 (**PA... [03:28:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [03:32:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [03:47:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2020.codfw.wmnet with OS bullseye [03:47:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye executed with errors: - wdq... [03:47:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:49:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:49:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2021.codfw.wmnet with OS bullseye [03:49:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye completed: - wdqs2021 (**PA... [03:50:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) [06:51:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230224T0700) [07:10:58] (03CR) 10Slyngshede: SUL account linking (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [07:30:33] (03PS5) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [07:39:21] (03CR) 10Alexandros Kosiaris: "Couple of minor comments, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus) [07:50:28] PROBLEM - Disk space on kubestagetcd1006 is CRITICAL: DISK CRITICAL - free space: / 711 MB (3% inode=95%): /tmp 711 MB (3% inode=95%): /var/tmp 711 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops [07:52:18] !log rm /var/log/{syslog,messages,user.log}.1 on kubetcd1006 to free up space - T329717 [07:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:23] T329717: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 [07:56:00] (03PS1) 10Elukey: role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) [07:56:54] akosiaris: o/ if you have a moment --^ [07:57:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39822/console" [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey) [07:58:50] mmm weird kubestagetcd in codfw shows no changes.. [07:59:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 8 hosts with reason: Downtime DSE workers for cluster upgrade [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230224T0800) [08:00:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 8 hosts with reason: Downtime DSE workers for cluster upgrade [08:04:35] (03PS6) 10Elukey: role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) [08:06:02] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade to k8s 1.23 [08:09:23] (03CR) 10Elukey: [C: 03+2] role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [08:09:25] (03PS1) 10Slyngshede: LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 [08:10:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:10:47] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host dse-k8s-ctrl1001.eqiad.wmnet with OS bullseye [08:11:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:11:16] RECOVERY - Disk space on kubestagetcd1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops [08:14:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39823/console" [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey) [08:21:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [08:24:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [08:25:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! At some point we should add a brief introduction doc which explains which part is WMF-specific and what local config changes a" [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 (owner: 10Slyngshede) [08:26:21] (03CR) 10Elukey: [V: 03+1] "Not sure why but pcc shows me a diff only for the eqiad nodes, not for the codfw ones.." [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey) [08:32:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:28] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:37] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) p:05Triageβ†’03Medium [08:37:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet with OS bullseye [08:38:42] (03PS1) 10Slyngshede: Access to analytics-privatedata-users for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364) [08:40:06] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host dse-k8s-ctrl1002.eqiad.wmnet with OS bullseye [08:43:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey) [08:49:46] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) Approval required by: @Ottomata or @odimitrijevic [08:51:39] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [08:54:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [08:54:31] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) @KFrancis Given that this is a reactivation of an account, I would assume that Bruno at some point signed an NDA, b... [08:54:39] (03CR) 10Filippo Giunchedi: "Bummer re: runtimerandomizedextrasec 😞" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [08:56:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SLyngshede-WMF) p:05Triageβ†’03Medium [08:57:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10SLyngshede-WMF) 05Openβ†’03Resolved p:05Triageβ†’03Medium [08:57:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) [08:58:14] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::kubernetes::staging: move certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/891749 (https://phabricator.wikimedia.org/T329717) (owner: 10Elukey) [08:58:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) p:05Triageβ†’03Medium [09:01:46] (03PS7) 10Vgutierrez: acme_chief: support several passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/888652 (https://phabricator.wikimedia.org/T321309) [09:08:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet with OS bullseye [09:08:43] !log rm /var/log/{syslog,messages,user.log}.1 on kubetcd1005 to free up space - T329717 [09:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:48] T329717: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 [09:08:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "FWIW this change gave me an idea for PCC feature request at https://phabricator.wikimedia.org/T330484" [puppet] - 10https://gerrit.wikimedia.org/r/891392 (owner: 10Cwhite) [09:09:18] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [09:10:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [09:11:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [09:11:36] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [09:12:45] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [09:13:03] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [09:13:21] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [09:13:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [09:20:11] (03CR) 10Elukey: "Looks good but I have a question - when did we agree to extend the scope of the DSE experiment to the rdf-streaming-updater? What is the p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [09:26:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [09:27:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [09:27:33] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [09:27:39] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [09:29:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [09:31:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [09:32:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [09:33:48] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [09:33:52] (03PS2) 10Alexandros Kosiaris: developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140) [09:33:54] (03PS1) 10Alexandros Kosiaris: prometheus: Explaining prometheus.io/port annotation [puppet] - 10https://gerrit.wikimedia.org/r/891800 [09:34:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [09:35:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [09:37:01] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [09:37:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [09:37:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [09:38:10] (03CR) 10Btullis: [C: 03+1] role::dse_k8s::{master,worker}: update settings to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/891280 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [09:39:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [09:39:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] developer-portal: Switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/891502 (https://phabricator.wikimedia.org/T297140) (owner: 10Alexandros Kosiaris) [09:40:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Explaining prometheus.io/port annotation [puppet] - 10https://gerrit.wikimedia.org/r/891800 (owner: 10Alexandros Kosiaris) [09:42:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [09:47:29] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff) [09:48:12] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff) [09:48:16] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10MoritzMuehlenhoff) [09:48:20] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [09:58:03] (03PS1) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 [09:58:12] (03CR) 10CI reject: [V: 04-1] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans) [09:59:17] (03PS2) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 [10:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:04] (03CR) 10Btullis: analytics: rename postgres DB user for search platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [10:02:37] (03CR) 10CI reject: [V: 04-1] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans) [10:05:54] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) >>! In T330364#8643473, @SLyngshede-WMF wrote: > @KFrancis Given that this is a reactivation of an account, I wo... [10:06:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [10:07:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [10:08:43] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10SLyngshede-WMF) [10:09:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:48] (03CR) 10Btullis: [C: 03+2] "I'll merge this now. Hope that's OK bking." [puppet] - 10https://gerrit.wikimedia.org/r/891587 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [10:10:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [10:12:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [10:13:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader2003.wikimedia.org with OS bullseye [10:14:06] (03PS3) 10Volans: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 [10:14:07] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader2003.wikimedia.org with OS bullseye [10:18:01] (03PS1) 10Btullis: Ensure that the airflow database names match existing conventions [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440) [10:20:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10SLyngshede-WMF) [10:21:31] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:09] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [10:31:01] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:31:04] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:31:17] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:31:18] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:32:17] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:32:19] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:32:28] !log installing emacs security updates on bullseye [10:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:37] (03CR) 10Elukey: [C: 03+2] admin_ng: upgrade the DSE cluster to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/891284 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [10:35:23] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:35:25] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:35:45] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:35:46] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:35:55] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:35:57] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:36:02] sorry a bit of spam :) [10:40:15] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:40:19] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:41:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2003.wikimedia.org with reason: host reimage [10:42:30] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans) [10:44:08] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:44:44] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:44:47] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2003.wikimedia.org with reason: host reimage [10:45:17] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:45:27] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:46:33] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:46:43] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:52:19] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [10:58:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host urldownloader2003.wikimedia.org with OS bullseye [10:58:28] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader2003.wikimedia.org with OS bullseye completed: - urldownloader2003 (**PASS**) -... [10:59:47] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [10:59:53] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [11:02:12] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [11:13:25] /19 [11:13:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader2004.wikimedia.org with OS bullseye [11:14:00] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader2004.wikimedia.org with OS bullseye [11:17:29] (03CR) 10Btullis: [C: 03+2] Ensure that the airflow database names match existing conventions [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [11:31:52] 10Puppet, 10SRE, 10Infrastructure-Foundations: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) > and puppetmaster[12]004 as the new puppetserver backends, This is one detail im a bit hazy on, the normal way to do things in puppet now is to have multiple puppet compilers and one... [11:37:50] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [11:41:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2004.wikimedia.org with reason: host reimage [11:44:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2004.wikimedia.org with reason: host reimage [11:49:27] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) [11:49:34] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [11:54:06] (03CR) 10Vgutierrez: [C: 03+1] "let's merge this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [11:58:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host urldownloader2004.wikimedia.org with OS bullseye [11:58:37] (03CR) 10Btullis: [C: 03+2] "I removed the incorrectly named databases and roles with the following commands in the psql command line on an-db1001." [puppet] - 10https://gerrit.wikimedia.org/r/891804 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [11:59:36] (03CR) 10Btullis: [C: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:59:46] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader2004.wikimedia.org with OS bullseye completed: - urldownloader2004 (**PASS**) -... [12:00:15] (03PS1) 10Superpes15: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) [12:01:00] (03CR) 10CI reject: [V: 04-1] [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [12:11:02] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [12:13:00] (03CR) 10MarcoAurelio: [C: 03+1] [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [12:22:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:23:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 4.592 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.915 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:06] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [12:31:46] (03PS1) 10Jbond: do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816 [12:33:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39824/console" [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond) [12:33:28] (03PS1) 10Muehlenhoff: Blacklist f2fs [puppet] - 10https://gerrit.wikimedia.org/r/891817 [12:34:22] (03CR) 10CI reject: [V: 04-1] do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond) [12:35:54] (03PS1) 10KartikMistry: Content Translation: Adjust the global limit for unedited MT to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891818 (https://phabricator.wikimedia.org/T330482) [12:38:02] 10SRE, 10Data-Engineering-Radar, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10JArguello-WMF) [12:38:15] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10JArguello-WMF) [12:39:04] 10SRE, 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops-radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF) [12:39:15] (03PS2) 10Aklapper: Remove redirect for pk.wikimedia.org (Pakistan) [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596) [12:48:19] (03Abandoned) 10Jbond: do not merge: [puppet] - 10https://gerrit.wikimedia.org/r/891816 (owner: 10Jbond) [12:53:34] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:04] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:59:54] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: update bios regex [cookbooks] - 10https://gerrit.wikimedia.org/r/891820 [13:23:17] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: allow simplifying the backend array [puppet] - 10https://gerrit.wikimedia.org/r/891825 [13:23:19] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826 [13:23:21] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) [13:26:52] (03CR) 10CI reject: [V: 04-1] P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [13:27:36] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) [13:41:20] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [13:55:04] (03PS1) 10Muehlenhoff: Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) [13:55:25] (03CR) 10CI reject: [V: 04-1] Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [13:57:58] (03PS2) 10Muehlenhoff: Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) [13:58:30] (03PS1) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage to MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891833 (https://phabricator.wikimedia.org/T242031) [13:58:59] (03PS1) 10Btullis: Update the SSH configuration to add the keys to the agent on first use [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891834 [14:01:02] (03PS1) 10Esanders: Disable VectorPromoteAddTopic on production wikis initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891836 (https://phabricator.wikimedia.org/T267444) [14:01:07] (03PS1) 10Jaime Nuche: scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) [14:02:22] (03PS5) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) [14:03:16] (03CR) 10Jcrespo: [C: 03+2] Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 (owner: 10Jcrespo) [14:04:04] (03CR) 10Jcrespo: [C: 03+2] Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [14:09:02] (03PS1) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) [14:09:30] (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:10:20] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [14:14:06] (03PS1) 10Muehlenhoff: Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840 [14:15:22] (03PS2) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) [14:15:50] (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:16:39] (03PS1) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [14:17:00] (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [14:17:52] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:22:00] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [14:23:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [14:23:50] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [14:24:07] sigh [14:28:55] (03PS2) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [14:29:17] (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [14:31:26] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:31:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2020.codfw.wmnet with OS bullseye [14:31:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye [14:34:39] (03PS3) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [14:35:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [14:36:20] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [14:36:45] (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [14:37:16] (03PS4) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [14:38:49] (03PS1) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) [14:39:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [14:41:07] (03CR) 10CI reject: [V: 04-1] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:42:04] (03PS2) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) [14:42:42] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) I checked all translations regarding time and date. I had to fix all of them manually, at least the one... [14:43:28] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39825/console" [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [14:44:31] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: update bios regex [cookbooks] - 10https://gerrit.wikimedia.org/r/891820 (owner: 10Jbond) [14:44:42] (03CR) 10CI reject: [V: 04-1] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:46:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891817 (owner: 10Muehlenhoff) [14:47:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891840 (owner: 10Muehlenhoff) [14:49:51] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2013'] [14:50:00] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['wdqs2013'] [14:50:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: Cluster half broken, in the middle of upgrading [14:50:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: Cluster half broken, in the middle of upgrading [14:50:34] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [14:50:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade to k8s 1.23 [14:50:49] (03PS5) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [14:51:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested tho)" [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:51:11] (03CR) 10CI reject: [V: 04-1] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [14:51:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "FWIW +1 to the idea, thank you for the improvement!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:52:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2020.codfw.wmnet with reason: host reimage [14:52:58] (03PS2) 10Muehlenhoff: Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840 [14:54:10] (03PS3) 10Jbond: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) [14:54:12] (03PS1) 10Jbond: fix formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891847 [14:54:47] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:58] (KubernetesCalicoDown) firing: (2) dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:55:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2020.codfw.wmnet with reason: host reimage [14:56:54] (03CR) 10Jbond: [C: 03+2] fix formating [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891847 (owner: 10Jbond) [14:57:57] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [14:59:20] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [15:02:19] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10Ottomata) Approved. [15:04:24] (03CR) 10Ottomata: [C: 03+1] dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:06:07] (03CR) 10Muehlenhoff: [C: 03+2] Skip boot.txt for SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/891840 (owner: 10Muehlenhoff) [15:06:41] (03PS3) 10Muehlenhoff: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) [15:06:43] (03PS1) 10Ayounsi: Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) [15:06:53] (03CR) 10Ottomata: [C: 03+1] "If Steve is okay with this, then I think it is ready for merge." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:07:09] (03CR) 10CI reject: [V: 04-1] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [15:08:23] (03PS2) 10Ayounsi: Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) [15:10:35] (03CR) 10DCausse: dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:11:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:11:47] (03CR) 10Jbond: [C: 03+1] "lgtm some optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro) [15:11:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new url downloaders - jmm@cumin2002 - T329945" [15:12:01] T329945: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 [15:21:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new url downloaders - jmm@cumin2002 - T329945" [15:21:24] T329945: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 [15:22:49] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10MoritzMuehlenhoff) Should we also extend the cookbook to run sre.puppet.sync-netbox-hiera? Or at least print a... [15:22:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:23:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2020.codfw.wmnet with OS bullseye [15:23:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2020.codfw.wmnet with OS bullseye completed: - wdqs2020 (**PA... [15:23:45] (03CR) 10Elukey: dse-k8s: raise memory for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:25:14] (03CR) 10Elukey: [C: 03+1] "It looks good to me, the reimages of the hosts in row E/F are already broken so it is worth a try. As specified in IRC I'd like somebody f" [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi) [15:33:44] (03CR) 10Hashar: [C: 03+1] "+1 then it looks like the failure is the deployment user does not have the proper sudo rule to act as `kibana`?" [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688) (owner: 10Cwhite) [15:40:06] (03CR) 10Btullis: [C: 03+1] Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi) [15:51:23] (03PS1) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) [15:52:01] (03CR) 10Ahmon Dancy: scap: add required Python3 venv package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [15:52:24] (03PS2) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) [15:54:21] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39828/console" [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [16:07:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:13] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:21] (03CR) 10Arturo Borrero Gonzalez: "I can't find where port 30002 is being used today:" [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah) [16:14:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) frbast1002 port 9 frmon1002 port 11 frpig1002 port 1 [16:14:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) [16:15:16] (03CR) 10Arturo Borrero Gonzalez: P:toolforge::k8s::haproxy: remove support for hash node list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah) [16:15:36] (03CR) 10Majavah: P:toolforge::k8s::haproxy: allow simplifying the backend array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah) [16:16:36] (03CR) 10Arturo Borrero Gonzalez: P:toolforge::k8s::haproxy: add api gateway load balancer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [16:19:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:toolforge::k8s::haproxy: allow simplifying the backend array (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah) [16:21:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::k8s::haproxy: allow simplifying the backend array [puppet] - 10https://gerrit.wikimedia.org/r/891825 (owner: 10Majavah) [16:21:54] (03PS1) 10Btullis: Add dummy keydata for the new ceph admin user [labs/private] - 10https://gerrit.wikimedia.org/r/891854 (https://phabricator.wikimedia.org/T328123) [16:22:44] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826 [16:22:46] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) [16:22:50] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keydata for the new ceph admin user [labs/private] - 10https://gerrit.wikimedia.org/r/891854 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [16:22:53] (03CR) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah) [16:23:13] (03CR) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [16:33:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [16:34:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Only +1, cannot merge & babysit myself now." [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah) [16:35:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:45:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS buster [16:45:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Only +1, cannot merge & babysit the rollout myself now." [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [16:48:32] (03PS4) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) [16:55:54] (03PS3) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) [16:57:06] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [16:57:32] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 26s) [16:57:47] (03CR) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus) [17:03:40] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [17:03:51] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 10s) [17:05:46] (03PS1) 10BCornwall: ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861 [17:06:47] (03CR) 10Ssingh: [C: 03+1] ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall) [17:07:17] (03CR) 10BCornwall: [C: 03+2] ntp/eqsin: set to dns6002 [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall) [17:08:25] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [17:08:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [17:08:45] (03PS2) 10Jaime Nuche: scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) [17:09:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [17:09:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [17:12:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye [17:15:31] (03PS1) 10Ssingh: ntp/eqsin: set ntp.eqsin to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891862 [17:16:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:17:26] (03CR) 10Ssingh: [C: 03+1] "Missed this in the review but for posterity, it should read ntp/drmrs as it is dns6002." [dns] - 10https://gerrit.wikimedia.org/r/891861 (owner: 10BCornwall) [17:25:34] (03PS1) 10EoghanGaffney: Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) [17:34:59] (03PS1) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) [17:39:05] (03Abandoned) 10Ssingh: hiera: update Traffic cloud instances hieradata for digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/856955 (owner: 10Ssingh) [17:39:43] (03PS4) 10Ssingh: hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) [17:46:18] (03PS1) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) [17:57:40] (03PS2) 10EoghanGaffney: Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) [17:59:44] (03CR) 10RLazarus: [C: 03+2] "Merging per IRC discussion, thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus) [18:00:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1004.eqiad.wmnet with OS bullseye [18:03:56] (03CR) 10Dzahn: "no offense, but I think it's better to remove myself as reviewer than to give the impression that I am going to merge them." [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [18:06:25] (03CR) 10Dzahn: "I'm afraid to get these merged 2 things are needed: a) some link to community consensus that this is agreed that can be pasted b) contac" [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [18:09:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:09:53] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:10:16] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:13:24] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:14:57] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS bullseye [18:15:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6001.wikimedia.org with OS bullseye [18:19:00] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:20:31] PROBLEM - Host 2a02:ec80:600:1:185:15:58:5 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:600:1:185:15:58:5) [18:20:42] (03PS1) 10Dzahn: service-catalog: add planet service [puppet] - 10https://gerrit.wikimedia.org/r/891894 [18:20:51] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:59] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:21:03] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:24:35] (03PS1) 10Dzahn: service-catalog: add people service [puppet] - 10https://gerrit.wikimedia.org/r/891895 [18:24:41] PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:26:13] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Papaul) [18:26:42] Is drmrs known? [18:26:54] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) 05Openβ†’03Resolved switch received and added to Netbox [18:28:06] sukhe: see asw1-b12-drmrs alert. Or anyone else from traffic. [18:28:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) [18:30:52] r [18:31:11] r? [18:31:13] RhinosF1: thanks, expected because of the dnsreimaging [18:31:20] sukhe: cool [18:32:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye [18:32:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye [18:32:16] dns6001 specifically that brett is doing [18:32:45] oh, shit, sorry, I missed this message [18:32:50] Sorry for any alarm [18:33:21] np, we can't avoid these alerts anyway :) [18:33:21] It’s good [18:34:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6001.wikimedia.org with reason: host reimage [18:37:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6001.wikimedia.org with reason: host reimage [18:38:21] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2022.codfw.wmnet with OS bullseye [18:38:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye executed with errors: - wdq... [18:39:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm can you also check the network cable on wdqs2022. [18:43:04] (03CR) 10RLazarus: [C: 03+1] sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10ClΓ©ment Goubert) [18:44:06] PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:48:34] ^expected [18:48:54] RECOVERY - Recursive DNS on 185.15.58.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:51:34] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:54:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002" [18:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:55:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002" [18:55:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2013.mgmt.codfw.wmnet with reboot policy FORCED [18:57:56] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:58:10] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6001.wikimedia.org with OS bullseye [19:00:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6001.wikimedia.org with OS bullseye completed: - dns6001 (**PASS**) - Downtimed on Icinga/Al... [19:02:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2014.mgmt.codfw.wmnet with reboot policy FORCED [19:04:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:04:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host thanos-fe2004.mgmt.codfw.wmnet with reboot policy FORCED [19:05:18] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:10:16] (03CR) 10BCornwall: [C: 03+2] ntp/eqsin: set ntp.eqsin to dns5003 [dns] - 10https://gerrit.wikimedia.org/r/891862 (owner: 10Ssingh) [19:10:40] (03PS1) 10BCornwall: Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 [19:11:01] (03PS2) 10BCornwall: Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 [19:11:02] RECOVERY - Check that envoy is running on idm2001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [19:11:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2013.mgmt.codfw.wmnet with reboot policy FORCED [19:11:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2014.mgmt.codfw.wmnet with reboot policy FORCED [19:12:29] (03CR) 10Ssingh: [C: 03+1] Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 (owner: 10BCornwall) [19:14:05] (03CR) 10BCornwall: [C: 03+2] Revert "ntp/eqsin: set to dns6002" [dns] - 10https://gerrit.wikimedia.org/r/891636 (owner: 10BCornwall) [19:14:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013'] [19:14:58] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe2013'] [19:15:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013'] [19:18:02] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3002.wikimedia.org with OS bullseye [19:18:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye [19:18:24] Doing some esams dns reimaging [19:18:53] ack, thx [19:18:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2014'] [19:19:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe2004.mgmt.codfw.wmnet with reboot policy FORCED [19:20:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:58] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:34] PROBLEM - Host 2620:0:862:1:91:198:174:62 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004'] [19:23:02] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:23:34] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:23:50] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:24:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:25:48] PROBLEM - Recursive DNS on 91.198.174.62 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:28:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013'] [19:29:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2014'] [19:29:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) @jbond ` poweredge-r450: picking DellDriverCategory.BIOS update file We have found multiple entries please pick from the list below: 0: /srv/... [19:32:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004'] [19:33:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2013'] [19:34:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2014'] [19:36:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004'] [19:36:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:36:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) [20:06:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:08:40] (03PS1) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) [20:11:34] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:11:47] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns3002.wikimedia.org with OS bullseye [20:11:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye executed with errors: - dns3002 (**FAIL**) - Downtimed o... [20:12:50] (03PS2) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) [20:13:14] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:25:15] (03CR) 10AOkoth: [C: 03+1] clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [20:32:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3002.wikimedia.org with OS bullseye [20:33:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye [20:33:48] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013'] [20:33:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004'] [20:33:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2014'] [20:44:45] (03PS3) 10Bking: wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) [20:45:56] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [20:46:01] (03CR) 10Ryan Kemper: [C: 03+1] wdqs.data-transfer: completely remove defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:46:16] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 19s) [20:49:29] (03PS4) 10Bking: wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) [20:52:03] (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:52:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3002.wikimedia.org with reason: host reimage [20:53:50] (03Merged) 10jenkins-bot: wdqs.data-transfer: completely replace defunct argument [cookbooks] - 10https://gerrit.wikimedia.org/r/891899 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:55:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3002.wikimedia.org with reason: host reimage [20:55:56] PROBLEM - Disk space on people2002 is CRITICAL: DISK CRITICAL - free space: / 2871 MB (3% inode=88%): /tmp 2871 MB (3% inode=88%): /var/tmp 2871 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people2002&var-datasource=codfw+prometheus/ops [20:57:09] ^ yea, it's me :) [20:57:24] i got away with 97% and gotta clean it up somehow [20:58:59] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [20:59:10] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 10s) [21:03:38] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 80.97 ms [21:05:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:06:00] !log ganeti2021 - adding a virtual 20G disk to people2002 - to temp get some space for backups and syncing T330091 [21:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:05] T330091: Switchover People and Planet services to codfw - https://phabricator.wikimedia.org/T330091 [21:10:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:11:36] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:52] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 81.05 ms [21:14:50] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:14:56] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:15:30] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:17:49] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3002.wikimedia.org with OS bullseye [21:17:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3002.wikimedia.org with OS bullseye completed: - dns3002 (**WARN**) - Removed from Puppet an... [21:19:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:19:52] !log rebooting people2002 [21:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:37] (03PS1) 10Papaul: Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848) [21:25:56] Okay, dns work in esams is done for the week [21:26:12] !log people2002 - performing the usual dance when device names changed after editing virtual hardware (s/ens13/ens14 in /etc/network/interfaces ... reboot) [21:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:17] enjoy your weekend brett [21:26:28] Thanks! You too :) [21:26:36] :) [21:31:54] (03PS2) 10Papaul: Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848) [21:34:33] (03CR) 10Papaul: [C: 03+2] Ad new ms-fe and thanos-fe node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/891914 (https://phabricator.wikimedia.org/T326848) (owner: 10Papaul) [21:37:16] RECOVERY - Disk space on people2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people2002&var-datasource=codfw+prometheus/ops [21:54:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2013.codfw.wmnet with OS bullseye [21:54:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS... [22:12:19] (03PS1) 10Dzahn: peopleweb: add bacula file set srv-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920 [22:15:40] (03PS2) 10Dzahn: peopleweb: add bacula file set srv-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920 [22:15:58] (03PS3) 10Dzahn: peopleweb: add bacula file set srv-org-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920 [22:18:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage [22:21:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage [22:40:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:49:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:49:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2013.codfw.wmnet with OS bullseye [22:49:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed: - ms-f... [22:52:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) [22:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:15:49] !log people2002 - for each user who has a public_html dir that is not empty (for pubdir in $(find . -name public_html -type d -not -empty); ..); rsync it from people1003 with --delete (rsync -avp rsync://people1003.eqiad.wmnet/people-home/${pubdiruser}/public_html/ /home/${pubdiruser}/public_html/); T330091 [23:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:56] T330091: Switchover People and Planet services to codfw - https://phabricator.wikimedia.org/T330091 [23:31:28] (03PS1) 10Dzahn: httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) [23:32:16] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts people2002.codfw.wmnet /tmp/test_people.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:33:42] (03PS2) 10Dzahn: httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) [23:34:07] (03CR) 10Dzahn: [V: 03+2] httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/891927 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:37:40] (03CR) 10Dzahn: [C: 03+2] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/891382 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:37:47] (03PS2) 10Arlolra: Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) [23:41:53] (03CR) 10Dzahn: "I have rsynced any public_html dir that was not empty but have not touched anything outside public_html dirs. Also we have httpbb tests an" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:44:00] (03PS4) 10Dzahn: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) [23:44:26] (03CR) 10Dzahn: [C: 03+2] "also double checked people2002 is on the SANs of the TLS cert for envoy" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:45:30] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts people2002.codfw.wmnet /srv/deployment/httpbb-tests/people/test_people.yaml" [dns] - 10https://gerrit.wikimedia.org/r/891381 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:51:40] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)