[00:00:34] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:05:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED [00:06:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc1016'] [00:06:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:07:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc1016'] [00:07:39] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['pc1016'] [00:07:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc1015'] [00:08:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['pc1015'] [00:08:57] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [00:08:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [00:10:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) [00:10:49] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:11:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) @VRiley-WMF if you have a screen going please check to see if its in process of doing something on pc1015 spicerack.dhcp.DHCPError: Snippet /etc/dhcp/automation/mg... [00:15:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc1016'] [00:27:47] (03PS1) 10Krinkle: search-grafana-dashboards: add support for searching alert metadata [software] - 10https://gerrit.wikimedia.org/r/959366 [00:36:24] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958975 [00:38:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958975 (owner: 10TrainBranchBot) [00:38:45] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db12[34-49] - jclark@cumin1001" [00:39:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db12[34-49] - jclark@cumin1001" [00:39:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:41:24] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1234 [00:41:31] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1235 [00:42:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1235 [00:42:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1234 [00:43:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1236 [00:43:23] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1238 [00:43:27] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1239 [00:44:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1239 [00:44:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1236 [00:44:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1238 [00:44:38] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1240 [00:44:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1241 [00:44:45] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1242 [00:45:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1241 [00:45:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1243 [00:45:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1240 [00:46:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1242 [00:46:04] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1244 [00:46:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1245 [00:46:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1243 [00:46:17] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1246 [00:46:19] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1244 [00:46:48] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1247 [00:46:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1245 [00:46:58] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1248 [00:47:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1247 [00:47:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1246 [00:47:51] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1249 [00:48:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1248 [00:49:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1249 [00:50:37] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1236.mgmt.eqiad.wmnet with reboot policy FORCED [00:50:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1234.mgmt.eqiad.wmnet with reboot policy FORCED [00:50:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1235.mgmt.eqiad.wmnet with reboot policy FORCED [00:50:42] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1237.mgmt.eqiad.wmnet with reboot policy FORCED [00:50:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1238.mgmt.eqiad.wmnet with reboot policy FORCED [00:50:46] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1239.mgmt.eqiad.wmnet with reboot policy FORCED [00:51:21] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:52:35] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958975 (owner: 10TrainBranchBot) [00:54:01] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:55:39] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:04:31] (03PS2) 10Krinkle: search-grafana-dashboards: add support for searching alert metadata [software] - 10https://gerrit.wikimedia.org/r/959366 (https://phabricator.wikimedia.org/T345190) [01:09:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1236.mgmt.eqiad.wmnet with reboot policy FORCED [01:09:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1238.mgmt.eqiad.wmnet with reboot policy FORCED [01:09:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1239.mgmt.eqiad.wmnet with reboot policy FORCED [01:09:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1237.mgmt.eqiad.wmnet with reboot policy FORCED [01:10:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1240.mgmt.eqiad.wmnet with reboot policy FORCED [01:10:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1241.mgmt.eqiad.wmnet with reboot policy FORCED [01:10:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1242.mgmt.eqiad.wmnet with reboot policy FORCED [01:10:19] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1243.mgmt.eqiad.wmnet with reboot policy FORCED [01:11:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1234.mgmt.eqiad.wmnet with reboot policy FORCED [01:11:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1244.mgmt.eqiad.wmnet with reboot policy FORCED [01:17:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1235.mgmt.eqiad.wmnet with reboot policy FORCED [01:17:29] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1249.mgmt.eqiad.wmnet with reboot policy FORCED [01:19:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:28:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1240.mgmt.eqiad.wmnet with reboot policy FORCED [01:28:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1242.mgmt.eqiad.wmnet with reboot policy FORCED [01:29:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1241.mgmt.eqiad.wmnet with reboot policy FORCED [01:29:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1243.mgmt.eqiad.wmnet with reboot policy FORCED [01:29:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [01:29:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1248.mgmt.eqiad.wmnet with reboot policy FORCED [01:30:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1245.mgmt.eqiad.wmnet with reboot policy FORCED [01:30:42] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1246.mgmt.eqiad.wmnet with reboot policy FORCED [01:31:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1244.mgmt.eqiad.wmnet with reboot policy FORCED [01:35:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1249.mgmt.eqiad.wmnet with reboot policy FORCED [01:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:46:05] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:51] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:47:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [01:48:29] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:48:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) @VRiley-WMF db1247 Fails to provision please check cables / serial number in morning [01:48:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1246.mgmt.eqiad.wmnet with reboot policy FORCED [01:48:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1248.mgmt.eqiad.wmnet with reboot policy FORCED [01:49:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1234'] [01:49:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1236'] [01:49:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1237'] [01:49:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1238'] [01:50:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1235'] [01:50:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1245.mgmt.eqiad.wmnet with reboot policy FORCED [01:51:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1239'] [01:52:31] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:45] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:53:57] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:58:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1234'] [01:58:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1240'] [01:58:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1235'] [01:58:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1241'] [01:59:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1236'] [01:59:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1238'] [01:59:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1242'] [01:59:28] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1243'] [01:59:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1237'] [01:59:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1244'] [01:59:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1239'] [02:00:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1245'] [02:02:07] (03CR) 10Andrew Bogott: [C: 03+2] Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) (owner: 10Jclark-ctr) [02:07:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1240'] [02:07:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1241'] [02:07:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1243'] [02:08:01] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1244'] [02:08:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1246'] [02:08:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1248'] [02:08:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1249'] [02:09:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1242'] [02:09:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1245'] [02:11:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:11:47] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:13:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) [02:13:13] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:17:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1246'] [02:17:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1249'] [02:17:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1248'] [02:22:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:11] PROBLEM - Check systemd state on dumpsdata1006 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:36] (JobUnavailable) firing: (7) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:55] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:46:21] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:48:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [02:53:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [02:57:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:02:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:04:51] (03PS1) 10Andrew Bogott: pdns_server: make the webserver address configurable [puppet] - 10https://gerrit.wikimedia.org/r/959377 (https://phabricator.wikimedia.org/T346385) [03:04:54] (03PS1) 10Andrew Bogott: Update pdns web server to use private IPs [puppet] - 10https://gerrit.wikimedia.org/r/959378 (https://phabricator.wikimedia.org/T346385) [03:04:56] (03PS1) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [03:05:23] (03CR) 10CI reject: [V: 04-1] pdns_server: make the webserver address configurable [puppet] - 10https://gerrit.wikimedia.org/r/959377 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [03:11:24] (03PS2) 10Andrew Bogott: pdns_server: make the webserver address configurable [puppet] - 10https://gerrit.wikimedia.org/r/959377 (https://phabricator.wikimedia.org/T346385) [03:11:26] (03PS2) 10Andrew Bogott: Update pdns web server to use private IPs [puppet] - 10https://gerrit.wikimedia.org/r/959378 (https://phabricator.wikimedia.org/T346385) [03:11:28] (03PS2) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [03:13:47] (03CR) 10CI reject: [V: 04-1] pdns_server: make the webserver address configurable [puppet] - 10https://gerrit.wikimedia.org/r/959377 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [03:15:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:20:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:27:39] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificaterequests) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:32:39] (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (LIST certificaterequests) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:40:11] (03CR) 10Andrew Bogott: [C: 04-1] designate pools.yaml: contact pdns webserver on private IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [03:42:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:43:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:53:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:58:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:15:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:01] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:02:29] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:13] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:09:41] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:14:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:17:37] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [05:17:53] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [05:19:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:24:07] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [05:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:27:41] (03PS1) 10Marostegui: Revert "Add new servers db1226-33" [puppet] - 10https://gerrit.wikimedia.org/r/959294 [05:28:06] (03CR) 10CI reject: [V: 04-1] Revert "Add new servers db1226-33" [puppet] - 10https://gerrit.wikimedia.org/r/959294 (owner: 10Marostegui) [05:28:39] (03PS2) 10Marostegui: Revert "Add new servers db1226-33" [puppet] - 10https://gerrit.wikimedia.org/r/959294 [05:29:47] (03CR) 10Marostegui: [C: 03+2] Revert "Add new servers db1226-33" [puppet] - 10https://gerrit.wikimedia.org/r/959294 (owner: 10Marostegui) [05:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:33:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:38:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:38:41] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [05:40:09] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [05:44:35] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [05:44:52] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [05:45:11] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [05:47:40] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [05:47:55] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [05:49:53] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [05:51:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:52:03] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [05:53:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:56:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T0600). [06:01:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:03:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:03:13] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_bitu_username_block.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:09:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:11:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:16:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:17:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:20:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-cli-wrapper: fix own dc reference in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [06:22:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:24:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [06:27:52] (03PS1) 10Giuseppe Lavagetto: mcrouter: new version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959386 [06:28:12] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [06:30:39] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix cloudsw cloud-private records - taavi@cumin1001" [06:31:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix cloudsw cloud-private records - taavi@cumin1001" [06:31:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:38:59] (03CR) 10Muehlenhoff: [C: 03+1] ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [06:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:43:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:48:25] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:49:51] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:49:55] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:51:21] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:51:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2915 [06:52:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2915 [06:52:49] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:54:15] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:00:04] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T0700). [07:00:04] abijeet: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] * kart_ will deploy ^^ patch. [07:00:24] morning! no trainees are signed up for the window today. [07:00:37] ok well that answers my next question about self-deploys and whatnot :-D [07:00:42] :) [07:00:47] happy deploying, kart_ ! [07:01:01] note there's a merge conflict showing in gerrit [07:01:08] I will wait for abijeet. He will join in a few minutes. Meanwhile, rebasing patch. [07:01:10] Yes. [07:01:20] (03PS4) 10KartikMistry: Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:01:49] 👍 [07:04:34] o/ [07:05:00] scap backport no longer listed at, https://deploy-commands.toolforge.org/bacc/958406 ? <-- apergos ? [07:05:21] yeah, I expect because the dpeloyment is running out of codfw there's some little hiccup [07:05:39] we can ask Amir or you can just go ahead with the old procedure [07:05:56] Amir1: ^ [07:06:12] I'm fine with old procedure too. [07:07:35] kart_: `scap backport` is still fine [07:08:00] taavi: thanks. Going with it then :) [07:08:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:09:13] (03Merged) 10jenkins-bot: Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [07:09:25] (03PS12) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [07:10:32] !log kartik@deploy2002 Started scap: Backport for [[gerrit:958406|Enable MinT translation service on Meta-Wiki - rollout #5 (T341445)]] [07:10:39] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:11:10] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43437/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:11:32] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:12:54] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43438/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:19:09] (03PS4) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) [07:19:25] (03PS13) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [07:22:16] scap seems stuck at `07:12:20 K8s images build/push output redirected to /home/kartik/scap-image-build-and-push-log` step like last time (ie super slow) [07:22:20] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:22:25] Do we have a bug/task about this? [07:22:54] "Finished build-and-push-container-images (duration: 10m 12s)" [07:25:56] (03PS14) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [07:29:32] (03CR) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:30:26] I don't know, kart_, if you didn't find one in phab then it might actually be stuck [07:31:05] it might be rebuilding images slowly, just give it a little more time I guess [07:31:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:32:11] !log kartik@deploy2002 kartik and abi: Backport for [[gerrit:958406|Enable MinT translation service on Meta-Wiki - rollout #5 (T341445)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:32:18] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:33:26] apergos: yeah. Slow. [07:33:41] abijeet: Can you test patch debug servers? [07:33:47] welp. you've got the whole window at least [07:33:48] on debug* [07:33:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:34:18] apergos: Deploying just 1 patch in 1 hour is quite slow :D [07:34:28] ayup :-D [07:36:34] kart_, on it [07:38:00] kart_, looks good [07:38:29] nice. Deploying.. [07:38:32] !log kartik@deploy2002 kartik and abi: Continuing with sync [07:39:03] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:40:31] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:41:02] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [07:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:42:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:43:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:46:35] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:47:59] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:49:20] (03CR) 10Slyngshede: [C: 03+2] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [07:50:21] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:51:45] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:52:34] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:958406|Enable MinT translation service on Meta-Wiki - rollout #5 (T341445)]] (duration: 42m 01s) [07:52:41] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [07:53:09] We are done, abijeet :) [07:53:53] (03CR) 10Vgutierrez: [C: 04-1] Release 0.21+deb11u1 for bullseye (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [07:53:59] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:24] kart_, thanks [07:55:58] (03CR) 10Majavah: [C: 03+2] dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [07:58:04] (03CR) 10Fabfur: Release 0.21+deb11u1 for bullseye (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [07:58:55] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:59:55] (03CR) 10Vgutierrez: [C: 04-1] Release 0.21+deb11u1 for bullseye (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [08:00:04] brennen and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T0800). [08:00:21] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:03:48] kart_: taavi apergos My apologies, I moved the service to gitlab yesterday, I guess the commits are not properly moved, I'll check [08:04:36] okey dokey [08:04:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [08:05:39] (03Merged) 10jenkins-bot: Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [08:06:18] (03PS2) 10Fabfur: Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 [08:06:48] (03CR) 10Fabfur: Release 0.21+deb11u1 for bullseye (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [08:07:12] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [08:07:32] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [08:10:16] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [08:10:32] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [08:10:44] Amir1: No problem. Thanks for update! [08:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:12:18] !log redeploying eventgate-analytics-external in staging T336041 [08:12:20] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [08:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:24] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:12:32] (03CR) 10Filippo Giunchedi: "I am +1 on the idea, wrt to the parameter I seemed to remember we did something similar for http probes, there are some heuristic in wmfli" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [08:12:40] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [08:13:21] !log redeploying eventstreams-internal in staging T336041 [08:13:22] (03CR) 10Filippo Giunchedi: [C: 03+1] titan: add pyrra/slo envoy/cfssl config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron) [08:13:24] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [08:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:51] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [08:14:10] !log redeploying mw-page-content-change-enrich in staging T336041 [08:14:13] (03PS1) 10Slyngshede: C:idm::deployment Create staticfiles directory [puppet] - 10https://gerrit.wikimedia.org/r/959605 [08:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:16] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:14:19] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:14:29] (noop) [08:14:41] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:15] (03CR) 10Vgutierrez: [C: 03+1] Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [08:16:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43439/console" [puppet] - 10https://gerrit.wikimedia.org/r/959605 (owner: 10Slyngshede) [08:17:48] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43440/console" [puppet] - 10https://gerrit.wikimedia.org/r/959605 (owner: 10Slyngshede) [08:17:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602 [08:17:51] ctive - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64 [08:17:51] : Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw [08:17:51] 2/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernet [08:17:51] , AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - ku [08:17:51] -codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:15] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment Create staticfiles directory [puppet] - 10https://gerrit.wikimedia.org/r/959605 (owner: 10Slyngshede) [08:18:33] (03PS5) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [08:20:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [08:20:01] v4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernet [08:20:01] , AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connec [08:20:01] rnetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/ [08:20:01] nnect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-cod [08:20:02] 602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:14] I guess that these are the new wikikube nodes [08:22:29] jayme --^ [08:23:22] yeah, sorry [08:23:24] cc _joe_ [08:24:57] ack thanks jayme [08:25:13] _joe_: 👋 Clément mentioned yesterday that you ran into an issue with scap doing a helm rollback when it shouldn't [08:25:20] is there a task for that with the details? [08:27:33] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) [08:27:38] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [08:28:06] <_joe_> jnuche: yeah, not now please :) [08:28:44] <_joe_> jnuche: I'll ping you when I'm done with what I am doing rn :) [08:28:59] PROBLEM - Host registry2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:11] <_joe_> uh [08:29:12] _joe_: sure, talk to you later :) [08:29:13] _joe_: that you? [08:29:13] <_joe_> jayme: ^^ [08:29:15] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [08:29:16] <_joe_> nope [08:29:20] wtf [08:29:25] PROBLEM - Host mwdebug2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:31] <_joe_> uh oh [08:29:39] <_joe_> this looks like a ganeti host [08:29:59] <_joe_> jayme: can you check that? [08:30:03] on it [08:30:33] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958400 (owner: 10Muehlenhoff) [08:30:53] (03PS1) 10Filippo Giunchedi: thanos: bump max open files for query/rule/compact [puppet] - 10https://gerrit.wikimedia.org/r/959674 (https://phabricator.wikimedia.org/T346950) [08:32:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:33:16] block drbd0: We did not send a P_BARRIER for 301516ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? [08:33:55] moritzm: you have a minute? ^ [08:34:01] ganeti2030 [08:35:58] jynus: having a look [08:36:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) We are about to run the procedure again for {T346892} in case you want to test/observe/re... [08:36:52] wrong j.ynus :) [08:37:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:37:10] oh yes :-) [08:37:24] drbd0,3,7 fwiw [08:37:32] (03PS1) 10Slyngshede: C:idm::deployment Fix path to static files. [puppet] - 10https://gerrit.wikimedia.org/r/959676 [08:37:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:51] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:39:24] (03PS2) 10Slyngshede: C:idm::deployment Fix path to static files. [puppet] - 10https://gerrit.wikimedia.org/r/959676 [08:40:21] moritzm: there where some " peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )" that probably started this [08:40:30] jayme: I'll depool registry2003,ok? [08:40:35] yeah [08:40:43] depooled [08:40:47] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43441/console" [puppet] - 10https://gerrit.wikimedia.org/r/959676 (owner: 10Slyngshede) [08:40:53] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:41:05] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:21] (03PS1) 10Majavah: Remove most references to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/959677 (https://phabricator.wikimedia.org/T346892) [08:41:34] the other two VMs on ganeti2030 are an inactive LDAP replica and mwdebug2001 [08:42:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43442/console" [puppet] - 10https://gerrit.wikimedia.org/r/959676 (owner: 10Slyngshede) [08:42:17] also depooled mwdebug2001 (except for those who pick it manually in the browser extension) [08:42:17] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:42:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove most references to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/959677 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [08:42:44] (03CR) 10Majavah: [C: 03+2] Remove most references to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/959677 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [08:43:19] moritzm: ok to merge your patch? (nftables for ganeti-test) [08:43:23] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment Fix path to static files. [puppet] - 10https://gerrit.wikimedia.org/r/959676 (owner: 10Slyngshede) [08:45:24] (03PS6) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [08:46:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:48:28] (03PS5) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) [08:48:54] taavi: please do, sorry got distracted by the ganeti server [08:49:32] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:51:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:51:24] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1007.wikimedia.org [08:52:59] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:03] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:53:50] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:54:23] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:55:22] (03PS1) 10Slyngshede: C:idm::deployment update UWSGI base path to fit repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959680 [08:56:39] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment update UWSGI base path to fit repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959680 (owner: 10Slyngshede) [08:57:06] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [08:58:00] (03CR) 10Elukey: "@Herron: I added some wording based on the suggestion that you made on the task. You'd need to scroll down a little to read it all, but I " [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [08:58:56] (03PS1) 10Giuseppe Lavagetto: docker_registry_ha: fetch the list of nodes from the static list [puppet] - 10https://gerrit.wikimedia.org/r/959681 [08:59:25] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [09:00:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [09:00:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1007.wikimedia.org [09:00:40] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudcontrol1007.wikimed... [09:00:53] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [09:00:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43443/console" [puppet] - 10https://gerrit.wikimedia.org/r/959681 (owner: 10Giuseppe Lavagetto) [09:01:24] (03PS1) 10Jelto: gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 [09:03:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2030.codfw.wmnet with reason: Fixup DRBD [09:03:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2030.codfw.wmnet with reason: Fixup DRBD [09:04:19] (03CR) 10JMeybohm: [C: 03+1] docker_registry_ha: fetch the list of nodes from the static list [puppet] - 10https://gerrit.wikimedia.org/r/959681 (owner: 10Giuseppe Lavagetto) [09:06:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker_registry_ha: fetch the list of nodes from the static list [puppet] - 10https://gerrit.wikimedia.org/r/959681 (owner: 10Giuseppe Lavagetto) [09:07:24] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:08:16] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:08:52] (03PS1) 10Ayounsi: Remove OSPF between eqsin and eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/959685 (https://phabricator.wikimedia.org/T344888) [09:09:31] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:11:31] (03CR) 10Ayounsi: [C: 03+2] Remove OSPF between eqsin and eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/959685 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [09:12:01] (03PS1) 10Fabfur: varnish: add comment to better explain mobile redirection regex [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) [09:12:12] (03Merged) 10jenkins-bot: Remove OSPF between eqsin and eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/959685 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [09:17:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:17:38] (03CR) 10Fabfur: [C: 03+2] Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (owner: 10Fabfur) [09:20:33] RECOVERY - Host mwdebug2001 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [09:24:01] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:08] (03PS1) 10Jelto: gitlab: remove deprecated grafana feature [puppet] - 10https://gerrit.wikimedia.org/r/959689 [09:27:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) Which partman recipe do you use? Does it include `modules/install_server/files/autoinstall/... [09:27:43] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [09:27:49] (03CR) 10Vgutierrez: [C: 03+1] varnish: add comment to better explain mobile redirection regex (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [09:28:00] !log remove GRE tunnel between eqsin and eqdfw - T344888 [09:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:13] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [09:28:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) And we don't see the same issue on plain reimages, where we don't even run wipefs. [09:30:13] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [09:30:24] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43444/console" [puppet] - 10https://gerrit.wikimedia.org/r/959689 (owner: 10Jelto) [09:30:42] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [09:30:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) Yes, apparently, we do! `lang=shell-session $ git grep cloudservices modules/install_ser... [09:31:09] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:43] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:44] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43445/console" [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto) [09:32:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:32:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:32:48] (03PS2) 10Fabfur: varnish: add comment to better explain mobile redirection regex [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) [09:32:57] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [09:33:01] (03CR) 10Fabfur: varnish: add comment to better explain mobile redirection regex (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [09:33:42] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [09:34:17] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [09:35:08] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [09:35:23] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:35:57] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:06] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [09:36:35] (03PS1) 10Jelto: peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) [09:36:47] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [09:36:52] (03PS3) 10Fabfur: varnish: add comment to better explain mobile redirection regex [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) [09:37:04] (03CR) 10CI reject: [V: 04-1] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:38:12] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [09:38:39] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:38:46] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [09:38:56] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) 05Open→03Resolved [09:39:01] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:39:03] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) 05In progress→03Resolved [09:39:05] (03PS1) 10Ayounsi: Remove DNS include for former eqsin-eqdfw GRE tunnel [dns] - 10https://gerrit.wikimedia.org/r/959691 (https://phabricator.wikimedia.org/T344888) [09:39:52] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:40:31] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:40:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:40:52] 10SRE, 10Bitu, 10Infrastructure-Foundations: Update IDM servers to Bookworm - https://phabricator.wikimedia.org/T340722 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [09:40:53] RECOVERY - Host registry2003 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [09:40:55] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [09:41:15] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: connect to address 10.192.0.39 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:41:27] PROBLEM - Docker registry HTTPS interface certificate expiry on registry2003 is CRITICAL: connect to address 10.192.0.39 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:41:51] PROBLEM - Check systemd state on registry2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-registry-ha-jwt.service,ifup@ens13.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:05] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:42:22] (03CR) 10CI reject: [V: 04-1] Remove DNS include for former eqsin-eqdfw GRE tunnel [dns] - 10https://gerrit.wikimedia.org/r/959691 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [09:42:37] (JobUnavailable) firing: (7) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:41] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Docker [09:42:53] RECOVERY - Docker registry HTTPS interface certificate expiry on registry2003 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [09:43:17] RECOVERY - Check systemd state on registry2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:26] (03PS2) 10Jelto: peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) [09:44:39] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) p:05Triage→03Medium [09:44:41] jayme: registry2003 is back, can I just pool it back or do we need to trigger some resync or so? [09:45:19] moritzm: repool is fine [09:45:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43447/console" [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:45:29] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) Design has been handed over, but will require implementation by us, as this is not part of what is being offered by the design team. [09:46:01] jayme: ack, repooled now [09:46:05] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF [09:46:08] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:46:15] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10BTullis) @Jclark-ctr When would you like to move an-tool1010 ? It is the single host behind superset.wikimedia.org so I'd like to give our users a little bit of notice it it's going to be dow... [09:48:39] 10SRE, 10Infrastructure-Foundations, 10netops: Include Netbox Anycast IPs in Capirca host definitions - https://phabricator.wikimedia.org/T347016 (10cmooney) p:05Triage→03Low [09:48:48] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10akosiaris) @Robh, Switchover was done yesterday, we are now in codfw for the next 6 months, deploy1002 is no longer used. It can be powered off and moved whenever #ops-eqiad feels like it. [09:49:38] (03PS1) 10Jelto: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) [09:51:15] !log disable puppet on kubernetes[2025-2053].codfw.wmnet [09:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:19] (03PS1) 10Cathal Mooney: Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) [09:53:21] (03CR) 10Effie Mouzeli: [C: 03+2] wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [09:53:33] (03PS7) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [09:53:35] (03PS1) 10Elukey: slo_definitions: improve Lift Wing's service SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) [09:55:59] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:57:12] (03PS1) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) [09:57:30] (03CR) 10Ayounsi: "overall lgtm, one small comment." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) (owner: 10Cathal Mooney) [09:57:34] (03CR) 10Klausman: [C: 03+1] slo_definitions: improve Lift Wing's service SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [09:59:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) Could the problem be related to the rename? Just a theory The renames we have been condu... [09:59:30] (03CR) 10CI reject: [V: 04-1] os-reports: Stop configuring a puppetdb server and switch to discovery record [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [10:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1000) [10:00:08] (03PS2) 10Cathal Mooney: Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) [10:00:13] (03CR) 10Klausman: "Actually, on second thought: should we really include 3xx and 4xx in the latency SLO?" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [10:00:18] (03CR) 10JMeybohm: [C: 03+1] kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463 (owner: 10Giuseppe Lavagetto) [10:00:21] (03CR) 10Cathal Mooney: Include Anycast IPs in Netbox capirca host definitions (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) (owner: 10Cathal Mooney) [10:00:23] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10MoritzMuehlenhoff) I think we can close this; the new profile:: ferm::service and ferm::service offer opt-in name resolution on the Puppet server si... [10:00:56] (03CR) 10Effie Mouzeli: [C: 03+2] kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463 (owner: 10Giuseppe Lavagetto) [10:01:27] (03PS3) 10Cathal Mooney: Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) [10:01:29] (03CR) 10Ayounsi: [C: 03+1] Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) (owner: 10Cathal Mooney) [10:01:35] (03PS2) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) [10:01:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [10:04:09] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:04:20] (03CR) 10Cathal Mooney: [C: 03+2] Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) (owner: 10Cathal Mooney) [10:04:57] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:04] (03Merged) 10jenkins-bot: Include Anycast IPs in Netbox capirca host definitions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/959694 (https://phabricator.wikimedia.org/T347016) (owner: 10Cathal Mooney) [10:07:41] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:08:23] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:25] (03PS1) 10Btullis: Update the nginx regex for archiva [puppet] - 10https://gerrit.wikimedia.org/r/959698 (https://phabricator.wikimedia.org/T318962) [10:08:30] jouncebot: nowandnext [10:08:30] For the next 0 hour(s) and 51 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1000) [10:08:30] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1000) [10:08:30] In 1 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1200) [10:09:11] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:09:20] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove eqsin-eqdfw tunnel - ayounsi@cumin1001" [10:09:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:55] (03PS2) 10Ladsgroup: Enable pagelinks write both in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957339 (https://phabricator.wikimedia.org/T345732) [10:09:57] (03CR) 10Ladsgroup: [C: 03+2] Enable pagelinks write both in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957339 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:10:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove eqsin-eqdfw tunnel - ayounsi@cumin1001" [10:10:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:10:33] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/959691 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [10:11:08] (03CR) 10Volans: [C: 03+1] "LGTM, not sure if John has a better suggestion to maybe keep it in the config and let puppet sets it from hiera" [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [10:11:31] 10SRE: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10Fabfur) [10:11:44] (03Merged) 10jenkins-bot: Enable pagelinks write both in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957339 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:11:57] (03PS3) 10Fabfur: Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (https://phabricator.wikimedia.org/T346874) [10:12:55] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:957339|Enable pagelinks write both in testwiki (T345732)]] [10:13:01] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:13:20] (03CR) 10Fabfur: Release 0.21+deb11u1 for bullseye (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [10:13:48] (03CR) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [10:13:50] (03CR) 10Volans: [C: 03+1] "LGTM, I've seen the removals on Netbox." [dns] - 10https://gerrit.wikimedia.org/r/959691 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [10:13:59] (03CR) 10Fabfur: [V: 03+2 C: 03+2] Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [10:14:22] (03CR) 10Ayounsi: [C: 03+2] Remove DNS include for former eqsin-eqdfw GRE tunnel [dns] - 10https://gerrit.wikimedia.org/r/959691 (https://phabricator.wikimedia.org/T344888) (owner: 10Ayounsi) [10:16:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:17:11] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:17:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:17:29] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10MoritzMuehlenhoff) We have "Forgot your password?", but not yet "Forgot your username?" ? [10:17:33] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:17:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:17:54] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:18:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:18:18] (03CR) 10Jbond: [C: 03+2] git-sync-upstream: Fix environment when setting gitusers [puppet] - 10https://gerrit.wikimedia.org/r/955291 (https://phabricator.wikimedia.org/T345702) (owner: 10Jbond) [10:19:01] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:19:28] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10MoritzMuehlenhoff) Gerrit still links to "https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount&returnto=Ger... [10:20:34] 10SRE, 10Infrastructure-Foundations, 10netops: Include Netbox Anycast IPs in Capirca host definitions - https://phabricator.wikimedia.org/T347016 (10cmooney) 05Open→03Resolved Script updated and re-run, seems fine. [10:21:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:25:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [10:25:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [10:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [10:27:05] !log set max repeaters = 20 on asw2-a-eqiad - T346759 [10:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] T346759: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 [10:27:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [10:29:15] (03PS2) 10Btullis: Update the nginx regex for archiva [puppet] - 10https://gerrit.wikimedia.org/r/959698 (https://phabricator.wikimedia.org/T318962) [10:31:25] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) a:05taavi→03Jclark-ctr hi @Jclark-ctr! This server has been powered off and can be moved at any time to `E4`. thanks! [10:32:06] (03CR) 10Jbond: [C: 04-1] [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [10:32:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:51] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:957339|Enable pagelinks write both in testwiki (T345732)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [10:34:59] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:35:43] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) [10:36:04] (03Abandoned) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/933399 (owner: 10Jgiannelos) [10:36:10] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10aborrero) [10:36:30] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:36:53] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) This is the active`maintain-dbusers` server at the moment. Moving that requires updating the database grants and firewall rules. [10:40:00] (03PS2) 10Elukey: slo_definitions: improve Lift Wing's service SLO/SLI availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) [10:40:02] (03PS8) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [10:41:05] (03PS1) 10Arturo Borrero Gonzalez: policies/cr-labs: refresh openstack API endpoints [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) [10:42:25] (03CR) 10Jbond: "lgtm minor post merge comment (no need to fix)" [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [10:42:41] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:44:46] (03CR) 10Majavah: [C: 04-1] "The term you're removing is needed for example for the cloudcontrol1006->cloudcontrol1005 flow that is still needed." [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) (owner: 10Arturo Borrero Gonzalez) [10:46:07] (03PS2) 10Arturo Borrero Gonzalez: policies/cr-labs: refresh openstack API endpoints [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) [10:47:12] (03CR) 10Majavah: [C: 03+1] policies/cr-labs: refresh openstack API endpoints [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) (owner: 10Arturo Borrero Gonzalez) [10:47:27] (PrometheusRuleEvaluationFailures) firing: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:48:43] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [10:48:43] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:48] !log installing flac security updates [10:48:51] (03CR) 10Jbond: [C: 03+1] sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [10:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:22] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:957339|Enable pagelinks write both in testwiki (T345732)]] (duration: 36m 27s) [10:49:29] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:49:57] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:31] (03PS3) 10Elukey: slo_definitions: improve Lift Wing's service SLO/SLI availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) [10:51:33] (03PS9) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [10:52:27] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:53:37] (03CR) 10Klausman: [C: 03+1] slo_definitions: improve Lift Wing's service SLO/SLI availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [10:54:39] !log installing c-ares security updates [10:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [10:57:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [10:57:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T343198)', diff saved to https://phabricator.wikimedia.org/P52550 and previous config saved to /var/cache/conftool/dbconfig/20230921-105723-arnaudb.json [10:57:30] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:00:18] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST networkpolicies) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:26] (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [11:03:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:04:27] (03CR) 10Cathal Mooney: [C: 03+1] "I think as it's temporary it's ok. If it were a more permanent thing we might want to lock down to specific port numbers but we can hopef" [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) (owner: 10Arturo Borrero Gonzalez) [11:05:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] policies/cr-labs: refresh openstack API endpoints [homer/public] - 10https://gerrit.wikimedia.org/r/959706 (https://phabricator.wikimedia.org/T346948) (owner: 10Arturo Borrero Gonzalez) [11:05:15] (03CR) 10Jbond: [C: 03+1] "nice, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:05:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST networkpolicies) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:31] (03CR) 10FNegri: [C: 04-1] d/changelog: bump version (032 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [11:05:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959241 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [11:06:35] (03PS2) 10Jbond: puppetserver: fix perma-diff on /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:07:08] (03PS1) 10Kevin Bazira: ml-services: update recommendation-api-ng port [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) [11:08:31] !log merging homer CR firewall patch https://gerrit.wikimedia.org/r/c/operations/homer/public/+/959706 for T346948 [11:08:35] (03CR) 10Jbond: [C: 04-1] "unless im missing something no longer required as i sent a patch fr this last week" [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:38] T346948: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 [11:10:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [11:10:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [11:12:36] (03PS1) 10Ladsgroup: Turn on write both for pagelinks in largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959710 (https://phabricator.wikimedia.org/T345732) [11:15:33] (03CR) 10Majavah: [C: 04-1] "in the PCC it looks like the 'firewall' wrapper tries to install ferm even on hosts which don't have profile::firewall applied" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:18:18] (03CR) 10Jbond: [C: 03+1] "lgtm but see comment" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:19:05] (03CR) 10Ladsgroup: [C: 03+2] Turn on write both for pagelinks in largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959710 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [11:19:20] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) [11:19:47] (03Merged) 10jenkins-bot: Turn on write both for pagelinks in largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959710 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [11:20:26] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@c579111] (releasing): (no justification provided) (duration: 01m 05s) [11:21:00] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:959710|Turn on write both for pagelinks in largest s3 wikis (T345732)]] [11:21:06] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [11:21:16] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:22:49] (03CR) 10Jbond: "change lgtm but im not sure its the right thing to do see comment" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [11:24:07] (03PS1) 10Ladsgroup: Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) [11:24:13] (03CR) 10Majavah: puppet agent: protect against missing client bucket path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:24:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [11:24:46] (03CR) 10CI reject: [V: 04-1] Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [11:26:16] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:28:03] (03CR) 10Jbond: "lgtm but possibly missing a unmask?" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [11:28:27] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [11:31:16] (KubernetesRsyslogDown) resolved: (3) rsyslog on kubernetes2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:33:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [11:35:07] PROBLEM - MD RAID on kubernetes2028 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:36:31] (03CR) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:39:08] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [11:42:46] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10BTullis) It looks like @SGupta-WMF hasn't been added to the `wmf` LDAP group so I'm adding a note onto this ticket, rather than a new LDAP access request ticket. As per https://wikitech.wik... [11:43:25] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:959710|Turn on write both for pagelinks in largest s3 wikis (T345732)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:43:28] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:43:32] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [11:45:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [11:46:06] (03CR) 10Majavah: [C: 04-1] profile::cumin::cloud_target: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:46:27] (PrometheusRuleEvaluationFailures) firing: (2) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:47:56] (03CR) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:48:22] (03Abandoned) 10Brouberol: [mw-page-content-change-enrich] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958499 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:48:32] 10SRE, 10Infrastructure-Foundations, 10netops: Audit cloud filters on CR in respect of new cloud-private and public VIP networks - https://phabricator.wikimedia.org/T347030 (10cmooney) p:05Triage→03Medium [11:48:40] (03Abandoned) 10Brouberol: [eventstream-internal] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958498 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:48:44] (03Abandoned) 10Brouberol: [eventgate-analytics] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958497 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:50:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:27] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:51:31] 10SRE, 10Infrastructure-Foundations: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete, all KDCs are on Bullseye by now. [11:52:05] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:52:16] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2030:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:52:34] (KubernetesCalicoDown) firing: (3) kubernetes2030.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:53:25] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:54:11] (03PS1) 10JMeybohm: k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 [11:55:16] (KubernetesRsyslogDown) firing: (14) rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:56:35] (03PS1) 10Cathal Mooney: Allow all traffic from cloud-public addresses through CRs [homer/public] - 10https://gerrit.wikimedia.org/r/959723 [11:56:38] (03PS1) 10Cathal Mooney: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959724 (https://phabricator.wikimedia.org/T347030) [11:57:13] (03Abandoned) 10Cathal Mooney: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959724 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [11:57:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43449/console" [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm) [11:57:28] (03Abandoned) 10Cathal Mooney: Allow all traffic from cloud-public addresses through CRs [homer/public] - 10https://gerrit.wikimedia.org/r/959723 (owner: 10Cathal Mooney) [11:57:45] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:959710|Turn on write both for pagelinks in largest s3 wikis (T345732)]] (duration: 36m 44s) [11:57:51] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [11:57:55] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10ayounsi) a:03ayounsi [11:58:01] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10ayounsi) 05Open→03Declined Thanks, I spent a bit more time on that. Bumping `max-repeaters` to 20 didn't change a t... [11:58:17] (03PS2) 10Ladsgroup: Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) [11:59:10] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw [11:59:38] (03PS1) 10Cathal Mooney: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1200) [12:00:16] (KubernetesRsyslogDown) firing: (14) rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:24] (03PS2) 10Cathal Mooney: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) [12:02:16] (KubernetesRsyslogDown) resolved: (3) rsyslog on kubernetes2030:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:21] (03PS3) 10Cathal Mooney: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) [12:02:34] (KubernetesCalicoDown) resolved: (3) kubernetes2030.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:03:25] !log cordon kubernetes2028 to reimage [12:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:01] (03PS18) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) [12:05:06] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10eoghan) [12:05:16] (KubernetesRsyslogDown) resolved: (14) rsyslog on kubernetes2036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:06:42] (03PS2) 10JMeybohm: k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 [12:08:56] (03CR) 10Jbond: "see comments i think a better path would be nice" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [12:09:59] (03CR) 10Ayounsi: [C: 03+1] Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [12:10:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [12:10:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw [12:11:15] (03CR) 10Cathal Mooney: [C: 03+2] Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [12:14:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway) [12:16:27] (PrometheusRuleEvaluationFailures) firing: (2) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:16:57] 10SRE, 10Infrastructure-Foundations, 10netops: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) 05Open→03Resolved a:03ayounsi This is done for now, more improvements to come from Dell, tracked in T342673. [12:17:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [12:18:43] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) @cmooney I think this can be closed? [12:19:43] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10jbond) >>! In T344164#9184243, @Urbanecm wrote: > Hi @LSobanski, @taavi mentioned to me privately that if we want the stew... [12:20:43] !log depooled cp1090.eqiad.wmnet to test new purged package version (T346874) [12:20:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [12:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:50] T346874: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 [12:21:15] !log milimetric@deploy2002 Started deploy [analytics/aqs/deploy@041016f] (aqs): Enable etags on all AQS 1.0 endpoints [12:21:27] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:21:48] (03PS1) 10Muehlenhoff: firewall::service: Handle the use of the define on systems w/o P:firewall [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) [12:22:15] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad [12:23:16] (03Merged) 10jenkins-bot: Allow all traffic from cloud-public IPs in from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/959726 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [12:24:30] (03PS3) 10JMeybohm: k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 [12:25:12] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2028.codfw.wmnet with OS bullseye [12:25:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:26:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway) [12:26:26] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ayounsi) Indeed and hosts on public IPs have a much larger attack surface so they should be a last resort option. The ircb... [12:27:01] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43450/console" [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm) [12:27:28] (03CR) 10Jbond: [C: 04-1] "we also need a hiera value in cloud.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [12:27:30] (03CR) 10Ilias Sarantopoulos: ml-services: update recommendation-api-ng port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [12:28:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:29:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [12:31:39] !log milimetric@deploy2002 Finished deploy [analytics/aqs/deploy@041016f] (aqs): Enable etags on all AQS 1.0 endpoints (duration: 10m 23s) [12:33:09] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10aborrero) Thanks! [12:33:16] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:33:21] (03PS4) 10JMeybohm: k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 [12:33:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad [12:34:15] 10SRE, 10ops-codfw, 10User-aborrero, 10User-dcaro, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725 (10dcaro) [12:34:25] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) [12:35:13] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) ran into an issue with og racking plan. power cables don't reach that far. [12:35:50] (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [12:36:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43451/console" [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm) [12:36:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) [12:37:31] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9186750, @jbond wrote: >>>! In T344164#9184243, @Urbanecm wrote: >> Hi @LSobanski, @taavi mention... [12:37:58] (03CR) 10AOkoth: [C: 03+2] ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [12:38:39] (03PS1) 10Ayounsi: Block inbound RAs on the routers [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) [12:42:02] 10SRE, 10ops-codfw, 10User-aborrero, 10User-dcaro, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725 (10dcaro) [12:42:38] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ayounsi) FYI, the underlying IRC library seems to support proxies https://github.com/aatxe/irc#configuring-irc-clients [12:42:42] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [12:43:36] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10eoghan) [12:45:03] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [12:46:27] (PrometheusRuleEvaluationFailures) firing: (3) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:46:49] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:50:53] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:09] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10MoritzMuehlenhoff) Looks good! [12:51:27] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:52:58] (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [12:55:37] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1300) [13:00:05] Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] \o/ [13:00:27] I can probably deploy in a few minutes [13:00:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:35] (might also do a backport) [13:01:02] Lucas_WMDE I'll deploy mine (with help from James_F) - thanks [13:01:31] ok! [13:02:50] (03PS2) 10Tchanders: Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) [13:02:55] (03PS2) 10Tchanders: Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) [13:02:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) (owner: 10Tchanders) [13:03:52] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) @ayounsi yeah I think so, the RMA is complete as far as Juniper is concerned and we are no longer using the old card. It's unclear to me if the new card has been received in cod... [13:04:12] (PrometheusRuleEvaluationFailures) firing: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [13:04:28] (03Merged) 10jenkins-bot: Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) (owner: 10Tchanders) [13:04:31] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:04:52] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:956465|Enable partial action blocks on commonswiki (T339878)]] [13:04:59] T339878: Enable partial action blocks on Commons - https://phabricator.wikimedia.org/T339878 [13:05:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2028.codfw.wmnet with OS bullseye [13:05:49] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:05:57] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [13:06:58] (03CR) 10Kevin Bazira: ml-services: update recommendation-api-ng port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [13:08:19] !log disabled puppet on cp1090 for T346874 [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:26] T346874: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 [13:09:11] 10SRE, 10ops-codfw, 10User-aborrero, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10dcaro) [13:12:33] (03CR) 10Majavah: [V: 03+2 C: 03+2] add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/928477 (owner: 10Majavah) [13:16:28] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [13:19:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: improve Lift Wing's service SLO/SLI availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [13:19:26] (03CR) 10Jbond: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [13:19:26] Tchanders: can you ping me when you’re done? [13:19:48] (03PS1) 10Lucas Werkmeister (WMDE): SpecialUndelete: Do not clone RequestContext [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959304 (https://phabricator.wikimedia.org/T346995) [13:19:53] I’d like to backport ^ that [13:19:57] Lucas_WMDE: Sure, it's being very slow... [13:20:05] np, it’s not urgent [13:20:23] * Lucas_WMDE adds it to the wiki page in the meantime [13:22:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] "elukey@grafana1002:/srv/grafana-grizzly$ grr apply slo_dashboards.jsonnet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/959695 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [13:24:05] (03CR) 10Elukey: slo_template: hardcode time window for SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [13:25:00] !log mwmaint2002: `mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki 'Private Incident Reporting System/Updates' 'Incident Reporting System/Updates' 'Martin Urbanec' --reason 'per [[:phab:T347019|request]]'` (T347019) [13:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:07] T347019: Request to move translatable page: Private Incident Reporting System/Updates - https://phabricator.wikimedia.org/T347019 [13:25:57] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:956465|Enable partial action blocks on commonswiki (T339878)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:26:03] T339878: Enable partial action blocks on Commons - https://phabricator.wikimedia.org/T339878 [13:26:32] !log tchanders@deploy2002 tchanders: Continuing with sync [13:27:18] (03CR) 10Muehlenhoff: "As exhibited by https://gerrit.wikimedia.org/r/c/operations/puppet/+/959179" [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:27:22] (03CR) 10Effie Mouzeli: [C: 03+2] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [13:27:31] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:28:39] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:28:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [13:29:46] (03PS1) 10Muehlenhoff: Bump to 6.6.12 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/959738 [13:30:36] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:39] (03PS1) 10Majavah: firewall: add 'none' provider [puppet] - 10https://gerrit.wikimedia.org/r/959739 [13:30:57] (03CR) 10Majavah: "another alternative would be adding a 'none' provider like in https://gerrit.wikimedia.org/r/c/operations/puppet/+/959739/" [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:34:52] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1015 [13:34:54] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1015 [13:36:02] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:36:08] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) [13:36:21] (03PS3) 10Btullis: Update the nginx regex for archiva [puppet] - 10https://gerrit.wikimedia.org/r/959698 (https://phabricator.wikimedia.org/T318962) [13:36:37] (03CR) 10Muehlenhoff: firewall::service: Handle the use of the define on systems w/o P:firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:37:30] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond) [13:37:32] (03PS2) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) [13:37:34] (03CR) 10Bking: [C: 03+1] Update the nginx regex for archiva [puppet] - 10https://gerrit.wikimedia.org/r/959698 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [13:37:35] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) p:05Triage→03Medium [13:37:43] (03CR) 10Btullis: [C: 03+2] Update the nginx regex for archiva [puppet] - 10https://gerrit.wikimedia.org/r/959698 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [13:37:58] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:38:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond) @MoritzMuehlenhoff I have updated the description a bit. This task is more about removing the lists in hiera and instead relying on puppetdb... [13:39:57] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:956465|Enable partial action blocks on commonswiki (T339878)]] (duration: 35m 04s) [13:40:04] T339878: Enable partial action blocks on Commons - https://phabricator.wikimedia.org/T339878 [13:40:14] Phew! Onto the next one... [13:40:15] 35 minutes o_O [13:40:21] and that was only the first of two right? :/ [13:40:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders) [13:40:55] (03PS3) 10Jforrester: Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders) [13:41:04] (03CR) 10TrainBranchBot: "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders) [13:42:13] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:32] Lucas_WMDE: Yeah [13:42:36] (03Merged) 10jenkins-bot: Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders) [13:42:59] !log tchanders@deploy2002 Started scap: Backport for [[gerrit:956467|Enable partial action blocks on mediawikiwiki (T332733)]] [13:43:06] T332733: Deploy action blocks on mediawikiwiki - https://phabricator.wikimedia.org/T332733 [13:43:10] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:43:17] PROBLEM - HTTPS on archiva1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:43:49] good luck with this one then… [13:43:51] jouncebot: next [13:43:51] In 2 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1600) [13:44:01] ok, I guess I can do my backport after the window ends if necessary [13:44:26] (03PS1) 10Muehlenhoff: durum: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) [13:44:38] (03PS2) 10Muehlenhoff: durum: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) [13:44:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:45:11] Lucas_WMDE We were hoping that was an outlier and this one will be much quicker... Wishing I'd written the 2 configs in one patch though... [13:45:12] Lucas_WMDE: Yeah, for prod UBNs just go for it. [13:45:48] (03CR) 10JMeybohm: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm) [13:46:03] Tchanders: `scap backport` lets you deploy multiple patches at once, too [13:46:12] (03CR) 10Jbond: [C: 04-1] "i don't think this will work (if you tested it then i can remove the +1). the problem is that the puppetdb-api micro service dose not sup" [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [13:46:14] (03PS1) 10Btullis: Revert "Update the nginx regex for archiva" [puppet] - 10https://gerrit.wikimedia.org/r/959766 [13:46:30] (/me was hoping to deploy something too, but scap seems to be in a different mood today) [13:46:30] taavi: Ah, good to know! [13:46:44] The i18n build / docker build is triggering each time. [13:46:59] (03PS2) 10Btullis: Revert "Update the nginx regex for archiva" [puppet] - 10https://gerrit.wikimedia.org/r/959766 (https://phabricator.wikimedia.org/T318962) [13:47:46] (03CR) 10Filippo Giunchedi: prometheus: add service_name_override parameter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [13:47:48] (03CR) 10Btullis: [C: 03+2] Revert "Update the nginx regex for archiva" [puppet] - 10https://gerrit.wikimedia.org/r/959766 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [13:48:37] (03PS2) 10Majavah: P:installserver::proxy: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955699 [13:48:41] (03PS2) 10Majavah: P:icinga: move files under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955698 [13:49:07] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:08] (03PS3) 10Majavah: P:installserver::proxy: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955699 [13:50:11] RECOVERY - HTTPS on archiva1002 is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2023-11-29 22:21:23 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [13:51:09] (03CR) 10Jbond: [C: 03+1] puppet agent: protect against missing client bucket path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [13:51:32] (03CR) 10Majavah: [C: 03+2] P:icinga: move files under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955698 (owner: 10Majavah) [13:51:46] (03CR) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [13:51:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/959738 (owner: 10Muehlenhoff) [13:51:57] (03CR) 10Majavah: [C: 03+2] P:installserver::proxy: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955699 (owner: 10Majavah) [13:52:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:54] (03PS3) 10Muehlenhoff: durum: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) [13:53:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:53:41] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:53:43] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump to 6.6.12 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/959738 (owner: 10Muehlenhoff) [13:54:19] (03PS1) 10Btullis: Fix the archiva nginx regex [puppet] - 10https://gerrit.wikimedia.org/r/959751 (https://phabricator.wikimedia.org/T318962) [13:54:49] (03CR) 10Bking: [C: 03+1] Fix the archiva nginx regex [puppet] - 10https://gerrit.wikimedia.org/r/959751 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [13:55:46] (03CR) 10Btullis: [C: 03+2] Fix the archiva nginx regex [puppet] - 10https://gerrit.wikimedia.org/r/959751 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [13:57:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:58:15] (03PS1) 10Btullis: Revert "Fix the archiva nginx regex" [puppet] - 10https://gerrit.wikimedia.org/r/959767 [13:58:27] (03PS1) 10Jbond: systemd::timer::job: improve documentation to ensure path_exists [puppet] - 10https://gerrit.wikimedia.org/r/959753 [13:58:35] (03PS2) 10Btullis: Revert "Fix the archiva nginx regex" [puppet] - 10https://gerrit.wikimedia.org/r/959767 (https://phabricator.wikimedia.org/T318962) [13:59:02] (03CR) 10Muehlenhoff: "(After merging libnginx-mod-http-echo needs to be removed with Cumin)" [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:59:39] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2013.codfw.wmnet with OS bullseye [13:59:47] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2013.codfw.wmnet with OS bullseye [14:00:47] (03CR) 10Btullis: [C: 03+2] Revert "Fix the archiva nginx regex" [puppet] - 10https://gerrit.wikimedia.org/r/959767 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [14:01:53] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [14:02:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:03:05] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [14:03:24] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:956467|Enable partial action blocks on mediawikiwiki (T332733)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:03:32] T332733: Deploy action blocks on mediawikiwiki - https://phabricator.wikimedia.org/T332733 [14:03:33] (03CR) 10Majavah: puppet agent: protect against missing client bucket path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [14:03:45] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [14:04:06] !log tchanders@deploy2002 tchanders: Continuing with sync [14:04:39] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/958976 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [14:04:57] (03PS1) 10Muehlenhoff: puppetdb: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) [14:07:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:38] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:09:15] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:10:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:16] (03CR) 10Jbond: "lgtm, minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [14:12:49] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) Ah, we're missing a link to the page: https://idm.wikimedia.org/wikimedia/whoami/ It actually does work, but where to put it? [14:13:40] (03PS1) 10Elukey: Delete the fastapi-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959757 [14:14:11] (03CR) 10Jbond: [C: 03+1] puppet agent: protect against missing client bucket path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [14:14:26] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2013.codfw.wmnet with reason: host reimage [14:15:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:16:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2013.codfw.wmnet with reason: host reimage [14:17:01] !log tchanders@deploy2002 Finished scap: Backport for [[gerrit:956467|Enable partial action blocks on mediawikiwiki (T332733)]] (duration: 34m 01s) [14:17:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:48] Lucas_WMDE, taavi: I'm done, in case you want to keep going beyond the window [14:18:15] taavi: what did you want to deploy? [14:18:16] jouncebot: now [14:18:16] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [14:19:04] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) > NTP automation: Even if Debian Installer supports a coma separated list of NTP servers (to be tested?), some special appliances (like PDUs) o... [14:19:28] oops, I need to update my script to SSH to some other servers now [14:19:36] * Lucas_WMDE has seen the big big “do not use this server” banner on mwmaint1002 [14:21:06] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10MoritzMuehlenhoff) >>! In T340636#9187408, @SLyngshede-WMF wrote: > Ah, we're missing a link to the page: https://idm.wikimedia.org/wikimedia/whoami/ > > It actually does work,... [14:21:10] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) > In the set up the team asked for a couple more items. Can you also share the “aud” (audience) & cid (clientId)values from the ID token? [14:21:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [14:21:29] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:21:31] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:22:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959753 (owner: 10Jbond) [14:24:12] alright, I’ll go ahead with my backport then… [14:24:47] (probably would’ve been smart to at least +2 it ahead of time 🤦) [14:24:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959304 (https://phabricator.wikimedia.org/T346995) (owner: 10Lucas Werkmeister (WMDE)) [14:25:31] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: improve documentation to ensure path_exists [puppet] - 10https://gerrit.wikimedia.org/r/959753 (owner: 10Jbond) [14:26:17] (03PS8) 10Majavah: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [14:26:19] (03PS9) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [14:26:38] (03PS10) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [14:28:20] (03CR) 10Majavah: "I took the liberty of updating 'project_id' to 'openstack_id' to make it more clear which one is referring to the openstack id field and w" [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [14:31:20] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:31:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43454/console" [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [14:31:56] !log imported cas 6.6.12+wmf11u1 to apt.wikimedia.org [14:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:34:30] (03PS2) 10David Caro: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 [14:35:03] (03CR) 10David Caro: "Feel free to take the patch over anytime if you want, you should probably put your name there instead of mine xd" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [14:37:57] (03PS2) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) [14:39:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:39:23] (03CR) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar) [14:39:34] (03Merged) 10jenkins-bot: SpecialUndelete: Do not clone RequestContext [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959304 (https://phabricator.wikimedia.org/T346995) (owner: 10Lucas Werkmeister (WMDE)) [14:40:03] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:959304|SpecialUndelete: Do not clone RequestContext (T346995)]] [14:40:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:10] T346995: LogicException viewing diffs between deleted revisions: "RequestContext should not be cloned, use DerivativeContext instead." - https://phabricator.wikimedia.org/T346995 [14:40:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:31] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [14:40:31] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:40:41] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) [14:40:56] what's up parsoid? [14:41:03] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL [14:41:03] not fetch url http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:41:13] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) (Not sure which tags `logspam-watch` belongs to, I grabbed some relevant-seeming ones from older tasks.) [14:41:21] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.16:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.16:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [14:41:21] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:41:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2013.codfw.wmnet with OS bullseye [14:41:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2011.codfw.wmnet, parse2017.codfw.wmnet, parse2006.codfw.wmnet, parse2019.codfw.wmnet, parse2005.codfw.wmnet, parse2007.codfw.wmnet, parse2020.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:41:53] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2013.codfw.wmnet with OS bullseye completed: - restbase20... [14:42:07] (ProbeDown) firing: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:07] thanks for acking. I'm still checking. Maybe also capacity issues due to the swichover? [14:42:19] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:19] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.165:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.165:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [14:42:19] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:23] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [14:42:23] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:27] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [14:42:27] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:27] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.71:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [14:42:27] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:44] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) The message seems to be normalized correctly in Logstash, at least: {F37746900} [14:42:57] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [14:42:57] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:11] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:43:15] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:43] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:47] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:47] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:47] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:43:49] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [14:43:49] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:03] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.67:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [14:44:03] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:19] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:31] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.118:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.118:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [14:44:31] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:09] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:11] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:23] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:29] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:51] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:00] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10cmooney) >>! In T347054#9187423, @ayounsi wrote: > we need to have a "catch-all" option. Good point! We had a call about this task a short time ago a... [14:46:05] jelto: I don't see a cooresponding increase in load, but an increase in latency. Something downstream, maybe? [14:46:20] cwhite: look in -sre :) [14:46:23] (03PS3) 10FNegri: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [14:46:27] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:47:07] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:23] (03CR) 10JMeybohm: [C: 03+1] Delete the fastapi-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959757 (owner: 10Elukey) [14:48:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343198)', diff saved to https://phabricator.wikimedia.org/P52553 and previous config saved to /var/cache/conftool/dbconfig/20230921-144847-arnaudb.json [14:48:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:49:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:49:22] (03CR) 10David Caro: [C: 03+1] d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [14:49:24] (03CR) 10FNegri: d/changelog: bump version (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [14:49:49] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10dancy) `logspam-watch` works by reading the /srv/mw-log/{exception,error}.log files which only have the final error message (no template... [14:50:07] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#parsoid-php:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:10] (03PS1) 10Muehlenhoff: Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) [14:53:19] (03CR) 10Muehlenhoff: firewall: add 'none' provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [14:54:59] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:59] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:55:03] (03CR) 10FNegri: [C: 03+1] designate nova_fixed_multi: create A recs using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [14:55:16] cwhite, jelto: my backport is still ongoing (currently in sync-testservers-k8s), let me know if I should interrupt it [14:55:33] (03CR) 10Majavah: firewall: add 'none' provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [14:56:30] Lucas_WMDE: Proceed as you like. The situation from earlier is under control. [14:56:38] good, thanks! [14:56:39] Lucas_WMDE: As far as I can tell we are fine again [14:57:43] (03PS1) 10Kevin Bazira: ml-services: update recommendation-api-ng readiness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/958978 (https://phabricator.wikimedia.org/T347015) [14:59:38] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng readiness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/958978 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [15:00:30] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update recommendation-api-ng readiness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/958978 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [15:00:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:959304|SpecialUndelete: Do not clone RequestContext (T346995)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:00:39] (03PS1) 10JMeybohm: Revert "Revert "Revert "mediawiki: Reduce requests for canaries""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959768 [15:00:44] testing… [15:00:46] T346995: LogicException viewing diffs between deleted revisions: "RequestContext should not be cloned, use DerivativeContext instead." - https://phabricator.wikimedia.org/T346995 [15:00:49] hello [15:01:29] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng readiness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/958978 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [15:01:29] seems to work yay [15:01:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:03:26] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/959760 [15:03:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P52554 and previous config saved to /var/cache/conftool/dbconfig/20230921-150353-arnaudb.json [15:04:17] (03PS1) 10Elukey: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) [15:05:04] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:06:46] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2013.codfw.wmnet [15:06:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2013.codfw.wmnet [15:07:22] (03PS2) 10Muehlenhoff: Failover idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/959760 [15:07:24] (03CR) 10Elukey: "The backend TLS cert for ores-legacy.discovery.wmnet already has a SAN for ores.wikimedia.org:" [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [15:09:49] (03PS1) 10JMeybohm: mw-api-ext/mw-web: Raise main replicas to 16 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/959764 [15:09:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43455/console" [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [15:10:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 18 hosts with reason: Schema change [15:10:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 18 hosts with reason: Schema change [15:11:07] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/959760 (owner: 10Muehlenhoff) [15:12:08] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:14] !log dbmaint on s8@eqiad (T343198) [15:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:20] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:12:20] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:12:49] (03PS2) 10Elukey: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) [15:13:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 22 hosts with reason: Schema change [15:14:04] (03Abandoned) 10Mforns: analytics::refinery::job::eventlogging_to_druid: Default to deploy-mode cluster [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [15:14:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 22 hosts with reason: Schema change [15:14:11] (03CR) 10Muehlenhoff: firewall::service: Handle the use of the define on systems w/o P:firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:14:17] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:959304|SpecialUndelete: Do not clone RequestContext (T346995)]] (duration: 34m 13s) [15:14:22] (03Abandoned) 10Muehlenhoff: firewall::service: Handle the use of the define on systems w/o P:firewall [puppet] - 10https://gerrit.wikimedia.org/r/959730 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:14:23] T346995: LogicException viewing diffs between deleted revisions: "RequestContext should not be cloned, use DerivativeContext instead." - https://phabricator.wikimedia.org/T346995 [15:14:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43456/console" [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [15:15:15] * Lucas_WMDE done [15:15:29] taavi: as far as I’m concerned you can deploy now if you want [15:15:30] jouncebot: next [15:15:30] In 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1600) [15:15:40] that should be just about enough time for one more deploy at the current speed ._. [15:15:52] I have a meeting in 15min unfortunately :/ [15:15:56] ah ok [15:16:13] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:16:15] is it something I could deploy for you or do you need to test it? [15:16:38] I'd prefer to do it myself, but thanks for the offer [15:16:41] ok [15:16:50] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:55] (messing with both the 2fa database storage and wikitech authentication configuration) [15:17:02] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:33] (03CR) 10Mforns: "Some months have passed since we moved all Druid loading jobs to Airflow. I believe this can be merged (if appropriate), no?" [puppet] - 10https://gerrit.wikimedia.org/r/906667 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [15:19:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P52555 and previous config saved to /var/cache/conftool/dbconfig/20230921-151900-arnaudb.json [15:20:04] !log installing php7.3 security updates (as packaged in Debian Buster) [15:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:16] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "Revert "mediawiki: Reduce requests for canaries""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959768 (owner: 10JMeybohm) [15:20:23] (03CR) 10JMeybohm: [C: 03+2] mw-api-ext/mw-web: Raise main replicas to 16 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/959764 (owner: 10JMeybohm) [15:21:12] (03Merged) 10jenkins-bot: Revert "Revert "Revert "mediawiki: Reduce requests for canaries""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959768 (owner: 10JMeybohm) [15:21:22] (03Merged) 10jenkins-bot: mw-api-ext/mw-web: Raise main replicas to 16 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/959764 (owner: 10JMeybohm) [15:22:36] !log jayme@deploy2002 Started scap: (no justification provided) [15:25:06] !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 02m 29s) [15:28:14] (03PS1) 10Elukey: ml-services: enable the ingress module for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/959789 (https://phabricator.wikimedia.org/T347015) [15:28:32] (03PS2) 10Elukey: ml-services: enable the ingress module for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/959789 (https://phabricator.wikimedia.org/T347015) [15:29:45] (03PS1) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/959790 (https://phabricator.wikimedia.org/T342149) [15:30:51] (03CR) 10Elukey: [C: 03+2] ml-services: enable the ingress module for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/959789 (https://phabricator.wikimedia.org/T347015) (owner: 10Elukey) [15:30:53] (03CR) 10Elukey: [V: 03+2 C: 03+2] ml-services: enable the ingress module for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/959789 (https://phabricator.wikimedia.org/T347015) (owner: 10Elukey) [15:33:20] (03CR) 10DCausse: [C: 03+1] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/959790 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:33:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:34:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T343198)', diff saved to https://phabricator.wikimedia.org/P52556 and previous config saved to /var/cache/conftool/dbconfig/20230921-153406-arnaudb.json [15:34:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:34:18] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:34:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:34:22] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T343198)', diff saved to https://phabricator.wikimedia.org/P52557 and previous config saved to /var/cache/conftool/dbconfig/20230921-153428-arnaudb.json [15:35:15] (03PS1) 10JMeybohm: Revert "mw-on-k8s: Lower traffic to 3%" [puppet] - 10https://gerrit.wikimedia.org/r/959769 [15:36:42] (03PS2) 10JMeybohm: Revert "mw-on-k8s: Lower traffic to 3%" [puppet] - 10https://gerrit.wikimedia.org/r/959769 (https://phabricator.wikimedia.org/T341780) [15:37:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mw-on-k8s: Lower traffic to 3%" [puppet] - 10https://gerrit.wikimedia.org/r/959769 (https://phabricator.wikimedia.org/T341780) (owner: 10JMeybohm) [15:39:17] (03CR) 10JMeybohm: [C: 03+2] Revert "mw-on-k8s: Lower traffic to 3%" [puppet] - 10https://gerrit.wikimedia.org/r/959769 (https://phabricator.wikimedia.org/T341780) (owner: 10JMeybohm) [15:42:29] (03Abandoned) 10JMeybohm: mw-api-ext: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959248 (owner: 10Clément Goubert) [15:43:55] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy kserve 0.11 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959797 (https://phabricator.wikimedia.org/T346446) [15:48:28] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:51:40] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:44] (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate another database-related message [puppet] - 10https://gerrit.wikimedia.org/r/959802 (https://phabricator.wikimedia.org/T347064) [15:53:06] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:52] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:54:06] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:16] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:26] !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host apt-staging2001.codfw.wmnet [16:02:28] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [16:03:05] (03PS1) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [16:03:52] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [16:04:13] (03CR) 10Filippo Giunchedi: prometheus: add service_name_override parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [16:08:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:08:40] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt-staging2001.codfw.wmnet - eoghan@cumin1001" [16:09:28] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt-staging2001.codfw.wmnet - eoghan@cumin1001" [16:09:28] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:28] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache apt-staging2001.codfw.wmnet on all recursors [16:09:32] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt-staging2001.codfw.wmnet on all recursors [16:09:51] (03CR) 10Jbond: firewall: add 'none' provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [16:09:58] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt-staging2001.codfw.wmnet - eoghan@cumin1001" [16:10:46] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt-staging2001.codfw.wmnet - eoghan@cumin1001" [16:11:22] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host apt-staging2001.codfw.wmnet with OS bookworm [16:11:29] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host apt-staging2001.codfw.wmnet with OS bookworm [16:12:17] (03PS2) 10Majavah: firewall: add 'none' provider [puppet] - 10https://gerrit.wikimedia.org/r/959739 [16:13:05] (03CR) 10Jbond: [C: 03+1] "lgtm but lets wait for moritz as well just in case" [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [16:13:07] (03CR) 10Majavah: "This won't have any effect without adding P:firewall to the affected roles, right?" [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:14:23] (03PS4) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) [16:14:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:15:22] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:15:28] (03PS2) 10EoghanGaffney: gitlab: Add unlock command to gitlab-backup script [puppet] - 10https://gerrit.wikimedia.org/r/954916 [16:15:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:16:38] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:17:11] (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [16:17:22] (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: add config for custom blackbox scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [16:17:24] (03PS3) 10Cwhite: prometheus: use alias rather than service_name when present [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) [16:17:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:17:42] (03CR) 10Jbond: [C: 03+1] Add profile::firewall::provider: none for roles where P:firewall is not applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:18:12] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:47] 10SRE-tools, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10fnegri) [16:23:57] (03CR) 10Lucas Werkmeister (WMDE): "Not sure this is a good idea, to be honest. We can do it as a temporary workaround, if we think that the non-normalized messages clutter l" [puppet] - 10https://gerrit.wikimedia.org/r/959802 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [16:24:54] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958973/43457/" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [16:26:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2014.codfw.wmnet with OS bullseye [16:26:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2014.codfw.wmnet with OS bullseye [16:26:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [16:28:30] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Add unlock command to gitlab-backup script [puppet] - 10https://gerrit.wikimedia.org/r/954916 (owner: 10EoghanGaffney) [16:29:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) You can find in the logs the command and the exit code, but that's correct the output of a remote command is not automatica... [16:29:40] (03PS1) 10EoghanGaffney: Add new apt-staging host to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/959807 [16:31:03] (03PS1) 10Jbond: git-sync-upstream: invalid use of os.geteuid() [puppet] - 10https://gerrit.wikimedia.org/r/959808 [16:31:19] (03PS4) 10Cwhite: prometheus: use alias rather than service_name when present [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) [16:31:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) @Jclark-ctr db1247 - Verified serial numbers. Verified port and CableID. All seem to be correct. Swapped out cable with a new cable. old: db1247 - D 3. U 04.... [16:32:28] (03CR) 10Cwhite: [C: 03+2] prometheus: use alias rather than service_name when present [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [16:33:09] (03CR) 10Cwhite: [C: 03+2] sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893) (owner: 10Filippo Giunchedi) [16:34:03] (03CR) 10Majavah: [C: 03+1] git-sync-upstream: invalid use of os.geteuid() [puppet] - 10https://gerrit.wikimedia.org/r/959808 (owner: 10Jbond) [16:34:22] (03Merged) 10jenkins-bot: sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893) (owner: 10Filippo Giunchedi) [16:35:21] (03CR) 10Jbond: [C: 03+2] git-sync-upstream: invalid use of os.geteuid() [puppet] - 10https://gerrit.wikimedia.org/r/959808 (owner: 10Jbond) [16:38:27] 10SRE: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt) Thanks @Peachey88 for the Description update for readability! [16:40:13] (03CR) 10Herron: [C: 03+1] "Looks good thank you Luca!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [16:42:30] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2014.codfw.wmnet with reason: host reimage [16:44:12] (03CR) 10Herron: "+1 for managing these limits, although I'm not sure yet what the managed values should look like. Added a few notes on the task" [puppet] - 10https://gerrit.wikimedia.org/r/959674 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi) [16:45:33] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2014.codfw.wmnet with reason: host reimage [16:45:59] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [16:50:01] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10fnegri) I see your point of avoiding to spam the logs, but I still think it can be useful in some situations. Maybe the output coul... [16:50:16] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:50:29] (03CR) 10Jbond: "LGTM ill merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling) [16:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:51:20] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:57:04] (03CR) 10Herron: [C: 03+2] titan: add pyrra/slo envoy/cfssl config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron) [16:57:08] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:58:24] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1700) [17:02:16] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:03:36] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:37] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:10:53] (03CR) 10Jbond: [C: 04-1] "lgtm just a minor issue with the gid and additional approvals" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:16:22] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you! 😊" [puppet] - 10https://gerrit.wikimedia.org/r/959674 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi) [17:18:39] (03CR) 10Jbond: [C: 03+1] gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [17:19:06] (03CR) 10Bking: [C: 03+2] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/959790 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [17:20:04] (03Merged) 10jenkins-bot: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/959790 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [17:21:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2014.codfw.wmnet with OS bullseye [17:21:22] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2014.codfw.wmnet with OS bullseye completed: - restbase20... [17:22:05] (03CR) 10Btullis: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [17:25:28] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [17:27:32] !log eoghan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apt-staging2001.codfw.wmnet with OS bookworm [17:27:32] !log eoghan@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host apt-staging2001.codfw.wmnet [17:27:37] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host apt-staging2001.codfw.wmnet with OS bookworm executed with errors: - apt-staging2... [17:34:42] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) To summarize a meeting between @Htriedman and myself: The current API attempts to return results for one of page ID (the m... [17:38:42] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:39:35] (03PS3) 10DDesouza: Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) [17:39:59] (03PS4) 10DDesouza: Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) [17:40:12] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:06] (03CR) 10CI reject: [V: 04-1] Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [17:41:17] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2014.codfw.wmnet [17:41:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2014.codfw.wmnet [17:44:22] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:19] (03PS5) 10DDesouza: Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) [17:48:15] (03PS1) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) [17:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2149 (T346365)', diff saved to https://phabricator.wikimedia.org/P52558 and previous config saved to /var/cache/conftool/dbconfig/20230921-174934-ladsgroup.json [17:49:42] T346365: PHP Notice: Undefined index: DEFAULT - https://phabricator.wikimedia.org/T346365 [17:51:13] (03PS1) 10Jbond: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 [17:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P52559 and previous config saved to /var/cache/conftool/dbconfig/20230921-175444-ladsgroup.json [17:56:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1166 (T346365)', diff saved to https://phabricator.wikimedia.org/P52560 and previous config saved to /var/cache/conftool/dbconfig/20230921-175634-ladsgroup.json [17:56:41] T346365: PHP Notice: Undefined index: DEFAULT - https://phabricator.wikimedia.org/T346365 [17:56:46] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:58:10] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:58:14] (03PS1) 10Jbond: do not merge: test change for pcc [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) [17:58:33] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@ddcc518]: Deploy latest DAGs to analytics Airflow instance [17:58:39] (03CR) 10CI reject: [V: 04-1] do not merge: test change for pcc [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [17:59:13] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@ddcc518]: Deploy latest DAGs to analytics Airflow instance (duration: 00m 40s) [18:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1166 (T346365)', diff saved to https://phabricator.wikimedia.org/P52561 and previous config saved to /var/cache/conftool/dbconfig/20230921-180003-ladsgroup.json [18:00:06] brennen and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1800). [18:01:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43458/console" [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [18:03:15] (03PS2) 10Jbond: do not merge: test change for pcc [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) [18:03:37] o/ [18:04:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43459/console" [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [18:05:21] !log train 1.41.0-wmf.27 (T345888): no current blockers, logs clean, rolling to group2 shortly. [18:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:30] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [18:06:14] (03PS2) 10Jbond: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 [18:06:35] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959833 (https://phabricator.wikimedia.org/T345888) [18:06:37] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959833 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [18:07:13] (03CR) 10CI reject: [V: 04-1] group2 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959833 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [18:07:19] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959833 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [18:09:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P52562 and previous config saved to /var/cache/conftool/dbconfig/20230921-180949-ladsgroup.json [18:10:06] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:07] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.27 refs T345888 [18:15:16] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [18:18:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) @Jclark-ctr corrected the problem with the device. Also, I did close my screen on this device. Would you be able to try again? [18:21:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:23] (03CR) 10Jbond: "fyi i dropped this from the patches on the deployment prep puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [18:24:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P52564 and previous config saved to /var/cache/conftool/dbconfig/20230921-182455-ladsgroup.json [18:31:27] jouncebot: nowandnext [18:31:28] For the next 1 hour(s) and 28 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T1800) [18:31:28] In 1 hour(s) and 28 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T2000) [18:31:45] brennen: is the coast clear for me to deploy a config change? [18:32:08] Amir1: yeah, logs are looking stable. go ahead. [18:32:24] awesome [18:32:29] (03PS3) 10Ladsgroup: Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) [18:32:33] (03CR) 10Ladsgroup: [C: 03+2] Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [18:33:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [18:33:41] (03Merged) 10jenkins-bot: Enable Url shortener in sidebar in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959713 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [18:33:57] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:959713|Enable Url shortener in sidebar in all wikis (T267921)]] [18:34:09] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [18:36:39] (03PS1) 10Cwhite: prometheus: create swagger job variant [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) [18:40:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P52565 and previous config saved to /var/cache/conftool/dbconfig/20230921-184000-ladsgroup.json [18:45:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:959713|Enable Url shortener in sidebar in all wikis (T267921)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [18:45:19] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [18:45:45] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:50] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:49:02] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:16] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:28] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:53:56] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch! Happy to take care of merging it on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [18:54:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10VRiley-WMF) an-master1003 - C 6. U 12. port 09 CableID 3193 an-master1004 - D 8. U 36. port 35 CableID 2013339101850 [18:54:45] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:959713|Enable Url shortener in sidebar in all wikis (T267921)]] (duration: 20m 47s) [18:54:52] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [19:01:51] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10jbond) without looking at all the detail and edge cases i think the overall proposal sounds good. for the puppet side of things we should be able to us... [19:02:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) It seems like this fell through the cracks between last week's SRE clinic duty (mine) and this week's. Let me finish it up for you, thanks for your patience. [19:02:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) [19:02:23] (03PS1) 10RLazarus: admin: Add ahoelzl to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/959843 (https://phabricator.wikimedia.org/T345959) [19:06:00] (03PS2) 10RLazarus: admin: Add ahoelzl to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/959843 (https://phabricator.wikimedia.org/T345959) [19:13:02] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add codfw new switches - cmooney@cumin1001" [19:15:09] (03CR) 10Fabfur: admin: Add ahoelzl to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959843 (https://phabricator.wikimedia.org/T345959) (owner: 10RLazarus) [19:15:53] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958981/43460/" [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [19:16:44] (03PS1) 10Dwisehaupt: Add DKIM record for fundraise-up (donation processor) [dns] - 10https://gerrit.wikimedia.org/r/959847 (https://phabricator.wikimedia.org/T345354) [19:17:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add codfw new switches - cmooney@cumin1001" [19:17:50] (03CR) 10Dwisehaupt: "DKIM record for the fundraise up pilot for your review." [dns] - 10https://gerrit.wikimedia.org/r/959847 (https://phabricator.wikimedia.org/T345354) (owner: 10Dwisehaupt) [19:18:32] (03PS1) 10Jdlrobson: WIP: Update wikiquote wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959848 (https://phabricator.wikimedia.org/T341260) [19:18:40] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343198)', diff saved to https://phabricator.wikimedia.org/P52566 and previous config saved to /var/cache/conftool/dbconfig/20230921-191858-arnaudb.json [19:19:05] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:19:50] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959843 (https://phabricator.wikimedia.org/T345959) (owner: 10RLazarus) [19:24:20] (03CR) 10CI reject: [V: 04-1] WIP: Update wikiquote wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959848 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [19:26:17] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10dr0ptp4kt) [19:26:21] (03CR) 10RLazarus: [C: 03+2] admin: Add ahoelzl to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959843 (https://phabricator.wikimedia.org/T345959) (owner: 10RLazarus) [19:29:59] (03PS1) 10Dr0ptp4kt: Re-enroll dr0ptp4kt in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/959850 (https://phabricator.wikimedia.org/T347110) [19:31:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) 05Open→03Resolved a:03RLazarus Done: - Added you to the `wmf` LDAP group. - Added you to the #wmf-nda Phabricator project. - Created your shell user `ahoelzl... [19:31:33] (03CR) 10Jgreen: [C: 03+2] Add DKIM record for fundraise-up (donation processor) [dns] - 10https://gerrit.wikimedia.org/r/959847 (https://phabricator.wikimedia.org/T345354) (owner: 10Dwisehaupt) [19:32:21] (03CR) 10Hashar: [C: 03+1] "Welcome aboard! :)" [puppet] - 10https://gerrit.wikimedia.org/r/959850 (https://phabricator.wikimedia.org/T347110) (owner: 10Dr0ptp4kt) [19:34:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P52567 and previous config saved to /var/cache/conftool/dbconfig/20230921-193404-arnaudb.json [19:34:28] 10SRE, 10observability: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt) [19:47:00] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:49:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P52568 and previous config saved to /var/cache/conftool/dbconfig/20230921-194911-arnaudb.json [19:54:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) [19:57:54] (03PS1) 10Bking: elastic: introduce jbod-related config [puppet] - 10https://gerrit.wikimedia.org/r/959854 (https://phabricator.wikimedia.org/T231010) [19:59:58] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reords for codfw test servers - cmooney@cumin1001" [20:00:05] brennen and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230921T2000). [20:00:05] danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] o/ [20:00:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959854 (https://phabricator.wikimedia.org/T231010) (owner: 10Bking) [20:00:39] o/ [20:00:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reords for codfw test servers - cmooney@cumin1001" [20:00:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:01:25] I can deploy if needed [20:02:04] o/ [20:02:15] i'm here, will do this one since we have a trainee. :) [20:02:19] TheresNoTime: ^ [20:02:24] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:02:32] brennen: ah good :) [20:02:50] * TheresNoTime hadn't updated fingerprints on this laptop anyways [20:03:50] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:03:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:04:00] (03PS3) 10Krinkle: search-grafana-dashboards: add support for searching alert metadata [software] - 10https://gerrit.wikimedia.org/r/959366 (https://phabricator.wikimedia.org/T345190) [20:04:18] (03PS6) 10Brennen Bearnes: Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:04:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T343198)', diff saved to https://phabricator.wikimedia.org/P52569 and previous config saved to /var/cache/conftool/dbconfig/20230921-200417-arnaudb.json [20:04:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:04:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:04:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:04:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T343198)', diff saved to https://phabricator.wikimedia.org/P52570 and previous config saved to /var/cache/conftool/dbconfig/20230921-200439-arnaudb.json [20:05:04] (03CR) 10TrainBranchBot: "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:05:47] (03Merged) 10jenkins-bot: Update Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:06:03] !log brennen@deploy2002 Started scap: Backport for [[gerrit:956931|Update Reader Demographics 2 pilot survey (T345951)]] [20:06:06] PROBLEM - Host ps1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [20:06:10] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:06:20] RECOVERY - Host ps1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [20:10:26] danisztls: this will be on debug servers after a bit. the bot will ping. [20:10:29] i will also ping. [20:10:34] ok [20:10:38] redundancy in comms is most of my job. [20:11:01] xD [20:17:17] !log brennen@deploy2002 dani and brennen: Backport for [[gerrit:956931|Update Reader Demographics 2 pilot survey (T345951)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:17:24] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:17:36] lgtm [20:18:11] danisztls: cool, going ahead. [20:18:17] !log brennen@deploy2002 dani and brennen: Continuing with sync [20:20:20] brennen: thanks! [20:26:34] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:27:40] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:956931|Update Reader Demographics 2 pilot survey (T345951)]] (duration: 21m 36s) [20:27:48] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:28:05] danisztls: fin. [20:28:24] !log end of UTC late backport & config window [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:18] (03CR) 10Bking: [C: 04-1] "Let's wait to merge this until we have a larger discussion with the team." [puppet] - 10https://gerrit.wikimedia.org/r/959854 (https://phabricator.wikimedia.org/T231010) (owner: 10Bking) [20:31:34] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:56] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:20] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:02] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:47:28] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:34] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:00] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:07:21] (03PS1) 10Jdlrobson: WIP Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) [21:08:01] (03CR) 10CI reject: [V: 04-1] WIP Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson) [21:08:10] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:10:36] (03PS1) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) [21:12:50] (03PS2) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) [21:15:04] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:40] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [21:17:01] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10colewhite) mwmaint2002: ` $ ulimit -Hn 1048576 $ ulimit -Sn 1024 ` I didn't s... [21:19:18] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:22] (03CR) 10Cathal Mooney: [C: 03+1] "Good call!" [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) (owner: 10Ayounsi) [21:23:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Sorry this is so labour-intensive, we'll hopefully have that cookbook soon or implement some of those other options. Thanks." [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [21:29:22] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) >>! In T344428#9189242, @colewhite wrote: > I didn't see anythin... [21:52:09] (03PS1) 10Jdlrobson: WIP: wordmarks for Wikinews projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258) [21:52:14] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:53:40] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:55:34] (03PS3) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [22:00:04] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:14] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:35] ^^ I'll look into that. [22:11:10] (03PS1) 10Ahmon Dancy: sync-gitlab-group-with-ldap: Use --yes flag [puppet] - 10https://gerrit.wikimedia.org/r/959881 [22:12:49] (03CR) 10Ahmon Dancy: "Please deploy ASAP! sync-gitlab-group-with-ldap.service is currently failing and generating Icinga alerts." [puppet] - 10https://gerrit.wikimedia.org/r/959881 (owner: 10Ahmon Dancy) [22:18:10] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:32:02] (03PS4) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [22:32:41] (03CR) 10CI reject: [V: 04-1] WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson) [22:37:32] (03PS3) 10Ebernhardson: Draft: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) [22:37:36] (03CR) 10Ebernhardson: Draft: Pull some flink config down into the chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [22:37:40] (03PS2) 10Ebernhardson: Draft: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066 [22:37:44] (03CR) 10Ebernhardson: "great ideas, thanks! I've updated this to be opinionated about paths, and tried to do something about having a pre-defined storage bucket," [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [22:51:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10colewhite) [22:52:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10colewhite) [22:53:07] (03CR) 10Cwhite: [C: 03+2] dr0ptp4kt WDQS, Search, Analytics access [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694) (owner: 10Dr0ptp4kt) [22:55:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10colewhite) 05Open→03Resolved a:03colewhite The group membership change has been deployed. Please feel free to reopen if you encounter any r... [22:58:38] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:00:04] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:10:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) [23:12:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) @MGerlach is there an expiry date for this contract renewal? [23:12:31] (03PS1) 10Cwhite: Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) [23:12:40] (03CR) 10CI reject: [V: 04-1] Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) (owner: 10Cwhite) [23:19:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10colewhite) [23:21:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10colewhite) ping: @thcipriani as approver for deployment group membership [23:22:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) [23:26:07] (03PS1) 10Cwhite: admin: add mabualruz to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) [23:29:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) [23:30:04] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:32:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:18] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:20] 10SRE, 10LDAP-Access-Requests: Migrate Bawolff from wmf ldap group to nda ldap group - https://phabricator.wikimedia.org/T346921 (10colewhite) 05Open→03Resolved a:03colewhite Migrated to nda ldap group. [23:37:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10thcipriani) Approved! [23:38:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:01] (03PS2) 10Cwhite: Re-enroll dr0ptp4kt in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/959850 (https://phabricator.wikimedia.org/T347110) (owner: 10Dr0ptp4kt) [23:39:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10colewhite) [23:40:11] (03CR) 10Cwhite: [C: 03+2] Re-enroll dr0ptp4kt in deployment group [puppet] - 10https://gerrit.wikimedia.org/r/959850 (https://phabricator.wikimedia.org/T347110) (owner: 10Dr0ptp4kt) [23:40:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10colewhite) 05Open→03Resolved a:03colewhite The group membership change has been deployed. Please feel free to reopen if you encounter any related issue. [23:58:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343198)', diff saved to https://phabricator.wikimedia.org/P52572 and previous config saved to /var/cache/conftool/dbconfig/20230921-235810-arnaudb.json [23:58:17] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198