[00:01:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:02:55] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:07:32] vriley@cumin1003 provision (PID 4047452) is awaiting input [00:13:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [00:14:01] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1290.eqiad.wmnet with OS bookworm [00:14:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1290.eqiad.wmnet with OS bookworm [00:14:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1289.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:18:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [00:19:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:19:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:19:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:19:59] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:19:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:03] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:04] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:04] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:05] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:05] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:20:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:51] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:54] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:54] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:55] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:55] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [00:20:56] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [00:30:01] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1289.eqiad.wmnet with OS bookworm [00:30:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm [00:36:57] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11924035 (10Papaul) 05Open→03Resolved The last BGP session between cr3 and asw1-23 is now up, We ca now close this task. Thanks to all that did help... [00:38:17] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: FY2526 Q3 ulsfo: switch refresh - https://phabricator.wikimedia.org/T408510#11924038 (10Papaul) 05Open→03Resolved The ULSFO switch refresh is complete. Good to close this task. [00:38:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [00:39:10] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:42:42] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:43:27] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:46:05] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1289.eqiad.wmnet with reason: host reimage [00:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1289.eqiad.wmnet with reason: host reimage [00:50:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:58:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:03:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:06:39] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [01:09:44] vriley@cumin1003 reimage (PID 4050668) is awaiting input [01:10:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287506 [01:10:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287506 (owner: 10TrainBranchBot) [01:10:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [01:10:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1289.eqiad.wmnet with OS bookworm [01:10:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1289.eqiad.wmnet with OS bookworm completed: - db1289 (**PASS**) -... [01:11:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:12:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:14:16] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1290.eqiad.wmnet with OS bookworm [01:14:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1290.eqiad.wmnet with OS bookworm [01:17:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924059 (10VRiley-WMF) [01:17:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924060 (10VRiley-WMF) [01:19:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:19:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:19:59] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:04] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:04] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:09] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:20:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:51] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:54] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:54] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [01:20:59] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [01:22:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1287506 (owner: 10TrainBranchBot) [02:00:38] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:06:48] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1290.eqiad.wmnet with OS bookworm [02:06:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11924120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1290.eqiad.wmnet with OS bookworm executed with errors: - db1290 (**F... [02:07:28] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 50s) [02:19:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:19:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:19:59] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:03] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:05] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:05] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:05] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:05] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:20:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:49] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:53] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [02:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [02:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [02:31:39] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [02:34:20] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:34:57] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:37:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:19:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:19:59] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:03] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:05] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:05] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:49] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:55] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:55] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:55] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [03:34:35] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:36:35] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:19:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:19:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:19:59] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:19:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:19:59] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:20:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:49] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:51] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:53] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [04:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [04:36:43] (03PS1) 10Dzahn: admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) [04:39:06] (03CR) 10Dzahn: admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) (owner: 10Dzahn) [04:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:03] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:03] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:03] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:05] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:05] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:05] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:20:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:53] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:53] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:55] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:55] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [05:20:55] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [05:21:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T0600) [06:20:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:21:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [06:23:40] (03CR) 10Slyngshede: admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) (owner: 10Dzahn) [06:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [06:26:13] (03PS4) 10Slyngshede: P:idp webauthn, with database backend [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) [06:26:32] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [06:32:31] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:47:35] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:35] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:47:51] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:55] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:56] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:56] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:57] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:57] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:58] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:00] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:00] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2022.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:48:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2022.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:48:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:03] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:03] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:03] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:04] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:04] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:05] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:05] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:06] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:09] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:48:47] !ack [06:48:47] 7939 (ACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [06:48:51] omw [06:49:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:49:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:49:23] !ack [06:49:23] 7940 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:49:23] 7941 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:50:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:37] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 3.104 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:43] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 8.663 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:51] FIRING: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [06:51:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [06:51:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.613 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:57] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.834 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:57] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.875 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:57] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 3.746 second response time https://wikitech.wikimedia.org/wiki/Swift [06:52:01] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 5.694 second response time https://wikitech.wikimedia.org/wiki/Swift [06:52:31] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:52:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:27] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:37] !ack [06:53:38] 7942 (ACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [06:54:35] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:35] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:55] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:56:55] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:57:04] (03CR) 10Brouberol: "The way I'm reading it, we're not retrofitting the PVC for ceph dumps data into the list of PVCs defined in values, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [06:58:51] FIRING: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:58:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [06:59:04] !ack [06:59:05] 7943 (ACKED) [3x] TransitPeeringTransportOutSaturation network sre (gnmi) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T0700) [07:00:05] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.118 second response time https://wikitech.wikimedia.org/wiki/Swift [07:01:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:01:51] RESOLVED: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [07:01:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/queue-and-error-stats-by-network-device?var-site=eqsin+prometheus%2Fops&var-device=cr3-eqsin&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [07:01:55] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [07:02:04] !ack [07:02:05] 7944 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [07:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:59] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:51] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:55] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:55] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:57] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.316 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:57] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.650 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:04:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:49] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:49] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:51] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:51] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.599 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:54] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [07:04:59] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Swift [07:05:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:06:55] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:35] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:35] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:49] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:51] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:51] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:51] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:55] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:56] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:56] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:57] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:57] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:58] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [07:07:59] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:07:59] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:08:01] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:08:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:08:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2022.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled ht [07:08:01] kitech.wikimedia.org/wiki/PyBal [07:08:51] FIRING: [4x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [07:09:55] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:10:55] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [07:10:55] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [07:10:55] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [07:10:58] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [07:11:55] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:49] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:53] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1021.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1024.eqiad.wmnet, ms-fe1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:12:57] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:57] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:57] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:57] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:57] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Swift [07:12:59] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:12:59] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:12:59] PROBLEM - Swift https frontend on ms-fe1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:12:59] PROBLEM - Swift https backend on ms-fe1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:13:07] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:13:19] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:13:27] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:35] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [07:13:49] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/Swift [07:13:49] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.280 second response time https://wikitech.wikimedia.org/wiki/Swift [07:13:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1023.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1021.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1024.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled ht [07:13:53] kitech.wikimedia.org/wiki/PyBal [07:13:55] RECOVERY - Swift https backend on ms-fe1022 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 5.332 second response time https://wikitech.wikimedia.org/wiki/Swift [07:13:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:05] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.705 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:07] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:09] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:57] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:59] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:05] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 7.937 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:05] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 8.878 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:07] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:15:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:15:35] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.658 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:49] PROBLEM - Swift https frontend on ms-fe1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:59] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 2.098 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:59] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.201 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:59] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 2.909 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:43] PROBLEM - Swift https backend on ms-fe1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:16:49] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:51] RECOVERY - Swift https frontend on ms-fe1022 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.565 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:51] PROBLEM - Swift https backend on ms-fe1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:55] RECOVERY - Swift https frontend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.341 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:57] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:59] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.959 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:59] PROBLEM - Swift https frontend on ms-fe1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:16:59] PROBLEM - Swift https backend on ms-fe1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:17:35] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:51] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.329 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:57] (03CR) 10Elukey: [C:03+1] hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [07:17:57] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:27] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:37] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 2.877 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:41] elukey@cumin1003 reimage (PID 4095083) is awaiting input [07:18:49] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:49] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:55] RECOVERY - Swift https backend on ms-fe1024 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 6.807 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:57] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:57] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 568 bytes in 0.435 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:57] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:59] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:03] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.974 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:43] RECOVERY - Swift https backend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 9.139 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:53] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.836 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:53] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:19:57] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:59] PROBLEM - Swift https frontend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:20:07] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:20:49] RECOVERY - Swift https frontend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:49] RECOVERY - Swift https frontend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:49] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:51] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:53] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:55] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [07:20:57] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:09] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 9.322 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:21:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:35] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.831 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:43] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:21:49] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.222 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:51] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:54] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:54] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:55] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:55] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:56] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:21:56] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:57] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:57] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.536 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:58] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.879 second response time https://wikitech.wikimedia.org/wiki/Swift [07:21:59] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.622 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:22:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:22:03] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 8.319 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:05] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.107 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:05] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 4.501 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:35] PROBLEM - Swift https backend on ms-fe1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:35] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.811 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [07:22:53] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [07:23:27] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:57] RECOVERY - Swift https backend on ms-fe1022 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 6.465 second response time https://wikitech.wikimedia.org/wiki/Swift [07:23:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:24:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:24:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:24:37] RECOVERY - Swift https backend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 3.097 second response time https://wikitech.wikimedia.org/wiki/Swift [07:25:07] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:25:07] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:25:19] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:25:42] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift [07:26:09] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 568 bytes in 0.702 second response time https://wikitech.wikimedia.org/wiki/Swift [07:26:09] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:26:43] PROBLEM - Swift https backend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:26:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:26:59] PROBLEM - Swift https frontend on ms-fe1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:26:59] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 2.819 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:01] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.961 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:07] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:27:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:27:33] RECOVERY - Swift https backend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:49] RECOVERY - Swift https frontend on ms-fe1022 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:49] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:57] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 568 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:27:59] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [07:28:49] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:28:57] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Swift [07:28:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:29:49] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Swift [07:29:49] PROBLEM - Swift https frontend on ms-fe1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Swift [07:29:49] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [07:29:53] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1009.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:29:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1018.eqiad.wmnet, ms-fe1017.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:29:59] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.773 second response time https://wikitech.wikimedia.org/wiki/Swift [07:30:19] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:30:49] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [07:31:49] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Swift [07:31:49] RECOVERY - Swift https frontend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.679 second response time https://wikitech.wikimedia.org/wiki/Swift [07:31:51] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.440 second response time https://wikitech.wikimedia.org/wiki/Swift [07:32:26] (03PS1) 10Kosta Harlan: api-gateway: Add Vary: Origin to api.wikimedia.org responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) [07:32:57] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [07:33:43] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:33:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [07:33:53] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:33:53] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:34:03] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.612 second response time https://wikitech.wikimedia.org/wiki/Swift [07:34:09] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.412 second response time https://wikitech.wikimedia.org/wiki/Swift [07:34:33] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Swift [07:34:59] PROBLEM - Swift https frontend on ms-fe1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:35:49] RECOVERY - Swift https frontend on ms-fe1024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Swift [07:36:07] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:36:50] (03CR) 10Elukey: ferm: Absent the NRPE check when migrating from ferm to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [07:37:57] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Swift [07:37:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:38:27] (03CR) 10Elukey: "To keep archives happy - we decided to go ahead with this change once the cumin nodes will be on Trixie, so Spicerack will be able to leve" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120500 (owner: 10Volans) [07:38:43] PROBLEM - Swift https backend on ms-fe1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:38:43] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:38:49] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Swift [07:39:07] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:39:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:39:37] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 3.429 second response time https://wikitech.wikimedia.org/wiki/Swift [07:39:49] !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host sretest2010 [07:40:27] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:35] RECOVERY - Swift https backend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 569 bytes in 1.382 second response time https://wikitech.wikimedia.org/wiki/Swift [07:40:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:40:42] RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:49] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [07:40:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:40:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1024.eqiad.wmnet, ms-fe1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:40:57] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [07:40:57] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Swift [07:40:59] PROBLEM - Swift https frontend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:41:01] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 3.554 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:01] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 568 bytes in 0.479 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [07:41:42] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:49] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.126 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:55] RECOVERY - Swift https frontend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 5.364 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:57] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Swift [07:41:57] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Swift [07:42:00] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:42:07] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:42:12] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [07:42:53] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:42:57] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 567 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Swift [07:42:59] PROBLEM - Swift https frontend on ms-fe1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:43:49] RECOVERY - Swift https frontend on ms-fe1023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:45:27] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:53:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [07:53:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [07:54:10] !ack [07:54:10] All incidents are already acked. [07:54:24] oh that's resolved, okay, whee. [07:54:53] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:54:56] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:55:45] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2064 [07:55:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:55:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:58:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:58:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:03:09] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:03:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:05:21] (03PS6) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [08:05:45] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:13:22] 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975#11924337 (10MatthewVernon) Thanks :) [08:16:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:17:56] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie [08:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11924349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS tr... [08:18:51] mvernon@cumin2002 convert-disks (PID 2847391) is awaiting input [08:20:43] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:20:43] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:20:43] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:20:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:20:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:20:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:01] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:03] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:03] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:03] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:03] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:05] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:05] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:07] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:21:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11924352 (10fgiunchedi) @Jclark-ctr once T426180 is resolved and hosts can be reimaged, please rack as follows 1077 -> `C8` 1078 -> `D5` 1079 -> `E4` 1080 -> `F4` [08:21:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:33] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:33] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:33] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:51] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:53] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:53] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:53] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:55] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:55] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:57] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [08:22:09] (03CR) 10AikoChou: "Thanks for updating the list! I'll deploy this to production on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [08:28:02] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie [08:28:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11924364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host kafka-logging1006.eqiad.wmnet with OS trixie... [08:31:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11924373 (10elukey) >>! In T418929#11909574, @elukey wrote: > weird error while doing pxe: > > ` >>>Checking Media Presence...... >>>Media Present...... [08:34:49] (03CR) 10Klausman: [C:03+1] ml-serve(grpc): step 1, etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [08:35:18] (03CR) 10Klausman: [C:03+1] ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [08:35:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11924396 (10cmooney) Actually now that I think of it we should probably combine this work wi... [08:35:33] (03CR) 10Klausman: [C:03+1] ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [08:36:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11924398 (10cmooney) [08:38:37] PROBLEM - Host db2218 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:39:04] !ack [08:39:04] 7945 (ACKED) Host db2218 (paged) [08:39:09] (03CR) 10Klausman: docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [08:39:45] marostegui: ? [08:39:59] PROBLEM - MariaDB Replica IO: s7 on db2198 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:02] PROBLEM - MariaDB Replica IO: s7 #page on db2182 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:03] PROBLEM - MariaDB Replica IO: s7 #page on db2159 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:04] PROBLEM - MariaDB Replica IO: s7 #page on db2168 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:04] PROBLEM - MariaDB Replica IO: s7 #page on db2208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:05] PROBLEM - MariaDB Replica IO: s7 #page on db2220 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:06] PROBLEM - MariaDB Replica IO: s7 #page on db2222 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:07] PROBLEM - MariaDB Replica IO: s7 #page on db2221 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:12] !ack [08:40:13] 7946 (ACKED) db2182 (paged)/MariaDB Replica IO: s7 (paged) [08:40:13] 7947 (ACKED) db2159 (paged)/MariaDB Replica IO: s7 (paged) [08:40:14] 7948 (ACKED) db2168 (paged)/MariaDB Replica IO: s7 (paged) [08:40:14] 7949 (ACKED) db2208 (paged)/MariaDB Replica IO: s7 (paged) [08:40:14] 7950 (ACKED) db2220 (paged)/MariaDB Replica IO: s7 (paged) [08:40:14] 7951 (ACKED) db2222 (paged)/MariaDB Replica IO: s7 (paged) [08:40:14] 7952 (ACKED) db2221 (paged)/MariaDB Replica IO: s7 (paged) [08:40:18] hm [08:40:42] oh that's downstream of db2218 having a bad time? [08:40:43] PROBLEM - MariaDB Replica IO: s7 on db2200 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:40:49] (03CR) 10Filippo Giunchedi: [C:03+1] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [08:41:17] Looks like master down [08:41:27] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2064 [08:41:36] (03CR) 10Lucas Werkmeister (WMDE): "(Removing myself here, I can’t review the Windows-specific changes in `manage.py`. I still think those should be split into a separate cha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [08:41:44] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2065 [08:41:58] absent a link to a runbook of any kind, i'm not quite sure how to proceed with this one [08:42:00] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:42:03] bjensen: the switch port is up but we have no MAC address learnt for db2218 in codfw rack d6 [08:42:19] bjensen: I'd escalate to dba [08:42:26] We need data persistence for this one [08:42:33] ack, thanks [08:42:55] marostegui, Amir1 ^ [08:43:03] I'm on the serial console now, the host seems functional [08:43:13] RECOVERY - Host db2218 #page is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [08:43:16] Also cezmunsta for awareness [08:43:57] hmm... mac is learnt now and by the time I logged on to console gw was pingable [08:44:19] What’s the uptime? [08:44:26] sobanski: I'm off today but near my laptop, need help? [08:45:05] Simply depool it and downtime it and create a tast [08:45:11] PROBLEM - MariaDB Events s7 on db2218 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:45:12] PROBLEM - MariaDB Replica SQL: s7 #page on db2218 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:12] PROBLEM - mysqld processes on db2218 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:45:12] PROBLEM - pt-heartbeat-wikimedia process on db2218 is CRITICAL: PROCS CRITICAL: 0 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [08:45:13] PROBLEM - MariaDB Replica IO: s7 #page on db2218 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:19] !ack [08:45:20] 7953 (ACKED) db2218 (paged)/MariaDB Replica SQL: s7 (paged) [08:45:20] 7954 (ACKED) db2218 (paged)/MariaDB Replica IO: s7 (paged) [08:45:30] PROBLEM - MariaDB read only s7 #page on db2218 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:45:30] PROBLEM - MariaDB Event Scheduler s7 on db2218 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:45:33] It's a master? [08:45:37] marostegui: apologies, I haven’t checked [08:45:40] !ack [08:45:41] 7955 (ACKED) db2218 (paged)/MariaDB read only s7 (paged) [08:45:51] But yes, looks like it [08:45:53] looks like an intermediate in s7 [08:45:58] Yep [08:46:03] Ok going to my laptop [08:46:12] <3 [08:46:31] (03PS7) 10WAN233: change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) [08:47:02] PROBLEM - MariaDB Replica Lag: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:03] PROBLEM - MariaDB Replica Lag: s7 #page on db2208 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:04] PROBLEM - MariaDB Replica Lag: s7 #page on db2221 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:05] PROBLEM - MariaDB Replica Lag: s7 #page on db2222 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:05] PROBLEM - MariaDB Replica Lag: s7 #page on db2220 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:10] !ack [08:47:11] Can someone ack/silence all that? [08:47:11] 7956 (ACKED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [08:47:11] 7957 (ACKED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [08:47:12] 7958 (ACKED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [08:47:12] 7959 (ACKED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [08:47:12] 7960 (ACKED) db2220 (paged)/MariaDB Replica Lag: s7 (paged) [08:47:42] PROBLEM - MariaDB Replica Lag: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 648.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:43] PROBLEM - MariaDB Replica Lag: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 648.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:47] !ack [08:47:48] 7961 (ACKED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [08:48:09] marostegui: not sure what happened looking at the host logs [08:48:18] I can see on the network side the port bounced a few times: [08:48:24] https://www.irccloud.com/pastebin/qcCYyhwq/ [08:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:47] topranks: the host rebooted itself [08:49:13] ah ffs I logged on to 2216 via ssh.... hence the logs not showing anything to me :) [08:49:21] (03PS4) 10Filippo Giunchedi: Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [08:49:22] ok well at least that makes the port bouncing make sense [08:49:25] thanks [08:50:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1287820 (https://phabricator.wikimedia.org/T426380) [08:50:52] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1285770 (owner: 10L10n-bot) [08:50:57] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1285773 (owner: 10L10n-bot) [08:51:10] (03PS8) 10WAN233: change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) [08:51:12] RECOVERY - mysqld processes on db2218 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:51:12] RECOVERY - MariaDB Events s7 on db2218 is OK: OK - All 2 events in ops database are ENABLED https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:51:32] RECOVERY - MariaDB read only s7 #page on db2218 is OK: Version 10.11.16-MariaDB-log, Uptime 71s, read_only: True, event_scheduler: True, 27.90 QPS, connection latency: 0.019407s, query latency: 0.001308s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:51:32] RECOVERY - MariaDB Event Scheduler s7 on db2218 is OK: Version 10.11.16-MariaDB-log, Uptime 71s, read_only: True, event_scheduler: True, 22.42 QPS, connection latency: 0.025435s, query latency: 0.001138s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:51:45] RECOVERY - MariaDB Replica IO: s7 on db2200 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:51:46] (03CR) 10WAN233: change logo at zh-classical wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [08:51:59] RECOVERY - MariaDB Replica IO: s7 on db2198 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:02] RECOVERY - MariaDB Replica IO: s7 #page on db2159 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:03] RECOVERY - MariaDB Replica IO: s7 #page on db2168 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:03] RECOVERY - MariaDB Replica IO: s7 #page on db2182 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:04] RECOVERY - MariaDB Replica IO: s7 #page on db2208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:05] RECOVERY - MariaDB Replica IO: s7 #page on db2220 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:06] RECOVERY - MariaDB Replica IO: s7 #page on db2222 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:06] RECOVERY - MariaDB Replica IO: s7 #page on db2221 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:13] RECOVERY - MariaDB Replica SQL: s7 #page on db2218 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:13] RECOVERY - MariaDB Replica IO: s7 #page on db2218 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:14] PROBLEM - MariaDB Replica Lag: s7 #page on db2218 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 673.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:20] !ack [08:52:21] 7963 (ACKED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [08:52:23] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [08:54:13] RECOVERY - MariaDB Replica Lag: s7 #page on db2218 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2220 with weight 0 T426380', diff saved to https://phabricator.wikimedia.org/P92551 and previous config saved to /var/cache/conftool/dbconfig/20260515-085420-marostegui.json [08:54:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T426380 [08:54:25] T426380: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T426380 [08:54:43] RECOVERY - MariaDB Replica Lag: s7 #page on db2159 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:43] RECOVERY - MariaDB Replica Lag: s7 #page on db2168 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:01] RECOVERY - MariaDB Replica Lag: s7 #page on db2182 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:02] RECOVERY - MariaDB Replica Lag: s7 #page on db2208 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:04] RECOVERY - MariaDB Replica Lag: s7 #page on db2221 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:05] RECOVERY - MariaDB Replica Lag: s7 #page on db2222 is OK: OK slave_sql_lag Replication lag: 3.81 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:06] RECOVERY - MariaDB Replica Lag: s7 #page on db2220 is OK: OK slave_sql_lag Replication lag: 3.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:12] RECOVERY - pt-heartbeat-wikimedia process on db2218 is OK: PROCS OK: 1 process with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [08:55:19] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1287820 (https://phabricator.wikimedia.org/T426380) (owner: 10Gerrit maintenance bot) [08:55:59] (03CR) 10A smart kitten: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [08:56:42] !log Starting s7 codfw failover from db2218 to db2220 - T426380 [08:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2220 to s7 primary T426380', diff saved to https://phabricator.wikimedia.org/P92552 and previous config saved to /var/cache/conftool/dbconfig/20260515-085836-marostegui.json [09:00:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2218 T426380', diff saved to https://phabricator.wikimedia.org/P92553 and previous config saved to /var/cache/conftool/dbconfig/20260515-090000-marostegui.json [09:00:04] T426380: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T426380 [09:01:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2064.codfw.wmnet with OS bullseye [09:01:53] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11924497 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye [09:01:54] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [09:02:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2064 [09:02:14] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [09:03:10] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:03:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:44] (03PS1) 10Filippo Giunchedi: designate: use zk backend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) [09:05:18] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [09:05:37] (03PS1) 10Marostegui: db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287823 [09:06:15] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2064 - mvernon@cumin2002" [09:06:17] (03CR) 10Marostegui: [C:03+2] db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1287823 (owner: 10Marostegui) [09:06:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2064 - mvernon@cumin2002" [09:06:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:06:21] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2064.codfw.wmnet 56.32.192.10.in-addr.arpa 6.5.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:06:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2064.codfw.wmnet 56.32.192.10.in-addr.arpa 6.5.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:06:25] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2064 [09:08:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2064 [09:08:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2064 [09:08:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db2218.codfw.wmnet with reason: Host crashed T426383 [09:08:59] T426383: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383 [09:09:41] 10ops-codfw, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11924545 (10Marostegui) Host has been switched over and it is a slave now. @Jhancock.wm @Papaul would it be possible to check/upgrade bios/firmware on this host before we repool it back? It can be rebooted any... [09:10:24] elukey@cumin1003 reimage (PID 4108711) is awaiting input [09:10:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:10:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:52] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:10:59] 10ops-codfw, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11924565 (10ayounsi) [09:11:05] (03CR) 10Atsuko: "If I understood correctly, we just creating a PVC that airflow then can use, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [09:15:04] (03CR) 10Effie Mouzeli: "similar to I64475fafdae90bc55ff3e8046dda48b85217594d" [puppet] - 10https://gerrit.wikimedia.org/r/1286793 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [09:15:09] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter_wancache: add mc1068-mc1069 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286793 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [09:16:03] (03PS2) 10Elukey: docker_registry: allow multiple docker instances [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) [09:16:19] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [09:17:36] (03CR) 10Elukey: docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [09:18:25] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1287391 (owner: 10L10n-bot) [09:18:30] (03CR) 10Gmodena: Add support for creating arbitrary PVCs to mediawiki-dumps-legacy (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [09:19:04] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11924651 (10elukey) @wiki_willy sorry for the lag, didn't see your question! `00062974` [09:20:48] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:20:48] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:20:48] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:00] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:00] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:04] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:05] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:05] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:06] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:06] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:21:37] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:37] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:37] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:49] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:51] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:53] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:54] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:54] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:55] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Swift [09:21:57] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [09:25:13] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1287396 (owner: 10L10n-bot) [09:30:01] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [09:30:05] mvernon@cumin2002 convert-disks (PID 2847391) is awaiting input [09:30:09] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2065 [09:30:11] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [09:30:41] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [09:32:16] !log Migrate cr4-ulsfo link to asw1-23-ulsfo to tagged interface T424611 [09:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:19] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [09:32:23] (03PS2) 10Btullis: Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) [09:32:25] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2064.codfw.wmnet with OS bullseye [09:32:33] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11924713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye execu... [09:32:36] (03CR) 10Btullis: "That's correct. We will be mounting the PVC into the task pods using python code in the DAG to modify the pod spec and reference this PVC." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [09:33:12] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:35:43] (03CR) 10Brouberol: [C:03+1] Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [09:36:33] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11924732 (10elukey) I keep getting this error when reimage: ` Server IP address is ...208.80.153.70 NBP filename is http://208.80.154.10/efiboot/snponly.efi NBP filesize is... [09:37:10] (03PS1) 10MVernon: swift: set ms-be206[4,5] to be new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1287826 (https://phabricator.wikimedia.org/T354872) [09:39:10] (03CR) 10CI reject: [V:04-1] swift: set ms-be206[4,5] to be new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1287826 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:40:04] (03PS2) 10MVernon: swift: set ms-be206[4,5] to be new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1287826 (https://phabricator.wikimedia.org/T354872) [09:40:24] elukey@cumin1003 reimage (PID 4112012) is awaiting input [09:40:28] (03CR) 10Klausman: [C:03+1] docker_registry: allow multiple docker instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287292 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [09:42:03] (03CR) 10Gkyziridis: "> Also, should it be explicitly unversioned, just pointing to the version in the HEAD of the main branch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [09:45:58] (03CR) 10CWilliams: [C:03+2] swift: set ms-be206[4,5] to be new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1287826 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:48:03] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:48:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:49:39] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [09:53:02] (03PS3) 10Tiziano Fogli: logstash/thanos-qfe: add event.hash and event.start [puppet] - 10https://gerrit.wikimedia.org/r/1287827 [09:54:24] (03CR) 10FNegri: [C:03+1] "Verified that the new key matches the one at https://ldap.toolforge.org/user/filippo" [puppet] - 10https://gerrit.wikimedia.org/r/1285742 (owner: 10Filippo Giunchedi) [09:55:36] elukey@cumin1003 reimage (PID 4112705) is awaiting input [09:56:05] !log Migrate cr3-ulsfo link to asw1-22-ulsfo to tagged interface T424611 [09:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:08] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [09:57:19] (03CR) 10MVernon: [C:03+2] swift: set ms-be206[4,5] to be new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1287826 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:58:51] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: update filippo cloudvps root key [puppet] - 10https://gerrit.wikimedia.org/r/1285742 (owner: 10Filippo Giunchedi) [09:59:43] (03PS1) 10Elukey: WIP: test reimage workaround for sretest2010 [cookbooks] - 10https://gerrit.wikimedia.org/r/1287829 [10:00:57] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [10:01:00] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T426298 [10:01:00] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:03:12] mvernon@cumin2002 reimage (PID 2916451) is awaiting input [10:04:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2064.codfw.wmnet with OS bullseye [10:04:27] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11924792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2064.codfw.wm... [10:04:40] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:08:09] (03CR) 10Hnowlan: [C:04-1] "This change needs to happen in helmfile.d/services/rest-gateway/values.yaml as these APIs have been migrated to the rest-gateway. Thankful" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [10:10:18] !log Migrate ulsfo cr<->cr traffic to use path via switches not direct link T424611 [10:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:22] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [10:10:24] cmooney@cumin1003 netbox (PID 4115505) is awaiting input [10:12:23] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify entries for ulsfo router interfaces - cmooney@cumin1003" [10:12:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify entries for ulsfo router interfaces - cmooney@cumin1003" [10:12:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:12:55] (03PS1) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [10:14:27] (03CR) 10CI reject: [V:04-1] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [10:16:50] (03CR) 10Cathal Mooney: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [10:20:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage [10:20:59] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:03] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:06] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:06] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:07] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:07] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:51] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:55] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:56] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:56] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:57] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:57] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:42] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:22:46] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:23:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:23:59] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [10:24:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage [10:25:18] 06SRE, 10corto, 10Incident Tooling, 13Patch-For-Review: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11924884 (10A_smart_kitten) >>! In T426137#11922867, @hnowlan wrote: > [...] but I think after an incident we should have to make a very good... [10:26:52] (03CR) 10Cathal Mooney: "I cannot work out this CI error, schema validation is failing saying the 'metric' should be a string? But all the others are ints and tha" [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [10:28:10] (03PS2) 10Kosta Harlan: rest-gateway: Add Vary: Origin to CORS-enabled routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) [10:28:15] mvernon@cumin2002 reimage (PID 2916451) is awaiting input [10:28:34] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:31:00] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:31:03] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:33:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11924902 (10cmooney) [10:33:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:35:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2065.codfw.wmnet with OS bullseye [10:36:05] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11924907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye [10:36:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2065 [10:36:38] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [10:40:38] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2065 - mvernon@cumin2002" [10:40:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2065 - mvernon@cumin2002" [10:40:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:44] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2065.codfw.wmnet 167.48.192.10.in-addr.arpa 7.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:40:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2065.codfw.wmnet 167.48.192.10.in-addr.arpa 7.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:40:48] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2065 [10:41:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2065 [10:41:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2065 [10:42:56] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [10:43:24] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:44:51] (03PS1) 10Btullis: Install conda-analytics-next to the production Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) [10:46:09] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [10:49:16] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [10:52:20] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:52:24] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [10:55:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:55:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2064.codfw.wmnet with OS bullseye [10:56:05] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11924926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye compl... [10:59:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T0700) [11:00:04] jelto, arnoldokoth, mutante, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260515T1100). [11:01:10] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage [11:07:23] (03PS2) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [11:08:36] (03PS1) 10Muehlenhoff: Allow to tighten ptrace access [puppet] - 10https://gerrit.wikimedia.org/r/1287840 [11:08:47] (03CR) 10CI reject: [V:04-1] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [11:10:37] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T426298 [11:13:26] FIRING: [44x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:50] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:20:27] cmooney@cumin1003 netbox (PID 4122531) is awaiting input [11:20:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:ae0 (Core: cr3-ulsfo:ae0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:20:58] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:06] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:12] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:12] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:12] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:12] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:12] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:14] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:21:48] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Swift [11:21:56] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:02] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:02] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:02] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:02] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:02] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:04] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Swift [11:22:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. In the sense it defaults to false and thus leaves things as they are. In terms of what roles/hosts we might want to flip that to t" [puppet] - 10https://gerrit.wikimedia.org/r/1287840 (owner: 10Muehlenhoff) [11:24:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2065.codfw.wmnet with OS bullseye [11:24:17] ^^ swift problems are due to our friend in AWS (see -security) same hourly pattern [11:24:20] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11925020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye compl... [11:25:51] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:ae0 (Core: cr3-ulsfo:ae0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:34:29] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [11:34:32] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [11:49:53] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11925070 (10MatthewVernon) [11:54:08] (03PS1) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) [11:54:16] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [11:54:28] (03PS3) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [11:55:51] (03CR) 10CI reject: [V:04-1] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [11:57:06] (03CR) 10Cathal Mooney: "It's now failing due to line 180, which hasn't even changed!" [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [11:58:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2009.codfw.wmnet [11:59:14] !log depool / restart swift / repool on ms-fe2010 ms-fe2012 [11:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:00] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [12:02:00] RESOLVED: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:02:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2009.codfw.wmnet [12:18:22] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove IPs that had been used for ulsfo cr links from dns - cmooney@cumin1003" [12:18:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove IPs that had been used for ulsfo cr links from dns - cmooney@cumin1003" [12:18:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:55] (03PS1) 10Filippo Giunchedi: pontoon: add current roles and their hostname convention [puppet] - 10https://gerrit.wikimedia.org/r/1287863 [12:20:58] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:20:58] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:06] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:12] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:14] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:14] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:14] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:14] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:15] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:21:22] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add current roles and their hostname convention [puppet] - 10https://gerrit.wikimedia.org/r/1287863 (owner: 10Filippo Giunchedi) [12:21:48] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:48] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:56] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:02] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:04] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:04] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:04] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:04] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:05] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:58] (03CR) 10Ladsgroup: "I can't think of an easy way to do this. Have you looked at the archive index thingy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [12:32:09] (03CR) 10Brouberol: [C:03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [12:35:50] (03CR) 10Cathal Mooney: [C:03+2] Allow to tighten ptrace access [puppet] - 10https://gerrit.wikimedia.org/r/1287840 (owner: 10Muehlenhoff) [12:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:50] (03PS2) 10Kosta Harlan: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) [13:08:50] (03PS2) 10Kosta Harlan: hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [13:08:50] (03PS2) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [13:11:14] (03CR) 10CI reject: [V:04-1] hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:12:06] (03PS3) 10Kosta Harlan: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) [13:12:07] (03CR) 10CI reject: [V:04-1] hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:12:29] (03CR) 10CI reject: [V:04-1] hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:14:17] (03CR) 10CI reject: [V:04-1] hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:21:08] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:12] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:12] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:12] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:14] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:15] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:15] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:21:48] (03PS3) 10Kosta Harlan: hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [13:21:54] (03PS3) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [13:21:58] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:02] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:02] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:02] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:04] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:05] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:06] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [13:22:06] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [13:24:10] (03CR) 10CI reject: [V:04-1] hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:24:39] (03CR) 10CI reject: [V:04-1] hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:25:00] (03PS4) 10Kosta Harlan: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) [13:27:30] (03CR) 10Harroyo-wmf: [C:03+1] hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:31:59] (03PS4) 10Kosta Harlan: hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [13:32:00] (03CR) 10Dreamy Jazz: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:32:08] (03PS4) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [13:32:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:09] (03CR) 10CI reject: [V:04-1] hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:34:43] (03CR) 10CI reject: [V:04-1] hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:36:14] (03PS5) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [13:36:24] (03PS5) 10Kosta Harlan: hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [13:37:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:32] (03PS5) 10Kosta Harlan: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) [13:40:38] (03CR) 10Kosta Harlan: hcaptcha: Override upstream Access-Control-Allow-Origin with '*' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:42:34] (03PS6) 10Kosta Harlan: hcaptcha: Stop attempting to cache credentialed proxy endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [13:42:40] (03PS6) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [13:52:27] (03CR) 10Dreamy Jazz: [C:03+1] hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:52:44] (03CR) 10Dreamy Jazz: [C:03+1] "(Seems fine from a non-SRE perspective)" [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [13:52:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:59:31] here and having a look. [14:07:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:13:40] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11925605 (10LSobanski) 05Open→03Resolved With the bounces queue and the junk queue alert in place I think this can... [14:13:41] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11925607 (10Aklapper) > I will change the logic so that we always default to a thumbnail in this case, thank you for sharing. I will monitor the changes and hopefully no more 429 will show up @NakavoDev Hi, is there sti... [14:20:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:08] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:12] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:14] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:14] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:14] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:14] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:14] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:16] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:21:50] (03PS2) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) [14:21:58] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:02] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:04] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:04] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:04] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:04] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:04] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:06] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [14:22:26] (03PS1) 10Bking: WIP: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) [14:22:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [14:22:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:23:06] (03CR) 10CI reject: [V:04-1] WIP: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [14:24:34] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11925674 (10Jhancock.wm) a:03Jhancock.wm [14:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [14:25:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:32:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:35:31] (03PS1) 10Filippo Giunchedi: zookeeper: fail on empty myid [puppet] - 10https://gerrit.wikimedia.org/r/1287893 (https://phabricator.wikimedia.org/T422646) [14:43:15] (03PS1) 10Daimona Eaytoy: Store uncomputed references delta as null, not 0 [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) [14:43:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) (owner: 10Daimona Eaytoy) [14:43:33] (03CR) 10Ssingh: [C:03+2] hcaptcha: Override upstream Access-Control-Allow-Origin with '*' [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) (owner: 10Kosta Harlan) [14:47:10] (03PS3) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) [14:47:23] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [14:53:43] (03PS4) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [14:55:49] (03PS5) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [14:57:09] (03CR) 10CI reject: [V:04-1] Store uncomputed references delta as null, not 0 [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) (owner: 10Daimona Eaytoy) [14:58:58] (03CR) 10Daimona Eaytoy: "recheck" [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) (owner: 10Daimona Eaytoy) [15:00:22] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11925789 (10Ahoelzl) @Dzahn we would appreciate if this could be expedited, metrics sign off is blocked by this / access to related... [15:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:48] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 13Patch-For-Review: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11925795 (10BTullis) a:03BTullis [15:09:00] (03PS1) 10Btullis: Add mcollins to analytics-privatedata-users without SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1287902 (https://phabricator.wikimedia.org/T426348) [15:09:46] (03CR) 10CI reject: [V:04-1] Add mcollins to analytics-privatedata-users without SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1287902 (https://phabricator.wikimedia.org/T426348) (owner: 10Btullis) [15:11:06] (03PS2) 10Btullis: Add mcollins to analytics-privatedata-users without SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1287902 (https://phabricator.wikimedia.org/T426348) [15:12:22] (03CR) 10Dzahn: "duplicate https://gerrit.wikimedia.org/r/c/operations/puppet/+/1287516" [puppet] - 10https://gerrit.wikimedia.org/r/1287902 (https://phabricator.wikimedia.org/T426348) (owner: 10Btullis) [15:13:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:42] (03CR) 10Btullis: [C:03+1] admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) (owner: 10Dzahn) [15:18:18] (03Abandoned) 10Btullis: Add mcollins to analytics-privatedata-users without SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1287902 (https://phabricator.wikimedia.org/T426348) (owner: 10Btullis) [15:20:15] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 13Patch-For-Review: Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11925843 (10BTullis) a:05BTullis→03Dzahn [15:21:13] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:13] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:15] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:15] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:15] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:15] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:15] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:16] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:21:18] (03PS1) 10Elukey: redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) [15:21:34] (03CR) 10Dzahn: admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) (owner: 10Dzahn) [15:22:03] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:03] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:05] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:05] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:05] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:05] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:05] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:06] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Swift [15:28:41] (03CR) 10Dzahn: [C:03+2] admin: upgrade mcollins from ldap_only to a-privatedata lvl 1 [puppet] - 10https://gerrit.wikimedia.org/r/1287516 (https://phabricator.wikimedia.org/T426348) (owner: 10Dzahn) [15:29:17] (03PS6) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [15:29:58] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425 (10JMonton-WMF) 03NEW [15:30:15] (03CR) 10Btullis: "I think that your option C)" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [15:30:39] (03CR) 10CI reject: [V:04-1] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [15:33:21] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 152137704 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:34:21] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2932992 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:35:22] (03PS1) 10JavierMonton: stream: webrequest-page-view [puppet] - 10https://gerrit.wikimedia.org/r/1287906 (https://phabricator.wikimedia.org/T426425) [15:35:52] (03CR) 10CI reject: [V:04-1] stream: webrequest-page-view [puppet] - 10https://gerrit.wikimedia.org/r/1287906 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [15:36:58] (03PS2) 10JavierMonton: stream: webrequest-page-view [puppet] - 10https://gerrit.wikimedia.org/r/1287906 (https://phabricator.wikimedia.org/T426425) [15:38:45] (03PS10) 10Bking: WIP: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) [15:40:01] (03PS4) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) [15:40:27] (03PS11) 10Bking: WIP: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) [15:40:43] (03PS12) 10Bking: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) [15:41:43] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [15:41:51] (03PS7) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [15:43:17] (03CR) 10CI reject: [V:04-1] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [15:44:24] (03CR) 10Btullis: [C:03+2] Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [15:44:39] (03PS8) 10Cathal Mooney: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) [15:44:52] (03PS4) 10Tiziano Fogli: logstash/thanos-qfe: add event.hash and event.start [puppet] - 10https://gerrit.wikimedia.org/r/1287827 [15:44:55] (03CR) 10Tiziano Fogli: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [15:46:34] (03Merged) 10jenkins-bot: Add support for creating arbitrary PVCs to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287426 (https://phabricator.wikimedia.org/T422179) (owner: 10Btullis) [15:49:15] (03PS5) 10Tiziano Fogli: logstash/thanos-qfe: add event.start [puppet] - 10https://gerrit.wikimedia.org/r/1287827 [15:49:25] (03CR) 10Brouberol: relforge: Switch to an OCI-image based profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [15:50:02] (03CR) 10Ebernhardson: [C:03+1] "conceptually, this should work for relforge. Not familiar with the actual implementation here." [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [15:51:55] (03PS13) 10Bking: relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) [15:52:51] (03CR) 10Bking: "Updated commit msg to make this more clear." [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [16:00:30] !log dancy@deploy1003 Installing scap version "4.265.1" for 2 host(s) [16:02:28] !log dancy@deploy1003 Installation of scap version "4.265.1" completed for 2 hosts [16:02:54] (03CR) 10Xcollazo: "Can we please do a bit more testing on test cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [16:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [16:34:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:20] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11926377 (10Dzahn) [16:36:50] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Grant mcollins level 1 access to analytics-privatedata-users - https://phabricator.wikimedia.org/T426348#11926378 (10Dzahn) 05Open→03Resolved This is resolved. mcollins is in the group now. ` [an-master1003:~] $ id mcollin... [16:39:31] (03CR) 10Dzahn: "actually I am not sure why the reviewer-bot added me to this. maybe I should check my regexes in the reviewer-bot page." [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [16:42:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:43:30] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [16:47:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:49:55] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11926436 (10wiki_willy) Thanks @elukey, I went ahead and sent it over to Ken from Supermicro, so that he can try to push this along a bit quicker. >>! In T394357#11924651, @eluke... [16:51:11] (03PS1) 10Jgreen: Create frdb-analytics-write.wmnet CNAME to FR analytics db. [dns] - 10https://gerrit.wikimedia.org/r/1287917 (https://phabricator.wikimedia.org/T426448) [16:51:52] (03CR) 10CI reject: [V:04-1] Create frdb-analytics-write.wmnet CNAME to FR analytics db. [dns] - 10https://gerrit.wikimedia.org/r/1287917 (https://phabricator.wikimedia.org/T426448) (owner: 10Jgreen) [16:53:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:53:15] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:55:56] (03CR) 10Xcollazo: [C:03+1] "This all LGTM, happy to see both 3.1 and 3.5 can coexist." [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [16:58:52] (03Abandoned) 10Jgreen: Create frdb-analytics-write.wmnet CNAME to FR analytics db. [dns] - 10https://gerrit.wikimedia.org/r/1287917 (https://phabricator.wikimedia.org/T426448) (owner: 10Jgreen) [17:07:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:09:25] (03CR) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [17:14:52] (03CR) 10Cwhite: logstash/thanos-qfe: add event.start (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [17:17:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:19:05] !incidents [17:19:06] 7970 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:19:06] 7968 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:19:06] 7964 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:19:06] 7959 (RESOLVED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:06] 7960 (RESOLVED) db2220 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:06] 7958 (RESOLVED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:07] 7956 (RESOLVED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:07] 7957 (RESOLVED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:08] 7962 (RESOLVED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:08] 7961 (RESOLVED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:09] 7963 (RESOLVED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [17:19:09] 7954 (RESOLVED) db2218 (paged)/MariaDB Replica IO: s7 (paged) [17:19:10] 7953 (RESOLVED) db2218 (paged)/MariaDB Replica SQL: s7 (paged) [17:19:10] 7951 (RESOLVED) db2222 (paged)/MariaDB Replica IO: s7 (paged) [17:19:11] 7950 (RESOLVED) db2220 (paged)/MariaDB Replica IO: s7 (paged) [17:19:11] 7949 (RESOLVED) db2208 (paged)/MariaDB Replica IO: s7 (paged) [17:19:12] 7948 (RESOLVED) db2168 (paged)/MariaDB Replica IO: s7 (paged) [17:19:12] 7947 (RESOLVED) db2159 (paged)/MariaDB Replica IO: s7 (paged) [17:19:13] 7952 (RESOLVED) db2221 (paged)/MariaDB Replica IO: s7 (paged) [17:19:13] 7946 (RESOLVED) db2182 (paged)/MariaDB Replica IO: s7 (paged) [17:19:14] 7955 (RESOLVED) db2218 (paged)/MariaDB read only s7 (paged) [17:19:14] 7945 (RESOLVED) Host db2218 (paged) [17:19:15] 7943 (RESOLVED) [3x] TransitPeeringTransportOutSaturation network sre (gnmi) [17:19:15] 7944 (RESOLVED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:19:16] 7941 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:19:16] 7940 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:19:17] 7942 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:19:17] 7939 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:19:18] 7938 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [17:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:44:07] (03PS1) 10Eevans: Add component/cassandra50 for Cassandra 5.0.x releases [puppet] - 10https://gerrit.wikimedia.org/r/1287923 (https://phabricator.wikimedia.org/T418419) [17:46:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:49:22] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11926637 (10Eevans) [17:52:42] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11926702 (10Eevans) [17:56:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:01:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:03:30] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:06:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:19:24] 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for zsinger - https://phabricator.wikimedia.org/T426458 (10ZSinger-WMF) 03NEW [18:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [18:29:15] (03CR) 10Btullis: [C:03+1] Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [18:41:46] (03CR) 10Cathal Mooney: "Merging patch, changes are all applied network wise for past few hours all working ok. https://phabricator.wikimedia.org/P92555" [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:41:49] (03CR) 10Cathal Mooney: [C:03+2] ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:44:14] (03Merged) 10jenkins-bot: ulsfo: enable ospf on new links via switches and set metric on direct [homer/public] - 10https://gerrit.wikimedia.org/r/1287831 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:47:30] (03PS1) 10Dzahn: zuul: let the launcher use the zuul user, not a separate one [puppet] - 10https://gerrit.wikimedia.org/r/1287933 (https://phabricator.wikimedia.org/T395938) [18:48:04] (03PS2) 10Dzahn: zuul: let the launcher use the zuul user, not a separate one [puppet] - 10https://gerrit.wikimedia.org/r/1287933 (https://phabricator.wikimedia.org/T395938) [18:57:37] (03CR) 10Dduvall: [C:03+1] zuul: let the launcher use the zuul user, not a separate one [puppet] - 10https://gerrit.wikimedia.org/r/1287933 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:02:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:49] (03CR) 10Brouberol: [C:03+1] relforge: Switch to an OCI-image based profile [puppet] - 10https://gerrit.wikimedia.org/r/1287889 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [19:13:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:15:32] (03CR) 10Dzahn: [C:03+2] zuul: let the launcher use the zuul user, not a separate one [puppet] - 10https://gerrit.wikimedia.org/r/1287933 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:18:20] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:21:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:21:38] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1290 [19:22:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1290 [19:23:27] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:30:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1290.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:21] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1290.eqiad.wmnet with OS bookworm [19:32:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11927030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1290.eqiad.wmnet with OS bookworm [19:38:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:47:51] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1290.eqiad.wmnet with reason: host reimage [19:50:46] (03CR) 10Srishakatux: Gender namespaces on Serbo-Croatian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285467 (https://phabricator.wikimedia.org/T425402) (owner: 10Acamicamacaraca) [19:53:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1290.eqiad.wmnet with reason: host reimage [20:09:18] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:12:24] vriley@cumin1003 reimage (PID 4175483) is awaiting input [20:12:58] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:12:59] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1290.eqiad.wmnet with OS bookworm [20:13:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11927126 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1290.eqiad.wmnet with OS bookworm completed: - db1290 (**PASS**) -... [20:13:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11927128 (10VRiley-WMF) 05Open→03Resolved @Marostegui this has been completed [20:31:55] (03PS1) 10Seddon: Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 [20:37:12] (03PS2) 10Seddon: Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) [20:38:03] (03PS3) 10Seddon: Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) [20:42:25] (03PS4) 10Subramanya Sastry: Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) (owner: 10Seddon) [20:45:37] (03CR) 10Pmiazga: [C:03+1] Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) (owner: 10Seddon) [20:45:59] (03CR) 10Tiziano Fogli: logstash/thanos-qfe: add event.start (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [20:54:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) (owner: 10Seddon) [20:55:07] (03Merged) 10jenkins-bot: Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287940 (https://phabricator.wikimedia.org/T425580) (owner: 10Seddon) [20:55:32] Seddon: Can you validate it against mw-debug, or should I just ship it and hope? [20:55:47] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1287940|Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" (T425580)]] [20:55:51] T425580: [Spike] [BUG] All images breaking on iOS - https://phabricator.wikimedia.org/T425580 [20:57:38] (03PS1) 10Atsuko: opensearch-cluster: full permission for anonymous users [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287920 (https://phabricator.wikimedia.org/T426073) [20:57:45] !log jforrester@deploy1003 jforrester, seddon: Backport for [[gerrit:1287940|Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" (T425580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:58] @James_F I don't actually know. I don't have mw-debug setup or quite sure the right way of testing. Dmitry might have a better idea. But its a simple enough change that ship and hope is probably fine. [20:59:10] Ack, OK, let's ship it. [20:59:12] !log jforrester@deploy1003 jforrester, seddon: Continuing with deployment [20:59:26] Nothing looked broken in my manual desktop and mobile Web testing. [21:00:42] given that it's a revert to recent state it seems pretty low risk. [21:03:30] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287940|Revert "Enable wgTrackMediaRequestProvenance on remaining Wikipedias" (T425580)]] (duration: 07m 43s) [21:03:33] T425580: [Spike] [BUG] All images breaking on iOS - https://phabricator.wikimedia.org/T425580 [21:03:37] Seddon: And live. [21:21:58] (03PS1) 10Bking: bking: Replace my non-FIDO SSH key with a backup FIDO-backed key [puppet] - 10https://gerrit.wikimedia.org/r/1287947 [22:12:26] (03CR) 10Ryan Kemper: [C:03+2] archiva: block scraper UAs at nginx [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [22:12:47] (03CR) 10Ryan Kemper: [C:03+2] archiva: block scraper UAs at nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [22:24:38] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [22:45:30] (03CR) 10Ryan Kemper: [C:03+1] bking: Replace my non-FIDO SSH key with a backup FIDO-backed key [puppet] - 10https://gerrit.wikimedia.org/r/1287947 (owner: 10Bking) [23:13:41] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287957 [23:40:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287957 (owner: 10TrainBranchBot) [23:51:29] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287957 (owner: 10TrainBranchBot)