[07:16:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) From JTAC: > This message “Read-only file system” suggest file system issues. I found one case with same behavior and the upgrade had to do it with... [07:17:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) 05Open→03Resolved All good, thanks a lot! [10:39:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) fpc0 went back up fine, but fpc1 not so much... It's not fully booting and stuck at a busybox like shell. Root password works so that means the con... [11:23:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) We tried to boot on the Recovery Junos (both 14 and 20) but the same error happened. Next step is onsite "format install" https://supportportal.ju... [15:37:47] (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp5017:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [15:41:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) > Next step is onsite "format install" https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?lang... [15:42:47] (VarnishPrometheusExporterDown) resolved: (16) Varnish Exporter on instance cp5017:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [15:43:35] (PurgedHighEventLag) firing: High event process lag with purged on cp5024:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5024 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:44:35] (PurgedHighBacklogQueue) firing: (4) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [15:45:25] ^ expected from a long network outage, hopefully it will just catch up on its own. We might want to wait for purged to catch up before repooling even. [15:48:35] (PurgedHighEventLag) firing: (13) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:49:35] (PurgedHighBacklogQueue) firing: (5) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [15:54:35] (PurgedHighBacklogQueue) firing: (4) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [15:58:35] (PurgedHighEventLag) firing: (14) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:03:35] (PurgedHighEventLag) resolved: (12) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:14:35] (PurgedHighBacklogQueue) firing: (5) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:24:35] (PurgedHighBacklogQueue) firing: (5) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:29:35] (PurgedHighBacklogQueue) firing: (5) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:30:40] I'm really not even sure why 5018 is flapped and two others (cp5020 + cp5024) are still alerting fairly persistently. The linked graphs show it already having recovered on all of those. [16:30:52] Will wait for a bit more settling time [16:34:35] (PurgedHighBacklogQueue) firing: (4) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:39:35] (PurgedHighBacklogQueue) firing: (5) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:44:35] (PurgedHighBacklogQueue) resolved: (8) Large backlog queue for purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [16:53:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) 05Stalled→03Resolved That's all done. [16:54:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [16:54:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [17:57:07] 10Traffic, 10SRE: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) [17:57:27] 10Traffic, 10SRE: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) p:05Triage→03Low [18:04:33] 10Traffic, 10SRE: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) [18:04:37] purge backlogs seems stable now, gonna repool eqsin [23:07:14] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH) [23:07:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH) [23:21:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10RobH)