[08:37:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [10:25:16] (VarnishChildRestarted) firing: varnish-upload restarted on cp4046 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4046&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [10:28:58] * vgutierrez looking [10:30:22] vgutierrez: fyi, I'm going to depool ulsfo in 15min or so, so it's fully depooled for the maintenance in ~2h [10:30:46] ack [10:35:16] (VarnishChildRestarted) resolved: varnish-upload restarted on cp4046 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4046&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [10:39:33] 10Traffic, 10SRE: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10Vgutierrez) cp4046 has been impacted by the same issue a few minutes ago [10:55:42] 10Traffic, 10SRE, 10Patch-For-Review, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) both cp2041 and cp2042 look good to me. I haven't found any reason that would prevent upgrading to bullseye [11:59:19] hello! I have a change that requires an LVS restart - would it be suitable to have this done some time in the next few hours? It's a change to the healthcheck URL used by thumbor https://gerrit.wikimedia.org/r/c/operations/puppet/+/880898 thanks! [12:08:30] * vgutierrez looking [12:09:48] hnowlan: yep, just make sure that the change is deployed first across the thumbor instances [12:12:09] vgutierrez: yep, it is! [12:14:56] hit lvs2010 first and double check that pybal is happy with the new endpoint [12:15:47] besides pybal logs, curl http://127.0.0.1:9090/pools/thumbor_8800 can be useful [12:41:35] (PurgedHighEventLag) firing: (2) High event process lag with purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:44:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=795679e1-6c07-4196-8280-0cef7454587d) set by ayounsi@cumin1001 fo... [12:46:35] (PurgedHighEventLag) resolved: (3) High event process lag with purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:15:05] vgutierrez: just to be clear so I don't break anything, if I'm following https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service, lvs2010 is the inactive host? [13:55:52] hnowlan: sorry, lunch break [13:55:56] hnowlan: yes, lvs2010 is the secondary host [14:13:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [14:41:56] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [14:42:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! fpc2 didn't like the first "blank" reboot and required a power cycle using... [14:57:18] 10Traffic, 10Observability-Alerting: Move (or delete?) trafficserver restart count alert from icinga to alerts.git - https://phabricator.wikimedia.org/T327791 (10fgiunchedi) [15:07:32] 10Traffic, 10Observability-Alerting: Move (or delete?) trafficserver restart count alert from icinga to alerts.git - https://phabricator.wikimedia.org/T327791 (10Vgutierrez) We should migrate this and not delete it as it's still useful [16:58:24] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:10:56] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye [17:36:14] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye executed with errors: - cp5017 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [17:37:38] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye [17:51:47] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) [18:43:36] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye [19:00:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) We discuss this during today's meeting, we are going to put 1 spine in A1 and the other spine in A8. When we upgrade ro... [19:10:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @BBlack Do you think you will have time for us to move lvs2007 this Thursday the 26th at 9:45am CT 2:45 pm UTC? Tha... [19:15:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10BBlack) @Papaul - I can't make that slot for LVS, I have meetings a bit later that might get run over. @ssingh might be able t... [19:17:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ssingh) >>! In T326564#8554616, @BBlack wrote: > @Papaul - I can't make that slot for LVS, I have meetings a bit later that mig... [19:19:11] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye executed with errors: - cp5025 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [19:19:31] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye [19:20:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @BBlack @ssingh thank you. So the process is depool the server, power it down I move it and power it back no changes i... [19:39:16] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5032:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:50:19] 10Traffic, 10DC-Ops: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) [19:52:58] 10Traffic, 10DC-Ops, 10ops-eqsin: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) [20:04:16] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5032:9331 is unreachable - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [20:05:19] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye completed: - cp5017 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:09:45] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6009.drmrs.wmnet with OS bullseye completed: - cp6009 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:29:03] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye completed: - cp5025 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:40:55] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:15:51] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6001.drmrs.wmnet with OS bullseye [22:02:02] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6001.drmrs.wmnet with OS bullseye completed: - cp6001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [22:02:34] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)