[01:27:22] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tried a warmup request followed by another request for the same page view, the second having MW logging enabled with [[https://gerrit.wikimedia.... [02:09:20] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tested parse times with `ab -n10 -H'X-Forwarded-Proto: https' -X mw1441.eqiad.wmnet:80 'http://test2.wikipedia.org/w/api.php?action=parse&format... [02:10:25] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:18:32] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5001:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [06:18:32] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5001:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [08:01:56] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:06:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:02:07] 10netops, 10Infrastructure-Foundations, 10SRE: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10cmooney) @ayounsi @Papaul I've done the first draft of the summary here: https://wikitech.wikimedia.org/wiki/Dell_Enterprise_Sonic_Evaluation Feel fre... [13:16:02] 10netops, 10Infrastructure-Foundations, 10SRE: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10Papaul) @cmooney thanks for putting this together. [13:39:38] (LVSHighCPU) firing: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:44:38] (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:12:31] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None [14:22:31] 10Traffic, 10DNS, 10SRE, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [15:31:24] 10Traffic, 10DNS, 10SRE, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10BBlack) The existing google IP apparently doesn't even have TLS (just old port 80), so it defaults to an insecure site warning in Chrome. Google's public reso... [16:26:56] sukhe: out of curiosity, are the performance issues improved? [16:30:33] cdanis: some have been alleviated but the current persisting issue is about the increase in TTFB in 9.x [16:30:39] https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&viewPanel=62&var-site=drmrs&var-instance=cp6016&var-layer=backend&from=1659198291536&to=1659580111861 [16:31:15] cp6016 is one of the upgraded hosts, but yeah, it's there in the three other ones as well that are running 9.x [16:31:43] trying to get to the root cause of it and we have tried a bunch of things so far (obvious ones, missing configs, missing merged patches upstream, restarting to clear the cache etc.) [16:31:56] suggestions for debugging most welcome of course [16:32:18] same issue on cp6008, https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&viewPanel=62&var-site=drmrs&var-instance=cp6008&var-layer=backend&from=1659198291536&to=1659580111861 [16:33:23] thanks for the detail! [16:34:18] this is the only _known_ issue so far with the upgrade that's preventing us from going further [16:34:27] the other ones were mostly outdated configs and what not that we have fixed [16:35:19] right [21:43:58] 10Acme-chief, 10SRE, 10Patch-For-Review: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10BCornwall) 05Open→03Resolved a:03BCornwall @Dzahn I'm assuming you meant 0.3, which has long since been deployed. I... [23:53:18] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)