[00:54:57] (HAProxyEdgeTrafficDrop) firing: 60% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:59:56] (HAProxyEdgeTrafficDrop) resolved: 59% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:35:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [07:38:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [07:39:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [08:23:04] 10Acme-chief, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:38:07] 10Traffic, 10SRE: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:38:54] 10Traffic, 10SRE: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) [08:39:13] 10Traffic, 10SRE: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) p:05Triage→03Medium [08:44:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin? [08:48:00] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) meanwhile I'll remove it from puppet, cause it's been a month since the host crashed and it already got prunned from puppetdb [09:00:26] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [14:18:22] 10Traffic: atsbackend.mtail doesn't track requests with cache read|write time set to -1 properly - https://phabricator.wikimedia.org/T316938 (10Vgutierrez) [14:18:40] 10Traffic: atsbackend.mtail doesn't track requests with cache read|write time set to -1 properly - https://phabricator.wikimedia.org/T316938 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium [15:09:54] 10Traffic, 10SRE, 10Patch-For-Review: atsbackend.mtail doesn't track requests with cache read|write time set to -1 properly - https://phabricator.wikimedia.org/T316938 (10Vgutierrez) 05In progress→03Resolved after merging https://gerrit.wikimedia.org/r/829208 sli_total|sli_good counters seem sane: `vguti... [15:31:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr for more information: the MMF/MTP fibers ordered in https://phabricator.wikimedia.org/T313464 we want 1 fiber from rack c2 to rack a1 1 fi... [15:55:16] bblack: whatever ended up happening with XKey-based purging? [15:57:08] did it hit a technical snag of some kind? or is it still desirable, but simply not resourced? [16:08:47] I lack the context for that, but if it would help reducing the huge amount of PURGE requests that we currently handle.. it would be great :) [16:13:47] ori: still somewhat desirable I think. It has gotten complicated over time, since we'd need a shared x-key-like solution between both Varnish and ATS. [16:15:05] also, arguably it might be greater benefit for the effort to reduce purge volume in other ways. x-key could get us some single-digit divider for normal article purges, but finding a way that we can be comfortable eliminating (or at least reducing) the bulk template purges might be worth more. [16:17:46] the simple idea there is that if purges from widely-used templates spool out over many hours (or possibly days) anyways, then (a) clearly they're ok to be async in nature in general + (b) we cap our cache TTLs at 24h, so maybe natural expiry is enough and we should just not send these towards the cdn purge kafka queue at all (or perhaps, only send up to the first N, so that we do handle smaller [16:17:52] cases quickly) [16:18:44] the complicating factor is whether we really trust the supposed 24h cap on cache TTLs. What we've seen in the past is that there's potentially some "lying" going on with 304 responses from Mediawiki in relevant cases. [16:19:12] and so our caches might (even today) be hanging onto stale content longer than they should through 304 refreshes [16:19:57] (the 304 will re-up the TTL, but there were some past tickets where we documented/explored this a bit, and apparently MW will do a 304-ok even though the article has been updated) [16:22:44] (I vaguely recall there was some reasonable-ish argument for the behavior, but it still feels wrong to me) [17:28:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10wiki_willy) Hi @Vgutierrez - yeah, probably makes more sense to replace than purchase a replacement part, since the new servers have already been ordered and are expected to arrive in Oct... [18:30:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10RobH) >>! In T314256#8207835, @Vgutierrez wrote: > @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing... [20:42:56] (HAProxyEdgeTrafficDrop) firing: 28% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:47:56] (HAProxyEdgeTrafficDrop) resolved: (4) 41% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop