[01:47:10] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Krinkle) [02:14:15] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) > Observe cross-DC database connection rate, analyse sources It's not necessary to use tcpdump since we can just look at SSL connection counts. I... [03:16:00] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I made this [[https://grafana-rw.wikimedia.org/d/6fLyZKG4k/all-clusters-utilization|all clusters utilization]] dashboard so that I could easily se... [03:17:51] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [03:44:19] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10ori) > I suspect this is fallout from the URL query sorting change (cc @ori) not invalidating the cache... [03:44:47] vgutierrez: ^ ptal if you have a chance [03:55:33] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) MySQL cross-DC traffic is higher than expected, with 110 conns/s. Appserver CPU usage is fine. Mcrouter connection rates are fine. [04:24:28] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [04:29:49] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I captured cross-DC queries on the s3 master (db1157) using SHOW PROCESSLIST in a loop, once per second for 20 minutes. Out of 10 captured queries... [04:32:28] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) The serverIsReadOnly() cache key includes the DB hostname, so I should have done my calculation per section rather than globally. [04:37:49] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) | section | Cross-DC connection rate (req/s) | |--|--| | es4 | 0.00 | | es5 | 0.00 | | s1 | 9.21 | | s2 | 19.7 | | s3 | 53.2 | | s4 | 7.02 | | s5... [06:15:03] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [07:50:01] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) yes, your assessment is right @ori, query parameters are sorted before triggering the purge [07:57:43] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) @ori, my current theory (and it needs to be tested) is that varnish frontend purges the hist... [07:58:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:59:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7eb8120c-f8b6-4c79-8deb-b18a305a2353) set by ayounsi@cumin1001 for 2:00:00 on 1 host(s) and th... [08:03:56] (HAProxyEdgeTrafficDrop) resolved: (2) 68% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:09:11] (HAProxyEdgeTrafficDrop) firing: (2) 68% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:45:56] (PyBalBGPUnstable) firing: (3) PyBal BGP sessions on instance lvs4005 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [08:46:45] expected... XioNoX is restarting cr3-ulsfo :) [08:46:55] yep [09:11:59] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) I've just confirmed it in testwiki, first unauthenticated GET against `action=history` trigg... [09:34:39] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) ATS provides a similar feature to libvmod-querysort as part of the Cache Key manipulation pl... [09:39:11] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:48:24] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Joe) Another option is to do the query sorting for purges, which are a special case, in either: # media... [10:30:56] (PyBalBGPUnstable) resolved: (3) PyBal BGP sessions on instance lvs4005 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [10:57:56] (PyBalBGPUnstable) firing: (3) PyBal BGP sessions on instance lvs4005 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [11:22:56] (PyBalBGPUnstable) resolved: (3) PyBal BGP sessions on instance lvs4005 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [11:40:56] (HAProxyEdgeTrafficDrop) firing: 60% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:45:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:47:05] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) This has been quite eventful. To keep in mind that those upgrade need the !!no-validate!! knob, more details in the [[ https://www.juniper.net/documentation/us/en/software... [11:49:56] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:54:56] (HAProxyEdgeTrafficDrop) resolved: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:10:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [12:44:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:54:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:12:33] Hi folks, I'm clinician on duty this week, and there are a bunch of traffic phab items that need triage - T317011 T316337 T315911 T315676 T315064 T122097 would you mind setting a priority, please? Alternatively I'm happy to do either a) remove the SRE tag leaving just traffic or b) set them all as medium priority if you'd rather [13:12:34] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [13:12:34] T315676: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 [13:12:34] T317011: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 [13:12:35] T315064: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 [13:12:36] T316337: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 [13:12:36] T122097: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097 [13:13:53] sorry, also T317064 [13:13:54] T317064: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 [13:18:38] 10Traffic, 10SRE: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Vgutierrez) 05Open→03Resolved [13:22:19] 10Traffic, 10SRE, 10Patch-For-Review: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 (10Vgutierrez) p:05Triage→03Medium [13:25:02] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) p:05Triage→03High [13:34:20] 10Traffic, 10SRE, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Vgutierrez) p:05Triage→03Medium >>! In T317011#8212213, @Aklapper wrote: > Not sure which project tags to add when it comes to caching layers (?), as... [13:40:26] 10Traffic, 10SRE, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) p:05Triage→03Medium This could have been solved by T316938 [13:41:05] Thanks :) [14:42:07] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Use vlan trunking instead of multiple physical interfaces - https://phabricator.wikimedia.org/T316114 (10jbond) p:05Triage→03Medium [14:43:35] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) p:05Triage→03Medium [14:55:54] 10netops, 10Infrastructure-Foundations, 10SRE: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) p:05Triage→03Low [15:04:25] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) @ayounsi the mentioned : "All management routers are running Junos 20 except mr1-codfw and mr1-esams that are running 18." and "The current Junos recommen... [15:08:43] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10ayounsi) Only those 2 from 18 to 21. 20 is recent enough. [15:12:08] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) Thanks [15:31:39] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10hashar) I have posted the very few actions I have done on the incident documentation. Given the root cause was immediately found (trafficserver) an... [15:45:28] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) > on the incident documentation Where? There is no incident doc yet (or I couldn't find one on Wikitech) [16:07:30] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [16:20:31] <_joe_> hi folks, I am inclined to merge my purged patch, can I trust y'all to package/deploy purged? [16:33:08] probably :) [16:33:43] (see also ori's note about the delimiter. package+deploy again for that or do it now?) [16:34:22] _joe_: ^ [16:35:11] <_joe_> bblack: using the delimiter would make the code a bit more complicated actually, if we want to reproduce all features [16:35:20] <_joe_> adn significantly slower [16:35:34] <_joe_> hence I'd go with the hotfix for now [16:35:43] <_joe_> my 2 c [16:36:07] <_joe_> but I can look into adding the support for all the cases that vmodquerysort supprts, sure [16:38:13] why more complicated? [16:44:37] yeah the question is whether we're longer-term going down the road of keeping purged in sync with querysort+MW, or if this is a temp hack and we plan to instead fix MW and/or querysort to agree with each other about the desired normalized/canonicalized result [16:59:01] I think it should be implemented in ATS [17:02:56] logically, it makes sense for the CDN layer to handle this. it's unfortunate that it means we have to implement it twice, but I think that is a predictable consequence of having two caching proxies [17:05:11] ideally we want to apply query normalization to as many services as possible, so it shouldn't require making modifications to the backend application [17:20:57] in theory, if we do it "right" (for some value of right I guess), there should be one standard for what the most-canonical form of a given URI is. [17:21:47] and internal things that generate URIs (like MW emitting purges, links, rel=canonical? but other cases too, like x-service traffic?) should generate them in their canonical form. [17:22:36] and the front edge should implement normalization early in request processing to upgrade alternate variants to the canonical one before caching and/or forwarding further in, and then we're done. [17:23:17] I think we're generally pretty close to that model historically, although I'm sure there's edge cases we ignore. [17:24:34] the nice thing about that model is simplicity. one thing (the origin service, MW) defines what's canonical, and one other service implements normalization which aims (as best it can!) to normalize in the direction of canonicalness. [17:25:33] it's the decision to let querysort invent a novel canonicalization different from MW that breaks things here. I remember it being discussed in the ticket, but I didn't really think of this fallout, at that time. [17:28:01] we could "fix" this at either end: either give our varnish+querysort implementation some specific logic to emulate MW on whatever its canonical param ordering is for important cases.... [17:28:26] or have MW re-order what it considers canonical param order to better match normalizers in general (alpha-sort keys or whatever). [17:28:54] bblack: I don't agree. MediaWiki considers ?title=X&action=Y URLs canonical, and we don't want to change that — it is more readable than the alternatives, and historically entrenched. At the same time, we don't want every backend to have to match that sort order [17:29:33] I get the "as many services as possible" angle, but MW is by far our dominant/important case. [17:30:01] bbiab, sorry. [17:30:05] np! [17:30:45] the reason we're pursuing querysort at all is to increase cache hitrate, basically, which is mostly an internal concern of any given cache layer [17:32:57] you could imagine an alternate implementation of querysort where we stashed the unsorted version of the URI as it arrived, and then only applied the sorting for cache lookup (+PURGE lookup), then reverted it before forwarding the request inwards, which would also avoid these problems by making the querysort an invisible action interior to varnish details that we do as a hitrate optimization [17:33:25] in that case it wouldn't matter much which sort order we settled on [17:34:55] but if we're transforming it permanently for the interior/origin-facing requests, if nothing else it's confusing to have that not be in canonical form. You're begging future questions like "Why does every history link read ?title=X&action=Y , yet when I'm tracing live requests or digging in analytics data, I can only find them as ?action=Y&title=X?" [17:36:38] the nice thing about the stash+restore model is two cache layers could implement normalization as cache hitrate optimizations with implementations that don't even match, and that would be ok. [17:37:50] not that I necessarily advocate that as the best solution either. it creates some odd future failure modes, potentially (if the cache + MW's idea of what transformations are legally equivalent ever differs, I could imagine a lot of confusion before someone tracks down that it was a logic bug in the hidden normalization rewrite inside varnish or ATS) [17:41:46] anyways, it's PURGE that creates the loop and necessitates some model of canonical-ness. For a service that doesn't PURGE, it wouldn't matter as much, other than possible confusion. [17:42:46] as faidon mentioned earlier - xkey is another solution to the PURGE problem (but yeah, that's not trivial either!), and we're still left with the confusion problem of why various URIs look different at various points in our infra due to differing normalizations. [17:45:56] but if this whole problem boils down, in practice today, to just a handful of important cases for canonical PURGE'd URIs, it's not hard to just hardcode the important cases, too [17:46:32] (which we could do in purged, or we could just do it in the varnish querysort code with a flag to enable MediaWiki-isms) [17:47:22] the ordering of title-vs-action for a history URI is easy. there's probably only a few more cases (roughly matching how many unique URIs we purge per article) [17:48:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Dzahn) >>! In T265864#6995696, @Legoktm wrote: > This will remove Cloud VPS from `wikimedia_nets`, which gets some... [17:55:04] seperating out generic normalization from mediawiki-specifics was the model we were pursuing for encoding normalization, too (which is also still in an unfinished state, but it's doing some good in the shape its in now) [18:00:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) c2 <-- G2204190495000069 --> a1 c7 <-- G2204190495000136 --> a8 d2 <-- G2204190495000072 --> a1 d7 <-- G2204190495000097 --> a8 [18:25:33] 10Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10SRE, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10aaron) [19:09:33] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE, 10SRE Observability (FY2022/2023-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) p:05Triage→03Medium [19:09:49] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE, 10SRE Observability (FY2022/2023-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) a:05herron→03andrea.denisse