[02:17:46] 10Traffic, 10Thumbor: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Platonides) The file can be curled down from all dcs: ` for site in ulsfo eqiad codfw esams eqsin drmrs; do curl -o $site.pdf https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E... [02:21:00] 10Traffic, 10Thumbor: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Platonides) But it does fail if we also add `--limit-rate 1M` [05:46:00] 10Traffic, 10netops, 10Infrastructure-Foundations: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10ayounsi) p:05Triage→03High [05:46:09] FYI: https://phabricator.wikimedia.org/T344968 [08:14:14] XioNoX: makes sense :) [08:28:15] <_joe_> I made a plot of the backend requests from esams+drmrs compared to last week [08:28:18] <_joe_> https://grafana.wikimedia.org/goto/_V17YUgSz?orgId=1 [08:28:23] <_joe_> this is just for mediawiki [08:28:36] <_joe_> I don't think it's ok. [08:45:57] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10Joe) A couple things: first of all, the recent issues with repooling esams were mostly due to the insufficient caching in its backend, even more than starting with a completely cold... [08:57:40] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10KOfori) @Joe certainly not now with all the trouble but at some point when things have stabilized, we should do it. [09:28:57] while the backend rps (atm) has not changed significantly compared to yesterday, we are should consider depooling UK and DE from esams, at least until any pending optimisation we have previously discussed, is deployed. The way backend traffic looks right know, it is possible that potential increased traffic (even due to natural events) will cause problems [09:42:42] 10Traffic, 10SRE, 10SRE-swift-storage: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) @Vgutierrez how hard is moving swift frontends to using envoy in place of nginx likely to be? If it could give us better visibility of what is going on in these occasional s... [10:10:45] 10Traffic, 10SRE, 10SRE-swift-storage: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) As a side effect of moving to envoy we would be getting https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 data for swift. As stated in the task description the... [10:11:14] _joe_: re https://phabricator.wikimedia.org/T344968#9119575 not saying we should do right now but once the issues are fixed [10:15:28] <_joe_> XioNoX: the priority worried me a bit :) [11:03:01] 10Traffic, 10Thumbor: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Vgutierrez) A quick check on cp3081 shows the following results: * HAproxy closes the connection after 245 seconds and 327 seconds in a second test * varnish closes the connection after 353 s... [11:10:32] 10Traffic, 10Thumbor: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Vgutierrez) it seems that the ATS issue could be addressed by https://github.com/apache/trafficserver/pull/8083 [13:00:40] _joe_: there's a whole lot of different factors going on in the graph you're looking at. The timing and method of re-introducing the cold caches aside (which could've been more-ideal, but ultimately did not actually melt anything), regardless of the single-backend config issues as well: anytime we concentrate more users into fewer caches, it will reduce MW load. There's just more users sharing [13:00:46] the same caches for greater overall efficiency. When you compare esams+drmrs now to drmrs-only, you're seeing some of that as well. [13:01:39] <_joe_> bblack: that latter part can't justify a 40% raise in rps [13:01:45] https://grafana.wikimedia.org/goto/BG-_GwRSk?orgId=1 [13:01:58] ^ if you look at the 30d view, you can see the dropoff when esams was first depooled from this effect [13:02:26] if you're going to compare like that, it would be better to compare to how it was with both sites, before esams depooled [13:04:24] <_joe_> ok then it's ~1k rps at peak of net increase, not 1.4k rps [13:05:09] on the first day? [13:05:15] <_joe_> no, now [13:06:30] <_joe_> it's even clearer if you don't have grafana interpolating your bins and use [6h] or something like that in your vector [13:07:11] <_joe_> anyways, I'm going to be on PTO for 2 weeks, but we'd appreciate if moves around caching that impact the backends this much were coordinated with us. [13:09:16] with the [6h], looking at ~now versus two weeks prior (have to go back a day to not already be in the esams original downing point) [13:10:37] well I donno, even that doesn't work. Basically we don't have a two-week span of clean data unless we go back to the first day when it was known-cold [13:10:42] (trying to get same row) [13:10:44] err, dow [13:10:57] hmmm [13:11:15] I could do now vs 3 weeks ago I guess [13:12:11] 4656 for now, 4297 3w ago, at 12:40 UTC [13:12:37] but we're not at today's peak yet either, it's just the most-current point in time in the 30d[6h] view [13:14:05] yesterday's peak was 4886 @ 21:20, and 3 weeks back from that is 4537 [13:14:55] which is ~7.7% bump, for the aggregate of the two sites, and only a couple days after starting up cold, and with the new single-backend stuff going on [13:17:27] (in the peak, I might add. it might be different with whole-day averages or something) [13:19:34] for whatever it may be worth, at [24h], latest datapoint is 4049 @ 12:40, and -3w it's 3852, closer to a 5% bump [13:21:46] if we include eqiad's requests as well (to get a more-complete picture of all ATS reqs heading towards eqiad mediawikis), at the latest [24h] point in time minus 3 weeks, we actually had a ~3.5% decrease, but then there's a lot of variables over 3 weeks introducing noise. [13:24:08] <_joe_> if we use today, indeed data are better. I was avoiding it as it's incomplete [13:24:19] well even yesterday [13:24:32] but we're dealing both with incomplete data, and things improving by the data as we get longer-term cache fill [13:24:44] <_joe_> there;s anotehr issue ofc [13:24:50] and the lack of a clean -1w or -2w comparison due to the depool/repool timing :/ [13:24:56] <_joe_> on july 30th everyone is at work in europe [13:25:03] <_joe_> right now maybe 30% of people are? [13:25:13] <_joe_> so it's hard to make comparisons going back that much [13:25:17] yeah I agree [13:25:41] but for that matter, how many intervening deploys have changed things too, etc. [13:25:55] it's hard to get good long-term comparisons that isolate one effect [13:27:15] regardless of the single-backend change, though: the point about "more total caches == more traffic" will keep applying regardless. [13:27:38] the next new regional edge DC we add, it will steal users from other existing caches and cause more backend traffic because less cache sharing. [13:28:28] at least, potentially. you might see that effect dampened by the fact that the region gets better caching of its own language editions when not sharing as much space with what's popular in the US caches [13:41:24] 10Traffic: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10herron) @BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to... [13:57:26] 10Traffic, 10Math, 10RESTBase Sunsetting, 10SRE: Determin the cause of x8 increase in requests to math endpoints between july 6 and August 3 - https://phabricator.wikimedia.org/T344329 (10Physikerwelt) p:05Triage→03Low I can't explain this. However, I think it is less critical. Previously, we had thr... [14:12:00] 10Traffic, 10Math, 10RESTBase Sunsetting, 10SRE: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329 (10Aklapper) [17:43:37] 10Traffic, 10Thumbor: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10BCornwall) [17:43:39] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10BCornwall) [18:16:27] 10Traffic, 10Infrastructure-Foundations, 10SRE: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10BCornwall) 05Open→03Resolved a:03BCornwall cumin2002 is able to ping both v4 and v6, so I'm going to mark as closed. Thanks, everyone!