[00:36:57] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:41:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:42:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:47:11] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:47:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:52:11] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:17:12] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez DigitalOcean restored the CSV and it's now working as... [07:17:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Vgutierrez) [07:17:26] 10netops, 10Infrastructure-Foundations: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) p:05Triage→03High [07:30:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Critical DB infra there: - dbproxy1020 (m3 current proxy): needs failover. - pc1013 active pc3 master: needs failover - db1181 s7 master: needs failover T313383... [07:30:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [07:31:47] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) p:05Triage→03High [07:33:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [07:47:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but loo... [07:49:23] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Peachey88) [08:14:09] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) [08:14:15] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) [08:14:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) [08:43:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Resolved→03Open Since the replacement errors rate on one of the interfaces went though the roof: https://librenms.wikimedia.org/graphs/to=1658306... [09:15:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) Opened high severity JTAC case 2022-0720-513915. In the meantime we need to discuss if we want to preemptively replace FPC5 with a... [11:17:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) m3-master dbproxy has been failed over. [11:34:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro) [13:20:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Open→03Resolved Nevermind, tracked in T313337 [13:49:45] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BBlack) a:03Jdforrester-WMF Hi - the process for the public certs+DN... [14:01:03] vgutierrez: you happy for me to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/768766 (wikimedia_domains) now. anything specific i should be aware of [14:01:14] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) >>! In T313227#8091301, @BBlack wrote: > Hi - the pro... [14:01:49] jbond: let's play it safe, disable puppet on A:cp, and test it in one node [14:01:51] jbond: oh my god I love that patch [14:02:20] vgutierrez: ack will do and thanks cdanis :) [14:02:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10cmooney) Agreed this is a good idea. I can see why it may have been "left alone" previously but given we'd had issues best to bite the bullet and do it. The 40G u... [14:06:19] vgutierrez: whats the best thing to watch as an indicator os success/failure? [14:08:02] so for maps 403 rate [14:08:25] ack thanks [14:08:46] and a manual check on the HSTS header being delivered as usual [14:11:12] ack will dop [14:18:36] not sure if already mentioned here: https://security.googleblog.com/2022/07/dns-over-http3-in-android.html [14:20:10] I sent it to sukhe earlier today :) [14:20:22] but no, didn't share it here, my bad, thanks elukey [14:25:56] vgutierrez: fyi im reverting it failed to validate the varnish reload check. ill do some more testing in the vagrent box )which i had forgot to do) [15:03:16] (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 65.02862959596865% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:04:56] (HAProxyEdgeTrafficDrop) firing: 56% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:06:35] (PurgedHighEventLag) firing: (10) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:08:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqsin has dropped 54.745541229661825% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:09:56] (HAProxyEdgeTrafficDrop) resolved: 58% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:11:35] (PurgedHighEventLag) resolved: (24) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:13:27] vgutierrez: issue was a missing ')' is it ok for me to give things another test? https://gerrit.wikimedia.org/r/c/operations/puppet/+/815728/1..3/modules/varnish/templates/wikimedia-frontend.vcl.erb [15:13:34] not right now :) [15:13:51] oh shit sorry [15:14:09] * jbond will leave it untill tomorrow [15:30:19] ack, thanks :) [18:39:03] bblack: when you get the chance, would love to know your thoughts on next steps for https://phabricator.wikimedia.org/T138093 (query param normalization). the tl;dr is that we now have a vmod that does this correctly, it's packaged, and deployed on beta. needs a strategy for rolling out to prod. [19:05:43] ori: I'd guess like you said, X-Wikimedia-Debug is a first step (and will imply rolling out the vmod package, etc as well) [19:06:07] from there it's a little thorny. Not sure if we want to take the risk of applying it to all misc domains, or narrow it to just mediawiki [19:06:49] (also, we could look at data on upload cluster and see if it could help there, too. Maybe there are multiple re-orderings of image resizing/format params and such?) [19:08:42] by all the misc domains, I mean e.g. phabricator and logstash and tendril and cxserver and the other hundred or so services we pay less attention to the semantics of [19:09:41] I think in VCL we can limit it easily to the traditional text-cluster case (which means just mediawiki and RB+friends (the oids)) [19:11:37] actually I think the "friends" list is down to just cxserver now [19:12:33] from there, I guess we could try all traffic on a single cache host or something, might be easier to deal with fallout that way than using a random traffic sample everywhere. [19:15:59] other applications might be sensitive to the order of query parameters, or (if they're implemented in a language other than PHP) might handle duplicate parameters differently, so I'd be nervous about turning this on for misc domains [19:19:43] it'd be interesting to look at uploads, yeah. I'm trying to think of a good way to analyze the potential impact. Need a way to compute the canonicalized query string and count seen variations over traffic log data [19:23:49] so yeah, we can limit to the traditional-text cases I think, which I believe is just Mediawiki (appservers+api) + Restbase + cxserver now. [19:24:10] and from there, if we want to exclude either of the latter two, that's pretty easy on hostname or path-regex [19:26:21] your beta cluster patch is inside normalize_request [19:27:03] in the actual "vcl_recv" in that file, the call sequence is basically "call normalize_request; call cluster_fe_vcl_switch;" [19:27:54] everything after "call cluster_fe_vcl_switch" is only operating on MW/RB/cxserver, because everything else (misc) flipped over to a different VCL file at that point. [19:28:44] so we might just need an extra sub right after it, say "normalize_request_nonmisc" or something, to park this in [19:29:07] ack [19:29:50] I'll summarize on the task [20:05:21] ori: seems like upload doesn't really use params commonly, other than for maps tiles, which isn't worth it [20:05:34] upload uses path info for size/format [20:06:03] e.g. /wikipedia/commons/thumb/d/d3/Jesu%C3%ADta_Barbosa_during_an_interview_in_January_2019_02.png/200px-Jesu%C3%ADta_Barbosa_during_an_interview_in_January_2019_02.png.webp [20:06:19] which is probably more-sensible anyways :) [20:06:51] there might be some other kinds of normalization that could be applied there, but it's not queries [20:08:18] the one normalization pattern that stands out from staring at snippets of varnish logs, is the format extension on thumbnails [20:10:17] e.g. using this as an example: /wikipedia/commons/thumb/f/f5/Flag_of_Cross_of_Burgundy.svg/46px-Flag_of_Cross_of_Burgundy.svg.png [20:10:40] all that apparently matters for the format to conver to, is the final .foo [20:10:51] but you get the same output from ending that URI with any of: [20:11:04] [...]Burgundy.svg.png [20:11:06] [...]Burgundy.png [20:11:09] [...]Burgundy.svg.png.png [20:11:14] [...]Burgundy.svg.png.asdf.xyz.png [20:11:33] and there are obvious examples in short logs, of duplicates like that, e.g. URLs ending in .svg.jpg.jpg.jpg [20:16:14] that's interesting [20:17:25] we could maybe do a simple regex just for the easy/common case [20:17:39] if it's a thumb uri and ends in .dupe.dupe, reduce the dupes [20:17:42] for text requests, I came across a number of cases of code that generates URLs with duplicate parameters, so I figured query-sorting was superior to playing whack-a-mole [20:17:56] yeah [20:17:58] but the case you're citing now could conceivably be attributable to a single bug somewhere [20:18:18] quite possibly! [20:18:38] the norm seems to be ".svg.png" at the end, when the original was .svg [20:18:53] but the .svg.png.png case seems common enough, not sure why [20:20:45] hmmm let me dig some more [20:22:19] no, this is fake, it's some internal rewriting for the webp "experiemtn" [20:22:23] *experiment heh [20:24:09] and other rewrites [20:24:25] basically there's already a lot of VCL working on this problem, and it causes confusing log noise for ReqURL :) [20:25:49] what's the webp experiment? [20:26:29] is the wmf serving webp? cool if true [20:28:09] https://phabricator.wikimedia.org/T269946 [20:28:25] also https://phabricator.wikimedia.org/T27611 + https://phabricator.wikimedia.org/T211661 are related [20:28:54] gilles had it going as a conditional experiment, something like "if this image has been hit more than X times [is hot], and the UA advertises webp support, auto-convert to webp for them" [20:29:34] and I think the experiment bogged down at some middling stage (might've been us bogging him down on priority, wouldn't surprise me), and now he's gone [20:29:41] and it's still there in whatever state it was left in [20:29:58] I'm pretty sure I knew about this and forgot about it [20:30:11] there were some concerns. that third ticket is about cleaning up room from stale old thumbs to make room for more webp. [20:30:38] and we were also at one point waiting for consumer webp support to ramp up (but pretty sure we're well past that point now) [20:31:28] it's a clever/hacky way to auto-webp for some significant chunk of traffic where it makes the most sense [20:31:35] the current VCL code I mean [20:31:49] in the long run, it might be better to support it in a more-native way :) [20:32:48] yeah that's the problem (or benefit, depending on your perspective) to solutions like this that capture most of the area under curve [20:33:00] also webp conversion isn't universally reliable apparently, there's some code to fall back to jpeg or whatever on failure [20:33:04] "we'll get to the long tail eventually" famous last words etc. [20:33:54] https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/upload-frontend.inc.vcl.erb#L382 [20:37:01] how expensive is Swift storage space anyway [20:37:44] tying webp to the unused-thumbnail-cleanup issue seems like a way of holding the former hostage in the hope that it motivates someone to work on the latter problem [20:38:36] if only everyone reading this donated $2.75! [20:38:38] yeah I donno. I could guess, but I know people that know things were involved in that discussion before [20:38:52] apparently we store a lot of unused cruft, and space is at a premium [20:39:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10wiki_willy) a:03Jclark-ctr [20:39:09] (and swift isn't a cache, it won't evict on its own) [20:39:45] the whole architecture of how we store+serve media files deserves a serious rethink. It probably needed one years ago, even moreso now :) [20:40:25] most of the recent work on it nibbles at the edges without shaking things up too much [20:40:48] but we are storing a lot of cruft, and storing things in the wrong places for the wrong reasons, etc, I think [20:41:53] thumbnail storage should be more like a cache [20:42:32] (arguably, it could all be in the actual edge caches, if the thumbnailer scaled better for spikes, and maybe the caches had a little more storage, etc) [20:43:22] anyways, I won't pretend to be able to re-design it on the spot here, I just know it smells and needs looking at someday. moving on! :) [20:45:57] * ori plummets deeper and deeper into the Phabricator rabbit-hole [20:53:09] (it looks like there was/is an actual crunch for swift space so this wasn't hostage-taking) [21:24:48] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) It looks like the maximum rate at which swift-object-expirer will issue deletes is configurable via [[ https://github.com/op... [21:41:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10nskaggs) >>! In T313382#8090176, @Marostegui wrote: > - dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should... [22:02:56] (HAProxyEdgeTrafficDrop) firing: (3) 34% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:07:56] (HAProxyEdgeTrafficDrop) resolved: (3) 34% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop