[00:36:57] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:41:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:42:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:47:11] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:47:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:52:11] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[07:17:12] <wikibugs>	 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez DigitalOcean restored the CSV and it's now working as...
[07:17:20] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Vgutierrez)
[07:17:26] <wikibugs>	 10netops, 10Infrastructure-Foundations: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) p:05Triage→03High
[07:30:02] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Critical DB infra there:  - dbproxy1020 (m3 current proxy): needs failover. - pc1013 active pc3 master: needs failover - db1181 s7 master: needs failover T313383...
[07:30:26] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui)
[07:31:47] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) p:05Triage→03High
[07:33:31] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui)
[07:47:57] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but loo...
[07:49:23] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Peachey88)
[08:14:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi)
[08:14:15] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi)
[08:14:21] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi)
[08:43:40] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Resolved→03Open Since the replacement errors rate on one of the interfaces went though the roof: https://librenms.wikimedia.org/graphs/to=1658306...
[09:15:39] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) Opened high severity JTAC case 2022-0720-513915. In the meantime we need to discuss if we want to preemptively replace FPC5 with a...
[11:17:57] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) m3-master dbproxy has been failed over.
[11:34:50] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro)
[13:20:24] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Open→03Resolved Nevermind, tracked in T313337
[13:49:45] <wikibugs>	 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BBlack) a:03Jdforrester-WMF Hi - the process for the public certs+DN...
[14:01:03] <jbond>	 vgutierrez: you happy for me to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/768766 (wikimedia_domains) now.  anything specific i should be aware of
[14:01:14] <wikibugs>	 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) >>! In T313227#8091301, @BBlack wrote: > Hi - the pro...
[14:01:49] <vgutierrez>	 jbond: let's play it safe, disable puppet on A:cp, and test it in one node
[14:01:51] <cdanis>	 jbond: oh my god I love that patch
[14:02:20] <jbond>	 vgutierrez: ack will do and thanks cdanis :)
[14:02:50] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10cmooney) Agreed this is a good idea.  I can see why it may have been "left alone" previously but given we'd had issues best to bite the bullet and do it.  The 40G u...
[14:06:19] <jbond>	 vgutierrez: whats the best thing to watch as an indicator os success/failure?
[14:08:02] <vgutierrez>	 so for maps 403 rate
[14:08:25] <jbond>	 ack thanks
[14:08:46] <vgutierrez>	 and a manual check on the HSTS header being delivered as usual
[14:11:12] <jbond>	 ack will dop
[14:18:36] <elukey>	 not sure if already mentioned here: https://security.googleblog.com/2022/07/dns-over-http3-in-android.html
[14:20:10] <volans>	 I sent it to sukhe earlier today :)
[14:20:22] <volans>	 but no, didn't share it here, my bad, thanks elukey 
[14:25:56] <jbond>	 vgutierrez: fyi im reverting it failed to validate the varnish reload check.  ill do some more testing in the vagrent box )which i had forgot to do)
[15:03:16] <jinxer-wm>	 (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 65.02862959596865% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop
[15:04:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 56% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[15:06:35] <jinxer-wm>	 (PurgedHighEventLag) firing: (10) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[15:08:16] <jinxer-wm>	 (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqsin has dropped 54.745541229661825% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop
[15:09:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 58% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[15:11:35] <jinxer-wm>	 (PurgedHighEventLag) resolved: (24) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[15:13:27] <jbond>	 vgutierrez: issue was a missing ')' is it ok for me to give things another test? https://gerrit.wikimedia.org/r/c/operations/puppet/+/815728/1..3/modules/varnish/templates/wikimedia-frontend.vcl.erb
[15:13:34] <vgutierrez>	 not right now :)
[15:13:51] <jbond>	 oh shit sorry
[15:14:09] * jbond will leave it untill tomorrow
[15:30:19] <vgutierrez>	 ack, thanks :)
[18:39:03] <ori>	 bblack: when you get the chance, would love to know your thoughts on next steps for https://phabricator.wikimedia.org/T138093 (query param normalization). the tl;dr is that we now have a vmod that does this correctly, it's packaged, and deployed on beta. needs a strategy for rolling out to prod.
[19:05:43] <bblack>	 ori: I'd guess like you said, X-Wikimedia-Debug is a first step (and will imply rolling out the vmod package, etc as well)
[19:06:07] <bblack>	 from there it's a little thorny.  Not sure if we want to take the risk of applying it to all misc domains, or narrow it to just mediawiki
[19:06:49] <bblack>	 (also, we could look at data on upload cluster and see if it could help there, too.  Maybe there are multiple re-orderings of image resizing/format params and such?)
[19:08:42] <bblack>	 by all the misc domains, I mean e.g. phabricator and logstash and tendril and cxserver and the other hundred or so services we pay less attention to the semantics of
[19:09:41] <bblack>	 I think in VCL we can limit it easily to the traditional text-cluster case (which means just mediawiki and RB+friends (the oids))
[19:11:37] <bblack>	 actually I think the "friends" list is down to just cxserver now
[19:12:33] <bblack>	 from there, I guess we could try all traffic on a single cache host or something, might be easier to deal with fallout that way than using a random traffic sample everywhere.
[19:15:59] <ori>	 other applications might be sensitive to the order of query parameters, or (if they're implemented in a language other than PHP) might handle duplicate parameters differently, so I'd be nervous about turning this on for misc domains
[19:19:43] <ori>	 it'd be interesting to look at uploads, yeah. I'm trying to think of a good way to analyze the potential impact. Need a way to compute the canonicalized query string and count seen variations over traffic log data
[19:23:49] <bblack>	 so yeah, we can limit to the traditional-text cases I think, which I believe is just Mediawiki (appservers+api) + Restbase + cxserver now.
[19:24:10] <bblack>	 and from there, if we want to exclude either of the latter two, that's pretty easy on hostname or path-regex
[19:26:21] <bblack>	 your beta cluster patch is inside normalize_request
[19:27:03] <bblack>	 in the actual "vcl_recv" in that file, the call sequence is basically "call normalize_request; call cluster_fe_vcl_switch;"
[19:27:54] <bblack>	 everything after "call cluster_fe_vcl_switch" is only operating on MW/RB/cxserver, because everything else (misc) flipped over to a different VCL file at that point.
[19:28:44] <bblack>	 so we might just need an extra sub right after it, say "normalize_request_nonmisc" or something, to park this in
[19:29:07] <ori>	 ack
[19:29:50] <ori>	 I'll summarize on the task
[20:05:21] <bblack>	 ori: seems like upload doesn't really use params commonly, other than for maps tiles, which isn't worth it
[20:05:34] <bblack>	 upload uses path info for size/format
[20:06:03] <bblack>	 e.g. /wikipedia/commons/thumb/d/d3/Jesu%C3%ADta_Barbosa_during_an_interview_in_January_2019_02.png/200px-Jesu%C3%ADta_Barbosa_during_an_interview_in_January_2019_02.png.webp
[20:06:19] <bblack>	 which is probably more-sensible anyways :)
[20:06:51] <bblack>	 there might be some other kinds of normalization that could be applied there, but it's not queries
[20:08:18] <bblack>	 the one normalization pattern that stands out from staring at snippets of varnish logs, is the format extension on thumbnails
[20:10:17] <bblack>	 e.g. using this as an example: /wikipedia/commons/thumb/f/f5/Flag_of_Cross_of_Burgundy.svg/46px-Flag_of_Cross_of_Burgundy.svg.png
[20:10:40] <bblack>	 all that apparently matters for the format to conver to, is the final .foo
[20:10:51] <bblack>	 but you get the same output from ending that URI with any of:
[20:11:04] <bblack>	 [...]Burgundy.svg.png
[20:11:06] <bblack>	 [...]Burgundy.png
[20:11:09] <bblack>	 [...]Burgundy.svg.png.png
[20:11:14] <bblack>	 [...]Burgundy.svg.png.asdf.xyz.png
[20:11:33] <bblack>	 and there are obvious examples in short logs, of duplicates like that, e.g. URLs ending in .svg.jpg.jpg.jpg
[20:16:14] <ori>	 that's interesting
[20:17:25] <bblack>	 we could maybe do a simple regex just for the easy/common case
[20:17:39] <bblack>	 if it's a thumb uri and ends in .dupe.dupe, reduce the dupes
[20:17:42] <ori>	 for text requests, I came across a number of cases of code that generates URLs with duplicate parameters, so I figured query-sorting was superior to playing whack-a-mole
[20:17:56] <bblack>	 yeah
[20:17:58] <ori>	 but the case you're citing now could conceivably be attributable to a single bug somewhere
[20:18:18] <bblack>	 quite possibly!
[20:18:38] <bblack>	 the norm seems to be ".svg.png" at the end, when the original was .svg
[20:18:53] <bblack>	 but the .svg.png.png case seems common enough, not sure why
[20:20:45] <bblack>	 hmmm let me dig some more
[20:22:19] <bblack>	 no, this is fake, it's some internal rewriting for the webp "experiemtn"
[20:22:23] <bblack>	 *experiment heh
[20:24:09] <bblack>	 and other rewrites
[20:24:25] <bblack>	 basically there's already a lot of VCL working on this problem, and it causes confusing log noise for ReqURL :)
[20:25:49] <ori>	 what's the webp experiment?
[20:26:29] <ori>	 is the wmf serving webp? cool if true
[20:28:09] <bblack>	 https://phabricator.wikimedia.org/T269946
[20:28:25] <bblack>	 also https://phabricator.wikimedia.org/T27611 + https://phabricator.wikimedia.org/T211661 are related
[20:28:54] <bblack>	 gilles had it going as a conditional experiment, something like "if this image has been hit more than X times [is hot], and the UA advertises webp support, auto-convert to webp for them"
[20:29:34] <bblack>	 and I think the experiment bogged down at some middling stage (might've been us bogging him down on priority, wouldn't surprise me), and now he's gone
[20:29:41] <bblack>	 and it's still there in whatever state it was left in
[20:29:58] <ori>	 I'm pretty sure I knew about this and forgot about it
[20:30:11] <bblack>	 there were some concerns.  that third ticket is about cleaning up room from stale old thumbs to make room for more webp.
[20:30:38] <bblack>	 and we were also at one point waiting for consumer webp support to ramp up (but pretty sure we're well past that point now)
[20:31:28] <bblack>	 it's a clever/hacky way to auto-webp for some significant chunk of traffic where it makes the most sense
[20:31:35] <bblack>	 the current VCL code I mean
[20:31:49] <bblack>	 in the long run, it might be better to support it in a more-native way :)
[20:32:48] <ori>	 yeah that's the problem (or benefit, depending on your perspective) to solutions like this that capture most of the area under curve
[20:33:00] <bblack>	 also webp conversion isn't universally reliable apparently, there's some code to fall back to jpeg or whatever on failure
[20:33:04] <ori>	 "we'll get to the long tail eventually" famous last words etc.
[20:33:54] <bblack>	 https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/upload-frontend.inc.vcl.erb#L382
[20:37:01] <ori>	 how expensive is Swift storage space anyway
[20:37:44] <ori>	 tying webp to the unused-thumbnail-cleanup issue seems like a way of holding the former hostage in the hope that it motivates someone to work on the latter problem
[20:38:36] <ori>	 if only everyone reading this donated $2.75!
[20:38:38] <bblack>	 yeah I donno.  I could guess, but I know people that know things were involved in that discussion before
[20:38:52] <bblack>	 apparently we store a lot of unused cruft, and space is at a premium
[20:39:08] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10wiki_willy) a:03Jclark-ctr
[20:39:09] <bblack>	 (and swift isn't a cache, it won't evict on its own)
[20:39:45] <bblack>	 the whole architecture of how we store+serve media files deserves a serious rethink.  It probably needed one years ago, even moreso now :)
[20:40:25] <bblack>	 most of the recent work on it nibbles at the edges without shaking things up too much
[20:40:48] <bblack>	 but we are storing a lot of cruft, and storing things in the wrong places for the wrong reasons, etc, I think
[20:41:53] <bblack>	 thumbnail storage should be more like a cache
[20:42:32] <bblack>	 (arguably, it could all be in the actual edge caches, if the thumbnailer scaled better for spikes, and maybe the caches had a little more storage, etc)
[20:43:22] <bblack>	 anyways, I won't pretend to be able to re-design it on the spot here, I just know it smells and needs looking at someday.  moving on! :)
[20:45:57] * ori plummets deeper and deeper into the Phabricator rabbit-hole
[20:53:09] <ori>	 (it looks like there was/is an actual crunch for swift space so this wasn't hostage-taking)
[21:24:48] <wikibugs>	 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) It looks like the maximum rate at which swift-object-expirer will issue deletes is configurable via [[ https://github.com/op...
[21:41:54] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10nskaggs) >>! In T313382#8090176, @Marostegui wrote: > - dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should...
[22:02:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: (3) 34% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[22:07:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: (3) 34% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop