[11:35:38] 10Traffic, 10Data-Engineering, 10Event-Platform, 10SRE, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jbond) p:05Triage→03Medium [11:36:10] 10Traffic, 10DNS, 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): Automate DNS depools such that manual commits are not required - https://phabricator.wikimedia.org/T303219 (10jbond) p:05Triage→03Medium [11:41:28] 10Traffic, 10SRE, 10serviceops: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T303305 (10jbond) p:05Triage→03Medium [11:41:50] 10Traffic, 10SRE, 10Patch-For-Review: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) p:05Triage→03Medium [11:42:57] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10jbond) p:05Triage→03Medium [11:50:56] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10jbond) p:05Triage→03Medium [11:58:30] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10jbond) p:05Triage→03Medium [11:59:04] 10Traffic, 10SRE, 10serviceops: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T303305 (10Joe) 05Open→03Resolved a:03Joe This happened during an outage. That is the tls terminator of the application servers (envoy) circuit-breaking... [13:34:12] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10jbond) 05Open→03Resolved a:03jbond @AlexisJazz thanks for the report it appears that there was a small bip in traffic when we turned on our new DRMRS PoP. it seems the issue lasted only a few sec... [14:02:14] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Vgutierrez) latest round of HAProxy reimages were performed between March 7th and March 8th: ` * 4d58564f87 - site: Reimage cp1083 as cache::text... [14:30:23] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1075.eqiad.wmnet with OS buster [15:19:21] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1075.eqiad.wmnet with OS buster com... [15:50:46] 10Traffic, 10SRE: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fix deployed, I'm closing the task now, feel free to reopen if the issue happens again. Thanks! [16:12:12] bblack: unscientific study from RIPE data shows a clear latency improvements for probes in Spain to drmrs (vs. esams), and a small one for France [16:12:44] https://atlas.ripe.net/measurements/39583587/#probes vs. https://atlas.ripe.net/measurements/39583586/#probes for France [16:13:01] https://atlas.ripe.net/frames/measurements/39583608/#!probes vs. https://atlas.ripe.net/frames/measurements/39583609/#!probes for Spain [16:13:34] method: ctrl + tab quickly between the 2 tabs [16:16:23] I'd take spain, it's roughly comparable in size [16:16:36] (if you'd prefer that to france for an easy traffic bump, I mean) [16:17:19] even the small improvement for France is worth it, so no preference, both are fine to me [16:17:30] pageview stats say CY is ~4M, ES is ~40M, FR is ~60M (pageviews/month) [16:17:44] and we will probably do both at some point anyway [16:36:04] XioNoX: do you think you/someone will have time for some of the latency mapping stuff next Q? [16:37:40] cdanis: probably not me [16:37:46] ok! [16:38:06] bblack: PT is also a clear win for drmrs - https://atlas.ripe.net/frames/measurements/39584477/#!probes vs. https://atlas.ripe.net/frames/measurements/39584478/#!probes [16:42:55] CONGRATS ON DRMRS! that's awesome! [16:43:24] cdanis: iirc what (re-)triggerd the discussion and the quick doc I wrote, was the discovery of https://github.com/Netflix/probnik bu Joanna [16:43:35] yeah! [16:43:48] XioNoX: so I have some other work I want to do on getting NEL data into Hive [16:43:48] cdanis: the dev said he was going to update it, but got promoted since and nothing happened [16:43:52] ahahahahaha [16:44:14] XioNoX: okay so I think we can use NEL as the ingress + some simple stuff in Analytics for the analysis [16:44:25] and then we just have special subdomains for sampling latency info [16:44:31] and use what exists of probnik to trigger requests to them [16:45:05] yeah the tricky part is to get latency data to other sites [16:45:15] not sure I understand the part about NEL? [16:54:10] so what I was thinking was [16:54:25] * make special per-site subdomains like drmrs.latencyprobe.wikimedia.org or something [16:54:35] * publish NEL policies for them that have a success_fraction of 1.0 [16:54:47] * javascript to conditionally initiate background fetches from them [16:54:52] and then we get an elapsed_time as part of the report body [16:55:42] yeah that kind of thing could work! [16:56:11] getting the data at all is kind of step 1, and then we have to design+automate the rest of the pipeline, too [16:56:22] yeah [16:56:38] this is as good a reminder as any to file a bug about getting NEL in data lake as well [16:56:42] (to ingest the historical data, give it some kind of recency-weighting, and turn it into something like a maxmind db) [16:57:13] and layer it in with maxmind and/or our manual country-mapping or geodistance as a fallback for where we have little data [16:57:43] there's some other projects that naturally fit together in the same scope as well, when considering how the engineering details play out [16:58:25] (like, we'd really like to spin up experimenting with alt-svc redirection at the HTTP layer as well, and have it operate on the same dataset we use for geoip, but obviously it's very different lookup->execution code than the dns server) [16:59:20] you could hack them together by having that part do an ednsc lookup via authdns, but that adds latency and frailty in a pretty hot path. [16:59:44] better to just bring the data over to separate front edge code that's injecting the alt-svc response [17:00:12] I hadn't realized that alt-svc was pretty broadly implemented [17:00:24] we have no idea, we've never experimented [17:00:35] or investigated, or whatever [17:01:04] it's a years-old ticket, to go look at that, because it could give us some wins when DNS points in the wrong direction (far-off recursor exits, lack of ednsc, etc) [17:02:15] but the intended design is spot-on for this use-case anyways. the spec even talks about how the UA should continue with the current connection while establishing the new one, etc [17:02:39] if it works, we could even use it for smoother depools of long-running connections to cache servers, too. [17:03:01] (don't ask me how, but seems like it) [17:03:18] whole sites full of them, I mean, not per-server [17:04:27] yeah [17:04:55] re: layering the data, one important property that stands out, is that some kind of "confidence" rating might be good to have in the dataset [17:05:39] because we can infer confidence from maxmind's fallback data too (if it was a very narrow/specific result with a small radius, or some generic "it's somewhere in the AP region" sort of geoip map result). [17:06:05] and compare that to the confidence level of our NEL metrics (if it's got a consistent strong history of reports from that network, or little data or flapping data) [17:06:31] and then do a little better than just assuming our data is always better than maxmind's [17:07:49] anyways, random thoughts from reviving some of this out of the back of my head. The main point is, yeah, it would be nice to see efforts in these directions get some priority someday! [17:10:23] https://caniuse.com/?search=alt-svc <- there's my 30 seconds of effort on current alt-svc status :) [17:10:49] looks like chrome is very recent, FF has had it longer [17:11:21] we'd need to see if it even has the expected properties and works sanely, too [17:13:45] our old ticket: https://phabricator.wikimedia.org/T208242 [18:35:45] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Krinkle) a:03Krinkle [19:20:08] XioNoX: maybe let's try PT since it's smaller, then move up to ES and FR in size order? that should be plenty of traffic to get some "proof" and hold us over while we work on real mapping? maybe starting tomorrow? [20:01:11] (and then we can keep iterating on what we think as we go of course, culminating in some failover testing) [20:26:27] bblack: sounds good! ping me when you get online [21:38:57] (EdgeTrafficDrop) firing: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [21:43:56] (EdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [22:47:57] (EdgeTrafficDrop) firing: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [22:52:56] (EdgeTrafficDrop) resolved: 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop