[06:55:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:00:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:03:11] 10Traffic, 10SRE: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10elukey) For varnishkafka, this is the problem: ` elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent varnishkafka varnishkafka | 1.0.13-1 | stretch-wikimedia | main | amd64, source varnis... [07:10:14] mmandere: o/ left a note in --^ for the varnishkafka package versions, lemme know what you think :) [07:40:50] elukey: o/ thanks checking [09:47:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Prometheus doesn't run on VMs in eqiad/codfw (not sure if this fact was... [12:41:59] Hi traffic, I'd like to add an LVS service (https://gerrit.wikimedia.org/r/c/operations/puppet/+/764733) [13:18:00] vgutierrez: any objections to do low_traffic lvs_setup? Current backups being lvs1020 and lvs2010 [14:01:17] Please go ahead jayme [14:03:49] ack [14:30:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > As far as this task goes to me it still remains a mystery why it looks l... [15:15:07] 10netops, 10Discovery, 10Infrastructure-Foundations, 10SRE: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) Per Cathal's feedback above, we are closing this ticket as he correctly stated "it represents significant risk for what seems to be scant benefit.... [15:15:45] 10netops, 10Discovery, 10Infrastructure-Foundations, 10SRE: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) 05Open→03Resolved [16:34:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) >>! In T302265#7731305, @fgiunchedi wrote: > The current pings from promet... [17:20:53] <_joe_> bblack: regarding my work on dynamic bans, do you think we can consider having rules on the request body? [17:21:02] <_joe_> or that risks being too expensive at the edge? [17:21:31] <_joe_> we can by-default guard anything regarding a request body with a condition on the request content-length [17:23:25] it's hard to categorically answer that, even [17:23:33] filtering on method is easy [17:23:54] CL as a header won't always be there even when there's a body (could be chunked or close-delimited, depending) [17:24:03] <_joe_> right [17:24:19] but there might be an easy VCL thing for "is there a req body"? [17:24:48] <_joe_> I was deferring to you on that :) [17:25:19] yeah scanning the wonderful docs [17:25:43] <_joe_> you know, I still remember them [17:25:50] <_joe_> and I think I don't look at those since varnish 4 [17:26:05] <_joe_> and I'd like to keep the streak going :P [17:26:40] :P [17:31:52] so, there's a bereq.body, but not a req.body [17:32:11] and it's only present for "pass" disposition, not "miss" [17:32:38] the bereq object would be availble in e.g. vcl_backend_fetch, at the point where we're heading off towards the next layer of cache infra [17:33:59] _joe_: are you thinking of putting these dynamic-bans in very early (before checking for hits), or only for the miss/pass case? [17:34:18] with ratelimiters we've often gone for miss/pass-only, and this also reduces the amount of traffic that even sees that logic/code. [17:34:22] <_joe_> for now it's only in miss/pass [17:34:41] <_joe_> inside the cluster_fe_ratelimit function [17:34:46] ok [17:35:04] you could probably make arguments for both - maybe some ideal future variant would have the option where to place it [17:35:11] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/763557/6/modules/varnish/templates/text-frontend.inc.vcl.erb [17:35:26] (e.g. some scenario where a cache hit on a large object was being used to saturate network links somewhere, possibly in our own networK) [17:36:12] <_joe_> yes [17:36:39] <_joe_> but I'd assume that should be done once with have an engineering process that is not filled with footguns :) [17:37:21] maybe! [17:39:50] so yeah the req body thing, doesn't seem realistic, at least not this iteration [17:40:41] req.method that's already in schema will get us some related bits (e.g. block POST and such) [17:41:11] honestly, I'm not even sure that varnish is capable of a meaningful cache hit on e.g. a GET with a client body. [17:41:36] maybe? it's not an area I've thought a lot about lately [17:43:03] we could/should have a preset switch through this mechanism that just heavily limits all miss/pass without any other filter, too. [17:43:23] (I'm guessing that's implicitly possible with this scheme, by just not setting many filtering factors) [17:44:31] s/many/any/ [17:50:56] (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [18:11:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) [18:20:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [22:18:58] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) 05Open→03Resolved I closed out the ticket and this is now resolved. [22:19:14] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) [22:42:23] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) This came up again in T301507.