[06:55:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[07:00:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[07:03:11] <wikibugs>	 10Traffic, 10SRE: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10elukey) For varnishkafka, this is the problem:  ` elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent varnishkafka varnishkafka | 1.0.13-1 | stretch-wikimedia |               main | amd64, source varnis...
[07:10:14] <elukey>	 mmandere: o/ left a note in --^ for the varnishkafka package versions, lemme know what you think :)
[07:40:50] <mmandere>	 elukey: o/  thanks checking
[09:47:04] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Prometheus doesn't run on VMs in eqiad/codfw (not sure if this fact was...
[12:41:59] <jayme>	 Hi traffic, I'd like to add an LVS service (https://gerrit.wikimedia.org/r/c/operations/puppet/+/764733)
[13:18:00] <jayme>	 vgutierrez: any objections to do low_traffic lvs_setup? Current backups being lvs1020 and lvs2010
[14:01:17] <vgutierrez>	 Please go ahead jayme 
[14:03:49] <jayme>	 ack
[14:30:48] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > As far as this task goes to me it still remains a mystery why it looks l...
[15:15:07] <wikibugs>	 10netops, 10Discovery, 10Infrastructure-Foundations, 10SRE: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) Per Cathal's feedback above, we are closing this ticket as he correctly stated "it represents significant risk for what seems to be scant benefit....
[15:15:45] <wikibugs>	 10netops, 10Discovery, 10Infrastructure-Foundations, 10SRE: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) 05Open→03Resolved
[16:34:18] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) >>! In T302265#7731305, @fgiunchedi wrote:  > The current pings from promet...
[17:20:53] <_joe_>	 bblack: regarding my work on dynamic bans, do you think we can consider having rules on the request body?
[17:21:02] <_joe_>	 or that risks being too expensive at the edge?
[17:21:31] <_joe_>	 we can by-default guard anything regarding a request body with a condition on the request content-length
[17:23:25] <bblack>	 it's hard to categorically answer that, even
[17:23:33] <bblack>	 filtering on method is easy
[17:23:54] <bblack>	 CL as a header won't always be there even when there's a body (could be chunked or close-delimited, depending)
[17:24:03] <_joe_>	 right
[17:24:19] <bblack>	 but there might be an easy VCL thing for "is there a req body"?
[17:24:48] <_joe_>	 I was deferring to you on that :)
[17:25:19] <bblack>	 yeah scanning the wonderful docs
[17:25:43] <_joe_>	 you know, I still remember them
[17:25:50] <_joe_>	 and I think I don't look at those since varnish 4
[17:26:05] <_joe_>	 and I'd like to keep the streak going :P
[17:26:40] <bblack>	 :P
[17:31:52] <bblack>	 so, there's a bereq.body, but not a req.body
[17:32:11] <bblack>	 and it's only present for "pass" disposition, not "miss"
[17:32:38] <bblack>	 the bereq object would be availble in e.g. vcl_backend_fetch, at the point where we're heading off towards the next layer of cache infra
[17:33:59] <bblack>	 _joe_: are you thinking of putting these dynamic-bans in very early (before checking for hits), or only for the miss/pass case?
[17:34:18] <bblack>	 with ratelimiters we've often gone for miss/pass-only, and this also reduces the amount of traffic that even sees that logic/code.
[17:34:22] <_joe_>	 for now it's only in miss/pass
[17:34:41] <_joe_>	 inside the cluster_fe_ratelimit function
[17:34:46] <bblack>	 ok
[17:35:04] <bblack>	 you could probably make arguments for both - maybe some ideal future variant would have the option where to place it
[17:35:11] <_joe_>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/763557/6/modules/varnish/templates/text-frontend.inc.vcl.erb
[17:35:26] <bblack>	 (e.g. some scenario where a cache hit on a large object was being used to saturate network links somewhere, possibly in our own networK)
[17:36:12] <_joe_>	 yes
[17:36:39] <_joe_>	 but I'd assume that should be done once with have an engineering process that is not filled with footguns :)
[17:37:21] <bblack>	 maybe!
[17:39:50] <bblack>	 so yeah the req body thing, doesn't seem realistic, at least not this iteration
[17:40:41] <bblack>	 req.method that's already in schema will get us some related bits (e.g. block POST and such)
[17:41:11] <bblack>	 honestly, I'm not even sure that varnish is capable of a meaningful cache hit on e.g. a GET with a client body.
[17:41:36] <bblack>	 maybe? it's not an area I've thought a lot about lately
[17:43:03] <bblack>	 we could/should have a preset switch through this mechanism that just heavily limits all miss/pass without any other filter, too.
[17:43:23] <bblack>	 (I'm guessing that's implicitly possible with this scheme, by just not setting many filtering factors)
[17:44:31] <bblack>	 s/many/any/
[17:50:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[18:11:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney)
[18:20:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[22:18:58] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) 05Open→03Resolved I closed out the ticket and this is now resolved.
[22:19:14] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH)
[22:42:23] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) This came up again in T301507.