[07:46:56] (HAProxyEdgeTrafficDrop) firing: 35% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:11:31] (HAProxyEdgeTrafficDrop) resolved: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:12:36] (HAProxyEdgeTrafficDrop) firing: (2) 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:12:40] (HAProxyEdgeTrafficDrop) firing: (2) 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:13:06] (HAProxyEdgeTrafficDrop) resolved: (2) 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:30:25] bblack: ack re: ncredir, see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/790269 [08:47:07] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) a:03ayounsi [09:26:19] vgutierrez: cheers! [09:35:08] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I suggest that we increase the `connect_timeout` value for the `local_service` clust... [10:03:11] 10Traffic: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Vgutierrez) [11:04:37] btullis: ^^ that's the connect timeout on localhost between your TLS termination layer and eventgate? [11:56:56] vgutierrez: yes, that's right. Under load I think that we're getting some requests timing out at 1s. [11:59:01] It would definitely be better to fix why that's happening, but I think that's more difficult. [12:14:44] right.. 1s with TLS between eqsin and eqiad/codfw could be potentially problematic under far-from-ideal scenarios [12:23:07] Yeah, but this is definitely all on localhost. It's from the TLS envoy of to its local main_app service, so I can see why 1s is the default. I think our eventgate is just sometimes slow under load. [12:46:14] connect_timeout though is about the TCP connection [12:46:34] if the TCP connection from localhost to localhost takes >1s then that is a pretty huge problem [12:47:01] localhost -> localhost has 0 latency effectively, what problem would increasing that timeout solve? [12:48:39] akosiaris: connection being queued [12:48:48] the only that comes to mind is the nodejs event loop having so many events to process that the nodejs backlog fills up [12:49:31] to which my answer would be... add capacity? [12:50:10] vgutierrez: if the connection is queued though, it wouldn't be the connection timeout that is interesting to meddle with. [12:50:33] as in queued means the TCP handshake has been accepted and it's now waiting on the nodejs event loop to handle it [12:50:37] which works fine up to a point [12:52:39] for example the route timeout or the idle timeout might be more interesting envoy timeouts to tinker with if we are in the state before the nodejs event loop backlog is full [12:53:59] the route timeout is the only when envoy expects nodejs to answer with a full HTTP response. The idle one, I guess it's self explanatory? [12:54:21] s/only/one/ [12:57:38] akosiaris: I think you're possibly quite right with 'the nodejs event loop having so many events to process that the nodejs backlog fills up' [12:58:14] The thing is that we *did* souble the capacity recently and it made only a very limited difference to the latency of the requests: https://phabricator.wikimedia.org/T306181#7871031 [13:01:23] s/souble/double [13:02:26] I'm very open to further ideas about how to diagnose and/or fix this though. At the moment we just have a relatively large number of requests timing out at 1 second, so I thought this might help. [13:12:17] btullis: I'm very rusty on the details, but, these kinds of questions can be good applications for systemtap or for the ebpf tracing tools [13:13:40] btullis: fwiw, it wouldn't hurt to try out the connect_timeout increase. But if it ends up solving the problem, I 'd be surprised and would really love to know why. [13:21:56] OK, thanks both. I'll look into stap and ebpf tracing for eventgate before doing anything else. [13:23:03] there are a few ebpf tools that might be helpful 'out of the box' or with just small modification -- I'm thinking of `tcpaccept` and `tcpconnlat` in the standard bcc distribution [13:24:18] feel free to manually install bpfcc on a host (just !log that you did) [13:26:00] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) If that fixes it, at the very least it will give us some information about what is... [13:49:10] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) There's some discussion ongoing in [[https://wm-bot.wmflabs.org/libera_logs/%23wikim... [13:49:52] cdanis: Thanks. I will let you know how I get on and definitly ! log any manual changes. [13:58:22] btullis: regarding that 1s in the quantiles in https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=now-90d&to=now&viewPanel=79&forceLogin&editPanel=79 [13:58:35] it's the ceiling of the prometheus buckets [13:58:41] that's why you see 1s and not more [13:59:13] curl -s http://10.64.66.166:9102/metrics |grep express_router_request_duration_seconds_bucket |grep POST [13:59:14] express_router_request_duration_seconds_bucket{le="0.01",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515 [13:59:14] express_router_request_duration_seconds_bucket{le="0.05",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515 [13:59:14] express_router_request_duration_seconds_bucket{le="0.1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515 [13:59:14] express_router_request_duration_seconds_bucket{le="0.3",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515 [13:59:15] express_router_request_duration_seconds_bucket{le="1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 252402123 [13:59:16] express_router_request_duration_seconds_bucket{le="+Inf",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 253557747 [13:59:37] sorry about the long paste, but as you can see the last bucket is le=1 [13:59:50] quantile calculation can't work past that one [14:07:38] Oh, right. I see, thanks. It still suggests that there's a latency issue doesn't it? - you've just ruled out that the connect_timeout is related though, right? [14:08:54] well.. it could be the connect_timeout or another one with a higher value [14:21:52] yes, there probably is some latency issue there, not disagreeing there. [14:24:34] however [14:24:39] the 1s, is for GETs [14:24:41] not POSTs [14:24:56] oh wait, I just saw the dip [14:25:03] GETs as well as POSTs [14:30:50] ottomata: ^ you might wanna help btullis to add 1 more bucket to express_router_request_duration_seconds_bucket metric in eventgate [14:30:53] say something like 10s [14:31:24] a quick grep win eventgate-wikimedia/ and deployments-charts/ repos from yours truly did not yield back anything [14:31:35] so, not sure how to add more buckets [14:32:05] chances are it will help pin down more easily the actual problem [14:40:49] i'm pretty sure express_router_request_duration_seconds_bucket would be a service-template-node metric [14:41:00] which https://github.com/wikimedia/eventgate is forked from [14:41:31] https://github.com/wikimedia/eventgate/blob/00a110630dfc853deb0da8bec6455fe1a2b0cb5d/lib/util.js#L92-L139 [14:42:03] i wonder, can we just add another bucket or two there? 5s, 10s? [14:43:34] ottomata: I 'd say go with adding another 1 at 10s. No need to explode prometheus metric cardinality yet [14:43:40] k [14:43:57] looks like makeMetric just is new PrometheusClient [14:43:57] https://github.com/wikimedia/service-runner/blob/deb41fd568ec78ddf189bc5261bed1c5059c05d0/lib/metrics/prometheus.js#L113 [14:45:03] looks like we can just add them [14:45:03] https://github.com/siimon/prom-client#histogram [14:45:08] okay, will make a patch for ya btullis [14:56:37] Thanks. I'm determined to get to the bottom of this mystery. [15:03:29] <3 [15:38:33] btullis: ergghhh, i need to npm publish, which requires 2FA on npm, but i have a new phone, but i can't remove (and readd) 2FA because the wikimedia npm org requires I have 2FA [15:38:44] trying to figure out how to replace 2FA on npm [16:01:01] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) Looking at https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPane... [17:22:02] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) [22:07:57] (HAProxyEdgeTrafficDrop) firing: 48% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:17:57] (HAProxyEdgeTrafficDrop) resolved: 59% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:56:57] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [23:11:57] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop