[05:41:02] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:43:55] 10Traffic: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T307647 (10AlexisJazz) [05:45:56] (HAProxyEdgeTrafficDrop) firing: (2) 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:47:51] 10Traffic, 10Wikimedia-production-error: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T307647 (10AlexisJazz) [05:56:01] (HAProxyEdgeTrafficDrop) firing: (2) 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:00:56] (HAProxyEdgeTrafficDrop) resolved: (2) 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:40:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:45:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:56:08] bblack vgutierrez re: ncredir probe, should the service be non-paging altogether ? and/or the probedown page was useful/informative ? [08:47:02] 10Traffic, 10SRE, 10Patch-For-Review, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) Issue fixed by upstream on https://git.haproxy.org/?p=haproxy-2.4.git;a=commit;h=f9a0f51d3bfa37993935754508e7c88b2e69c9ed [16:20:59] 10Traffic, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Maxmind: GeoIP Download Failed - https://phabricator.wikimedia.org/T302864 (10Dzahn) [17:07:15] godog: yeah, I don't think it should be paging. it's fine to be an IRC alert though! [17:21:03] been keeping an eye on various dashboards after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/789219 [17:21:47] things look good overall -- there's a noticeable increase in CPU in esams but it's from ~0.8sec/sec to ~1.3sec/sec so it's not exactly concerning :) [17:21:59] interestingly it also ... seemed to decrease ttfb on that host? [17:23:37] cdanis: first I've seen of it, is there anything more than the ticket? [17:23:59] bblack: a bunch of discussions with Valentin, but that's basically it [17:24:04] I'll expand the ticket today [17:24:37] ok thanks! :) [17:25:13] the tldr is that, when we do see cachebusting attacks, they tend to come from a small number of IP addresses relative to overall traffic (on the order of hundreds or single-digit thousands) -- so identifying abusive behavior seems doable [17:25:44] are there legit patterns that get caught in the crossfire? [17:25:57] that patch has haproxy internally track, for each client IP, # of miss+pass/10 seconds, and # of new TCP connections/10 seconds [17:26:01] eh maybe I shouldn't ask, I'm really not prepared to deep-dive on this at the moment! :) [17:26:35] it doesn't do anything with the data yet, aside from I am dumping them to local disk every minute or so, and later we can start thinking up thresholds (which we'd only act upon if we knew we were saturating on traffic) [17:28:06] of course I really want any future state of this to not impact legit traffic, but I also think you can argue that it isn't *so* bad if the impact is only during windows where we would be suffering anyway, and overall this makes those windows rarer and shorter [17:28:43] but as for now, assuming I didn't introduce a glaring performance regression into these four machines around the fleet (which it looks like I definitely did not), there's no impact on any traffic [17:29:39] sounds pretty useful anyways! [17:31:27] <_joe_> cdanis: that's fantastic, I was worried esams traffic would expose any bottleneck in performance, apparently though there is none [17:31:55] yeah I'm really happy about that, it's like a 75% increase in CPU usage from "very small" to "small" [17:36:08] 10Traffic, 10Privacy Engineering, 10Research, 10SRE, and 3 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10BBlack) >>! In T251732#7892986, @bmansurov wrote: > 2. Resolve the https -> http redirect issue (who should lo... [17:36:33] memory consumption doesn't look appreciable at all, either -- even in esams we're using < 50k entries in a table where each entry takes ~60 bytes [17:36:38] so.. under 3MB? [17:37:25] (that's the `newconnrate` table, which is exactly what it sounds like; there's also a smaller `misspassrate` table that is, as you might guess, approximately 20-25% of the size of the former one)