[09:31:03] <dcaro>	 We are seeing several toolforge tools getting connection reset from several wikis at a time for spans of a few minutes (ex. T356160), is there a way for me to see if that's some rate-limiting? (they are using the correct user-agent, and they did not substantially change the amount of queries themselves, but there might be noisy neighbors)
[09:31:04] <stashbot>	 T356160: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160
[09:46:16] <_joe_>	 dcaro: are they getting 429 responses?
[09:46:52] <_joe_>	 in general it's hard to answer if tasks don't have timestamps for events
[09:46:55] <dcaro>	 I think they are getting the connection reset directly, but let me ask
[09:47:25] <_joe_>	 yeah without a timestamp of when that happens
[09:47:34] <_joe_>	 I can't correlate to events happening in production
[09:48:13] <dcaro>	 there's some timestamps here: T356163
[09:48:14] <stashbot>	 T356163: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163
[09:48:34] <_joe_>	 without a date
[09:48:35] <dcaro>	 11:54:05 PM  (UTC I think, unless the tool is doing something weird)
[09:48:55] <_joe_>	 but I would suppose it might be during the attacks we've been seeing in the last few days
[09:49:12] <_joe_>	 dcaro: I'd suggest to ask for time and date of the events first
[09:49:14] <dcaro>	 I think it's from yesterday
[09:49:23] <_joe_>	 we've had a coupel issues over the weekend and a few yesterday too
[09:49:48] <_joe_>	 these tools connect to the edge in eqiad, correct?
[09:50:05] <dcaro>	 yep
[09:50:05] <_joe_>	 so I'd first check the health status of the traffic stack at the time of the questions
[09:50:17] <_joe_>	 as the "connection reset" comes from probably haproxy
[09:50:32] <_joe_>	 fabfur / vgutierrez ^^ did we change something in haproxy yesterday?
[09:50:50] <vgutierrez>	 nope AFAIK
[09:53:00] <_joe_>	 dcaro: ok so, once you have timestamps we can see if they correlate to the outages
[09:53:06] <vgutierrez>	 could we have pcaps or at least timestamps and source IPs?
[09:53:07] <_joe_>	 and dig a bit deeper
[09:53:32] <_joe_>	 vgutierrez: pcaps isn't something the tool authors can get themselves, so I assume not in that moment
[09:54:04] <dcaro>	 pcaps is going to be tricky, the ip would be the natted ip for the cloud network (same as all the tools/cloud instances)
[09:55:04] <vgutierrez>	 is a single IP or something like a /28?
[09:55:05] <taavi>	 we don't nat traffic to the wikis
[09:56:26] <vgutierrez>	 indeed.. I'm seeing 172.16. IPs hitting the wikis using UA Chiebot
[09:57:54] <vgutierrez>	 the amount of reported requests on turnilo seems kinda steady during the last month
[09:58:01] <dcaro>	 true, https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis is still on hold
[09:58:10] <vgutierrez>	 I'm guessing they keep retrying and they succeed?
[09:58:37] <dcaro>	 at times, they have to wait up to several minutes (sometimes >5) before it works
[09:59:08] <dcaro>	 whet it's >5min they give up (one of the tools at least)
[10:01:37] <vgutierrez>	 from toolforge.. one IP == one tool?
[10:02:17] <dcaro>	 not really, one ip == one k8s worker, that might have many tools, and a tool might be spread in different workers too
[10:02:29] <vgutierrez>	 gotcha
[10:03:07] <dcaro>	 if you have the ips that might be interesting too, we are upgrading the fleet to use containerd, that would allow us to track which workers are they running on
[10:03:34] <dcaro>	 if it's all the new workers, then probably is a bullseye+containerd issue
[10:04:02] <vgutierrez>	 I can see the IPs for successful requests
[10:04:17] <vgutierrez>	 not for the ones experiencing L3 issues like in your task
[10:04:31] <vgutierrez>	 so you should track that in your side
[10:05:24] <dcaro>	 hmm, that might hint to things too, as the tools would not be moved to a different worker unless completely failed (and retries might have worked too, I can cross-check retries that worked with timestamps to determine if the tool was having the issues when requesting from that ip)
[10:06:23] <vgutierrez>	 dcaro: https://w.wiki/8$mx 
[10:06:32] <dcaro>	 thanks!
[10:06:42] <vgutierrez>	 take into account that data is sampled 1/128
[10:07:20] <vgutierrez>	 but if we had that kind of issue at cp nodes (haproxy) we should've seen external reports as well
[10:07:37] <dcaro>	 yep, this points to something earlier in the flow (probably on our side)
[10:07:50] <vgutierrez>	 so it looks to me like an internal networking issue between toolforge and the lvs
[10:09:43] <vgutierrez>	 gotta run to the hospital now.. ping fabfur in the meanwhile if you need something :)
[10:09:52] <fabfur>	 ack
[10:10:05] <dcaro>	 thanks! good luck!
[17:24:01] <fabfur>	 all good from EU oncall