[09:31:03] We are seeing several toolforge tools getting connection reset from several wikis at a time for spans of a few minutes (ex. T356160), is there a way for me to see if that's some rate-limiting? (they are using the correct user-agent, and they did not substantially change the amount of queries themselves, but there might be noisy neighbors) [09:31:04] T356160: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 [09:46:16] <_joe_> dcaro: are they getting 429 responses? [09:46:52] <_joe_> in general it's hard to answer if tasks don't have timestamps for events [09:46:55] I think they are getting the connection reset directly, but let me ask [09:47:25] <_joe_> yeah without a timestamp of when that happens [09:47:34] <_joe_> I can't correlate to events happening in production [09:48:13] there's some timestamps here: T356163 [09:48:14] T356163: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163 [09:48:34] <_joe_> without a date [09:48:35] 11:54:05 PM (UTC I think, unless the tool is doing something weird) [09:48:55] <_joe_> but I would suppose it might be during the attacks we've been seeing in the last few days [09:49:12] <_joe_> dcaro: I'd suggest to ask for time and date of the events first [09:49:14] I think it's from yesterday [09:49:23] <_joe_> we've had a coupel issues over the weekend and a few yesterday too [09:49:48] <_joe_> these tools connect to the edge in eqiad, correct? [09:50:05] yep [09:50:05] <_joe_> so I'd first check the health status of the traffic stack at the time of the questions [09:50:17] <_joe_> as the "connection reset" comes from probably haproxy [09:50:32] <_joe_> fabfur / vgutierrez ^^ did we change something in haproxy yesterday? [09:50:50] nope AFAIK [09:53:00] <_joe_> dcaro: ok so, once you have timestamps we can see if they correlate to the outages [09:53:06] could we have pcaps or at least timestamps and source IPs? [09:53:07] <_joe_> and dig a bit deeper [09:53:32] <_joe_> vgutierrez: pcaps isn't something the tool authors can get themselves, so I assume not in that moment [09:54:04] pcaps is going to be tricky, the ip would be the natted ip for the cloud network (same as all the tools/cloud instances) [09:55:04] is a single IP or something like a /28? [09:55:05] we don't nat traffic to the wikis [09:56:26] indeed.. I'm seeing 172.16. IPs hitting the wikis using UA Chiebot [09:57:54] the amount of reported requests on turnilo seems kinda steady during the last month [09:58:01] true, https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis is still on hold [09:58:10] I'm guessing they keep retrying and they succeed? [09:58:37] at times, they have to wait up to several minutes (sometimes >5) before it works [09:59:08] whet it's >5min they give up (one of the tools at least) [10:01:37] from toolforge.. one IP == one tool? [10:02:17] not really, one ip == one k8s worker, that might have many tools, and a tool might be spread in different workers too [10:02:29] gotcha [10:03:07] if you have the ips that might be interesting too, we are upgrading the fleet to use containerd, that would allow us to track which workers are they running on [10:03:34] if it's all the new workers, then probably is a bullseye+containerd issue [10:04:02] I can see the IPs for successful requests [10:04:17] not for the ones experiencing L3 issues like in your task [10:04:31] so you should track that in your side [10:05:24] hmm, that might hint to things too, as the tools would not be moved to a different worker unless completely failed (and retries might have worked too, I can cross-check retries that worked with timestamps to determine if the tool was having the issues when requesting from that ip) [10:06:23] dcaro: https://w.wiki/8$mx [10:06:32] thanks! [10:06:42] take into account that data is sampled 1/128 [10:07:20] but if we had that kind of issue at cp nodes (haproxy) we should've seen external reports as well [10:07:37] yep, this points to something earlier in the flow (probably on our side) [10:07:50] so it looks to me like an internal networking issue between toolforge and the lvs [10:09:43] gotta run to the hospital now.. ping fabfur in the meanwhile if you need something :) [10:09:52] ack [10:10:05] thanks! good luck! [17:24:01] all good from EU oncall