[07:50:22] do you know someone with exercise addictionI am working with Safeguard on this request. It seems there are two different Employer of Records in Finland which would explain the differences. [07:50:42] argh :/ [10:35:57] vgutierrez: hi! from the last day, we were having issues connecting to wikis from some tools, I have a reproducer right now, from one of our workers (tools-k8s-worker-43), to commons.wikimedia.org, I'm doing a pcap of a curl request that hangs [10:36:07] is there anything else I can get that will help debug? [10:36:23] (I can give you access or debug live if you have time right now) [10:36:44] mtr/traceroute to commons.wikimedia.org from the worker as well [10:40:08] ack [10:43:04] vgutierrez: mtrs look the same T356164 [10:43:05] T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 [10:43:37] oh, but now it's working again [10:43:43] so maybe the mtr was a bit late [10:44:34] where is the pcap? [10:45:30] vgutierrez: give me a sec, I think it might have some private info (captured all traffic), I'll send it to your email [10:45:46] ack [10:46:54] I think it might be too big xd [10:47:36] it's 14G, I'll try to trim it by host, is that ok? (filtering by the commons IP) [10:48:04] sure [10:53:31] xd, still 3.9G [10:57:27] uploading it to cloudcontrol1007.eqiad.wmnet:/root/only_commons.pcap.gz, vgutierrez you have access to that right? [10:57:35] filter by stream? :) [10:58:01] yes, I got access [10:58:29] BTW, did you dump the TLS keys during the handshake? :) [11:00:52] I did not do anything specific for it, so I'm guessing I did not? How should I have dumped them? [11:01:58] SSLKEYLOGFILE=/root/sslkey.log curl [...] [11:02:55] I'll take it into account for the next, that would allow us to decode the https stream? [11:03:22] yes [11:04:03] awesome, got another worker having the issue, I'll try it there (and try to get a smaller pcap) [11:08:08] dcaro: just use --local-port to a privileged port that you know for a fact that is not being used [11:08:14] and filter by that [11:08:30] I added a dummy header xd, but that sounds easier to filter by [11:08:51] a dummy header on a TLS communication? [11:09:06] you'll end up with another huge pcap that requires processing [11:09:43] ack, using the port [11:16:32] vgutierrez: got it [11:16:34] https://www.irccloud.com/pastebin/5TeBVliD/ [11:16:50] lovely :D [11:18:57] dcaro: oh wow... the pcap shows retransmissions happening even during the TLS handshake [11:19:09] dcaro: are you sure that mtr is happy? [11:19:26] https://www.irccloud.com/pastebin/L2rdN8Cb/ [11:19:35] vgutierrez: that might be because I captured all interfaces [11:19:48] (so it might be seeing it from the docker interface + the host interface) [11:19:57] I can try using only the host one [11:20:29] not only duped ones [11:21:07] https://www.irccloud.com/pastebin/b5HDD0zZ/ [11:21:57] let me rerun, it's still having issues [11:22:38] dcaro: also let's run mtr on monitor mode... mtr -t -T -P 443 [11:22:59] and see if you're getting some lost packets in there as well [11:23:12] (that should open the curses iface running till you stop it manually) [11:23:28] okok, running `mtr -P 443 -t -T -w 208.80.154.224` [11:24:00] looks clean [11:24:05] https://www.irccloud.com/pastebin/HSBbIAFE/ [11:25:58] ack, let me know when you have the new pcap in place [11:26:00] got another pcap, listening only on eth0 [11:26:04] uploading [11:26:42] it's under `cloudcontrol1007.eqiad.wmnet:/tmp/only_eth0.tgz` [11:26:46] got it [11:27:43] the ssl decoding is awesome btw. [11:27:53] looking better, I'm only seeing the burst of retransmissions at the end prior to haproxy RSTing the connection [11:28:03] s/decoding/decryption/ [11:28:08] yep xd [11:29:30] nice, python requests also generates the ssl key log file when the envvar is set, that will help a lot debugging stuff xd [11:36:09] dcaro: you're welcome [11:36:27] btw... the IP from that worker is always the same? [11:37:28] yes, the ips get assigned on VM creation, and only changed when rebuilding the VM [11:40:58] dcaro: ack, can you tell me the x-cache response header from a working oneE? [11:41:06] (on the same worker) [11:41:23] that worker now works xd [11:41:32] < x-cache: cp1106 miss, cp1106 pass [11:41:46] so that host is hitting cp1106 [11:41:48] from that same worker, same url [11:41:54] URL doesn't matter [11:42:00] just source IP [11:42:16] same ip too [11:42:23] < x-client-ip: 172.16.6.13 [11:45:48] it stopped working again [11:47:30] that URL is being used by the bot as well right? [11:48:05] yes [11:48:17] I can try see if it happens with a different url [11:48:41] right now I'm guessing is still failing? [11:48:45] just api.php without params also fails [11:48:46] yep [11:49:00] it did work for a minute though [11:50:07] yeah... requests aren't going through the CDN node at all from that client [11:51:41] that sounds like maybe routing issues? (as it works sometimes?) [11:51:44] and now is working again [11:51:50] right? [11:52:03] yep [11:52:08] unles.... [11:52:23] Feb 07 11:41:23 cp1106 haproxy[1857819]: [debug] silent-drop_for_300s: type=ipv4 <172.16.6.13> [11:52:57] the bot is so aggressive that's triggering the silent-drop protection mechanism from HAProxy [11:53:27] ohhh, that is interesting, and as it shares ip with anything on that worker, any container running there would also see the blockage right? [11:53:34] indeed [11:53:50] has that limit changed lately? the user said it was working before [11:53:57] (without any change on their side) [11:53:57] fabfur & cdanis ^^ [11:54:19] dcaro: hmm maybe now the tool is on a busier node? [11:54:38] AFAIK the silent-drop limits haven't changed lately [11:55:22] I'll look into the node traffic, see if there's a congregation of high traffic tools or similar [11:56:22] vgutierrez: why only on cp1106? [11:56:43] volans: :? [11:56:58] the silent drop [11:57:08] cause the abusing IP is the same one [11:57:16] and load balancers work like that [11:57:41] yeah, I know I mean all the traffic comes only from that single cloud IP? [11:58:04] all the traffic that's being silent-dropped by cp1106 comes from that IP [11:59:00] is there anything that can be done to prevent that? (headers sent about traffic limit so the users can rate-limit themselves, increase limits for toolforge ips, ...) [11:59:01] sigh... the most used UAs on that IP in the last day are python-requests/2.28.2 and 2.31.0... 6k requests the first one and 1.7k the second one [11:59:30] so you have two heavy unindentified users on that node [12:00:08] I know one at least xd [12:01:01] the python-requests/2.28.2 UA has the same amount of requests in the last 24h that the rest of the top 10 together [12:01:34] I kind of remember we had a policy in Toolforge that tools had to set an explicit UA. But can't find the docs at the moment [12:01:47] it's definitely against our policies [12:01:49] yes we do [12:02:45] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Banned for tools that were banned until changed the user agent [12:03:01] so we need some banning :) [12:03:45] I hope a bit of asking might be enough https://wikitech.wikimedia.org/wiki/User:Legoktm/toolforge_library [12:03:51] dcaro: but that page is about accessing toolforge. This problem here is about hitting the wikis form toolforge, no? [12:03:54] there's libraries to help too [12:04:00] dcaro: to be clear.. we silent drop if we see more than 500 concurrent requests from a single IP [12:04:21] so we won't be relaxing that limit for toolforge [12:04:34] that's ok [12:04:41] tools need to behave :) [12:04:59] so the rate is about concurrent requests, not rate-limiting then? [12:05:15] that's right [12:05:17] (as in, you can do many small quick requests, but 500 long-running ones will hit the limit) [12:06:14] yeah.. if you spawn 1000 threads you get banned immediately but you can have 100 threads going as fast as they can and you won't trigger the silent-drop [12:06:34] ack, that's useful [12:06:56] you would trigger some 429s for sure [12:06:59] but not the silent-drop [12:07:21] vgutierrez: the silent drop is known to return HTTP 104 always? [12:07:43] arturo: you're not seeing an HTTP 104 [12:08:15] but ECONNRESET 104 Connection reset by peer [12:08:17] os error, not http [12:08:18] (os error 104) [12:08:26] right [12:08:49] would that sum up diffirent wikis? or it's per-wiki? [12:09:03] it's per CDN node [12:09:07] is that ECONRESET the known behavior on the client side when the silent drop triggers? [12:09:13] arturo: yes [12:09:41] vgutierrez: ok, and last question, is just using a proper UA enough to avoid the limit? [12:09:47] nope [12:09:54] the limit is enforced by source IP [12:10:36] maybe source IP and source port? or only src IP? [12:10:40] source IP [12:11:12] in HTTP/1.1 you cannot have concurrent requests if you don't have several TCP connections [12:11:25] that requires 500 source ports AFAIK [12:12:10] so, with the earlier question by dcaro [12:12:30] I guess we can only have 500 connections per toolforge worker node for each CDN node [12:13:02] and each worker node can run hundred of tools/bots, the available connections per tool/boot is even lower [12:13:29] that's accurate [12:14:31] and each ip gets assigned one CDN node [12:14:31] BTW, limit got tightened in late November [12:14:53] embrace IPv6 and get one IP per tool? :) [12:15:05] heh, do you have a magic wand? :-P [12:15:55] IPv6 for toolforge is definitely something that has been in my radar for half a decade now [12:16:03] it's weird though that the user had no issues on grid [12:16:13] (note how half a decade sounds more time that 5 years) [12:16:25] (were the nodes were also shared) [12:16:59] in that node you got one or two abusers [12:17:31] I'd suggest that you solve that issue and probably the silent-drop will stop being triggered [12:18:36] happy to continue this conversation then [12:19:47] the user is opening >1k tcp connections at once [12:19:49] https://www.irccloud.com/pastebin/A0VSqIUW/ [12:20:14] that seems like too much xd [12:20:19] lol :) [12:24:39] arturo: found the policy, it's a soft-redirect to https://meta.wikimedia.org/wiki/User-Agent_policy [12:24:47] https://wikitech.wikimedia.org/wiki/User-agent_policy [12:26:23] vgutierrez: thanks a lot!! that was very helpful :) [12:26:57] dcaro: thanks [12:26:59] vgutierrez: thanks!! [13:58:21] confd service (using nrpe) is failing everywhere, is that part of the migration to systemdunitfailure? godog? [13:58:33] *confd service alert [13:58:54] https://alerts.wikimedia.org/?q=alertname%3Dconfd%20service [14:00:50] oh, they are gone now :) [14:01:10] "yes" to answer your question dcaro :) [14:01:27] ack, thanks for moving off icinga! ;) [14:01:49] thank you, baby steps [14:41:46] cdanis: you got me digging up on BGP metrics for calico and well, it was depressing: https://docs.tigera.io/calico-enterprise/latest/operations/monitor/metrics/bgp-metrics [14:42:05] apparently only on calico-enterprise 😞 [14:43:38] that's ... [14:43:41] sigh [14:43:50] ugh [14:44:52] <_joe_> yeah sadly I'm not surprised [14:45:17] <_joe_> and calico was by far the least "freemium" of these solutions [14:45:21] yeah [14:45:26] we can just run bird_exporter in the pod i guess? [14:45:31] ("just") [14:51:46] akosiaris: there's also an open bug on their github from 2019 about bgp metrics heh https://github.com/projectcalico/calico/issues/2369 [14:52:01] i love open core [14:52:36] yeah I have an open task as well about exposing BGP graceful shutdown [14:52:42] heh [14:52:42] similar situation [14:52:56] gonna need our own calico distribution [14:53:17] there you go: https://github.com/projectcalico/calico/issues/4607 [14:53:27] May 12, 2021 [14:53:43] It would be nice if any support added by this PR is a compatible extension of what we already have in Enterprise. [14:53:44] it became evident pretty quickly what I wanted was in the enterprise product [14:53:46] 😂 [14:54:57] Do we ever reach out to these open-core vendors to see if they'd make a "donation" of a license to the WMF? [14:55:07] claime: we wouldn't [14:55:22] By policy I assume [14:55:25] yes [14:55:34] ack [14:56:43] * Emperor mutters about open-core needing dumping [14:56:48] On the positive side, I'm excited about the open-source forks of terraform and hashi vault...it seems like they will be OK with adding features that threaten the enterprise stuff [14:57:14] I suppose we can put bird_exporter in a container and run it in the pod. [14:57:42] I 'll have to double check the sockets thing, but it shouldn't be difficult [14:57:52] https://i.pinimg.com/originals/15/f5/89/15f589b8111e210ac297e86737bb3a65.jpg [14:57:56] we 'll need to run 2 ofc, to make things fun [14:58:01] one for IPv4 and one for IPv6 [14:58:21] claime: you are fast :P [14:58:29] :p [14:58:33] faster than that motorcycle ;P [14:58:33] oh ffs [14:58:45] :nod: [14:58:51] inflatador: tbf my 30 year old horse is faster than that motorcycle [14:59:18] claime: we can probably speed it up by adding a few more sidecars [14:59:30] With motors in them [14:59:33] On per wheel [14:59:37] One* [14:59:39] Yeah, it'll look like a drone on wheels [14:59:40] just one more sidecar bruh it's gonna fix everything i promise [14:59:42] one more sidecar [14:59:58] * claime slaps pod [15:00:04] This baby can fit so many sidecars [15:00:06] * kamila_ was just finishing typing up https://phabricator.wikimedia.org/T356877 [15:08:30] <_joe_> tbh the right allegory for this mess is a mcmansion [15:09:30] <_joe_> the sidecars are the quadruple garage [15:13:56] * kamila_ is desperately trying to figure out how to insert their beloved 4-wheeled pile of rust into the conversation [15:14:59] kamila_: take the previous joke, and rewrite it in rust? ;-) [15:15:18] oh god why XD [15:16:11] * Emperor bows [15:17:16] <_joe_> kamila_: took me a minute to ensure we didn't have a member of the rust evangelism task force in our team [15:17:43] <_joe_> "ah no they're talking about their car, phew" [15:17:55] I should have thought of that :D [15:19:02] I spent ~8h sanding rust last week and another horrible person already asked me why I did that given that rust was cool these days [15:25:07] what's the car? [15:26:15] nevermind :) [15:29:54] was that an invitation to talk about my bad decisions? :D [15:41:23] :) [16:01:21] !incidents [16:01:21] 4427 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:01:22] 4424 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:01:25] sad times [16:01:32] !ack 4427 [16:01:33] 4427 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:01:38] !ack 4424 [16:01:39] Attempt to ack incident 4424 failed. [16:01:42] is that going to be another OOM? [16:01:46] !incidents [16:01:46] 4427 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:01:46] 4424 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:01:47] * Emperor goes to look at titan nodes [16:02:28] both eqiad titan nodes slow/nonresponsive to ssh, so I'm guessing this is a repeat of yesterday. [16:03:04] same time as yesterday [16:03:14] i think we have a query of death in a cronjob that runs at 16:00 [16:03:36] <_joe_> possible [16:04:03] <_joe_> cdanis: or someone is very regular in their thanos queries :D [16:05:26] titan1* still nonresponsive [16:05:49] yeah, ping response but ssh hangs for me too probably time to reboot [16:07:05] they did self-recover last time... [16:07:44] <_joe_> just to clarify, there's an ongoing production issue and a couple teams are flying blind atm [16:08:15] I'm getting a console on titan1002 atm, getting ready to hard reboot [16:08:55] thanks [16:09:46] herron: shall I do titan1001 via the reboot cookbook? [16:10:50] Emperor: sure I'll focus on titan1002 for the moment [16:10:51] * Emperor doing so [16:13:45] finally bios timeouts are done booting linux [16:14:25] titan1002 reachable again [16:14:31] titan1001 powercycling now [16:16:33] does anyone have any automation that pokes Netbox to check for hosts that are past the 5-year cycle? I might put something together if not [16:18:18] herron: I'll update T356788 to note the worse failure today [16:18:19] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [16:18:36] Emperor: thank you [16:19:01] and I think we have some sleuthing to do re the 16:00 cron idea [16:46:56] jynus: we have a puppet-merge clash. DO you want to merge mine? [16:47:24] it is just 2 lines for statistics_servers, ok? [16:47:40] Yep. Please go ahead. [16:48:38] btullis: done [16:48:46] Cheers. [17:04:13] !incidents [17:04:14] 4428 (UNACKED) [2x] ProbeDown sre (ip4 thumbor:8800 probes/service http_thumbor_ip4) [17:04:14] 4427 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:04:27] !ack 4428 [17:04:28] 4428 (ACKED) [2x] ProbeDown sre (ip4 thumbor:8800 probes/service http_thumbor_ip4) [17:05:00] herron: I just merged a change to remove the thumbor pools, must have broken something [17:05:03] reverting [17:05:10] hnowlan: ah! thank you [17:05:49] ty, was just about to see if you were still around :) [17:06:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/998447 [17:06:46] thumbor is not configured in lvs to use these pools [17:07:25] unless I messed up the restarts [17:08:57] pools are restored but I'm not seeing recovery [17:10:23] might I need to restart pybal to get it to pick up the restored pools? [17:10:27] hnowlan: they get re-added with weight=0 [17:10:28] by default [17:10:36] sudo confctl select 'service=thumbor' get [17:10:41] aha [17:10:44] and inactive [17:10:50] you have to pool them and set a weight [17:11:15] my select gets both eqiad and codfw [17:11:22] I see all with weight=0 [17:11:33] and pooled=inactive [17:11:35] done, recovering now [17:14:17] hnowlan: I'm not sure what was the status *before* merging the patch, but my suggestion would be to always depool hosts in a pool before removing it, allows for easier testing and rollbck ;) [17:15:07] volans: fair point [17:15:53] just verified in lvs logs that the restart I did earlier didn't pick up the group change, which means that the service is using the thumbor group rather than kubesvc [17:16:48] that explains [17:34:07] jayme: gentle reminder to not use cumin1001 but cumin1002,cumin2002 ;) [see Mori.tz's email from December for context] [17:35:28] volans: oops, thanks. That probably got lost during my absence and muscle memory took over again [17:35:38] :D