[04:04:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:09:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [12:22:13] 06Traffic: Upgrade haproxy to 2.8.13 on cp hosts - https://phabricator.wikimedia.org/T383111#10466213 (10Vgutierrez) [13:43:40] hello, in T383750 it seems that I'm reaching a limit enforced by Varnish, I'm unable to get download more than 1.85GB at a time, apache doesn't log anything special when the download stops. From this context, I guess there is two questions: could someone confirm that this limit is being reached? Is this a limit that we would be willing to adjust [13:43:41] for that usage ? it seems to be a bit outside of the initial scope of this host [13:43:41] T383750: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750 [14:25:31] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10466709 (10LSobanski) [15:14:35] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10466932 (10ABran-WMF) >>! In T383750#10466226, @KartikMistry wrote: > Also, I deployed MinT successfully on 07... [15:18:15] arnaudb: how long does it take to fail? is there a consistent timeout cutoff? [15:18:16] arnaudb: what limit are you referring to? [15:18:42] also, in the cases where you're hitting peopleweb.discovery.wmnet you're not going through varnish [15:18:55] timeouts are not applied on pass ? [15:18:56] hm [15:19:23] connection level idle-traffic timeouts would apply on pass [15:19:26] arnaudb: what URL are you targetting? if it's peopleweb.discovery.wmnet that's not behind the CDN [15:19:31] pass is just about caching [15:19:34] time wget --verbose --output-document /dev/null https://people.wikimedia.org/~arnaudb/model.bin.1 [15:19:35] it's an internal endoint [15:19:41] arnaudb: ack [15:20:02] (does it work internally if you go direct against the discovery URL?) [15:20:23] checking [15:20:43] but also, in a more broad and general sense: hosting multi-gigabyte model files through our "text" cluster is not a great idea [15:22:12] I definitely can reproduce the issue, it fails after 84 seconds [15:22:42] unpopula opinion... personally I don't think we should use peopleweb for this use case [15:22:47] bblack: +1 haha that question has been raised in the ticket already [15:23:05] volans: +1 as well [15:23:49] http://peopleweb.discovery.wmnet is not reachable via cumin hosts, where should I jump to be able to test internally? [15:25:38] ah deploy1xx [15:25:39] I tried it from an eqiad traffic cache with "wget", and can indeed load the whole thing. ~2.3GB xferred in ~10s though. [15:25:44] a cp host :) [15:25:51] so we don't reach any timeout because it's too fast there heh [15:26:24] I should have upgraded my net to 10G, successfully retrieved from deploy1003 at 60MB/s [15:26:36] not really [15:26:46] the CDN doesn't allow single stream speeds over 30M/sec [15:27:33] ah so I'm saturating my connection to the CDN with the download, the timeout is reached and apache logs a http/200 [15:27:48] I can reproduce against peopleweb.discovery.wmnet BTW [15:27:57] curl -v -o /dev/null --limit-rate 25M -H 'Host: people.wikimedia.org' https://peopleweb.discovery.wmnet/~arnaudb/model.bin.1 [15:27:57] throttling the download? [15:28:01] ack [15:28:17] that triggers the issue as well at 1.6Gb [15:28:34] so, not varnish, the plot thickens [15:28:42] just the opposite :) [15:28:47] ? [15:28:57] its varnish serving the file on discovery? [15:28:59] by discarding the CDN you already know what's the culprit :D [15:29:19] * arnaudb jumps back to the traffic schema to identify the missing layer [15:30:04] arnaudb: I'd bet you a beer in ATL that the culprit are envoy timeouts in people2003 [15:30:32] ah so the wikitech schema was a bit outdated as it misses envoy [15:30:43] https://wikitech.wikimedia.org/wiki/Global_traffic_routing#/media/File:WMF_Inbound_Text_Traffic_Diagram.svg (or at least the one I found haha) [15:30:59] arnaudb: that's accurate and up to date [15:31:16] envoy timeouts in *people2003* [15:31:27] people2003.codfw.wmnet isn't part of the CDN [15:31:43] it's the backend server behind peopleweb.discovery.wmnet [15:32:05] peopleweb.discovery.wmnet is an alias for people2003.codfw.wmnet. [15:32:10] ah ok I was missing the envoy component [15:32:25] thanks for the pointer vgutierrez! [15:32:28] np [16:19:10] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT, 13Patch-For-Review: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10467299 (10ABran-WMF) after merging the CR, running puppet and trying again to download th... [16:31:14] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10467357 (10KartikMistry) It seems that at least one pod is running fine on eqiad? ` $ kube_env machinetranslat... [16:47:02] 06Traffic, 06SRE: Define an event stream and schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467480 (10Ottomata) [16:47:18] 06Traffic, 06SRE: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467481 (10Ottomata) [16:52:21] 06Traffic, 06SRE: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467505 (10Ottomata) Hi! It looks like [[ https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/commit/a4cc9ecad3d018487e7c215c605346b335... [16:57:50] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10467527 (10LSobanski) @KartikMistry @santhosh As this is proving to be more complex, I suggest looking into the... [16:59:43] 06Traffic, 06SRE: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914 (10Ottomata) 03NEW [16:59:49] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10467542 (10KartikMistry) >>! In T383750#10467527, @LSobanski wrote: > @KartikMistry @santhosh As this is proving... [16:59:56] 06Traffic, 06SRE: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467543 (10Ottomata) p:05Triage→03High [17:10:58] 06Traffic, 06SRE: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467589 (10Ottomata) [17:11:33] 06Traffic, 06SRE: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467592 (10Ottomata) [17:16:05] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: execute interface validator in provision script for switch interfaces - https://phabricator.wikimedia.org/T383915 (10cmooney) 03NEW p:05Triage→03Low [18:30:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [18:47:18] ugh [18:55:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX