[11:10:17] 10Traffic, 10Commons: After ~1 minute file download is cut off - https://phabricator.wikimedia.org/T351876 (10Vgutierrez) The file is over the CDN size threshold (1G) so it will hit swift every time that it needs to be fetched. Could it be related by the work done by @MatthewVernon on T317616 [11:25:34] 10Traffic, 10Commons: After ~1 minute file download is cut off - https://phabricator.wikimedia.org/T351876 (10Vgutierrez) envoy sets an upstream response timeout by default at 65s (https://github.com/wikimedia/operations-puppet/blob/397c454bbad404c9667c6f63f86e993b1970af8a/modules/envoyproxy/manifests/tls_term... [11:26:33] 10Traffic, 10Commons: After ~1 minute file download is cut off - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) I'm trying to see what the previous nginx-based timeout was, but it's code I'm unfamiliar with [11:31:47] 10Traffic, 10Commons: After ~1 minute file download is cut off - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) I think, per [[https://github.com/wikimedia/operations-puppet/blob/397c454bbad404c9667c6f63f86e993b1970af8a/modules/tlsproxy/manifests/localssl.pp#L105 | modules/tlsproxy/manifests/local... [11:34:18] 10Traffic, 10Commons: After ~1 minute file download is cut off - https://phabricator.wikimedia.org/T351876 (10Vgutierrez) nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response ti... [11:35:06] 10Traffic, 10Commons: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10Aklapper) [11:36:02] 10Traffic, 10Commons: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) >>! In T351876#9356447, @Vgutierrez wrote: > nginx didn't enforce a timeout for the whole request but just a timeout (180s) between... [11:36:50] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) p:05Triage→03High [11:54:55] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) ...but profile::tlsproxy::envoy doesn't have that configuation available as far as I can see... [12:02:13] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) Let's add it as an optional parameter, and try and pass it through. [12:54:18] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster API timeout - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) [12:54:25] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster API timeout - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) [12:54:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster API timeout - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) As I don't know the cause I'm just tagging as both traffic and train blocker. [12:55:33] 10Traffic, 10Beta-Cluster-Infrastructure, 10MediaWiki-Action-API, 10Beta-Cluster-reproducible: Beta cluster API timeout - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) [12:57:23] 10Traffic, 10Beta-Cluster-Infrastructure, 10MediaWiki-Action-API, 10Beta-Cluster-reproducible: Beta cluster API timeout - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) [13:11:51] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10AlexisJazz) >>! In T351876#9356445, @MatthewVernon wrote: > I think, per [[https://github.com/wikimedia/operations-puppet/b... [13:14:39] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) @AlexisJazz it's a time for how long the connect has no data going over it "Each time an encode/decode event... [13:27:39] 10Traffic, 10Commons, 10SRE-swift-storage: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) - https://phabricator.wikimedia.org/T351876 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I can confirm that I can now download this file, even though it ta... [13:30:08] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: HTTP 504 connection timeout error accessing MW API on Beta cluster - https://phabricator.wikimedia.org/T351930 (10Aklapper) [13:30:12] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: HTTP 504 connection timeout error accessing MW API on Beta cluster - https://phabricator.wikimedia.org/T351930 (10Aklapper) [13:32:08] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: HTTP 504 connection timeout error accessing MW API on Beta cluster - https://phabricator.wikimedia.org/T351930 (10Aklapper) (Removed 1.42.0-wmf.13 as that's six weeks away, and MW-API tag as I'd not expect it to be related in a 504). Inte... [15:23:46] 10Traffic, 10Observability-Metrics: Label value spam in ncredir_requests_total metric - https://phabricator.wikimedia.org/T351934 (10fgiunchedi) [15:43:50] 10Traffic, 10Observability-Metrics: Label value spam in ncredir_requests_total metric - https://phabricator.wikimedia.org/T351934 (10fgiunchedi) [15:44:11] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: HTTP 504 connection timeout error accessing MW API on Beta cluster - https://phabricator.wikimedia.org/T351930 (10Lucas_Werkmeister_WMDE) Both of them are working again for me now. [16:23:41] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) @ayounsi what would be the required TCP MSS clamping values? per https://phabricator.wikimedia.org/T348837#9256494 It seems that around ~1400 bytes for both IPv4/IPv6 should... [16:34:11] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: HTTP 504 connection timeout error accessing MW API on Beta cluster - https://phabricator.wikimedia.org/T351930 (10AlexisJazz) For me as well.