[05:12:33] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I think I narrowed it down. If I upload using plain CLI curl, it finishes instantaneously: {P17633}. Now when I use a stripped down version of SwiftFileB... [06:15:05] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) If I use PHP's stream wrappers, literally: ` $opts = [ 'http' => [ 'method' => 'PUT', 'header' => $realHeaders, 'content' => $contents, ] ]; $ctx = st... [06:16:56] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) Also note that during the buster upgrade we did move from curl 7.52.1 to 7.64.0, so it could also be a regression from that. I haven't looked at the libcu... [06:40:36] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I tried a few different variations on the PHP curl script, none of which made a difference: * Using `curl_setopt( $ch, CURLOPT_POSTFIELDS, $contents );`... [07:45:02] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) >>! In T275752#7467352, @Legoktm wrote: > The `stream_context_create` solution only works for files that fit under the memory limit, which is currently ~6... [08:05:02] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) Just to clarify from a previous comment from @Legoktm, the "fread" reported in his output is produced by curl when it sends the data to the client. So I'm not... [08:45:27] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) So, we can go deeper in what's wrong here, but for now I just tested switching the URL we call in @Legoktm's `test.php` to funnel the request via envoy, and t... [08:52:02] hello folks [08:53:13] I was checking if https://gerrit.wikimedia.org/r/c/operations/puppet/+/735577 could have been a way to add node-role.kubernetes.io/master to the ml-serve master nodes, but then I realized that the current label (node.kubernetes.io/disk-type=kvm) that should be there is not listed for any node [08:53:18] meanwhile in the main cluster I see it [08:54:04] does it ring any bell? Like things that we (as ML) have missed etc.. [08:54:59] (I am checking with kubectl get nodes --show-labels) [08:57:19] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10Mvolz) >>! In T294010#7466225, @Dzahn wrote: > Can we ask for an exemption for Wikipedia? I thi... [09:09:08] completely unrelated - with https://gerrit.wikimedia.org/r/c/operations/puppet/+/735583 ServiceOps becomes officially the owner of the kafka-main clusters (had a chat with Wolfgang), enjoy :) Jokes aside, count me for anything that needs to be done etc.. [09:10:04] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) So, what I think we know at this point is: * The problem is completely within php / curl; curl from the command line or pretty much any other client behaves c... [09:12:18] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10Mvolz) This paints a nice picture. https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from... [09:25:53] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10Mvolz) a:05Mvolz→03None [09:59:25] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10akosiaris) {F34715054} Adding a statistical analysis of packet lengths in wireshark from 2 captures. Upper one, with 3323 packets in total is standard curl call,... [10:12:23] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) I don't think curl sends `Expect: 100-continue` for chunked transfers to begin with, and I don't think chunks need to be ack'ed before sending the next in H... [10:24:53] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10akosiaris) > Can we force cmdline curl to HTTP/2, or PHP libcurl to HTTP/1.1, to test that? Right on target! Thanks for noticing that, that's the issue. HTTP2 sho... [10:27:21] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) Yes, thanks for noticing @Xover; a posteriori, it's pretty obvious what is going on and it's interesting how much more inefficient using http2 is in this case... [10:39:59] 10serviceops, 10Data-Persistence-Backup, 10GitLab (Infrastructure), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Jelto) Thanks for the implementation of the restore script and the timer! When updating the documentation I had some additional though... [10:48:59] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) [[ https://github.com/curl/curl/pull/2709 | #2709 ]] landed in libcurl 7.62.0 and enabled HTTP/2 multiplexing by default when available, so Buster would ind... [10:54:31] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) >>! In T275752#7467725, @akosiaris wrote: > setting `curl_setopt( $ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);` in the code fixes it Hmm. Is HTTP/2 a... [10:57:00] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) >>! In T275752#7467817, @Xover wrote: >>>! In T275752#7467725, @akosiaris wrote: >> setting `curl_setopt( $ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);`... [13:15:24] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) While reading the code of MultiHttpClient, I found that is tries to support pipelining, which was removed from curl completely (and was already disabled in cu... [13:18:51] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Reedy) >>! In T275752#7468119, @Joe wrote: > While reading the code of MultiHttpClient, I found that is tries to support pipelining, which was removed from curl co... [14:59:05] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10akosiaris) >>! In T294010#7467613, @Mvolz wrote: > This paints a nice picture. https://grafana.w... [15:01:59] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10akosiaris) p:05Unbreak!→03Low With that mitigation in place I am switching priority to Low f... [15:17:13] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) I've rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/725... [15:17:23] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966 (10JMeybohm) [15:45:46] 10serviceops, 10SRE, 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Xover) "[[ https://blog.cloudflare.com/delivering-http-2-upload-speed-improvements/ | Delivering HTTP/2 upload speed improvements ]]" from Cl... [15:54:56] akosiaris: I added you as a reviewer on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/735073 since I hear you're the new MultiHttpClient expert in SRE [15:55:25] <_joe_> definitely [16:21:50] 10serviceops, 10SRE: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) a:05Dzahn→03Arnoldokoth [16:22:15] 10serviceops, 10SRE: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) 05Stalled→03Open p:05Triage→03Low [18:08:07] 10serviceops, 10Data-Persistence-Backup, 10GitLab (Infrastructure), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) @Jelto This makes sense to me. We talked about it a bit during our 1:1 we just had. I would say let's split it multiple parts. A... [19:44:26] 10serviceops, 10SRE: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) discussed in 1:1 with Arnold [21:07:08] 10serviceops, 10SRE, 10MW-1.38-notes (1.38.0-wmf.7; 2021-11-02), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) >>! In T275752#7467700, @Xover wrote: > But then, as Lego's output above shows, cmdlin... [21:29:44] 10serviceops, 10SRE, 10MW-1.38-notes (1.38.0-wmf.7; 2021-11-02), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) ` 2021-10-29 21:21:39 [296d3ab0-158d-47d4-b10b-25ae741838c7] mw1308 testwiki 1.38.0-wm... [22:07:27] 10serviceops, 10SRE, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) I watched the uploads for {T292769} and they took 7s and 9s for 714MB and 864 MB files... [22:17:27] 10serviceops, 10SRE, 10MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), 10Patch-For-Review, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm)