[10:35:42] dcausse: The SUP was killed last night due to an unexpectedly large cirrusbuilddoc (5.6mb). I had a quick look into the cirrus code, but it does not look like we log anything in case a doc is larger than the DocumentSizeLimiter profile (wmf_capped) requires. Should we investigate this? [10:39:31] Ah, looks like this is cause inside the SUP: In case of a REV_BASED_UPDATE we pass the underlying doc twice: Once as script source and once as upsert doc. :-( [10:52:19] pfischer: seems like it's controlled by elasticsearch-bulk-flush-max-size [10:53:02] but we seem to set both setMaxRecordSizeInBytes and setMaxBatchSizeInBytes with it [10:54:12] max record size should perhaps be 4MiB*2 for the upsert case, and not sure about max_batch_size but probably a bit higher? [10:57:39] cirrus log the output size here: https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&viewPanel=58 so if there's a problem like one doc not respecting the max capacity not 100% sure we would see it there [10:58:01] happy to add more logs if you think it's necessary [11:18:04] Thanks! IIRC ES, or the reverse proxy in front of it, enforce a max. request content-length of ~4mb [11:18:30] So even if we increase the limit on our end, it would still fail. [11:19:44] But if it’s ES/its proxy, cirrus should hit the same limitation [11:20:31] If it writes a single upsert request with 2.8mb * 2 (script param + upsert doc) [11:21:31] hm... indeed unless the SUP is traversing different network intermerdiaries it should have seen 4Mb*2 already for sure, and couple years ago there was no limit at all at the doc building stage [11:22:15] I cirrus routed through envoy, too? [11:22:26] yes it should [11:23:34] there's a known 100Mb limit at the nginx and/or elastic level [11:24:10] but given the recent mem usage circuit breakers added in elastic it might have changed [11:24:52] pfischer: also to understand, the problem happened with the new sink you just implemented or the old sink? [11:26:01] That was with the new sink [11:28:14] Hm, maybe I just remembered the wrong limit and it’s actually much higher. I just remember that we ran into 400-responses at some point, which led to the whole get-the-estimated-bulk-size-right problem that later led to the custom implementation. [11:28:48] I’ll relaunch with a higher limit on Monday. [11:29:23] ah, indeed, it's possible that the elastic http client has some limits which were not aligned with the rest of the stack? [11:29:41] sure [11:30:26] lunch [11:30:28] https://phabricator.wikimedia.org/T353430 that was the original ticket and we hit > 100mb so that’s the per-request limit [11:30:52] Dunno what happened, I’ll re-configure it Monday. [11:31:00] Uh, Tuesday [13:28:13] o/ [13:32:55] Forgot my co-working space doesn't open until 9AM my time...so working from car for the next 30m ;P [13:37:49] lol [15:01:30] \o [15:04:58] o/ [15:19:27] going offline early, have a nice week-end [15:51:47]