[08:56:19] Trey314159: just a thought... Would it make sense to copy some of your notes in the standup etherpad to the phab ticket? For someone looking at Phab, it looks like no progress is being done. And since I'm adding links to the phab tickets from my weekly status update, it looks a bit incoherent. [09:20:46] Yes, that's additional work, but hopefully not too much? [09:24:00] pfischer: around? would you have 5' for a quick chat on SUP? [10:05:05] Errand, back in 20' [10:48:07] status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-12-08 [11:04:08] lunch [11:08:30] gehel: I’m around now [11:15:35] I'll be there in 2' [11:17:43] pfischer: meet.google.com/zyn-fcam-zja [11:25:05] Maybe this afternoon? [11:26:25] Hm, I’m sorry gehe: I don’t get notifications from IRCCloud once more. [11:27:00] Let’s schedule a meeting. [11:28:16] (I’m in the meet call now, but I’m happy to schedule another for later, too) [14:14:26] o/ [14:31:16] o/ [16:00:10] \o [16:01:31] o/ [16:10:14] as random guesses, going to deploy a patch to switch fetch to unordered. Not for long term, just to see if it can then keep the async op full-ish [16:13:11] +1 [16:14:08] briefly looked at envoy and it does seem to be "unlimited" (50k concurrent requests if I understand the config correctly) [16:14:30] yea i saw the same [16:21:59] could maybe try kicking up the queue capacity there too, it's currently 25. We labeled it fetch retry queue, but i suspect this is also capturing pass-thru's. I wonder what the % of non-fetch events is [16:23:29] indeed, hard to tell... [16:25:07] other random note, i saw in the envoy config that it will retry a 5xx one time, so our retries might be doubled [16:25:44] ah good point [16:54:21] unordered roughly doubled throughput, from 25-30 up to ~60 [16:55:21] ok seems interesting to run unordered for rerenders I think [16:55:52] 60 might be plausible with 25 concurrents, the pending requests counter now stays pegged at 25. Can try increasing the queue size with ordered to see if it can get there with enough buffer? [16:56:13] sure [17:16:11] now just weird :P Increase queue from 25 to 100, pending_requests is reports low single digits, and then 60, then back to single digits. throughput fluxuates between 50 and 80 [17:16:33] i suppose the throughput fluxuation is the buffered events going out in order [17:16:55] * ebernhardson notices there is no x in fluctuation [17:16:56] yes this value might be very volatile [17:17:16] Yeah, depends on when prometheus asks for it [17:17:22] While implementing the http client metrics I noticed that the connection limit is applied per route. [17:17:43] So it’s not an overall max. [17:19:31] sure, this is where I'm not sure how it works, if by using envoy the client only considers one route [17:19:32] oh, i also just noticed this name isn't that obvious. retryFetchQueueCapacity turns into CirrusDocFetcher.maxConcurrentConnection, so the increase to 100 didn't only change the buffer size but also the concurrent requests allowed [17:20:23] That should only map to the connection pool size AFAIK [17:20:28] indeed here there're two levels of "protection" the http client and the async operator [17:22:25] in the wdqs update there are two different settings for the asyncIO capacity and the connection pool and it sounds difficult to tune in retrospect [17:23:04] hmm, i suppose i would expect we want to configure the one at a higher level, and expect whatever is lower level to have enough to always service what the higher level asks for [17:26:30] yes, for wdqs now we have a "high-level" setting that says: don't do more than X concurrent requests to MW and it tries to automatically "tune" the rest (mainly take into account the flink parallelism) to achieve best throughput given the constraint [17:27:13] here the bypass and reordering guarantees make this a bit more difficult to tune [17:29:42] random reading suggests can implement a global semaphore, but that seems overkill :P [17:30:00] :) [17:30:41] dinner [17:32:24] pushed the http metrics changes but haven't opened a PR yet [17:37:13] thx, pushed also quick experiment to have 2 fetch operators, unsure if that's the right approach tho... [17:38:25] dcausse: looks reasonable at first glance (only read the commit message) [17:39:07] it's poorly tested, sadly :( [17:39:21] but unsure how to test this kind of changes [17:45:08] yea not super clear, read over it and seems reasonable. Needs a couple minor tweaks but i can do that and see how it works [17:46:33] thanks! [17:49:36] going offline, enjoy the time off and take care! [17:50:30] thx! [18:33:39] hmm, updated consumer says: Caused by: java.lang.IllegalArgumentException: Event is not a page re-rerender event [18:38:44] * ebernhardson notices we probably need to filter the streams before passing in [18:38:51] lunch, back in ~40 [18:53:06] hmm, it already has a filter :S [18:59:32] so the actual error is slightly different, that says "not a page re-render", but the actual check was == REB_BASED_UPDATE. It seems like perhaps the bypass stopped working [19:11:00] * ebernhardson wonders why he can't convince the IT to fail in the same way when it clearly has multiple kinds of updates [19:17:31] * ebernhardson suspects stored state [19:56:29] inflatador: easy patch for ya to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/981599/ these were still listed as insetup [19:57:40] I'm having some trouble with prometheus poller getting a 500 from wdqs1015 (ldf endpoint) . I pcapped good and bad responses here: https://phabricator.wikimedia.org/P54325 , but I can't tell the difference. I do see some warnings about malformed queries in wdqs-blazegraph.log [19:58:11] Headed to my appointment now but if anyone has time to look, it would be appreciated. Will be back in ~90m [20:07:25] looked, but indeed those two requests look identical :S [20:16:49] that exception is sadly not very descriptive, the final stack trace comes from a re-throw. It prints the prior stack trace so maybe a better error is in the server logs [20:19:36] blazegraph must emit a bunch of stuff, asking journactl for the full history has taken more than 30s now :P [20:20:37] only half a gig since nov 18, i guess not as insane as it could be [20:22:45] no good ideas from those logs either :( [20:24:03] its like it tries to print the NullPointerException, but there is no stack trace attached to it :S [20:45:23] inflatador: I think I send it somewhere, but can't find my message back :( [20:45:54] if the check goes through varnish, there is a rewrite from /ldf -> /bigdata/ldf [20:45:58] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/trafficserver/backend.yaml#L201 [21:01:32] ebernhardson: can I assist you with the dual fetcher? [21:03:01] pfischer: i pushed a new patch that restored the CombinedCirrusDocEndpoint to the non-rererender side, it's currently working. I suppose i should have it start again without that to verify the wrong events only got there via the state [21:03:22] the throughput is about the same as before though, 80-ish/s [21:04:44] Is that still lower than the producer output rate? [21:06:34] pfischer: should be lower, the producer is reporting mid 20's per second to the Sink:_Writer operator. But this is also still limited to testwiki,frwiki and itwiki, which is less than 10% of total events [21:07:53] i'm not sure what our target throughput should be for a single instance, we can of course have flink parallelize some operations. Right now i'm poking a bit to guess where the current throughput limits are, it's still the rerender-fetch but it's not obvious what about it [21:08:38] the current queue size / parallel request limit is high enough that it doesn't look to be getting limited there [21:09:11] Well, as long as we do not see back pressure we should be fine. [21:09:33] it's currently at full backpressure, with the rerender 100% busy :) [21:11:06] oh i guess not 100% anymore, right now its showing 80% busy and 20% backpressure in rerender, with the sink reporting 26% busy [21:11:18] perhaps more obvious now because it's split, before that was all one task [21:12:38] Oh, so we see back pressure in the producer app in consuming the page_rerender topic? [21:13:36] Ah, that's what you ment with rerenders (the split fetcher) [21:13:54] no, the producer app can't backpressure from that. I mean in the flink ui where it draws the graph of the different pieces of the computation and labels them with busy and backpressure %'s [21:14:04] at least, i don't think the producer can be effected by the consumers [21:14:43] yea, the split fetcher caused flink to make three new logical splits of the graph, so instead of one rerender+sink thing, we now have separate things for rerender and normal fetch, and a third for the sink [21:15:07] maybe logical split is the wrong word, i dunno what should call it [21:15:39] Routing? Forging? [21:17:54] hmm, looks like flink docs call them subgraphs [21:18:26] maybe [21:18:57] or maybe subtask [21:20:41] oh, to subgraph is the top level name, and every subgraph has one or more subtasks (based on partitioning) that do the actual work [21:20:47] Would bulk-reqyes [21:21:08] Ups [21:21:56] Would bulk-requesting the MW API buy us throughput? [21:24:40] hmm, maybe a little but probably not much. The actual doc building doesn't do parallel building and most of the cost is probably running the wikitext parser [21:27:05] I'm also not sure though that the limit is waiting on mw-api responses, while when we had the queue limit of 25 flink managed to fill it up, right now with the limit at 100 it fluctuates up and down as more work is done [21:27:19] it doesn't seem to cap out there at least [21:27:44] maybe thats simply the limits of parsing mass amounts of json and it needs to be parallelized, seems plausible [21:30:00] although, i guess our metrics only show a median doc size of 10kb, 800kb of json isn't really mass amounts [21:31:32] Okay, what does the flink UI say about the parallelism of the operators? Or would they go up only if the parallelism is increased explicitly or if we come in from Kafka from multiple partitions? [21:36:37] back [21:37:45] I'm not sure about in flink, but in spark (which is generally similar-ish) we only get additional parallelization if explicitly shuffling, otherwise the source parallelism goes all the way through [21:39:23] apparently in flnk we can set explicit parallism values on an operator, and they have some form of auto-scaling available as well [21:41:12] and it looks like we use the explicit parallelism in rdf-streaming-updater, not yet in cirrus. Might be the next place to look [21:52:57] ryankemper ACK, just +1'd [21:55:41] heh, consumer fell over with an error from elasticsearch: 413 Request Entity Too Large [21:55:45] guess we need to limit that a little more [22:08:14] hmm, no the default size should by 5MB, and i'm not seeing us set it anywhere :S [22:09:01] i guess it's not clear if that really came from elastic, maybe envoy? [22:09:38] I have a theory about the LDF stuff, BTW...if it has to maintain state and it's getting a bunch of requests at once from a poller, maybe it's getting confused? [22:11:14] inflatador: hmm, maybe you could use something like `ab` to fire a bunch of requests in parallel to see if bursts of load cause the failure? [22:11:25] although in theory ldf should be super cheap [22:13:30] ebernhardson yeah, it's really strange. There's plenty of legit traffic hitting ldf at the same time and getting 200s [22:15:37] Digging thru logstash ATM https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_query_wikidata_org_ldf_ip4)))))&_a=h@88fc22a [22:18:39] gehel sorry I haven't responded till now...the check doesn't go through varnish, it hits wdqs1015 directly [22:19:08] also, go enjoy your weekend ;P [22:19:28] hmm, curiously the tls-proxy logs for consumer-search only mention the mw api, not seeing any bulk's even though they are obviously going through [22:20:29] oh, no they don't. We didn't send relforge through envoy because it isn't defined as a service. So that error had to come from elastic [22:31:42] restarted consumer, instead of failing with the 413 error, it instead dies with: Caused by: java.lang.OutOfMemoryError: Java heap space [22:31:44] fun :P [22:35:12] school run, but will try and get it at least running when i get back [23:03:11] back [23:05:02] * ebernhardson also wonders if we should fix the elasticsearch sink so a 419 entity too large to a bulk endpoint splits up the bulk in two [23:05:12] s/419/413/ [23:08:46] as for whats actually failing, nginx is the one rejecting it. It's saying that the body was 128,387,425 bytes [23:13:39] oh...it looks like the elasticsearch sink, when estimating the amount of bytes added to a bulk by a scripted UpdateRequest, adds `sizeInBytes += request.script().getIdOrCode().length() * 2;` [23:14:00] but it doesn't count the size of the parameters [23:17:23] hmm, thats not even the sink, thats the elasticsearch client library :S [23:17:38] ouch [23:18:34] Hm, we could add a safety margin by specifying a max bulk size (in bytes) that is slightly lower than the actual maximum [23:19:38] pfischer: not sure yet, the default limit was 5MB and it tried to send a 120MB+ bulk, i suspect it wouldn't be discriminiative enough [23:20:02] not sure i'll have time to figure this out today, i guess fun for you next week :) [23:21:52] i guess can set it very low just to see if things get moving [23:36:06] Hm, so far we only specify a limit in terms of number of 1000 actions per bulk, but if 5mb is another limit, we can specify this. [23:39:20] pfischer: tracing through it looks like 5MB is the default when not set, i added a patch to set it to 1MB [23:39:22] BTW: http metrics PR is rebased. I explicitly set the overall maximum for the connection pool now and the default route maximum is half of that. I wonder if apache http uses the resolved route (local envoy proxy) or the original route (wiki domain) to associate the limit. [23:40:03] from BulkProcessor.Builder [23:40:33] sounds reasonable [23:42:18] Okay so it seems we (ab-)use script parameters to an extend that the authors of ES did not anticipate; If necessary, I’ll propose another upstream patch to count int the script parameters too. [23:44:42] naively, since a bulk request is always 2 lines of text, i guess i hoped they would convert them into bytes when added and figure it out that way [23:44:53] BulkRequest.estimateSize() looks at the rendered source of the underlying actions [23:46:37] All the BulkRequest.internalAdd methods do that calculation. [23:47:05] pfischer: hmm, my local copy isn't seeing any reference to estimateSize in BulkRequest [23:47:47] ebernhardson: estimatedSizeInBytes(), my bad [23:48:20] That looks up sizeInBytes which in turn is incremented by the aforementioned internalAdd() methods. [23:49:28] pfischer: ahh, right. I would have to do some testing to verify, but for internalAdd(UpdateRequest) it adds request.doc.source().length(), and request.upsertRequest().source().length(), i expect doc is not available, but i suppose upsertRequest should have the content [23:50:16] but that would imply the estimate should only be off by about 50%, which doesn't match what we are seeing. I'm probably missing something still [23:51:15] oh, rerender's dont get an upsert, and those are the majority [23:51:37] Hm, question is: does the flink sink writer use the same estimates to trigger a flush [23:52:22] pfischer: i suspect flush either happens due to checkpointing, or when one of those two limits (# actions, est action size in bytes) is reached [23:52:46] but not completely sure [23:55:41] Yes, I meant: The ElasticsearchSinkWriter does use an intermediate object called “requestIndexer” that accumulates the actions. At some point either limit is hit and only then a bulk request is created an passed to the bulk processor. [23:56:12] That’s why I’m wondering how the writer concludes that the size limit has been reached. [23:57:01] oh curious, i hadn't noticed that bit [23:58:11] Might be that is uses an instance of BulkRequest under the hood that is incrementally built up by passing in actions via the internalAdd methods. In that case it should share the size estimation logic. [23:58:41] The script size estimation is definitely faulty though, at least for the way we use it.