[03:40:55] dr0ptp4kt: (not urgent) came across https://news.ycombinator.com/item?id=39907468 today. looks like they implement it with langchain, which I remember you playing around with during the offsite hackathon. check out their implementation here: https://github.com/princeton-nlp/SWE-agent/blob/main/config/default.yaml pretty cool stuff [03:42:01] I like how many times they stress that the indentation needs to be correct. You can tell that the pesky LLM kept messing that up repeatedly haha [08:40:49] gehel: I can’t make it to our 1:1 today, I proposed an alternative time slot [08:56:20] pfischer: I moved it to tomorrow, but did not see your proposal before I moved. Hopefully the time works, otherwise move it again! [09:57:33] lunch + errand [11:05:16] dcausse: I just noticed that we run the cloudelastic SUP with 2 replicas but we do not use a keyed stream in the consumer. IIUC we risk loosing the order of updates (within the scope of a page) since updates are shuffle-distributed among the two ES sinks, not based on keys (wiki ID + page ID). [13:06:39] pfischer: I see a keyBy right before the fetchOperator [13:08:37] unless we shuffle between the source and this operator we should be good but haven't checked closely [13:20:45] hm... perhaps split-by-change-type is shuffling things? [13:22:24] Okay, I was not sure if keyed streams stay keyed after being processed by a function, I’ll debug the graph. [13:23:17] o/ [13:23:50] dcausse: SUP just failed due to failing name resolution: UnknownHostException flink-zk1001.eqiad.wmnet [13:24:08] pfischer: there's an incident in progress [13:24:28] see #wikimedia-operations [13:25:18] dcausse: thanks! [13:28:00] ryankemper: thanks! neat. and also funny. in other news I found that there's community langchain code for sparql. it's different from Hal's approach in some ways, and it needs more work for our stuff. but, after we get through adjusting the flink updates, automating the pipelines, and arranging the server split techniques for the graph split, i want to swing back around to that, as it is even more promising than the converged gemini [13:28:00] and openai models. [14:13:10] pfischer: did the SUP recover from the incidend [14:13:13] *t [14:13:55] I did restart it [14:14:26] But forgot to check the Jobrunner logs [14:16:37] Ah, looks like it recovered by itself, the restart was not necessary [14:18:08] nice! [15:56:08] workout, back in ~40 [17:02:18] sorry, been back [17:45:48] dinner [18:02:29] lunch, back in ~40 [18:42:36] back [18:44:34] does anyone know if the glent indices are in-scope for moving off the main cluster? [18:45:07] I guess so, since apifeatureusage is [18:45:24] inflatador: those should stay, they are ours [18:45:27] are used in prod search [18:45:39] ebernhardson ACK, forget I said anything then ;) [18:50:04] does anyone know what the translatewiki indices are called? [18:50:35] inflatador: ttmserver, but those aren't translatewiki exactly. They are the translate extension. [18:51:20] that could go either way...maybe moving it off would be reasonable [18:53:01] Yeah, I don't know enough to have strong feelings [20:00:00] random crazy idea: implement reindexing in airflow. Except it doesn't have access to mwscript or helmfile [20:34:58] run airflow on the mwmaint servers ;P ? [20:57:02] hmm, backfills are triggering OOM in the consumer [21:00:07] could increasing parallelism help? Or do we just need bigger containers? [21:00:57] i'm not sure, i suppose my suspicion would be that large batches of output bulk requests are being represented multiple times in memory [21:03:41] oh, i bet it's because peter was still experimenting so the working values are only in the live deploy and not the helmfile yet [21:04:05] we can get those thrown in the helmfile [21:04:18] yea, i'll move them over [21:07:48] I remember we did have that problem with rdf-streaming-updater when we had to do a really long backfill. David ended up doing the backfill in YARN [21:32:37] ebernhardson: yes, sorry, I was trying to optimize a few parameters. I can persist them in the helm/value files, looks like its running stable right now. [21:32:50] pfischer: its no worry, i'm writing a commit message for it now [21:39:18] The cloudelastic release is running at full blast at the moment (5 replicas to match the kafka partitions). According to the lag recovery estimation, it would still take 2 to 6 days to catch up. I wonder why one partition lags behind 3 times as much as the remaining 4 [21:40:32] oh, i hadn't realized. will take a look [21:41:00] https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?forceLogin&from=now-30m&orgId=1&to=now&var-datasource=eqiad%20prometheus%2Fk8s&var-flink_job_name=cirrus_streaming_updater_consumer_cloudelastic_eqiad&var-helm_release=consumer-cloudelastic&var-namespace=cirrus-streaming-updater&var-operator_name=All [21:43:06] pfischer: oh, thats because chi is the "big" cluster. All the indices are bigger, and indexing into a bigger index takes longer [21:44:39] Oh, okay. Good to know. I noticed the higher response times but that correlated to the larger payloads so I didn’t suspect any other cause. [21:45:29] For chi we max out the configured max. batch size of ~25mb [21:47:30] looks like it's all staying pretty busy, sink at ~80% and fetch at ~100% [21:48:28] would be interesting to get flame graphs or some other profiling of the fetch operator, see whats so busy about it [21:48:58] I also reduced the connection pool (and queue size) for the fetch operator at the same time. What’s curious: The back pressure now is caused by the fetch (not fetch_rerender) and no longer by the ES sink. However, the smaller connection pool for fetch is only used <20% [21:49:05] UI says "The flame graph feature is currently disabled (enable it by setting rest.flamegraph.enabled: true)" I guess i should look into that :) [21:51:22] Busy is does not mean the JVM does work. If we configure a max. queue size and it’s full, flink considers the operator busy while it does not accept new records/events/updates [21:52:38] i suppose if the normal fetch is full, that could indicate its queue size for handling out-of-order events isn't big enough [21:52:48] can we see that queue size? [21:52:59] We’ve seen that before: If too many events are waiting to be re-fetched (due to an error) the queue remains full but not all connections are in use [21:53:55] The out-of-order-fetch (fetch_rerender) is fine, probably due to the fact that out-of-order is an option. [21:54:01] flink flame graphs are experimental, they suggest not running in prod. https://sap1ens.com/blog/2023/06/04/profiling-apache-flink-applications-with-async-profiler/ is an interesting option [21:54:26] but indeed it might not be as obvious if it's all waiting rather than cpu busy [21:56:09] I’ll change the connection-pool/queue-size ratio in favour of the fetch. [21:56:44] Maybe coupling queue-size and connection-pool-size does not make so much sense. [22:00:23] yea i'm thinking that might be the case, queue size perhaps depends on the rate of ingestion and the time it takes to complete a task [22:03:59] far from scientific, but reloaded the thread dumps for a few taskmanagers, the thread for the fetch operator is always parked inside addToWorkQueue(), suggesting capacity issues [22:05:56] I also don't entirely know what it means, but taskmanager cpu usage reports plenty of throttling [22:07:46] Sounds like we’re using more than what’s requested? [22:08:36] probably, but not sure where thats set [22:08:52] * pfischer is frustrated that `helmfile —set “key=0.4”` becomes `key: “0.4”` [22:11:16] ebernhardson: app.taskManager.resource.cpu [22:44:18] * pfischer just found out that `rev-based-fetch-retry-queue-capacity-ratio` cannot be overridden, the config code does not read it [22:45:20] I’ll fix that tomorrow, probably decoupling queue-size and connection-pool-size as I go [22:45:46] pfischer: sounds good, get some sleep!