[07:04:10] o/ [07:04:48] this friday I'm taking care of kids. I'm around but a bit spotty late morning/mid afternoon. I'm online in the eve CEST. [08:31:00] o/ [08:48:51] Just came across T383568 [08:48:51] T383568: wikidatawiki dump never started for 20250101 - https://phabricator.wikimedia.org/T383568 [08:49:02] does this impact us? [08:56:14] hm... we use rdf dumps so hopefully no [08:57:08] if they fail it's our import_wikidata_ttl dag that might start failing [09:13:36] dcausse ack. Thanks for clarifying! [09:21:27] dcausse i'm merging MR1032 and will restart mjolnir. But IMHO you hit the nail on the head with skein possibly ignoring memoryOverhead setting. [09:21:51] gmodena: ack [09:22:01] this would explain why the job did not hit a memory ceiling in cluster mode [09:22:05] yeah... curious that it did not hit us somewhere else... [09:22:14] ah right [09:22:17] the driver would be allocated on a dedicated YARN container [09:22:21] not a skein proxy [09:22:25] yes makes sense [09:27:06] I vaguely remember brouberol (or was it btullis) saying something about memory configured in kibi somewhere and in kb in other places, leading to confusion about actual memory requirements. Might be related? [09:27:37] I think thart was joal [09:27:42] *that [09:30:02] Oh, that might be ! [09:34:23] gehel yep. That was indeed an issue. We rolled out a fix yesterday, and the behaviour (memory alloc) is now consistent across skein and spark [09:34:39] Ok, so still something else :( [09:34:42] but the offending job still fails with out of mem issues [09:34:55] Hi folks - sorry to be late for the party [09:35:20] I don't have historical view of what's been said before - could you please give me a summary? [09:35:28] as a workaround, we bumped driver memory [09:36:02] https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-search/20250117.txt [09:36:33] joal we suspect skein is ignoring memoryOverhead settings when allocating container resources. It sets it just to spark.driver.memory. [09:36:36] Thanks gehel [09:36:52] I think you're absolutely right gmodena [09:37:13] kudos to dcausse, he spotted it :D [09:37:32] joal when the job used to run in cluster mode this was a non issue. The driver would run on it's own yarn container [09:37:53] that's right [09:38:06] I think the easiest solution for this case is to manually override skein memory setting [09:38:14] but in client, we proxy via skein and maybe be hitting memory ceiling because of the unaccounted mem [09:38:23] This is possible in Airflow IIRC [09:38:26] joal ack [09:39:27] I think it's overkill to make the the skein hook recompute memory settings for driver+overhead as it's a very small minority of jobs using it (we use it regularly for executors, not for drivers) [09:40:41] patch incoming [09:57:24] all this fiddling with mem settings made me apperciate the snapashot testing approach in airflow-dags [09:58:28] can make pretty big patches but definitely nice to see the impact :) [10:13:35] dcausse joal when you have a moment: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1034 [10:16:13] +1 [10:17:13] merging & testing [10:18:25] brouberol the approval + autodeploy workflow is such a quality of life improvement. Thank you <3 [10:18:41] <3<3 [10:21:50] gmodena: note that pre-k8s this task had issues so I would not be surprised that it's going to fail again, but hopefully for different reasons this time, i.e. what I'm hoping to see is possibly some mem pressure warnings on the spark driver [10:22:03] fail or hang for ever [10:22:21] dcausse ack [10:22:46] dcausse i reverted the driver mem increase to have a basline [10:22:56] yes good idea [10:23:02] now we should run with the same settings as we had in cluster mode [10:23:09] on to the next error :D [10:23:14] :\ [10:23:59] dcausse hopefully that become just a matter of changing airflow pool [10:24:20] sure [10:24:49] so that COELESCE trick did not have the effect we were hoping on the file sizes... [10:25:04] :( [10:25:23] dcausse how bad is the file size increase, in term of impact ? [10:25:28] will continue discussing with Joseph about it [10:26:01] gmodena: not sure that's a big deal... but since we were having space issues I just wondered... [10:26:30] and was curious why simply switching engine could lead to files jumping from 170M to 240M [10:27:56] i thought about changes shuffle metadata generation (on these small sizes is relatively significant) but moving from REPARTITION to COALESCE i'd expect it to go down [10:28:11] restarted mjolnir with the skein change [10:28:15] errand+lunch [10:29:34] ++ [10:29:38] dcausse joal really curious to learn what you find out with file sizes thing! [10:29:47] sure [12:25:12] I'm gonna do some proper testing on this, and will report :) [13:30:13] status update published : https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-01-17 (and on Slack) [13:38:24] dcausse gehel the last two skein patches did the trick. The feature_selection mjolnir task completed, and the dag is chugging along [13:38:37] gmodena: nice! [13:42:36] and a bit of grep-foo tells the offending task was the only one in the search instance that was tuning the driver memory overhead [13:43:17] executor mem overhead settings are managed directly by yarn, so we should be good [14:05:01] gmodena: \o/ [14:12:35] o/ [14:20:52] o/ [15:27:48] alright. I think I managed to create a baesline for training MLR with instance weighting. Basically, by joining mjolnir.feature_vectors and mjolnir.labeled_query_page [15:28:45] what's a bit tricky is that for LTR tasks, xgboost requires to assign a weight to a (query) group, not single instance [15:30:14] meaning, we can't select a cutoff "easy query value" for a single (query, page) [15:30:36] oh right... [15:31:16] I need to dig a bit and understand what this means in practice (= what data looks like). But at least now I can run the whole pipeline end-to-end [15:31:38] perhaps grouping per query and keeping the best "easy query value" first? [15:31:58] dcausse that seems def a sensible start [15:32:45] I'm afraid we'll end up with another hyperparameter re which aggregation function to use in ltr :) [15:33:30] possible :) [15:35:15] but I think overall there might be optimizations to be looked into, can't remember about all the parameters we run in hyperopt but perhaps not all of all them are actually useful, same for features, I'm sure we could trim them down a bit [15:36:48] good point [15:37:03] tbh so far I only ad a cursory look at feature importance stats [15:37:34] it would be great to at least document the process and generated metrics [16:41:16] yes agreed [16:41:34] heading out, have a nice week-end