[10:22:44] lunch [12:44:56] o/ [12:52:13] ebernhardson: deployed your patch for process_sparql_query_hourly [12:52:28] still seeing oom :/ [12:53:00] might perhaps be related to the parsing itself or the size of its output? [12:55:34] 2024-07-12T08:00:00 it the run I use to test [13:44:08] \o [13:44:30] dcausse: :( i ran one of the failing hours in a notebook and it passed, was hoping it would transfer over [13:44:42] o/ [13:44:55] dcausse: oh, i guess i changed to 4 cores and 16gb of memory per executor as well [13:45:00] ah perhaps because of some settings it uses? [13:45:03] instead of 8/16 [13:46:23] here it's --executor-cores 1/--executor-memory 16g/--input-partitions 4 [13:46:44] hmm, i had 4 cores / 16gb memory / 16 input partitions [13:47:01] i suppose i only ran the first one though, july 12th 07:00 [13:47:13] the stack says it's when writing but unsure if that's accurate [13:47:45] application_1719935448343_542545 should be one that failed this morning [13:47:58] hmm, failing while doing the final write is curious. At that point it should just be streaming data from shuffle servers to hdfs [13:49:11] but yea, it says task 0 in stage 2, and stage 2 was the output [13:50:05] maybe some giant query_info field? [13:52:09] hmm, it would have to be really out there. i suppose usually here i would try boosting memory overhead [13:52:29] with the theory that writing large files is taking large off-heap memory usage for something [13:53:00] or can increase output partitions? [13:54:06] sure [13:57:33] yea we could probably even do 10 output partitions, i'm seeing hourly outputs of 2G+ [13:58:56] but really...writing a 2g file shouldn't blow out 16g of memory :S [13:59:08] :/ [14:01:41] oh i also just realized that i had disabled adaptive optimizations (spark.sql.adaptive.enabled=false) and codegen (spark.sql.codegen=false) in my test runs, but had moved on from that idea and didn't turn them back on. Can do some testing and see if those had much effect [14:02:54] i had been trying to convince spark to note move the isnotnull() call [14:03:04] s/note/nnot/ [14:16:14] interesting, managed to fail my notebook after disabling those optimizations. will try one at a time [14:17:37] annoyingly, it takes 15 minutes per test :P [14:42:10] hmm, it worked with adaptive query execution disabled. But that doesn't make any sense :S [14:42:32] maybe they've added more, but from docs: As of Spark 3.0, there are three major features in AQE: including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skew join optimization. [14:42:53] i guess maybe the post-shuffle partition coalescing is doing something it shouldn't [14:43:06] but we already go to 1 partition, so i wouldn't expect that to do much [15:08:07] * inflatador joined the wrong mtg ;( [16:01:39] errand, back in ~20 [16:20:06] back [18:01:22] lunch, back in ~40 [18:51:49] * ebernhardson wishes more ui's had a 'not' choice..in airflow i dont want to see only queued, or only running, i want to see everything that is not success [20:12:23] workout, back in ~40 [20:42:37] inflatador: might be 10' late to pairing, lunch running a lil late [20:56:05] back [20:56:05] ryankemper ACK, hit me up whenever [22:20:53] meh, tiny containers are annoying. We dont have curl, wget, or even python in the flink containers, was hoping to test bare http connections to mw private apis without deploying the full app