[07:03:58] o/ [07:07:13] oof mjollnir's feature_vectors-norm_query-20180215-query_explorer [07:07:14] is stuck again with a (seemingly) idle driver (https://yarn.wikimedia.org/proxy/application_1734703658237_856259/) [07:28:09] and it failed last night around; the dangling app is a retry. [08:19:00] popping out for a dentist appointment in 15 min [08:36:20] o/ [08:39:58] thanks to your patch we can now see: "expected final status was 'SUCCEEDED', but got 'FAILED' instead" [08:45:51] Diagnostics message: Max number of executor failures (818) reached [08:51:09] seeing things like: 25/01/15 22:42:08 WARN YarnAllocator: Container from a bad node: container_e131_1734703658237_848200_01_000276 on host: analytics1076.eqiad.wmnet. Exit status: -1000. Diagnostics: [2025-01-15 22:42:05.154]java.io.IOException: Resource hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive-shaded.jar changed on src filesystem (expected 1733863769794, was [08:51:11] 1736951952432 [09:39:39] apparently caused by deploy of refinery so expected, we might want to use airflow deployed artifacts instead of relying on this current jar deployed by DE [09:52:45] dcausse yep. The jar thing is not dag specifc [09:52:54] +1 for airflow deployed artifacts [09:54:18] the driver seems currently stuck around this execution point https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/mjolnir/kafka/client.py?ref_type=heads#L116 [09:55:45] and fwiw, don't know if you saw it, but task 15 failed with a different error py4j died during a df_features.cache().count() [09:55:53] File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics-search/appcache/application_1734703658237_848188/container_e131_1734703658237_848188_01_000001/venv/lib/python3.10/site-packages/mjolnir/transformations/feature_vectors.py", [09:55:53] line 42, [09:56:29] so... multiple things going on :( [09:56:46] dcausse quick chat after the unmeeting? [09:57:05] gmodena: sure! [09:57:08] i'd also like to prep a patch for the drop_data task today, so we can speed up testing next week [10:06:03] +1 [11:06:00] lunch+errand [14:14:10] I claimed and open a MR for T383870 [14:14:11] T383870: mjolnir should pin refinery jar version explicitely - https://phabricator.wikimedia.org/T383870 [14:14:36] thx! [14:14:38] hope it's ok - it's a trivial mjolnir change I looked into earlier today with dcausse - should help stabilize the dag. [14:14:46] +1 [14:15:53] o/ [14:27:32] o/ [15:23:43] spark driver <> skein memory parsing fixes have been deployed. I'll re-run the failing mjolnir task [15:48:13] can anyone tell me what https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/dist/src/script/summarizeEvents.sh does? Just wondering if it is a dependency of wdqs-categories [15:48:35] guess it is related to streaming-updater, but LMK [15:50:40] inflatador: totally obsolete, was a tool to gather usage statistics when wcqs was running on wmcs [15:53:21] dcausse ACK, thanks! Just looking at helm chart stuff [16:46:50] heading out early today, back later tonight [17:12:34] workout, back in ~40 [17:50:45] well, my back isn't happy after that workout...taking rest of day off ;( [17:55:12] inflatador: ouch! hope you feel better soon! [18:39:01] inflatador :( [19:05:45] Trey314159 inflatador ryankemper if I don't talk to you before then - enjoy the monday off & long weekend [19:05:55] Thanks! [19:47:20] mjolnir is still unhappy [19:47:21] 25/01/16 16:29:19 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information. [20:03:10] I'm running the failing spark job manually (tmux ftw) to test some tweaks to spark driver memory size [21:04:06] I created T383938 [21:04:07] T383938: Investigate and tune mjolnir resource allocation - https://phabricator.wikimedia.org/T383938