[07:20:17] o/ [08:34:19] o/ [08:38:31] dcausse looks like running spark in client mode helped (at least with import_wikidata_ttl) [08:39:21] gmodena: yes, indeed [08:39:27] looking into errors of some other dags that failed this week-end [08:39:49] image_suggestions_weekly failed after the spark job... [08:40:11] OSError: HDFS connection failed, from pyarrow [08:42:57] dcausse I wonder if that might be kerberos related [08:43:44] no clue :/ [08:45:04] I see "Environment variable CLASSPATH not set!" right before but not 100% sure that's related [08:45:38] Errand, back in a few [08:46:15] yes that seems related: https://stackoverflow.com/questions/60954201/pyarrow-0-16-0-fs-hadoopfilesystem-throws-hdfs-connection-failed [08:47:19] ah! good catch [08:47:59] dcausse how do you access logs for the failing application? Airflow tells me to run `sudo -u airflow yarn logs -appOwner...` but I don't know from which host [08:48:51] gmodena: that should work from any stat machine but perhaps not with the airflow user bu analytics-search [08:48:57] sudo -u analytics-search kerberos-run-command analytics-search yarn logs -appOwner analytics-search -applicationId application_1734703658237_778642 | less [08:49:54] yes... does not seem to work with "airflow" user [08:50:07] dcausse ack! Other users are more restricted, and I did not want to spam sudo logs. [08:50:37] dcausse yeah, user `airflow` was my first failed attempt :D [09:01:34] dcausse I think it's triggered from https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/blob/main/discolytics/cli/convert_to_esbulk.py?ref_type=heads#L693. That call to pyarrow happens outside of SparkSession, which usually takes care to init CLASSPATH [09:07:24] gmodena: ok, I'm not sure how to fix this tho [09:13:45] dcausse wmf_airflow_common's hdfs_client hacks CLASSPATH in the python process right before calling pyarrow: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/util.py#L21 [09:14:20] ok [09:15:23] but I assume this script used to work pre k8s right? [09:18:10] gmodena: yes [09:18:27] but possibly running in cluster mode the CLASSPATH is set? [09:18:32] by spark [09:19:21] i think that's it, it's the only change I see in the code path [09:19:43] we had this "hack" in drop-dated-directories.py too and it does not run spark [09:20:37] ack [09:33:28] gmodena: if/when you have a moment: https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/49 [09:34:00] looking into other issues, import_cirrus_indices & glent [09:34:29] import_cirrus_indexes is REQUESTS_CA_BUNDLE not set [09:34:41] and glent seems to be spark driver mem issue [09:39:06] dcausse anything I can help with? [09:51:45] gmodena: for import_cirrus_indices&glent I'll have a patch in a sec, perhaps if you could look into mjolnir, there's a failure at https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?dag_run_id=scheduled__2025-01-03T18%3A42%3A00.449096%2B00%3A00&task_id=feature_vectors-norm_query-20180215-query_explorer&tab=logs for which I don't know the cause yet [09:53:02] these logs do not even have the yarn appid :/ [09:54:57] dcausse sounds good! [09:55:18] thx! [10:57:40] lunch [11:02:00] dcausse enjoy! [11:02:23] mjolnir is a a weird state [11:03:16] the sekin app that lunched `feature_vectors-norm_query-20180215-query_explorer` died. It did not even register/log a Spark application id. However, a spark job was submitted and has executed [11:04:31] I'm not sure if it was launched from the failed skein app though. Investigating. [11:05:52] Right now there is a `mjolnir_weekly__feature_vectors-norm_query-20180215-query_explorer__20250103` YARN application in RUNNING state https://yarn.wikimedia.org/cluster/app/application_1734703658237_762174 [11:06:04] Started: Sat Jan 11 13:42:18 +0000 2025 [11:06:47] this aligns with the failed airflow task https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?execution_date=2025-01-03+18%3A42%3A00.449096%2B00%3A00&dag_run_id=scheduled__2025-01-03T18%3A42%3A00.449096%2B00%3A00&task_id=feature_vectors-norm_query-20180215-query_explorer&tab=logs [11:12:01] the yarn application contains only a driver container (hanging), all other tasks seem to have completed (https://yarn.wikimedia.org/proxy/application_1734703658237_762174/jobs/). The application has started in client mode, so my guess would be this is some dangling skein container? [11:14:41] FWIW I do see data in `mjolnir.feature_vectors` for all wikis 2025-01-03 (the target date and output table for the job), but it's roughly half the number of records of the previous run. Can't asses if this was expected or not. [11:15:19] lunch than back looking into this. [12:42:23] need to go\ pick up Lukas from school, they called to tell me I'm sick. Back in 30 [13:07:02] gmodena: thanks for looking into it! [13:17:05] back [13:18:53] dcausse checking your glent & c changes right now [13:20:20] re mjolnir: I'm unsure on what to do. I'd be tempted to kill the yarn app, and re-run the failed airflow task for now. [13:27:40] gmodena: sure, the dangling driver seems similar to T383218 (which IIRC was caused by mem issues) but this was for feature selection not during feature collection [13:27:40] T383218: Mjolnir is sometimes stuck in feature selection - https://phabricator.wikimedia.org/T383218 [13:30:27] dcausse I'll append to that phab task. I thought about driver issues, but the fact that tasks do seem to have completed through me off. application_1734703658237_762174 has no worker task marked as active/running, just what looks like an idle driver process. [13:35:39] yes was looking at the driver stack trace and unsure to understand what could be blocking it... [13:36:23] perhaps stuck in python doing something with the results... [13:37:21] the main thread is waiting for the PythonRunner at least [13:46:27] dcausse ack. I was that wait in main thread, but my interpretation was py4j possibly getting stuck. [13:48:38] i ssh-ed into the driver host, and I don't see any python process running (other than the spark submit for the mjolnir app) [13:49:40] :/ [13:49:53] I also realize that I don't understand how this skein submit step works. I was expecting the spark-submit command to be executed inside a yarn container, not an an-worker. Maybe I'm looking at the wrong thing :| [13:57:21] sparksubmitoperator with launcher=skein should have spark-submit run in a yarn app master for sure! [14:00:12] ottomata ack! [14:00:47] dcausse killing spark / re-running the failed airflow task [14:00:55] gmodena: ack [14:01:17] added a comment (with less typos :P) at https://phabricator.wikimedia.org/T383218#10453258 [14:08:56] SkeinHook Airflow SparkSkeinSubmitHook skein launcher mjolnir_weekly__feature_vectors-norm_query-20180215-query_explorer__20250103 application_1734703658237_813702 status: RUNNING - Waiting until finished. [14:09:03] this looks better. [14:15:16] o/ [14:18:29] hmmm, we're getting a SLO "ErrorBudgetBurn" for wdqs. Looking into it now https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:23:26] dcausse relaunching the airflow task did not help. [14:24:20] and right now there seem to be multiple instances of the skein/spark applications running [14:24:41] yarn application -list | grep feature_vectors-norm_query-20180215-query_explorer | wc -l [14:24:42] 15 [14:32:32] ouch [14:33:48] with mjolnir it's hard to know what's related to the move to k8s/client mode vs existing issues it had prior the move [15:00:51] dcausse let's touch base at triage? [15:04:03] it seems that there's an issue between skein and airflow. Airflow thinks the skein app is down, while it is actually running. It then keeps re-launching, and this leads to multiple instances of the same application running. [15:05:25] just had a chat with joal - he remembers a similar issue, that should have been fixed a while. I need to check if we are hitting a regression. [15:06:31] the dangling spark driver is a different problem, for that I need to have a look at the application logic (which I was planning to do anyway) [15:08:45] ack [15:09:00] gmodena: should we pause mjolnir in the meantime? [15:11:03] feature vector is what's collecting feature from elastic so many running is not great [15:12:12] but since it's going through kafka it might not run them in parallel tho [15:12:14] https://grafana-rw.wikimedia.org/d/000000616/elasticsearch-mjolnir-msearch?orgId=1&refresh=5m&from=now-7d&to=now [15:14:47] dcausse +1 for pausing. [15:14:51] ok [15:23:37] I created https://phabricator.wikimedia.org/T383571 [15:25:36] thanks [15:31:10] I stopped feature_vectors-norm_query-20180215-query_explorer, and cleaned up all associated skein/spark apps [15:38:27] ack [16:57:38] errand, back in ~45 [18:06:36] back [18:59:56] lunch, back in ~30 [19:19:38] ok popularity_score, glent, image suggestions, import_cirrus_indexes are working or at least running, will see tomorrow what's the next round of errors is... [19:19:40] dinner [19:31:57] {◕ ◡ ◕} [20:36:12] seeing this issue with skein losing track of its spark app on import_cirrus_indexes_weekly, this is bad because we don't want hadoop to dump indices from our search cluster concurrently... pausing it... [20:37:35] this is bad... 6 were running :( [20:42:34] ;( [20:45:24] actually it's not skein but the airflow scheduler losing track of skein [20:50:49] same thing we were talking about at the mtg, right? Bah [20:53:36] yes [21:03:25] perhaps this https://github.com/apache/airflow/issues/39088 ? [21:14:32] * inflatador reads the Kubernetes Executor docs again [21:31:19] Looks like we're on airflow 2.10.3, I think that bug should already be fixed [22:06:10] ryankemper I'm in pairing if you're around