[07:20:17] <gmodena>	 o/
[08:34:19] <dcausse>	 o/
[08:38:31] <gmodena>	 dcausse looks like running spark in client mode helped (at least with import_wikidata_ttl)
[08:39:21] <dcausse>	 gmodena: yes, indeed
[08:39:27] <dcausse>	 looking into errors of some other dags that failed this week-end
[08:39:49] <dcausse>	 image_suggestions_weekly failed after the spark job...
[08:40:11] <dcausse>	 OSError: HDFS connection failed, from pyarrow
[08:42:57] <gmodena>	 dcausse I wonder if that might be kerberos related
[08:43:44] <dcausse>	 no clue :/
[08:45:04] <dcausse>	 I see "Environment variable CLASSPATH not set!" right before but not 100% sure that's related
[08:45:38] <gehel>	 Errand, back in a few
[08:46:15] <dcausse>	 yes that seems related: https://stackoverflow.com/questions/60954201/pyarrow-0-16-0-fs-hadoopfilesystem-throws-hdfs-connection-failed
[08:47:19] <gmodena>	 ah! good catch
[08:47:59] <gmodena>	 dcausse how do you access logs for the failing application? Airflow tells me to run `sudo -u airflow yarn logs -appOwner...` but I don't know from which host
[08:48:51] <dcausse>	 gmodena: that should work from any stat machine but perhaps not with the airflow user bu analytics-search 
[08:48:57] <dcausse>	 sudo -u analytics-search kerberos-run-command analytics-search yarn logs -appOwner analytics-search -applicationId application_1734703658237_778642 | less
[08:49:54] <dcausse>	 yes... does not seem to work with "airflow" user 
[08:50:07] <gmodena>	 dcausse ack! Other users are more restricted, and I did not want to spam sudo logs.
[08:50:37] <gmodena>	 dcausse yeah, user `airflow` was my first failed attempt :D
[09:01:34] <gmodena>	 dcausse I think it's triggered from https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/blob/main/discolytics/cli/convert_to_esbulk.py?ref_type=heads#L693. That call to pyarrow happens outside of SparkSession, which usually takes care to init CLASSPATH
[09:07:24] <dcausse>	 gmodena: ok, I'm not sure how to fix this tho
[09:13:45] <gmodena>	 dcausse wmf_airflow_common's hdfs_client hacks CLASSPATH in the python process right before calling pyarrow: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/util.py#L21
[09:14:20] <dcausse>	 ok
[09:15:23] <gmodena>	 but I assume this script used to work pre k8s right?
[09:18:10] <dcausse>	 gmodena: yes
[09:18:27] <dcausse>	 but possibly running in cluster mode the CLASSPATH is set?
[09:18:32] <dcausse>	 by spark
[09:19:21] <gmodena>	 i think that's it, it's the only change I see in the code path
[09:19:43] <dcausse>	 we had this "hack" in drop-dated-directories.py too and it does not run spark
[09:20:37] <gmodena>	 ack
[09:33:28] <dcausse>	 gmodena: if/when you have a moment: https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/49 
[09:34:00] <dcausse>	 looking into other issues, import_cirrus_indices & glent
[09:34:29] <dcausse>	 import_cirrus_indexes is REQUESTS_CA_BUNDLE not set
[09:34:41] <dcausse>	 and glent seems to be spark driver mem issue
[09:39:06] <gmodena>	 dcausse anything I can help with? 
[09:51:45] <dcausse>	 gmodena: for import_cirrus_indices&glent I'll have a patch in a sec, perhaps if you could look into mjolnir, there's a failure at https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?dag_run_id=scheduled__2025-01-03T18%3A42%3A00.449096%2B00%3A00&task_id=feature_vectors-norm_query-20180215-query_explorer&tab=logs for which I don't know the cause yet
[09:53:02] <dcausse>	 these logs do not even have the yarn appid :/
[09:54:57] <gmodena>	 dcausse sounds good!
[09:55:18] <dcausse>	 thx!
[10:57:40] <dcausse>	 lunch
[11:02:00] <gmodena>	 dcausse enjoy!
[11:02:23] <gmodena>	 mjolnir is a a weird state
[11:03:16] <gmodena>	 the sekin app that lunched `feature_vectors-norm_query-20180215-query_explorer` died. It did not even register/log a Spark application id. However, a spark job was submitted and has executed
[11:04:31] <gmodena>	 I'm not sure if it was launched from the failed skein app though. Investigating.
[11:05:52] <gmodena>	 Right now there is a `mjolnir_weekly__feature_vectors-norm_query-20180215-query_explorer__20250103` YARN application in RUNNING state https://yarn.wikimedia.org/cluster/app/application_1734703658237_762174
[11:06:04] <gmodena>	 Started:	Sat Jan 11 13:42:18 +0000 2025
[11:06:47] <gmodena>	 this aligns with the failed airflow task https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?execution_date=2025-01-03+18%3A42%3A00.449096%2B00%3A00&dag_run_id=scheduled__2025-01-03T18%3A42%3A00.449096%2B00%3A00&task_id=feature_vectors-norm_query-20180215-query_explorer&tab=logs
[11:12:01] <gmodena>	 the yarn application contains only a driver container (hanging), all other tasks seem to have completed (https://yarn.wikimedia.org/proxy/application_1734703658237_762174/jobs/). The application has started in client mode, so my guess would be this is some dangling skein container?
[11:14:41] <gmodena>	 FWIW I do see data in `mjolnir.feature_vectors` for all wikis 2025-01-03 (the target date and output table for the job), but it's roughly half the number of records of the previous run. Can't asses if this was expected or not.
[11:15:19] <gmodena>	 lunch than back looking into this.
[12:42:23] <gmodena>	 need to go\ pick up Lukas from school, they called to tell me I'm sick. Back in 30
[13:07:02] <dcausse>	 gmodena: thanks for looking into it!
[13:17:05] <gmodena>	 back
[13:18:53] <gmodena>	 dcausse checking your glent & c changes right now
[13:20:20] <gmodena>	 re mjolnir: I'm unsure on what to do. I'd be tempted to kill the yarn app, and re-run the failed airflow task for now.
[13:27:40] <dcausse>	 gmodena: sure, the dangling driver seems similar to T383218 (which IIRC was caused by mem issues) but this was for feature selection not during feature collection
[13:27:40] <stashbot>	 T383218: Mjolnir is sometimes stuck in feature selection - https://phabricator.wikimedia.org/T383218
[13:30:27] <gmodena>	 dcausse I'll append to that phab task. I thought about driver issues, but the fact that tasks do seem to have completed through me off. application_1734703658237_762174 has no worker task marked as active/running, just what looks like an idle driver process.
[13:35:39] <dcausse>	 yes was looking at the driver stack trace and unsure to understand what could be blocking it...
[13:36:23] <dcausse>	 perhaps stuck in python doing something with the results...
[13:37:21] <dcausse>	 the main thread is waiting for the PythonRunner at least
[13:46:27] <gmodena>	 dcausse ack. I was that wait in main thread, but my interpretation was py4j possibly getting stuck. 
[13:48:38] <gmodena>	 i ssh-ed into the driver host, and I don't see any python process running (other than the spark submit for the mjolnir app)
[13:49:40] <dcausse>	 :/
[13:49:53] <gmodena>	 I also realize that I don't understand how this skein submit step works. I was expecting the spark-submit command to be executed inside a yarn container, not an an-worker. Maybe I'm looking at the wrong thing :|
[13:57:21] <ottomata>	 sparksubmitoperator with launcher=skein should have spark-submit run in a yarn app master for sure!
[14:00:12] <gmodena>	 ottomata ack!
[14:00:47] <gmodena>	 dcausse killing spark / re-running the failed airflow task
[14:00:55] <dcausse>	 gmodena: ack
[14:01:17] <gmodena>	 added a comment (with less typos :P) at https://phabricator.wikimedia.org/T383218#10453258
[14:08:56] <gmodena>	 SkeinHook Airflow SparkSkeinSubmitHook skein launcher mjolnir_weekly__feature_vectors-norm_query-20180215-query_explorer__20250103 application_1734703658237_813702 status: RUNNING - Waiting until finished.
[14:09:03] <gmodena>	 this looks better.
[14:15:16] <inflatador>	 <o/
[14:16:39] <dcausse>	 o/
[14:18:29] <inflatador>	 hmmm, we're getting a SLO "ErrorBudgetBurn" for wdqs. Looking into it now https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:23:26] <gmodena>	 dcausse relaunching the airflow task did not help. 
[14:24:20] <gmodena>	 and right now there seem to be multiple instances of the skein/spark applications running
[14:24:41] <gmodena>	 yarn application -list | grep feature_vectors-norm_query-20180215-query_explorer | wc -l
[14:24:42] <gmodena>	 15
[14:32:32] <dcausse>	 ouch
[14:33:48] <dcausse>	 with mjolnir it's hard to know what's related to the move to k8s/client mode vs existing issues it had prior the move
[15:00:51] <gmodena>	 dcausse let's touch base at triage? 
[15:04:03] <gmodena>	 it seems that there's an issue between skein and airflow. Airflow thinks the skein app is down, while it is actually running. It then keeps re-launching, and this leads to multiple instances of the same application running. 
[15:05:25] <gmodena>	 just had a chat with joal - he remembers a similar issue, that should have been fixed a while. I need to check if we are hitting a regression.
[15:06:31] <gmodena>	 the dangling spark driver is a different problem, for that I need to have a look at the application logic (which I was planning to do anyway)
[15:08:45] <dcausse>	 ack
[15:09:00] <dcausse>	 gmodena: should we pause mjolnir in the meantime?
[15:11:03] <dcausse>	 feature vector is what's collecting feature from elastic so many running is not great
[15:12:12] <dcausse>	 but since it's going through kafka it might not run them in parallel tho
[15:12:14] <dcausse>	 https://grafana-rw.wikimedia.org/d/000000616/elasticsearch-mjolnir-msearch?orgId=1&refresh=5m&from=now-7d&to=now
[15:14:47] <gmodena>	 dcausse +1 for pausing. 
[15:14:51] <dcausse>	 ok
[15:23:37] <gmodena>	 I created https://phabricator.wikimedia.org/T383571
[15:25:36] <dcausse>	 thanks
[15:31:10] <gmodena>	 I stopped feature_vectors-norm_query-20180215-query_explorer, and cleaned up all associated skein/spark apps 
[15:38:27] <dcausse>	 ack
[16:57:38] <inflatador>	 errand, back in ~45
[18:06:36] <inflatador>	 back
[18:59:56] <inflatador>	 lunch, back in ~30
[19:19:38] <dcausse>	 ok popularity_score, glent, image suggestions, import_cirrus_indexes are working or at least running, will see tomorrow what's the next round of errors is...
[19:19:40] <dcausse>	 dinner
[19:31:57] <inflatador>	 {◕ ◡ ◕}
[20:36:12] <dcausse>	 seeing this issue with skein losing track of its spark app on import_cirrus_indexes_weekly, this is bad because we don't want hadoop to dump indices from our search cluster concurrently... pausing it...
[20:37:35] <dcausse>	 this is bad... 6 were running :(
[20:42:34] <inflatador>	 ;(
[20:45:24] <dcausse>	 actually it's not skein but the airflow scheduler losing track of skein
[20:50:49] <inflatador>	 same thing we were talking about at the mtg, right? Bah
[20:53:36] <dcausse>	 yes
[21:03:25] <dcausse>	 perhaps this https://github.com/apache/airflow/issues/39088 ?
[21:14:32] * inflatador reads the Kubernetes Executor docs again 
[21:31:19] <inflatador>	 Looks like we're on airflow 2.10.3, I think that bug should already be fixed
[22:06:10] <inflatador>	 ryankemper I'm in pairing if you're around