[08:52:18] dcausse urgh. That livy behaviour is annoying [08:55:04] drop_old_data_daily is failing with /usr/bin/env: ‘/usr/bin/python3’: No such file or directory [08:55:08] gmodena: I think that's spark-submit doing things with back ticks, so if the query is passed as CLI arg it might break if not escaped before end, made some progress with the same workaround (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1002) [08:55:11] sigh... [08:58:26] meh, wikidata rdf imports is also broken :( [08:58:30] dcausse :| [08:58:44] btullis re drop_old_data_daily on k8s: does the airflow image bundle refinery? Under the hood that dag uses some clean up scripts that on an-launcher live under /srv/deployment/analytics/refinery/bin/ [09:00:39] for wikidata: seems to be sken: skein.exceptions.ConnectionError: Unable to connect to application [09:00:49] but no clue why it's failing [09:01:55] oof [09:02:56] actually it's running but looks like airflow is unable to keep track of the yarn application [09:03:22] application_1734703658237_728431 (skein) application_1734703658237_728437 (spark job) are still running [09:09:30] there's also this WARN in logs UserWarning: Skein global security credentials not found, writing now to '/home/airflow/.skein'. [09:09:38] not sure if related though [09:10:59] weird... there are other tasks using SparkSubmitOperator that succeeded [09:11:11] dcausse how about trying to kill the spark job and re-run the airflow task? - just to check for eventual network transient funkiness [09:11:24] gmodena: sure [09:13:40] killed [09:14:03] deploying a quick fix to query-clicks and re-running it alongside the wikidata rdf import [09:22:04] query-clicks running, restarted import_wikidata_ttl.munge_dumps [09:23:24] got same warning with "Skein global security credentials not found" but seems to wait longer than previous run [09:23:46] last time it failed 3sec after saying "Waiting until finished" [09:27:04] ouch in fact it failed this way multiple times and previous run are still running: https://yarn.wikimedia.org/cluster/app/application_1734703658237_727260 https://yarn.wikimedia.org/cluster/app/application_1734703658237_727165 [09:27:53] killed them [09:28:13] but something's definitely very fragile here :( [09:29:19] yep [09:29:41] now it seems that application_1734703658237_730765 has been running fine for a couple of minutes [09:32:17] yes, and airflow is keeping track of properly so far [09:32:32] opening a new doc to list of the problems we've seen so far [09:32:43] yep. heartbeat responses are coming in [09:33:05] wondering if we're the first instance moving to k8s [09:33:53] if no there are probably many things we don't do the right way [09:39:33] IIRC wmde and research have moved already [09:40:47] Hi, sorry for the delay. Here now. [09:41:48] gmodena: No, the airflow image does not bundle refinery at the moment. We have discussed it in the past and decided (I believe) that we would rather not bundle it, if we can avoid doing so. [09:43:07] btullis IIRC the recommendation was to provide a custom image/overlay with refinery bundled [09:43:11] What we would like to do instead is to create a specific docker image that contains refinery (and other data lake client tools) - then this image would be selected at the task level. [09:43:26] ^^ Yes, exactly. But we haven't created that image yet. [09:43:42] btullis ack [09:43:52] We were unaware that there were any immediate use cases. [09:45:08] how airflow-analytics is calling refinery-drop-older-than ? [09:45:33] The `Skein global security credentials not found, writing now to '/home/airflow/.skein'.` can be safely ignored. It will be seen on every invocation of skein. [09:46:34] dcausse airflow-analytics is not on k8s yet AFAIK [09:46:43] airflow-analytics is using a BashOperator too with things like: /usr/bin/env PYTHONPATH=/srv/deployment/analytics/refinery/python /usr/bin/python3 /srv/deployment/analytics/refinery/bin/refinery-drop-older-than --verbose --database=wmf_staging '--tables=^(webrequest_frontend)$' --older-than=7 --allowed-interval=3 --execute=7516a2dddda5de7ad73f031294c35cc6 [09:46:55] gmodena: but it'll have the same issue [09:48:00] dcausse yes, but that something we discussed as a pre-req before moving to k8s (). I think the search use case fell throguh the cracks [09:48:24] hm ok, should we revert then? [09:55:45] Would it be realistic to run the refinery-drop-older-than with a SkeinOperator, as a temporary workaround? [10:00:16] I don't know... I can try to look at what's required [10:01:05] https://docs.google.com/document/d/18YXTuVMAiaVwmxOktcIENrPCV_hHA_k4HEWdV2EWxvM/edit?usp=sharing (cc btullis, gmodena) [10:01:41] I don't think SkeinOperator would necessarily help. The problem is that refinery is not properly packaged and we need a clone of the repo to access it. To make it work with skein "as is" we would have to hand roll a conda image with the refinery scripts, and hack around PYTHONPATH. At that point I'd rather 1. go the custom docker image route 2. refactor refinery python code into its own package. [10:01:47] for reference: https://wikimedia.slack.com/archives/CSV483812/p1732277067342799?thread_ts=1732276680.848229&cid=CSV483812 [10:21:03] I am also chatting with joal about these issues in #data-platform-sre on Slack. [13:51:01] weekly status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-01-10 [13:51:58] dcausse / gmodena / btullis : where are we on analytics search? Do we need to scramble to fix this before the weekend? [13:52:35] gehel: still trying things... [13:53:06] how bad is it if we don't fix this before the weekend? [13:55:32] for user facing features: we won't push updated scores (popularity/incoming_links) to the search indices, image suggestions and wdqs/wcqs reconcilation info [13:56:55] addLink / Image Suggestions might be impacted ? If that's the case, we should at least let the Growth team know. [13:57:17] The rest seems much lower impact. [13:57:49] Do we have a deadline after which we just rollback and try again next week? [13:58:46] rollback might be hard because of all the changes I've made so far... [13:59:02] :/ [13:59:04] I mean I can revert all the MRs I sent [14:00:16] btullis, dcausse: do you have enough of a plan to have things under control before the end of the day? And not spend the weekend trying to fix it? Do we need to have a quick chat to see where we're at? [14:00:17] rollback to the scheduler running in the vm is also a bit tricky because the database has been migrated. We would probably have to dump and restore from k8s back to bare metal. [14:00:54] yes thanks. I'm taking 30 mins for lunch now, but then back on it. [14:03:12] I'm scheduling a quick check in 1h. If things are fixed by then or you thing it is not needed, let me know. [14:09:49] ack, will do. [14:15:34] do we have hdfs cli in the image? [14:20:24] not sure, there are some references to HDFS in https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/blob/main/blubber.yaml?ref_type=heads ...will check [14:23:19] `/usr/bin/hdfs` is provided by `hadoop-hdfs` deb pkg on airflow VMs, I don't see it in the blubberfile [14:24:13] :/ [14:25:36] dcausse there is a `/usr/bin/hdfs` in the airflow-scheduler container I just checked, so it actually is in there [14:25:49] ah cool [14:55:00] I'll be late for the meeting. I'll ping you when I'm back [15:02:33] and I'm back! [17:16:15] heading out, have a nice week-end (will check the status of airflow later tonight once I'm back) [17:17:01] later, have a great weekend [18:48:12] lunch/errand, back in ~90 [20:28:46] back