[08:01:54] Seddon: glad to hear you made it work! :) [09:44:00] going out for lunch [11:45:30] dcausse: do you know why locally that cross-project search returns N>1 results per wiki but in production only returns 1 result per sister project [12:08:13] errand [12:45:28] dcausse: do you have a few minutes for a chat? [12:45:47] meet.google.com/nvx-qpwy-hvn [12:57:41] gehel: sure [13:27:25] dcausse: any thoughts about my question above? [13:27:56] Seddon: sure there's a setting for that, lemme see [13:29:23] Seddon: I think you could set $wgCirrusSearchNumCrossProjectSearchResults = 1; [13:29:38] * Seddon tests [13:30:20] dcausse: you are a star [13:30:41] yw! [13:41:11] dcausse: Sam is around, want to jump back into that same meet? [13:41:21] gehel: sure [13:56:42] We messed up our vacation planning: both Erik and David are out next week :( [13:57:10] oh, hm. is there anyone else who can pick that up? or at least who will be able to run a manual update next week so we can automate when they get back? [13:57:13] Hopefully Erik can start the import today. [13:58:46] I don't think anyone else on the team has enough familiarity with these kind of things :( [13:59:16] ok, I guess let's find out when ebernhardson gets in if there's any chance we can do the automation today [13:59:24] otherwise our feature will be broken next week :( [14:02:01] dcausse: any chance you can get the import started for T304954 ? [14:02:02] T304954: Import data from hdfs to commonswiki_file - https://phabricator.wikimedia.org/T304954 [14:03:06] gehel: the data is not there yet I think [14:03:47] or is there a snapshot for 2022-07-25 perhaps? [14:05:53] it's not the import we need, I think we have the latest import - it's automating the import starting next week [14:07:27] I think today we imported 2022-07-11 -> 2022-07-18, from when do we need to automate? [14:07:56] checking hive [14:08:09] I just pinged Marco and asked him to jump into the conversation to help clarify [14:09:08] hmm.. I would have guessed 2022-07-25 to be present but not seeing this in hive [14:12:43] o/ [14:13:36] hola [14:15:35] mfossati_: do you want a link to how you schedule your dag? [14:15:39] s/want/have [14:16:46] what's this trailing underscore in my username? :-D [14:17:13] there's another mfossati apparently :) [14:17:25] try /nick mfossati [14:17:46] interesting [14:18:58] anyway, jumping in re the search indices update automation: the image suggestion Airflow DAG is scheduled to run on Mondays [14:19:53] mfossati_: when do you think the 2022-07-25 partition will be made? [14:19:57] we need to wait for the latest Wikidata snapshot to become available: its job should run on Monday as well, but usually it becomes available in Hive a few days later, it depends [14:20:28] let me check [14:20:54] we schedule on mondays too and rely on the sensor to retry perhaps [14:21:33] yes, we do the same [14:21:52] so it looks like Wikidata is at 2022-07-25 [14:22:15] let me now check our DAG [14:24:19] I shall manually run the DAG, as it was interrupted due to the latest bug [14:31:38] once the DAG is run, are we able to do the automation for future imports, including next week? [14:32:02] ok running now, everything should be available in the next couple of hours or so [14:32:40] almost 3 hours actually to get to the search indices [14:34:27] dcausse: are you still there? you seem frozen [14:35:52] yes, the automation should be doable for future runs [14:36:18] I guess the real question is if we have time to finish the automation today [14:39:59] If the latest pipeline run succeeds, then Erik will see the latest snapshot available at a reasonable time of his day today [14:41:12] ok, fingers crossed! [15:14:18] \o [15:15:26] o/ [15:16:49] so if i understand right, we have our weekly shipping execution_date on sundays and we expect to ship that day. The image suggestions have their execution_date on mondays, but don't expect data to be ready until friday or so? so in theory we can wait for data paths based on date - 6d [15:16:58] ebernhardson: rapidly pushed https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/820763 but I'm unsure about reusing the 'weekly' section so introduced a new entry in convert_to_es [15:17:13] i suppose it's fine to have another one [15:17:40] and relying on the default 7 days timeout of the sensor to align exec dates [15:19:01] sigh I can't seem to be able the tests locally :/ [15:19:15] using the tox-pyspark image? [15:19:39] dcausse: tox-pyspark:0.7.0-s2 is what i'm using locally [15:20:13] me too :/ [15:20:17] Erik: that's correct from the image suggestions side [15:21:05] hm.. subgraph_and_query_mapping.meta.json: error: Error reading JSON file; you likely have a bad cache. [15:21:37] since we heavily rely on Wikidata, we wait for its snapshot to become available via Airflow sensors [15:22:01] mfossati_: makes sense, we have the same craziness trying to wait for wikidata pieces [15:23:09] dcausse: just pulled the patch and ran, seems to work ok inside the image locally. Randomly guessing..unstaged files it doesn't like? i dunno [15:23:15] getting this, https://phabricator.wikimedia.org/P32293 not sure yet to understand what's wrong [15:23:57] tried removing .tox and ~/.mypy_cache [15:24:00] hmm, i suppose i'm re-using the existing .tox, lemme try letting it start from scratch [15:24:48] dcausse: historically, sometimes pip makes changes that are incompatible and it need a `pip install --upgrade pip` before things will work, but i haven't seen those kind of errors in awhile and they were usually more to do with it not being able to figure out what to install [15:25:09] ag trying that then [15:26:13] hm... OSError: [Errno 13] Permission denied: '/home/nomoa' [15:26:34] hrm [15:26:56] using: docker run -it --rm -e XDG_CACHE_HOME=/src/.cache -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v $PWD:/src:rw --user $(id -u):$(id -g) --entrypoint /bin/bash docker-registry.wikimedia.org/releng/tox-pyspark:0.7.0-s2 [15:27:32] dcausse: looks the same as I'm using, on a linux machine that should be aligning all the uid/gid for permissions appropriately. hmm [15:27:48] it does not mount /home tho [15:27:53] pip might want to install there [15:28:32] oh, yea perhaps. I actually dunno if i've ever run the pip upgrade inside the container, but on the other hand i'm seeing it work fine locally without upgrading pip. Maybe try deleting the .cache directory? [15:28:52] dcausse: .cache/pip/ might have a bad zip file i suppose [15:29:15] yes dropped this .cache dir, [15:29:45] seems to download stuff now [15:30:19] another could try is `git clean -nfd` which should make it like a fresh clone and delete even gitignore'd things [15:31:36] err, the -n is dry run, have to remove that to actually do anything, so `-nfd` to report what will be done, `-fd` to actualy do it. might not be necessary (or could always re-clone to somewhere else and test) [15:32:01] ok thanks! finally could get the fixture created [16:00:11] ebernhardson: is it safe to merge https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/820763 today? And get that data pipeline automated before you and David are on vacation? [16:01:36] gehel: we are looking at it now [16:01:47] great! Thanks! [16:27:10] mfossati_: is your job still running? [16:27:31] trying to find something in yarn but not sure I can identify it [16:30:29] ah must be https://yarn.wikimedia.org/cluster/app/application_1655808530211_285146 [16:32:52] dcausse: seems plausible, i think they trigger from here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image-suggestions_dag.py [16:33:57] thanks! gather_search_indices_dataset should be right after gather_cassandra_dataset [16:44:16] I wish spark could know how many jobs are remaining [16:47:03] yea it's pretty bad about telling whats coming up [16:47:26] and yarn does not seem to keep a lot of old runs [17:16:46] this wcqs thing is mysterious, here is a request that times out with a 500: curl -vv -XPOST http://localhost/some-invalid-url -d 'foo=bar' [17:16:52] (from a wcqs serveR) [17:17:37] there's a job wdqs_streaming_updater_reconcile_hourly it's waiting for some partitions (year=2022/month=08/day=04/hour=17) but they're not there... [17:18:05] it's still waiting but I'm afraid this partition will never be around, canary events might have stopped during that period [17:18:20] dcausse: some of the other hourly jobs that read eventgate also got stuck, in a manual check of partitions we had eqiad but not codfw, so i told airflow to pretend everything is fine (mark the wait_for_* tasks as success) [17:18:34] since i'm expecting the codfw partitions to be empty [17:18:41] I guess that's why you want to setup lower timeouts here :/ [17:19:28] i suppose in your case it's not as clear codfw will be empty, the updaters emit things independantly ? [17:19:29] ok thanks did that [17:19:56] it emits stuff but rarely so it depends on canary events to be sent [17:20:02] ahh ok [17:21:13] if 7 days is the default timeout it's not great for hourly stuff [17:21:31] and it depends on past so everything is stuck [17:22:07] hmm, but the sla should still poke us that things aren't progressing as normal? [17:22:20] yes it did, just realized that now :) [17:22:23] i suppose some of the sla's are too chatty though, i should extend the sla on subgraph_* [17:22:59] makes sense [17:30:43] going out for dinner, will check the imagesuggestion job later tonight [17:30:48] kk [17:34:36] oh! the timeout is (probably) because nginx is forwarding the `Content-Length` header to /oauth/check_auth, which causes jetty to wait around and never start the request, and then fail it because no content ever arrived [17:35:04] wonder if we can whitelist the headers we send on to the oauth endpoints and avoid surpriseds [17:51:48] it mostly means wcqs has never worked properly with POST'd queries, and the UI falls back to POST under a few circumstances [18:11:03] early lunch [19:05:25] back [21:33:12] hmm activating the dag did not trigger it, I might have done something wrong [21:38:48] triggered it manually [21:54:26] hmm, probably should have, will try and look at why [21:54:49] checked two examples and the data seems to have been shipped but I'm not convinced that automation will work [21:55:19] sudo -u analytics-search airflow next_execution image_suggestions_weekly says 2022-07-31 00:00:00+00:00 [21:55:46] I'd have expected august 1st (monday) not a sunday [21:56:50] triggered the first run with sudo -u analytics-search airflow trigger_dag -e "2022-07-25T00:00:00+00:00" image_suggestions_weekly [21:57:22] hmm, indeed [21:58:18] not sure why, thats certainly odd [21:59:28] dcausse: oh, because @weekly is a specific cron alias string, we need to manually configure it [21:59:41] dcausse: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html - @weekly Run once a week at midnight on Sunday morning `0 0 * * 0` [21:59:49] sigh... [22:00:05] dcausse: i think we can timedelta(days=7) there instead [22:01:07] ok... I see why it did not trigger then... [22:01:17] or match whatever they did in the other repo, checking [22:01:56] dcausse: they use `0 0 * * 1`, might as well copy them i suppose [22:02:15] ok [22:02:47] can delete the dag from airflow ui and see if it then creates it appropriately next time around, the actual update should all noop and not really matter that it gets shipped twice [22:03:03] oh, but it will fail the first time because the part that writes out text files can't overwrite, has to be manually deleted from hdfs [22:03:14] (i can do all that, i imagine it's late for you :) [22:03:40] no worries :) [22:05:47] I guess I might not need to re-run 2022-07-25? if next_execution is 2022-08-01 it should be fine? [22:07:46] yes we can probably simply change the schedule and trust the next_execution date reports correctly [22:09:48] ok shipping this small change first and I'll check [22:11:38] the exec date in the fixture file is already a bit better [22:12:42] oh should have looked closer at that, it was telling us but didn't notice [22:13:47] might add a test that the first scheduled date matches the start_date, would error a bit more obviously. I suppose there might be reasons to not have them match but cant think of one atm [22:17:06] me too for some reason I thought that exec dates were set in stone in the tests and did pay much attention [22:18:24] they were initially, but then after trying to figure out the wikidata time stuff i changed it to always take the first execution date of the dag for fixture rendering [22:19:46] oh right, that's quite helpful [22:21:54] next_execution is correct now [22:21:59] thanks for the help! [22:22:33] np [22:22:45] going offline, enjoy your vacations! [23:03:44] taking off as well