[07:44:26] sigh... many dags failed their sla and the sup producer in eqiad is failing with "java.lang.UnsupportedOperationException: The end timestamp of an event-time window cannot become earlier than the current watermark by merging. Current watermark: 1714929806999 window: TimeWindow{start=1714929430000, end=1714929730000}" [08:04:56] unsure if this will fix the issue: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/124 [08:05:18] but I don't see where we tell the window to route such events to this sideoutput [08:29:57] w[cd]qs dags unblocked by marking the sensor as success, /wmf/data/raw/event/eqiad.rdf-streaming-updater.lapsed-action/year=2024/month=05/day=04/hour=12 appears to be missing, going to assume that canary events failed again [08:33:26] image_suggestion is blocked because of https://phabricator.wikimedia.org/T350007 [08:43:19] dcausse: nice Monday morning! Need any help on this? [08:59:57] gehel: not really, will try to deploy the fix for the sup and see, other issues there's nothing we can do but wait [09:48:48] sigh... this did not work, still seeing late events being merged and failing the window :( [09:59:10] seems very similar to https://issues.apache.org/jira/browse/FLINK-22425 [09:59:33] going to pre-filter late events I guess... [10:16:26] dcausse: Sorry, I missed the review request. I’m here to help now. [10:16:58] pfischer: no worries! working on a another fix [10:17:36] You want to filter events right after we polled them from kafka? [10:19:54] pfischer: I was planning to do it right before the window, basically do "sideOutputLateData" is supposed to do but manually [10:22:57] I’m happy to do that. [10:25:51] I have a quick patch WIP but getting a test failure [10:31:20] Hm, please let me know if I can help, I’m happy to pair. [10:34:43] pfischer: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/125 still wip as I got a test failure locally in ConsumerApplicationIT which I haven't had time to investigate [10:34:45] lunch [10:39:22] 👀 [11:22:46] I only see ProducerGraphFactoryTest fail because of a serialisation issue (Map is generic and we explicitly disable generic type support) [12:29:41] ah it's because my os runs with a french locale and it fails locally with "Connection reset by peer" != "Connexion ré-initialisée par le correspondant" [12:36:24] build is green now [13:04:36] o/ [13:11:19] o/ [13:32:01] pfischer: thanks for merging, will deploy this new image [14:02:02] workout, back in ~40 [14:46:23] producer seems to be stable again [14:47:54] seems like windows get assigned and merged before lateness is being check [14:55:05] back [15:02:59] dcausse: triage meeting... [15:51:45] when backfilling the new late event filter seemed to have ignored many events making the dedup&merge window operator not doing as much work as we could have hoped... :( [15:51:55] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000017%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000017%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22rate%28flink_taskmanager_job_task_operator_deduplicated%7B%7D%5B10m%5D%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:tr [15:51:57] ue%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D [15:51:58] hmm [15:53:15] not sure I understand the windowing system enough but probably the quick fix I made is far from optimal [15:56:03] sometimes I feel that not using this "high-level" operator is easier and less error-prone, doing the dedup based on a simple process function with a ListState seems easier to me than messing with these window settings [16:05:30] But how is this metric alarming? Looking at the last 2 days, the deduplication rate is back to its former level. [16:08:09] Does the event-time window span true 5’ in a backfill scenario or would it close once an event with an event-time > start-event-time + 5’ comes in? [16:08:47] dcausse: a thought re: T349069 , one sort of discussed previously. what about having an airflow'd job that runs loadData.sh on a stats cluster node itself and keeps the last n produced journal files? i guess related to this is taking local backups of the journal files on the wdqs hosts (e.g., in some rotating fashion). or is it more preferred to have the actual wdqs nodes doing the heavy lifting of reload? [16:08:48] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [16:10:15] pfischer: I think during the backfill one of stream (probably cirrus rerenders) had all its event considered late, this is "alarming" because many of those could have been dedupped I think [16:10:16] pfischer: my understanding is it should be event time based, so if it should flush the old events when it sees a new event with timestamp 5m later [16:12:08] somehow the watermark got bumped while still having events way before that new watermark, not sure how and when... [16:12:45] probably still hard for flink to deal with streams with varying throughput [16:15:51] mildly surprised the dedup rate is only 15-20%. Wonder how much trouble that's worth [16:21:24] Yeah, I recently checked (and added the charts to the SUP dashboard). That’s when I wondered if we could shrink that window after all. [16:27:16] i guess that's also higher than the linked metrics david linked earlier, that suggested we would only see 8.8% on edits. But that probably wasn't counting the related-tags events that also get folded in [16:27:24] https://phabricator.wikimedia.org/T325672#8939471 [16:35:35] dcausse: The watermark should should not be effected a union of sources with different rates. It’s supposed to be the minimum of all streams. [16:41:13] pfischer: exactly but that does not seem to be what we saw during the backfill [16:56:00] ebernhardson: Regarding https://phabricator.wikimedia.org/T358472: All MRs are closed and I tried to find any traces of image_suggestions_weekly in yarn but w/o success. I see many hourly jobs, though, that involve convert_to_esbulk. How can I tell, that image_suggestions_weekly is running again? [16:56:18] pfischer: hmm, lemme see [16:57:10] pfischer: from the airflow ui it does look stuck with the latest one failed. Need to clear the most recent one [16:57:28] at http://localhost:8600/dags/image_suggestions_weekly/grid (assuming forwarding 8600 into an-airflow1005:8600) [16:58:06] i'm sure i saw the probe skip indicator working somewhere else, i guess i forgot to check this one :S [16:59:42] hmm, might be something up with the weekly one as well: http://localhost:8600/dags/transfer_to_es_weekly/grid . Suspect what happened here is we have depends_on_past=True, so the first time a new operator is added is has no past and has to be manually set to success [17:00:07] i wonder why hourly is working though... [17:20:11] ryankemper just got out of the incident review ritual for https://wikitech.wikimedia.org/wiki/Incidents/2024-04-26_Search_unavailable_for_some_eqiad_users . They had a few questions about SLOs that I wasn't sure about [17:21:10] inflatador: ack, are the questions written down somewhere? or how should I follow up [17:22:04] ryankemper I'll send you an email w/Reuven, he had the questions [17:22:19] sounds good [17:29:18] ebernhardson: thanks, will look into it after dinner [18:33:24] beware that the image suggestion pipeline is currently broken because of https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/40 [18:59:01] randomly curious, iran is the #2 country for related articles page views, between india and the united states. Seems fishy [19:56:39] ryankemper: i saw earlier the import finished. are these the correct boundaries of the run? i got this from a `head` on loadData.log and looking at the last modified time of loadData.log. May 4 02:59 UTC - May 6 16:41 UTC . i think maybe you had some other timer script or a scrollback, but wasn't sure. [19:58:23] here's the cmd I used: `date | tee /home/ryankemper/loadData.log; time sudo /home/ryankemper/loadData.sh -n wdq -d /srv/T362920/nt_wd_schol -s 0 -e 0 2>&1 | tee -a /home/ryankemper/loadData.log; time sudo /home/ryankemper/loadData.sh -n wdq -d /srv/T362920/nt_wd_schol 2>&1 | tee -a /home/ryankemper/loadData.log` [19:58:29] Last log line is `>File /srv/T362920/nt_wd_schol/part-01024-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz not found, terminating`, hmm [19:59:42] dr0ptp4kt: btw do you remember what the `slightly-modified loadData.sh` in https://phabricator.wikimedia.org/P54284 is referring to? I changed the script to match the file format but that's mentioned explicitly later on in the paste so I'm wondering if there's some other change I should have made [20:02:57] dr0ptp4kt: scrollback is in `ryankemper@wdqs2023` tmux session `data_reload`. looks like whole scrollback is there, ~7000 lines [20:05:08] oh and disregard my earlier `hmm` about it ending on 1024; it looks like the last file is `part-01023-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.txt.gz` so that is expected [20:23:35] ryankemper: thx. i think it was just the filename pattern (there was a #commented out FORMAT line and also an extra else branch to say that the file was found for the happy path, but otherwise i think all the same) [20:43:51] ebernhardson: apparently convert_to_esbulk.py returns with a return code == 1. I tried to narrow down the cause but SPARK logs are only visible while the job is running and when I try to run the spark3-submit command (as shown in the airflow logs) directly (on either stat1008 or an-airflow1005), it fails due to lacking venv. Is there a more convenient way to debug convert_to_es.py via spark3-submit? [21:04:30] ryankemper: i added the comment to the tix about the running time. wow, that was a lot faster (it actually beat the gaming desktop head-to-head)! thanks! able to rinse and repeat, but with the additional buffer capacity piece? i'm going to be interested to see if this completes in under 2 days. [21:11:42] oh, heh, my machine is in powersave according to freshly installed cpufrequtils (the bios is bumped up, but i bet it has the same null effect). may have to try bumping that up for fun [21:14:47] pfischer: hmm [21:19:30] pfischer: so this is a bit silly, but...airflow logs have the id application_1713453355160_197949. Querying those logs gives the actual app id, application_1713453355160_197961. These logs are crazy verbose, but if you search for `YarnScheduler` it will jump to where the master is starting up. `^Container` will then jump to the next container after the master and can page-up to see why [21:19:32] the master died. [21:20:13] in there it says `Output directory hdfs://analytics-hadoop/wmf/data/discovery/transfer_to_es/date=20240428/freq=weekly already exists`. Which is an annoying part of spark text file output, there is no overwrite option. Probably we should implement something to mimic an overwrite, because this happens every time [21:20:37] (assuming transfer_to_es_weekly convert_to_esbulk for 2024-05-05 ) [21:20:49] so only the first fail really failed [21:22:22] as for easier, there must be a way to load the conda env, or a fascimilie, into jupyterlab [21:25:05] The first invocation in airflow claims it worked, and the airflow log includes 'INFO - Marking task as SUCCESS. dag_id=transfer_to_es_weekly, task_id=convert_to_esbulk, ...`, so perhaps it's been trying to re-run and failing simply because the first time put things in there? Not sure why it ran a second time if it worked the first though [21:26:47] now that pyarrow is reasonable and working we should be able to programatically clear out that directory without much fuss to mimic an overwrite i guess [21:28:51] Alright, so we would no longer rely on firing hdfs cli commands in favour of leveraging the pyarrow lib? [21:40:39] pfischer: thats what i'm thinking at least, pyarrow hdfs lib gives a nice api we can use in convert_to_esbulk just before writing, hopefully removing this class of error where it keeps trying to write to a place it already wrote [21:41:01] the manual fix is the `hdfs dfs -rm ...` the path an re-run airflow [22:11:25] ryankemper this might be the key to running buildkit on MacOS https://til.codeinthehole.com/posts/how-to-enable-docker-buildkit-in-macos-docker-desktop/ ... trying it now [22:21:39] nope, still the same error. I pinged in dpe sre slack...guessing b-rouberol knows how to do it, will check in tomorrow [23:26:22] dr0ptp4kt: yeah, will kick off the next iteration soon