[10:13:52] <dcausse>	 added a new entry at https://wikitech.wikimedia.org/wiki/Search/WeightedTags#Known_tag_families for "ext.pageassessments.project" context is T378868
[10:13:52] <stashbot>	 T378868: Allow searching articles by WikiProject - https://phabricator.wikimedia.org/T378868
[10:35:49] <dcausse>	 errand+lunch
[14:18:51] <inflatador>	 <o/
[14:20:17] <gehel>	 dcausse: a bit last minute, but I've invited you to the meeting with Web about Article Recommendations
[14:20:52] <dcausse>	 gehel: yes saw this, I should be around
[14:21:03] <dcausse>	 o/
[14:23:08] <gehel>	 thanks!
[14:51:26] <inflatador>	 dcausse might be a few min late to pairing
[14:56:56] <dcausse>	 inflatador: won't be able to attend pairing today :/
[15:00:54] <ebernhardson>	 \o
[15:02:17] <inflatador>	 dcausse ACK, np
[15:02:21] <inflatador>	 \moti wave2
[15:02:24] <inflatador>	 .o/
[15:08:06] <pfischer>	 o/
[15:30:52] <dcausse>	 o/
[16:33:48] <pfischer>	 I am currently looking into T372912, in particular, how to handle large volumes of image recommendations via AirFlow & Spark. So my naive approach would be to split a large dataset into chunks which are mapped to AirFlow tasks which are then processed one after another. Is that possible? I just saw examples of chaining tasks in a loop. My hope would be that this would reduce duplicates in case we have to re-run a DAG due 
[16:33:48] <pfischer>	 to some kind of failure, that is, if it’s possible to resume a DAG or at least re-run it with a different start-offset.
[16:33:48] <stashbot>	 T372912: Migrate image recommendation to use page_weighted_tags_changed stream - https://phabricator.wikimedia.org/T372912
[16:34:49] <ebernhardson>	 pfischer: hmm, how big are the tasks? I suppose normally i would do that level of orchestration in the spark app with executors=1 or some such
[16:35:06] <ebernhardson>	 but yea, doing it inside spark has different guarantees around retrys
[16:39:12] <pfischer>	 ebernhardson: IIUC on average they yield a batch of 90k rows per week. Sure, if it’s possible to achieve such split inside spark I would take that too. So that would look like AirFlow submits a Spark job pointed to a 90k row dataset and the spark job would then somehow split that?
[16:39:15] <ebernhardson>	 i suppose a difficulty of mapping tasks to chunks of the dataset is that airflow doesn't really like you to change the shape of your graph dynamically iirc.  I suppose that's what the loop is aiming to solve? I can't say i've seen loops, not sure how it would work 
[16:40:48] <pfischer>	 Yeah, the samples i saw simply extracted sub-lists from a list in a loop and created tasks for each sub-list. So the shape wouldn’t change once they have been created.
[16:44:06] <ebernhardson>	 pfischer: at a high level, what i've done in other things is to first .count() the dataset to get the size, use that to choose a number of partitions, then use a random repartitioner to get tasks of approx that size.  Unfortunately spark doesn't offer an easy way to know the number of executors, but can inspect config for our typical method of config (dynamic allocation)
[16:44:26] <ebernhardson>	 i think there is similar code in mjolnir...maybe, looking
[16:46:51] <pfischer>	 Okay, that sounds good. So if, with 5 partitions, we fail at partition 3: Could we retry from that partition on? With kafka under the hood we’d assume there’s only one executor with one core running a spark job.
[16:47:08] <ebernhardson>	 hmm, no not really. What mjolnir does instead is repartition things down to the maximum number of parallel processes it's willing to run, so 1 sometimes.
[16:48:17] <ebernhardson>	 in spark if a task fails it will be retried a few times. If it fails enough times it will fail the full application, which when restarted by airflow would start at the begining again. Unfortunately state sharing other than with large datasets isn't too typical 
[16:52:25] <ebernhardson>	 might be ok though, i suppose it depends why it fails? Historically spark things fail because they overuse memory, but hopefully that's not the case here since events should be small. Do validation failures fail the task or just skip?
[16:56:33] <pfischer>	 Hm, okay. Validation failures would fail the application it as of now.
[16:58:01] <ebernhardson>	 hmm, sadly spark doesn't have a nice side output / split like flink. Would have to ponder what to do with those
[17:15:39] <ottomata>	 pfischer: Xabriel and dumps 2 crew have been wondering very similar questions!  I don't think they have a solution, but yall should figure it out together :)
[17:44:12] <inflatador>	 back
[18:29:52] <dcausse>	 dinner
[19:28:05] <inflatador>	 lunch, back in ~40
[20:35:09] <inflatador>	 back
[20:45:43] <ebernhardson>	 somehow i didn't realize before...the existing opensearch deploy runs with the security plugin disabled, so no built-in tls.  We would need to build out the necessary bits for certificates if we want to use them. 
[20:46:02] <ebernhardson>	 Also not clear yet on how that works when migrating a cluster from elasticsearch -> opensearch+security. needs testing
[20:46:35] <ebernhardson>	 curiously, docs claim you can rolling restart elasticsearch->opensearch+security, but if opensearch has security disables you need a full-cluster restart to turn it on
[20:48:24] <inflatador>	 That's interesting...do we have a test env besides relforge where we can practice this? I can stand one up in WMCS if you think it'd be useful
[20:50:06] <ebernhardson>	 i'd probably use docker-compose, should work with tiny instances to make a cluster
[20:50:18] <ebernhardson>	 but maybe thats hard, haven't tried :P
[20:50:34] <inflatador>	 Probably easier than standing up something in WMCS ;P
[20:51:38] <inflatador>	 re: partman recipes, I'm wondering if preserving the data would work? As in, wouldn't the shards already get replaced during the ~45m or so that it takes to reimage the host?
[20:51:57] <ebernhardson>	 oh, i forgot how long it takes....
[20:52:16] <ebernhardson>	 yea, in that case it wouldn't matter. The timeout is configurable but i think it's only a few minutes
[20:54:11] <ebernhardson>	 we will just have to accept a very long rolling restart i guess
[20:56:02] <ebernhardson>	 school run, back in 20-ish
[20:56:30] <inflatador>	 I wonder how safe it would be to leave the cluster in a mixed OS/ES state...not that we want to keep it there forever, but I'd like to understand the risks a bit better
[21:17:50] <ebernhardson>	 back