[06:49:38] <pfischer>	 o/
[07:31:55] <dcausse>	 o/
[13:18:02] <inflatador>	 <o/
[13:18:11] <pfischer>	 o/
[13:18:16] <inflatador>	 thanks dcausse for figuring out the quorum stuff! I went ahead and closed the task
[13:18:50] <dcausse>	 o/
[13:19:06] <dcausse>	 inflatador: np! I filed T402627 as a possible followup on our side
[13:19:07] <stashbot>	 T402627: Stop using auto_expand_replicas on indices hosted by the cirrussearch cluster  - https://phabricator.wikimedia.org/T402627
[13:48:29] <ebernhardson>	 \o
[13:50:49] <ebernhardson>	 indeed doubt we need auto_expand_replicas, it can just be a constant replica count
[13:52:26] <ebernhardson>	 separately, how do we feel about dump file counts?  I'm pretty sure i could get pyspark to emit a single file for each index,  but do we still want 50gb dump files?
[13:52:55] <ebernhardson>	 i'm kinda waffling...the argument for large dump files would be everything stays the same, we can even rename them to match what existed.  But is that important?
[14:06:11] <dcausse>	 o/
[14:07:03] <dcausse>	 ebernhardson: not sure that's important but 50M chunks is definitely not great
[14:08:06] <dcausse>	 how many files is that for commons with default spark settings?
[14:10:27] <ebernhardson>	 the full dump is 33k files with no adjustment, 28k files if i .coalesce(4000),  a .coalesce(1000) fails. juts commonswiki i think it was like 1300
[14:10:41] <ebernhardson>	 (might also suggest the source dump script could use bigger batches)
[14:11:24] <ebernhardson>	 i'm playing with a thing right now which can generate a single file each, although not sure if spark will blow out or not when asked to do a 50gb file. Hopefully it's doing reasonable batches, but dunno
[14:11:56] <ebernhardson>	 the main idea would be to load and process a bunch of dataframes in parallel, one for each index, instead of trying to do them all in a single dataframe with output partitioning
[14:13:53] <dcausse>	 and varying the number of files per wiki with some size guesstimator is probably hard?
[14:16:25] <ebernhardson>	 depends how we do it.  We could read the cirrus_index_without_content to gets counts, possible but seems a bit roundabout.  We could check `df.inputFiles()` and estimate size based on the number of files in the partition
[14:16:39] <ebernhardson>	 we could directly count the input, but since it's avro it would have to read the full dataset it can't just return a count
[14:17:40] <dcausse>	 there's the _without_content table that is parquet but that requires another dataset dependency
[14:18:05] <ebernhardson>	 yea, it can certainly count from there but i suppose it just feels awkward to check a second table,  but we do know it should match
[14:18:34] <ebernhardson>	 if we split into files, what would be the target size? Maybe a gb-ish per file?
[14:19:21] <dcausse>	 yes? hopefully most small wikis would end-up being one file
[14:21:12] <dcausse>	 also do we care to generate actual bulk requests, if we do a breaking change anyways could we just skip the first line keeping only the data?
[14:23:44] <dcausse>	 well... it's convenient to be able to just pipe the dump to elastic and filtering those metadata lines seems easy enough
[14:24:38] <ebernhardson>	 hard to say...indeed it's much easier to not generate that format
[14:24:44] <ebernhardson>	 well, maybe not much, but a bit
[14:26:53] <dcausse>	 I'm fine either ways, if that saves you some transformation steps I'm all for it
[14:28:48] <ebernhardson>	 the difference is i think we can mostly ask spark to format json (if we throw out extra_cols, getting those cols into the schema before serializing would be annoying)
[14:29:00] <ebernhardson>	 currently i use a python udf to format json since it needs to be two lines
[14:29:37] <ebernhardson>	 which also makes it trivial to stick extra_cols back in, although unsure if thats needed. It's kinda nice because it means if we add cols but haven't (yet) put them in the hive schema they are still dumped
[14:31:31] <dcausse>	 hm.. re extra_cols I don't have much opinions on this... would that be an "extra_col" field or would you put them back top-level?
[14:58:48] <ebernhardson>	 native spark methods, there would be an extra_cols and it would be json encoded string, nested in a json encoded string.  To bring it into the top level would require making the schema include all the cols, so probably not worth it
[14:58:57] <ebernhardson>	 but when formatting from a pyspark udf can do whatever, it receives the row and returns a string
[15:02:21] <pfischer>	 ebernhardson: I am somewhat stuck on testing RestExternalTaskSensor in AirFlow. The test_dag_structure, in particular, test_external_task_exists assumes, that the external_dag_id can be looked up in all_dag_ids, but ‘all’ appears scoped to search-related DAGs. Could this be extended to all dags in the repo?
[15:05:32] <ebernhardson>	 pfischer: hmm, yea it could be changed i suppose. At a general level the full list of dags is available in the pytest config
[15:06:59] <ebernhardson>	 pfischer: if you look in conftest.py you can see that instance_tasks filters from request.config._all_tasks,  you need essentially the same but perhaps not filtered, but also including which instance it belongs to?
[15:54:25] <dcausse>	 heading out have a nice week-end
[16:04:22] <inflatador>	 .o/
[16:04:30] <inflatador>	 workout, back in ~40
[17:25:28] <inflatador>	 sorry, been back
[18:10:15] <inflatador>	 lunch, back in ~30-45
[18:58:48] <ebernhardson>	 dump is currently running through wikidatawiki, will see how it goes...i'm up to a silly 32gb per executor, 2 cpu's per executor. Maybe can bring it down, but hoping to see it work on wikidata for once...
[19:56:54] <ebernhardson>	 the size differences are curious. The stage the reads avro and shuffles reads in twice as much data as it writes, the stage that reads the shuffle and write .txt.gz files again reads twice as much data as it writes, suggesting the output files are 1/4 the size of the avro inputs. seems odd
[20:17:11] <inflatador>	 I think I'm pretty close to having a working opensearch-operator image. It's a pretty simple golang app, so a bit easier than the OS image itself ;)
[20:17:29] <ebernhardson>	 nice!
[20:18:59] <inflatador>	 yeah, tangentially related but we might want to help Observability with the OpenSearch packaging...the version of 2.x they're running is pretty old. I haven't looked at their deb pkg setup, but maybe we can work the same magic you did on the plugins repo ;P
[20:34:03] <ebernhardson>	 it might be similar but different to have a source repo, would probably have to look at how they build the package. But probably can
[20:47:22] <inflatador>	 yeah, it'll probably be easier since upstream provides packages
[20:49:30] <inflatador>	 ryankemper do you have anything for pairing? I'm just working on the opensearch k8s stuff
[20:59:17] <ryankemper>	 inflatador_: nothing too interesting on my end, just looking at the query from the ticket about those wdqs visualizations that were failing sometimes
[21:01:26] <inflatador>	 ryankemper if you need a second set of eyes LMK, I'll be around for the next 30m-1h. Otherwise I'm OK w/skipping
[22:05:01] <inflatador>	 OK, I've got the helm chart going with our custom OS image and custom operator image...there's still some weirdness around the bootstrap pod, but it's progress. Have a good weekend!