[08:34:32] <dcausse>	 cleared a bunch of airflow tasks waiting for cirrus backend requests... seems like the data arrived very late for some partition recently
[10:04:51] <dcausse>	 lunch + errand
[13:15:36] <inflatador>	 <o/
[13:16:46] <inflatador>	 <o/
[13:22:38] <dcausse>	 o/
[14:01:50] <pfischer>	 o/
[14:22:46] <inflatador>	 .o/
[15:09:30] <inflatador>	 ryankemper we're in standup if you're around
[15:37:58] <dcausse>	 heading out, back later tonight
[17:44:14] <inflatador>	 dcausse what is the oldest streaming updater data that might be useful? Context is T377772 ... thinking of just writing a script that will set TTLs for all the objects
[17:44:14] <stashbot>	 T377772: RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T377772
[17:54:26] <pfischer>	 Hi! I'm trying to test the behavior of the spark-kafka-sink in a highly parallel DAG. IIUC doing so locally only gives me thread-wise parallel execution. So forcing a repartition to the value of spark.default.parallelism should result in those partitions being processed in parallel, right?
[17:55:20] <pfischer>	 I still see only one kafka producer gets created, which is what we want, but I don't trust my setup
[18:09:23] <inflatador>	 lunch, back in ~40
[18:41:42] <inflatador>	 https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html looks like S3 supports lifecycle rules, not sure if it's implemented in our object store, but will check it out
[20:39:26] <dcausse>	 inflatador: unfortunately with our current snapshot strategy it's hard to settle a fixed ttl, some files might get re-used accross checkpoints in a folder named "shared" iirc...
[20:41:45] <inflatador>	 dcausse would a stupidly long TTL (say, a year) be acceptable?
[20:44:20] <dcausse>	 inflatador: we'd have to analyse some snapshots I think, or perhaps some alerts to force us doing a restart with savepoint every X months (a restart with savepoint should start a "fresh" checkpoint)
[20:46:21] <inflatador>	 dcausse ACK. I thought the snapshots weren't useful after the Kafka topic retention period?
[20:46:33] <inflatador>	 or checkpoints, I should say
[20:48:52] <dcausse>	 pfischer: I believe that you need to set --num-executors and/or --executor-cores to make sure that multiple partitions are processed in parallel? spark.default.parallelism or spark.sql.shuffle.partitions is, I believe, the default number of partitions that are created when doing distributed operations, if spark.sql.shuffle.partitions is 200 and --num-executors is 1 you'll get only one
[20:48:54] <dcausse>	 partition processed at a time
[20:51:25] <dcausse>	 inflatador: there are some files that are re-used accross checkpoints, if you look at the checkpoint dir there should be a "shared" folder that is on the parent dir of the checkpoint itself, it's for these files that I believe is hard to settle a ttl, we can discuss more about this tomorrow if you want?
[20:52:30] <inflatador>	 dcausse sure, no hurry