[08:34:32] cleared a bunch of airflow tasks waiting for cirrus backend requests... seems like the data arrived very late for some partition recently [10:04:51] lunch + errand [13:15:36] o/ [14:01:50] o/ [14:22:46] .o/ [15:09:30] ryankemper we're in standup if you're around [15:37:58] heading out, back later tonight [17:44:14] dcausse what is the oldest streaming updater data that might be useful? Context is T377772 ... thinking of just writing a script that will set TTLs for all the objects [17:44:14] T377772: RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T377772 [17:54:26] Hi! I'm trying to test the behavior of the spark-kafka-sink in a highly parallel DAG. IIUC doing so locally only gives me thread-wise parallel execution. So forcing a repartition to the value of spark.default.parallelism should result in those partitions being processed in parallel, right? [17:55:20] I still see only one kafka producer gets created, which is what we want, but I don't trust my setup [18:09:23] lunch, back in ~40 [18:41:42] https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html looks like S3 supports lifecycle rules, not sure if it's implemented in our object store, but will check it out [20:39:26] inflatador: unfortunately with our current snapshot strategy it's hard to settle a fixed ttl, some files might get re-used accross checkpoints in a folder named "shared" iirc... [20:41:45] dcausse would a stupidly long TTL (say, a year) be acceptable? [20:44:20] inflatador: we'd have to analyse some snapshots I think, or perhaps some alerts to force us doing a restart with savepoint every X months (a restart with savepoint should start a "fresh" checkpoint) [20:46:21] dcausse ACK. I thought the snapshots weren't useful after the Kafka topic retention period? [20:46:33] or checkpoints, I should say [20:48:52] pfischer: I believe that you need to set --num-executors and/or --executor-cores to make sure that multiple partitions are processed in parallel? spark.default.parallelism or spark.sql.shuffle.partitions is, I believe, the default number of partitions that are created when doing distributed operations, if spark.sql.shuffle.partitions is 200 and --num-executors is 1 you'll get only one [20:48:54] partition processed at a time [20:51:25] inflatador: there are some files that are re-used accross checkpoints, if you look at the checkpoint dir there should be a "shared" folder that is on the parent dir of the checkpoint itself, it's for these files that I believe is hard to settle a ttl, we can discuss more about this tomorrow if you want? [20:52:30] dcausse sure, no hurry