Core: Add writer for unordered position deletes#7692
Conversation
|
|
||
| @Benchmark | ||
| @Threads(1) | ||
| public void writeUnpartitionedFanoutPositionDeleteWriterShuffled(Blackhole blackhole) |
There was a problem hiding this comment.
We should expect 5-15% overhead for the new buffering writer, which can still be beneficial for the job if we skip local ordering for inserts and potentially avoid spilling. This benchmark also does not take into account the cost to order records, it only tests the write performance. We will use this writer only if fanout is enabled. We should also explore Puffin delete files that would persist bitmaps directly.
Benchmark Mode Cnt Score Error Units
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriter ss 5 6.004 ± 0.185 s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriter ss 5 6.503 ± 0.171 s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffled ss 5 6.616 ± 0.204 s/op
There was a problem hiding this comment.
We should also explore Puffin delete files that would persist bitmaps directly
+1
There was a problem hiding this comment.
About memory overhead (not sure any thing measures it now): Should be just additional space of map (data_file_path => bitmaps)? Will there be cases, esp in fanout, where a writer writes many delete of many data files, that will start to stress it?
There was a problem hiding this comment.
I ran this benchmark (100 data files, 50k deletes each, 5 million deletes total) with a GC profiler and did not see anything bad. Issues will arise when there are lots of unique data files. That's unlikely as we distribute by partition and this writer will still be disabled by default, so users will have to opt in explicitly. It isn't perfect for sure but there would be reasonable cases for it.
903e827 to
20ebae5
Compare
singhpk234
left a comment
There was a problem hiding this comment.
LGTM, Thanks @aokolnychyi !
szehon-ho
left a comment
There was a problem hiding this comment.
Looks good to me too, left few comments
| } | ||
|
|
||
| public PositionDelete<R> set(CharSequence newPath, long newPos) { | ||
| this.path = newPath; |
There was a problem hiding this comment.
Nit question: is it cleaner to have this constructor delegate to the other one?
| import org.roaringbitmap.longlong.Roaring64Bitmap; | ||
|
|
||
| /** | ||
| * A position delete writer that is capable of handling unordered deletes without rows. |
There was a problem hiding this comment.
Nit: can we add javadoc to the PositionDeleteWriter, when we get a chance?
There was a problem hiding this comment.
Will add in this PR.
|
|
||
| @Benchmark | ||
| @Threads(1) | ||
| public void writeUnpartitionedFanoutPositionDeleteWriterShuffled(Blackhole blackhole) |
There was a problem hiding this comment.
About memory overhead (not sure any thing measures it now): Should be just additional space of map (data_file_path => bitmaps)? Will there be cases, esp in fanout, where a writer writes many delete of many data files, that will start to stress it?
|
|
||
| this.positionDeleteRows = | ||
| RandomData.generateSpark(DeleteSchemaUtil.pathPosSchema(), NUM_ROWS, 0L); | ||
| this.positionDeleteRows = generatePositionDeletes(false /* shuffle */); |
|
|
||
| for (int pathIndex = 0; pathIndex < NUM_DATA_FILES_PER_POSITION_DELETE_FILE; pathIndex++) { | ||
| UTF8String path = UTF8String.fromString("path/to/position/delete/file/" + UUID.randomUUID()); | ||
| int step = 10; |
There was a problem hiding this comment.
why not just make this outside?
|
Thanks, @singhpk234 @szehon-ho! |
This PR adds a position delete writer that can handle unordered position deletes. This writer should allow us to avoid a local sort for some MERGE operations. Specifically, consider MERGE operations where 90% of data are inserts and the table is partitioned but no sort order is defined. Right now, we always request a local sort to order deletes. However, that sort can be useless for inserts if no sort order is defined and fanout writer is enabled. Moreover, ordering inserts may lead to a spill, which is expensive for wide tables and large tasks.