Core, Puffin: Add DV file writer#11476
Conversation
|
|
||
| private final OutputFileFactory fileFactory; | ||
| private final Function<CharSequence, PositionDeleteIndex> loadPreviousDeletes; | ||
| private final CharSequenceMap<Deletes> deletesByPath = CharSequenceMap.create(); |
There was a problem hiding this comment.
We need to discuss whether we should accept String or CharSequence in DV writers. We migrated to String for our new DataFile and DeleteFile fields but it is a different use case. We don't have to necessarily serialize these values. If an engine supports custom CharSequence objects, it may benefit from this API and skip the conversion.
It has no benefits for Spark, though. We have to convert to String.
There was a problem hiding this comment.
I think that String makes more sense for now because:
- to my knowledge, most of query engines support
String - if a query engine needs
CharSequence, the conversion is not very painful
There was a problem hiding this comment.
I'm not aware of any engine that would take advantage of it here, but I'm not sure it really causes any significant complexity to maintain as CharSequence. I'm fine either way.
There was a problem hiding this comment.
Avro Utf8 is the only example I remember.
There was a problem hiding this comment.
I agree. It's probably a good idea to use String. We can always move to CharSequence later with minimal changes.
There was a problem hiding this comment.
+1 on using String, which I've also mentioned in https://github.com/apache/iceberg/pull/11302/files#r1824137648
There was a problem hiding this comment.
Sounds good. Let's use String. It will have less overhead for engines like Spark that use String.
| import org.apache.iceberg.StructLike; | ||
|
|
||
| /** Copy the StructLike's values into a new one. It does not handle list or map values now. */ | ||
| class StructCopy implements StructLike { |
There was a problem hiding this comment.
I need to move this into a utility. Just not sure which at this point. Ideas are welcome.
There was a problem hiding this comment.
What about a "generic" IoUtil ?
There was a problem hiding this comment.
Does this even belong in io? This feels like it should be in types.TypeUtil
There was a problem hiding this comment.
I'd say it belongs to StructLikeUtil or something like this. Doesn't seem to belong to io.
There was a problem hiding this comment.
+1 for StructLikeUtil or similar.
@jbonofre, the io classes typically refer to FileIO and buffer management utils in Iceberg. I'm not quite sure why this is currently in the io package. I think it's probably better to not continue using io so that it's easier to find later.
There was a problem hiding this comment.
Good point. I like the StructLikeUtil name.
+1 for that. Thanks !
There was a problem hiding this comment.
Moved the logic to the util and deprecated this class not to break anyone.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testBasicDVs() throws IOException { |
There was a problem hiding this comment.
I'll add more once we support DV reads.
| return new Blob( | ||
| StandardBlobTypes.DV_V1, | ||
| ImmutableList.of(MetadataColumns.ROW_POSITION.fieldId()), | ||
| -1 /* snapshot ID is inherited */, |
There was a problem hiding this comment.
As per V1 Puffin spec.
| * @param spec the data file partition spec | ||
| * @param partition the data file partition | ||
| */ | ||
| void write(CharSequence path, long pos, PartitionSpec spec, StructLike partition); |
There was a problem hiding this comment.
Little bit of a nit here, but does write make sense? We're actually marking for delete. FileAppender and most "Writers" use add, should this just be delete?
There was a problem hiding this comment.
Agree. What about DVFileWriter as the name? That seems OK but if there are alternatives, let me know.
There was a problem hiding this comment.
@danielcweeks from my standpoint, delete() would be confusing. As the class is named DeleteVectorFileWriter (yes, I would prefer a full name instead of DV), I think it makes sense to use write() there.
There was a problem hiding this comment.
In my view, delete makes sense now as previously we passed PositionDelete object and used the common writer abstraction. That said, I am OK either way.
I do want to keep the shorter class name. I use DV naming consistently throughout the code and it allows me to stay on one line most of the time. Our 100 character line limit is no joke!
52c0f37 to
e95cf2c
Compare
| return StructCopy.copy(struct); | ||
| } | ||
|
|
||
| private static class StructCopy implements StructLike { |
There was a problem hiding this comment.
Copied the existing class but made private. The old one is deprecated and will be removed in 1.9.
| this.tableDir = Files.createTempDirectory(temp, "junit").toFile(); | ||
| assertThat(tableDir.delete()).isTrue(); // created during table creation | ||
|
|
There was a problem hiding this comment.
| this.tableDir = Files.createTempDirectory(temp, "junit").toFile(); | |
| assertThat(tableDir.delete()).isTrue(); // created during table creation |
FYI this typically isn't needed, because tableDir is already initialized in the super class. See also #11460 where existing tests have been refactored to remove this unnecessary re-initialization
There was a problem hiding this comment.
Oh, good call. Missed that change!
There was a problem hiding this comment.
Also removed the metadata dir and empty line.
|
Thanks for reviewing, @nastra @danielcweeks @jbonofre @rdblue! |
|
Hi team, may I ask a dummy question? My confusion lies in:
Appreciate any answer! I could tell iceberg is evolving really fast, so as a newcomer I'm kinda lost in the (multiple) proposals and different versions of implementation. |
@dentiny yes this repository contains the java implementation. You might want to start browsing the code around |
Thank you for the reply and code pointer! |
This PR adds a base DV file writer with basic tests. More tests to come once we support DV reads.
This work is part of #11122.