Core: Support committing delete files with multiple specs#2985
Conversation
| protected PartitionSpec writeSpec() { | ||
| Preconditions.checkState(spec != null, | ||
| "Cannot determine partition spec: no data or delete files have been added"); | ||
| protected PartitionSpec dataSpec() { |
There was a problem hiding this comment.
Renaming this does require touching more places but I think keeping it writeSpec will be confusing.
| PartitionSpec fileSpec = ops.current().spec(file.specId()); | ||
| List<DeleteFile> deleteFiles = newDeleteFiles.computeIfAbsent(file.specId(), specId -> Lists.newArrayList()); | ||
| deleteFiles.add(file); | ||
| addedFilesSummary.addedFile(fileSpec, file); |
There was a problem hiding this comment.
The file spec is only used for partition summaries. I added a test that shows it works as expected.
szehon-ho
left a comment
There was a problem hiding this comment.
Took a look and change looks good to me
| addedFilesSummary.addedFile(writeSpec(), file); | ||
| Preconditions.checkNotNull(file, "Invalid delete file: null"); | ||
| PartitionSpec fileSpec = ops.current().spec(file.specId()); | ||
| List<DeleteFile> deleteFiles = newDeleteFiles.computeIfAbsent(file.specId(), specId -> Lists.newArrayList()); |
There was a problem hiding this comment.
Not a big deal, but for me this would be easier to understand if it was deleteFilesForSpec
There was a problem hiding this comment.
Sure, I'll update that. You refer to the map name, right?
There was a problem hiding this comment.
Updated the map name.
| deleteFile(cachedNewDeleteManifest.path()); | ||
| } | ||
| } | ||
| this.cachedNewDeleteManifests.clear(); |
There was a problem hiding this comment.
Do we need the explicit clear here? Are we just trying to free it up for GC early?
There was a problem hiding this comment.
This logic seems a little different than it was previously?
Before
if committed doesn't contain cachedNewDeleteManifest
deleteFile()
clear cachedNewDeleteManifest
for any cachedNewDeleteManifest
if commited doesn't contain cachedNewDeleteManifest
deleteFile
clear all cachedDeleteManifests
I'm still trying to understand the check here but it seems like we will clear out all manifests even if some of them are committed?
Seems like the equivalent would be something like
for (ManifestFile cachedNewDeleteManifest : cachedNewDeleteManifests) {
if (!committed.contains(cachedNewDeleteManifest)) {
deleteFile(cachedNewDeleteManifest.path());
this.cachedNewDeleteManifests.remove(cachedNewDeleteManifests) // Although this would be modifying the list as we iterated through it but you get the idea
}
}
There was a problem hiding this comment.
You are right, I'll update this place.
There was a problem hiding this comment.
Updated to use LinkedList and listIterator.
44fdaad to
b199536
Compare
|
This one is ready for another review round. |
| for (ManifestFile cachedNewDeleteManifest : cachedNewDeleteManifests) { | ||
| deleteFile(cachedNewDeleteManifest.path()); | ||
| } | ||
| cachedNewDeleteManifests.clear(); |
There was a problem hiding this comment.
Minor: this will rewrite all delete manifests even if there is only one new delete file. I think it's fine to simplify it right now since we don't expect this case very often. But it would be good to note that this is something we can improve in a comment.
There was a problem hiding this comment.
Added a comment. I think this will be rare enough in real world so should be fine to optimize later.
|
@aokolnychyi, this looks good to me. I had a couple of minor comments, but merge when you're ready. |
|
Thanks for reviewing, @szehon-ho @rdblue @RussellSpitzer! |
This PR enables committing delete files that belong to different specs in a single operation. Previously, we only supported row deltas where all delete and data files were part of the same spec.