Core: Fix incorrect selection of incremental cleanup in expire snapshots#13614
Conversation
58bcc2b to
850e04a
Compare
| @@ -50,7 +50,7 @@ class IncrementalFileCleanup extends FileCleanupStrategy { | |||
| @Override | |||
| @SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"}) | |||
| public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpiration) { | |||
There was a problem hiding this comment.
I think there's a bigger question of moving away from incremental cleanup and just always doing the reachability analysis in the core cleanup. This is what's done in spark and other integrations (which is probably why this issue didn't surfaced for quite some time).
If I recall correctly, the original intent for keeping incremental cleanup was that at the time branching/tagging was a pretty new introduction and I didn't want to regress folks expiration cleanup in the "average" case. However, even beyond branching/tagging there are other constructs on table metadata (like the ability to cherry pick) that can make incremental cleanup difficult to do safely, that maybe in 2.0 we should consider just removing it and always doing a reachability analysis? cc @rdblue @RussellSpitzer @stevenzwu @aokolnychyi
stevenzwu
left a comment
There was a problem hiding this comment.
I like this quick change to prevent corrupted state.
Nit on the commit msg. Technically, we didn't fix the incorrect selection. Maybe sth like this.
Core: Disable incremental cleanup in expire snapshots when there are multiple refs in the table
850e04a to
be82964
Compare
| @SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"}) | ||
| public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpiration) { | ||
| if (afterExpiration.refs().size() > 1) { | ||
| if (beforeExpiration.refs().size() > 1 || afterExpiration.refs().size() > 1) { |
There was a problem hiding this comment.
Could this leak files if there is a rollback or a staged (but not referenced) commit?
For instance, if you have snapshots like this:
A --- B --- C --- D (main)
`--- X (unreferenced)
If snapshot X should be cleaned up, Iceberg should not use the incremental strategy. Also, if X is not removed but B and C are, then incremental also can't be used because C may have removed files from X.
Instead of checking the number of refs, I think we need to check the history of the main branch. If there is any snapshot outside of the main branch before or after expiration, incremental cannot be used. There are cases where a ref is in the main branch's history that are safe, but I would probably keep the implementation simpler and avoid checking for things like branch overlap.
What I would do is:
- If there is more than one ref after expiration, use the reference set strategy (optionally, we could check that all refs are in the main branch's history and allow it)
- If there are any snapshots to be removed after expiration that were not the main branch's ancestors, use the reference set strategy
There was a problem hiding this comment.
we could check that all refs are in the main branch's history and allow it
I agree. It's the safest to only allow incremental cleanup when this condition is met. This should also makes incremental cleanup safe for a large class of history manipulations that's going to get implemented in the future.
There was a problem hiding this comment.
I think in the current implementation we're safe against staged snapshots being cleaned up since the incremental cleanup itself will delete if the manifests are not referenced in an ancestor.
That said, I generally agree that the clearest way to determine if a reachability analysis should be performed is by doing an analysis of if any of the snapshots being removed are outside of the main ancestry.
|
|
||
| if (incrementalCleanup == null) { | ||
| incrementalCleanup = current.refs().size() == 1; | ||
| incrementalCleanup = base.refs().size() == 1 && current.refs().size() == 1; |
There was a problem hiding this comment.
If incrementalCleanup is explicitly enabled by withIncrementalCleanup(true), but on this line we already know that it's going to crash, does it make sense to fallback to full scan with a warning, or just crash here?
There was a problem hiding this comment.
I've upleveled some of the validation, but I do think if it's specified through the API option and it's incompatible with the state of the table we should just fail instead of falling back.
I do think that we should probably revisit removing the whole withIncremental option (it's not on the API but there's complication to having to be defensive against bad input where most users don't even specify it). It was added for test comparisons a long time ago but I think we know that reachability is the safest and we should probably just let the implementation hide the complexity of cleanup selection.
… fix incremental to always just pick latest from current snapshot on metadata before expiration rather than just try and get the only ref
be82964 to
3fedbe5
Compare
…not incremental for staged/rollback cases
|
|
||
| @TestTemplate | ||
| public void testExpireOlderThanWithRollbackAndMergedManifests() { | ||
| assumeThat(incrementalCleanup).isFalse(); |
There was a problem hiding this comment.
I didn't copy this case again for the incremental failure case, since this case is no different than the case above beyond merging manifests or not; the current approach blanket prevents performing incremental cleanup in case of rollbacks so both cases just fail the same.
62a79e5 to
ba40cb2
Compare
| cleanupStrategy.cleanFiles(base, current); | ||
| } | ||
|
|
||
| private void validateCleanupCanBeIncremental(TableMetadata current) { |
There was a problem hiding this comment.
Don't we also need to check that the only table ref is main? This checks that there's one ref and only snapshots behind currentSnapshot are expired. But if main doesn't exist, then allRemovedSnapshotsAreInMainAncestry is going to fail with an NPE.
We also need to ensure that there were no non-main snapshots before expiration. Those need to be cleaned up through reachability analysis.
There was a problem hiding this comment.
I don't think we'd ever be in a state where only a non main ref is defined because at minimum the main branch would always be defined; even if it's empty. so the exactly one check is OK i think. I could make it more explicit though by comparing that it is main. base.currentSnapshot() would also not be null at this point since we early return on 355 if there are no snapshots in the metadata before, there'd be nothing to cleanup.
We also need to ensure that there were no non-main snapshots before expiration. Those need to be cleaned up through reachability analysis.
The ancestry check in the method in 407 would cover this because it checks that any removed snapshots are not outside the main ancestry
There was a problem hiding this comment.
Ah but even if a snapshot is not removed we still would need reachable analysis if there's a non-main snapshot since a staged but not removed snapshot could still reference a file that should still not be cleaned up
There was a problem hiding this comment.
Ok I've expanded this check to be more conservative but it's guaranteed to always be correct I think:
Incremental cleanup can only be done:
- When there's no specified snapshots to remove
- When there's no non main refs before OR after expiration (it's a bit overly conservative here because I could have an aged off tag on the main lineage which could be cleaned up incrementally. Could probably optimize this by complicating the hasOnlyMainBranch a bit by checking the ref types etc)
- When there are no non main snapshots before or after expiration
There was a problem hiding this comment.
I've amended this so that we distinguish the validation from before expiration and after expiration a little bit.
Incremental cleanup can only be done
- No specified snapshots to remove
- Before expiration there are no branches other than main.
- After expiration there are no refs other than main
- When there are no non-main unreferenced snapshots before or after expiration.
The distinction for allowing tags before expiration is beneficial since as any tags on the ancestry of main age out we can still incrementally cleanup, assuming the other conditions of no other non-main snapshots holds true. The check in 4 technically could already cover what's done in 2, but 2 is a lighter weight check to verify first (also helps to have a different error message here imo) and we fall back to 4 for the staged case
For the check where there are non main snapshots it handles the case where the current snapshot is null by simply checking if the whole snapshots list is not empty. If it's not empty then this means there are non-main snapshots.
There was a problem hiding this comment.
I think maybe we can do a little better. I think your argument for tags main's ancestors is a good one. We can do that for branches, too.
What about these rules to determine if incremental is safe?
- No specified snapshots to remove
- No snapshots to remove that are not ancestors of main
- No snapshots after expiration that are not ancestors of main
I think the combination of 2 and 3 cover branch and tag cases. Tags function like branches with just one snapshot, so reasoning about branches will cover them. Any branch either has a snapshot outside of main or it doesn't. If it doesn't, then incremental cleanup will work. If there is a snapshot outside of main, then it is either going to be caught be 2 (if it ages off) or 3 (if it does not age off).
There was a problem hiding this comment.
After a lot of sketching out to prove to myself this works, I agree with the simplified algorithm. I went ahead and updated the implementation
d26cec8 to
b033644
Compare
876b4f4 to
86887f7
Compare
…snapshots before or after expiration
86887f7 to
562a5ee
Compare
| } | ||
|
|
||
| private boolean hasOnlyMainBranch(TableMetadata metadata) { | ||
| return metadata.refs().size() == 1 && metadata.refs().containsKey(SnapshotRef.MAIN_BRANCH); |
There was a problem hiding this comment.
Ok, I forgot that we actually don't return a main ref with -1, no ref would be surfaced. I'll need to fix that as well
98e1da1 to
8bbf7b0
Compare
8bbf7b0 to
f7b4b9a
Compare
…ther refs for after expiration. Handle empty case properly
f7b4b9a to
8cea44f
Compare
90a7c4b to
7ab58c5
Compare
| table.manageSnapshots().createTag("TagA", table.currentSnapshot().snapshotId()).commit(); | ||
| waitUntilAfter(table.currentSnapshot().timestampMillis()); | ||
| RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots(); |
There was a problem hiding this comment.
This had to change because now for this case the logic is more permissive. Previously any ref existing would lead to a failure with incremental cleanup, however with the new logic we'd allow incremental cleanup here since the tag is aged off on main. I updated the test so that it creates a branch and writes to the branch which is then aged off. The check that there are no snapshots being removed outside of main will fail in the new test.
| Set<String> expectedDeleteFiles = | ||
| ImmutableSet.of(table.currentSnapshot().manifestListLocation()); | ||
| String tag = "tag"; |
There was a problem hiding this comment.
move expectedDeleteFiles closer to where it's actually used
| LOG.info("Committed snapshot changes"); | ||
|
|
||
| if (cleanExpiredFiles) { | ||
| if (cleanExpiredFiles && !base.snapshots().isEmpty()) { |
There was a problem hiding this comment.
This seems odd. Shouldn't this happen automatically when cleanExpiredSnapshots() gathers the set of snapshots to clean?
There was a problem hiding this comment.
Yeah it does, I just updated it this way so that the logic short circuits before that point.
|
This look good to me. Thanks, @amogh-jahagirdar! |
|
OK thanks for the reviews @rdblue @sqd @stevenzwu ! I'll go ahead and merge |
Fixes #13568
Currently in the core implementation of file cleanup as part of expires snapshots, it's possible that all non-main references are removed after a snapshot expiration; the decision to select incremental cleanup or the reachable cleanup is simply based on the table metadata after expiration having a single reference. However, this is not correct since full reachability is required if the table metadata before expiration had other refs.
Additionally, the incremental cleanup would currently just get the first ref from the metadata before expiration assuming there was only one, and this can just be some non-main branch or tag depending on the ordering in the refs() values, and then the ancestry calculation for incremental cleanup is incorrect.
The following changes are made in this PR:
getOnlyElementso there's an explicit failure in case that violation somehow gets broken