Skip to content

Core: Fix incorrect selection of incremental cleanup in expire snapshots#13614

Merged
amogh-jahagirdar merged 10 commits into
apache:mainfrom
amogh-jahagirdar:fix-expiration-cleanup
Jul 28, 2025
Merged

Core: Fix incorrect selection of incremental cleanup in expire snapshots#13614
amogh-jahagirdar merged 10 commits into
apache:mainfrom
amogh-jahagirdar:fix-expiration-cleanup

Conversation

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jul 21, 2025

Fixes #13568

Currently in the core implementation of file cleanup as part of expires snapshots, it's possible that all non-main references are removed after a snapshot expiration; the decision to select incremental cleanup or the reachable cleanup is simply based on the table metadata after expiration having a single reference. However, this is not correct since full reachability is required if the table metadata before expiration had other refs.

Additionally, the incremental cleanup would currently just get the first ref from the metadata before expiration assuming there was only one, and this can just be some non-main branch or tag depending on the ordering in the refs() values, and then the ancestry calculation for incremental cleanup is incorrect.

The following changes are made in this PR:

  1. Incremental cleanup is performed only if table metadata before and after ("after" being the table metadata after the commit, it can include any intermediate changes after the expiration was committed) has only one ref (the main ref)
  2. Adding additional validation in Incremental cleanup and changing Iterables.getFirst in incremental cleanup to getOnlyElement so there's an explicit failure in case that violation somehow gets broken

@github-actions github-actions Bot added the core label Jul 21, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from 58bcc2b to 850e04a Compare July 21, 2025 13:16
@@ -50,7 +50,7 @@ class IncrementalFileCleanup extends FileCleanupStrategy {
@Override
@SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"})
public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpiration) {
Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bigger question of moving away from incremental cleanup and just always doing the reachability analysis in the core cleanup. This is what's done in spark and other integrations (which is probably why this issue didn't surfaced for quite some time).

If I recall correctly, the original intent for keeping incremental cleanup was that at the time branching/tagging was a pretty new introduction and I didn't want to regress folks expiration cleanup in the "average" case. However, even beyond branching/tagging there are other constructs on table metadata (like the ability to cherry pick) that can make incremental cleanup difficult to do safely, that maybe in 2.0 we should consider just removing it and always doing a reachability analysis? cc @rdblue @RussellSpitzer @stevenzwu @aokolnychyi

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this quick change to prevent corrupted state.

Nit on the commit msg. Technically, we didn't fix the incorrect selection. Maybe sth like this.

Core: Disable incremental cleanup in expire snapshots when there are multiple refs in the table

@SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"})
public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpiration) {
if (afterExpiration.refs().size() > 1) {
if (beforeExpiration.refs().size() > 1 || afterExpiration.refs().size() > 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this leak files if there is a rollback or a staged (but not referenced) commit?

For instance, if you have snapshots like this:

A --- B --- C --- D (main)
       `--- X (unreferenced)

If snapshot X should be cleaned up, Iceberg should not use the incremental strategy. Also, if X is not removed but B and C are, then incremental also can't be used because C may have removed files from X.

Instead of checking the number of refs, I think we need to check the history of the main branch. If there is any snapshot outside of the main branch before or after expiration, incremental cannot be used. There are cases where a ref is in the main branch's history that are safe, but I would probably keep the implementation simpler and avoid checking for things like branch overlap.

What I would do is:

  1. If there is more than one ref after expiration, use the reference set strategy (optionally, we could check that all refs are in the main branch's history and allow it)
  2. If there are any snapshots to be removed after expiration that were not the main branch's ancestors, use the reference set strategy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could check that all refs are in the main branch's history and allow it

I agree. It's the safest to only allow incremental cleanup when this condition is met. This should also makes incremental cleanup safe for a large class of history manipulations that's going to get implemented in the future.

Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the current implementation we're safe against staged snapshots being cleaned up since the incremental cleanup itself will delete if the manifests are not referenced in an ancestor.

That said, I generally agree that the clearest way to determine if a reachability analysis should be performed is by doing an analysis of if any of the snapshots being removed are outside of the main ancestry.


if (incrementalCleanup == null) {
incrementalCleanup = current.refs().size() == 1;
incrementalCleanup = base.refs().size() == 1 && current.refs().size() == 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If incrementalCleanup is explicitly enabled by withIncrementalCleanup(true), but on this line we already know that it's going to crash, does it make sense to fallback to full scan with a warning, or just crash here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've upleveled some of the validation, but I do think if it's specified through the API option and it's incompatible with the state of the table we should just fail instead of falling back.

I do think that we should probably revisit removing the whole withIncremental option (it's not on the API but there's complication to having to be defensive against bad input where most users don't even specify it). It was added for test comparisons a long time ago but I think we know that reachability is the safest and we should probably just let the implementation hide the complexity of cleanup selection.

… fix incremental to always just pick latest from current snapshot on metadata before expiration rather than just try and get the only ref
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from be82964 to 3fedbe5 Compare July 22, 2025 16:45
Comment thread core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
Comment thread core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java
Comment thread core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java
Comment thread core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java Outdated

@TestTemplate
public void testExpireOlderThanWithRollbackAndMergedManifests() {
assumeThat(incrementalCleanup).isFalse();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't copy this case again for the incremental failure case, since this case is no different than the case above beyond merging manifests or not; the current approach blanket prevents performing incremental cleanup in case of rollbacks so both cases just fail the same.

Comment thread core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from 62a79e5 to ba40cb2 Compare July 23, 2025 14:56
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cleanupStrategy.cleanFiles(base, current);
}

private void validateCleanupCanBeIncremental(TableMetadata current) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to check that the only table ref is main? This checks that there's one ref and only snapshots behind currentSnapshot are expired. But if main doesn't exist, then allRemovedSnapshotsAreInMainAncestry is going to fail with an NPE.

We also need to ensure that there were no non-main snapshots before expiration. Those need to be cleaned up through reachability analysis.

Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd ever be in a state where only a non main ref is defined because at minimum the main branch would always be defined; even if it's empty. so the exactly one check is OK i think. I could make it more explicit though by comparing that it is main. base.currentSnapshot() would also not be null at this point since we early return on 355 if there are no snapshots in the metadata before, there'd be nothing to cleanup.

We also need to ensure that there were no non-main snapshots before expiration. Those need to be cleaned up through reachability analysis.

The ancestry check in the method in 407 would cover this because it checks that any removed snapshots are not outside the main ancestry

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but even if a snapshot is not removed we still would need reachable analysis if there's a non-main snapshot since a staged but not removed snapshot could still reference a file that should still not be cleaned up

Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've expanded this check to be more conservative but it's guaranteed to always be correct I think:

Incremental cleanup can only be done:

  1. When there's no specified snapshots to remove
  2. When there's no non main refs before OR after expiration (it's a bit overly conservative here because I could have an aged off tag on the main lineage which could be cleaned up incrementally. Could probably optimize this by complicating the hasOnlyMainBranch a bit by checking the ref types etc)
  3. When there are no non main snapshots before or after expiration

Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've amended this so that we distinguish the validation from before expiration and after expiration a little bit.

Incremental cleanup can only be done

  1. No specified snapshots to remove
  2. Before expiration there are no branches other than main.
  3. After expiration there are no refs other than main
  4. When there are no non-main unreferenced snapshots before or after expiration.

The distinction for allowing tags before expiration is beneficial since as any tags on the ancestry of main age out we can still incrementally cleanup, assuming the other conditions of no other non-main snapshots holds true. The check in 4 technically could already cover what's done in 2, but 2 is a lighter weight check to verify first (also helps to have a different error message here imo) and we fall back to 4 for the staged case

For the check where there are non main snapshots it handles the case where the current snapshot is null by simply checking if the whole snapshots list is not empty. If it's not empty then this means there are non-main snapshots.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can do a little better. I think your argument for tags main's ancestors is a good one. We can do that for branches, too.

What about these rules to determine if incremental is safe?

  1. No specified snapshots to remove
  2. No snapshots to remove that are not ancestors of main
  3. No snapshots after expiration that are not ancestors of main

I think the combination of 2 and 3 cover branch and tag cases. Tags function like branches with just one snapshot, so reasoning about branches will cover them. Any branch either has a snapshot outside of main or it doesn't. If it doesn't, then incremental cleanup will work. If there is a snapshot outside of main, then it is either going to be caught be 2 (if it ages off) or 3 (if it does not age off).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a lot of sketching out to prove to myself this works, I agree with the simplified algorithm. I went ahead and updated the implementation

Comment thread core/src/main/java/org/apache/iceberg/RemoveSnapshots.java Outdated
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch 3 times, most recently from d26cec8 to b033644 Compare July 23, 2025 22:52
Comment thread core/src/main/java/org/apache/iceberg/RemoveSnapshots.java Outdated
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch 2 times, most recently from 876b4f4 to 86887f7 Compare July 24, 2025 00:23
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from 86887f7 to 562a5ee Compare July 24, 2025 00:25
}

private boolean hasOnlyMainBranch(TableMetadata metadata) {
return metadata.refs().size() == 1 && metadata.refs().containsKey(SnapshotRef.MAIN_BRANCH);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I forgot that we actually don't return a main ref with -1, no ref would be surfaced. I'll need to fix that as well

@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch 2 times, most recently from 98e1da1 to 8bbf7b0 Compare July 24, 2025 03:47
Comment thread core/src/main/java/org/apache/iceberg/RemoveSnapshots.java Outdated
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from 8bbf7b0 to f7b4b9a Compare July 24, 2025 05:10
Comment thread core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java Outdated
…ther refs for after expiration. Handle empty case properly
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from f7b4b9a to 8cea44f Compare July 24, 2025 15:33
@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-expiration-cleanup branch from 90a7c4b to 7ab58c5 Compare July 25, 2025 02:25
Comment on lines -1112 to -1114
table.manageSnapshots().createTag("TagA", table.currentSnapshot().snapshotId()).commit();
waitUntilAfter(table.currentSnapshot().timestampMillis());
RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
Copy link
Copy Markdown
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to change because now for this case the logic is more permissive. Previously any ref existing would lead to a failure with incremental cleanup, however with the new logic we'd allow incremental cleanup here since the tag is aged off on main. I updated the test so that it creates a branch and writes to the branch which is then aged off. The check that there are no snapshots being removed outside of main will fail in the new test.

Comment on lines +1889 to +1891
Set<String> expectedDeleteFiles =
ImmutableSet.of(table.currentSnapshot().manifestListLocation());
String tag = "tag";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move expectedDeleteFiles closer to where it's actually used

LOG.info("Committed snapshot changes");

if (cleanExpiredFiles) {
if (cleanExpiredFiles && !base.snapshots().isEmpty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd. Shouldn't this happen automatically when cleanExpiredSnapshots() gathers the set of snapshots to clean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does, I just updated it this way so that the logic short circuits before that point.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jul 28, 2025

This look good to me. Thanks, @amogh-jahagirdar!

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor Author

OK thanks for the reviews @rdblue @sqd @stevenzwu ! I'll go ahead and merge

@amogh-jahagirdar amogh-jahagirdar merged commit f7e6a27 into apache:main Jul 28, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expiring snapshot can erroneously delete data files that are still referenced

4 participants