Skip to content

Spark: RewriteTablePath: filter content files by snapshotId#12885

Merged
szehon-ho merged 4 commits into
apache:mainfrom
dramaticlly:dr-filter-snapshot-id
Jun 13, 2025
Merged

Spark: RewriteTablePath: filter content files by snapshotId#12885
szehon-ho merged 4 commits into
apache:mainfrom
dramaticlly:dr-filter-snapshot-id

Conversation

@dramaticlly
Copy link
Copy Markdown
Contributor

@dramaticlly dramaticlly commented Apr 24, 2025

RewriteTablePath incremental mode limits rewrite/copy plan to those between two snapshots. However, the returned copy plan ignores this limitation and includes all referenced files by these snapshots, even those not modified between the two snapshots. This leads to unnecessary copies. This patch allows of rewriting table path to use snapshots to filter both

  1. added_snapshot_id in ManifestFile
  2. snapshot_id in ManifestEntry

It help avoid repeated copy of same data files when manifest rewrite happens in between incremental copy. Manifest rewrite will only change added snapshot id for manifest, but data files remain the same and it shall not be copied over repeatedly.

If you can help take a look @flyrain @szehon-ho

@dramaticlly dramaticlly force-pushed the dr-filter-snapshot-id branch from 4f257c7 to e61e572 Compare April 24, 2025 06:22
@dramaticlly dramaticlly changed the title Spark 3.5: RewriteTablePath: filter content files by snapshotId Spark: RewriteTablePath: filter content files by snapshotId Apr 24, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 25, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2025

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Jun 1, 2025
@flyrain flyrain reopened this Jun 2, 2025
@github-actions github-actions Bot removed the stale label Jun 3, 2025
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me.

Note: can improve the pr description a bit to give the context, something like, "RewriteTablePath incremental mode limits rewrite/copy plan to those between two snapshots. However, the returned copy plan ignores this limitation and includes all referenced files by these snapshots, even those not modified between the two snapshots. This leads to unnecessary copies".

Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread .palantir/revapi.yml Outdated
…ing by snapshotId

This help avoid repeated copy of data files when manifest rewrite happens
@dramaticlly dramaticlly force-pushed the dr-filter-snapshot-id branch from e61e572 to dda923c Compare June 11, 2025 05:33
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
@szehon-ho szehon-ho merged commit 62d9ff5 into apache:main Jun 13, 2025
42 checks passed
@szehon-ho
Copy link
Copy Markdown
Member

Merged, thanks @dramaticlly !

@dramaticlly
Copy link
Copy Markdown
Contributor Author

Thanks @szehon-ho for the detailed review and merge!

eric-maynard pushed a commit to eric-maynard/iceberg that referenced this pull request Jun 18, 2025
devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants