Skip to content

Spark: Use bulk deletes in rewrite manifests action#10343

Merged
amogh-jahagirdar merged 1 commit into
apache:mainfrom
amogh-jahagirdar:rewrite-manifests-bulk-deletes
Jun 17, 2024
Merged

Spark: Use bulk deletes in rewrite manifests action#10343
amogh-jahagirdar merged 1 commit into
apache:mainfrom
amogh-jahagirdar:rewrite-manifests-bulk-deletes

Conversation

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

This change uses bulk deletes when possible when cleaning up files as part of any failure or as part of the staging logic for v1 tables.

@github-actions github-actions Bot added the spark label May 17, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the rewrite-manifests-bulk-deletes branch from 9c20bbf to 0add66c Compare May 17, 2024 03:01
Copy link
Copy Markdown
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change makes sense to me. I wonder if we should have a variation of SparkCleanupUtil.deleteFiles for this, where we have some additional logging/exception handling when e.g. bulk deletes are only partially applied (inside SparkCleanupUtil.bulkDelete

Copy link
Copy Markdown
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that there's deleteFiles in BaseSparkAction which batch files by group of 10k, but I think it's more for snapshot expiration or purge

@amogh-jahagirdar amogh-jahagirdar force-pushed the rewrite-manifests-bulk-deletes branch 2 times, most recently from ece6a88 to f51bf8e Compare June 9, 2024 01:03
@amogh-jahagirdar amogh-jahagirdar force-pushed the rewrite-manifests-bulk-deletes branch from f51bf8e to d10a86f Compare June 9, 2024 01:07
@amogh-jahagirdar
Copy link
Copy Markdown
Contributor Author

Forgot I had this PR up :) thanks for the reviews @nastra @dramaticlly ! I went with just using deleteFiles from the BaseSparkAction, that will also log on failed deletes. SparkCleanupUtil is package private and not accessible from the action implementation. I'll go ahead and merge this

@amogh-jahagirdar amogh-jahagirdar merged commit d8f26ca into apache:main Jun 17, 2024
jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants