Core: Handle statistics file clean up from expireSnapshots#6090
Conversation
|
cc: @findepi, @rdblue, @szehon-ho |
|
|
I think we should change the label of the |
Sorry, I still didn't get how the query engine will figure out the statistics file for the current snapshot (when the snapshot is reused). @rdblue: What do you think about this? |
13ae3e0 to
be54044
Compare
be54044 to
36e99a4
Compare
911ed3d to
7d9dae0
Compare
|
@rdblue, @findepi, @amogh-jahagirdar: Handled the comments. Please take a look at it again. |
| return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup); | ||
| } | ||
|
|
||
| private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation) |
There was a problem hiding this comment.
I don't think there's a reason to pass table to this method. I think this should accept a String location, a FileIO, and a snapshot ID.
This should also not use File for writing.
There was a problem hiding this comment.
This was taken from existing code.
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java#L917
I will modify the current testcase.
| table.newAppend().appendFile(FILE_B).commit(); | ||
| // Note: RewriteDataFiles can reuse statistics files across operations. | ||
| // This test reuses stats for append just to mimic this scenario without having to run | ||
| // RewriteDataFiles. |
There was a problem hiding this comment.
Does this actually happen in RewriteDataFiles? I don't think that the same stats file should be added more than once. It's a good idea to make sure it doesn't, but that should not be the behavior of built-in operations.
There was a problem hiding this comment.
@findepi has mentioned about reusing the stats file. I think we should not allow it because concurrent operations can add extra stats during rewrite operation.
We don't have any engine integration with stats in this repo. So, I mentioned "can reuse" instead of "will resue"
There was a problem hiding this comment.
I think this is inaccurate then. It should be enough to state that in the even that a snapshot file is for some reason reused, we want to detect that it is still referenced and not delete it from the file system.
|
Thanks, @ajantha-bhat! I made some comments in tests to fix. |
15f9794 to
fad3899
Compare
Thanks for the review. I have addressed the comments. |
|
If the changes are ok, please merge this PR. So that I can rebase #6091 and make it ready for review. |
| } | ||
| } | ||
|
|
||
| private StatisticsFile reuseStatsForCurrentSnapshot( |
There was a problem hiding this comment.
This is not for the "current" snapshot because the snapshot ID is being passed in.
When there are problems that need to be fixed in multiple places, I might just mention it once to avoid unnecessary repetition. So to keep PRs moving faster, you should always look for similar cases that also need to be fixed.
There was a problem hiding this comment.
ACK.
Apologies for the back and forth. This was induced during refactoring.
rdblue
left a comment
There was a problem hiding this comment.
@ajantha-bhat, looks like there are just two more things to fix. Thanks!
fad3899 to
e0f51fd
Compare
Done. Thanks for the review. |
|
@jackye1995: Can you please consider this for the 1.2.0 release? |
| } | ||
|
|
||
| private void commitStats(Table table, StatisticsFile statisticsFile) { | ||
| table.updateStatistics().setStatistics(statisticsFile.snapshotId(), statisticsFile).commit(); |
There was a problem hiding this comment.
I find it odd that a single line like this is in a separate method. Seems like this could be inlined and would make the tests more readable.
|
Thanks, @ajantha-bhat. Good to have this in. |
Currently, Statistics files are safeguarded against orphan_files cleanup. But they are never cleaned up from table metadata and from the storage once the snapshots are expired/deleted.
Hence, this PR adds a change to handle the Statistics file cleanup during expire_snapshot.
Note that this is just for API level clean up (
table#expireSnapshots)Clean-up from expired snapshots spark action/procedure will be built on top of it in a follow-up PR.