Spec: Document Snapshot Summary Optional Fields for Standardization#11660
Conversation
HonahX
left a comment
There was a problem hiding this comment.
@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for adding this new appendix section
|
@HonahX I think we are good on this one, I don't think we had any negative feedback from the dev list. So I think we are good to merge if you are ready. |
| | **`deleted-duplicate-files`** | Number of duplicate files deleted (duplicates are files recorded more than once in the manifest) | | ||
| | **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot | | ||
|
|
||
| ### Other Fields |
There was a problem hiding this comment.
Other fields also include engine-name, engine-version, app-id.
There was a problem hiding this comment.
I was on the edge of including these since they are not constants in SnapshotSummary.java. engine-name and engine-version seem fine as they’re unambiguous and included in the view's summary spec: https://iceberg.apache.org/view-spec/#summary. However, app-id feels too generic and poorly defined, so I’ve left it out.
I've added rows for engine-name and engine-version. Please let me know what you think.
There was a problem hiding this comment.
Spark adds the spark.app.id to the snapshot summary as seen here
https://iceberg.apache.org/docs/1.6.0/spark-queries/#snapshots
There was a problem hiding this comment.
Both spark.app.id and app-id can exist in snapshot summary when using spark:
{spark.app.id=local-1737052579245,
...
engine-version=3.5.2,
app-id=local-1737052579245,
engine-name=spark,
...
I personally prefer spark.app.id since it clearly indicates that this field relates to Spark. I think it’s best to let each engine decide what engine-specific information to include in the summary and how to name it, rather than setting a general recommendation here.
There was a problem hiding this comment.
IMHO, We should skip anything that's engine specific since the goal is to provide definitions for things that are universal and may be used by different engines. So we don't need spark.app.id here because no other engine needs a consistent definition of that.
add engine-name and engine-version
# Conflicts: # format/spec.md
|
Thanks @kevinjqliu @RussellSpitzer @sfc-gh-aixu @danielcweeks @rdblue for reviewing! The vote has passed: https://lists.apache.org/thread/mz01jwt69osqxhx9d3dd9xzncv9yncd0 I will go ahead and merge it. |
This PR introduces a new section, "Optional Snapshot Summary Fields", in the table spec under Appendix F to document optional fields in the snapshot summary, including metrics, and other fields such as Write-Audit-Publish (WAP)-related fields. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.
Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing