Skip to content

Spec: Document Snapshot Summary Optional Fields for Standardization#11660

Merged
HonahX merged 11 commits into
apache:mainfrom
HonahX:standardize_snapshot_summary_fields
Jan 24, 2025
Merged

Spec: Document Snapshot Summary Optional Fields for Standardization#11660
HonahX merged 11 commits into
apache:mainfrom
HonahX:standardize_snapshot_summary_fields

Conversation

@HonahX
Copy link
Copy Markdown
Contributor

@HonahX HonahX commented Nov 26, 2024

This PR introduces a new section, "Optional Snapshot Summary Fields", in the table spec under Appendix F to document optional fields in the snapshot summary, including metrics, and other fields such as Write-Audit-Publish (WAP)-related fields. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.

Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Nov 26, 2024
@HonahX HonahX linked an issue Nov 26, 2024 that may be closed by this pull request
6 tasks
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Copy link
Copy Markdown
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@HonahX HonahX marked this pull request as ready for review January 3, 2025 01:06
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this new appendix section

@RussellSpitzer
Copy link
Copy Markdown
Member

@HonahX I think we are good on this one, I don't think we had any negative feedback from the dev list. So I think we are good to merge if you are ready.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
| **`deleted-duplicate-files`** | Number of duplicate files deleted (duplicates are files recorded more than once in the manifest) |
| **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot |

### Other Fields
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other fields also include engine-name, engine-version, app-id.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the edge of including these since they are not constants in SnapshotSummary.java. engine-name and engine-version seem fine as they’re unambiguous and included in the view's summary spec: https://iceberg.apache.org/view-spec/#summary. However, app-id feels too generic and poorly defined, so I’ve left it out.

I've added rows for engine-name and engine-version. Please let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark adds the spark.app.id to the snapshot summary as seen here
https://iceberg.apache.org/docs/1.6.0/spark-queries/#snapshots

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both spark.app.id and app-id can exist in snapshot summary when using spark:

{spark.app.id=local-1737052579245, 
...
engine-version=3.5.2, 
app-id=local-1737052579245, 
engine-name=spark, 
...

I personally prefer spark.app.id since it clearly indicates that this field relates to Spark. I think it’s best to let each engine decide what engine-specific information to include in the summary and how to name it, rather than setting a general recommendation here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, We should skip anything that's engine specific since the goal is to provide definitions for things that are universal and may be used by different engines. So we don't need spark.app.id here because no other engine needs a consistent definition of that.

add engine-name and engine-version
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
Copy link
Copy Markdown
Contributor

@sfc-gh-aixu sfc-gh-aixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HonahX
Copy link
Copy Markdown
Contributor Author

HonahX commented Jan 24, 2025

Thanks @kevinjqliu @RussellSpitzer @sfc-gh-aixu @danielcweeks @rdblue for reviewing!

The vote has passed: https://lists.apache.org/thread/mz01jwt69osqxhx9d3dd9xzncv9yncd0

I will go ahead and merge it.

@HonahX HonahX merged commit c0c1b15 into apache:main Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document Snapshot Summary Optional Fields for Standardization

7 participants