-
Notifications
You must be signed in to change notification settings - Fork 92
None optimiziation (part 2) #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Stores None's as 1-bit in a bitmap - In case something isn't None, store as the value in a dynamically sized-list Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
Add code to generate batch files in different formats and makes sure we can read them correctly in the future. Signed-off-by: Gerd Zellweger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the serialization format for tuples containing Option types by implementing a sparse encoding strategy. Instead of storing the full size of each field even when None, the new format uses a bitmap to mark None fields and only serializes non-None values.
Changes:
- Implemented bitmap-based sparse tuple serialization for more efficient storage of tuples with many
Nonevalues - Incremented storage format version from 3 to 4 to support the new tuple encoding
- Added backward compatibility layer to deserialize v3 files using the legacy format
Reviewed changes
Copilot reviewed 21 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/feldera-macros/src/tuples.rs | Core implementation of the new bitmap-based tuple serialization format |
| crates/dbsp/src/storage/file/format.rs | Incremented VERSION_NUMBER from 3 to 4 |
| crates/dbsp/src/storage/file.rs | Added Deserializer wrapper with version tracking |
| crates/dbsp/src/storage/file/reader.rs | Updated readers to pass version information for backward compatibility |
| crates/storage-test-compat/* | New compatibility testing crate with golden file tests |
| crates/feldera-macros/tests/* | Added comprehensive tests for new tuple serialization |
| @@ -0,0 +1,419 @@ | |||
| use std::collections::BTreeMap; | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'optimziation' to 'optimization' in the PR title.
| { | ||
| #[inline] | ||
| fn eq(&self, other: &Self) -> bool { | ||
| unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block here is unnecessary since get_t0() and get_t1() already contain the necessary unsafe operations internally. The unsafe should be removed to avoid misleading safety implications.
| unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() } | |
| self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() |
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | ||
| if t0_cmp != core::cmp::Ordering::Equal { | ||
| return t0_cmp; | ||
| } | ||
| unsafe { self.get_t1().cmp(other.get_t1()) } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block is unnecessary here. The get_t0() method already handles unsafe operations internally, and calling cmp() on the returned references doesn't require unsafe.
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| unsafe { self.get_t1().cmp(other.get_t1()) } | |
| let t0_cmp = self.get_t0().cmp(other.get_t0()); | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| self.get_t1().cmp(other.get_t1()) |
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | ||
| if t0_cmp != core::cmp::Ordering::Equal { | ||
| return t0_cmp; | ||
| } | ||
| unsafe { self.get_t1().cmp(other.get_t1()) } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block is unnecessary. Remove it since the comparison of archived values doesn't require unsafe after the references are obtained.
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| unsafe { self.get_t1().cmp(other.get_t1()) } | |
| let t0_cmp = self.get_t0().cmp(other.get_t0()); | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| self.get_t1().cmp(other.get_t1()) |
| Some(false) | ||
| } | ||
| VERSION_NUMBER => Some(true), | ||
| x if x >= 3 => Some(true), |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition accepts any version >= 3, including future versions that may not exist yet or be incompatible. Change to x if x >= 3 && x <= VERSION_NUMBER to only accept known compatible versions.
| x if x >= 3 => Some(true), | |
| 3..=VERSION_NUMBER => Some(true), |
| let version = (deserializer as &mut dyn ::core::any::Any) | ||
| .downcast_mut::<::dbsp::storage::file::Deserializer>() | ||
| .map(|deserializer| deserializer.version()) | ||
| .expect("passed wrong deserializer"); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "passed wrong deserializer" is unclear. Consider changing to "Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support" to better explain the issue.
| .expect("passed wrong deserializer"); | |
| .expect("Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support"); |
| .downcast_mut::<::dbsp::storage::file::Deserializer>() | ||
| .map(|deserializer| deserializer.version()) | ||
| .expect("passed wrong deserializer"); | ||
| if version <= 3 { |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version check should use a named constant instead of the magic number 3. Define const LAST_LEGACY_VERSION: u32 = 3; and use that for clarity and maintainability.
| if version <= 3 { | |
| const LAST_LEGACY_VERSION: u32 = 3; | |
| if version <= LAST_LEGACY_VERSION { |
Signed-off-by: feldera-bot <[email protected]>
This changes the way tuples are serialized/deserialized to optimize sparse tuples.
before a TupX<> would be serialized as a struct of
The new format is:
the trade-off is:
some things to consider:
Breaking Changes?
Increased storage format versions.
Describe Incompatible Changes
We are writing a different storage format for Tup<> types. We ensure it's backward compatible by distinguishing new and old formats using the version in the respective files.