Skip to content

Conversation

@gz
Copy link
Contributor

@gz gz commented Jan 10, 2026

This changes the way tuples are serialized/deserialized to optimize sparse tuples.

before a TupX<> would be serialized as a struct of

struct ArchivedTupX
  t1: T1
  t2: T2
etc.

The new format is:

struct NewArchviedTup

  bitmap: [u64; len]
  rel_ptrs: ArchivedVec<RelPtrI32>
  <serialized form of all items that are not None (bit in bitmap corresponding to field not set) follows here
   with rel_ptrs pointing to them>

the trade-off is:

  • if data is sparse, this is great because we reduce a None value to 1bit
  • if the data is not sparse it adds overhead: (a rel_ptr which is 4 byte per field that is not none + the bitmap which is 8 byte for every 64 tup entries)

some things to consider:

  • Potential improvements could be to compress the RelPtrI32 further to a I16 (by storing diff from current to next(?))
  • Some smart way to get rid of rel_ptrs entirely (I doubt it can be done?)
  • an adaptive serialization/deserialization (only use this format if data is really sparse)

Breaking Changes?

  • Storage Format / Checkpoints

Increased storage format versions.

Describe Incompatible Changes

We are writing a different storage format for Tup<> types. We ensure it's backward compatible by distinguishing new and old formats using the version in the respective files.

gz added 6 commits January 10, 2026 14:16
- Stores None's as 1-bit in a bitmap
- In case something isn't None, store as the value in a
  dynamically sized-list

Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
Add code to generate batch files in different formats and makes sure
we can read them correctly in the future.

Signed-off-by: Gerd Zellweger <[email protected]>
Copilot AI review requested due to automatic review settings January 10, 2026 22:28
@gz gz changed the title None optimziation part 2 None optimiziation (part 2) Jan 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the serialization format for tuples containing Option types by implementing a sparse encoding strategy. Instead of storing the full size of each field even when None, the new format uses a bitmap to mark None fields and only serializes non-None values.

Changes:

  • Implemented bitmap-based sparse tuple serialization for more efficient storage of tuples with many None values
  • Incremented storage format version from 3 to 4 to support the new tuple encoding
  • Added backward compatibility layer to deserialize v3 files using the legacy format

Reviewed changes

Copilot reviewed 21 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/feldera-macros/src/tuples.rs Core implementation of the new bitmap-based tuple serialization format
crates/dbsp/src/storage/file/format.rs Incremented VERSION_NUMBER from 3 to 4
crates/dbsp/src/storage/file.rs Added Deserializer wrapper with version tracking
crates/dbsp/src/storage/file/reader.rs Updated readers to pass version information for backward compatibility
crates/storage-test-compat/* New compatibility testing crate with golden file tests
crates/feldera-macros/tests/* Added comprehensive tests for new tuple serialization

@@ -0,0 +1,419 @@
use std::collections::BTreeMap;
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'optimziation' to 'optimization' in the PR title.

Copilot uses AI. Check for mistakes.
{
#[inline]
fn eq(&self, other: &Self) -> bool {
unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() }
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsafe block here is unnecessary since get_t0() and get_t1() already contain the necessary unsafe operations internally. The unsafe should be removed to avoid misleading safety implications.

Suggested change
unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() }
self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1()

Copilot uses AI. Check for mistakes.
Comment on lines +563 to +567
let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) };
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
unsafe { self.get_t1().cmp(other.get_t1()) }
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsafe block is unnecessary here. The get_t0() method already handles unsafe operations internally, and calling cmp() on the returned references doesn't require unsafe.

Suggested change
let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) };
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
unsafe { self.get_t1().cmp(other.get_t1()) }
let t0_cmp = self.get_t0().cmp(other.get_t0());
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
self.get_t1().cmp(other.get_t1())

Copilot uses AI. Check for mistakes.
Comment on lines +563 to +567
let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) };
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
unsafe { self.get_t1().cmp(other.get_t1()) }
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsafe block is unnecessary. Remove it since the comparison of archived values doesn't require unsafe after the references are obtained.

Suggested change
let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) };
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
unsafe { self.get_t1().cmp(other.get_t1()) }
let t0_cmp = self.get_t0().cmp(other.get_t0());
if t0_cmp != core::cmp::Ordering::Equal {
return t0_cmp;
}
self.get_t1().cmp(other.get_t1())

Copilot uses AI. Check for mistakes.
Some(false)
}
VERSION_NUMBER => Some(true),
x if x >= 3 => Some(true),
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition accepts any version >= 3, including future versions that may not exist yet or be incompatible. Change to x if x >= 3 && x <= VERSION_NUMBER to only accept known compatible versions.

Suggested change
x if x >= 3 => Some(true),
3..=VERSION_NUMBER => Some(true),

Copilot uses AI. Check for mistakes.
let version = (deserializer as &mut dyn ::core::any::Any)
.downcast_mut::<::dbsp::storage::file::Deserializer>()
.map(|deserializer| deserializer.version())
.expect("passed wrong deserializer");
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "passed wrong deserializer" is unclear. Consider changing to "Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support" to better explain the issue.

Suggested change
.expect("passed wrong deserializer");
.expect("Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support");

Copilot uses AI. Check for mistakes.
.downcast_mut::<::dbsp::storage::file::Deserializer>()
.map(|deserializer| deserializer.version())
.expect("passed wrong deserializer");
if version <= 3 {
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version check should use a named constant instead of the magic number 3. Define const LAST_LEGACY_VERSION: u32 = 3; and use that for clarity and maintainability.

Suggested change
if version <= 3 {
const LAST_LEGACY_VERSION: u32 = 3;
if version <= LAST_LEGACY_VERSION {

Copilot uses AI. Check for mistakes.
Signed-off-by: feldera-bot <[email protected]>
@gz gz requested review from blp and removed request for blp January 10, 2026 22:34
@gz gz marked this pull request as draft January 11, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants