Skip to content

Add DeleteRange Support with Range Tombstones#99

Merged
feichai0017 merged 7 commits intofeichai0017:mainfrom
nothiny:fix/deleterange
Mar 10, 2026
Merged

Add DeleteRange Support with Range Tombstones#99
feichai0017 merged 7 commits intofeichai0017:mainfrom
nothiny:fix/deleterange

Conversation

@nothiny
Copy link
Copy Markdown
Contributor

@nothiny nothiny commented Feb 28, 2026

Summary

Implements efficient range deletion using LSM-level range tombstones, eliminating the need for expensive point deletes during large key-range cleanup operations.

Motivation

Previously, deleting a large key range required issuing individual point deletes for each key, resulting in:

  • High write amplification
  • Increased compaction overhead
  • Poor performance for bulk deletions (e.g., TTL sweeps, retention policies)

Changes

API Layer

  • Added DB.DeleteRange(start, end []byte) for default column family
  • Added DB.DeleteRangeCF(cf, start, end []byte) for specific column families
  • Range semantics: [start, end) (inclusive start, exclusive end)

Storage Layer

  • Introduced BitRangeDelete flag to mark range tombstone entries
  • Range tombstones stored as: Entry.Key=start, Entry.Value=end, Meta=BitRangeDelete
  • Added Entry.IsRangeDelete() and Entry.RangeEnd() helper methods

Read Path

  • Get operations check if keys are covered by range tombstones via isKeyCoveredByRangeTombstone()
  • Iterator properly filters out keys covered by range tombstones
  • Range tombstone entries themselves are hidden from user-facing iterators

Compaction

  • Range tombstones dropped at max level (no longer needed)
  • Keys covered by range tombstones are discarded during compaction
  • Reduces space amplification over time

WAL & Recovery

  • Range tombstones persisted to WAL
  • Correctly recovered on database restart

Usage Example

// Delete all keys in range [key1, key3)
err := db.DeleteRange([]byte("key1"), []byte("key3"))

// key1, key2 deleted; key3 remains
_, err = db.Get([]byte("key1")) // ErrKeyNotFound
_, err = db.Get([]byte("key2")) // ErrKeyNotFound
_, err = db.Get([]byte("key3")) // OK

// Works with iterators too
iter := db.NewIterator(nil)
for iter.Rewind(); iter.Valid(); iter.Next() {
    // Only key3 and beyond will be returned
}

Notes

  • Read performance may be impacted with many active range tombstones (future optimization opportunity)
  • Transaction support for DeleteRange not yet implemented

Summary by CodeRabbit

  • New Features
    • Range delete API: delete key ranges (with column-family variants) and hide covered keys from reads/iterations.
  • Bug Fixes
    • Prevented resurrection of tombstoned keys during compaction and after restart; fixed visibility edge-cases between point writes and range deletes.
  • Performance
    • Compaction and flush paths now track range tombstones to avoid unnecessary I/O and maintain correct visibility.
  • Tests
    • Extensive unit and integration tests for range deletes, recovery, compaction, and iterator behavior.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 28, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 69.44444% with 66 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
lsm/lsm.go 43.58% 15 Missing and 7 partials ⚠️
lsm/executor.go 67.64% 7 Missing and 4 partials ⚠️
lsm/memtable.go 72.22% 7 Missing and 3 partials ⚠️
lsm/range_tombstone.go 84.21% 6 Missing and 3 partials ⚠️
kv/entry.go 0.00% 6 Missing ⚠️
iterator.go 20.00% 2 Missing and 2 partials ⚠️
db.go 91.66% 1 Missing and 1 partial ⚠️
lsm/levels.go 86.66% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DeleteRange support to NoKV by representing range deletions as LSM-level “range tombstone” entries and teaching the read path, iterators, and compaction to honor and eventually drop them.

Changes:

  • Introduces DB.DeleteRange / DB.DeleteRangeCF APIs that write range tombstones using a new BitRangeDelete entry meta flag.
  • Updates read path (Get + DBIterator) to hide range tombstone records and filter keys covered by range tombstones.
  • Updates compaction to track range tombstones and drop covered keys (and drop tombstones at the max level), plus adds dedicated tests for correctness and WAL recovery.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
db.go Adds DeleteRange APIs, range-tombstone coverage checks on Get, and ensures range tombstones stay inline in LSM (not vlog).
iterator.go Filters out range tombstone entries and skips keys covered by range tombstones during iteration.
lsm/executor.go Applies range tombstones during compaction to drop covered keys and drop tombstones at max level.
kv/const.go Adds BitRangeDelete meta flag.
kv/entry.go Adds helpers IsRangeDelete, RangeEnd, and KeyInRange.
db_deleterange_test.go Adds tests for core semantics, validation, CF isolation, compaction interaction, and WAL recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lsm/executor.go Outdated
Comment on lines +1002 to +1016
var rangeTombstones []*kv.Entry
isMaxLevel := cd.nextLevel != nil && cd.nextLevel.levelNum == lm.opt.MaxLevelNum

for ; it.Valid(); it.Next() {
key := it.Item().Entry().Key
//version := kv.ParseTs(key)
isExpired := IsDeletedOrExpired(it.Item().Entry())
entry := it.Item().Entry()
key := entry.Key
isExpired := IsDeletedOrExpired(entry)

if entry.IsRangeDelete() {
if isMaxLevel {
updateStats(entry)
continue
}
rangeTombstones = append(rangeTombstones, entry)
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rangeTombstones stores *kv.Entry pointers returned by the iterator. Table/block iterators reuse a single kv.Entry instance as they advance, so the pointers captured here can later point to mutated data, breaking coverage checks. Store a stable copy of the tombstone’s start/end (e.g., copy key/value bytes into a dedicated struct) instead of keeping the iterator-owned entry pointer.

Copilot uses AI. Check for mistakes.
lsm/executor.go Outdated
Comment on lines +1034 to +1037
_, userKey, version := kv.SplitInternalKey(key)
for _, rt := range rangeTombstones {
_, rtStart, rtVersion := kv.SplitInternalKey(rt.Key)
if rtVersion >= version && kv.KeyInRange(userKey, rtStart, rt.RangeEnd()) {
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage filtering ignores column family: SplitInternalKey results for both the current key and the range tombstone discard the returned CF, so a tombstone from one CF could incorrectly discard keys from another CF during compaction. Include CF equality in the predicate before applying KeyInRange.

Suggested change
_, userKey, version := kv.SplitInternalKey(key)
for _, rt := range rangeTombstones {
_, rtStart, rtVersion := kv.SplitInternalKey(rt.Key)
if rtVersion >= version && kv.KeyInRange(userKey, rtStart, rt.RangeEnd()) {
cf, userKey, version := kv.SplitInternalKey(key)
for _, rt := range rangeTombstones {
rtCF, rtStart, rtVersion := kv.SplitInternalKey(rt.Key)
if cf == rtCF && rtVersion >= version && kv.KeyInRange(userKey, rtStart, rt.RangeEnd()) {

Copilot uses AI. Check for mistakes.
db.go Outdated
Comment on lines +624 to +655
func (db *DB) isKeyCoveredByRangeTombstone(cf kv.ColumnFamily, userKey []byte, version uint64) bool {
opt := &utils.Options{IsAsc: true}
iters := db.lsm.NewIterators(opt)
for _, it := range iters {
if it == nil {
continue
}
it.Rewind()
for it.Valid() {
item := it.Item()
if item == nil {
it.Next()
continue
}
e := item.Entry()
if e == nil || !e.IsRangeDelete() {
it.Next()
continue
}
rangeCF, rangeStart, rangeVersion := kv.SplitInternalKey(e.Key)
if rangeCF != cf || rangeVersion < version {
it.Next()
continue
}
rangeEnd := e.RangeEnd()
if kv.KeyInRange(userKey, rangeStart, rangeEnd) {
return true
}
it.Next()
}
it.Close()
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isKeyCoveredByRangeTombstone returns true from inside the iterator loop without closing the current iterator (and any remaining iterators), which can leak resources/file handles. Ensure iterators are closed on all exit paths (e.g., defer it.Close() per iterator, and close previously created iterators before returning).

Copilot uses AI. Check for mistakes.
db.go Outdated
Comment on lines +447 to +451
// Store range tombstone: Entry.Key=start, Entry.Value=end
entry := kv.NewEntryWithCF(cf, start, end)
entry.Meta = kv.BitRangeDelete
defer entry.DecrRef()
return db.setEntry(entry)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range tombstones are written using the same internal-key scheme as point entries (Entry.Key=start + fixed nonTxnMaxVersion). This collides with normal writes to the start key: a subsequent Set(start, ...) will overwrite the tombstone record and can unintentionally remove the deletion for other keys in the range. To preserve correct semantics, encode range tombstones under a distinct internal key kind/prefix so they can coexist with point keys at the same start key.

Copilot uses AI. Check for mistakes.
db.go Outdated
Comment on lines +624 to +633
func (db *DB) isKeyCoveredByRangeTombstone(cf kv.ColumnFamily, userKey []byte, version uint64) bool {
opt := &utils.Options{IsAsc: true}
iters := db.lsm.NewIterators(opt)
for _, it := range iters {
if it == nil {
continue
}
it.Rewind()
for it.Valid() {
item := it.Item()
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper scans from Rewind() across every LSM iterator for each lookup, which is O(total-keys) per Get/iterator step and can become a major regression once tombstones exist. Consider indexing active range tombstones (per-CF) or at least short-circuiting the scan based on key ordering; also consider using Seek to avoid full scans when possible.

Copilot uses AI. Check for mistakes.
Comment on lines +194 to +203
// Skip range tombstone entries themselves
if src.IsRangeDelete() {
return false
}
iter.entry = *src
cf, userKey, ts := kv.SplitInternalKey(iter.entry.Key)
// Check if this key is covered by a range tombstone
if iter.db != nil && iter.db.isKeyCoveredByRangeTombstone(cf, userKey, ts) {
return false
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

materialize calls db.isKeyCoveredByRangeTombstone for every entry produced by the iterator, which currently performs a full LSM scan per key. This can make iteration O(n²) and can also amplify the resource-leak risk if iterators aren’t always closed. Consider reworking iterator-side filtering to share a precomputed/seekable tombstone view rather than doing per-key full scans.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +25
// Test 1: Basic deletion with [start, end) semantics
db.Set([]byte("a"), []byte("1"))
db.Set([]byte("b"), []byte("2"))
db.Set([]byte("c"), []byte("3"))

if err := db.DeleteRange([]byte("a"), []byte("c")); err != nil {
t.Fatal(err)
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test ignores errors from writes/deletes in multiple places (e.g., db.Set, db.DeleteRange, db.Del). If any of these operations fails, later assertions may produce misleading results. Prefer asserting require.NoError/t.Fatal on these calls to keep failures actionable.

Copilot uses AI. Check for mistakes.
lsm/executor.go Outdated
Comment on lines +1011 to +1015
if isMaxLevel {
updateStats(entry)
continue
}
rangeTombstones = append(rangeTombstones, entry)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In max-level compactions, range tombstones are currently skipped (continue) and not recorded in rangeTombstones, so covered keys won’t be filtered and the deletion can be lost (keys may reappear after compaction). Consider still using max-level tombstones to discard covered keys (but avoid emitting the tombstone record into the output tables).

Suggested change
if isMaxLevel {
updateStats(entry)
continue
}
rangeTombstones = append(rangeTombstones, entry)
// Track all range tombstones so they can filter subsequent keys.
rangeTombstones = append(rangeTombstones, entry)
if isMaxLevel {
// At max level, use the tombstone to discard covered keys but do not
// emit the tombstone itself into the output tables.
updateStats(entry)
continue
}

Copilot uses AI. Check for mistakes.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds range-tombstone (DeleteRange/DeleteRangeCF) support: new Tombstone entry type, in-memory and on-disk collector, memtable and LSM integration for tombstone recording and coverage checks, iterator/readpath suppression of covered keys, compaction handling to preserve/apply tombstones, and extensive tests for behavior and recovery.

Changes

Cohort / File(s) Summary
DB API & iterator
db.go, iterator.go
Added DeleteRange/DeleteRangeCF APIs and input validation; iterator now holds db *DB and skips range-tombstone entries and keys covered by active tombstones; loadBorrowedEntry updated to treat range-delete entries as not-found.
KV entry defs
kv/const.go, kv/entry.go
Introduced BitRangeDelete flag, Entry.IsRangeDelete(), Entry.RangeEnd(), and KeyInRange() helper for [start,end) checks.
Memtable changes
lsm/memtable.go
Store memtable-local range tombstones and per-key WAL sequence offsets; record tombstones on writes and replay; add coverage checks using version or WAL-seq.
LSM integration
lsm/lsm.go, lsm/levels.go
Added RangeTombstoneCollector wiring into levelManager and LSM; public APIs to check coverage and count; flush path collects tombstones and registers them with collector; rebuild at startup.
Compaction executor
lsm/executor.go
Preserve tombstones across builder splits, track in-memory rangeTombstone list during subcompact, skip/record covered entries, and trigger rebuild of tombstone index after max-level compaction.
Range tombstone subsystem
lsm/range_tombstone.go, lsm/range_tombstone_test.go
New RangeTombstone type and RangeTombstoneCollector with Add, Rebuild, IsKeyCovered, Count; rebuild from SSTables; unit tests for collector behavior.
Compaction/read tests
lsm/lsm_test.go, db_deleterange_test.go
Added regression and integration tests: max-level compaction resurrection scenarios, DeleteRange semantics, CF isolation, overlapping ranges, WAL recovery, and visibility edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant DB as DB
    participant Mem as Memtable
    participant LSM as LSM
    participant Col as RangeTombstoneCollector

    App->>DB: DeleteRangeCF(cf, start, end)
    DB->>DB: validate start<end
    DB->>Mem: write BitRangeDelete entry (Key=start, Value=end)
    Mem->>Mem: record mem-range-tombstone + wal seq
    Mem-->>DB: ack
    Note over Mem,LSM: On flush
    Mem->>LSM: flush entries incl. range deletes
    LSM->>Col: Add(range tombstone)
    Col-->>LSM: stored
Loading
sequenceDiagram
    participant App as Application
    participant DB as DB
    participant Mem as Memtable
    participant LSM as LSM
    participant Col as RangeTombstoneCollector
    participant Comp as CompactionEngine

    App->>DB: GetCF(cf, key)
    DB->>Mem: check memtable coverage (isKeyCoveredByRangeTombstone)
    alt covered in memtable
        Mem-->>DB: covered -> ErrKeyNotFound
    else not covered
        DB->>LSM: get entry
        LSM->>Col: IsKeyCovered(cf, key, entryVersion)
        alt covered by collector
            Col-->>LSM: true
            LSM-->>DB: ErrKeyNotFound
        else
            LSM-->>DB: return entry
        end
    end

    Note over Comp,LSM: During compaction
    Comp->>Comp: collect range deletes, skip covered entries
    Comp->>Col: rebuild/register tombstones after max-level rewrite
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I nibble bytes from start to end,
With tombstones placed, old keys suspend.
Compactions hum and collectors grow,
No stray key springs from long ago.
A hop, a test, range cleared—Bravo!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the primary change: adding DeleteRange functionality with range tombstones as the core implementation mechanism.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@db_deleterange_test.go`:
- Line 16: Tests in db_deleterange_test.go call functions that return errors
which are currently ignored (e.g., db.Close() and various setup calls); add a
small test helper like mustNoErr(t, err) that calls t.Helper() and t.Fatalf on
non-nil errors, then replace all unchecked error-returning calls (the ones
flagged around lines with db.Close() and the other setup/operation calls) by
checking their error via mustNoErr(t, err) (or call mustNoErr immediately after
the call) so every error is handled and tests fail fast with a clear message.

In `@db.go`:
- Around line 624-657: The current isKeyCoveredByRangeTombstone function
performs a full LSM iterator scan (db.lsm.NewIterators) per lookup which is
expensive; instead implement and use a CF-scoped in-memory range-tombstone index
(e.g., an interval tree or sorted interval list) built/updated from memtables
and SSTable metadata and query it in O(log n) from isKeyCoveredByRangeTombstone;
replace the iterator scan in isKeyCoveredByRangeTombstone with a lookup against
that index (keyed by cf and version/sequence) and ensure the index is updated
during memtable inserts, memtable flushes and SSTable load/compaction routines
so range tombstones are visible to Get and iterator materialization without
scanning the LSM.
- Around line 624-655: The function isKeyCoveredByRangeTombstone leaks iterator
resources by returning early without closing the current iterator and remaining
iterators from db.lsm.NewIterators; change the logic so that before any early
return (including when kv.KeyInRange(...) causes a true result) you close the
current iterator and all remaining iterators, and propagate or at least log any
errors returned by it.Close(); ensure the cleanup runs regardless of the return
path (e.g., use defer per-iterator or a single deferred cleanup over the
iterators slice) while keeping the checks around kv.SplitInternalKey and
e.IsRangeDelete() intact.

In `@lsm/executor.go`:
- Around line 1001-1004: The local variable rangeTombstones is being
reinitialized each time addKeys (or the builder chunk loop) runs, causing
tombstones to be lost when a builder splits; instead persist and reuse the slice
across chunks by moving the declaration and initialization out of the per-chunk
scope (or into the builder/compaction state) so entries accumulate between
successive addKeys invocations; update references around the blocks that declare
var rangeTombstones []*kv.Entry (and the related uses near tableKr and
isMaxLevel) so they append to the shared slice rather than resetting it,
ensuring tombstones cover keys across split chunks.
- Around line 1034-1038: The code currently discards column-family info when
checking range tombstone coverage; capture and compare CFs before marking a key
as covered. Change the kv.SplitInternalKey call in the loop to capture the CF
(e.g., cf, userKey, version := kv.SplitInternalKey(key)) and likewise extract
the CF for each tombstone (e.g., rtCF, rtStart, rtVersion :=
kv.SplitInternalKey(rt.Key)) and only set covered = true when cf == rtCF &&
rtVersion >= version && kv.KeyInRange(userKey, rtStart, rt.RangeEnd()). This
ensures a tombstone in one column-family does not affect keys in another.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aafe6ba and 5aeb588.

📒 Files selected for processing (6)
  • db.go
  • db_deleterange_test.go
  • iterator.go
  • kv/const.go
  • kv/entry.go
  • lsm/executor.go

@feichai0017
Copy link
Copy Markdown
Owner

Thanks for the contribution. The feature direction is good, but I think we should fix a few blocking issues before merge:

compaction range coverage must include CF match,
avoid storing *kv.Entry tombstones (iterator entry is reused),
keep tombstone state across builder splits in subcompact,
fix iterator cleanup + current full-scan read-path overhead,
resolve lint (errcheck) failures.
After these are addressed, this PR should be in a much better shape for merge.

@nothiny
Copy link
Copy Markdown
Contributor Author

nothiny commented Mar 1, 2026

Thanks for the contribution. The feature direction is good, but I think we should fix a few blocking issues before merge:

compaction range coverage must include CF match, avoid storing *kv.Entry tombstones (iterator entry is reused), keep tombstone state across builder splits in subcompact, fix iterator cleanup + current full-scan read-path overhead, resolve lint (errcheck) failures. After these are addressed, this PR should be in a much better shape for merge.

Thank you for the detailed review and the helpful suggestions! I'll address the blocking issues you've listed:

compaction range coverage must include CF match,

avoid storing *kv.Entry tombstones (iterator entry is reused),

keep tombstone state across builder splits in subcompact,

fix iterator cleanup + current full-scan read-path overhead,

resolve lint (errcheck) failures.

Regarding the performance point (iterator cleanup and full-scan read-path overhead), I'll fix the immediate issues as required. However, to keep the changes focused and avoid scope creep, I plan to add a TODO comment in the code to highlight potential further optimizations in that area. That way, we can track it as a future improvement.

I'll update the PR shortly. Thanks again for the feedback!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lsm/executor.go (1)

1051-1065: CF matching and coverage check logic looks correct.

The implementation properly:

  1. Extracts CF from both the entry (cf) and tombstone (rt.cf)
  2. Compares CF before applying coverage: rt.cf == cf
  3. Checks version ordering: rt.version >= version (newer tombstone shadows older entry)
  4. Validates range coverage: kv.KeyInRange(userKey, rt.start, rt.end)

This addresses the previous review concern about cross-CF tombstone application.

Consider adding a TODO for performance optimization.

The O(entries × tombstones) complexity may become a concern with many active range tombstones. Per the PR objectives and reviewer feedback, a TODO was planned for read-path optimization—consider adding one here for the compaction path as well.

,

💡 Optional: Add TODO for future optimization
 			if !entry.IsRangeDelete() {
 				covered := false
 				cf, userKey, version := kv.SplitInternalKey(key)
+				// TODO: Optimize range tombstone lookup. Consider:
+				// - Interval tree for O(log n) coverage checks
+				// - Pruning tombstones whose range is entirely before current key
 				for _, rt := range rangeTombstones {
 					// Check CF match and version/range coverage.
 					if rt.cf == cf && rt.version >= version && kv.KeyInRange(userKey, rt.start, rt.end) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/executor.go` around lines 1051 - 1065, The CF-matching and coverage loop
is correct but can be costly O(entries × rangeTombstones); add a clear TODO near
the loop that mentions optimizing the compaction/read-path for many active range
tombstones and suggests future approaches (e.g., index by CF or use interval
tree) so maintainers know to revisit; place the TODO adjacent to the code using
entry.IsRangeDelete(), kv.SplitInternalKey(...), rangeTombstones, and
updateStats to make it easy to find.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lsm/executor.go`:
- Around line 1051-1065: The CF-matching and coverage loop is correct but can be
costly O(entries × rangeTombstones); add a clear TODO near the loop that
mentions optimizing the compaction/read-path for many active range tombstones
and suggests future approaches (e.g., index by CF or use interval tree) so
maintainers know to revisit; place the TODO adjacent to the code using
entry.IsRangeDelete(), kv.SplitInternalKey(...), rangeTombstones, and
updateStats to make it easy to find.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5aeb588 and dd78266.

📒 Files selected for processing (2)
  • db.go
  • lsm/executor.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • db.go

@feichai0017
Copy link
Copy Markdown
Owner

feichai0017 commented Mar 3, 2026

Thanks for the PR — really appreciate the work here. The overall direction looks good, but I found a few issues that I think should be addressed before merge:

  1. Visibility bug after DeleteRange + rewrite inside the same range

    • Files: db.go , lsm/executor.go
    • Current coverage check relies on rangeVersion >= version.
    • In the non-transactional path, writes use the same nonTxnMaxVersion, so write order is not represented.
    • This can incorrectly hide newer point writes under older range tombstones.
    • Repro:
      • Set(a1, old)
      • DeleteRange(a0, a9)
      • Set(a1, new)
      • Get(a1) returns Key not found (expected new).
  2. Potential key resurrection risk in max-level compaction

    • File: lsm/executor.go (isMaxLevel branch)
    • Range tombstones are dropped immediately in the max-level path.
    • But subcompaction is range-sharded, and a shard may not fully cover the tombstone span.
    • If the tombstone is dropped too early, older covered point keys outside that shard can become visible again.
  3. Branch currently does not merge cleanly with main

    • There is a conflict in iterator.go when merging this PR into current main.
    • Please rebase/merge from latest main, resolve conflicts, and rerun full tests.

Happy to re-review after these are addressed — this feature is valuable, and getting these semantics right will make it much safer to land.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
db_deleterange_test.go (1)

15-16: ⚠️ Potential issue | 🟠 Major

Check every error-returning DB call in these tests.

The new cases still ignore Close, Set, SetCF, Del, DeleteRange, and DeleteRangeCF errors in multiple places. That preserves the current errcheck failure and can turn setup failures into misleading assertions.

🧪 Proposed fix pattern
+func mustNoErr(t *testing.T, err error) {
+	t.Helper()
+	if err != nil {
+		t.Fatal(err)
+	}
+}
+
 func TestDeleteRangeCore(t *testing.T) {
 	opt := getTestOptions(t.TempDir())
 	db := Open(opt)
-	defer db.Close()
+	t.Cleanup(func() { mustNoErr(t, db.Close()) })
 
-	db.Set([]byte("a"), []byte("1"))
-	db.Set([]byte("b"), []byte("2"))
-	db.Set([]byte("c"), []byte("3"))
+	mustNoErr(t, db.Set([]byte("a"), []byte("1")))
+	mustNoErr(t, db.Set([]byte("b"), []byte("2")))
+	mustNoErr(t, db.Set([]byte("c"), []byte("3")))
 
-	db.DeleteRange([]byte("key1"), []byte("key3"))
+	mustNoErr(t, db.DeleteRange([]byte("key1"), []byte("key3")))

Also applies to: 19-21, 38-42, 55-56, 70-71, 87-87, 109-114, 128-137, 153-157, 179-186, 208-216, 234-238

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db_deleterange_test.go` around lines 15 - 16, Every error-returning DB call
in these tests (Open/Close, Set, SetCF, Del, DeleteRange, DeleteRangeCF) must
have their errors captured and asserted instead of being ignored; change
statements like db := Open(opt); defer db.Close() and calls to
Set/SetCF/Del/DeleteRange/DeleteRangeCF to capture the returned error (err :=
db.Set(...), err := db.SetCF(...), etc.) and assert no error (use
require.NoError(t, err) or t.Fatalf on err), and change defer db.Close() to
defer a closure that checks and fails the test if db.Close() returns an error
(defer func(){ if err := db.Close(); err != nil { t.Fatalf(...) } }()). Apply
this to all occurrences referencing Open, Close, Set, SetCF, Del, DeleteRange,
and DeleteRangeCF in the test file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lsm/executor.go`:
- Around line 552-556: The rebuild vs incremental-update race: serialize
rebuildRangeTombstones()/Rebuild() with concurrent rtCollector.Add() from flush
paths by introducing a shared synchronization point (e.g., a mutex or RWLock)
around lm.rtCollector mutations; in practice, wrap the full rebuild performed in
runCompactDef() (rebuildRangeTombstones()/Rebuild()) in the same lock you use
when calling rtCollector.Add() in lsm/levels.go, or alternatively change the
flush path to build a new collector and swap it under the same lock so no
tombstones can be lost during the swap. Ensure you reference lm.rtCollector,
runCompactDef(), rebuildRangeTombstones(), Rebuild(), and rtCollector.Add() when
adding the locking to guarantee both code paths use the same synchronization.

In `@lsm/memtable.go`:
- Around line 213-221: The memRangeTombstone created for live writes in the
setBatch() path is missing its version, causing isKeyCoveredByRangeTombstone()
to skip version ordering for new tombstones; populate memRangeTombstone.version
when constructing the tombstone (use the memtable's current version field—e.g.,
m.version or the same version value used/restored in openMemTable()) so new
range-deletes have the correct version ordering, mirroring how openMemTable()
restores version for recovered tombstones.
- Around line 131-137: GetMemTables() only pins the memtable index but not
keyWalOffset or rangeTombstones, allowing close() to clear them while a pinned
reader is still using them; change the cleanup so tombstone metadata is retained
until the last pinned reader releases the memtable: introduce/use a
memtable-owned refcount (incremented by GetMemTables pin and decremented by the
reader unpin) and move the calls keyWalOffset.Clear() and the rangeTombstones =
nil assignment (currently under rtMu in close()) to run only when that refcount
reaches zero (or add a dedicated release method that does this when
refcount==0), leaving close() to mark the memtable closed but not free tombstone
metadata prematurely.

In `@lsm/range_tombstone.go`:
- Around line 13-18: Range tombstones currently only store Version which loses
tie-break ordering after flush/rebuild; add a stable tie-breaker field (e.g.,
WalSeq or Seq uint64) to RangeTombstone and persist it so equality of Version
can be resolved deterministically across restarts, then update IsKeyCovered()
and any logic in memTable.isKeyCoveredByRangeTombstone() to compare (Version,
Seq) instead of just Version (use Seq when Versions are equal) so flushed
tombstones keep the same ordering as in-memory walSeqKey; ensure
serialization/deserialization and any constructors/builders for RangeTombstone
(and code paths at the locations mentioned around lines ~39-52 and ~111-117) are
updated to read/write the new field.
- Around line 55-59: The Add method on RangeTombstoneCollector currently appends
caller-owned RangeTombstone slices directly, allowing external mutation of
Start/End; change RangeTombstoneCollector.Add(rt RangeTombstone) to take
ownership by making defensive copies of rt.Start and rt.End (e.g., allocate new
byte slices and copy contents) before locking and appending so the collector
stores its own slices; update any tests if needed and keep this behavior
consistent with rebuildRangeTombstones/Rebuild which already safe-copies.

---

Duplicate comments:
In `@db_deleterange_test.go`:
- Around line 15-16: Every error-returning DB call in these tests (Open/Close,
Set, SetCF, Del, DeleteRange, DeleteRangeCF) must have their errors captured and
asserted instead of being ignored; change statements like db := Open(opt); defer
db.Close() and calls to Set/SetCF/Del/DeleteRange/DeleteRangeCF to capture the
returned error (err := db.Set(...), err := db.SetCF(...), etc.) and assert no
error (use require.NoError(t, err) or t.Fatalf on err), and change defer
db.Close() to defer a closure that checks and fails the test if db.Close()
returns an error (defer func(){ if err := db.Close(); err != nil { t.Fatalf(...)
} }()). Apply this to all occurrences referencing Open, Close, Set, SetCF, Del,
DeleteRange, and DeleteRangeCF in the test file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8a53532-8152-45ce-91bd-0025926bf553

📥 Commits

Reviewing files that changed from the base of the PR and between dd78266 and f3d2dfc.

📒 Files selected for processing (12)
  • db.go
  • db_deleterange_test.go
  • iterator.go
  • lsm/executor.go
  • lsm/levels.go
  • lsm/lsm.go
  • lsm/lsm_test.go
  • lsm/memtable.go
  • lsm/range_tombstone.go
  • lsm/range_tombstone_test.go
  • utils/art.go
  • utils/skiplist.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • iterator.go

Comment on lines +552 to +556
// After max-level compaction, range tombstone layout may change.
// Rebuild the in-memory range tombstone index to keep read visibility correct.
if cd.nextLevel != nil && cd.nextLevel.levelNum == lm.opt.MaxLevelNum-1 && lm.rtCollector != nil {
lm.rebuildRangeTombstones()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Serialize collector rebuilds with concurrent flush updates.

runCompactDef() now does a full rebuildRangeTombstones(), but lsm/levels.go still mutates the same collector incrementally with rtCollector.Add(). If a flush installs a tombstone after the rebuild snapshot is taken but before Rebuild() swaps the slice, that tombstone is dropped from the in-memory index and covered keys can become visible again. These paths need a shared synchronization point, or a single-source rebuild/update strategy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/executor.go` around lines 552 - 556, The rebuild vs incremental-update
race: serialize rebuildRangeTombstones()/Rebuild() with concurrent
rtCollector.Add() from flush paths by introducing a shared synchronization point
(e.g., a mutex or RWLock) around lm.rtCollector mutations; in practice, wrap the
full rebuild performed in runCompactDef() (rebuildRangeTombstones()/Rebuild())
in the same lock you use when calling rtCollector.Add() in lsm/levels.go, or
alternatively change the flush path to build a new collector and swap it under
the same lock so no tombstones can be lost during the swap. Ensure you reference
lm.rtCollector, runCompactDef(), rebuildRangeTombstones(), Rebuild(), and
rtCollector.Add() when adding the locking to guarantee both code paths use the
same synchronization.

lsm/memtable.go Outdated
Comment on lines +131 to +137
// Release the keyWalOffset side map and range tombstone cache to prevent
// unbounded memory growth after flush. Range tombstones have already been
// transferred to rtCollector by the flush path before close is called.
m.keyWalOffset.Clear()
m.rtMu.Lock()
m.rangeTombstones = nil
m.rtMu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Keep tombstone metadata alive until the last pinned reader is done.

GetMemTables() pins the memtable index, not keyWalOffset or rangeTombstones. A concurrent flush can remove an immutable, call close(), and clear both before an in-flight read finishes its tombstone check, which reopens the same-version visibility bug on that pinned table.

Possible minimal fix
 func (m *memTable) close() error {
 	if m == nil {
 		return nil
 	}
-	// Release the keyWalOffset side map and range tombstone cache to prevent
-	// unbounded memory growth after flush. Range tombstones have already been
-	// transferred to rtCollector by the flush path before close is called.
-	m.keyWalOffset.Clear()
-	m.rtMu.Lock()
-	m.rangeTombstones = nil
-	m.rtMu.Unlock()
 	return nil
 }

Longer term, if you still want early reclamation, this cleanup needs to move behind a memtable-owned refcount rather than close().

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Release the keyWalOffset side map and range tombstone cache to prevent
// unbounded memory growth after flush. Range tombstones have already been
// transferred to rtCollector by the flush path before close is called.
m.keyWalOffset.Clear()
m.rtMu.Lock()
m.rangeTombstones = nil
m.rtMu.Unlock()
func (m *memTable) close() error {
if m == nil {
return nil
}
return nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/memtable.go` around lines 131 - 137, GetMemTables() only pins the
memtable index but not keyWalOffset or rangeTombstones, allowing close() to
clear them while a pinned reader is still using them; change the cleanup so
tombstone metadata is retained until the last pinned reader releases the
memtable: introduce/use a memtable-owned refcount (incremented by GetMemTables
pin and decremented by the reader unpin) and move the calls keyWalOffset.Clear()
and the rangeTombstones = nil assignment (currently under rtMu in close()) to
run only when that refcount reaches zero (or add a dedicated release method that
does this when refcount==0), leaving close() to mark the memtable closed but not
free tombstone metadata prematurely.

Comment on lines +13 to +18
type RangeTombstone struct {
CF kv.ColumnFamily
Start []byte // inclusive
End []byte // exclusive
Version uint64
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Flushed tombstones still lose equal-version ordering.

memTable.isKeyCoveredByRangeTombstone() uses walSeqKey to break ties when versions are equal, but RangeTombstone and IsKeyCovered() persist only Version. In the non-transactional path, point writes and range tombstones share the same version, so after flush/rebuild a deleted key can become visible again because rt.Version > version is false for the exact case that still needs ordering.

This needs a stable ordering signal that survives flush/rebuild, not just the version.

Also applies to: 39-52, 111-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/range_tombstone.go` around lines 13 - 18, Range tombstones currently only
store Version which loses tie-break ordering after flush/rebuild; add a stable
tie-breaker field (e.g., WalSeq or Seq uint64) to RangeTombstone and persist it
so equality of Version can be resolved deterministically across restarts, then
update IsKeyCovered() and any logic in memTable.isKeyCoveredByRangeTombstone()
to compare (Version, Seq) instead of just Version (use Seq when Versions are
equal) so flushed tombstones keep the same ordering as in-memory walSeqKey;
ensure serialization/deserialization and any constructors/builders for
RangeTombstone (and code paths at the locations mentioned around lines ~39-52
and ~111-117) are updated to read/write the new field.

Comment on lines +55 to +59
// Add appends a single tombstone to the collector.
func (c *RangeTombstoneCollector) Add(rt RangeTombstone) {
c.mu.Lock()
c.tombstones = append(c.tombstones, rt)
c.mu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Make Add() take ownership of Start and End.

rebuildRangeTombstones() safe-copies before Rebuild(), but Add() still stores caller-owned slices verbatim. One iterator-backed caller here is enough to reintroduce mutated tombstone boundaries.

Suggested fix
 func (c *RangeTombstoneCollector) Add(rt RangeTombstone) {
+	rt.Start = kv.SafeCopy(nil, rt.Start)
+	rt.End = kv.SafeCopy(nil, rt.End)
 	c.mu.Lock()
 	c.tombstones = append(c.tombstones, rt)
 	c.mu.Unlock()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/range_tombstone.go` around lines 55 - 59, The Add method on
RangeTombstoneCollector currently appends caller-owned RangeTombstone slices
directly, allowing external mutation of Start/End; change
RangeTombstoneCollector.Add(rt RangeTombstone) to take ownership by making
defensive copies of rt.Start and rt.End (e.g., allocate new byte slices and copy
contents) before locking and appending so the collector stores its own slices;
update any tests if needed and keep this behavior consistent with
rebuildRangeTombstones/Rebuild which already safe-copies.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
lsm/levels.go (1)

212-228: Consider logging or handling malformed range tombstone keys.

When kv.SplitInternalKey(entry.Key) fails for a range tombstone entry (Line 219-220), the code continues without adding the tombstone to newTombstones, but the entry is still added to the SSTable via builder.AddKey(entry) on Line 229. This creates an inconsistency where the SSTable contains a range tombstone that the in-memory collector doesn't know about.

While this is unlikely to happen with well-formed data, a defensive approach would be to either skip adding malformed tombstones to the builder entirely, or log a warning.

Proposed fix
 		if entry != nil && entry.IsRangeDelete() && lm.rtCollector != nil {
 			cf, start, version, ok := kv.SplitInternalKey(entry.Key)
 			if !ok {
+				// Skip malformed range tombstone entries entirely
 				continue
 			}
 			newTombstones = append(newTombstones, RangeTombstone{
 				CF:      cf,
 				Start:   kv.SafeCopy(nil, start),
 				End:     kv.SafeCopy(nil, entry.RangeEnd()),
 				Version: version,
 			})
 		}
+		if entry == nil {
+			continue
+		}
 		builder.AddKey(entry)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/levels.go` around lines 212 - 228, The loop currently continues when
kv.SplitInternalKey(entry.Key) fails for a range-delete entry but still allows
builder.AddKey(entry) later, causing SSTables to include tombstones the
in-memory collector (lm.rtCollector / newTombstones) doesn't know about; modify
the logic so that if entry.IsRangeDelete() and kv.SplitInternalKey(...) returns
ok==false you either (a) skip adding that entry to the builder (avoid calling
builder.AddKey(entry)), or (b) emit a warning log including identifying info
(e.g., entry.Key, CF info if available) before continuing, and ensure
newTombstones and builder remain consistent; update the code paths around
newTombstones, kv.SplitInternalKey, entry.IsRangeDelete, and builder.AddKey to
implement one of these fixes.
db.go (1)

443-453: Range tombstone coverage checking could be more robust with sequence-based ordering throughout the LSM.

The current implementation uses memtable sequence numbers to determine write order during coverage checks, but falls back to version-based comparison for flushed range tombstones. Since all non-transactional operations share nonTxnMaxVersion, the version comparison alone cannot distinguish write ordering for flushed tombstones—though the current code avoids the bug by checking rt.Version > version, which returns false when both have the same version.

This is fragile: the correctness depends on memtable-level sequence tracking plus the implicit assumption that flushed tombstones don't incorrectly cover newer writes. The test TestDeleteRangeVisibilityBug validates the expected behavior for the scenario Set("a1", old) → DeleteRange("a0","a9") → Set("a1", new) → Get("a1"), but relying on version-based checks for flushed tombstones is not a robust long-term solution.

Consider propagating write-sequence information through flushing or using a fully sequence-based ordering mechanism to eliminate the version-based logic for tombstone coverage checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db.go` around lines 443 - 453, The range-tombstone coverage check is fragile
because DeleteRangeCF creates entries with nonTxnMaxVersion and relies on
Version-based comparisons; instead propagate a write sequence and use it for
ordering: extend kv.NewInternalEntry / the kv.Entry to carry a sequence number
(set by DeleteRangeCF when creating the tombstone) and ensure the flush path
preserves that sequence on flushed range tombstones, then change the
tombstone-coverage logic to compare entry.Sequence (or the new sequence field)
rather than relying on Version; update DeleteRangeCF to populate the sequence
(and any memtable/flush plumbing) and ensure ApplyInternalEntries and any
coverage checks use sequence-based ordering consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@db.go`:
- Around line 443-453: The range-tombstone coverage check is fragile because
DeleteRangeCF creates entries with nonTxnMaxVersion and relies on Version-based
comparisons; instead propagate a write sequence and use it for ordering: extend
kv.NewInternalEntry / the kv.Entry to carry a sequence number (set by
DeleteRangeCF when creating the tombstone) and ensure the flush path preserves
that sequence on flushed range tombstones, then change the tombstone-coverage
logic to compare entry.Sequence (or the new sequence field) rather than relying
on Version; update DeleteRangeCF to populate the sequence (and any
memtable/flush plumbing) and ensure ApplyInternalEntries and any coverage checks
use sequence-based ordering consistently.

In `@lsm/levels.go`:
- Around line 212-228: The loop currently continues when
kv.SplitInternalKey(entry.Key) fails for a range-delete entry but still allows
builder.AddKey(entry) later, causing SSTables to include tombstones the
in-memory collector (lm.rtCollector / newTombstones) doesn't know about; modify
the logic so that if entry.IsRangeDelete() and kv.SplitInternalKey(...) returns
ok==false you either (a) skip adding that entry to the builder (avoid calling
builder.AddKey(entry)), or (b) emit a warning log including identifying info
(e.g., entry.Key, CF info if available) before continuing, and ensure
newTombstones and builder remain consistent; update the code paths around
newTombstones, kv.SplitInternalKey, entry.IsRangeDelete, and builder.AddKey to
implement one of these fixes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cd14551a-d2fc-4ae6-b688-7cc83e01e9f1

📥 Commits

Reviewing files that changed from the base of the PR and between f3d2dfc and c6bbef2.

📒 Files selected for processing (10)
  • db.go
  • db_deleterange_test.go
  • iterator.go
  • kv/entry.go
  • lsm/executor.go
  • lsm/levels.go
  • lsm/lsm.go
  • lsm/lsm_test.go
  • lsm/memtable.go
  • lsm/range_tombstone.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • lsm/range_tombstone.go

@nothiny
Copy link
Copy Markdown
Contributor Author

nothiny commented Mar 7, 2026

Thanks for the PR — really appreciate the work here. The overall direction looks good, but I found a few issues that I think should be addressed before merge:

  1. Visibility bug after DeleteRange + rewrite inside the same range

    • Files: db.go , lsm/executor.go

    • Current coverage check relies on rangeVersion >= version.

    • In the non-transactional path, writes use the same nonTxnMaxVersion, so write order is not represented.

    • This can incorrectly hide newer point writes under older range tombstones.

    • Repro:

      • Set(a1, old)
      • DeleteRange(a0, a9)
      • Set(a1, new)
      • Get(a1) returns Key not found (expected new).
  2. Potential key resurrection risk in max-level compaction

    • File: lsm/executor.go (isMaxLevel branch)
    • Range tombstones are dropped immediately in the max-level path.
    • But subcompaction is range-sharded, and a shard may not fully cover the tombstone span.
    • If the tombstone is dropped too early, older covered point keys outside that shard can become visible again.
  3. Branch currently does not merge cleanly with main

    • There is a conflict in iterator.go when merging this PR into current main.
    • Please rebase/merge from latest main, resolve conflicts, and rerun full tests.

Happy to re-review after these are addressed — this feature is valuable, and getting these semantics right will make it much safer to land.

Thanks for the detailed review — I addressed each of the issues you called out.

  1. Visibility bug after DeleteRange + rewrite inside the same range

You were right that comparing only rangeVersion >= version is insufficient for the non-transactional path, because point writes and range
tombstones both use nonTxnMaxVersion, so version alone does not encode write order.

I fixed this by introducing explicit in-mem write ordering for memtable entries:

  • lsm/memtable.go now tracks a per-entry WAL sequence (segmentID + walOffset) in keyWalOffset
  • memtable range tombstones now also retain that sequence metadata
  • coverage checks now use:
    • version ordering when versions differ
    • WAL sequence ordering when versions are equal
  • read/iterator paths pass the original internal key through so the coverage check can resolve the correct write sequence

Relevant paths:

  • lsm/memtable.go
  • db.go:646
  • iterator.go:275

This fixes the repro:

  • Set(a1, old)
  • DeleteRange(a0, a9)
  • Set(a1, new)
  • Get(a1) now returns new as expected.
  1. Potential key resurrection in max-level compaction

Agreed — dropping range tombstones immediately in the max-level path is unsafe when compaction is range-sharded.

I fixed this in lsm/executor.go by preserving range tombstones through the max-level compaction scan instead of dropping them eagerly. The
compaction logic now:

  • copies range tombstones into an in-memory list for the active compaction window
  • keeps them across builder splits
  • filters covered point keys against that list
  • avoids dropping tombstones before it is safe to do so

Relevant area:

  • lsm/executor.go:1041

This prevents older covered keys outside a partial shard rewrite from becoming visible again.

  1. Merge conflict with current main

Also addressed. I merged from the latest origin/main, resolved the iterator.go conflict, and reran the test suite against the updated
branch.

Validation run:

  • go test -count=1 ./...
  • go test -race -count=1 . ./lsm ./percolator ./raftstore/kv ./utils

So at this point the branch is rebased/merged onto current main, builds cleanly, passes the full test suite, and passes targeted race
testing on the affected subsystems.

Happy to push the updated branch for another look.

nothiny and others added 3 commits March 10, 2026 10:31
…tion

- move range tombstone coverage checks into the LSM read path
- track memtable tombstones with WAL ordering for same-version reads
- rebuild flushed tombstone state on startup, flush and max-level compaction
- keep max-level tombstones during partial rewrites and add regressions
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
lsm/executor.go (1)

556-560: ⚠️ Potential issue | 🔴 Critical

Serialize collector rebuilds with flush-time updates.

rebuildRangeTombstones() rebuilds from a snapshot and swaps the collector, while lsm/levels.go still appends flushed tombstones via rtCollector.Add(). A flush that lands between the scan and Rebuild() can be dropped from the in-memory index, making covered keys visible until the next rebuild.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/executor.go` around lines 556 - 560, rebuildRangeTombstones() can race
with concurrent flush-time rtCollector.Add() calls causing tombstones to be
lost; serialize rebuilds with flush updates by performing the collector swap
under the same lock used for flush-time adds (or by adding an atomic swap method
on the collector). Concretely, change lm.rebuildRangeTombstones() to acquire the
flush/collector mutex (the same lock used around rtCollector.Add()), take the
snapshot and replace the collector while holding that lock, or add a
ReplaceCollectorAtomic(old,new) on rtCollector and call that from
rebuildRangeTombstones() so no rtCollector.Add() can run concurrently and no
flush can be dropped.
lsm/range_tombstone.go (1)

13-18: ⚠️ Potential issue | 🔴 Critical

Persist the equal-version tie-breaker in the collector.

The collector still reduces ordering to Version only. After flush/rebuild, equal-version point writes and range tombstones become indistinguishable again, so a newer point write can be hidden or an older key can resurface. RangeTombstone needs the persisted WAL/sequence tie-breaker, IsKeyCovered() needs to compare (Version, Seq), and rebuildRangeTombstones() needs to restore that field from disk too.

Also applies to: 39-52, 111-121

🧹 Nitpick comments (1)
db_deleterange_test.go (1)

247-291: Add an on-disk equal-version regression.

These tests cover WAL recovery and same-version visibility separately, but not the case after the tombstone leaves the memtable. Please add a Set -> DeleteRange -> force flush/rotation -> reopen/Get path so the SST/collector ordering is exercised too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db_deleterange_test.go` around lines 247 - 291, Update
TestDeleteRangeVisibilityBug to exercise the on-disk path: after calling
mustSet(... "old") and mustDeleteRange(...), force the tombstone out of the
memtable (e.g., trigger a memtable flush/rotation — use the existing helpers
such as mustClose/Open to persist memtables to SSTs or any db.Flush/db.Rotate
helper if available), then perform the newer mustSet(... "new"), close/reopen
the DB with Open, and finally call db.Get on "a1" to assert the value is "new";
reference TestDeleteRangeVisibilityBug, mustSet, mustDeleteRange,
mustClose/Open, and db.Get to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lsm/executor.go`:
- Around line 1030-1031: The split/subcompaction currently only accumulates
tombstones seen after the worker starts, so wide tombstones that begin before
kr.Left get skipped and cease covering later splits; update the split
initialization to seed rangeTombstones with any existing rangeTombstone whose
span overlaps the split's left bound (kr.Left) before scanning forward.
Concretely: when starting a new builder split/subcompaction (the code around
where rangeTombstones is declared and where code seeks past tombstones at
kr.Left), perform a lookup/scan for tombstones with start<=kr.Left and
end>kr.Left and append them into rangeTombstones so they are carried into the
split, then proceed with the existing forward accumulation logic.

In `@lsm/lsm.go`:
- Around line 89-115: checkRangeTombstone currently breaks same-version ties
using WAL sequence from memTables (walSeqForKey) but falls back to
rtCollector.IsKeyCovered which only checks Version, so equal-version ordering
can be lost; persist the range-tombstone WAL/sequence when writing SSTs and
reconstruct that sequence into the collector during rebuild, then update
checkRangeTombstone to pass the tombstone sequence (or have
rtCollector.IsKeyCovered consult stored tombstone sequences) so the collector
can compare entrySeq vs tombstoneSeq the same way
memTable.isKeyCoveredByRangeTombstone does; relevant symbols:
checkRangeTombstone, memTable.walSeqForKey,
memTable.isKeyCoveredByRangeTombstone, levels.rtCollector and
rtCollector.IsKeyCovered — ensure SST writer, rebuild logic, and rtCollector
store and expose tombstone sequence metadata and use it in IsKeyCovered.

---

Duplicate comments:
In `@lsm/executor.go`:
- Around line 556-560: rebuildRangeTombstones() can race with concurrent
flush-time rtCollector.Add() calls causing tombstones to be lost; serialize
rebuilds with flush updates by performing the collector swap under the same lock
used for flush-time adds (or by adding an atomic swap method on the collector).
Concretely, change lm.rebuildRangeTombstones() to acquire the flush/collector
mutex (the same lock used around rtCollector.Add()), take the snapshot and
replace the collector while holding that lock, or add a
ReplaceCollectorAtomic(old,new) on rtCollector and call that from
rebuildRangeTombstones() so no rtCollector.Add() can run concurrently and no
flush can be dropped.

---

Nitpick comments:
In `@db_deleterange_test.go`:
- Around line 247-291: Update TestDeleteRangeVisibilityBug to exercise the
on-disk path: after calling mustSet(... "old") and mustDeleteRange(...), force
the tombstone out of the memtable (e.g., trigger a memtable flush/rotation — use
the existing helpers such as mustClose/Open to persist memtables to SSTs or any
db.Flush/db.Rotate helper if available), then perform the newer mustSet(...
"new"), close/reopen the DB with Open, and finally call db.Get on "a1" to assert
the value is "new"; reference TestDeleteRangeVisibilityBug, mustSet,
mustDeleteRange, mustClose/Open, and db.Get to locate and implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 39f58b65-3bc2-4e50-9b2b-0ac08eca8f4a

📥 Commits

Reviewing files that changed from the base of the PR and between c6bbef2 and db95615.

📒 Files selected for processing (12)
  • db.go
  • db_deleterange_test.go
  • iterator.go
  • kv/const.go
  • kv/entry.go
  • lsm/executor.go
  • lsm/levels.go
  • lsm/lsm.go
  • lsm/lsm_test.go
  • lsm/memtable.go
  • lsm/range_tombstone.go
  • lsm/range_tombstone_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • kv/entry.go
  • iterator.go
  • lsm/memtable.go

lsm/executor.go Outdated
Comment on lines +1030 to +1031
// Keep tombstone state across builder splits.
var rangeTombstones []rangeTombstone
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Seed each split with tombstones that overlap its left bound.

rangeTombstones only accumulates tombstones seen after this worker starts scanning. On later subcompactions, Line 1103 seeks past tombstones whose start key is before kr.Left, so a wide tombstone can stop covering keys in subsequent splits and stale keys can be emitted.

Also applies to: 1101-1118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/executor.go` around lines 1030 - 1031, The split/subcompaction currently
only accumulates tombstones seen after the worker starts, so wide tombstones
that begin before kr.Left get skipped and cease covering later splits; update
the split initialization to seed rangeTombstones with any existing
rangeTombstone whose span overlaps the split's left bound (kr.Left) before
scanning forward. Concretely: when starting a new builder split/subcompaction
(the code around where rangeTombstones is declared and where code seeks past
tombstones at kr.Left), perform a lookup/scan for tombstones with start<=kr.Left
and end>kr.Left and append them into rangeTombstones so they are carried into
the split, then proceed with the existing forward accumulation logic.

lsm/lsm.go Outdated
Comment on lines +89 to +115
func (lsm *LSM) checkRangeTombstone(cf kv.ColumnFamily, userKey []byte, entryVersion uint64, internalKey []byte, tables []*memTable) bool {
// Find the entry's write sequence from the side map.
entrySeq, seqFound := walSeqKey{}, false
for _, mt := range tables {
if mt == nil {
continue
}
if seq, ok := mt.walSeqForKey(internalKey); ok {
entrySeq = seq
seqFound = true
break
}
}
// Check memtable tombstones (O(M) where M = number of range tombstones).
for _, mt := range tables {
if mt == nil {
continue
}
if mt.isKeyCoveredByRangeTombstone(cf, userKey, entryVersion, entrySeq, seqFound) {
return true
}
}
// Check flushed range tombstones via collector (version-based).
if lsm.levels == nil || lsm.levels.rtCollector == nil {
return false
}
return lsm.levels.rtCollector.IsKeyCovered(cf, userKey, entryVersion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The SST fallback still loses equal-version ordering.

The memtable path uses WAL sequence to break same-version ties, but the fallback here only asks the collector about Version. Once both records are read from SSTs, Set(k) followed by DeleteRange(...) at nonTxnMaxVersion can become visible again because the collector cannot tell the tombstone is newer. Persist and consult the tombstone sequence in the collector/rebuild path too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lsm/lsm.go` around lines 89 - 115, checkRangeTombstone currently breaks
same-version ties using WAL sequence from memTables (walSeqForKey) but falls
back to rtCollector.IsKeyCovered which only checks Version, so equal-version
ordering can be lost; persist the range-tombstone WAL/sequence when writing SSTs
and reconstruct that sequence into the collector during rebuild, then update
checkRangeTombstone to pass the tombstone sequence (or have
rtCollector.IsKeyCovered consult stored tombstone sequences) so the collector
can compare entrySeq vs tombstoneSeq the same way
memTable.isKeyCoveredByRangeTombstone does; relevant symbols:
checkRangeTombstone, memTable.walSeqForKey,
memTable.isKeyCoveredByRangeTombstone, levels.rtCollector and
rtCollector.IsKeyCovered — ensure SST writer, rebuild logic, and rtCollector
store and expose tombstone sequence metadata and use it in IsKeyCovered.

@feichai0017 feichai0017 merged commit cc9aad0 into feichai0017:main Mar 10, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in NoKV Delivery Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants