perf(l1): optimize storage layer for block execution#6207
Conversation
RwLockReadGuard across read transactions instead of re-acquiring per get(). Also replace Mutex with RwLock for trie_cache and last_computed_flatkeyvalue to allow concurrent readers. Combined, these changes reduce block execution time by ~9.3% (57ms → 52ms) on block 24443168 (442 txs, 37.5M gas) benchmarked on a 32-core AMD server.
pre-acquired read view in BackendTrieDB for the entire trie traversal. InMemoryBackend now wraps its Database in Arc, so begin_read() clones the Arc (O(1)) and releases the lock immediately. InMemoryReadTx and InMemoryLocked hold an owned Arc<Database> snapshot — all subsequent gets are lock-free HashMap lookups with no RwLock contention. BackendTrieDB now acquires a single read view in its constructor and reuses it for all get() calls during the trie traversal. This eliminates the per-node-lookup Box allocation and lock acquisition that previously happened ~8000+ times per block. StorageReadView gains Send + Sync bounds and begin_read() returns a 'static view, enabling BackendTrieDB to own the read view.
Change begin_read() to return Arc<dyn StorageReadView> instead of Box, allowing the read view to be shared across multiple BackendTrieDB instances. In get_storage_at_root (the SLOAD hot path), pre-acquire the read view, trie cache, and last_written once and share them between the state trie and storage trie opens. This eliminates per-query duplicate RwLock acquisitions and Arc allocations.
apply_prefix was called on every trie node lookup, creating 3 Vec allocations each time (from_bytes + append_new + concat). Pre-computing the prefix nibbles once in the TrieWrapper constructor reduces this to 1 allocation (just concat) per lookup. Also adds TrieWrapper::new() constructor to encapsulate this.
🤖 Kimi Code ReviewReview SummaryThis PR introduces significant performance optimizations to the storage layer, primarily through:
Issues Found1. Critical: Atomicity Violation in InMemoryBackend (in_memory.rs:33-34, 153-154, 168-169)The use of // Problem: Each write creates a new Arc, breaking atomicity
let db_mut = Arc::make_mut(&mut *db);Fix: Use 2. Performance: Unnecessary Arc Cloning (in_memory.rs:44-48)The in-memory backend clones the entire database on every read transaction, which is O(n) complexity. This defeats the purpose of the optimization: let snapshot = self.inner.read()?.clone(); // Clones entire databaseFix: Return a reference-based read view that holds the RwLock guard, or use a proper MVCC implementation. 3. Memory Safety: Lifetime Mismatch (api/mod.rs:37)The trait signature change from 4. Code Duplication (store.rs:2422-2599)Multiple nearly identical 5. Error Handling Inconsistency (layering.rs:195-210)The Recommendations
The RocksDB backend changes look correct as RocksDB handles its own MVCC. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code ReviewFindings
Notes / Questions
Positive
If you want, I can scan for any callers that depend on deterministic prefix iteration order to confirm whether Point 1 is a real behavioral change. Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have enough context to provide a comprehensive code review. Let me compile my findings: PR Review: Performance Optimization of Storage Layer (-15.5%)SummaryThis PR implements four incremental optimizations to the storage hot path, targeting lock contention and allocation overhead identified through profiling. The changes are well-motivated and demonstrate clear performance wins (15.5% speedup, 18.2% throughput increase). Critical Issues1. Type inconsistency in InMemoryBackend (crates/storage/backend/in_memory.rs:11)The outer type Table = FxHashMap<Vec<u8>, Vec<u8>>;
type Database = HashMap<&'static str, Table>; // ← Standard HashMap, not FxHashMapIssue: The outer type Database = FxHashMap<&'static str, Table>;Impact: Minor performance regression for table name lookups, though table count is small so impact is minimal in practice. 2. Potential Arc clone overhead in get_storage_at_root (crates/storage/store.rs:2104-2106, 2118-2120)The code clones read_view.clone(),
cache.clone(),
last_written.clone(), // ← This is Vec<u8>, not Arc!Issue: Recommendation: Change Impact: Moderate - each 3. Missing Send + Sync bounds verification (crates/storage/api/mod.rs:58)The trait now requires pub trait StorageReadView: Send + Sync {Verification needed: Ensure all concrete implementations (
Status: Appears correct, but worth confirming with a compile-time assertion or test if not already verified. Performance Concerns4. Arc::make_mut potential COW overhead (crates/storage/backend/in_memory.rs:33, 153, 169)The RCU pattern uses let db_mut = Arc::make_mut(&mut *db);Concern: If any reader still holds a reference to the inner Analysis: The design assumes writes are infrequent (background worker only, 2-3 times per block). Readers clone the inner Arc and release the RwLock immediately, so the window for contention is small. However, if a reader holds the snapshot for a long time while a write occurs, the entire DB gets cloned. Recommendation: Add a comment documenting this behavior and the assumption that reads are short-lived. Consider profiling to verify COW isn't triggered frequently in practice. 5. Prefix nibble computation still allocates (crates/storage/layering.rs:207-208, 233-236, 241-244)The optimization pre-computes prefix nibbles, but the application still allocates: let key = match &self.prefix_nibbles {
Some(prefix) => prefix.concat(&key), // ← Still allocates a new Nibbles
None => key,
};Issue: Potential optimization: If Impact: Lower priority - already reduced by 67%, further optimization requires trie library changes. Code Quality Issues6. Redundant field in BackendTrieDB (crates/storage/trie.rs:14-15)The /// Reference to the storage backend (used only for writes)
db: Arc<dyn StorageBackend>,
/// Pre-acquired read view held for the lifetime of this struct.
read_view: Arc<dyn StorageReadView>,Question: Is Looking at usage: 7. Removed apply_prefix usage not fully deleted (crates/storage/trie.rs:6, 111)The import and function are still present but largely unused: use crate::layering::apply_prefix; // Line 6
// ...
fn make_key(&self, path: Nibbles) -> Vec<u8> {
apply_prefix(self.address_prefix, path).into_vec() // Line 111
}While
Correctness8. RwLock poisoning behavior (crates/storage/store.rs:2093-2094)Lock errors map to .read()
.map_err(|_| StoreError::LockError)?Question: What happens if a thread panics while holding a write lock? The RwLock will be poisoned, and all subsequent operations will fail. Is this the desired behavior, or should poisoned locks be recovered? Analysis: For blockchain storage, fail-fast on poisoned locks is likely correct - if a storage write panicked, the database is potentially in an inconsistent state. The current behavior seems appropriate. 9. Trie cache RCU consistency (crates/storage/store.rs:2759-2760)During trie updates, the cache is updated with RCU: let trie = Arc::new(trie_mut);
*trie_cache.write().map_err(|_| StoreError::LockError)? = trie.clone();Observation: After the write lock is released, readers will see the new cache state. There's a small window where readers might miss cache entries if they grabbed the old cache before the update but perform lookups after the database is updated. Analysis: This is inherent to the RCU pattern and acceptable - readers will simply hit the database instead of the cache, which is correct (slower but not incorrect). The tradeoff is worth it for lock-free reads. Minor Issues10. Missing type documentation (crates/storage/backend/in_memory.rs:10-11)The type aliases should document why FxHashMap is chosen: // Using FxHashMap for O(1) lookups vs BTreeMap's O(log n).
// FxHash is faster than SipHash for small keys (byte vectors, table names).
type Table = FxHashMap<Vec<u8>, Vec<u8>>;
type Database = FxHashMap<&'static str, Table>; // Fixed: use FxHashMap here too11. TrieWrapper::new could take ownership (crates/storage/layering.rs:201-215)The constructor currently takes owned pub fn new(
state_root: H256,
inner: Arc<TrieLayerCache>,
db: Box<dyn TrieDB>,
prefix: Option<H256>,
) -> Self {
let prefix_nibbles = prefix.map(|p| Nibbles::from_bytes(p.as_bytes()).append_new(17));
// ...
}This is fine, but the Positive Observations
RecommendationsMust fix:
Should fix: Nice to have: Future optimization: Security AssessmentNo security vulnerabilities identified. The changes maintain the same correctness guarantees as before:
VerdictAPPROVE with minor fixes requested (Item 1 and optionally Item 2). The performance gains are significant and well-justified. The implementation is sound with only minor consistency issues to address. Automated review by Claude (Anthropic) · custom prompt |
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile SummaryThis PR implements four incremental storage layer optimizations that collectively achieve a 15.5% reduction in block execution time through targeted elimination of lock contention, allocation overhead, and algorithmic complexity. Key Changes:
Performance Impact: Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/storage/backend/in_memory.rs | Implemented RCU pattern with Arc<RwLock<Arc<Database>>>, replaced BTreeMap with FxHashMap for O(1) lookups, snapshots now clone inner Arc for lock-free reads |
| crates/storage/layering.rs | Pre-computes prefix nibbles in TrieWrapper::new() constructor to avoid repeated allocations on every trie node lookup, reducing allocations from 3 to 1 per lookup |
| crates/storage/store.rs | Changed trie_cache and last_computed_flatkeyvalue from Mutex to RwLock, added shared read view pattern in get_storage_at_root, new *_with_view methods enable resource sharing across multiple trie opens |
| crates/storage/trie.rs | Added read_view field to BackendTrieDB to hold pre-acquired read view, eliminating per-lookup allocations, new *_with_view constructors support sharing a single read view across multiple trie instances |
Sequence Diagram
sequenceDiagram
participant App as Block Executor
participant Store as Store
participant Backend as StorageBackend
participant TrieDB as BackendTrieDB
participant Cache as TrieLayerCache
Note over App,Cache: Optimized get_storage_at_root flow
App->>Store: get_storage_at_root(state_root, address, storage_key)
Note over Store: Pre-acquire shared resources (optimization #2 & #3)
Store->>Backend: begin_read()
Backend-->>Store: Arc<StorageReadView> (cloneable snapshot)
Store->>Cache: trie_cache.read() (RwLock instead of Mutex)
Cache-->>Store: Arc<TrieLayerCache> clone
Store->>Store: last_written() (RwLock.read())
Note over Store: Open state trie with shared resources
Store->>TrieDB: new_for_accounts_with_view(backend, read_view, last_written)
Note over TrieDB: Holds read_view for entire traversal (no per-lookup allocation)
Store->>TrieDB: state_trie.get(account_hash)
Note over TrieDB: Pre-computed prefix nibbles (optimization #4)
TrieDB->>Cache: inner.get(state_root, prefixed_key)
Cache-->>TrieDB: cache hit/miss
alt cache miss
TrieDB->>Backend: read_view.get() (lock-free with RCU)
Backend-->>TrieDB: trie node data
end
TrieDB-->>Store: encoded_account
Note over Store: Open storage trie reusing same read_view
Store->>TrieDB: new_for_storages_with_view(backend, read_view.clone(), last_written)
Store->>TrieDB: storage_trie.get(hashed_key)
Note over TrieDB: Reuses held read_view (no new snapshot)
TrieDB->>Cache: inner.get(state_root, prefixed_key)
alt cache miss
TrieDB->>Backend: read_view.get() (same snapshot)
Backend-->>TrieDB: storage value
end
TrieDB-->>Store: storage_value
Store-->>App: U256 result
Note over App,Cache: Key optimizations:<br/>1. BTreeMap→FxHashMap (O(log n)→O(1))<br/>2. RCU pattern (lock-free reads)<br/>3. Shared read_view (no duplicate snapshots)<br/>4. Pre-computed prefix (1 alloc vs 3)
Last reviewed commit: d4555a6
There was a problem hiding this comment.
Pull request overview
This PR optimizes the storage layer for block execution through four targeted performance improvements, achieving a 15.5% reduction in execution time. The optimizations target bottlenecks identified via profiling: data structure access patterns (BTreeMap→FxHashMap), lock contention (Mutex→RwLock), per-lookup allocations (RCU pattern + held read views), and repeated prefix computations.
Changes:
- Replaced Mutex with RwLock for read-heavy
trie_cacheandlast_computed_flatkeyvalueto enable concurrent readers - Implemented RCU pattern in InMemoryBackend with Arc<RwLock<Arc>> for lock-free reads via snapshots
- Modified storage API to return
Arc<dyn StorageReadView>instead of Box, enabling cheap sharing of read views across multiple trie operations - Added constructors for BackendTrieDB that accept pre-acquired shared read views to eliminate per-trie-open allocations
- Pre-computed prefix nibbles in TrieWrapper constructor to reduce allocations from 3 to 1 per trie node lookup
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| crates/storage/api/mod.rs | Changed StorageBackend::begin_read() to return Arc instead of Box, added Send + Sync bounds to StorageReadView trait |
| crates/storage/backend/rocksdb.rs | Updated begin_read() to return Arc |
| crates/storage/backend/in_memory.rs | Implemented RCU pattern with Arc<RwLock<Arc>>, switched Table from BTreeMap to FxHashMap, updated begin_read() for snapshot-based lock-free reads |
| crates/storage/trie.rs | Added BackendTrieDB constructors with shared read view support (*_with_view variants), changed read view field to Arc for sharing |
| crates/storage/layering.rs | Added TrieWrapper::new() constructor with pre-computed prefix nibbles, replaced apply_prefix calls with direct concat operations |
| crates/storage/store.rs | Converted trie_cache and last_computed_flatkeyvalue from Mutex to RwLock, added *_shared trie opening methods, updated get_storage_at_root to pre-acquire and share resources, updated all TrieWrapper instantiations to use new constructor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Benchmark Block Execution Results Comparison Against Main
|
…2929 warm/cold tracking Switch the internal data structures used for EIP-2929 accessed storage slot tracking from sorted tree structures to hash-based structures for O(1) lookups. Changes: - Substate.accessed_storage_slots: BTreeMap<Address, BTreeSet<H256>> → FxHashMap<Address, FxHashSet<H256>> - VM.storage_original_values: BTreeMap<(Address, H256), U256> → FxHashMap<(Address, H256), U256> Output boundaries that require determinism (make_access_list, get_accessed_storage_slots) create fresh BTreeMap/BTreeSet locally, preserving sorted output regardless of field type.
…eature flag Add SloadCounters struct with 4 AtomicU64 counters to measure SLOAD cache effectiveness across the cache hierarchy: - sload_l1_hit: value found in per-tx GenDB storage (L1 cache) - sload_l2_hit: value found in CachingDatabase (L2 cross-tx cache) - sload_l2_miss: fell through to Store/trie lookup - sload_duplicate_miss_race: read miss became hit on write recheck (warmer race) Counters are global statics (const-initialized, no LazyLock needed for atomics) and output via [PERF] log line alongside existing opcode timings. All behind #[cfg(feature = "perf_opcode_timings")] for zero overhead in production builds.
…tion
Two coupled optimizations that reduce cold SLOAD overhead:
1. Block-scoped locked RocksDB snapshot (Phase B):
- New SharedLockedTrieDB wrapper reusing Arc<BackendTrieDBLocked>
- Single RocksDB snapshot created per block, shared across all trie reads
- Eliminates up to 16 heap allocations + transaction creations per cold SLOAD
- 5 new locked methods in Store mirroring existing unlocked variants
- Locked backend stored in StoreVmDatabase, auto-dropped at block end
2. Storage-root memoization (Phase C):
- Cache address→storage_root in StoreVmDatabase (Arc<Mutex<FxHashMap>>)
- On cache hit, skips state trie traversal entirely
- Reduces trie opens from 2N to N+1 per N unique slot reads per account
- New get_storage_at_storage_root_locked() takes pre-resolved root
- Non-existent accounts intentionally not cached (rare path, simple types)
Architecture after both optimizations:
get_storage_slot → check storage_root_cache
HIT: skip state trie → open storage trie (locked snapshot) → 8 node reads
MISS: open state trie (locked snapshot) → cache root → open storage trie → 8 reads
…ot semantics BackendTrieDB now snapshots at construction time (pre-acquired read view). Tests were writing then reading from the same instance, but the snapshot predated the writes. Use separate instances so the reader gets a fresh view.
| let account_nibbles = Nibbles::from_bytes(account.as_bytes()); | ||
| let last_computed_flatkeyvalue = self.last_written()?; | ||
| Ok(&last_computed_flatkeyvalue[0..64] > account_nibbles.as_ref()) | ||
| &last_written[0..64] > account_nibbles.as_ref() |
There was a problem hiding this comment.
nit: &last_written[0..64] will panic if the slice is shorter than 64 bytes. The caller (self.last_written()) always returns ≥64 bytes (initialized from unwrap_or_else(|| vec![0u8; 64])), but this static method's &[u8] signature doesn't encode that invariant. Consider:
last_written.get(0..64).is_some_and(|lw| lw > account_nibbles.as_ref())| trie_mut.put_batch(parent_state_root, child_state_root, new_layer); | ||
| let trie = Arc::new(trie_mut); | ||
| *trie_cache.lock().map_err(|_| StoreError::LockError)? = trie.clone(); | ||
| *trie_cache.write().map_err(|_| StoreError::LockError)? = trie.clone(); |
There was a problem hiding this comment.
nit: The .read() → clone → modify → .write() RCU pattern (here and again at line 2833) is safe because only the background worker thread calls apply_trie_updates. Worth adding a brief comment at the .write() sites to make this single-writer invariant explicit — e.g., // Single writer: only called from trie update worker thread.
Benchmark Results ComparisonNo significant difference was registered for any benchmark run. Detailed ResultsBenchmark Results: BubbleSort
Benchmark Results: ERC20Approval
Benchmark Results: ERC20Mint
Benchmark Results: ERC20Transfer
Benchmark Results: Factorial
Benchmark Results: FactorialRecursive
Benchmark Results: Fibonacci
Benchmark Results: FibonacciRecursive
Benchmark Results: ManyHashes
Benchmark Results: MstoreBench
Benchmark Results: Push
Benchmark Results: SstoreBench_no_opt
|
There was a problem hiding this comment.
I feel like the addition of the SLOAD counters might belong in another PR.
…nd associated fixes Reverts commits 3d279be..d0896cf (4 commits): - d0896cf Run rustfmt - e4a016c fix(test): use separate writer/reader in trie_db_tests for RCU snapshot semantics - 8345d1d perf(l1): add block-scoped locked trie reads and storage-root memoization - 3d279be perf(l1): add SLOAD attribution counters behind perf_opcode_timings feature flag
Two fixes: - Format long FxHashMap type annotation in vm.rs to satisfy cargo fmt - Fix trie_db tests that wrote via put_batch then read back on the same BackendTrieDB instance. With the RCU snapshot pattern, the read view is captured at construction time and won't see subsequent writes. Create a fresh BackendTrieDB after writing to get an updated snapshot.
Motivation
Profiling block execution on mainnet revealed that lock contention, per-lookup allocations, and O(log n) data structure access in the storage layer accounted for a significant portion of execution time. The storage hot path (
get_storage_at_root→ trie lookups) was hitting multiple bottlenecks on every trie node access.Description
Four incremental optimizations to the storage layer, each targeting a specific bottleneck identified via
perfprofiling on a 32-core AMD server:1. BTreeMap → FxHashMap + Mutex → RwLock
BTreeMap<Vec<u8>, Vec<u8>>withFxHashMapfor in-memory storage tables — O(1) lookups instead of O(log n)MutexwithRwLockfortrie_cacheandlast_computed_flatkeyvalue— allows concurrent readers (only the background worker writes, 2-3 times per block)2. RCU pattern + held read view
InMemoryBackend: wrap the database asArc<RwLock<Arc<Database>>>so readers clone the innerArc(O(1)) and read entirely lock-freeBackendTrieDB: acquire the read view once at construction and reuse it for all trie node lookups, eliminating ~8000+Boxallocations per block3. Shared read view across trie opens
begin_read()to returnArc<dyn StorageReadView>instead ofBox<dyn StorageReadView>, enabling cheapArc::clone()sharingget_storage_at_root, pre-acquire the read view, trie cache, and last_written value once, then share across both the state trie and storage trie opens — eliminates duplicateRwLockacquisitionsStorageReadView: Send + Syncbound (required forArcsharing)4. Pre-compute prefix nibbles in TrieWrapper
apply_prefixwas called on every trie node lookup, creating 3Vecallocations per call (from_bytes+append_new+concat)TrieWrapperconstructor, reducing to 1 allocation (justconcat) per lookupTrieWrapper::new()constructor to encapsulate prefix pre-computationBenchmark Results (ethrex-replay)
30 runs each, back-to-back on same server (32-core AMD, ethrex-office-4):
14-15% improvement on heavy blocks, 8% on light blocks. Lock contention dropped from 8.1% to 1.4% of block executor time.
How to Test
cargo test -p ethrex-storagecargo test -p ethrex-blockchaincargo clippy -p ethrex-storage -p ethrex-blockchain -- -D warnings