more algorithm-independent GC infra#7194
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImplements a CPython-style generational GC, refactors reference counting into a unified atomic state with deferred-drop support, adds a per-thread trashcan for deferred deallocation, reorders interpreter shutdown and module finalization, and enables the Changes
Sequence Diagram(s)sequenceDiagram
participant Interpreter
participant VM as VirtualMachine
participant GC as GcState
participant Heap as HeapObjects
participant Modules
Interpreter->>VM: finalize()
VM->>VM: run atexit handlers
VM->>VM: set finalizing flag
VM->>GC: collect_force(generation)
GC->>Heap: build refs map & mark reachable
GC->>Heap: identify unreachable, run weakref callbacks
GC->>Heap: finalize unreachable (handle resurrection)
VM->>Modules: finalize_remove_modules() -> Vec<(name, weak)>
VM->>Modules: finalize_clear_module_dicts()
VM->>GC: collect_force(generation) (post-removal)
VM->>Modules: clear sys & builtins dicts
VM->>GC: collect_force(generation) (final)
sequenceDiagram
participant Dealloc as Deallocator
participant Trashcan
participant Queue as DeferredQueue
participant GC as GcState
Dealloc->>Trashcan: begin(obj, dealloc_fn)
alt depth <= threshold
Trashcan-->>Dealloc: allow immediate dealloc
Dealloc->>GC: untrack object
Dealloc->>Dealloc: tp_clear / drop children
Dealloc->>Trashcan: end()
Trashcan->>Queue: process deferred items iteratively
else depth > threshold
Trashcan-->>Dealloc: defer
Trashcan->>Queue: push (obj, dealloc_fn)
Dealloc-->>Trashcan: return (deferred)
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/vm/src/vm/interpreter.rs (1)
386-395:⚠️ Potential issue | 🟡 MinorDoc comment is out of sync with the actual shutdown sequence.
The numbered steps in the doc comment don't match the code:
Doc step Doc says Code actually does 3 Wait for thread shutdown ✅ threading._shutdown 4 Set finalizing flag ❌ atexit runs here 5 Call threading._shutdown() ❌ duplicate of step 3; finalizing flag set here 6 Run atexit ❌ GC collect_force here 7 GC pass and module cleanup ✅ finalize_modules 8 Final GC pass ❌ code does flush_std, no separate final GC Suggested doc fix
- /// 1. Set finalizing flag (suppresses unraisable exceptions). - /// 1. Call threading._shutdown() to join non-daemon threads. - /// 1. Run atexit exit functions. - /// 1. GC pass and module cleanup (finalize_modules). - /// 1. Final GC pass. + /// 1. Run atexit exit functions. + /// 1. Set finalizing flag (suppresses unraisable exceptions from __del__). + /// 1. Forced GC collection pass (collect cycles while builtins are available). + /// 1. Module finalization (finalize_modules). + /// 1. Final stdout/stderr flush.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/vm/interpreter.rs` around lines 386 - 395, Update the "Finalization steps" doc comment so it matches the actual shutdown sequence in this file: enumerate the real order (flush stdout/stderr via flush_std, handle exit exception -> exit code, call threading._shutdown to wait for threads, set the finalizing flag (suppresses unraisable exceptions), run atexit exit functions, run GC collect_force (where the doc previously claimed atexit ran elsewhere), call finalize_modules for module cleanup, and note any final flush/GC behavior; remove the duplicate threading._shutdown step and correct the step labels to reference the actual calls (threading._shutdown, atexit, collect_force, finalize_modules, flush_std) so the numbered list matches the code flow.crates/vm/src/vm/mod.rs (1)
1122-1149:⚠️ Potential issue | 🔴 Critical
scopeguard::defer!is placed after execution, defeating panic safety.The tracing + execution block (lines 1122–1132) runs before the
scopeguard::defer!guard is created (line 1136). Iff(frame)ortrace_eventpanics, the guard doesn't exist yet, so cleanup (restoring owner, popping exception, popping frame, decrementing recursion depth) never runs — leaving the VM in a corrupted state.Compare with
with_frame_exc(line 1056) where the guard is created before the execution block (line 1078). This method should follow the same pattern: SETUP → GUARD → EXECUTE.🐛 Proposed fix: move scopeguard before execution
let old_owner = frame.owner.swap( crate::frame::FrameOwner::Thread as i8, core::sync::atomic::Ordering::AcqRel, ); + // Ensure cleanup on panic: restore owner, pop exception, frame chain, frames Vec, + // and recursion depth. + scopeguard::defer! { + frame.owner.store(old_owner, core::sync::atomic::Ordering::Release); + self.exceptions.borrow_mut().stack + .pop() + .expect("pop_exception() without nested exc stack"); + crate::vm::thread::set_current_frame(old_frame); + self.frames.borrow_mut().pop(); + #[cfg(feature = "threading")] + crate::vm::thread::pop_thread_frame(); + + self.recursion_depth.update(|d| d - 1); + } + use crate::protocol::TraceEvent; let result = match self.trace_event(TraceEvent::Call, None) { Ok(()) => { let result = f(frame); if result.is_ok() { let _ = self.trace_event(TraceEvent::Return, None); } result } Err(e) => Err(e), }; - // Ensure cleanup on panic: restore owner, pop exception, frame chain, frames Vec, - // and recursion depth. - scopeguard::defer! { - frame.owner.store(old_owner, core::sync::atomic::Ordering::Release); - self.exceptions.borrow_mut().stack - .pop() - .expect("pop_exception() without nested exc stack"); - crate::vm::thread::set_current_frame(old_frame); - self.frames.borrow_mut().pop(); - #[cfg(feature = "threading")] - crate::vm::thread::pop_thread_frame(); - - self.recursion_depth.update(|d| d - 1); - } - result }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/vm/mod.rs` around lines 1122 - 1149, The defer cleanup (scopeguard::defer!) must be created before running the trace/execution so panics still trigger cleanup; move the scopeguard::defer! block to just after setting up the frame/owner/exception/recursion state and before calling self.trace_event(TraceEvent::Call, None) and f(frame). Ensure TraceEvent::Return is still emitted on successful result (keep the existing trace_event call after f(frame)), and preserve the same cleanup actions inside the guard (frame.owner.store(...), self.exceptions.borrow_mut().stack.pop(), crate::vm::thread::set_current_frame(old_frame), self.frames.borrow_mut().pop(), cfg thread pop_thread_frame(), and self.recursion_depth.update(|d| d - 1)) so any panic during trace_event or f(frame) will run the guard.
🧹 Nitpick comments (4)
crates/common/src/refcount.rs (2)
43-46: Dead arithmetic:/ COUNTand* COUNTwhereCOUNT = 1
COUNTis defined as1, so the division instrong()and the multiplication inadd_strong()are no-ops that add noise. Simplify to direct mask/add:♻️ Proposed simplification
#[inline] fn strong(self) -> u32 { - ((self.inner & STRONG) / COUNT) as u32 + (self.inner & STRONG) as u32 }#[inline] fn add_strong(self, val: u32) -> Self { - Self::from_raw(self.inner + (val as usize) * COUNT) + Self::from_raw(self.inner + val as usize) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 43 - 46, The code does unnecessary arithmetic because COUNT is 1; simplify by removing the redundant / COUNT in strong() and * COUNT in add_strong(): in the strong() method use ((self.inner & STRONG) as u32) (or equivalent masking) to extract the strong count directly, and in add_strong() add the increment directly to the strong bits instead of multiplying by COUNT; update any related bit-manipulation comments to reflect the simplification and keep the STRONG and COUNT constants as needed (or remove COUNT if unused).
250-252:#[inline]on aVec-allocating function is a misleading hint
flush_deferred_dropsdrains and collects into a newVecon every call. The heap allocation prevents meaningful inlining and the attribute gives the compiler a misleading signal. Consider removing it, or if the goal is to inline the fast-path empty-queue check, restructure accordingly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 250 - 252, The #[inline] attribute on flush_deferred_drops is misleading because the function always allocates a new Vec when draining, preventing meaningful inlining; remove the #[inline] on the flush_deferred_drops function or restructure it so the cheap fast-path (e.g., an early is_empty check on the deferred queue) is performed in a small #[inline] helper that returns early while the heavier Vec-allocating work remains in a non-inlined function; locate flush_deferred_drops and either drop the attribute or split out the empty-queue check into an inline helper to preserve the intended optimization.crates/vm/src/gc_state.rs (2)
464-480: Traversal of referents is called twice over the collecting set — consider caching.
gc_get_referent_ptrs()is anunsafecall that traverses each object's references. It's invoked once in the gc_refs subtraction loop (lines 470–479) and again in the reachability traversal (lines 500–507) for every reachable object. For large object graphs this doubles the traversal work.If
gc_get_referent_ptrs()is non-trivial (e.g., it allocates or scans), caching the adjacency list (HashMap<GcObjectPtr, Vec<GcObjectPtr>>) during the first pass and reusing it in the second would halve the unsafe calls and improve collection latency.Also applies to: 496-509
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/gc_state.rs` around lines 464 - 480, The code currently calls unsafe obj.gc_get_referent_ptrs() twice per object; change to compute and store each object's adjacency once and reuse it: during the initial loop over collecting (the block that iterates over collecting, reads obj.gc_get_referent_ptrs(), and updates gc_refs) build a HashMap<GcObjectPtr, Vec<GcObjectPtr>> (or similar) keyed by gc_ptr and populate it from obj.gc_get_referent_ptrs(); then in the later reachability traversal (the code that walks reachable objects) read referents from that cache instead of calling gc_get_referent_ptrs() again. Update lookups to use the cache where you currently call gc_get_referent_ptrs(), and ensure you still respect the same alive/strong_count checks and membership tests against collecting and gc_refs.
598-614: TOCTOU betweenis_finalizedread andmark_finalizedwrite — consider a single write-lock check.The read lock at line 600 and write lock at line 609 are separate acquisitions. While the
collectingmutex prevents concurrent GC,untrack_object(which removes fromfinalized_objects) could run between the two — potentially allowing a re-finalization of an object that was concurrently untracked and re-tracked.Consolidating into one write-lock acquisition eliminates the window:
Proposed consolidation
- let already_finalized = if let Ok(finalized) = self.finalized_objects.read() { - finalized.contains(&ptr) - } else { - false - }; - - if !already_finalized { - // Mark as finalized BEFORE calling __del__ - // This ensures is_finalized() returns True inside __del__ - if let Ok(mut finalized) = self.finalized_objects.write() { - finalized.insert(ptr); - } + let newly_inserted = self + .finalized_objects + .write() + .map(|mut set| set.insert(ptr)) + .unwrap_or(false); + + if newly_inserted { obj_ref.try_call_finalizer(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/gc_state.rs` around lines 598 - 614, Consolidate the separate read and write lock around finalized_objects into a single write-lock per object: in the loop over unreachable_refs (using GcObjectPtr), acquire self.finalized_objects.write() once, check if the pointer is already present, and if not insert it; then drop the write lock and call obj_ref.try_call_finalizer() only when you performed the insert. This removes the TOCTOU window with untrack_object while preserving the behavior that the object is marked finalized before __del__ runs (is_finalized()/mark_finalized semantics).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/refcount.rs`:
- Around line 4-6: The layout comment mentions a "[1 bit: weaked]" flag that has
no corresponding implementation; either remove that token from the comment to
match the current code or add the missing bit definition and accessors. To fix,
decide whether to actually reserve the bit: if unused, update the comment to
drop "weaked" and adjust bit position explanations (remove any reference to
WEAKED/State::weaked); if intended, add a WEAKED constant (e.g., WEAKED_MASK)
and implement State::weaked(&self) -> bool plus setter/clearer logic where State
bits are manipulated (matching how destructed/leaked are handled), and ensure
weak_count/strong_count masks/shift constants are unchanged. Use existing
symbols like State, destructed, leaked, weak_count, strong_count, and follow the
pattern for other flag constants so the comment and code remain consistent.
- Around line 144-153: The dec() method currently does an unconditional
self.state.fetch_sub(COUNT, Ordering::SeqCst) then checks
State::from_raw(...).leaked(), which allows underflow to clear the LEAKED bit;
fix by detecting leaked() after the fetch and immediately restoring the
subtraction (call self.state.fetch_add(COUNT, Ordering::SeqCst)) before
returning false so the LEAKED flag and count aren't corrupted; update dec() to
perform the undo restore when old.leaked() is true and then return false
(optionally note you could replace with a CAS loop to avoid the transient
window).
- Around line 179-193: The deferred-drop thread-local and APIs must be compiled
only when the standard library is available: wrap the thread_local! block
(IN_DEFERRED_CONTEXT and DEFERRED_QUEUE), the DeferredDropGuard struct, and the
functions with_deferred_drops, try_defer_drop, and flush_deferred_drops with
#[cfg(feature = "std")] so they are excluded from no_std builds; ensure the
attribute is applied to each item (or to a surrounding mod) referencing those
symbols so the crate compiles without std just like refcount_overflow() does.
- Around line 99-110: Update the doc comment and safety semantics for pub fn
inc(&self) in refcount.rs: clarify that inc() assumes the caller already holds a
valid strong reference (e.g., used in to_owned()) and is not a resurrection path
— recommend safe_inc() for resurrecting from weak refs; replace the vague
comment "// The previous fetch_add created a permission to run decrement again"
with a brief protocol description (double fetch_add is an optimization to avoid
CAS when caller already had a strong ref and may race with a destructor that is
already in-flight), and make overflow handling consistent with inc_by() by
adding the same strong-count overflow check (use State::strong()/STRONG logic
and call refcount_overflow() on overflow) or, if you intentionally omit overflow
detection, document why it is safe to omit it; reference the symbols inc,
safe_inc, inc_by, State, COUNT, STRONG, destructed, and refcount_overflow in the
change.
In `@crates/vm/src/gc_state.rs`:
- Around line 439-444: The unconditional reset of generations[0].count loses
increments from concurrent track_object calls after gen_locks are dropped;
change the logic in the collection exit paths (where
generations[0].count.store(0, ...) is called) to either (A) switch maybe_collect
to use alloc_count as the authoritative allocation counter and reset alloc_count
there instead of touching generations[0].count, ensuring track_object increments
alloc_count, or (B) recompute gen0's true count after promotions by reading
generation_objects[0] (via its read lock) and store its length into
generations[0].count before releasing locks; update all early-return branches
(the places with generations[0].count.store(0, ...)) to use the chosen approach
so concurrent increments are not lost.
- Around line 734-758: promote_survivors currently removes an object from its
source generation and then silently skips adding it to next_gen if acquiring the
write lock for self.generation_objects[next_gen] fails, orphaning the object;
change the target-generation lock acquisition to propagate a poisoned-lock panic
instead of silently failing by replacing the conditional try-lock for next_gen
with an unconditional write() followed by .expect(...) (e.g. let mut next_set =
self.generation_objects[next_gen].write().expect("poisoned
generation_objects[next_gen] lock");) and then perform next_set.insert(ptr) and
the fetch_add on self.generations[next_gen].count, so that a poisoned lock
triggers a panic and the invariant that every tracked object is in exactly one
generation is preserved.
In `@crates/vm/src/vm/mod.rs`:
- Line 880: The doc comment above the block that clears the `sys` and `builtins`
dicts is mislabeled as "Phase 6" but `finalize_modules` actually treats that
block as Phase 7; update the comment to read "Phase 7" (or match the existing
phase numbering used in `finalize_modules`) so the phase labels are consistent
with the earlier Phase 6 GC collection and the surrounding phase comments.
---
Outside diff comments:
In `@crates/vm/src/vm/interpreter.rs`:
- Around line 386-395: Update the "Finalization steps" doc comment so it matches
the actual shutdown sequence in this file: enumerate the real order (flush
stdout/stderr via flush_std, handle exit exception -> exit code, call
threading._shutdown to wait for threads, set the finalizing flag (suppresses
unraisable exceptions), run atexit exit functions, run GC collect_force (where
the doc previously claimed atexit ran elsewhere), call finalize_modules for
module cleanup, and note any final flush/GC behavior; remove the duplicate
threading._shutdown step and correct the step labels to reference the actual
calls (threading._shutdown, atexit, collect_force, finalize_modules, flush_std)
so the numbered list matches the code flow.
In `@crates/vm/src/vm/mod.rs`:
- Around line 1122-1149: The defer cleanup (scopeguard::defer!) must be created
before running the trace/execution so panics still trigger cleanup; move the
scopeguard::defer! block to just after setting up the
frame/owner/exception/recursion state and before calling
self.trace_event(TraceEvent::Call, None) and f(frame). Ensure TraceEvent::Return
is still emitted on successful result (keep the existing trace_event call after
f(frame)), and preserve the same cleanup actions inside the guard
(frame.owner.store(...), self.exceptions.borrow_mut().stack.pop(),
crate::vm::thread::set_current_frame(old_frame), self.frames.borrow_mut().pop(),
cfg thread pop_thread_frame(), and self.recursion_depth.update(|d| d - 1)) so
any panic during trace_event or f(frame) will run the guard.
---
Nitpick comments:
In `@crates/common/src/refcount.rs`:
- Around line 43-46: The code does unnecessary arithmetic because COUNT is 1;
simplify by removing the redundant / COUNT in strong() and * COUNT in
add_strong(): in the strong() method use ((self.inner & STRONG) as u32) (or
equivalent masking) to extract the strong count directly, and in add_strong()
add the increment directly to the strong bits instead of multiplying by COUNT;
update any related bit-manipulation comments to reflect the simplification and
keep the STRONG and COUNT constants as needed (or remove COUNT if unused).
- Around line 250-252: The #[inline] attribute on flush_deferred_drops is
misleading because the function always allocates a new Vec when draining,
preventing meaningful inlining; remove the #[inline] on the flush_deferred_drops
function or restructure it so the cheap fast-path (e.g., an early is_empty check
on the deferred queue) is performed in a small #[inline] helper that returns
early while the heavier Vec-allocating work remains in a non-inlined function;
locate flush_deferred_drops and either drop the attribute or split out the
empty-queue check into an inline helper to preserve the intended optimization.
In `@crates/vm/src/gc_state.rs`:
- Around line 464-480: The code currently calls unsafe
obj.gc_get_referent_ptrs() twice per object; change to compute and store each
object's adjacency once and reuse it: during the initial loop over collecting
(the block that iterates over collecting, reads obj.gc_get_referent_ptrs(), and
updates gc_refs) build a HashMap<GcObjectPtr, Vec<GcObjectPtr>> (or similar)
keyed by gc_ptr and populate it from obj.gc_get_referent_ptrs(); then in the
later reachability traversal (the code that walks reachable objects) read
referents from that cache instead of calling gc_get_referent_ptrs() again.
Update lookups to use the cache where you currently call gc_get_referent_ptrs(),
and ensure you still respect the same alive/strong_count checks and membership
tests against collecting and gc_refs.
- Around line 598-614: Consolidate the separate read and write lock around
finalized_objects into a single write-lock per object: in the loop over
unreachable_refs (using GcObjectPtr), acquire self.finalized_objects.write()
once, check if the pointer is already present, and if not insert it; then drop
the write lock and call obj_ref.try_call_finalizer() only when you performed the
insert. This removes the TOCTOU window with untrack_object while preserving the
behavior that the object is marked finalized before __del__ runs
(is_finalized()/mark_finalized semantics).
| if collecting.is_empty() { | ||
| // Reset gen0 count even if nothing to collect | ||
| self.generations[0].count.store(0, Ordering::SeqCst); | ||
| self.generations[generation].update_stats(0, 0); | ||
| return (0, 0); | ||
| } |
There was a problem hiding this comment.
Resetting gen0.count to 0 loses objects tracked during collection.
After dropping gen_locks (line 534), other threads can call track_object, which increments generations[0].count. The unconditional store(0) at line 718 (and early-exit paths at lines 441, 527, 554) discards those increments, causing maybe_collect (line 376) to under-count gen0 and potentially miss triggering the next collection.
Note that alloc_count (line 120) is already incremented in track_object but is never read — it looks like it was intended as a dedicated allocation counter for threshold checking (matching CPython's approach), but maybe_collect uses generations[0].count instead.
Consider either:
- Using
alloc_countfor threshold checking inmaybe_collectand resetting it here, or - Recomputing gen0 count from the actual set size after promotion:
let actual = self.generation_objects[0].read().map(|s| s.len()).unwrap_or(0); self.generations[0].count.store(actual, Ordering::SeqCst);
Also applies to: 716-722
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/vm/src/gc_state.rs` around lines 439 - 444, The unconditional reset of
generations[0].count loses increments from concurrent track_object calls after
gen_locks are dropped; change the logic in the collection exit paths (where
generations[0].count.store(0, ...) is called) to either (A) switch maybe_collect
to use alloc_count as the authoritative allocation counter and reset alloc_count
there instead of touching generations[0].count, ensuring track_object increments
alloc_count, or (B) recompute gen0's true count after promotions by reading
generation_objects[0] (via its read lock) and store its length into
generations[0].count before releasing locks; update all early-return branches
(the places with generations[0].count.store(0, ...)) to use the chosen approach
so concurrent increments are not lost.
| // Remove from current generation | ||
| for gen_idx in 0..=from_gen { | ||
| if let Ok(mut gen_set) = self.generation_objects[gen_idx].write() | ||
| && gen_set.remove(&ptr) | ||
| { | ||
| // Decrement count for source generation | ||
| let count = self.generations[gen_idx].count.load(Ordering::SeqCst); | ||
| if count > 0 { | ||
| self.generations[gen_idx] | ||
| .count | ||
| .fetch_sub(1, Ordering::SeqCst); | ||
| } | ||
|
|
||
| // Add to next generation | ||
| if let Ok(mut next_set) = self.generation_objects[next_gen].write() | ||
| && next_set.insert(ptr) | ||
| { | ||
| // Increment count for target generation | ||
| self.generations[next_gen] | ||
| .count | ||
| .fetch_add(1, Ordering::SeqCst); | ||
| } | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check how RwLock errors are handled across gc_state.rs
rg -n 'if let Ok\(' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 1394
🏁 Script executed:
# Get the full context of the promote_survivors function
sed -n '700,800p' crates/vm/src/gc_state.rs | cat -nRepository: RustPython/RustPython
Length of output: 4494
🏁 Script executed:
# Check if there are any other error handling patterns or comments about lock poisoning
rg -n -i 'poison|panic|expect|unwrap' crates/vm/src/gc_state.rs | head -30Repository: RustPython/RustPython
Length of output: 142
🏁 Script executed:
# Check the GcState struct definition and comments about the invariant
sed -n '1,200p' crates/vm/src/gc_state.rs | head -150Repository: RustPython/RustPython
Length of output: 5096
🏁 Script executed:
# Look for any cleanup, recovery, or assertion code related to the invariant
rg -n 'tracked_objects|generation_objects' crates/vm/src/gc_state.rs | head -40Repository: RustPython/RustPython
Length of output: 1219
🏁 Script executed:
# Check if there are comments about lock poisoning or error handling strategy
rg -B2 -A2 'if let Ok\(mut' crates/vm/src/gc_state.rs | head -60Repository: RustPython/RustPython
Length of output: 2198
🏁 Script executed:
# Check if there are any panic!() or other operations that could poison locks
rg -n 'panic!|unwrap\(\)|expect\(' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Look at the track_object implementation to understand the full flow
sed -n '235,260p' crates/vm/src/gc_state.rs | cat -nRepository: RustPython/RustPython
Length of output: 1142
🏁 Script executed:
# Check if there's any mechanism to repair or validate the invariant between tracked_objects and generation_objects
rg -n 'fn.*invariant|fn.*valid|fn.*check' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check if there could be any panics during GC collection that might poison locks
# Look at the collection phase and see if there are any operations that could panic
sed -n '600,750p' crates/vm/src/gc_state.rs | cat -nRepository: RustPython/RustPython
Length of output: 7501
🏁 Script executed:
# Check if PyObject operations (like gc_clear) could panic
rg -n 'gc_clear\|gc_has_clear' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Look for how RwLock is used and if there's documentation about error handling strategy
rg -B3 -A3 'RwLock' crates/vm/src/gc_state.rs | head -80Repository: RustPython/RustPython
Length of output: 2130
🏁 Script executed:
# Check if try_call_finalizer could panic/poison locks
rg -n 'try_call_finalizer' crates/vm/src/Repository: RustPython/RustPython
Length of output: 213
🏁 Script executed:
# Verify line numbers in the actual file - check the exact location of promote_survivors
sed -n '726,760p' crates/vm/src/gc_state.rs | cat -nRepository: RustPython/RustPython
Length of output: 1652
promote_survivors silently loses objects if a write-lock acquisition fails.
If the write lock on generation_objects[next_gen] fails (lines 749–750), the object has already been removed from its source generation (lines 737–745) but is never added to the target — it becomes orphaned in tracked_objects with no generation, violating the invariant that every tracked object belongs to exactly one generation. Generational collection will never visit it again.
The if let Ok(mut ...) = ...write() pattern is used throughout this file (17 instances) and silently swallows poisoned-lock errors. While a panic during normal operations is rare, try_call_finalizer() is called during collection and could trigger one, poisoning locks. For promote_survivors the consequence is data loss; for other operations like track_object/untrack_object it causes inconsistent counts.
Use .expect("...") or .unwrap() on the target-generation insertion to propagate the poison panic and preserve the invariant.
Proposed fix
// Add to next generation
- if let Ok(mut next_set) = self.generation_objects[next_gen].write()
- && next_set.insert(ptr)
- {
- // Increment count for target generation
- self.generations[next_gen]
- .count
- .fetch_add(1, Ordering::SeqCst);
- }
+ let mut next_set = self.generation_objects[next_gen]
+ .write()
+ .expect("failed to acquire lock for next generation");
+ if next_set.insert(ptr) {
+ // Increment count for target generation
+ self.generations[next_gen]
+ .count
+ .fetch_add(1, Ordering::SeqCst);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Remove from current generation | |
| for gen_idx in 0..=from_gen { | |
| if let Ok(mut gen_set) = self.generation_objects[gen_idx].write() | |
| && gen_set.remove(&ptr) | |
| { | |
| // Decrement count for source generation | |
| let count = self.generations[gen_idx].count.load(Ordering::SeqCst); | |
| if count > 0 { | |
| self.generations[gen_idx] | |
| .count | |
| .fetch_sub(1, Ordering::SeqCst); | |
| } | |
| // Add to next generation | |
| if let Ok(mut next_set) = self.generation_objects[next_gen].write() | |
| && next_set.insert(ptr) | |
| { | |
| // Increment count for target generation | |
| self.generations[next_gen] | |
| .count | |
| .fetch_add(1, Ordering::SeqCst); | |
| } | |
| break; | |
| } | |
| } | |
| // Remove from current generation | |
| for gen_idx in 0..=from_gen { | |
| if let Ok(mut gen_set) = self.generation_objects[gen_idx].write() | |
| && gen_set.remove(&ptr) | |
| { | |
| // Decrement count for source generation | |
| let count = self.generations[gen_idx].count.load(Ordering::SeqCst); | |
| if count > 0 { | |
| self.generations[gen_idx] | |
| .count | |
| .fetch_sub(1, Ordering::SeqCst); | |
| } | |
| // Add to next generation | |
| let mut next_set = self.generation_objects[next_gen] | |
| .write() | |
| .expect("failed to acquire lock for next generation"); | |
| if next_set.insert(ptr) { | |
| // Increment count for target generation | |
| self.generations[next_gen] | |
| .count | |
| .fetch_add(1, Ordering::SeqCst); | |
| } | |
| break; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/vm/src/gc_state.rs` around lines 734 - 758, promote_survivors
currently removes an object from its source generation and then silently skips
adding it to next_gen if acquiring the write lock for
self.generation_objects[next_gen] fails, orphaning the object; change the
target-generation lock acquisition to propagate a poisoned-lock panic instead of
silently failing by replacing the conditional try-lock for next_gen with an
unconditional write() followed by .expect(...) (e.g. let mut next_set =
self.generation_objects[next_gen].write().expect("poisoned
generation_objects[next_gen] lock");) and then perform next_set.insert(ptr) and
the fetch_add on self.generations[next_gen].count, so that a poisoned lock
triggers a panic and the invariant that every tracked object is in exactly one
generation is preserved.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/vm/src/vm/interpreter.rs (1)
385-395:⚠️ Potential issue | 🟡 MinorFinalize step list is out of sync with the new order.
The doc still mentions a second
_shutdown()step and a “Final GC pass” afterfinalize_modules, which no longer happens.📝 Suggested comment update
- /// 1. Wait for thread shutdown (call threading._shutdown). - /// 1. Set finalizing flag (suppresses unraisable exceptions). - /// 1. Call threading._shutdown() to join non-daemon threads. - /// 1. Run atexit exit functions. - /// 1. GC pass and module cleanup (finalize_modules). - /// 1. Final GC pass. + /// 1. Wait for thread shutdown (call threading._shutdown). + /// 1. Run atexit exit functions. + /// 1. Set finalizing flag (suppresses unraisable exceptions). + /// 1. GC pass and module cleanup (finalize_modules, which includes the final GC phase).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/vm/interpreter.rs` around lines 385 - 395, Update the finalization doc comment in vm::interpreter.rs (the block describing the steps for finalizing the VM—see the comment above the finalize function) to match the current implementation: remove the duplicated threading._shutdown() step and remove the “Final GC pass” entry after finalize_modules, and reorder or renumber the listed steps so they accurately reflect the actual sequence (flush stdout/stderr; handle exit exception; wait for thread shutdown/set finalizing flag/call threading._shutdown to join non-daemon threads; run atexit functions; GC pass and module cleanup via finalize_modules). Ensure the comment references threading._shutdown and finalize_modules exactly as named.
🧹 Nitpick comments (1)
crates/vm/src/signal.rs (1)
66-75: Pre-existing: sharedRefCellborrow held across Python invocation may panic.
signal_handlersis borrowed immutably at line 66 and kept alive throughcallable.invoke(...)at line 73. If the Python signal handler callssignal.signal()(which wouldborrow_mut()the sameRefCell), the runtime borrow-check will panic.
IN_SIGNAL_HANDLERonly preventstrigger_signalsre-entry viacheck_signals; it does not guard against the Python-levelsignal.signal()call path mutating theRefCell.Consider dropping the borrow before invoking, or collecting the handlers to invoke into a local
Vecfirst:♻️ Suggested approach — snapshot handlers before invoking
- let signal_handlers = vm.signal_handlers.get().unwrap().borrow(); - for (signum, trigger) in TRIGGERS.iter().enumerate().skip(1) { - let triggered = trigger.swap(false, Ordering::Relaxed); - if triggered - && let Some(handler) = &signal_handlers[signum] - && let Some(callable) = handler.to_callable() - { - callable.invoke((signum, vm.ctx.none()), vm)?; - } - } + let to_invoke: Vec<(usize, _)> = { + let signal_handlers = vm.signal_handlers.get().unwrap().borrow(); + TRIGGERS + .iter() + .enumerate() + .skip(1) + .filter_map(|(signum, trigger)| { + if trigger.swap(false, Ordering::Relaxed) { + signal_handlers[signum] + .as_ref() + .and_then(|h| h.to_callable()) + .map(|c| (signum, c)) + } else { + None + } + }) + .collect() + }; // borrow released here + for (signum, callable) in to_invoke { + callable.invoke((signum, vm.ctx.none()), vm)?; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/signal.rs` around lines 66 - 75, The current code holds an immutable RefCell borrow from vm.signal_handlers.get().unwrap().borrow() across the Python call (callable.invoke), which can panic if the handler calls signal.signal() (a borrow_mut); fix by snapshotting the handlers to invoke and dropping the RefCell borrow before any Python invocation: iterate TRIGGERS and for each triggered entry, read and clone/cloneable-convert the handler reference (or store indices/signum + handler.to_callable() results) into a local Vec (e.g., Vec<(usize, Callable)>) while the borrow is held, then drop the borrow and loop over that Vec to call callable.invoke((signum, vm.ctx.none()), vm) so no RefCell borrow is held during invocation; reference functions/fields: vm.signal_handlers.get(), signal_handlers.borrow(), TRIGGERS, trigger.swap, handler.to_callable(), and callable.invoke.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/refcount.rs`:
- Around line 120-138: safe_inc currently increments without checking
strong-count overflow and can wrap into weak bits; before attempting the
compare_exchange in safe_inc, detect the overflow the same way inc_by does: read
old via State::from_raw(self.state.load(...)), and if adding 1 would exceed the
maximum strong count (e.g., compare old.strong() against State's max or check
that old.add_strong(1) would overflow/wrap), return false; only proceed to
compute new_state = old.add_strong(1) and call
self.state.compare_exchange(old.as_raw(), new_state.as_raw(), ...) when the
increment is safe and old.destructed() is false. Ensure you reference
State::from_raw, destructed, add_strong, as_raw and the existing
compare_exchange call when making the change.
---
Outside diff comments:
In `@crates/vm/src/vm/interpreter.rs`:
- Around line 385-395: Update the finalization doc comment in vm::interpreter.rs
(the block describing the steps for finalizing the VM—see the comment above the
finalize function) to match the current implementation: remove the duplicated
threading._shutdown() step and remove the “Final GC pass” entry after
finalize_modules, and reorder or renumber the listed steps so they accurately
reflect the actual sequence (flush stdout/stderr; handle exit exception; wait
for thread shutdown/set finalizing flag/call threading._shutdown to join
non-daemon threads; run atexit functions; GC pass and module cleanup via
finalize_modules). Ensure the comment references threading._shutdown and
finalize_modules exactly as named.
---
Duplicate comments:
In `@crates/common/src/refcount.rs`:
- Around line 98-108: inc() must explicitly prevent strong-count overflow and
clarify the safety assumption; replace the current blind fetch_add pattern with
a compare-and-swap loop (e.g., using fetch_update or compare_exchange) on
self.state to read current State::from_raw, verify !val.destructed() and that
val.strong() + COUNT does not exceed the strong-capacity (call
refcount_overflow() if it would), then atomically store the incremented raw
state; also add an explicit comment or debug_assert in inc() that the caller
already holds a strong reference (the function relies on that invariant) and
remove the ambiguous double-fetch_add; keep references to State::from_raw,
COUNT, self.state, destructed(), strong(), and refcount_overflow() so reviewers
can locate the change.
In `@crates/vm/src/gc_state.rs`:
- Around line 733-755: promote_survivors currently removes `ptr` from the source
generation (`generation_objects[gen_idx]`) before trying to acquire the target
generation write lock (`generation_objects[next_gen].write()`), so if the
next-gen lock acquisition fails (poisoned) the object is lost; change the logic
in `promote_survivors` to either acquire the `next_gen` write lock before
removing from the source, or if you must remove first then handle the `Err` from
`generation_objects[next_gen].write()` by reinserting `ptr` into the original
`gen_set` and restoring the corresponding `generations[gen_idx].count` (and
return/continue) so the object is not orphaned; reference the symbols
`promote_survivors`, `generation_objects`, `generations`, `from_gen`,
`next_gen`, `gen_set`, `next_set`, and `ptr` when making the change.
- Around line 439-444: The unconditional self.generations[0].count.store(0,
Ordering::SeqCst) can clobber increments from concurrent track_object() calls;
remove the unconditional reset in early-return paths (e.g., the branch where
collecting.is_empty()) and only adjust the atomic counter when you actually
remove objects — perform an atomic adjustment (use
AtomicUsize::fetch_sub(amount, Ordering::SeqCst) when you know how many were
collected, or AtomicUsize::swap(0, Ordering::SeqCst) to atomically
grab-and-reset) in the collection code paths (referencing generations[0].count,
track_object(), and the methods that call update_stats) so concurrent increments
aren’t lost.
---
Nitpick comments:
In `@crates/vm/src/signal.rs`:
- Around line 66-75: The current code holds an immutable RefCell borrow from
vm.signal_handlers.get().unwrap().borrow() across the Python call
(callable.invoke), which can panic if the handler calls signal.signal() (a
borrow_mut); fix by snapshotting the handlers to invoke and dropping the RefCell
borrow before any Python invocation: iterate TRIGGERS and for each triggered
entry, read and clone/cloneable-convert the handler reference (or store
indices/signum + handler.to_callable() results) into a local Vec (e.g.,
Vec<(usize, Callable)>) while the borrow is held, then drop the borrow and loop
over that Vec to call callable.invoke((signum, vm.ctx.none()), vm) so no RefCell
borrow is held during invocation; reference functions/fields:
vm.signal_handlers.get(), signal_handlers.borrow(), TRIGGERS, trigger.swap,
handler.to_callable(), and callable.invoke.
| /// Returns true if successful | ||
| #[inline] | ||
| pub fn safe_inc(&self) -> bool { | ||
| self.strong | ||
| .fetch_update(AcqRel, Acquire, |prev| (prev != 0).then_some(prev + 1)) | ||
| .is_ok() | ||
| let mut old = State::from_raw(self.state.load(Ordering::SeqCst)); | ||
| loop { | ||
| if old.destructed() { | ||
| return false; | ||
| } | ||
| let new_state = old.add_strong(1); | ||
| match self.state.compare_exchange( | ||
| old.as_raw(), | ||
| new_state.as_raw(), | ||
| Ordering::SeqCst, | ||
| Ordering::SeqCst, | ||
| ) { | ||
| Ok(_) => return true, | ||
| Err(curr) => old = State::from_raw(curr), | ||
| } | ||
| } |
There was a problem hiding this comment.
Guard safe_inc() against strong-count overflow.
safe_inc() can overflow into the weak bits when the strong count is at max; align it with inc_by()’s overflow checks.
♻️ Suggested fix
pub fn safe_inc(&self) -> bool {
let mut old = State::from_raw(self.state.load(Ordering::SeqCst));
loop {
if old.destructed() {
return false;
}
+ if (old.strong() as usize) >= STRONG {
+ refcount_overflow();
+ }
let new_state = old.add_strong(1);
match self.state.compare_exchange(
old.as_raw(),
new_state.as_raw(),
Ordering::SeqCst,
Ordering::SeqCst,
) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/refcount.rs` around lines 120 - 138, safe_inc currently
increments without checking strong-count overflow and can wrap into weak bits;
before attempting the compare_exchange in safe_inc, detect the overflow the same
way inc_by does: read old via State::from_raw(self.state.load(...)), and if
adding 1 would exceed the maximum strong count (e.g., compare old.strong()
against State's max or check that old.add_strong(1) would overflow/wrap), return
false; only proceed to compute new_state = old.add_strong(1) and call
self.state.compare_exchange(old.as_raw(), new_state.as_raw(), ...) when the
increment is safe and old.destructed() is false. Ensure you reference
State::from_raw, destructed, add_strong, as_raw and the existing
compare_exchange call when making the change.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
crates/common/src/refcount.rs (1)
93-96: Consider#[must_use]ondec()andsafe_inc().Both return a
boolthat must be acted upon: ignoringdec()'s return value causes a memory leak (the object is never freed), and ignoringsafe_inc()'sfalsereturn means proceeding with an invalid reference to a destructed object. Annotating them with#[must_use]surfaces these mistakes at compile time.♻️ Proposed addition
+ #[must_use] pub fn safe_inc(&self) -> bool { + /// Decrement strong count. Returns true when count drops to 0. + #[must_use] pub fn dec(&self) -> bool {Also applies to: 120-122, 141-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 93 - 96, Add the #[must_use] attribute to the reference-counting methods that return a bool so callers cannot accidentally ignore their result: annotate the dec() and safe_inc() methods (and any other overloads/variants of these methods in the same module/class) with #[must_use] to force checking the returned boolean; locate the implementations by name (dec and safe_inc in the RefCount/State impl) and add the attribute above their signatures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/vm/mod.rs`:
- Around line 824-825: The doc comment above the module-dict clearing block is
mislabeled "Phase 4" while finalize_modules runs it as Phase 5; update the
comment to read "Phase 5: Clear module dicts in reverse import order. Skip
builtins and sys — those are cleared last in Phase 7." so the phase numbering
matches finalize_modules/GC sequencing.
- Around line 723-733: Update the doc comment above the method
finalize_clear_module_dicts to correct the phase label from "Phase 4" to "Phase
5" so it matches the sequencing in finalize_modules (where the first GC collect
is Phase 4 and the dict-clearing pass is Phase 5); locate the comment that
currently precedes the finalize_clear_module_dicts function and change the phase
number text only, leaving the rest of the explanation intact.
---
Duplicate comments:
In `@crates/common/src/refcount.rs`:
- Around line 120-139: safe_inc() can overflow the strong count and corrupt WEAK
bits because it computes new_state = old.add_strong(1) without checking limits;
before calling State::add_strong or performing compare_exchange, check that
old.strong() < STRONG (the max strong sentinel) and return false if not, so you
never attempt to increment past the strong limit; update the logic in safe_inc
(referencing safe_inc, State::strong(), State::add_strong(), STRONG and the
compare_exchange on self.state) to perform this guard and only compute/add the
strong count when it's safe.
- Around line 98-109: inc() currently increments the strong count
unconditionally and can overflow into the weak bits; mirror the overflow guard
used in inc_by(): read State::from_raw(self.state.fetch_add(COUNT,
Ordering::SeqCst)) into val, check if (val.strong() as usize) > STRONG - 1 and
call refcount_overflow() before performing the increment (or bail), and keep the
existing destructed() and permission re-grant logic (the second
self.state.fetch_add(COUNT, Ordering::SeqCst)) unchanged so the function
enforces the same strong-count ceiling as inc_by(); reference functions/values:
inc(), inc_by(), State, STRONG, COUNT, refcount_overflow, and
self.state.fetch_add.
- Around line 141-152: The current dec() unconditionally calls
self.state.fetch_sub(COUNT) which can underflow and clear the LEAKED bit;
replace that unconditional subtraction with an atomic CAS loop that first reads
state via self.state.load(Ordering::SeqCst), checks
State::from_raw(raw).leaked() and returns false if leaked, and otherwise
attempts a compare_exchange_weak from raw to raw - COUNT (using
Ordering::SeqCst) until it succeeds, then compute and return whether the
previous strong count was 1 (use State::from_raw(raw).strong() == 1); update
references to self.state.fetch_sub(COUNT) and keep using State::from_raw,
leaked(), and LEAKED in the logic.
In `@crates/vm/src/vm/mod.rs`:
- Line 880: The Phase 7 label comment in vm::mod.rs (the doc/comment "Phase 7:
Clear sys and builtins dicts last.") has been corrected; no code changes are
required—keep the existing comment as-is and approve the change around the Phase
7 section in the VM teardown logic (look for the Phase 7 label near the VM
cleanup functions in vm::mod.rs).
---
Nitpick comments:
In `@crates/common/src/refcount.rs`:
- Around line 93-96: Add the #[must_use] attribute to the reference-counting
methods that return a bool so callers cannot accidentally ignore their result:
annotate the dec() and safe_inc() methods (and any other overloads/variants of
these methods in the same module/class) with #[must_use] to force checking the
returned boolean; locate the implementations by name (dec and safe_inc in the
RefCount/State impl) and add the attribute above their signatures.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
crates/common/src/refcount.rs (2)
258-270:flush_deferred_dropsis not panic-safe: remaining ops are silently discarded on panicOnce the queue is drained into
ops, anyop()that panics unwinds through theforloop — the remainingBox<dyn FnOnce()>items inopsare dropped (not called). Since these closures are GC untrack operations, silently skipping them leaves the GC holding stale pointers to freed objects, which can cause use-after-free in subsequent collections.♻️ Proposed panic-safe drain using a RAII call-on-drop wrapper
struct CallOnDrop(Option<Box<dyn FnOnce()>>); impl Drop for CallOnDrop { fn drop(&mut self) { if let Some(f) = self.0.take() { f(); } } } pub fn flush_deferred_drops() { DEFERRED_QUEUE.with(|q| { let ops: Vec<_> = q .borrow_mut() .drain(..) .map(|f| CallOnDrop(Some(f))) .collect(); // Each op runs either in the loop or on drop during panic unwinding for mut wrapper in ops { if let Some(f) = wrapper.0.take() { f(); } } }); }Alternatively, a simpler approach is to catch panics per-op using
std::panic::catch_unwind(requires the closures to beUnwindSafe), or to document that callers must ensure deferred closures do not panic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 258 - 270, flush_deferred_drops currently drains DEFERRED_QUEUE into a Vec and calls each op, but if any op() panics the remaining closures are dropped and never run (leading to GC stale pointers); fix by making the drain panic-safe: wrap each Box<dyn FnOnce()> into an RAII wrapper (e.g., struct CallOnDrop(Option<Box<dyn FnOnce()>>) with a Drop impl that calls the closure if still present), change the drain to map each closure into CallOnDrop(Some(f)) and collect that Vec, then iterate the Vec and take().call each wrapper to run closures normally—any closures not yet taken will be invoked from CallOnDrop::drop during panic unwinding, ensuring no deferred drop is silently skipped.
43-45:/ COUNTis a no-op — simplifystrong()Since
COUNT = 1, the division is always a no-op and only adds noise.♻️ Proposed simplification
fn strong(self) -> u32 { - ((self.inner & STRONG) / COUNT) as u32 + (self.inner & STRONG) as u32 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 43 - 45, The strong() method performs an unnecessary division by COUNT; update the strong function (fn strong(self) -> u32) to compute ((self.inner & STRONG) as u32) (or equivalent masking and casting) without dividing by COUNT since COUNT == 1, and ensure the result is still cast to u32; reference the symbols strong, self.inner, STRONG and COUNT when making the change.crates/vm/src/object/core.rs (1)
434-452: Missing re-check for existing generic ref under the re-acquired lockBetween releasing the first lock (line 421) and re-acquiring it (line 435), another thread may have allocated and inserted its own generic weakref. The
is_genericinsertion branch unconditionally overwritesself.genericwithout re-checking, resulting in two generic weakref nodes in the list. While this is memory-safe (both nodes are cleared when the object dies), it creates a redundant allocation that the two-phase refactor was intended to avoid.♻️ Proposed fix — re-check under the re-acquired lock
// Re-acquire lock for linked list insertion let _lock = weakref_lock::lock(obj as *const PyObject as usize); + // Re-check: another thread may have inserted a generic ref while we were + // allocating outside the lock. If so, reuse it and let `weak` drop. + if is_generic { + let generic_ptr = self.generic.load(Ordering::Relaxed); + if !generic_ptr.is_null() { + let generic = unsafe { &*generic_ptr }; + if generic.0.ref_count.safe_inc() { + return unsafe { PyRef::from_raw(generic_ptr) }; + } + } + } + // Insert into linked list under stripe lock🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/object/core.rs` around lines 434 - 452, Between releasing and re-acquiring the weakref stripe lock (weakref_lock::lock) the code unconditionally treats the new weakref as generic and overwrites self.generic, which can allow two generic nodes; after re-acquiring the lock (before setting self.generic and self.head) re-check whether a generic already exists (inspect self.generic/head via NonNull::new or equivalent) and if one is present avoid marking this node as the generic head—insert it as a normal/basic weakref (update pointers accordingly) or update the existing generic pointer only if still None; ensure the check uses the same pointer types/ordering as the current insertion logic (node_ptr, WeakLink::pointers, self.head, self.generic) to preserve atomic consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/object/core.rs`:
- Around line 128-133: The end() function can underflow the per-thread
DEALLOC_DEPTH by doing d.get() - 1; change the decrement to a safe operation
(e.g., use saturating_sub or checked_sub with a debug_assert) so depth never
wraps to usize::MAX, then set the new depth and return it; reference
DEALLOC_DEPTH and end(), and keep begin() semantics intact while optionally
adding a debug_assert that ensures a matching begin() existed (and/or log an
error) instead of allowing silent release-mode wraparound that trips
TRASHCAN_LIMIT logic.
---
Duplicate comments:
In `@crates/common/src/refcount.rs`:
- Around line 148-157: The dec() method unconditionally calls
self.state.fetch_sub(COUNT, Ordering::SeqCst) before checking
State::from_raw(...).leaked(), which can underflow the packed counter for LEAKED
states; restore the prior value if leaked() is true by re-adding COUNT (e.g.
call self.state.fetch_add(COUNT, Ordering::SeqCst)) and return false, so update
dec() to fetch_sub, compute old = State::from_raw(...), check old.leaked() and
if true call fetch_add(COUNT) to undo the subtraction and return false; keep the
final return old.strong() == 1 as-is.
- Around line 98-109: The inc() implementation must add the same strong-count
overflow guard used by inc_by(): after computing val =
State::from_raw(self.state.fetch_add(COUNT, ...)) check whether (val.strong() as
usize) > STRONG - 1 (i.e. would overflow by adding 1) and call
refcount_overflow() if so, then preserve the existing destructed() check and the
conditional extra fetch_add for when val.strong() == 0; update function inc() to
mirror inc_by()’s overflow logic using the same COUNT, STRONG and State symbols.
In `@crates/vm/src/vm/mod.rs`:
- Around line 880-884: The phase label in the finalize_clear_sys_builtins_dict
function comment has been corrected to "Phase 7" already; no code changes are
required—leave the comment and the function
Self::module_clear_dict(&self.sys_module.dict(), self) and
Self::module_clear_dict(&self.builtins.dict(), self) as-is in
finalize_clear_sys_builtins_dict.
---
Nitpick comments:
In `@crates/common/src/refcount.rs`:
- Around line 258-270: flush_deferred_drops currently drains DEFERRED_QUEUE into
a Vec and calls each op, but if any op() panics the remaining closures are
dropped and never run (leading to GC stale pointers); fix by making the drain
panic-safe: wrap each Box<dyn FnOnce()> into an RAII wrapper (e.g., struct
CallOnDrop(Option<Box<dyn FnOnce()>>) with a Drop impl that calls the closure if
still present), change the drain to map each closure into CallOnDrop(Some(f))
and collect that Vec, then iterate the Vec and take().call each wrapper to run
closures normally—any closures not yet taken will be invoked from
CallOnDrop::drop during panic unwinding, ensuring no deferred drop is silently
skipped.
- Around line 43-45: The strong() method performs an unnecessary division by
COUNT; update the strong function (fn strong(self) -> u32) to compute
((self.inner & STRONG) as u32) (or equivalent masking and casting) without
dividing by COUNT since COUNT == 1, and ensure the result is still cast to u32;
reference the symbols strong, self.inner, STRONG and COUNT when making the
change.
In `@crates/vm/src/object/core.rs`:
- Around line 434-452: Between releasing and re-acquiring the weakref stripe
lock (weakref_lock::lock) the code unconditionally treats the new weakref as
generic and overwrites self.generic, which can allow two generic nodes; after
re-acquiring the lock (before setting self.generic and self.head) re-check
whether a generic already exists (inspect self.generic/head via NonNull::new or
equivalent) and if one is present avoid marking this node as the generic
head—insert it as a normal/basic weakref (update pointers accordingly) or update
the existing generic pointer only if still None; ensure the check uses the same
pointer types/ordering as the current insertion logic (node_ptr,
WeakLink::pointers, self.head, self.generic) to preserve atomic consistency.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
crates/common/src/refcount.rs (2)
94-96:get()usesSeqCstwhileis_leaked()usesAcquire— inconsistent ordering for reads.Both are simple flag/count reads with no dependent write.
Acquire(or evenRelaxedfor a non-synchronising diagnostic read likeget()) is sufficient and consistent withis_leaked().♻️ Suggested change
pub fn get(&self) -> usize { - State::from_raw(self.state.load(Ordering::SeqCst)).strong() as usize + State::from_raw(self.state.load(Ordering::Acquire)).strong() as usize }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 94 - 96, The get() method currently uses Ordering::SeqCst for the atomic load while is_leaked() uses Ordering::Acquire, creating inconsistent read orderings; change the ordering in get() (the call to self.state.load and subsequent State::from_raw(...).strong()) to use the same weaker ordering as is_leaked() (e.g., Ordering::Acquire, or Ordering::Relaxed if this is purely a diagnostic read) so both reads are consistent.
43-45:/ COUNTdivision inState::strong()is a no-op.Since
COUNT = 1,(self.inner & STRONG) / COUNTis always identical toself.inner & STRONG. The division adds no value and slightly obscures intent.♻️ Suggested simplification
fn strong(self) -> u32 { - ((self.inner & STRONG) / COUNT) as u32 + (self.inner & STRONG) as u32 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/refcount.rs` around lines 43 - 45, The State::strong() method performs a redundant division by COUNT (which equals 1) making "((self.inner & STRONG) / COUNT) as u32" a no-op; simplify the expression by removing the "/ COUNT" so the method returns "(self.inner & STRONG) as u32" (referencing State::strong, the inner field, STRONG and COUNT) to make the intent clearer.crates/vm/src/object/core.rs (1)
157-166: Deferred objects invokedrop_slow_innertwice.When
trashcan::beginreturnsfalse,drop_slow_inner(finalizer + weakref clearing) has already run. The seconddefault_deallocinvocation, issued fromtrashcan::end's drain loop, callsdrop_slow_inneragain. The double-call is safe —gc_finalized()prevents__del__re-entry andwrl.clearis idempotent on an already-cleared list — but it is redundant work.Consider moving
trashcan::beginahead ofdrop_slow_innerso that deferred objects are queued before any finalization occurs, and the finalizer runs exactly once when the item is eventually processed from the queue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/object/core.rs` around lines 157 - 166, Move the trashcan check before running finalizers: call trashcan::begin(obj, default_dealloc::<T>) at the top of default_dealloc and if it returns false, return immediately (so the object is queued) instead of running drop_slow_inner; only when trashcan::begin returns true should you call drop_slow_inner and later call trashcan::end (so objects deferred by trashcan::begin are queued before any finalizer/weakref work and drop_slow_inner is executed exactly once when processed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/refcount.rs`:
- Around line 208-219: The Drop impl for DeferredDropGuard currently always
calls flush_deferred_drops() which can panic during unwinding; update
DeferredDropGuard::drop to check std::panic::panicking() (or equivalent) after
restoring IN_DEFERRED_CONTEXT.with(|in_ctx| in_ctx.set(self.was_in_context));
only call flush_deferred_drops() when !self.was_in_context &&
!std::panic::panicking() so flush is skipped during unwinding; keep restoring
IN_DEFERRED_CONTEXT unconditionally and do not change other logic.
- Around line 262-270: flush_deferred_drops currently drains DEFERRED_QUEUE and
calls each op, but if any op panics the rest are lost and callers can invoke it
while IN_DEFERRED_CONTEXT is true; update flush_deferred_drops to first assert
(or debug_assert) that IN_DEFERRED_CONTEXT is false to prevent reentrancy, then
take the ops vec and execute each op inside std::panic::catch_unwind so a panic
from one op does not skip others, and if a panic occurs re-push any remaining
ops back into DEFERRED_QUEUE (using DEFERRED_QUEUE.with) before resuming the
panic unwind; reference flush_deferred_drops, DEFERRED_QUEUE,
IN_DEFERRED_CONTEXT and ensure any untrack_object semantics are preserved by
requeuing unexecuted closures.
In `@crates/vm/src/object/core.rs`:
- Around line 441-449: In the re-check path where you return a resurrected
generic (the block using is_generic, generic_ptr, generic and calling
generic.0.ref_count.safe_inc()), clear the weak's wr_object before the early
return so drop of the local weak won't call unlink_weakref on an uninserted
node; i.e., set the weak/inner wr_object to null (use the same atomic store
semantics as other places, e.g., Ordering::Relaxed) on generic.0 (the PyWeak
inner) immediately prior to returning PyRef::from_raw(generic_ptr).
---
Duplicate comments:
In `@crates/common/src/refcount.rs`:
- Around line 148-157: The dec() implementation unconditionally does
self.state.fetch_sub(COUNT, Ordering::SeqCst), which can underflow and clear the
LEAKED bit; change dec() (and the fetch_sub usage) to use a CAS loop that loads
State::from_raw(self.state.load(...)), checks state.leaked() and if leaked
returns false without performing the subtraction, otherwise attempts a
compare_exchange to subtract COUNT (or retry on failure); ensure the code
references the COUNT constant and uses State::from_raw and leaked() to detect
and skip mutation when LEAKED is set so the LEAKED bit is never cleared by
underflow.
In `@crates/vm/src/object/core.rs`:
- Around line 128-135: The current debug_assert in end() (inside
DEALLOC_DEPTH.with in pub(super) unsafe fn end) prevents usize underflow only in
debug builds; change it to a runtime check so release builds can't underflow:
read d.get() into a local, if it's zero call panic! with "trashcan::end called
without matching begin" (or use checked_sub(1).unwrap_or_else(|| panic!(...)))
before calling d.set and returning the decremented depth. Ensure the check and
decrement happen inside the same DEALLOC_DEPTH.with closure so you update the
same thread-local value.
---
Nitpick comments:
In `@crates/common/src/refcount.rs`:
- Around line 94-96: The get() method currently uses Ordering::SeqCst for the
atomic load while is_leaked() uses Ordering::Acquire, creating inconsistent read
orderings; change the ordering in get() (the call to self.state.load and
subsequent State::from_raw(...).strong()) to use the same weaker ordering as
is_leaked() (e.g., Ordering::Acquire, or Ordering::Relaxed if this is purely a
diagnostic read) so both reads are consistent.
- Around line 43-45: The State::strong() method performs a redundant division by
COUNT (which equals 1) making "((self.inner & STRONG) / COUNT) as u32" a no-op;
simplify the expression by removing the "/ COUNT" so the method returns
"(self.inner & STRONG) as u32" (referencing State::strong, the inner field,
STRONG and COUNT) to make the intent clearer.
In `@crates/vm/src/object/core.rs`:
- Around line 157-166: Move the trashcan check before running finalizers: call
trashcan::begin(obj, default_dealloc::<T>) at the top of default_dealloc and if
it returns false, return immediately (so the object is queued) instead of
running drop_slow_inner; only when trashcan::begin returns true should you call
drop_slow_inner and later call trashcan::end (so objects deferred by
trashcan::begin are queued before any finalizer/weakref work and drop_slow_inner
is executed exactly once when processed).
| #[cfg(feature = "std")] | ||
| impl Drop for DeferredDropGuard { | ||
| fn drop(&mut self) { | ||
| IN_DEFERRED_CONTEXT.with(|in_ctx| { | ||
| in_ctx.set(self.was_in_context); | ||
| }); | ||
| // Only flush if we're the outermost context | ||
| if !self.was_in_context { | ||
| flush_deferred_drops(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
flush_deferred_drops() called from drop() during unwinding risks double-panic → process abort.
If the closure passed to with_deferred_drops panics, DeferredDropGuard::drop runs during stack unwinding and unconditionally calls flush_deferred_drops(). Should any queued operation also panic, Rust's "panic while panicking" rule causes an unconditional process abort, losing all panic context.
🛡️ Proposed fix — skip flush during unwinding
impl Drop for DeferredDropGuard {
fn drop(&mut self) {
IN_DEFERRED_CONTEXT.with(|in_ctx| {
in_ctx.set(self.was_in_context);
});
// Only flush if we're the outermost context
- if !self.was_in_context {
+ if !self.was_in_context && !std::thread::panicking() {
flush_deferred_drops();
}
}
}Skipping the flush during unwinding means deferred untrack operations will be lost on panic, which is acceptable given GC state is already inconsistent in a panic scenario.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/refcount.rs` around lines 208 - 219, The Drop impl for
DeferredDropGuard currently always calls flush_deferred_drops() which can panic
during unwinding; update DeferredDropGuard::drop to check
std::panic::panicking() (or equivalent) after restoring
IN_DEFERRED_CONTEXT.with(|in_ctx| in_ctx.set(self.was_in_context)); only call
flush_deferred_drops() when !self.was_in_context && !std::panic::panicking() so
flush is skipped during unwinding; keep restoring IN_DEFERRED_CONTEXT
unconditionally and do not change other logic.
| pub fn flush_deferred_drops() { | ||
| DEFERRED_QUEUE.with(|q| { | ||
| // Take all queued operations | ||
| let ops: Vec<_> = q.borrow_mut().drain(..).collect(); | ||
| // Execute them outside the borrow | ||
| for op in ops { | ||
| op(); | ||
| } | ||
| }); |
There was a problem hiding this comment.
flush_deferred_drops silently skips remaining ops if one panics, and is unsafe to call from within an active deferred context.
Two concerns:
-
Incomplete flush on panic: If
op()at index i panics, the remaining ops in the localopsvec are dropped (not called). Anyuntrack_object()calls they contained will be silently lost, leaving the GC tracking table in an inconsistent state for the duration of the thread's life. -
Callable within a deferred context: Since
flush_deferred_dropsispub, a caller can invoke it whileIN_DEFERRED_CONTEXTistrue, prematurely draining the queue and defeating the deferral guarantee. Consider adebug_assertguard.
🛡️ Suggested hardening
pub fn flush_deferred_drops() {
+ debug_assert!(
+ !IN_DEFERRED_CONTEXT.with(|c| c.get()),
+ "flush_deferred_drops called from within an active deferred context"
+ );
DEFERRED_QUEUE.with(|q| {
// Take all queued operations
let ops: Vec<_> = q.borrow_mut().drain(..).collect();
// Execute them outside the borrow
for op in ops {
op();
}
});
}For the panic-completeness concern, consider iterating with a RAII sentinel that re-pushes un-executed ops back to the queue on drop, though this adds complexity that may not be warranted for a GC untrack path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/refcount.rs` around lines 262 - 270, flush_deferred_drops
currently drains DEFERRED_QUEUE and calls each op, but if any op panics the rest
are lost and callers can invoke it while IN_DEFERRED_CONTEXT is true; update
flush_deferred_drops to first assert (or debug_assert) that IN_DEFERRED_CONTEXT
is false to prevent reentrancy, then take the ops vec and execute each op inside
std::panic::catch_unwind so a panic from one op does not skip others, and if a
panic occurs re-push any remaining ops back into DEFERRED_QUEUE (using
DEFERRED_QUEUE.with) before resuming the panic unwind; reference
flush_deferred_drops, DEFERRED_QUEUE, IN_DEFERRED_CONTEXT and ensure any
untrack_object semantics are preserved by requeuing unexecuted closures.
| if is_generic { | ||
| let generic_ptr = self.generic.load(Ordering::Relaxed); | ||
| if !generic_ptr.is_null() { | ||
| let generic = unsafe { &*generic_ptr }; | ||
| if generic.0.ref_count.safe_inc() { | ||
| return unsafe { PyRef::from_raw(generic_ptr) }; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the linked_list module and inspect Pointers::new()
fd "linked_list" -t f -e rsRepository: RustPython/RustPython
Length of output: 97
🏁 Script executed:
# Read the actual code at the specified location in core.rs
cat -n crates/vm/src/object/core.rs | sed -n '410,450p'Repository: RustPython/RustPython
Length of output: 2089
🏁 Script executed:
# Get more context on the WeakRefList::add method to understand the full flow
cat -n crates/vm/src/object/core.rs | sed -n '400,460p'Repository: RustPython/RustPython
Length of output: 3038
🏁 Script executed:
# Read the Pointers struct and new() implementation in linked_list.rs
cat -n crates/common/src/linked_list.rs | head -100Repository: RustPython/RustPython
Length of output: 4419
🏁 Script executed:
# Search for Pointers::new() definition
rg "impl.*Pointers|fn new\(\)" crates/common/src/linked_list.rs -A 5Repository: RustPython/RustPython
Length of output: 1033
🏁 Script executed:
# Search for unlink_weakref function in core.rs to verify the corruption path
rg "fn unlink_weakref" crates/vm/src/object/core.rs -A 20Repository: RustPython/RustPython
Length of output: 770
Dropping the uninserted weak in the re-check path corrupts wrl.head and causes a subsequent use-after-free.
Rust drops locals in reverse declaration order. _lock (line 437) is declared after weak (line 434), so on the return at line 446 the drop sequence is:
_lockdropped → stripe lock released.weakdropped →PyWeak::drop_innerruns → acquires the stripe lock (succeeds;_lockalready gone) → callsunlink_weakref(wrl, node_ptr).
weak was never inserted into the list, so its Pointers fields are in their Pointers::new() initial state (both prev and next are None). unlink_weakref interprets prev == None as "this node is the head" and executes:
wrl.head.store(ptr::null_mut(), Ordering::Relaxed); // ← corrupts the listThis wipes wrl.head to null while the real list nodes are still alive. When obj is eventually deallocated, wrl.clear iterates from head == null and skips every existing weakref. Those weakrefs retain a non-null wr_object pointing to the now-freed obj memory. Their eventual drop_inner then acquires a stripe lock based on a dangling pointer and calls unlink_weakref on already-freed memory → use-after-free.
Fix: before the early return, null out wr_object so drop_inner exits at its first guard check:
🛡️ Proposed fix
if is_generic {
let generic_ptr = self.generic.load(Ordering::Relaxed);
if !generic_ptr.is_null() {
let generic = unsafe { &*generic_ptr };
if generic.0.ref_count.safe_inc() {
+ // `weak` was never inserted; nullify wr_object so drop_inner
+ // returns early and does not call unlink_weakref.
+ weak.wr_object.store(ptr::null_mut(), Ordering::Relaxed);
return unsafe { PyRef::from_raw(generic_ptr) };
}
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/vm/src/object/core.rs` around lines 441 - 449, In the re-check path
where you return a resurrected generic (the block using is_generic, generic_ptr,
generic and calling generic.0.ref_count.safe_inc()), clear the weak's wr_object
before the early return so drop of the local weak won't call unlink_weakref on
an uninserted node; i.e., set the weak/inner wr_object to null (use the same
atomic store semantics as other places, e.g., Ordering::Relaxed) on generic.0
(the PyWeak inner) immediately prior to returning PyRef::from_raw(generic_ptr).
…n debug_assert, weakref generic re-check
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [ ] lib: cpython/Lib/asyncio dependencies:
dependent tests: (7 tests)
[ ] test: cpython/Lib/test/test_descr.py (TODO: 47) dependencies: dependent tests: (no tests depend on descr) [x] test: cpython/Lib/test/test_faulthandler.py (TODO: 4) dependencies: dependent tests: (1 tests)
[x] lib: cpython/Lib/functools.py dependencies:
dependent tests: (39 tests)
[x] lib: cpython/Lib/inspect.py dependencies:
dependent tests: (45 tests)
[x] lib: cpython/Lib/io.py dependencies:
dependent tests: (89 tests)
[ ] lib: cpython/Lib/logging dependencies:
dependent tests: (10 tests)
[ ] lib: cpython/Lib/collections dependencies:
dependent tests: (196 tests)
[ ] lib: cpython/Lib/subprocess.py dependencies:
dependent tests: (53 tests)
[x] lib: cpython/Lib/threading.py dependencies:
dependent tests: (113 tests)
[x] lib: cpython/Lib/traceback.py dependencies:
dependent tests: (35 tests)
[x] lib: cpython/Lib/warnings.py dependencies:
dependent tests: (49 tests)
[x] lib: cpython/Lib/weakref.py dependencies:
dependent tests: (158 tests)
Legend:
|
separated from #6849
Summary by CodeRabbit
New Features
Bug Fixes
Chores