[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523
Draft
msaelices wants to merge 4 commits into
Draft
[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523msaelices wants to merge 4 commits into
msaelices wants to merge 4 commits into
Conversation
BEGIN_PUBLIC [Stdlib] Skip redundant GIL acquire in `PythonObject.__del__` `PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a `PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could run safely even when the calling thread doesn't hold the GIL. In the common case the GIL *is* already held (the destructor runs inside a Python -> Mojo trampoline, or inside a `with Python()` block); the acquire/release pair is then just two extra C calls into CPython. Microbenchmarks (CPython 3.12, single core via `taskset -c 2`): PyGILState_Ensure+Release pair, cached fn ptrs: ~14.5 ns / pair PyGILState_Check, cached fn ptr: ~8.3 ns / call So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair. For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call overhead) we destroy two `PythonObject`s per call (the `self` arg and the args tuple), so this saves ~12 ns / call on the hot trampoline. Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once and fall back to `GILAcquired` only when the GIL really isn't held. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC [Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper` When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` / `METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't acquire the GIL again on entry. The Mojo binding wrapper, by contrast, wrapped the user-supplied function in an explicit `GILAcquired` context manager, which on every call hit `PyGILState_Ensure` (state bookkeeping in CPython) and `PyGILState_Release` on return. Microbenchmark on the same machine as issue modular#6521 (CPython 3.12, `taskset -c 2`), running a Python timeit loop that calls a Mojo function via a `PythonModuleBuilder`-registered C trampoline: variant ns / call --------------------------------------- --------- inline pywrap + GIL 123 inline pywrap, no outer GIL (this patch) 106 --------------------------------------- --------- delta: 17 ns So removing the outer pair saves ~17 ns / call (~8.5% of the current overhead). Combined with the conditional GIL handling in `PythonObject.__del__` (companion commit), the total saving on a single-arg trampoline is ~29 ns / call. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
4 tasks
Contributor
Author
|
@JoeLoser's stack covers items 2 + 3 from his decomposition (the changes in this PR), plus items 4 + 5 which I had identified but deferred. His after-numbers (≈22 ns noop / ≈42 ns add on M4) are dramatically better. Yielding, will close this PR once his stack lands publicly. Pivoting to item 6 (typed-arg fast paths) which he explicitly hasn't looked at. |
This was referenced May 12, 2026
Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue
Related to #6521. This PR addresses the lowest-hanging-fruit portion of that issue: redundant GIL bookkeeping on the Python -> Mojo trampoline. The bigger structural wins (METH_FASTCALL, borrowed-PythonObject) are noted as follow-ups below.
Motivation
Issue #6521 reports that calling a trivial Mojo function from Python costs ~150-170 ns/call vs PyO3's ~17-24 ns/call. Reading the dispatch path and instrumenting it with
std.benchmarkshows the per-noop overhead breaks down roughly like this on Mojo 1.0.0b2.dev2026051106 (bazel toolchain) + CPython 3.12, single core viataskset -c 2:_dispatch[is_method]chainGILAcquiredin_py_c_function_wrapperPythonObject.__del__withGILAcquiredPythonObject.from_borrowedincref/wrapThe two GIL-related rows together account for ~30 ns of redundant work that the wrapper does on every call, so the GIL is already held by definition when CPython dispatches our
METH_VARARGStrampoline, and the same applies to the destructor that runs as part of that trampoline.What changed
Commit 1: Skip redundant GIL acquire in
PythonObject.__del__PythonObject.__del__unconditionally wrapped itsPy_DecRefin aPyGILState_Ensure/PyGILState_Releasepair so it could run safely even on a thread that doesn't hold the GIL. In the common case the GIL is already held (we're inside a Python -> Mojo trampoline, or inside awith Python()block); the pair is then just two extra C calls into CPython.PyGILState_Ensure+PyGILState_Release(pair)PyGILState_CheckSo ~6 ns saved per
__del__. The FFI trampoline destroys 2PythonObjects per call (selfand the args tuple), so this is ~12 ns/call of saved overhead on the hot path.Commit 2: Drop outer
GILAcquiredfrom_py_c_function_wrapperWhen CPython dispatches via
METH_VARARGS/METH_VARARGS | METH_KEYWORDS, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and do not re-acquire on entry. The Mojo wrapper, by contrast, opened an explicitGILAcquiredblock on every call.Decomposed A/B bench, all running against the current stdlib (so this is apples-to-apples for the wrapper variants):
def noop(x): return x)def_function(current path)B - C = 17 nsis the direct measurement of this commit's saving.Combined effect
With both commits applied, the FFI trampoline shaves ~29 ns/call:
__del__(commit 1): ~12 ns (calculated from cached-pointer primitive measurements; the second-order effect — fewer C calls per destructor — applies twice per FFI call)So for the
noop(x)shape in issue #6521, expected improvement is ~170 → ~140 ns/call (~17%). Foradd(a, b)the relative improvement is smaller because the per-argInt(py=a)conversion dominates the remaining overhead.