Skip to content

[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523

Draft
msaelices wants to merge 4 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-optimizations
Draft

[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523
msaelices wants to merge 4 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-optimizations

Conversation

@msaelices
Copy link
Copy Markdown
Contributor

@msaelices msaelices commented May 11, 2026

Linked issue

Related to #6521. This PR addresses the lowest-hanging-fruit portion of that issue: redundant GIL bookkeeping on the Python -> Mojo trampoline. The bigger structural wins (METH_FASTCALL, borrowed-PythonObject) are noted as follow-ups below.

Motivation

Issue #6521 reports that calling a trivial Mojo function from Python costs ~150-170 ns/call vs PyO3's ~17-24 ns/call. Reading the dispatch path and instrumenting it with std.benchmark shows the per-noop overhead breaks down roughly like this on Mojo 1.0.0b2.dev2026051106 (bazel toolchain) + CPython 3.12, single core via taskset -c 2:

Component ns/call
stdlib _dispatch[is_method] chain ~76
Outer GILAcquired in _py_c_function_wrapper ~17
PythonObject.__del__ with GILAcquired ~42 (this PR shaves ~12)
PythonObject.from_borrowed incref/wrap ~20
Tuple build by CPython (METH_VARARGS) ~30-50

The two GIL-related rows together account for ~30 ns of redundant work that the wrapper does on every call, so the GIL is already held by definition when CPython dispatches our METH_VARARGS trampoline, and the same applies to the destructor that runs as part of that trampoline.

What changed

Commit 1: Skip redundant GIL acquire in PythonObject.__del__

PythonObject.__del__ unconditionally wrapped its Py_DecRef in a PyGILState_Ensure / PyGILState_Release pair so it could run safely even on a thread that doesn't hold the GIL. In the common case the GIL is already held (we're inside a Python -> Mojo trampoline, or inside a with Python() block); the pair is then just two extra C calls into CPython.

Primitive (cached fn ptr) ns / call
PyGILState_Ensure + PyGILState_Release (pair) 14.5
PyGILState_Check 8.3

So ~6 ns saved per __del__. The FFI trampoline destroys 2 PythonObjects per call (self and the args tuple), so this is ~12 ns/call of saved overhead on the hot path.

Commit 2: Drop outer GILAcquired from _py_c_function_wrapper

When CPython dispatches via METH_VARARGS / METH_VARARGS | METH_KEYWORDS, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and do not re-acquire on entry. The Mojo wrapper, by contrast, opened an explicit GILAcquired block on every call.

Decomposed A/B bench, all running against the current stdlib (so this is apples-to-apples for the wrapper variants):

Variant (return arg unchanged, like def noop(x): return x) ns / call
A: stdlib def_function (current path) 198.9
B: hand-rolled wrapper, PythonObject wraps + outer GIL 123.4
C: hand-rolled wrapper, PythonObject wraps, no outer GIL 106.4
D: raw, no PythonObject wraps, no GIL (lower bound) 63.8
(Py lambda baseline) 94.6

B - C = 17 ns is the direct measurement of this commit's saving.

Combined effect

With both commits applied, the FFI trampoline shaves ~29 ns/call:

  • Outer GIL removal (commit 2): 17 ns (measured directly)
  • Conditional GIL in __del__ (commit 1): ~12 ns (calculated from cached-pointer primitive measurements; the second-order effect — fewer C calls per destructor — applies twice per FFI call)

So for the noop(x) shape in issue #6521, expected improvement is ~170 → ~140 ns/call (~17%). For add(a, b) the relative improvement is smaller because the per-arg Int(py=a) conversion dominates the remaining overhead.

msaelices added 2 commits May 12, 2026 00:22
BEGIN_PUBLIC
[Stdlib] Skip redundant GIL acquire in `PythonObject.__del__`

`PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a
`PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could
run safely even when the calling thread doesn't hold the GIL. In the
common case the GIL *is* already held (the destructor runs inside a
Python -> Mojo trampoline, or inside a `with Python()` block); the
acquire/release pair is then just two extra C calls into CPython.

Microbenchmarks (CPython 3.12, single core via `taskset -c 2`):

  PyGILState_Ensure+Release pair, cached fn ptrs:  ~14.5 ns / pair
  PyGILState_Check, cached fn ptr:                  ~8.3 ns / call

So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair.
For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call
overhead) we destroy two `PythonObject`s per call (the `self` arg and
the args tuple), so this saves ~12 ns / call on the hot trampoline.

Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once
and fall back to `GILAcquired` only when the GIL really isn't held.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC
[Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper`

When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` /
`METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the
GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't
acquire the GIL again on entry. The Mojo binding wrapper, by contrast,
wrapped the user-supplied function in an explicit `GILAcquired`
context manager, which on every call hit `PyGILState_Ensure` (state
bookkeeping in CPython) and `PyGILState_Release` on return.

Microbenchmark on the same machine as issue modular#6521 (CPython 3.12,
`taskset -c 2`), running a Python timeit loop that calls a Mojo
function via a `PythonModuleBuilder`-registered C trampoline:

  variant                                    ns / call
  ---------------------------------------    ---------
  inline pywrap + GIL                              123
  inline pywrap, no outer GIL  (this patch)        106
  ---------------------------------------    ---------
                                       delta:       17 ns

So removing the outer pair saves ~17 ns / call (~8.5% of the current
overhead). Combined with the conditional GIL handling in
`PythonObject.__del__` (companion commit), the total saving on a
single-arg trampoline is ~29 ns / call.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
@msaelices
Copy link
Copy Markdown
Contributor Author

msaelices commented May 12, 2026

@JoeLoser's stack covers items 2 + 3 from his decomposition (the changes in this PR), plus items 4 + 5 which I had identified but deferred. His after-numbers (≈22 ns noop / ≈42 ns add on M4) are dramatically better. Yielding, will close this PR once his stack lands publicly. Pivoting to item 6 (typed-arg fast paths) which he explicitly hasn't looked at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants