[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path by msaelices · Pull Request #6523 · modular/modular

msaelices · 2026-05-11T22:24:31Z

Linked issue

Related to #6521. This PR addresses the lowest-hanging-fruit portion of that issue: redundant GIL bookkeeping on the Python -> Mojo trampoline. The bigger structural wins (METH_FASTCALL, borrowed-PythonObject) are noted as follow-ups below.

Motivation

Issue #6521 reports that calling a trivial Mojo function from Python costs ~150-170 ns/call vs PyO3's ~17-24 ns/call. Reading the dispatch path and instrumenting it with std.benchmark shows the per-noop overhead breaks down roughly like this on Mojo 1.0.0b2.dev2026051106 (bazel toolchain) + CPython 3.12, single core via taskset -c 2:

Component	ns/call
stdlib `_dispatch[is_method]` chain	~76
Outer `GILAcquired` in `_py_c_function_wrapper`	~17
2× `PythonObject.__del__` with `GILAcquired`	~42 (this PR shaves ~12)
`PythonObject.from_borrowed` incref/wrap	~20
Tuple build by CPython (METH_VARARGS)	~30-50

The two GIL-related rows together account for ~30 ns of redundant work that the wrapper does on every call, so the GIL is already held by definition when CPython dispatches our METH_VARARGS trampoline, and the same applies to the destructor that runs as part of that trampoline.

What changed

Commit 1: Skip redundant GIL acquire in `PythonObject.del`

PythonObject.__del__ unconditionally wrapped its Py_DecRef in a PyGILState_Ensure / PyGILState_Release pair so it could run safely even on a thread that doesn't hold the GIL. In the common case the GIL is already held (we're inside a Python -> Mojo trampoline, or inside a with Python() block); the pair is then just two extra C calls into CPython.

Primitive (cached fn ptr)	ns / call
`PyGILState_Ensure` + `PyGILState_Release` (pair)	14.5
`PyGILState_Check`	8.3

So ~6 ns saved per __del__. The FFI trampoline destroys 2 PythonObjects per call (self and the args tuple), so this is ~12 ns/call of saved overhead on the hot path.

Commit 2: Drop outer `GILAcquired` from `_py_c_function_wrapper`

When CPython dispatches via METH_VARARGS / METH_VARARGS | METH_KEYWORDS, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and do not re-acquire on entry. The Mojo wrapper, by contrast, opened an explicit GILAcquired block on every call.

Decomposed A/B bench, all running against the current stdlib (so this is apples-to-apples for the wrapper variants):

Variant (return arg unchanged, like `def noop(x): return x`)	ns / call
A: stdlib `def_function` (current path)	198.9
B: hand-rolled wrapper, PythonObject wraps + outer GIL	123.4
C: hand-rolled wrapper, PythonObject wraps, no outer GIL	106.4
D: raw, no PythonObject wraps, no GIL (lower bound)	63.8
(Py lambda baseline)	94.6

B - C = 17 ns is the direct measurement of this commit's saving.

Combined effect

With both commits applied, the FFI trampoline shaves ~29 ns/call:

Outer GIL removal (commit 2): 17 ns (measured directly)
Conditional GIL in __del__ (commit 1): ~12 ns (calculated from cached-pointer primitive measurements; the second-order effect — fewer C calls per destructor — applies twice per FFI call)

So for the noop(x) shape in issue #6521, expected improvement is ~170 → ~140 ns/call (~17%). For add(a, b) the relative improvement is smaller because the per-arg Int(py=a) conversion dominates the remaining overhead.

BEGIN_PUBLIC [Stdlib] Skip redundant GIL acquire in `PythonObject.__del__` `PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a `PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could run safely even when the calling thread doesn't hold the GIL. In the common case the GIL *is* already held (the destructor runs inside a Python -> Mojo trampoline, or inside a `with Python()` block); the acquire/release pair is then just two extra C calls into CPython. Microbenchmarks (CPython 3.12, single core via `taskset -c 2`): PyGILState_Ensure+Release pair, cached fn ptrs: ~14.5 ns / pair PyGILState_Check, cached fn ptr: ~8.3 ns / call So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair. For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call overhead) we destroy two `PythonObject`s per call (the `self` arg and the args tuple), so this saves ~12 ns / call on the hot trampoline. Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once and fall back to `GILAcquired` only when the GIL really isn't held. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

BEGIN_PUBLIC [Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper` When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` / `METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't acquire the GIL again on entry. The Mojo binding wrapper, by contrast, wrapped the user-supplied function in an explicit `GILAcquired` context manager, which on every call hit `PyGILState_Ensure` (state bookkeeping in CPython) and `PyGILState_Release` on return. Microbenchmark on the same machine as issue modular#6521 (CPython 3.12, `taskset -c 2`), running a Python timeit loop that calls a Mojo function via a `PythonModuleBuilder`-registered C trampoline: variant ns / call --------------------------------------- --------- inline pywrap + GIL 123 inline pywrap, no outer GIL (this patch) 106 --------------------------------------- --------- delta: 17 ns So removing the outer pair saves ~17 ns / call (~8.5% of the current overhead). Combined with the conditional GIL handling in `PythonObject.__del__` (companion commit), the total saving on a single-arg trampoline is ~29 ns / call. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

msaelices · 2026-05-12T06:41:38Z

@JoeLoser's stack covers items 2 + 3 from his decomposition (the changes in this PR), plus items 4 + 5 which I had identified but deferred. His after-numbers (≈22 ns noop / ≈42 ns add on M4) are dramatically better. Yielding, will close this PR once his stack lands publicly. Pivoting to item 6 (typed-arg fast paths) which he explicitly hasn't looked at.

Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

msaelices added 2 commits May 12, 2026 00:22

github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels May 11, 2026

msaelices mentioned this pull request May 11, 2026

[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function #6524

Draft

4 tasks

JoeLoser self-assigned this May 12, 2026

This was referenced May 12, 2026

Mojo from Python per-call FFI overhead ~10x higher than PyO3 #6521

Open

[Stdlib] Typed-arg fast-path bindings: Int <-> PyLong #6526

Closed

msaelices added 2 commits May 12, 2026 19:52

Merge branch 'main' into mojo-from-python-ffi-optimizations

05e5274

[Stdlib] Trim verbose comments on GIL fast-path changes

ea6cc6a

Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523

[Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path#6523
msaelices wants to merge 4 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-optimizations

msaelices commented May 11, 2026 •

edited

Loading

Uh oh!

msaelices commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

msaelices commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

Motivation

What changed

Commit 1: Skip redundant GIL acquire in PythonObject.__del__

Commit 2: Drop outer GILAcquired from _py_c_function_wrapper

Combined effect

Uh oh!

msaelices commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msaelices commented May 11, 2026 •

edited

Loading

Commit 1: Skip redundant GIL acquire in `PythonObject.del`

Commit 2: Drop outer `GILAcquired` from `_py_c_function_wrapper`

msaelices commented May 12, 2026 •

edited

Loading