[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524
Draft
msaelices wants to merge 6 commits into
Draft
[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524msaelices wants to merge 6 commits into
msaelices wants to merge 6 commits into
Conversation
BEGIN_PUBLIC [Stdlib] Skip redundant GIL acquire in `PythonObject.__del__` `PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a `PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could run safely even when the calling thread doesn't hold the GIL. In the common case the GIL *is* already held (the destructor runs inside a Python -> Mojo trampoline, or inside a `with Python()` block); the acquire/release pair is then just two extra C calls into CPython. Microbenchmarks (CPython 3.12, single core via `taskset -c 2`): PyGILState_Ensure+Release pair, cached fn ptrs: ~14.5 ns / pair PyGILState_Check, cached fn ptr: ~8.3 ns / call So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair. For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call overhead) we destroy two `PythonObject`s per call (the `self` arg and the args tuple), so this saves ~12 ns / call on the hot trampoline. Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once and fall back to `GILAcquired` only when the GIL really isn't held. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC [Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper` When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` / `METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't acquire the GIL again on entry. The Mojo binding wrapper, by contrast, wrapped the user-supplied function in an explicit `GILAcquired` context manager, which on every call hit `PyGILState_Ensure` (state bookkeeping in CPython) and `PyGILState_Release` on return. Microbenchmark on the same machine as issue modular#6521 (CPython 3.12, `taskset -c 2`), running a Python timeit loop that calls a Mojo function via a `PythonModuleBuilder`-registered C trampoline: variant ns / call --------------------------------------- --------- inline pywrap + GIL 123 inline pywrap, no outer GIL (this patch) 106 --------------------------------------- --------- delta: 17 ns So removing the outer pair saves ~17 ns / call (~8.5% of the current overhead). Combined with the conditional GIL handling in `PythonObject.__del__` (companion commit), the total saving on a single-arg trampoline is ~29 ns / call. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC [Stdlib] Add `METH_FASTCALL` support to `PythonModuleBuilder` Exposes the CPython METH_FASTCALL calling convention so extensions written with `PythonModuleBuilder` can opt out of the `METH_VARARGS`-style per-call tuple allocation. What's added: - `METH_FASTCALL = 0x80` flag (matches CPython's `Include/methodobject.h`). - `PyCFunctionFast` type, the trampoline signature: `def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr`. - `PyMethodDef.function_fastcall` static factory that builds a fastcall `PyMethodDef` with the right flag bits. - `PythonModuleBuilder.def_py_c_fastcall_function`, a new public entry point parallel to `def_py_c_function` for users who want to ship a hand-rolled fastcall trampoline. This is the raw-API surface — the higher-level `def_function[user_func]` registration still defaults to `METH_VARARGS`. Integrating fastcall into that high-level path requires a fastcall-aware variant of the `_python_func.mojo` dispatch chain and is left as a follow-up (this branch chains on top of the GIL-acquire cleanup PR; the next chained PR will add the high-level integration). Microbench: a tight Python -> Mojo -> Python loop on a `noop(x)` function shows the raw fastcall trampoline at ~136 ns/iter vs the current `def_function` path at ~198 ns/iter (-62 ns, ~31% improvement). For `add(a, b)`, ~383 ns/iter vs ~313 ns/iter (-70 ns, ~18%). Numbers include the Mojo -> Python overhead the bench loop pays per iter, so the win attributable purely to the trampoline change is the delta. See the companion benchmark commit for the full bench setup. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC [Stdlib] Add Python -> Mojo FFI hot path benchmark `mojo/stdlib/benchmarks/python/bench_python_ffi.mojo` measures the Python -> Mojo round-trip for two trampolines side by side: - A: the standard `PythonModuleBuilder.def_function[user_func]` path (METH_VARARGS, full `_py_c_function_wrapper` chain). - B: a hand-rolled trampoline registered via `def_py_c_fastcall_function` (METH_FASTCALL). Both `noop(x)` and `add(a, b)` shapes are covered, matching the reproduction in issue modular#6521. The bench loop runs in Mojo, so each iteration pays a Mojo -> Python crossing in addition to the Python -> Mojo crossing we want to measure. The absolute numbers are therefore higher than what an external Python driver would see, but the delta between the two variants is meaningful and attributable to the trampoline style. Run with: ./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench \ --test_output=all END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
Contributor
Author
|
@JoeLoser's stack covers item 1 from his decomposition (METH_FASTCALL), with the full integration into the high-level |
This was referenced May 12, 2026
The rest of the python bindings code uses 'wrapper' uniformly (_py_c_function_wrapper, _tp_dealloc_wrapper, _tp_repr_wrapper, _py_new_function_wrapper, ...). 'Trampoline' was jargon I introduced; swap it back to match the project's terminology. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chained on top of #6523 (
mojo-from-python-ffi-optimizations). Reviewing this PR with that one merged is the cleanest view of the cumulative speedup.Linked issue
Related to #6521 — the next milestone in closing the Python -> Mojo FFI gap, after the GIL-acquire cleanup PR (#6523) was merged.
What this PR adds
Exposes
METH_FASTCALLso extensions registered viaPythonModuleBuildercan opt out of the per-callMETH_VARARGStuple allocation.METH_FASTCALL = 0x80constant in_cpython.mojo.PyCFunctionFasttrampoline type:def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr.PyMethodDef.function_fastcallstatic factory.PythonModuleBuilder.def_py_c_fastcall_function— parallel to the existingdef_py_c_function, but for fastcall-shaped trampolines.mojo/stdlib/benchmarks/python/bench_python_ffi.mojobenchmark comparing the two paths.Bench results
./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench --test_output=all, Mojo1.0.0b2.dev2026051106 (aff922aa)bazel toolchain, CPython 3.13, single core:def_function(METH_VARARGS)def_py_c_fastcall_functionnoop(x)add(a, b)(Min ns/iter across 5 repetitions of 1000 ops each. The bench loop runs in Mojo so each iter pays a Mojo -> Python crossing on top of the Python -> Mojo overhead being measured. The delta between A and B is what's attributable to the trampoline style.)
Scope and follow-ups
This PR adds the raw entrypoint that lets users hand-roll a fastcall trampoline. The higher-level
def_function[user_func]registration still defaults toMETH_VARARGS, because integrating fastcall into that path needs a fastcall-aware variant of the_python_func.mojodispatch chain (the heavily-templated_dispatch[is_method]logic that does compile-time arity branching).That integration is the next chained PR in this series. Once it lands, end-user code using plain
m.def_function[user_func]("name")will pick up the win automatically.Tests
./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.smoke— passes../bazelw test //mojo/stdlib/test/python/...— 40/40 pass../bazelw test //mojo/integration-test/python-extension-modules/...— 22/22 pass.Checklist
Assisted-by: AItrailers in both commits.