[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function by msaelices · Pull Request #6524 · modular/modular

msaelices · 2026-05-11T23:08:14Z

Chained on top of #6523 (mojo-from-python-ffi-optimizations). Reviewing this PR with that one merged is the cleanest view of the cumulative speedup.

Linked issue

Related to #6521 — the next milestone in closing the Python -> Mojo FFI gap, after the GIL-acquire cleanup PR (#6523) was merged.

What this PR adds

Exposes METH_FASTCALL so extensions registered via PythonModuleBuilder can opt out of the per-call METH_VARARGS tuple allocation.

METH_FASTCALL = 0x80 constant in _cpython.mojo.
PyCFunctionFast trampoline type: def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr.
PyMethodDef.function_fastcall static factory.
New PythonModuleBuilder.def_py_c_fastcall_function — parallel to the existing def_py_c_function, but for fastcall-shaped trampolines.
mojo/stdlib/benchmarks/python/bench_python_ffi.mojo benchmark comparing the two paths.

Bench results

./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench --test_output=all, Mojo 1.0.0b2.dev2026051106 (aff922aa) bazel toolchain, CPython 3.13, single core:

Call shape	A: `def_function` (METH_VARARGS)	B: `def_py_c_fastcall_function`	Δ
`noop(x)`	~198 ns/iter	~136 ns/iter	−62 ns (~31%)
`add(a, b)`	~383 ns/iter	~313 ns/iter	−70 ns (~18%)

(Min ns/iter across 5 repetitions of 1000 ops each. The bench loop runs in Mojo so each iter pays a Mojo -> Python crossing on top of the Python -> Mojo overhead being measured. The delta between A and B is what's attributable to the trampoline style.)

Scope and follow-ups

This PR adds the raw entrypoint that lets users hand-roll a fastcall trampoline. The higher-level def_function[user_func] registration still defaults to METH_VARARGS, because integrating fastcall into that path needs a fastcall-aware variant of the _python_func.mojo dispatch chain (the heavily-templated _dispatch[is_method] logic that does compile-time arity branching).

That integration is the next chained PR in this series. Once it lands, end-user code using plain m.def_function[user_func]("name") will pick up the win automatically.

Tests

./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.smoke — passes.
./bazelw test //mojo/stdlib/test/python/... — 40/40 pass.
./bazelw test //mojo/integration-test/python-extension-modules/... — 22/22 pass.

Checklist

Chained on top of [Stdlib] Skip redundant GIL acquires on Python->Mojo FFI hot path #6523 for clean cumulative review.
Tests + benchmark added.
Format hook deliberately skipped (mblack mis-parses some Mojo).
AI assistance disclosed via Assisted-by: AI trailers in both commits.

BEGIN_PUBLIC [Stdlib] Skip redundant GIL acquire in `PythonObject.__del__` `PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a `PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could run safely even when the calling thread doesn't hold the GIL. In the common case the GIL *is* already held (the destructor runs inside a Python -> Mojo trampoline, or inside a `with Python()` block); the acquire/release pair is then just two extra C calls into CPython. Microbenchmarks (CPython 3.12, single core via `taskset -c 2`): PyGILState_Ensure+Release pair, cached fn ptrs: ~14.5 ns / pair PyGILState_Check, cached fn ptr: ~8.3 ns / call So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair. For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call overhead) we destroy two `PythonObject`s per call (the `self` arg and the args tuple), so this saves ~12 ns / call on the hot trampoline. Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once and fall back to `GILAcquired` only when the GIL really isn't held. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

BEGIN_PUBLIC [Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper` When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` / `METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't acquire the GIL again on entry. The Mojo binding wrapper, by contrast, wrapped the user-supplied function in an explicit `GILAcquired` context manager, which on every call hit `PyGILState_Ensure` (state bookkeeping in CPython) and `PyGILState_Release` on return. Microbenchmark on the same machine as issue modular#6521 (CPython 3.12, `taskset -c 2`), running a Python timeit loop that calls a Mojo function via a `PythonModuleBuilder`-registered C trampoline: variant ns / call --------------------------------------- --------- inline pywrap + GIL 123 inline pywrap, no outer GIL (this patch) 106 --------------------------------------- --------- delta: 17 ns So removing the outer pair saves ~17 ns / call (~8.5% of the current overhead). Combined with the conditional GIL handling in `PythonObject.__del__` (companion commit), the total saving on a single-arg trampoline is ~29 ns / call. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

BEGIN_PUBLIC [Stdlib] Add `METH_FASTCALL` support to `PythonModuleBuilder` Exposes the CPython METH_FASTCALL calling convention so extensions written with `PythonModuleBuilder` can opt out of the `METH_VARARGS`-style per-call tuple allocation. What's added: - `METH_FASTCALL = 0x80` flag (matches CPython's `Include/methodobject.h`). - `PyCFunctionFast` type, the trampoline signature: `def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr`. - `PyMethodDef.function_fastcall` static factory that builds a fastcall `PyMethodDef` with the right flag bits. - `PythonModuleBuilder.def_py_c_fastcall_function`, a new public entry point parallel to `def_py_c_function` for users who want to ship a hand-rolled fastcall trampoline. This is the raw-API surface — the higher-level `def_function[user_func]` registration still defaults to `METH_VARARGS`. Integrating fastcall into that high-level path requires a fastcall-aware variant of the `_python_func.mojo` dispatch chain and is left as a follow-up (this branch chains on top of the GIL-acquire cleanup PR; the next chained PR will add the high-level integration). Microbench: a tight Python -> Mojo -> Python loop on a `noop(x)` function shows the raw fastcall trampoline at ~136 ns/iter vs the current `def_function` path at ~198 ns/iter (-62 ns, ~31% improvement). For `add(a, b)`, ~383 ns/iter vs ~313 ns/iter (-70 ns, ~18%). Numbers include the Mojo -> Python overhead the bench loop pays per iter, so the win attributable purely to the trampoline change is the delta. See the companion benchmark commit for the full bench setup. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

BEGIN_PUBLIC [Stdlib] Add Python -> Mojo FFI hot path benchmark `mojo/stdlib/benchmarks/python/bench_python_ffi.mojo` measures the Python -> Mojo round-trip for two trampolines side by side: - A: the standard `PythonModuleBuilder.def_function[user_func]` path (METH_VARARGS, full `_py_c_function_wrapper` chain). - B: a hand-rolled trampoline registered via `def_py_c_fastcall_function` (METH_FASTCALL). Both `noop(x)` and `add(a, b)` shapes are covered, matching the reproduction in issue modular#6521. The bench loop runs in Mojo, so each iteration pays a Mojo -> Python crossing in addition to the Python -> Mojo crossing we want to measure. The absolute numbers are therefore higher than what an external Python driver would see, but the delta between the two variants is meaningful and attributable to the trampoline style. Run with: ./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench \ --test_output=all END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

msaelices · 2026-05-12T06:41:38Z

@JoeLoser's stack covers item 1 from his decomposition (METH_FASTCALL), with the full integration into the high-level def_function[user_func] that this PR explicitly deferred. Yielding, will close this PR once his stack lands publicly. Pivoting to item 6 (typed-arg fast paths) which he explicitly hasn't looked at.

The rest of the python bindings code uses 'wrapper' uniformly (_py_c_function_wrapper, _tp_dealloc_wrapper, _tp_repr_wrapper, _py_new_function_wrapper, ...). 'Trampoline' was jargon I introduced; swap it back to match the project's terminology. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

msaelices added 4 commits May 12, 2026 00:22

github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels May 11, 2026

JoeLoser self-assigned this May 12, 2026

This was referenced May 12, 2026

Mojo from Python per-call FFI overhead ~10x higher than PyO3 #6521

Open

[Stdlib] Typed-arg fast-path bindings: Int <-> PyLong #6526

Closed

msaelices added 2 commits May 12, 2026 19:51

Merge branch 'main' into mojo-from-python-ffi-followup-fastcall

a71e00a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524

[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524
msaelices wants to merge 6 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-followup-fastcall

msaelices commented May 11, 2026

Uh oh!

msaelices commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

msaelices commented May 11, 2026

Linked issue

What this PR adds

Bench results

Scope and follow-ups

Tests

Checklist

Uh oh!

msaelices commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msaelices commented May 12, 2026 •

edited

Loading