Skip to content

[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524

Draft
msaelices wants to merge 6 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-followup-fastcall
Draft

[Stdlib] Expose METH_FASTCALL via PythonModuleBuilder.def_py_c_fastcall_function#6524
msaelices wants to merge 6 commits into
modular:mainfrom
msaelices:mojo-from-python-ffi-followup-fastcall

Conversation

@msaelices
Copy link
Copy Markdown
Contributor

Chained on top of #6523 (mojo-from-python-ffi-optimizations). Reviewing this PR with that one merged is the cleanest view of the cumulative speedup.

Linked issue

Related to #6521 — the next milestone in closing the Python -> Mojo FFI gap, after the GIL-acquire cleanup PR (#6523) was merged.

What this PR adds

Exposes METH_FASTCALL so extensions registered via PythonModuleBuilder can opt out of the per-call METH_VARARGS tuple allocation.

  • METH_FASTCALL = 0x80 constant in _cpython.mojo.
  • PyCFunctionFast trampoline type: def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr.
  • PyMethodDef.function_fastcall static factory.
  • New PythonModuleBuilder.def_py_c_fastcall_function — parallel to the existing def_py_c_function, but for fastcall-shaped trampolines.
  • mojo/stdlib/benchmarks/python/bench_python_ffi.mojo benchmark comparing the two paths.

Bench results

./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench --test_output=all, Mojo 1.0.0b2.dev2026051106 (aff922aa) bazel toolchain, CPython 3.13, single core:

Call shape A: def_function (METH_VARARGS) B: def_py_c_fastcall_function Δ
noop(x) ~198 ns/iter ~136 ns/iter −62 ns (~31%)
add(a, b) ~383 ns/iter ~313 ns/iter −70 ns (~18%)

(Min ns/iter across 5 repetitions of 1000 ops each. The bench loop runs in Mojo so each iter pays a Mojo -> Python crossing on top of the Python -> Mojo overhead being measured. The delta between A and B is what's attributable to the trampoline style.)

Scope and follow-ups

This PR adds the raw entrypoint that lets users hand-roll a fastcall trampoline. The higher-level def_function[user_func] registration still defaults to METH_VARARGS, because integrating fastcall into that path needs a fastcall-aware variant of the _python_func.mojo dispatch chain (the heavily-templated _dispatch[is_method] logic that does compile-time arity branching).

That integration is the next chained PR in this series. Once it lands, end-user code using plain m.def_function[user_func]("name") will pick up the win automatically.

Tests

  • ./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.smoke — passes.
  • ./bazelw test //mojo/stdlib/test/python/... — 40/40 pass.
  • ./bazelw test //mojo/integration-test/python-extension-modules/... — 22/22 pass.

Checklist

msaelices added 4 commits May 12, 2026 00:22
BEGIN_PUBLIC
[Stdlib] Skip redundant GIL acquire in `PythonObject.__del__`

`PythonObject.__del__` unconditionally wrapped its `Py_DecRef` in a
`PyGILState_Ensure` / `PyGILState_Release` pair so the destructor could
run safely even when the calling thread doesn't hold the GIL. In the
common case the GIL *is* already held (the destructor runs inside a
Python -> Mojo trampoline, or inside a `with Python()` block); the
acquire/release pair is then just two extra C calls into CPython.

Microbenchmarks (CPython 3.12, single core via `taskset -c 2`):

  PyGILState_Ensure+Release pair, cached fn ptrs:  ~14.5 ns / pair
  PyGILState_Check, cached fn ptr:                  ~8.3 ns / call

So a single `PyGILState_Check` saves ~6 ns vs an `Ensure+Release` pair.
For the FFI dispatch path of issue modular#6521 (Python -> Mojo function call
overhead) we destroy two `PythonObject`s per call (the `self` arg and
the args tuple), so this saves ~12 ns / call on the hot trampoline.

Expose `CPython.PyGILState_Check` and let `__del__` test the GIL once
and fall back to `GILAcquired` only when the GIL really isn't held.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC
[Stdlib] Drop the outer `GILAcquired` from `_py_c_function_wrapper`

When CPython dispatches a `PyMethodDef` entry via `METH_VARARGS` /
`METH_VARARGS | METH_KEYWORDS`, the calling thread already holds the
GIL. PyO3, pybind11, and nanobind all rely on this invariant and don't
acquire the GIL again on entry. The Mojo binding wrapper, by contrast,
wrapped the user-supplied function in an explicit `GILAcquired`
context manager, which on every call hit `PyGILState_Ensure` (state
bookkeeping in CPython) and `PyGILState_Release` on return.

Microbenchmark on the same machine as issue modular#6521 (CPython 3.12,
`taskset -c 2`), running a Python timeit loop that calls a Mojo
function via a `PythonModuleBuilder`-registered C trampoline:

  variant                                    ns / call
  ---------------------------------------    ---------
  inline pywrap + GIL                              123
  inline pywrap, no outer GIL  (this patch)        106
  ---------------------------------------    ---------
                                       delta:       17 ns

So removing the outer pair saves ~17 ns / call (~8.5% of the current
overhead). Combined with the conditional GIL handling in
`PythonObject.__del__` (companion commit), the total saving on a
single-arg trampoline is ~29 ns / call.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC
[Stdlib] Add `METH_FASTCALL` support to `PythonModuleBuilder`

Exposes the CPython METH_FASTCALL calling convention so extensions
written with `PythonModuleBuilder` can opt out of the
`METH_VARARGS`-style per-call tuple allocation.

What's added:

- `METH_FASTCALL = 0x80` flag (matches CPython's
  `Include/methodobject.h`).
- `PyCFunctionFast` type, the trampoline signature:
  `def(PyObjectPtr, UnsafePointer[PyObjectPtr], Py_ssize_t) -> PyObjectPtr`.
- `PyMethodDef.function_fastcall` static factory that builds a
  fastcall `PyMethodDef` with the right flag bits.
- `PythonModuleBuilder.def_py_c_fastcall_function`, a new public
  entry point parallel to `def_py_c_function` for users who want to
  ship a hand-rolled fastcall trampoline.

This is the raw-API surface — the higher-level
`def_function[user_func]` registration still defaults to
`METH_VARARGS`. Integrating fastcall into that high-level path
requires a fastcall-aware variant of the `_python_func.mojo`
dispatch chain and is left as a follow-up (this branch chains on top
of the GIL-acquire cleanup PR; the next chained PR will add the
high-level integration).

Microbench: a tight Python -> Mojo -> Python loop on a `noop(x)`
function shows the raw fastcall trampoline at ~136 ns/iter vs the
current `def_function` path at ~198 ns/iter (-62 ns, ~31%
improvement). For `add(a, b)`, ~383 ns/iter vs ~313 ns/iter
(-70 ns, ~18%). Numbers include the Mojo -> Python overhead the
bench loop pays per iter, so the win attributable purely to the
trampoline change is the delta. See the companion benchmark commit
for the full bench setup.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
BEGIN_PUBLIC
[Stdlib] Add Python -> Mojo FFI hot path benchmark

`mojo/stdlib/benchmarks/python/bench_python_ffi.mojo` measures the
Python -> Mojo round-trip for two trampolines side by side:

- A: the standard `PythonModuleBuilder.def_function[user_func]`
  path (METH_VARARGS, full `_py_c_function_wrapper` chain).
- B: a hand-rolled trampoline registered via
  `def_py_c_fastcall_function` (METH_FASTCALL).

Both `noop(x)` and `add(a, b)` shapes are covered, matching the
reproduction in issue modular#6521.

The bench loop runs in Mojo, so each iteration pays a Mojo -> Python
crossing in addition to the Python -> Mojo crossing we want to
measure. The absolute numbers are therefore higher than what an
external Python driver would see, but the delta between the two
variants is meaningful and attributable to the trampoline style.

Run with:

    ./bazelw test //mojo/stdlib/benchmarks:python/bench_python_ffi.mojo.bench \
      --test_output=all
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
@github-actions github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels May 11, 2026
@JoeLoser JoeLoser self-assigned this May 12, 2026
@msaelices
Copy link
Copy Markdown
Contributor Author

msaelices commented May 12, 2026

@JoeLoser's stack covers item 1 from his decomposition (METH_FASTCALL), with the full integration into the high-level def_function[user_func] that this PR explicitly deferred. Yielding, will close this PR once his stack lands publicly. Pivoting to item 6 (typed-arg fast paths) which he explicitly hasn't looked at.

msaelices added 2 commits May 12, 2026 19:51
The rest of the python bindings code uses 'wrapper' uniformly
(_py_c_function_wrapper, _tp_dealloc_wrapper, _tp_repr_wrapper,
_py_new_function_wrapper, ...). 'Trampoline' was jargon I introduced;
swap it back to match the project's terminology.

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants