Skip to content

Conversation

@dpgeorge
Copy link
Member

@dpgeorge dpgeorge commented Jan 6, 2026

Summary

This is an alternative to #17557 which aims to implement t-strings in a more efficient way (less code size), leveraging the existing f-string parser in the lexer. It includes:

  • t-string parsing in py/lexer.c
  • new built-in __template__() function to construct t-string objects
  • new built-in Template and Interpolation classes which implement all the functionality from PEP 750
  • new built-in string module with templatelib sub-module, which contains the classes Template and Interpolation

This PR is built upon #18588.

The way it works is that an input t-string like:

t"hello {name:5}"

is converted character-by-character by the lexer/tokenizer to:

__template__(("hello ", "",), name, "name", None, "5")

(For reference, if it were an f-string it would be converted to "hello {:5}".format(name).)

Compared to #17557 which costs about +7400 bytes on stm32, this implementation costs +2844 bytes.

This is still a work-in-progress. It implements most of the t-string functionality including nested t-strings and f-strings, but there are a few corner cases yet to tidy up. I don't see any show stoppers though, and code size should hopefully not grow much more either.

Testing

All 16 tests from #17557 have been added here. So far 11 of them pass, and 1 is no longer relevant (testing runtime overflow limit which is no longer there).

Trade-offs and Alternatives

Being an alternative to #17557, it shows a different way to achieve the same end result. #17557 starts up a new parser instance each time a t-string is encountered and recursively parses the t-string, whereas the implementation here just transforms the input characters. After all, t-strings (and f-strings) are really just syntactic sugar.

This adds code size, but if t-strings are not used then there is very little execution overhead, all of which is contained to the lexer.

The changes to py/lexer.c are mildly complex, but not really much more complex than the existing f-string logic. It's just a different way of transforming the input stream.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jan 6, 2026
@dpgeorge dpgeorge force-pushed the py-implement-tstrings branch 2 times, most recently from 9395826 to 7c6e8e2 Compare January 6, 2026 13:26
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.41%. Comparing base (26c1696) to head (e61ae6d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18650      +/-   ##
==========================================
+ Coverage   98.38%   98.41%   +0.03%     
==========================================
  Files         171      172       +1     
  Lines       22298    22606     +308     
==========================================
+ Hits        21937    22247     +310     
+ Misses        361      359       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dpgeorge and others added 10 commits January 7, 2026 11:59
This saves about 4 bytes on ARM Cortex-M, and about 50-60 bytes on x86-64.
It also allows the upcoming `vstr_ins_strn()` function to be inline as
well, and have less of a code-size impact when used.

Signed-off-by: Damien George <[email protected]>
This is now an easy function to define as inline, so it does not impact
code size unless it's used.

Signed-off-by: Damien George <[email protected]>
Having this check takes code size and execution time, and it's not
necessary: all callers of this function pass a non-zero value for
`byte_len` already.  And even if `byte_len` was zero, the code would still
perform correctly.

Signed-off-by: Damien George <[email protected]>
The null byte cannot exist in source code (per CPython), so use it to
indicate the end of the input stream (instead of `(mp_uint_t)-1`).  This
allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit
bytes, making it clear that they are not `unichar` values.  It also saves a
bit of memory in the `mp_lexer_t` data structure.  (And in a future commit
allows the saved cache chars to be eliminated entirely by storing them in
a vstr instead.)

In order to keep code size down, the frequently used `chr0` is still of
type `uint32_t`.  Having it 32-bit means that machine instructions to load
it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed
to `uint8_t`).

Also add tests for invalid bytes in the input stream to make sure there are
no regressions in this regard.

Signed-off-by: Damien George <[email protected]>
It turns out that it's relatively simple to support nested f-strings, which
is what this commit implements.

The way the MicroPython f-string parser works at the moment is:
1. it extracts the f-string arguments (things in curly braces) into a
   temporary buffer (a vstr)
2. once the f-string ends (reaches its closing quote) the lexer switches to
   tokenizing the temporary buffer
3. once the buffer is empty it switches back to the stream.

The temporary buffer can easily hold f-strings itself (ie nested f-strings)
and they can be re-parsed by the lexer using the same algorithm.  The only
thing stopping that from working is that the temporary buffer can't be
reused for the nested f-string because it's currently being parsed.

This commit fixes that by adding a second temporary buffer, which is the
"injection" buffer.  That allows arbitrary number of nestings with a simple
modification to the original algorithm:
1. when an f-string is encountered the string is parsed and its arguments
   are extracted into `fstring_args`
2. when the f-string finishes, `fstring_args` is inserted into the current
   position in `inject_chrs` (which is the start of that buffer if no
   injection is ongoing)
3. `fstring_args` is now cleared and ready for any further f-strings
   (nested or not)
4. the lexer switches to `inject_chrs` if it's not already reading from it
5. if an f-string appeared inside the f-string then it is in `inject_chrs`
   and can be processed as before, extracting its arguments into
   `fstring_args`, which can then be inserted again into `inject_chrs`
6. once `inject_chrs` is exhausted (meaning that all levels of f-strings
   have been fully processed) the lexer switched back to tokenizing the
   stream.

Amazingly, this scheme supports arbitrary numbers of nestings of f-strings
using the same quote style.

This adds some code size and a bit more memory usage for the lexer.  In
particular for a single (non-nested) f-string it now makes an extra copy of
the `fstring_args` data, when copying it across to `inject_chrs`.
Otherwise, memory use only goes up with the complexity of nested f-strings.

Signed-off-by: Damien George <[email protected]>
This way, the use of `lex->fstring_args` is fully self contained within the
string literal parsing section of `mp_lexer_to_next()`.

Signed-off-by: Damien George <[email protected]>
@dpgeorge dpgeorge force-pushed the py-implement-tstrings branch from f103117 to 9916c48 Compare January 7, 2026 01:00
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Code size report:

Reference:  github/workflows: Use same Ubuntu for code_size as ports_esp32. [26c1696]
Comparison: py/lexer: Improve t-string edge cases. [merge of e61ae6d]
  mpy-cross: +1832 +0.485% [incl +96(data)]
   bare-arm:   -12 -0.021% 
minimal x86:   -94 -0.050% 
   unix x64: +5768 +0.673% standard[incl +416(data)]
      stm32: +2964 +0.751% PYBV10
      esp32: +2972 +0.170% ESP32_GENERIC[incl +480(data)]
     mimxrt: +3000 +0.798% TEENSY40
        rp2: +2848 +0.310% RPI_PICO_W
       samd: +2956 +1.088% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32: +3036 +0.666% VIRT_RV32

@dpgeorge dpgeorge force-pushed the py-implement-tstrings branch from 9916c48 to ac21499 Compare January 7, 2026 01:32
This now works in MicroPython.

Signed-off-by: Damien George <[email protected]>
Now OK in MicroPython.

Signed-off-by: Damien George <[email protected]>
Not worth supporting.

Signed-off-by: Damien George <[email protected]>
Not worth supporting.

Signed-off-by: Damien George <[email protected]>
Reusing the existing f-string parser in the lexer.

Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
@dpgeorge dpgeorge force-pushed the py-implement-tstrings branch from ac21499 to e61ae6d Compare January 9, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants