-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
py: Implement PEP 750 t-strings using existing f-string parser (WIP) #18650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dpgeorge
wants to merge
22
commits into
micropython:master
Choose a base branch
from
dpgeorge:py-implement-tstrings
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,999
−112
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9395826 to
7c6e8e2
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18650 +/- ##
==========================================
+ Coverage 98.38% 98.41% +0.03%
==========================================
Files 171 172 +1
Lines 22298 22606 +308
==========================================
+ Hits 21937 22247 +310
+ Misses 361 359 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This saves about 4 bytes on ARM Cortex-M, and about 50-60 bytes on x86-64. It also allows the upcoming `vstr_ins_strn()` function to be inline as well, and have less of a code-size impact when used. Signed-off-by: Damien George <[email protected]>
This is now an easy function to define as inline, so it does not impact code size unless it's used. Signed-off-by: Damien George <[email protected]>
Having this check takes code size and execution time, and it's not necessary: all callers of this function pass a non-zero value for `byte_len` already. And even if `byte_len` was zero, the code would still perform correctly. Signed-off-by: Damien George <[email protected]>
The null byte cannot exist in source code (per CPython), so use it to indicate the end of the input stream (instead of `(mp_uint_t)-1`). This allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit bytes, making it clear that they are not `unichar` values. It also saves a bit of memory in the `mp_lexer_t` data structure. (And in a future commit allows the saved cache chars to be eliminated entirely by storing them in a vstr instead.) In order to keep code size down, the frequently used `chr0` is still of type `uint32_t`. Having it 32-bit means that machine instructions to load it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed to `uint8_t`). Also add tests for invalid bytes in the input stream to make sure there are no regressions in this regard. Signed-off-by: Damien George <[email protected]>
It turns out that it's relatively simple to support nested f-strings, which is what this commit implements. The way the MicroPython f-string parser works at the moment is: 1. it extracts the f-string arguments (things in curly braces) into a temporary buffer (a vstr) 2. once the f-string ends (reaches its closing quote) the lexer switches to tokenizing the temporary buffer 3. once the buffer is empty it switches back to the stream. The temporary buffer can easily hold f-strings itself (ie nested f-strings) and they can be re-parsed by the lexer using the same algorithm. The only thing stopping that from working is that the temporary buffer can't be reused for the nested f-string because it's currently being parsed. This commit fixes that by adding a second temporary buffer, which is the "injection" buffer. That allows arbitrary number of nestings with a simple modification to the original algorithm: 1. when an f-string is encountered the string is parsed and its arguments are extracted into `fstring_args` 2. when the f-string finishes, `fstring_args` is inserted into the current position in `inject_chrs` (which is the start of that buffer if no injection is ongoing) 3. `fstring_args` is now cleared and ready for any further f-strings (nested or not) 4. the lexer switches to `inject_chrs` if it's not already reading from it 5. if an f-string appeared inside the f-string then it is in `inject_chrs` and can be processed as before, extracting its arguments into `fstring_args`, which can then be inserted again into `inject_chrs` 6. once `inject_chrs` is exhausted (meaning that all levels of f-strings have been fully processed) the lexer switched back to tokenizing the stream. Amazingly, this scheme supports arbitrary numbers of nestings of f-strings using the same quote style. This adds some code size and a bit more memory usage for the lexer. In particular for a single (non-nested) f-string it now makes an extra copy of the `fstring_args` data, when copying it across to `inject_chrs`. Otherwise, memory use only goes up with the complexity of nested f-strings. Signed-off-by: Damien George <[email protected]>
This way, the use of `lex->fstring_args` is fully self contained within the string literal parsing section of `mp_lexer_to_next()`. Signed-off-by: Damien George <[email protected]>
Signed-off-by: Koudai Aono <[email protected]>
Signed-off-by: Koudai Aono <[email protected]>
Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
f103117 to
9916c48
Compare
|
Code size report: |
9916c48 to
ac21499
Compare
Signed-off-by: Damien George <[email protected]>
This now works in MicroPython. Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
Now OK in MicroPython. Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
Not worth supporting. Signed-off-by: Damien George <[email protected]>
Not worth supporting. Signed-off-by: Damien George <[email protected]>
Reusing the existing f-string parser in the lexer. Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
Signed-off-by: Damien George <[email protected]>
ac21499 to
e61ae6d
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This is an alternative to #17557 which aims to implement t-strings in a more efficient way (less code size), leveraging the existing f-string parser in the lexer. It includes:
py/lexer.c__template__()function to construct t-string objectsTemplateandInterpolationclasses which implement all the functionality from PEP 750stringmodule withtemplatelibsub-module, which contains the classesTemplateandInterpolationThis PR is built upon #18588.
The way it works is that an input t-string like:
is converted character-by-character by the lexer/tokenizer to:
(For reference, if it were an f-string it would be converted to
"hello {:5}".format(name).)Compared to #17557 which costs about +7400 bytes on stm32, this implementation costs +2844 bytes.
This is still a work-in-progress. It implements most of the t-string functionality including nested t-strings and f-strings, but there are a few corner cases yet to tidy up. I don't see any show stoppers though, and code size should hopefully not grow much more either.
Testing
All 16 tests from #17557 have been added here. So far 11 of them pass, and 1 is no longer relevant (testing runtime overflow limit which is no longer there).
Trade-offs and Alternatives
Being an alternative to #17557, it shows a different way to achieve the same end result. #17557 starts up a new parser instance each time a t-string is encountered and recursively parses the t-string, whereas the implementation here just transforms the input characters. After all, t-strings (and f-strings) are really just syntactic sugar.
This adds code size, but if t-strings are not used then there is very little execution overhead, all of which is contained to the lexer.
The changes to
py/lexer.care mildly complex, but not really much more complex than the existing f-string logic. It's just a different way of transforming the input stream.