Skip to content

Rethinking the import system #1516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

Niki4tap
Copy link

@Niki4tap Niki4tap commented Mar 8, 2025

Hello Lark developers!

I've tried using this library in a personal project, and really liked it, thank you for it!

The problem

...except one thing, which was grammar composition. Lark docs make you think it has a nice module system where you can just import rules and terminals and everything magically works... but for some reason it doesn't.

Rules don't compose well since when you %import them, the grammar module that imported the rule is now responsible for parsing it since the rule has namespace of the current grammar module assigned to it, and funnily enough, every rule that it depends on still gets the imported grammar module prefix.

So if you had a layout like this:

# lib.lark

b: "b"
a: b
# main.lark
%import lib.a

start: a

Parse tree for "b" would look like:

start
	a
		lib__b

which makes approximately 0 sense to me 😄 (this has been discussed before in lark, and in my opinion the "official solution" came out looking as more of a workaround than something to be used and recommended to others)

Main grammar is supposed to parse imported terminal, but none of its children. Which is a very odd way of making the import system work. For example, if multiple grammar modules import the same terminal, they would all be responsible for parsing it, but not its children, which in transformer implementation terms means that importing transformers would have to inherit the implementation for parsing the imported terminal, and then on top of that get merge_transformers'd with the lib transformer (which also wouldn't be responsible for parsing one of its own rules?).

All of this makes working with multiple grammars very inconvenient.

The solution

%use directive.

More is explained in added docs, but here is a quick excerpt:

Imports terminals and rules just as %import does, but also puts the imported rule itself in its namespace and automatically aliases it.

Useful for importing rules that this chunk of grammar uses, but does not define a transformer for (isn't responsible for parsing it).

Changing %import to %use in the example above, and parsing same "b" will now result in:

start
	lib__a
		lib__b

This allows to simply compose transformers with merge_transformers and not worry about inheritance, mixins, or anything else for such a simple example!

Compatibility with %import

%use is currently incompatible with %import, i. e. you cannot do this:

# mixing both not allowed
%import a.b
%use a.c

or in the other direction, but you can import/use on different modules:

# this is ok!
%import a.b
%use b.c

The implementation

I am not going to lie, the implementation could use some design work, but right now it works as is.

Any comments or ideas on how to make it more robust will be appreciated as I am new to this codebase.

Currently it creates new group of aliases called use_aliases and uses that to rewrite all _defines, and that's about it. I am a bit worried it has some hidden bugs I haven't considered, but I hope for them to get ironed out during the code review :)

Other concerns

Of course, all of these issues could be solved with

...using inheritance? Or mixins? Python already has a few mechanisms (and libraries) for that sort of composition.1

But in my opinion, lark shouldn't impose a hell of mixins, inheritance, and a ton of other OOP boilerplate on its users just to do such a simple thing as basic grammar composition, especially when it can be "just fixed" in a hundred lines or so in lark itself, for basically no cost.

Another option would be to enforce manually prefixing imported rules, which doesn't scale. Like at all.

And I guess you could always just string together all files and implement all rules and terminals in one transformer, but then you would have to manually namespace all rules and terminals, and their uses, not just %imports.

Lastly...

This isn't supposed to be final, I'm open to comments and suggestions! The tests aren't supposed to be final either, I can write some more, the only one I added is just supposed to show how it all works :)

P.S.: I've also added minimal .editorconfig to the repository since my editor always wanted to convert everything to tabs, maybe it will also help other contributors

Footnotes

  1. https://github.com/lark-parser/lark/issues/531#issuecomment-593422234

@erezsh
Copy link
Member

erezsh commented Mar 8, 2025

Hi @Niki4tap !

It's possible that the current design isn't ideal. IIRC when I made the choice, I was trying to match the behavior of importing terminals, and I didn't take nested transformers into consideration.

The question remains, whether both behaviors are useful, or only the name spaced behavior is correct. Because if the latter is true, we should just change the behavior of import, instead of introducing a new keyword.

Regarding the implementation, I think it might be a bit over complicated. Unless I'm missing something, "use" can use the existing import mechanism, with the only difference of renaming the imported rule at the end.

@MegaIng thoughts?

@Niki4tap
Copy link
Author

Niki4tap commented Mar 8, 2025

Hi,

The question remains, whether both behaviors are useful, or only the name spaced behavior is correct. Because if the latter is true, we should just change the behavior of import, instead of introducing a new keyword.

Yeah I wasn't sure about this, hence the new keyword, and even if that's the case, I'm not sure about backwards compatibility, maybe something depends on the current behavior.

The renaming should allow for switching back to old behavior if we go with changing the %import, but it might be a bit too inconvenient. With two separate directives it should be fine.

Regarding the implementation, I think it might be a bit over complicated. Unless I'm missing something, "use" can use the existing import mechanism, with the only difference of renaming the imported rule at the end.

Yes, that's what I was trying to do, but I didn't find a way to rename the rule in the current grammar, as changing aliases would seemingly affect only imported grammars, so I felt like I had to implement that myself and that's what I did.

I'll look around some more but if you have any specific pointers, it would be appreciated.

Also I'll look at the CI fails in a bit.

@Niki4tap
Copy link
Author

Niki4tap commented Mar 8, 2025

Had to work around types for a bit, but I think CI should be green now.

@MegaIng
Copy link
Member

MegaIng commented Mar 8, 2025

I think we should at some point reconsider the entire import system. There are a few issues I can think of:

  • weird inconsistencies between importing rules and importing tokens. Some inconsistencies is ok, but it should be clear what they are.
  • Better handling of diamong-shaped imports - these are sometimes broken
  • Better interfacing for subgrammars - i.e. what is mentioned in this PR.

For the last part, I think something like this would be good:

class SubGrammarTransformer(Transformer):
    ...

class MainGrammarTransformer(Transformer):
    sub_grammar = SubGrammarTransformer

## or

class MainGrammarTransformer(Transformer):
    def __init__(self, ...):
        self.sub_grammar = SubGrammarTransformer()

And whatever renaming scheme we decide upon is handled internally.

I haven't read the code of this PR, but I don't like the general idea adding an additional directive - I think having two directives with similar behavior is a bad idea. Instead we should consider how we can make a breaking change here.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2025

In the past we made breaking changes by adding a switch, such as legacy_imports = True, with a deprecation warning, and then once we release v2, say a few months down the line, we could switch it to False. (it doesn't have to be a huge release like v1 was).

I agree we should fix problems with diamond-shaped imports, but it doesn't have to be in this PR.

The biggest "fix" I'd like is for %ignore to be importable (by being local to rules), but that's a big and complicated task.

@Niki4tap
Copy link
Author

Niki4tap commented Mar 9, 2025

weird inconsistencies between importing rules and importing tokens. Some inconsistencies is ok, but it should be clear what they are.

Actually, I don't think I've noticed any?

Better handling of diamong-shaped imports - these are sometimes broken

Again, maybe the luck has spared me, but I think I have a couple in my project, and I haven't had any issues, except aforementioned namespacing troubles.

Better interfacing for subgrammars - i.e. what is mentioned in this PR.
For the last part, I think something like this would be good:

That's how it would look like if the proposed namespacing changes are applied:

class LeafGrammarTransformer(Transformer): ...

class SubGrammarTransformer(Transformer): ...

class MainGrammarTransformer(Transformer): ...

def make_transformer(...) -> Transformer:
	leaf_grammar_transformer = LeafGrammarTransformer()
	sub_grammar_transformer = merge_transformers(SubGrammarTransformer(), leaf = leaf_grammar_transformer)
	return merge_transformers(MainGrammarTransformer(), sub = sub_grammar_transformer, leaf = leaf_grammar_transformer)

And mostly how it looks like right now if your grammar is 1 level deep, if it's more, you still have to do this, but also mixin imported implementations, and manually namespace in the grammar.

I haven't read the code of this PR, but I don't like the general idea adding an additional directive - I think having two directives with similar behavior is a bad idea. Instead we should consider how we can make a breaking change here.

I'm completely fine with changing current %import behavior.

In the past we made breaking changes by adding a switch, such as legacy_imports = True, with a deprecation warning, and then once we release v2, say a few months down the line, we could switch it to False

Sounds fine!

I agree we should fix problems with diamond-shaped imports, but it doesn't have to be in this PR.

I can take a peek at it, if pointed to the problem directly, since I haven't encountered one in the wild.

The biggest "fix" I'd like is for %ignore to be importable (by being local to rules), but that's a big and complicated task.

Yeah, this sounds like a good idea, though I'm not sure how it would work, since %ignores don't define anything %importable. Also, yes, that sounds like a big task since that needs lexer integration and being huge is not the aim of this PR 😅


I can change the implementation of %import to do what %use does in this PR right now (automatic namespacing and aliasing of imported terminals), I think it would be a good start to rethinking the import system, and then we can see if the changes look good to merge or if any other ideas pop up.

Also would be nice to see what tests fail or trip up on this, since we probably want to consider those usecases.

@Niki4tap
Copy link
Author

Niki4tap commented Mar 9, 2025

Dove a bit deeper into the implementation, made the change to %import instead of adding a new directive, and rewrote the older implementation since I thought I would get a bit more insight into the codebase and would end up with a better version, but it seems like I've just reimplemented the old :p

Simple stuff works, but a few tests are failing (33 failed, 959 passed, 134 skipped), I've migrated most of them, but some have weird errors I couldn't quite trace back to the source.

I'll try to fix them as soon as I'll have some more time, but for now feel free to look at the implementation (I probably missed something important)

Failing test cases:

TestGrammar::test_import_custom_sources3
TestGrammar::test_list_grammar_imports
TestGrammar::test_override_rule

# these all fail with basically the same error
Test{Parser}{Lexer}::test_relative_import
Test{Parser}{Lexer}::test_relative_import_rename
Test{Parser}{Lexer}::test_relative_multi_import
Test{Parser}{Lexer}::test_error_with_interactive_parser

@Niki4tap Niki4tap changed the title Add %use directive Rethinking the import system Mar 9, 2025
@erezsh
Copy link
Member

erezsh commented Mar 9, 2025

We do want to be backwards compatible, and changing the tests kind of beats the purpose of having them in the first place :)

@Niki4tap
Copy link
Author

Niki4tap commented Mar 9, 2025

We do want to be backwards compatible, and changing the tests kind of beats the purpose of having them in the first place :)

Oh yeah, I forgot about that, we probably want to test the new implementation as well though, so I might need to do the same thing as with the lexers/parsers, but with old and new imports

@Niki4tap
Copy link
Author

A couple updates:

1. Added an option, and refactored some of the code (again), to allow for legacy imports.

2. Hacked together a solution to run tests that involve importing with both legacy imports and new ones.

With that, only 1 old test is failing, the cache one, I'll look into it hopefully tomorrow. Not sure why, but it seems like it's still trying to load the grammar with legacy_import option set to False, even though I pass it as True. Would love some pointers around the caching system, I did see the _LOAD_ALLOWED_OPTIONS in lark.py but adding in there didn't seem to do much.

3. Well, I've got good news and bad ones:
Good news is: I think I've figured out the reason other tests are failing.
Bad news is: it's complicated.

Imagine a grammar layout like this:

# lib.lark

a: "b"
# nested.lark

%import lib.a
# main.lark
start: a

%import nested.a

Important thing to note is that %import doesn't define a rule/terminal, it only declares it, and while the old system deals with this just fine since it rewrites the names of children items (rules/terminals), I've made the new one to rewrite the rules/terminals it defines instead, and expect the imported names to be correct.

So the new import system would expect definitions to look like:

start
	nested__a
		nested__lib__a

But what happens in reality is:

start
	nested__a # expected
	nested__lib__a # actual (declaration)
		nested__lib__a # (definition)

And it throws an error with nested_a used but not found.

From here I see 2 potential solutions:
1. Make %imports strong, i. e. each %import defines a rule/terminal, not just declare it
2. On grammar load, return defined rules/terminals separately from re-imported ones

I personally prefer the second one (although it might be a little complicated), but I'm not exactly sure about it, and I want to hear some opinions on it.

Related: This is also why I added start rule to some tests, while I don't think the second option would solve it, the first one would. But I think using an %imported rule as a start isn't a good practice anyways.

P. S. I've also added venv to .gitignore, was a little annoying that git kept trying to commit that

@Niki4tap Niki4tap force-pushed the use branch 2 times, most recently from 4368cfe to 61c12c5 Compare March 26, 2025 07:23
@Niki4tap
Copy link
Author

Hello again, I think I've fixed all of the tests now.

Turns out cache test wasn't an issue, but output was different due to defining the start rule.

Interactive parser test just needed the namespacing.

And I worked around the re-import issue by doing none of what I have proposed before, but linking new names to old definitions (override now overwrites the existing definition instead of creating a new one and replacing the old one).

With that all the tests are passing on my machine now and pre-commit also doesn't error so I think the CI should be green again.

Nothing to be done on my side for now, I think this should be ready for a review.

cc @erezsh

@Niki4tap
Copy link
Author

What a weird type error, I don't think I even touched that code in this pr, but should be fixed anyways.

@erezsh
Copy link
Member

erezsh commented Mar 26, 2025

Before I go deeply into the implementation, can you explain in a few words why it took changing so many lines? I mean, if the only difference in behavior is renaming the imported rule, why was it necessary to update so many different parts?

@Niki4tap
Copy link
Author

I've split up the changes across commits, so it would be more reviewable, if you're looking just at the diff number github gives you in the pr it hides a lot of things. I think most of the changes are test-related, the commit with the actual implementation has the diff numbers of +151 -30 and a lot of additions are because I kind of duplicated the load_grammar method to handle both old and new grammar.

@erezsh
Copy link
Member

erezsh commented Mar 26, 2025

For now, I'm talking mostly about the changes in load_grammar.py

I kind of duplicated the load_grammar method

That's part of what I'm asking. Why did you need to duplicate it?

@Niki4tap
Copy link
Author

Ah I guess I didn't necessarily need to, when I was starting out I thought it would be littered with if checks so I thought it would be cleaner to separate them, but looking at it now it doesn't look like a lot is changed, so I can probably unify them if that's preferred.

@erezsh
Copy link
Member

erezsh commented Mar 26, 2025

Yes, minimal code is preferred..

@Niki4tap
Copy link
Author

Unified the function and rebased.

Also set legacy_import to True by default and adapted a few tests I overlooked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants