Skip to content

Improve lark-cython compatibility #1528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ornariece
Copy link
Contributor

@ornariece ornariece commented Apr 22, 2025

This PR addresses two significant compatibility issues when using lark-cython:

1. Fix Token handling for lark-cython

In the standard lark implementation, Token inherits from str, allowing string methods to be called directly on Token objects. However, lark-cython implements Token differently - it doesn't inherit from str, which means string methods like rsplit() won't work on token objects.

Changes:

  • Replaced direct string method calls on Token instances with calls on token.value instead

2. Refactor postlexer handling architecture

The original implementation used a PostLexConnector class that acted as a wrapper around a lexer but only implemented part of the Lexer interface (the lex() method but not next_token()). This partial implementation worked in standard lark through duck typing but caused errors in lark-cython's more strict implementation.

Changes:

  • Removed the use of PostLexConnector class
  • Added a self.postlex attribute to store the postlexer separately
  • Preserves the stream-based design of PostLex while ensuring proper interface implementation

This approach is more consistent with the architectural intent of postlexers as stream processors, while avoiding type compatibility issues in lark-cython.

return self.postlex.process(tokens)

def __copy__(self):
return type(self)(self.lexer, copy(self.state), self.postlex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the postlexer shouldn't be copied too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would say so? i see it as some kind of processor, it holds no data per se about the current states. i could be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, tbh, those shouldn't be instance attributes, but local variables inside of process. I think this postlexer design should already be broken on the current main if this copy method is called since lexer=PostLexConnector doesn't get copied either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, i agree those don't really make sense as instance attributes. it makes sense that a postlexer instance should be agnostic to a stream of tokens. this matter is probably beyond the scope of this PR though.

@erezsh
Copy link
Member

erezsh commented Apr 23, 2025

The "Python type check" test should now work properly if you rebase over master.

@erezsh
Copy link
Member

erezsh commented Apr 29, 2025

Overall looks good, but I have a few questions -

  • Did you consider implementing PostLex fully instead? I saw you wrote "This approach is more consistent with the architectural intent of postlexers as stream processors", can you elaborate on that a bit more please?

  • In your implementation, PostLexThread can accept a None postlex. When would that happen?

Also, I still think copy() should make a copy of the postlex too, since it's not an object we have control of. (and if there's no data to copy, the cost is negligible anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants