-
Notifications
You must be signed in to change notification settings - Fork 747
Rewrite Scala lexer for Scala 3 #1694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pygments/lexers/jvm.py
Outdated
|
|
||
| idrest = '%s(?:%s|[0-9])*(?:(?<=_)%s)?' % (letter, letter, op) | ||
| letter_letter_digit = '%s(?:%s|\\d)*' % (letter, letter) | ||
| opchar = (u'[!#%&*+\\-\\/:<>=?@^|~\u00a6-\u00a7\u00a9\u00ac\u00ae\u00b0-\u00b1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't reintroduce the "u" prefix please.
Also, are these derived from Unicode categories? If yes, please use the existing lists from pygments.unistring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Indeed, those were Unicode categories. These were present in the old lexer, but using pygments.unistring is indeed much nicer.
pygments/lexers/jvm.py
Outdated
| plainid = u'(?:%s|%s+)' % (idrest, opchar) | ||
| backQuotedId = r'`[^`]+`' | ||
| anyId = u'(?:%s|%s)' % (plainid, backQuotedId) | ||
| endOfLineMaybeWithComment = r'(?=\s*(//.*|/\*(?!.*\*/\s*\S.*).*)?$)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty complex. Especially the .*\*/ can eat the whole text due to DOTALL.
Is recognizing these comments really necessary? Keep in mind that pygments is a highlighter, 100% accuracy is not required, speed is more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recognizing the end-of-line is necessary, as end is a soft keyword that should only be highlighted in certain situations.
I think placing a comment after an end might happen, especially in educational material that explains this end syntax. However, the precision with which the regex was matching on block comments (all of this to avoid a false positive) was not necessary, and I was able to significantly simplify.
| (r'[{}()\[\];,.]', Punctuation), | ||
| (r'(?<!:):(?!:)', Punctuation), | ||
| ], | ||
| 'keywords': [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should all use words() which auto-escapes and optimizes the regex. Same for operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've also fixed the storage modifiers to use words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. Thanks!
With the upcoming release of Scala 3 (currently available as a milestone release at lampepfl/dotty), the Scala lexer needed to be updated for the new language version.
This new implementation is inspired by that of scala/vscode-scala-syntax. It supports both Scala 2 and Scala 3, which share a lot of syntax. Scala 3 can be written in either an indentation-based syntax, or a curly-brace-based syntax (which was the only syntax variant supported in Scala 2). This new lexer implementation supports both variants.
Fixes #1035. Very likely fixes #1121, although I cannot definitely confirm as the link to the bug reproduction is dead.