-
Notifications
You must be signed in to change notification settings - Fork 471
fix JS regex literal parsing in char classes #7790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@cristianoc your eyes would be good here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading ]
characters and /
characters within character classes.
- Implements lookahead logic to validate character class closers before entering class parsing mode
- Adds proper handling of beginning-of-class semantics for literal
]
and^
characters - Updates regex scanner to track character class state and apply correct escape rules
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
compiler/syntax/src/res_scanner.ml |
Implements new character class parsing logic with lookahead validation and proper state tracking |
tests/syntax_tests/data/parsing/grammar/expressions/regex.res |
Adds test cases for edge cases involving character classes with special characters |
tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txt |
Updates expected test output to match new parsing behavior |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
let rec loop () = | ||
if !i >= len then false | ||
else | ||
match String.unsafe_get src !i with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using String.unsafe_get
can lead to bounds checking violations. Consider using String.get
instead, or ensure proper bounds checking is performed before this call.
match String.unsafe_get src !i with | |
match String.get src !i with |
Copilot uses AI. Check for mistakes.
else | ||
match String.unsafe_get src !i with | ||
| '\n' | '\r' -> false | ||
| '\\' -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escape handling logic increments i
by 2 without checking if i + 1
is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.
Copilot uses AI. Check for mistakes.
loop () | ||
in | ||
loop () | ||
in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting the scan at from_offset + 1
assumes the character at from_offset
is '[', but this is not validated. If from_offset
doesn't point to '[', the parsing logic may be incorrect.
in | |
(* Validate that from_offset points to '[' *) | |
if from_offset < 0 || from_offset >= len || String.unsafe_get src from_offset <> '[' then | |
false | |
else | |
let i = ref (from_offset + 1) in | |
(* start scanning after current '[' *) | |
let bos = ref true in | |
let rec loop () = | |
if !i >= len then false | |
else | |
match String.unsafe_get src !i with | |
| '\n' | '\r' -> false | |
| '\\' -> | |
if !i + 1 < len then ( | |
i := !i + 2; | |
loop ()) | |
else false | |
| '^' when !bos -> | |
incr i; | |
loop () | |
| ']' when !bos -> | |
(* Leading ']' is literal content; after that, we're no longer at BOS. *) | |
bos := false; | |
incr i; | |
loop () | |
| ']' -> true | |
| _ -> | |
bos := false; | |
incr i; | |
loop () | |
in | |
loop () |
Copilot uses AI. Check for mistakes.
rescript
@rescript/darwin-arm64
@rescript/darwin-x64
@rescript/linux-arm64
@rescript/linux-x64
@rescript/win32-x64
commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.
let re = /[]]/ | ||
let re = /[\]]/ | ||
let re = /[[]]/ | ||
let re = /[^]/]/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible for the compiler to reject /[]/]/
and /[^]/]/
? They're not valid regexes in JS and so would cause a runtime error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude generated the following (diff is against master
, not this branch):
diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.ml
index c404d36cc..3415fbd02 100644
--- i/compiler/syntax/src/res_scanner.ml
+++ w/compiler/syntax/src/res_scanner.ml
@@ -580,9 +580,9 @@ let scan_regex scanner =
bring_buf_up_to_date ~start_offset:last_char_offset;
Buffer.contents buf)
in
- let rec scan () =
+ let rec scan ?(in_char_class = false) () =
match scanner.ch with
- | '/' ->
+ | '/' when not in_char_class ->
let last_char_offset = scanner.offset in
next scanner;
let pattern = result ~first_char_offset ~last_char_offset in
@@ -606,10 +606,16 @@ let scan_regex scanner =
| '\\' ->
next scanner;
next scanner;
- scan ()
+ scan ~in_char_class ()
+ | '[' when not in_char_class ->
+ next scanner;
+ scan ~in_char_class:true ()
+ | ']' when in_char_class ->
+ next scanner;
+ scan ~in_char_class:false ()
| _ ->
next scanner;
- scan ()
+ scan ~in_char_class ()
in
let pattern, flags = scan () in
let end_pos = position scanner in
Which manages to compile these regexes:
let re = /\.[^/.]+$/
let re = /[^]]/
let re = /[/]/
let re = /[]]/
let re = /[\]]/
let re = /[[]]/
And rejects the following:
let re = /[]/]/
let re = /[^]/]/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.
Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.
All of the added tests to
regex.res
failed before these changes.