Skip to content

fix JS regex literal parsing in char classes #7790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zth
Copy link
Member

@zth zth commented Aug 22, 2025

Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.

All of the added tests to regex.res failed before these changes.

@zth zth requested review from Copilot and cristianoc August 22, 2025 20:05
@zth
Copy link
Member Author

zth commented Aug 22, 2025

@cristianoc your eyes would be good here.
@glennsl if you're around, seeing as you did the original implementation, care to take a look?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading ] characters and / characters within character classes.

  • Implements lookahead logic to validate character class closers before entering class parsing mode
  • Adds proper handling of beginning-of-class semantics for literal ] and ^ characters
  • Updates regex scanner to track character class state and apply correct escape rules

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
compiler/syntax/src/res_scanner.ml Implements new character class parsing logic with lookahead validation and proper state tracking
tests/syntax_tests/data/parsing/grammar/expressions/regex.res Adds test cases for edge cases involving character classes with special characters
tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txt Updates expected test output to match new parsing behavior

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

let rec loop () =
if !i >= len then false
else
match String.unsafe_get src !i with
Copy link
Preview

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using String.unsafe_get can lead to bounds checking violations. Consider using String.get instead, or ensure proper bounds checking is performed before this call.

Suggested change
match String.unsafe_get src !i with
match String.get src !i with

Copilot uses AI. Check for mistakes.

else
match String.unsafe_get src !i with
| '\n' | '\r' -> false
| '\\' ->
Copy link
Preview

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape handling logic increments i by 2 without checking if i + 1 is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.

Copilot uses AI. Check for mistakes.

loop ()
in
loop ()
in
Copy link
Preview

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the scan at from_offset + 1 assumes the character at from_offset is '[', but this is not validated. If from_offset doesn't point to '[', the parsing logic may be incorrect.

Suggested change
in
(* Validate that from_offset points to '[' *)
if from_offset < 0 || from_offset >= len || String.unsafe_get src from_offset <> '[' then
false
else
let i = ref (from_offset + 1) in
(* start scanning after current '[' *)
let bos = ref true in
let rec loop () =
if !i >= len then false
else
match String.unsafe_get src !i with
| '\n' | '\r' -> false
| '\\' ->
if !i + 1 < len then (
i := !i + 2;
loop ())
else false
| '^' when !bos ->
incr i;
loop ()
| ']' when !bos ->
(* Leading ']' is literal content; after that, we're no longer at BOS. *)
bos := false;
incr i;
loop ()
| ']' -> true
| _ ->
bos := false;
incr i;
loop ()
in
loop ()

Copilot uses AI. Check for mistakes.

Copy link

pkg-pr-new bot commented Aug 22, 2025

Open in StackBlitz

rescript

npm i https://pkg.pr.new/rescript-lang/rescript@7790

@rescript/darwin-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-arm64@7790

@rescript/darwin-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-x64@7790

@rescript/linux-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-arm64@7790

@rescript/linux-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-x64@7790

@rescript/win32-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/win32-x64@7790

commit: e0d818b

Copy link
Collaborator

@cristianoc cristianoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.

let re = /[]]/
let re = /[\]]/
let re = /[[]]/
let re = /[^]/]/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for the compiler to reject /[]/]/ and /[^]/]/? They're not valid regexes in JS and so would cause a runtime error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?

Copy link
Member

@mediremi mediremi Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude generated the following (diff is against master, not this branch):

diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.ml
index c404d36cc..3415fbd02 100644
--- i/compiler/syntax/src/res_scanner.ml
+++ w/compiler/syntax/src/res_scanner.ml
@@ -580,9 +580,9 @@ let scan_regex scanner =
       bring_buf_up_to_date ~start_offset:last_char_offset;
       Buffer.contents buf)
   in
-  let rec scan () =
+  let rec scan ?(in_char_class = false) () =
     match scanner.ch with
-    | '/' ->
+    | '/' when not in_char_class ->
       let last_char_offset = scanner.offset in
       next scanner;
       let pattern = result ~first_char_offset ~last_char_offset in
@@ -606,10 +606,16 @@ let scan_regex scanner =
     | '\\' ->
       next scanner;
       next scanner;
-      scan ()
+      scan ~in_char_class ()
+    | '[' when not in_char_class ->
+      next scanner;
+      scan ~in_char_class:true ()
+    | ']' when in_char_class ->
+      next scanner;
+      scan ~in_char_class:false ()
     | _ ->
       next scanner;
-      scan ()
+      scan ~in_char_class ()
   in
   let pattern, flags = scan () in
   let end_pos = position scanner in

Which manages to compile these regexes:

let re = /\.[^/.]+$/
let re = /[^]]/
let re = /[/]/

let re = /[]]/
let re = /[\]]/
let re = /[[]]/

And rejects the following:

let re = /[]/]/
let re = /[^]/]/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants