fix JS regex literal parsing in char classes #7790

zth · 2025-08-22T20:05:00Z

Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.

All of the added tests to regex.res failed before these changes.

zth · 2025-08-22T20:05:18Z

@cristianoc your eyes would be good here.
@glennsl if you're around, seeing as you did the original implementation, care to take a look?

Copilot

Pull Request Overview

This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading ] characters and / characters within character classes.

Implements lookahead logic to validate character class closers before entering class parsing mode
Adds proper handling of beginning-of-class semantics for literal ] and ^ characters
Updates regex scanner to track character class state and apply correct escape rules

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`compiler/syntax/src/res_scanner.ml`	Implements new character class parsing logic with lookahead validation and proper state tracking
`tests/syntax_tests/data/parsing/grammar/expressions/regex.res`	Adds test cases for edge cases involving character classes with special characters
`tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txt`	Updates expected test output to match new parsing behavior

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-22T20:05:58Z

compiler/syntax/src/res_scanner.ml

+    let rec loop () =
+      if !i >= len then false
+      else
+        match String.unsafe_get src !i with


Using String.unsafe_get can lead to bounds checking violations. Consider using String.get instead, or ensure proper bounds checking is performed before this call.

Suggested change

match String.unsafe_get src !i with

match String.get src !i with

Copilot · 2025-08-22T20:05:58Z

compiler/syntax/src/res_scanner.ml

+      else
+        match String.unsafe_get src !i with
+        | '\n' | '\r' -> false
+        | '\\' ->


The escape handling logic increments i by 2 without checking if i + 1 is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.

Copilot · 2025-08-22T20:05:58Z

compiler/syntax/src/res_scanner.ml

+          loop ()
+    in
+    loop ()
+  in


Starting the scan at from_offset + 1 assumes the character at from_offset is '[', but this is not validated. If from_offset doesn't point to '[', the parsing logic may be incorrect.

Suggested change

in

(* Validate that from_offset points to '[' *)

if from_offset < 0 || from_offset >= len || String.unsafe_get src from_offset <> '[' then

false

else

let i = ref (from_offset + 1) in

(* start scanning after current '[' *)

let bos = ref true in

let rec loop () =

if !i >= len then false

else

match String.unsafe_get src !i with

| '\n' | '\r' -> false

| '\\' ->

if !i + 1 < len then (

i := !i + 2;

loop ())

else false

| '^' when !bos ->

incr i;

loop ()

| ']' when !bos ->

(* Leading ']' is literal content; after that, we're no longer at BOS. *)

bos := false;

incr i;

loop ()

| ']' -> true

| _ ->

bos := false;

incr i;

loop ()

in

loop ()

pkg-pr-new · 2025-08-22T20:13:53Z

Open in StackBlitz

rescript

npm i https://pkg.pr.new/rescript-lang/rescript@7790

@rescript/darwin-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-arm64@7790

@rescript/darwin-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-x64@7790

@rescript/linux-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-arm64@7790

@rescript/linux-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-x64@7790

@rescript/win32-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/win32-x64@7790

commit: e0d818b

cristianoc

Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.

mediremi · 2025-08-23T16:04:40Z

tests/syntax_tests/data/parsing/grammar/expressions/regex.res

+let re = /[]]/
+let re = /[\]]/
+let re = /[[]]/
+let re = /[^]/]/


Would it be possible for the compiler to reject /[]/]/ and /[^]/]/? They're not valid regexes in JS and so would cause a runtime error

Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?

Claude generated the following (diff is against master, not this branch):

diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.ml index c404d36cc..3415fbd02 100644 --- i/compiler/syntax/src/res_scanner.ml +++ w/compiler/syntax/src/res_scanner.ml @@ -580,9 +580,9 @@ let scan_regex scanner = bring_buf_up_to_date ~start_offset:last_char_offset; Buffer.contents buf) in - let rec scan () = + let rec scan ?(in_char_class = false) () = match scanner.ch with - | '/' -> + | '/' when not in_char_class -> let last_char_offset = scanner.offset in next scanner; let pattern = result ~first_char_offset ~last_char_offset in @@ -606,10 +606,16 @@ let scan_regex scanner = | '\\' -> next scanner; next scanner; - scan () + scan ~in_char_class () + | '[' when not in_char_class -> + next scanner; + scan ~in_char_class:true () + | ']' when in_char_class -> + next scanner; + scan ~in_char_class:false () | _ -> next scanner; - scan () + scan ~in_char_class () in let pattern, flags = scan () in let end_pos = position scanner in

Which manages to compile these regexes:

let re = /\.[^/.]+$/ let re = /[^]]/ let re = /[/]/ let re = /[]]/ let re = /[\]]/ let re = /[[]]/

And rejects the following:

let re = /[]/]/

let re = /[^]/]/

What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.

fix JS regex literal parsing in char classes

e0d818b

zth requested review from Copilot and cristianoc August 22, 2025 20:05

Copilot AI reviewed Aug 22, 2025

View reviewed changes

cristianoc reviewed Aug 23, 2025

View reviewed changes

mediremi reviewed Aug 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix JS regex literal parsing in char classes #7790

fix JS regex literal parsing in char classes #7790

zth commented Aug 22, 2025

Uh oh!

zth commented Aug 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 22, 2025

Uh oh!

Copilot AI Aug 22, 2025

Uh oh!

Copilot AI Aug 22, 2025

Uh oh!

pkg-pr-new bot commented Aug 22, 2025

Uh oh!

cristianoc left a comment

Uh oh!

mediremi Aug 23, 2025

Uh oh!

zth Aug 23, 2025

Uh oh!

mediremi Aug 23, 2025 •

edited

Loading

Uh oh!

cristianoc Aug 24, 2025

Uh oh!

Uh oh!

	match String.unsafe_get src !i with
	match String.get src !i with

-  in
+    (* Validate that from_offset points to '[' *)
+    if from_offset < 0 || from_offset >= len || String.unsafe_get src from_offset <> '[' then
+      false
+    else
+      let i = ref (from_offset + 1) in
+      (* start scanning after current '[' *)
+      let bos = ref true in
+      let rec loop () =
+        if !i >= len then false
+        else
+          match String.unsafe_get src !i with
+          | '\n' | '\r' -> false
+          | '\\' ->
+            if !i + 1 < len then (
+              i := !i + 2;
+              loop ())
+            else false
+          | '^' when !bos ->
+            incr i;
+            loop ()
+          | ']' when !bos ->
+            (* Leading ']' is literal content; after that, we're no longer at BOS. *)
+            bos := false;
+            incr i;
+            loop ()
+          | ']' -> true
+          | _ ->
+            bos := false;
+            incr i;
+            loop ()
+      in
+      loop ()

fix JS regex literal parsing in char classes #7790

Are you sure you want to change the base?

fix JS regex literal parsing in char classes #7790

Conversation

zth commented Aug 22, 2025

Uh oh!

zth commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

pkg-pr-new bot commented Aug 22, 2025

Uh oh!

cristianoc left a comment

Choose a reason for hiding this comment

Uh oh!

mediremi Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

zth Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

mediremi Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristianoc Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zth commented Aug 22, 2025 •

edited

Loading

mediremi Aug 23, 2025 •

edited

Loading