Skip to content

Conversation

@naitoh
Copy link
Contributor

@naitoh naitoh commented Sep 9, 2024

It supports non-head match cases such as StringScanner#scan_until.

If we use a String as a pattern, we can improve match performance.
Here is a result of the including benchmark.

CRuby

It shows String as a pattern is 1.18x faster than Regexp as a pattern.

$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.403M i/s -      9.548M times in 1.015459s (106.35ns/i)
          regexp_var     9.162M i/s -      9.248M times in 1.009479s (109.15ns/i)
              string     8.966M i/s -      9.274M times in 1.034343s (111.54ns/i)
          string_var    11.051M i/s -     11.190M times in 1.012538s (90.49ns/i)
Calculating -------------------------------------
              regexp    10.319M i/s -     28.209M times in 2.733707s (96.91ns/i)
          regexp_var    10.032M i/s -     27.485M times in 2.739807s (99.68ns/i)
              string     9.681M i/s -     26.897M times in 2.778397s (103.30ns/i)
          string_var    12.162M i/s -     33.154M times in 2.726046s (82.22ns/i)

Comparison:
          string_var:  12161920.6 i/s 
              regexp:  10318949.7 i/s - 1.18x  slower
          regexp_var:  10031617.6 i/s - 1.21x  slower
              string:   9680843.7 i/s - 1.26x  slower

JRuby

It shows String as a pattern is 2.11x faster than Regexp as a pattern.

$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.591M i/s -      7.544M times in 0.993780s (131.74ns/i)
          regexp_var     6.143M i/s -      6.125M times in 0.997038s (162.77ns/i)
              string    14.135M i/s -     14.079M times in 0.996067s (70.75ns/i)
          string_var    14.079M i/s -     14.057M times in 0.998420s (71.03ns/i)
Calculating -------------------------------------
              regexp     9.409M i/s -     22.773M times in 2.420268s (106.28ns/i)
          regexp_var    10.116M i/s -     18.430M times in 1.821820s (98.85ns/i)
              string    21.389M i/s -     42.404M times in 1.982519s (46.75ns/i)
          string_var    20.897M i/s -     42.237M times in 2.021187s (47.85ns/i)

Comparison:
              string:  21389191.1 i/s 
          string_var:  20897327.5 i/s - 1.02x  slower
          regexp_var:  10116464.7 i/s - 2.11x  slower
              regexp:   9409222.3 i/s - 2.27x  slower

See: https://github.com/jruby/jruby/blob/be7815ec02356a58891c8727bb448f0c6a826d96/core/src/main/java/org/jruby/util/StringSupport.java#L1706-L1736

@naitoh naitoh marked this pull request as draft September 9, 2024 09:44
@naitoh naitoh marked this pull request as ready for review September 9, 2024 09:54
@naitoh
Copy link
Contributor Author

naitoh commented Sep 9, 2024

As for the TruffleRuby error, I think a fix is needed to uncheck TypeError on the TruffleRuby side.

TypeError: wrong argument type String (expected Regexp)

https://github.com/oracle/truffleruby/blob/b555f5908772808791f0d1224b17a87b963cdab6/lib/truffle/strscan.rb#L320-L323

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you omit tests not for TruffleRuby like

omit("not implemented on TruffleRuby") if RUBY_ENGINE == "truffleruby"
?

@naitoh naitoh force-pushed the accept-string-as-a-pattern-at-non-head branch from 377d6b2 to cb4ce57 Compare September 12, 2024 13:04
@naitoh
Copy link
Contributor Author

naitoh commented Sep 12, 2024

Could you omit tests not for TruffleRuby like

Tests not for TruffleRuby were omitted.
Thanks.

@naitoh naitoh requested a review from kou September 12, 2024 13:19
@naitoh naitoh force-pushed the accept-string-as-a-pattern-at-non-head branch 2 times, most recently from 454ac52 to 4a193b2 Compare September 13, 2024 14:36
@naitoh naitoh requested a review from kou September 13, 2024 14:46
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use \0 included string in test?

@naitoh naitoh force-pushed the accept-string-as-a-pattern-at-non-head branch from 4a193b2 to bab64a3 Compare September 13, 2024 21:21
@naitoh
Copy link
Contributor Author

naitoh commented Sep 13, 2024

Could you use \0 included string in test?

OK.
I added this test case.

@naitoh naitoh requested a review from kou September 13, 2024 21:30
naitoh and others added 2 commits September 14, 2024 06:50
It supports non-head match cases such as StringScanner#scan_until.

If we use a String as a pattern, we can improve match performance.
Here is a result of the including benchmark. It shows String as a
pattern is 1.18x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.403M i/s -      9.548M times in 1.015459s (106.35ns/i)
          regexp_var     9.162M i/s -      9.248M times in 1.009479s (109.15ns/i)
              string     8.966M i/s -      9.274M times in 1.034343s (111.54ns/i)
          string_var    11.051M i/s -     11.190M times in 1.012538s (90.49ns/i)
Calculating -------------------------------------
              regexp    10.319M i/s -     28.209M times in 2.733707s (96.91ns/i)
          regexp_var    10.032M i/s -     27.485M times in 2.739807s (99.68ns/i)
              string     9.681M i/s -     26.897M times in 2.778397s (103.30ns/i)
          string_var    12.162M i/s -     33.154M times in 2.726046s (82.22ns/i)

Comparison:
          string_var:  12161920.6 i/s
              regexp:  10318949.7 i/s - 1.18x  slower
          regexp_var:  10031617.6 i/s - 1.21x  slower
              string:   9680843.7 i/s - 1.26x  slower
```

---------

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
It supports non-head match cases such as StringScanner#scan_until.

If we use a String as a pattern, we can improve match performance.
Here is a result of the including benchmark. It shows String as a
pattern is 2.11x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.591M i/s -      7.544M times in 0.993780s (131.74ns/i)
          regexp_var     6.143M i/s -      6.125M times in 0.997038s (162.77ns/i)
              string    14.135M i/s -     14.079M times in 0.996067s (70.75ns/i)
          string_var    14.079M i/s -     14.057M times in 0.998420s (71.03ns/i)
Calculating -------------------------------------
              regexp     9.409M i/s -     22.773M times in 2.420268s (106.28ns/i)
          regexp_var    10.116M i/s -     18.430M times in 1.821820s (98.85ns/i)
              string    21.389M i/s -     42.404M times in 1.982519s (46.75ns/i)
          string_var    20.897M i/s -     42.237M times in 2.021187s (47.85ns/i)

Comparison:
              string:  21389191.1 i/s
          string_var:  20897327.5 i/s - 1.02x  slower
          regexp_var:  10116464.7 i/s - 2.11x  slower
              regexp:   9409222.3 i/s - 2.27x  slower
```

See: https://github.com/jruby/jruby/blob/be7815ec02356a58891c8727bb448f0c6a826d96/core/src/main/java/org/jruby/util/StringSupport.java#L1706-L1736
@naitoh naitoh force-pushed the accept-string-as-a-pattern-at-non-head branch from bab64a3 to 5b9bcda Compare September 13, 2024 21:51
@naitoh naitoh requested a review from kou September 13, 2024 21:58
@kou kou merged commit f9d96c4 into ruby:master Sep 14, 2024
37 checks passed
@kou
Copy link
Member

kou commented Sep 14, 2024

Thanks.

@naitoh naitoh deleted the accept-string-as-a-pattern-at-non-head branch September 14, 2024 10:57
matzbot pushed a commit to ruby/ruby that referenced this pull request Sep 17, 2024
(ruby/strscan#106)

It supports non-head match cases such as StringScanner#scan_until.

If we use a String as a pattern, we can improve match performance.
Here is a result of the including benchmark.

## CRuby

It shows String as a pattern is 1.18x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.403M i/s -      9.548M times in 1.015459s (106.35ns/i)
          regexp_var     9.162M i/s -      9.248M times in 1.009479s (109.15ns/i)
              string     8.966M i/s -      9.274M times in 1.034343s (111.54ns/i)
          string_var    11.051M i/s -     11.190M times in 1.012538s (90.49ns/i)
Calculating -------------------------------------
              regexp    10.319M i/s -     28.209M times in 2.733707s (96.91ns/i)
          regexp_var    10.032M i/s -     27.485M times in 2.739807s (99.68ns/i)
              string     9.681M i/s -     26.897M times in 2.778397s (103.30ns/i)
          string_var    12.162M i/s -     33.154M times in 2.726046s (82.22ns/i)

Comparison:
          string_var:  12161920.6 i/s
              regexp:  10318949.7 i/s - 1.18x  slower
          regexp_var:  10031617.6 i/s - 1.21x  slower
              string:   9680843.7 i/s - 1.26x  slower
```

## JRuby

It shows String as a pattern is 2.11x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.591M i/s -      7.544M times in 0.993780s (131.74ns/i)
          regexp_var     6.143M i/s -      6.125M times in 0.997038s (162.77ns/i)
              string    14.135M i/s -     14.079M times in 0.996067s (70.75ns/i)
          string_var    14.079M i/s -     14.057M times in 0.998420s (71.03ns/i)
Calculating -------------------------------------
              regexp     9.409M i/s -     22.773M times in 2.420268s (106.28ns/i)
          regexp_var    10.116M i/s -     18.430M times in 1.821820s (98.85ns/i)
              string    21.389M i/s -     42.404M times in 1.982519s (46.75ns/i)
          string_var    20.897M i/s -     42.237M times in 2.021187s (47.85ns/i)

Comparison:
              string:  21389191.1 i/s
          string_var:  20897327.5 i/s - 1.02x  slower
          regexp_var:  10116464.7 i/s - 2.11x  slower
              regexp:   9409222.3 i/s - 2.27x  slower
```

See:
https://github.com/jruby/jruby/blob/be7815ec02356a58891c8727bb448f0c6a826d96/core/src/main/java/org/jruby/util/StringSupport.java#L1706-L1736

---------

ruby/strscan@f9d96c446a

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@hsbt
Copy link
Member

hsbt commented Sep 17, 2024

This change breaks some examples of ruby/spec. I skipped them at ruby/ruby@95f08f2

kou pushed a commit that referenced this pull request Oct 16, 2024
…nc_get()` (#108)

- before: #106

## Why?

In `rb_strseq_index()`, the result of `rb_enc_check()` is used.

-
https://github.com/ruby/ruby/blob/6c7209cd3788ceec01e504d99057f9d3b396be84/string.c#L4335-L4368
> enc = rb_enc_check(str, sub);

> return strseq_core(str_ptr, str_ptr_end, str_len, sub_ptr, sub_len,
offset, enc);

-
https://github.com/ruby/ruby/blob/6c7209cd3788ceec01e504d99057f9d3b396be84/string.c#L4309-L4318
```C
strseq_core(const char *str_ptr, const char *str_ptr_end, long str_len,
            const char *sub_ptr, long sub_len, long offset, rb_encoding *enc)
{
    const char *search_start = str_ptr;
    long pos, search_len = str_len - offset;

    for (;;) {
        const char *t;
        pos = rb_memsearch(sub_ptr, sub_len, search_start, search_len, enc);
```

## Benchmark

It shows String as a pattern is 1.24x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.225M i/s -      9.328M times in 1.011068s (108.40ns/i)
          regexp_var     9.327M i/s -      9.413M times in 1.009214s (107.21ns/i)
              string     9.200M i/s -      9.355M times in 1.016840s (108.70ns/i)
          string_var    11.249M i/s -     11.255M times in 1.000578s (88.90ns/i)
Calculating -------------------------------------
              regexp     9.565M i/s -     27.676M times in 2.893476s (104.55ns/i)
          regexp_var    10.111M i/s -     27.982M times in 2.767496s (98.90ns/i)
              string    10.060M i/s -     27.600M times in 2.743465s (99.40ns/i)
          string_var    12.519M i/s -     33.746M times in 2.695615s (79.88ns/i)

Comparison:
          string_var:  12518707.2 i/s
          regexp_var:  10111089.6 i/s - 1.24x  slower
              string:  10060144.4 i/s - 1.24x  slower
              regexp:   9565124.4 i/s - 1.31x  slower
```
kou pushed a commit that referenced this pull request Oct 16, 2024
…currPtr()` (#109)

- before: #106

## Why?

Because they are identical.


https://github.com/ruby/strscan/blob/d31274f41b7c1e28f23d58cf7bfea03baa818cb7/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L267-L268


https://github.com/ruby/strscan/blob/d31274f41b7c1e28f23d58cf7bfea03baa818cb7/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L359-L361

## Benchmark

It shows String as a pattern is 2.33x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.421M i/s -      7.378M times in 0.994235s (134.75ns/i)
          regexp_var     7.302M i/s -      7.307M times in 1.000706s (136.95ns/i)
              string    12.715M i/s -     12.707M times in 0.999388s (78.65ns/i)
          string_var    13.575M i/s -     13.533M times in 0.996914s (73.66ns/i)
Calculating -------------------------------------
              regexp     8.287M i/s -     22.263M times in 2.686415s (120.67ns/i)
          regexp_var    10.180M i/s -     21.905M times in 2.151779s (98.23ns/i)
              string    20.148M i/s -     38.144M times in 1.893226s (49.63ns/i)
          string_var    23.695M i/s -     40.726M times in 1.718753s (42.20ns/i)

Comparison:
          string_var:  23694846.7 i/s
              string:  20147598.6 i/s - 1.18x  slower
          regexp_var:  10180018.3 i/s - 2.33x  slower
              regexp:   8287384.8 i/s - 2.86x  slower
```
naitoh added a commit to naitoh/strscan that referenced this pull request Dec 7, 2024
Added support for string pattern type in ruby#106.
And fix Success Return content.
kou pushed a commit that referenced this pull request Dec 7, 2024
Added support for string pattern type in
#106.
And fix Success Return content.
matzbot pushed a commit to ruby/ruby that referenced this pull request Dec 10, 2024
kou pushed a commit to ruby/rexml that referenced this pull request Dec 19, 2024
…ntil(string)` (#226)

## Why?
`StringScanner#check_until(string)` is faster than
`StringScanner#check_until(regex)`.

See:
- ruby/strscan#106
- ruby/strscan#111

## Benchmark
```
RUBYLIB= BUNDLER_ORIG_RUBYLIB= /Users/naitoh/.rbenv/versions/3.3.4/bin/ruby -v -S benchmark-driver /Users/naitoh/ghq/github.com/naitoh/rexml/benchmark/parse.yaml
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin22]
Calculating -------------------------------------
                         before       after  before(YJIT)  after(YJIT)
                 dom     19.459      19.840        35.035       35.786 i/s -     100.000 times in 5.139034s 5.040369s 2.854304s 2.794367s
                 sax     30.057      30.026        52.986       53.716 i/s -     100.000 times in 3.326998s 3.330499s 1.887303s 1.861652s
                pull     33.777      34.415        62.294       64.020 i/s -     100.000 times in 2.960622s 2.905668s 1.605284s 1.562002s
              stream     33.789      34.003        60.174       60.411 i/s -     100.000 times in 2.959521s 2.940916s 1.661845s 1.655334s

Comparison:
                              dom
         after(YJIT):        35.8 i/s
        before(YJIT):        35.0 i/s - 1.02x  slower
               after:        19.8 i/s - 1.80x  slower
              before:        19.5 i/s - 1.84x  slower

                              sax
         after(YJIT):        53.7 i/s
        before(YJIT):        53.0 i/s - 1.01x  slower
              before:        30.1 i/s - 1.79x  slower
               after:        30.0 i/s - 1.79x  slower

                             pull
         after(YJIT):        64.0 i/s
        before(YJIT):        62.3 i/s - 1.03x  slower
               after:        34.4 i/s - 1.86x  slower
              before:        33.8 i/s - 1.90x  slower

                           stream
         after(YJIT):        60.4 i/s
        before(YJIT):        60.2 i/s - 1.00x  slower
               after:        34.0 i/s - 1.78x  slower
              before:        33.8 i/s - 1.79x  slower

```

- YJIT=ON : 1.00x - 1.03x faster
- YJIT=OFF : 1.00x - 1.02x faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants