Skip to content

Benchmarks for System.Utf8String.Experimental #1445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
Aug 19, 2020
Merged

Conversation

pgovind
Copy link

@pgovind pgovind commented Aug 10, 2020

Putting it for initial review. The IsNormalized_GetIndexOfFirstNonAsciiChar and IsNormalized_WidenAsciiToUtf16 have been setup to really accelerate the perf since those benchmarks don't contain any non-ascii data. A more comprehensive benchmark would need some non-ascii cases as well.

@kunalspathak @carlossanlop

@GrabYourPitchforks
Copy link
Member

When I was benchmarking the UTF-8 code I used these texts from Project Gutenberg. These texts were chosen because they represent a nice variety of languages and non-ASCII character distributions.

  • 11.txt - English, all-ASCII, should stay entirely within fast paths
  • 11-0.txt - English, mostly ASCII with some rare non-ASCII chars, exercises that the occasional non-ASCII char doesn't kill our fast paths
  • 25249-0.txt - Chinese, exercises 3-byte scalar processing paths typical of East Asian languages
  • 30774-0.txt - Cyrillic, exercises a combination of ASCII and 2-byte scalar processing paths
  • 39251-0.txt - Greek, similar to the Cyrillic case but with a different distribution of ASCII and non-ASCII chars

The most important benchmark I ran over these tests was to load them into a ROS (remember strip the UTF-8 BOM if it exists!) and to run the "is this valid UTF-8?" method over them.

@GrabYourPitchforks
Copy link
Member

What I did not benchmark: emoji and other 4-byte sequences, as they don't occur commonly enough to merit the optimization work. But we did unit test and fuzz the hell out of it so we have high confidence that the code works. :)

@kunalspathak
Copy link
Contributor

What I did not benchmark: emoji and other 4-byte sequences, as they don't occur commonly enough to merit the optimization work. But we did unit test and fuzz the hell out of it so we have high confidence that the code works. :)

Thanks for the pointers @GrabYourPitchforks . We should do similar thing for AdvSimd as well. More important to be functionally correct than performant.

[ArgumentsSource(nameof(AsciiData))]
public void IsNormalized_WidenAsciiToUtf16(string expected)
{
Utf8Span span = new Utf8Span(new Utf8String(expected));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhm, the return value is not used anywhere. Did you check what happens to the actual calls to ToCharArray()? Is your code even getting called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I did explicitly check for that before running the benchmark :) Others may have the same question though, so I'll add some lines here to consume the result of ToCharArray in some way

adamsitnik and others added 4 commits August 12, 2020 09:18
…nation (dotnet#1447)

* remove volatile variable access from HashCode benchmarks

* remove volatile variable access from Linq benchmarks

* remove volatile variable access from Regex benchmarks

* remove volatile variable access from Benchstone benchmarks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also probably want a benchmark that exercises the new Utf8String(byte[]) ctor.

[ArgumentsSource(nameof(NonAsciiData))]
public int ToUtf16(string expected)
{
Utf8Span span = new Utf8Span(new Utf8String(expected));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't call new Utf8String(string) in these benchmarks, as it performs a considerable amount of work and will throw off your numbers. Instead, have a benchmark dedicated just to calling this method. For all other benchmarks, use the [GlobalSetup] routine to pre-populate an instance field of type Utf8String.

In your [GlobalSetup] routine, you should check whether the first char of each string is '\uFEFF', and if so then strip that character off the front of the string (or just throw an exception from within your setup routine if you detect this). That will work around any stray BOM issues that might throw off the benchmark.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the plan with GlobalSetup. This is still WIP. I'm hitting an exception that throws "System.ComponentModel.Win32Exception: 'The filename or extension is too long'" though, so I'm still trying to debug that.

Yup, I'll strip out the BOM

}

[Benchmark]
public int ToUtf16_nonascii_110()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, consider having a name of method that you are testing in this benchmark?

public int ToUtf16_nonascii_110()
{
string str = nonascii_110;
Utf8Span span = new Utf8Span(nonascii_110_ustring);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just put everything except last line in [GlobalSetup] so we want be allocating / triggering GC in every run and the measurements will be accurate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the first thing I tried :) The problem is that Utf8Span is a ref struct. I did check that the constructor for Utf8Span is cheap though. I don't see another way to initialize a Span here. @GrabYourPitchforks, do you think the Utf8Span constructor will skew our numbers here? From what I can tell it just get a reference to the first byte in the string and it's all on the stack, so I would expect it to be incredibly cheap. For reference, I'm trying to benchmark the Utf8Utility.TranscodeToUtf16 method here (which internally calls both ASCIIUtility.WidenAsciiToUtf16 and ASCIIUtility.WidenFourAsciiBytesToUtf16AndWriteToBuffer )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should matter. ToChars() just calls static Utf8.Utf16() method and return. So you can pass the same utf8Span() again and again.
Ref: https://source.dot.net/#System.Private.CoreLib/Utf8Span.Conversion.cs,103

[Benchmark]
public bool IsAscii_GetIndexOfFirstNonAsciiByte()
{
Utf8Span ascii_11_span = new Utf8Span(ascii_11_ustring);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Move the allocation out of benchmark code.

char[] second = ascii_11_span.ToCharArray();
if (returned[0] != second[0])
{
Console.WriteLine("Just a line to consume returned and second");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Console.WriteLine() , just do return returned[0] == second[0];

dotnet-maestro bot and others added 2 commits August 14, 2020 04:53
[master] Update dependencies from dotnet/installer
- Updates:
  - Microsoft.Dotnet.Sdk.Internal: from 5.0.100-rc.1.20407.13 to 5.0.100-rc.1.20413.9
Comment on lines 110 to 124
<!--<None Update="11-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="11.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="25249-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="30774-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="39251-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if unused.

Suggested change
<!--<None Update="11-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="11.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="25249-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="30774-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="39251-0.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>-->

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but as per numbers posed in pgovind#1 , looks like we regressed in IsNormalized() EnglishMostlyAscii and Chinese. Do we know why or have a tracking issue?

}

[Benchmark]
public int ToChars() => new Utf8Span(_utf8).ToChars(_destination.Span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik - Should we avoid creating Utf8Span() inside this method but instead have it created globally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak this is what we do by default, but Utf8Span is a stack-only struct so we can't store it in the class field and BDN does not support ref structs as type that define benchmarks

https://github.com/dotnet/runtime/blob/b5705587347d29d79cec830dc22b389e1ad9a9e0/src/libraries/System.Private.CoreLib/src/System/Text/Utf8Span.cs#L20

@adamsitnik
Copy link
Member

I have no idea why GH shows 61 modified files and changes that have already been merged to master a while ago. We need to squash this PR.

@adamsitnik adamsitnik merged commit ba1df98 into dotnet:master Aug 19, 2020
@pgovind
Copy link
Author

pgovind commented Aug 19, 2020

Local WSL2 numbers on my machine using the latest RC1 runtime:

With and without AdvSimd:

| Faster                                                             | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------ | ---------:| ----------------:| ----------------:| -------- |
| System.Text.Perf_Utf8String.IsAscii(Input: Greek)                  |      3.23 |           249.47 |            77.29 |         |
| System.Text.Perf_Utf8String.IsAscii(Input: Chinese)                |      3.05 |           156.38 |            51.23 |         |
| System.Text.Perf_Utf8String.IsAscii(Input: EnglishAllAscii)        |      2.89 |         40824.14 |         14134.85 |         |
| System.Text.Perf_Utf8String.ToChars(Input: EnglishAllAscii)        |      2.73 |        163875.12 |         60134.38 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: Cyrillic)          |      2.24 |       1281092.41 |        573072.40 | several?|
| System.Text.Perf_Utf8String.IsAscii(Input: Cyrillic)               |      1.93 |           236.73 |           122.49 | several?|
| System.Text.Perf_Utf8String.ToChars(Input: Cyrillic)               |      1.50 |        214020.16 |        142740.49 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: EnglishAllAscii)   |      1.40 |        858855.47 |        615500.23 | several?|
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishMostlyAscii) |      1.35 |        584513.22 |        434575.87 |         |
| System.Text.Perf_Utf8String.IsAscii(Input: EnglishMostlyAscii)     |      1.18 |            10.84 |             9.16 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: Chinese)           |      1.17 |       1572500.21 |       1345411.33 | several?|
| System.Text.Perf_Utf8String.ToChars(Input: Chinese)                |      1.05 |        187545.45 |        179036.58 |         |

No file given

Technically we should be getting 20 results, but I get only 12 here. Probably because I see some output on my machine about some iterations being less than 100ms.

Now if we only produced an official ARM64 Windows SDK to reduce potential WSL kernel effects ...

@adamsitnik
Copy link
Member

Technically we should be getting 20 results, but I get only 12 here.

The ResultsComparer shows only improvements and regressions. Here it means that for the other 8 benchmarks there was no difference for the threshold that you have provided

@pgovind
Copy link
Author

pgovind commented Aug 20, 2020

for the threshold that you have provided

I actually gave a threshold of 0.000001% because I suspected the same thing :)

  1. Is there any other reason we would only see 12 results? I saw some messages in the log about iterations being less than 100ms (in 1 of the runs, not both).

  2. Also, does the modality matter? For ex: If the distribution in the baseline has a single peak, but the diff has 3 peaks (equal, less or more), can we still trust the results? Much of my statistics class is a distant memory now :)

@carlossanlop carlossanlop deleted the arm64 branch August 27, 2020 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants