-
Notifications
You must be signed in to change notification settings - Fork 281
Benchmarks for System.Utf8String.Experimental #1445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When I was benchmarking the UTF-8 code I used these texts from Project Gutenberg. These texts were chosen because they represent a nice variety of languages and non-ASCII character distributions.
The most important benchmark I ran over these tests was to load them into a ROS (remember strip the UTF-8 BOM if it exists!) and to run the "is this valid UTF-8?" method over them. |
What I did not benchmark: emoji and other 4-byte sequences, as they don't occur commonly enough to merit the optimization work. But we did unit test and fuzz the hell out of it so we have high confidence that the code works. :) |
Thanks for the pointers @GrabYourPitchforks . We should do similar thing for |
[ArgumentsSource(nameof(AsciiData))] | ||
public void IsNormalized_WidenAsciiToUtf16(string expected) | ||
{ | ||
Utf8Span span = new Utf8Span(new Utf8String(expected)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hhm, the return value is not used anywhere. Did you check what happens to the actual calls to ToCharArray()
? Is your code even getting called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. I did explicitly check for that before running the benchmark :) Others may have the same question though, so I'll add some lines here to consume the result of ToCharArray
in some way
…nation (dotnet#1447) * remove volatile variable access from HashCode benchmarks * remove volatile variable access from Linq benchmarks * remove volatile variable access from Regex benchmarks * remove volatile variable access from Benchstone benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also probably want a benchmark that exercises the new Utf8String(byte[])
ctor.
[ArgumentsSource(nameof(NonAsciiData))] | ||
public int ToUtf16(string expected) | ||
{ | ||
Utf8Span span = new Utf8Span(new Utf8String(expected)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't call new Utf8String(string)
in these benchmarks, as it performs a considerable amount of work and will throw off your numbers. Instead, have a benchmark dedicated just to calling this method. For all other benchmarks, use the [GlobalSetup]
routine to pre-populate an instance field of type Utf8String
.
In your [GlobalSetup]
routine, you should check whether the first char of each string is '\uFEFF'
, and if so then strip that character off the front of the string (or just throw an exception from within your setup routine if you detect this). That will work around any stray BOM issues that might throw off the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the plan with GlobalSetup
. This is still WIP. I'm hitting an exception that throws "System.ComponentModel.Win32Exception: 'The filename or extension is too long'" though, so I'm still trying to debug that.
Yup, I'll strip out the BOM
src/benchmarks/micro/libraries/System.Utf8String.Experimental/Perf.Utf8String.cs
Outdated
Show resolved
Hide resolved
} | ||
|
||
[Benchmark] | ||
public int ToUtf16_nonascii_110() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, consider having a name of method that you are testing in this benchmark?
public int ToUtf16_nonascii_110() | ||
{ | ||
string str = nonascii_110; | ||
Utf8Span span = new Utf8Span(nonascii_110_ustring); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just put everything except last line in [GlobalSetup]
so we want be allocating / triggering GC in every run and the measurements will be accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the first thing I tried :) The problem is that Utf8Span is a ref struct. I did check that the constructor for Utf8Span
is cheap though. I don't see another way to initialize a Span
here. @GrabYourPitchforks, do you think the Utf8Span
constructor will skew our numbers here? From what I can tell it just get a reference to the first byte in the string and it's all on the stack, so I would expect it to be incredibly cheap. For reference, I'm trying to benchmark the Utf8Utility.TranscodeToUtf16
method here (which internally calls both ASCIIUtility.WidenAsciiToUtf16
and ASCIIUtility.WidenFourAsciiBytesToUtf16AndWriteToBuffer
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should matter. ToChars()
just calls static Utf8.Utf16()
method and return. So you can pass the same utf8Span()
again and again.
Ref: https://source.dot.net/#System.Private.CoreLib/Utf8Span.Conversion.cs,103
[Benchmark] | ||
public bool IsAscii_GetIndexOfFirstNonAsciiByte() | ||
{ | ||
Utf8Span ascii_11_span = new Utf8Span(ascii_11_ustring); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Move the allocation out of benchmark code.
char[] second = ascii_11_span.ToCharArray(); | ||
if (returned[0] != second[0]) | ||
{ | ||
Console.WriteLine("Just a line to consume returned and second"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Console.WriteLine()
, just do return returned[0] == second[0];
[master] Update dependencies from dotnet/installer - Updates: - Microsoft.Dotnet.Sdk.Internal: from 5.0.100-rc.1.20407.13 to 5.0.100-rc.1.20413.9
<!--<None Update="11-0.txt"> | ||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | ||
</None> | ||
<None Update="11.txt"> | ||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | ||
</None> | ||
<None Update="25249-0.txt"> | ||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | ||
</None> | ||
<None Update="30774-0.txt"> | ||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | ||
</None> | ||
<None Update="39251-0.txt"> | ||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | ||
</None>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove if unused.
<!--<None Update="11-0.txt"> | |
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | |
</None> | |
<None Update="11.txt"> | |
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | |
</None> | |
<None Update="25249-0.txt"> | |
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | |
</None> | |
<None Update="30774-0.txt"> | |
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | |
</None> | |
<None Update="39251-0.txt"> | |
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> | |
</None>--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but as per numbers posed in pgovind#1 , looks like we regressed in IsNormalized()
EnglishMostlyAscii
and Chinese
. Do we know why or have a tracking issue?
} | ||
|
||
[Benchmark] | ||
public int ToChars() => new Utf8Span(_utf8).ToChars(_destination.Span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik - Should we avoid creating Utf8Span()
inside this method but instead have it created globally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak this is what we do by default, but Utf8Span
is a stack-only struct so we can't store it in the class field and BDN does not support ref structs
as type that define benchmarks
…the type name less generic
I have no idea why GH shows 61 modified files and changes that have already been merged to master a while ago. We need to squash this PR. |
Local WSL2 numbers on my machine using the latest RC1 runtime: With and without AdvSimd:
Technically we should be getting 20 results, but I get only 12 here. Probably because I see some output on my machine about some iterations being less than 100ms. Now if we only produced an official ARM64 Windows SDK to reduce potential WSL kernel effects ... |
The ResultsComparer shows only improvements and regressions. Here it means that for the other 8 benchmarks there was no difference for the threshold that you have provided |
I actually gave a threshold of 0.000001% because I suspected the same thing :)
|
Putting it for initial review. The
IsNormalized_GetIndexOfFirstNonAsciiChar
andIsNormalized_WidenAsciiToUtf16
have been setup to really accelerate the perf since those benchmarks don't contain any non-ascii data. A more comprehensive benchmark would need some non-ascii cases as well.@kunalspathak @carlossanlop