Skip to content

Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd #40054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 11, 2020

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jul 29, 2020

This resolves #33565

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 29, 2020
@tannergooding tannergooding force-pushed the matrix4x4 branch 4 times, most recently from 91ca86f to ae70564 Compare July 30, 2020 18:31
// Vector<T> for the rel-ops covered here requires at least SSE2
assert(compIsaSupportedDebugOnly(InstructionSet_SSE2));

// Vector<T>, when 32-bytes, requires at least AVX2
assert(!isVectorT256 || compIsaSupportedDebugOnly(InstructionSet_AVX2));

if (compOpportunisticallyDependsOn(InstructionSet_SSE41))
Copy link
Member Author

@tannergooding tannergooding Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrect logic for conditional select as it was no longer doing a bitwise operation, which is what the current software fallback and SSE2 implementations are doing. It was introduced in .NET 5 as part of the port logic

@tannergooding tannergooding marked this pull request as ready for review August 4, 2020 21:36
@tannergooding
Copy link
Member Author

CC. @echesakovMSFT, @kunalspathak. This should be ready for review

@tannergooding
Copy link
Member Author

Also CC. @carlossanlop, @eiriktsarpalis, @pgovind


if (AdvSimd.Arm64.IsSupported)
{
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be able to just say return vResult == 0 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the result is a vector and we need to convert it to a scalar int for comparison.

Sse2.And(ifTrue, selector),
Sse2.AndNot(selector, ifFalse)
);
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here

Copy link

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the changes in S.P.CoreLib. Somebody else should sign off on the jit change. I don't completely follow it

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the code in S.P.C. I will do one more round at the later point.

AdvSimd.Store(&matrix.M21, AdvSimd.Arm64.ZipHigh(P00, P10));
AdvSimd.Store(&matrix.M31, AdvSimd.Arm64.ZipLow(P01, P11));
AdvSimd.Store(&matrix.M41, AdvSimd.Arm64.ZipHigh(P01, P11));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a perfect case for using LD1/ST4 and LD4/ST1 (multiple registers) in the future

LD1 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xn]
ST4 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xm]

if (AdvSimd.IsSupported)
{
Vector128<float> zero = Vector128<float>.Zero;
AdvSimd.Store(&value.M11, AdvSimd.Subtract(zero, AdvSimd.LoadVector128(&value.M11)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using AdvSimd.Negate here


if (AdvSimd.Arm64.IsSupported)
{
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the following are faster?

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() == 0xFFFFFFFFFFFFFFFFUL);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly on ARM64. I copied the existing implementation from the DirectX Math Library, which at least been already profiled/optimized for various scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @echesakovMSFT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the short timeframe to get this in and that the existing code already shows gains, I'm going to wait on doing any additional refactorings/improvements until .NET 6.
The existing code works and is what an already vetted implementation is using.

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();

Vector64<byte> vResult0 = vResult.GetLower().AsByte();
Vector64<byte> vResult1 = vResult.GetUpper().AsByte();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for Equal - could this be done the following way?

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() != 0xFFFFFFFFFFFFFFFFUL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, why this can't be just return !Equal(vector1, vector2) ?

Copy link
Member Author

@tannergooding tannergooding Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN generally represents an issue for normal floating-point inversion checks

@kunalspathak
Copy link
Contributor

Thanks @tannergooding . Do you have performance improvements from this? Also, I don't see any benchmark related to Matrix4x4 in dotnet/performance. Could you please add some benchmarks so we can get .NET 3.1 vs. .NET 5.0 comparison?

@tannergooding
Copy link
Member Author

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.388 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=5.0.100-rc.1.20407.13
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), Arm64 RyuJIT
  Job-LDHERM : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15
WarmupCount=1
Method Job Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
AddOperatorBenchmark Preview 8 21.354 ns 0.0510 ns 0.0426 ns 21.360 ns 21.261 ns 21.425 ns - - - -
AddOperatorBenchmark PR 15.030 ns 0.0240 ns 0.0213 ns 15.029 ns 14.999 ns 15.063 ns - - - -
EqualityOperatorBenchmark Preview 8 14.115 ns 0.0396 ns 0.0351 ns 14.104 ns 14.083 ns 14.198 ns - - - -
EqualityOperatorBenchmark PR 12.964 ns 0.0202 ns 0.0168 ns 12.964 ns 12.938 ns 12.997 ns - - - -
InequalityOperatorBenchmark Preview 8 13.748 ns 0.0254 ns 0.0225 ns 13.742 ns 13.721 ns 13.806 ns - - - -
InequalityOperatorBenchmark PR 9.906 ns 0.0212 ns 0.0188 ns 9.905 ns 9.882 ns 9.936 ns - - - -
MultiplyByMatrixOperatorBenchmark Preview 8 40.910 ns 0.1037 ns 0.0866 ns 40.915 ns 40.735 ns 41.053 ns - - - -
MultiplyByMatrixOperatorBenchmark PR 18.500 ns 0.0155 ns 0.0138 ns 18.500 ns 18.471 ns 18.524 ns - - - -
MultiplyByScalarOperatorBenchmark Preview 8 14.749 ns 0.0190 ns 0.0169 ns 14.749 ns 14.728 ns 14.788 ns - - - -
MultiplyByScalarOperatorBenchmark PR 12.375 ns 0.0174 ns 0.0154 ns 12.374 ns 12.347 ns 12.408 ns - - - -
SubtractOperatorBenchmark Preview 8 19.567 ns 0.0198 ns 0.0176 ns 19.565 ns 19.538 ns 19.591 ns - - - -
SubtractOperatorBenchmark PR 15.421 ns 0.0218 ns 0.0194 ns 15.420 ns 15.390 ns 15.456 ns - - - -
NegationOperatorBenchmark Preview 8 14.668 ns 0.0219 ns 0.0204 ns 14.661 ns 14.634 ns 14.703 ns - - - -
NegationOperatorBenchmark PR 11.426 ns 0.0311 ns 0.0260 ns 11.423 ns 11.367 ns 11.466 ns - - - -
AddBenchmark Preview 8 23.281 ns 0.1212 ns 0.1075 ns 23.254 ns 23.157 ns 23.507 ns - - - -
AddBenchmark PR 18.093 ns 0.0316 ns 0.0264 ns 18.084 ns 18.061 ns 18.162 ns - - - -
LerpBenchmark Preview 8 24.464 ns 0.0588 ns 0.0521 ns 24.467 ns 24.342 ns 24.539 ns - - - -
LerpBenchmark PR 14.437 ns 0.0693 ns 0.0614 ns 14.448 ns 14.259 ns 14.515 ns - - - -
MultiplyByMatrixBenchmark Preview 8 40.726 ns 0.1982 ns 0.1655 ns 40.721 ns 40.422 ns 41.062 ns - - - -
MultiplyByMatrixBenchmark PR 21.933 ns 0.0182 ns 0.0162 ns 21.933 ns 21.902 ns 21.967 ns - - - -
MultiplyByScalarBenchmark Preview 8 17.390 ns 0.0636 ns 0.0564 ns 17.387 ns 17.323 ns 17.518 ns - - - -
MultiplyByScalarBenchmark PR 13.020 ns 0.0515 ns 0.0457 ns 13.039 ns 12.909 ns 13.051 ns - - - -
NegateBenchmark Preview 8 17.081 ns 0.0507 ns 0.0449 ns 17.061 ns 17.023 ns 17.182 ns - - - -
NegateBenchmark PR 13.472 ns 0.0168 ns 0.0158 ns 13.472 ns 13.448 ns 13.500 ns - - - -
SubtractBenchmark Preview 8 23.747 ns 0.0463 ns 0.0410 ns 23.739 ns 23.701 ns 23.820 ns - - - -
SubtractBenchmark PR 18.497 ns 0.3187 ns 0.2981 ns 18.586 ns 17.935 ns 18.799 ns - - - -
Transpose Preview 8 16.591 ns 0.1079 ns 0.0901 ns 16.547 ns 16.488 ns 16.773 ns - - - -
Transpose PR 12.493 ns 0.1507 ns 0.1336 ns 12.543 ns 12.175 ns 12.567 ns - - - -

Corresponding dotnet/performance PR: dotnet/performance#1442

@tannergooding
Copy link
Member Author

The rest of the perf benchmarks I added remained the same they have opportunities to be optimized for x86/x64 and ARM64.

@@ -226,7 +226,7 @@ public static double BitIncrement(double x)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double CopySign(double x, double y)
{
if (Sse.IsSupported || AdvSimd.IsSupported)
if (Sse2.IsSupported || AdvSimd.IsSupported)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. There is little need to differentiate on SSE vs SSE2 considering both are considered baseline to RyuJIT. This just simplifies the overall logic.

else
{
// Redundant test so we won't prejit remainder of this method on platforms without AdvSimd.
throw new PlatformNotSupportedException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not optimizing Permute() with ARM64 intrinsics? Same for some of the other methods below like Invert(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permute is only used from the x86 code paths. ARM doesn't have a Permute/Shuffle instruction and the corresponding ARM64 implementation for functions like Invert differs quite significantly.

if (AdvSimd.Arm64.IsSupported)
{
// This implementation is based on the DirectX Math Library XMMatrixTranspose method
// https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMatrix.inl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Have comment at the top.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the comment applies to the x64 implementation


// Repeat for the other 3 rows

Vector128<float> M21 = AdvSimd.LoadVector128(&value1.M21);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of repeated code. consider extracting the code in a function and passing 4 rows data to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix code is small and is more susceptible to inlining differences than some. I can investigate extracting it for .NET 6

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();

Vector64<byte> vResult0 = vResult.GetLower().AsByte();
Vector64<byte> vResult1 = vResult.GetUpper().AsByte();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, why this can't be just return !Equal(vector1, vector2) ?

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit 3642dee into dotnet:master Aug 11, 2020
@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Arm64] Vectorize System.Numerics.Matrix4x4 using hardware intrinsics
6 participants