Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd #40054

tannergooding · 2020-07-29T00:04:05Z

This resolves #33565

tannergooding · 2020-07-30T18:42:56Z

src/coreclr/src/jit/simdashwintrinsic.cpp

    // Vector<T> for the rel-ops covered here requires at least SSE2
    assert(compIsaSupportedDebugOnly(InstructionSet_SSE2));

    // Vector<T>, when 32-bytes, requires at least AVX2
-    assert(!isVectorT256 || compIsaSupportedDebugOnly(InstructionSet_AVX2));
-
-    if (compOpportunisticallyDependsOn(InstructionSet_SSE41))


This was incorrect logic for conditional select as it was no longer doing a bitwise operation, which is what the current software fallback and SSE2 implementations are doing. It was introduced in .NET 5 as part of the port logic

tannergooding · 2020-08-05T19:13:08Z

CC. @echesakovMSFT, @kunalspathak. This should be ready for review

tannergooding · 2020-08-05T19:13:59Z

Also CC. @carlossanlop, @eiriktsarpalis, @pgovind

pgovind · 2020-08-06T19:00:20Z

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

+
+            if (AdvSimd.Arm64.IsSupported)
+            {
+                Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();


Shouldn't we be able to just say return vResult == 0 here?

No, the result is a vector and we need to convert it to a scalar int for comparison.

pgovind · 2020-08-06T19:01:33Z

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

-                            Sse2.And(ifTrue, selector),
-                            Sse2.AndNot(selector, ifFalse)
-                        );
+                Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();


Same question here

pgovind

I'm good with the changes in S.P.CoreLib. Somebody else should sign off on the jit change. I don't completely follow it

echesakov

I looked over the code in S.P.C. I will do one more round at the later point.

echesakov · 2020-08-06T20:52:30Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs

+                AdvSimd.Store(&matrix.M21, AdvSimd.Arm64.ZipHigh(P00, P10));
+                AdvSimd.Store(&matrix.M31, AdvSimd.Arm64.ZipLow(P01, P11));
+                AdvSimd.Store(&matrix.M41, AdvSimd.Arm64.ZipHigh(P01, P11));
+


This should be a perfect case for using LD1/ST4 and LD4/ST1 (multiple registers) in the future

LD1 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xn] ST4 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xm]

echesakov · 2020-08-06T23:01:52Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs

+            if (AdvSimd.IsSupported)
+            {
+                Vector128<float> zero = Vector128<float>.Zero;
+                AdvSimd.Store(&value.M11, AdvSimd.Subtract(zero, AdvSimd.LoadVector128(&value.M11)));


Consider using AdvSimd.Negate here

echesakov · 2020-08-06T23:27:10Z

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

+
+            if (AdvSimd.Arm64.IsSupported)
+            {
+                Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();


I wonder if the following are faster?

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() == 0xFFFFFFFFFFFFFFFFUL);

Possibly on ARM64. I copied the existing implementation from the DirectX Math Library, which at least been already profiled/optimized for various scenarios.

Agree with @echesakovMSFT.

Given the short timeframe to get this in and that the existing code already shows gains, I'm going to wait on doing any additional refactorings/improvements until .NET 6.
The existing code works and is what an already vetted implementation is using.

echesakov · 2020-08-06T23:29:46Z

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

+                Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
+
+                Vector64<byte> vResult0 = vResult.GetLower().AsByte();
+                Vector64<byte> vResult1 = vResult.GetUpper().AsByte();


Same as for Equal - could this be done the following way?

Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() != 0xFFFFFFFFFFFFFFFFUL);

Wondering, why this can't be just return !Equal(vector1, vector2) ?

NaN generally represents an issue for normal floating-point inversion checks

kunalspathak · 2020-08-07T15:37:26Z

Thanks @tannergooding . Do you have performance improvements from this? Also, I don't see any benchmark related to Matrix4x4 in dotnet/performance. Could you please add some benchmarks so we can get .NET 3.1 vs. .NET 5.0 comparison?

tannergooding · 2020-08-08T01:19:44Z

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.388 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=5.0.100-rc.1.20407.13
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), Arm64 RyuJIT
  Job-LDHERM : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15
WarmupCount=1

Method	Job	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
AddOperatorBenchmark	Preview 8	21.354 ns	0.0510 ns	0.0426 ns	21.360 ns	21.261 ns	21.425 ns	-	-	-	-
AddOperatorBenchmark	PR	15.030 ns	0.0240 ns	0.0213 ns	15.029 ns	14.999 ns	15.063 ns	-	-	-	-

EqualityOperatorBenchmark	Preview 8	14.115 ns	0.0396 ns	0.0351 ns	14.104 ns	14.083 ns	14.198 ns	-	-	-	-
EqualityOperatorBenchmark	PR	12.964 ns	0.0202 ns	0.0168 ns	12.964 ns	12.938 ns	12.997 ns	-	-	-	-

InequalityOperatorBenchmark	Preview 8	13.748 ns	0.0254 ns	0.0225 ns	13.742 ns	13.721 ns	13.806 ns	-	-	-	-
InequalityOperatorBenchmark	PR	9.906 ns	0.0212 ns	0.0188 ns	9.905 ns	9.882 ns	9.936 ns	-	-	-	-

MultiplyByMatrixOperatorBenchmark	Preview 8	40.910 ns	0.1037 ns	0.0866 ns	40.915 ns	40.735 ns	41.053 ns	-	-	-	-
MultiplyByMatrixOperatorBenchmark	PR	18.500 ns	0.0155 ns	0.0138 ns	18.500 ns	18.471 ns	18.524 ns	-	-	-	-

MultiplyByScalarOperatorBenchmark	Preview 8	14.749 ns	0.0190 ns	0.0169 ns	14.749 ns	14.728 ns	14.788 ns	-	-	-	-
MultiplyByScalarOperatorBenchmark	PR	12.375 ns	0.0174 ns	0.0154 ns	12.374 ns	12.347 ns	12.408 ns	-	-	-	-

SubtractOperatorBenchmark	Preview 8	19.567 ns	0.0198 ns	0.0176 ns	19.565 ns	19.538 ns	19.591 ns	-	-	-	-
SubtractOperatorBenchmark	PR	15.421 ns	0.0218 ns	0.0194 ns	15.420 ns	15.390 ns	15.456 ns	-	-	-	-

NegationOperatorBenchmark	Preview 8	14.668 ns	0.0219 ns	0.0204 ns	14.661 ns	14.634 ns	14.703 ns	-	-	-	-
NegationOperatorBenchmark	PR	11.426 ns	0.0311 ns	0.0260 ns	11.423 ns	11.367 ns	11.466 ns	-	-	-	-

AddBenchmark	Preview 8	23.281 ns	0.1212 ns	0.1075 ns	23.254 ns	23.157 ns	23.507 ns	-	-	-	-
AddBenchmark	PR	18.093 ns	0.0316 ns	0.0264 ns	18.084 ns	18.061 ns	18.162 ns	-	-	-	-

LerpBenchmark	Preview 8	24.464 ns	0.0588 ns	0.0521 ns	24.467 ns	24.342 ns	24.539 ns	-	-	-	-
LerpBenchmark	PR	14.437 ns	0.0693 ns	0.0614 ns	14.448 ns	14.259 ns	14.515 ns	-	-	-	-

MultiplyByMatrixBenchmark	Preview 8	40.726 ns	0.1982 ns	0.1655 ns	40.721 ns	40.422 ns	41.062 ns	-	-	-	-
MultiplyByMatrixBenchmark	PR	21.933 ns	0.0182 ns	0.0162 ns	21.933 ns	21.902 ns	21.967 ns	-	-	-	-

MultiplyByScalarBenchmark	Preview 8	17.390 ns	0.0636 ns	0.0564 ns	17.387 ns	17.323 ns	17.518 ns	-	-	-	-
MultiplyByScalarBenchmark	PR	13.020 ns	0.0515 ns	0.0457 ns	13.039 ns	12.909 ns	13.051 ns	-	-	-	-

NegateBenchmark	Preview 8	17.081 ns	0.0507 ns	0.0449 ns	17.061 ns	17.023 ns	17.182 ns	-	-	-	-
NegateBenchmark	PR	13.472 ns	0.0168 ns	0.0158 ns	13.472 ns	13.448 ns	13.500 ns	-	-	-	-

SubtractBenchmark	Preview 8	23.747 ns	0.0463 ns	0.0410 ns	23.739 ns	23.701 ns	23.820 ns	-	-	-	-
SubtractBenchmark	PR	18.497 ns	0.3187 ns	0.2981 ns	18.586 ns	17.935 ns	18.799 ns	-	-	-	-

Transpose	Preview 8	16.591 ns	0.1079 ns	0.0901 ns	16.547 ns	16.488 ns	16.773 ns	-	-	-	-
Transpose	PR	12.493 ns	0.1507 ns	0.1336 ns	12.543 ns	12.175 ns	12.567 ns	-	-	-	-

Corresponding dotnet/performance PR: dotnet/performance#1442

tannergooding · 2020-08-08T01:20:28Z

The rest of the perf benchmarks I added remained the same they have opportunities to be optimized for x86/x64 and ARM64.

kunalspathak · 2020-08-10T16:12:39Z

src/libraries/System.Private.CoreLib/src/System/Math.cs

@@ -226,7 +226,7 @@ public static double BitIncrement(double x)
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static double CopySign(double x, double y)
        {
-            if (Sse.IsSupported || AdvSimd.IsSupported)
+            if (Sse2.IsSupported || AdvSimd.IsSupported)
            {


Was this a typo?

No. There is little need to differentiate on SSE vs SSE2 considering both are considered baseline to RyuJIT. This just simplifies the overall logic.

kunalspathak · 2020-08-10T16:15:55Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs

+            else
+            {
+                // Redundant test so we won't prejit remainder of this method on platforms without AdvSimd.
+                throw new PlatformNotSupportedException();


We are not optimizing Permute() with ARM64 intrinsics? Same for some of the other methods below like Invert(),

Permute is only used from the x86 code paths. ARM doesn't have a Permute/Shuffle instruction and the corresponding ARM64 implementation for functions like Invert differs quite significantly.

kunalspathak · 2020-08-10T16:32:11Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs

+            if (AdvSimd.Arm64.IsSupported)
+            {
+                // This implementation is based on the DirectX Math Library XMMatrixTranspose method
+                // https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMatrix.inl


nit: Have comment at the top.

I don't believe the comment applies to the x64 implementation

kunalspathak · 2020-08-10T16:35:35Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.cs

+
+                // Repeat for the other 3 rows
+
+                Vector128<float> M21 = AdvSimd.LoadVector128(&value1.M21);


Lot of repeated code. consider extracting the code in a function and passing 4 rows data to it.

The matrix code is small and is more susceptible to inlining differences than some. I can investigate extracting it for .NET 6

kunalspathak · 2020-08-10T16:38:24Z

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

+                Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
+
+                Vector64<byte> vResult0 = vResult.GetLower().AsByte();
+                Vector64<byte> vResult1 = vResult.GetUpper().AsByte();


Wondering, why this can't be just return !Equal(vector1, vector2) ?

kunalspathak

LGTM

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 29, 2020

tannergooding force-pushed the matrix4x4 branch 4 times, most recently from 91ca86f to ae70564 Compare July 30, 2020 18:31

tannergooding commented Jul 30, 2020

View reviewed changes

jaredpar mentioned this pull request Aug 3, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd

00ff217

tannergooding mentioned this pull request Aug 3, 2020

[Arm64] Vectorize System.Numerics.Matrix4x4 using hardware intrinsics #33565

Closed

Fixing Matrix4x4 * Matrix4x4

61250a6

tannergooding force-pushed the matrix4x4 branch from ae70564 to 61250a6 Compare August 3, 2020 22:47

tannergooding added 2 commits August 3, 2020 23:51

Fixing Matrix4x4.Transpose

8ebdf37

Fix a copy paste error with Matrix4x4 != Matrix4x4

380b85f

tannergooding marked this pull request as ready for review August 4, 2020 21:36

pgovind reviewed Aug 6, 2020

View reviewed changes

pgovind approved these changes Aug 6, 2020

View reviewed changes

echesakov reviewed Aug 6, 2020

View reviewed changes

kunalspathak reviewed Aug 10, 2020

View reviewed changes

kunalspathak suggested changes Aug 10, 2020

View reviewed changes

kunalspathak approved these changes Aug 11, 2020

View reviewed changes

tannergooding merged commit 3642dee into dotnet:master Aug 11, 2020

karelz added this to the 5.0.0 milestone Aug 18, 2020

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020


		// Repeat for the other 3 rows

		Vector128<float> M21 = AdvSimd.LoadVector128(&value1.M21);

Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd #40054

Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd #40054

Uh oh!

Conversation

tannergooding commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Aug 5, 2020

Uh oh!

tannergooding commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgovind left a comment

Choose a reason for hiding this comment

Uh oh!

echesakov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Aug 7, 2020

Uh oh!

tannergooding commented Aug 8, 2020

Uh oh!

tannergooding commented Aug 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tannergooding commented Jul 29, 2020 •

edited

Loading

tannergooding Jul 30, 2020 •

edited

Loading

tannergooding Aug 10, 2020 •

edited

Loading