-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Updating Matrix4x4 to accelerate several functions using Arm.AdvSimd #40054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
91ca86f
to
ae70564
Compare
// Vector<T> for the rel-ops covered here requires at least SSE2 | ||
assert(compIsaSupportedDebugOnly(InstructionSet_SSE2)); | ||
|
||
// Vector<T>, when 32-bytes, requires at least AVX2 | ||
assert(!isVectorT256 || compIsaSupportedDebugOnly(InstructionSet_AVX2)); | ||
|
||
if (compOpportunisticallyDependsOn(InstructionSet_SSE41)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was incorrect logic for conditional select as it was no longer doing a bitwise operation, which is what the current software fallback and SSE2 implementations are doing. It was introduced in .NET 5 as part of the port logic
ae70564
to
61250a6
Compare
CC. @echesakovMSFT, @kunalspathak. This should be ready for review |
Also CC. @carlossanlop, @eiriktsarpalis, @pgovind |
|
||
if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be able to just say return vResult == 0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the result is a vector and we need to convert it to a scalar int for comparison.
Sse2.And(ifTrue, selector), | ||
Sse2.AndNot(selector, ifFalse) | ||
); | ||
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with the changes in S.P.CoreLib. Somebody else should sign off on the jit change. I don't completely follow it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over the code in S.P.C. I will do one more round at the later point.
AdvSimd.Store(&matrix.M21, AdvSimd.Arm64.ZipHigh(P00, P10)); | ||
AdvSimd.Store(&matrix.M31, AdvSimd.Arm64.ZipLow(P01, P11)); | ||
AdvSimd.Store(&matrix.M41, AdvSimd.Arm64.ZipHigh(P01, P11)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a perfect case for using LD1/ST4
and LD4/ST1
(multiple registers) in the future
LD1 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xn]
ST4 { Vt.4S, Vt2.4S, Vt3.4S, Vt4.4S }, [Xm]
if (AdvSimd.IsSupported) | ||
{ | ||
Vector128<float> zero = Vector128<float>.Zero; | ||
AdvSimd.Store(&value.M11, AdvSimd.Subtract(zero, AdvSimd.LoadVector128(&value.M11))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using AdvSimd.Negate
here
|
||
if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the following are faster?
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() == 0xFFFFFFFFFFFFFFFFUL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly on ARM64. I copied the existing implementation from the DirectX Math Library, which at least been already profiled/optimized for various scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @echesakovMSFT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the short timeframe to get this in and that the existing code already shows gains, I'm going to wait on doing any additional refactorings/improvements until .NET 6.
The existing code works and is what an already vetted implementation is using.
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); | ||
|
||
Vector64<byte> vResult0 = vResult.GetLower().AsByte(); | ||
Vector64<byte> vResult1 = vResult.GetUpper().AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for Equal
- could this be done the following way?
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32();
return (AdvSimd.MinPairwise(vResult, vResult).AsUInt64().ToScalar() != 0xFFFFFFFFFFFFFFFFUL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, why this can't be just return !Equal(vector1, vector2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN generally represents an issue for normal floating-point inversion checks
Thanks @tannergooding . Do you have performance improvements from this? Also, I don't see any benchmark related to |
Corresponding dotnet/performance PR: dotnet/performance#1442 |
The rest of the perf benchmarks I added remained the same they have opportunities to be optimized for x86/x64 and ARM64. |
@@ -226,7 +226,7 @@ public static double BitIncrement(double x) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static double CopySign(double x, double y) | |||
{ | |||
if (Sse.IsSupported || AdvSimd.IsSupported) | |||
if (Sse2.IsSupported || AdvSimd.IsSupported) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is little need to differentiate on SSE vs SSE2 considering both are considered baseline to RyuJIT. This just simplifies the overall logic.
else | ||
{ | ||
// Redundant test so we won't prejit remainder of this method on platforms without AdvSimd. | ||
throw new PlatformNotSupportedException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not optimizing Permute()
with ARM64 intrinsics? Same for some of the other methods below like Invert()
,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Permute is only used from the x86 code paths. ARM doesn't have a Permute
/Shuffle
instruction and the corresponding ARM64 implementation for functions like Invert
differs quite significantly.
if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
// This implementation is based on the DirectX Math Library XMMatrixTranspose method | ||
// https://github.com/microsoft/DirectXMath/blob/master/Inc/DirectXMathMatrix.inl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Have comment at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the comment applies to the x64 implementation
|
||
// Repeat for the other 3 rows | ||
|
||
Vector128<float> M21 = AdvSimd.LoadVector128(&value1.M21); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot of repeated code. consider extracting the code in a function and passing 4 rows data to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matrix code is small and is more susceptible to inlining differences than some. I can investigate extracting it for .NET 6
Vector128<uint> vResult = AdvSimd.CompareEqual(vector1, vector2).AsUInt32(); | ||
|
||
Vector64<byte> vResult0 = vResult.GetLower().AsByte(); | ||
Vector64<byte> vResult1 = vResult.GetUpper().AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, why this can't be just return !Equal(vector1, vector2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This resolves #33565