-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
The heuristic the jit is using for deciding how to zero-initialize locals is very simplistic. In many cases faster sequences can be used.
Here is one example. An attempt was made to switch String.Split to use Spans: stephentoub/coreclr@500978f to avoid int[] allocations. This resulted in several more temp structs being allocated and zero-initialized, which made the performance of this benchmark ~12% worse than the non-Span version:
public static void Main()
{
DateTime start = DateTime.Now;
for (int i = 0; i < 30000000; ++i)
{
"abc,def,ghi".Split(',');
}
Console.WriteLine((DateTime.Now - start).TotalMilliseconds);
}
The current heuristic will use rep stosd in the prolog if the jit needs to initialize 16 bytes of locals (actually, the heuristic is slightly different if there are any structs larger than 24 bytes that need to be initialized but it’s not relevant for this benchmark). As an experiment I changed the heuristic so that rep stosd isn’t used for this benchmark but mov instructions are used instead. With that change we get all of the perf back compared to the array version.
Here are the two initialization sequences:
488BF1 mov rsi, rcx
488D7D08 lea rdi, [rbp+08H]
B912000000 mov ecx, 18
33C0 xor rax, rax
F3AB rep stosd
488BCE mov rcx, rsi
33C0 xor rax, rax
48894538 mov qword ptr [rbp+38H], rax
48894540 mov qword ptr [rbp+40H], rax
48894530 mov qword ptr [rbp+30H], rax
48894528 mov qword ptr [rbp+28H], rax
48894518 mov qword ptr [rbp+18H], rax
48894520 mov qword ptr [rbp+20H], rax
48894508 mov qword ptr [rbp+08H], rax
48894548 mov qword ptr [rbp+48H], rax
While the second sequence is faster than the first one, we can probably do even better with xmm registers.
The jit normally favors size over speed so the block init sequence may be preferred in many cases but we should at least use IBC data when available to drive this heuristic.
category:cq
theme:zero-init
skill-level:expert
cost:medium