Skip to content

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

@erozenfeld

Description

@erozenfeld

The heuristic the jit is using for deciding how to zero-initialize locals is very simplistic. In many cases faster sequences can be used.

Here is one example. An attempt was made to switch String.Split to use Spans: stephentoub/coreclr@500978f to avoid int[] allocations. This resulted in several more temp structs being allocated and zero-initialized, which made the performance of this benchmark ~12% worse than the non-Span version:

    public static void Main()
    {
        DateTime start = DateTime.Now;

        for (int i = 0; i < 30000000; ++i)
        {
            "abc,def,ghi".Split(',');
        }

        Console.WriteLine((DateTime.Now - start).TotalMilliseconds);
    }

The current heuristic will use rep stosd in the prolog if the jit needs to initialize 16 bytes of locals (actually, the heuristic is slightly different if there are any structs larger than 24 bytes that need to be initialized but it’s not relevant for this benchmark). As an experiment I changed the heuristic so that rep stosd isn’t used for this benchmark but mov instructions are used instead. With that change we get all of the perf back compared to the array version.

Here are the two initialization sequences:

       488BF1               mov      rsi, rcx
       488D7D08             lea      rdi, [rbp+08H]
       B912000000           mov      ecx, 18
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       488BCE               mov      rcx, rsi
       33C0                 xor      rax, rax
       48894538             mov      qword ptr [rbp+38H], rax
       48894540             mov      qword ptr [rbp+40H], rax
       48894530             mov      qword ptr [rbp+30H], rax
       48894528             mov      qword ptr [rbp+28H], rax
       48894518             mov      qword ptr [rbp+18H], rax
       48894520             mov      qword ptr [rbp+20H], rax
       48894508             mov      qword ptr [rbp+08H], rax
       48894548             mov      qword ptr [rbp+48H], rax

While the second sequence is faster than the first one, we can probably do even better with xmm registers.

The jit normally favors size over speed so the block init sequence may be preferred in many cases but we should at least use IBC data when available to drive this heuristic.

category:cq
theme:zero-init
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    JitUntriagedCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions