-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
C# source in this gist: https://gist.github.com/Zhentar/4ffb0a5d597c4c1e788d6007f1602b21
According to vTune, 5% of my execution time is in my function's prologue. This was unexpected because it hadn't been in previous iterations (and my function body had unfortunately not improved at all).
Looking at the the disassembly, I see:
LineEnumerator.MoveNext()
push rdi
push rsi
sub rsp,48h
mov rsi,rcx
lea rdi,[rsp+28h]
mov ecx,8
xor eax,eax
rep stos dword ptr [rdi]
mov rcx,rsi
mov rax,0F1CD0434ED23h
mov qword ptr [rsp+40h],rax
The rep stos dword
in there seems rather odd - at the very least, it should be a rep stos qword
with half as many iterations (although I'm not sure it would be any faster on my Skylake). But also I don't think there's any x86 architecture for which a 32 byte rep stos
is faster than a reasonable unrolled version and the unrolled version wouldn't even be particularly large. And some of the comments in the JIT code seem to suggest that rep stos
shouldn't ever be getting emitted.
category:cq
theme:optimization
skill-level:intermediate
cost:medium