[Patch] Advanced MOVZX optimisations
Original Reporter info from Mantis: CuriousKit @CuriousKit
-
Reporter name: J. Gareth Moreton
Original Reporter info from Mantis: CuriousKit @CuriousKit
- Reporter name: J. Gareth Moreton
Description:
This patch contains a long-range, in-depth optimisation routine for MOVZX operations and related register sizes and attempts to remove or shrink instructions where possible.
Steps to reproduce:
Apply patch and confirm correct compilation on -O2 and above.
Additional information:
Testing on i386 has been limited due to a pre-existing bug that, as of posting this patch, prevents building of i386-win32 at all.
----
Some examples of optimisations:
movzwl %dx,%ecx
shrl $8,%ecx
movzbl %cl,%ecx
The third instruction gets removed because the optimisation routine realises that the upper 3 bytes of %ecx are already zero.
-
shrb $3,%al
movzbl %al,%eax
cmpb $27,%al
seteb %al
The movz/cmp pair get swapped to minimise a pipeline stall, and then identifies that the movzbl instruction can be removed because %eax doesn't get used afterwards
-
shrb $3,%al
movzbl %al,%eax
movzbl %al,%eax
btl %eax,%edi
Removes one of the movzbl instructions.
-
movzwl -2(%rbx,%rax,2),%edi
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %cx = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
movzwl %di,%ecx
movzwl %di,%ecx gets changed to movl %edi,%ecx since the upper 16 bits of %edi are already zero, thus reducing instruction size slightly.
-
A more complex one that works on -O3 and -O4 (searches further ahead)
# Peephole Optimization: MovMovs/z2Mov/s/z done
movzbw %dl,%r8w
# Peephole Optimization: %r8b = %dl; changed to minimise pipeline stall (MovXXX2MovXXX)
movzbl %dl,%r9d
shrl $4,%r9d
# Peephole Optimization: And2Nop
leaq TC_$UNICODEDATA_$$_UC_TABLE_2(%rip),%r10
# Peephole Optimization: Lea2AddBase done
addq %r10,%rcx
movzwl (%rcx,%r9,2),%ecx
shlq $5,%rcx
# Peephole Optimization: Removed movs/z instruction and extended earlier write (MovMovs/z2Mov/s/z)
andw $15,%r8w
movzwl %r8w,%r8d
Becomes:
# Peephole Optimization: movzbw2movzbl movzbl %dl,%r8d # Peephole Optimization: %r8b = %dl; changed to minimise pipeline stall (MovXXX2MovXXX) movzbl %dl,%r9d shrl $4,%r9d # Peephole Optimization: And2Nop leaq TC_$UNICODEDATA_$$_UC_TABLE_2(%rip),%r10 # Peephole Optimization: Lea2AddBase done addq %r10,%rcx movzwl (%rcx,%r9,2),%ecx shlq $5,%rcx # Peephole Optimization: Removed movs/z instruction and extended earlier write (MovMovs/z2Mov/s/z) andl $15,%r8d # Peephole Optimization: Movzx2Nop 2
The initial movzbw %dl,%r8w becomes movzbl %dl,%r8d, the AND instruction is expanded from 16-bit to 32-bit and this thus permits the removal of movzwl %r8w,%r8d at the end.
----
There's still room for improvement. I'm seeing how well I can tie in regular MOV instructions with these optimisations.