[Patch] Miscellaneous x86 optimisations

Original Reporter info from Mantis: CuriousKit @CuriousKit

Reporter name: J. Gareth Moreton

Description:

This patch contains some small, miscellaneous optimisations for x86 platforms:

- if "movzx %reg.%reg; shr x,%reg" is found (same register, just different sizes) and x is small enough (<=7 for movzbl, for example), the "shr" instruction is moved before "movzx", as this might allow further optimisation on the next pass.

- if "movsx/d %reg.%reg; sar x,%reg" is found (same register, just different sizes) and x is small enough (<=7 for movsbl, for example), the "sar" instruction is moved before "movsx", as this might allow further optimisation on the next pass.

- If "and x,%reg; shr y,%reg" is found and the two instructions cover all the bits (e.g. "andb $248,%reg; shrb $3,%reg"), the "and" instruction is removed. The above "movzx; shr" optimisation allows this to happen much more frequently.

- if "and x,%reg; movsx/d %reg,%reg" is found (same register, just different sizes) and the 'and' instruction causes the sign bit to become zero, the movsx/d instruction is changed to movzx. Currently this doesn't cause a speed improvement, but is part of a larger, more in-depth optimisation routine that is still in development, and couldn't be easily removed from the patch. It is otherwise harmless though.

- "and %reg,%reg" for B, W and L sizes is now removed if the FLAGS register is not in use and the previous instruction wrote to %reg (same size), since if the FLAGS register is not in use, "and %reg,%reg" serves only to zero the upper 64-bits of the register, which the previous instruction has done already.

- (Pass 2) If "add %reg2,%reg1; mov/s/z x(%reg1),%reg1" is found (usually caused by the Lea2AddBase optimisation), it is changed to "mov/s/z x(%reg2,%reg1),%reg1", thus removing the "add" and not just reducing the instruction count but also eliminating a potential bottleneck.

- (Post peephole) If "and x,%ax; movzwl %ax,%eax" is found (it has to be %e/ax) and x ensures the sign bit is zero (i.e. <= $7FFF), then "movzwl %ax,%eax" is changed to "cwtl" (or CWDE, depending on the asm mode) - note that "cwtl" is a shorter encoding for "movswl %ax,%eax", but since the sign bit is zero, it acts like a zero extend. This partly reverses the "and; movsx" to "and; movzx" change above if it didn't open any new optimisations. but occasionally it makes an improvement by itself.

Steps to reproduce:

Apply patch, confirm correct compilation and small speed boosts.

Additional information:

Some of these optimisations, notably the ones based around movzx, are part of a larger, more in-depth optimisation routine that is under development, but do a good job by themselves.

x86_64-win64 has been fully tested with no regressions. i386-win32 requires further testing but is currently blocked due to an unrelated package compilation failure.

Mantis conversion info:

Mantis ID: 38130
OS: Microsoft Windows
OS Build: 10 Home
Build: r47557
Platform: i386 and x86_64
Version: 3.3.1
Fixed in version: 3.3.1
Fixed in revision: 47824 (#2a990b81)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information