[Patch] Advanced MOVZX optimisations

Original Reporter info from Mantis: CuriousKit @CuriousKit

Reporter name: J. Gareth Moreton

Description:

This patch contains a long-range, in-depth optimisation routine for MOVZX operations and related register sizes and attempts to remove or shrink instructions where possible.

Steps to reproduce:

Apply patch and confirm correct compilation on -O2 and above.

Additional information:

Testing on i386 has been limited due to a pre-existing bug that, as of posting this patch, prevents building of i386-win32 at all.

----

Some examples of optimisations:

	movzwl	%dx,%ecx
	shrl	$8,%ecx
	movzbl	%cl,%ecx

The third instruction gets removed because the optimisation routine realises that the upper 3 bytes of %ecx are already zero.

	shrb	$3,%al
	movzbl	%al,%eax
	cmpb	$27,%al
	seteb	%al

The movz/cmp pair get swapped to minimise a pipeline stall, and then identifies that the movzbl instruction can be removed because %eax doesn't get used afterwards

	shrb	$3,%al
	movzbl	%al,%eax
	movzbl	%al,%eax
	btl	%eax,%edi

Removes one of the movzbl instructions.

	movzwl	-2(%rbx,%rax,2),%edi
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %cx = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
	movzwl	%di,%ecx

movzwl %di,%ecx gets changed to movl %edi,%ecx since the upper 16 bits of %edi are already zero, thus reducing instruction size slightly.

A more complex one that works on -O3 and -O4 (searches further ahead)

# Peephole Optimization: MovMovs/z2Mov/s/z done
	movzbw	%dl,%r8w
# Peephole Optimization: %r8b = %dl; changed to minimise pipeline stall (MovXXX2MovXXX)
	movzbl	%dl,%r9d
	shrl	$4,%r9d
# Peephole Optimization: And2Nop
	leaq	TC_$UNICODEDATA_$$_UC_TABLE_2(%rip),%r10
# Peephole Optimization: Lea2AddBase done
	addq	%r10,%rcx
	movzwl	(%rcx,%r9,2),%ecx
	shlq	$5,%rcx
# Peephole Optimization: Removed movs/z instruction and extended earlier write (MovMovs/z2Mov/s/z)
	andw	$15,%r8w
	movzwl	%r8w,%r8d

Becomes:

# Peephole Optimization: movzbw2movzbl
	movzbl	%dl,%r8d
# Peephole Optimization: %r8b = %dl; changed to minimise pipeline stall (MovXXX2MovXXX)
	movzbl	%dl,%r9d
	shrl	$4,%r9d
# Peephole Optimization: And2Nop
	leaq	TC_$UNICODEDATA_$$_UC_TABLE_2(%rip),%r10
# Peephole Optimization: Lea2AddBase done
	addq	%r10,%rcx
	movzwl	(%rcx,%r9,2),%ecx
	shlq	$5,%rcx
# Peephole Optimization: Removed movs/z instruction and extended earlier write (MovMovs/z2Mov/s/z)
	andl	$15,%r8d
# Peephole Optimization: Movzx2Nop 2

The initial movzbw %dl,%r8w becomes movzbl %dl,%r8d, the AND instruction is expanded from 16-bit to 32-bit and this thus permits the removal of movzwl %r8w,%r8d at the end.

----

There's still room for improvement. I'm seeing how well I can tie in regular MOV instructions with these optimisations.

Mantis conversion info:

Mantis ID: 38294
OS: Microsoft Windows
OS Build: 10 Home
Build: r47977
Platform: i386 and x86_64
Version: 3.3.1
Fixed in version: 3.3.1
Fixed in revision: 48086 (#28efcfba), 48117 (#f42f6256)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information