[Patch / Refactor] x86_64 optimizer overhaul
Original Reporter info from Mantis: CuriousKit @CuriousKit
-
Reporter name: J. Gareth Moreton
Original Reporter info from Mantis: CuriousKit @CuriousKit
- Reporter name: J. Gareth Moreton
Description:
This patch serves to overhaul the optimiser for x86_64 to minimise the number of passes required and to be more intelligent. Preliminary tests show about a 5% speed increase on an -O1 compilation of Lazarus and about a 15% speed increase for -O3. See the attached Metric.txt file showcasing the timings.
To minimise the pass count, the pre-peephole, pass 1 and pass 2 stages have been merged, and jump and MOV optimisations have been overhauled. One of the control cases is that a compilation under -O1 should not produce worse code than the trunk - it turns out though that in many cases, the compiler produces better code even though no new actual optimization combinations have been introduced.
Additionally, for individual passes, the optimizer attempts to mark the end of function prologues so as to not waste time on sequences that won't change.
The code isn't completely clean as I have attempted to separate i386 from the changes, mostly as a control case to show it doesn't affect other platforms. Once testing and implementation is successful for x86_64, I plan to port my changes over to i386.
(NOTE: Linux testing hasn't yet been overly successful due to configuration difficulties)
Steps to reproduce:
Apply patch and test on all platforms for successful compilation and correct machine code output of binaries.
Additional information:
Though not the intention, the rewriting of some of the optimisation routines has allowed for some additional space and size savings. A lot of the time, this just amounts to stripping out dead labels that doesn't actually change the final binary size, but occasionally it can eliminate superfluous jumps and unnecessary alignment hints, which sometimes leads to further optimisaions. For example, in "components/codetools/basiccodetools.pas" for Lazarus, under -O3 compilation, the overhauled optimiser is able to remove two additional branches in the CompareSubstrings function. Under the trunk, the segment is as follows:
...
.Lj2799:
movslq %r8d,%r9
subq %r9,%rdx
leaq 1(%rdx),%r9
cmpl %r9d,%r11d
jge .Lj2802
.p2align 2,,0
.p2align 1
movl %r11d,%r9d
.Lj2802:
movq %rcx,%rdx
testq %rcx,%rcx
je .Lj2803
movq -8(%rdx),%rdx
.Lj2803:
movslq %r10d,%rbx
subq %rbx,%rdx
addq $1,%rdx
cmpl %edx,%r11d
jge .Lj2806
.p2align 2,,0
.p2align 1
movl %r11d,%edx
.Lj2806:
movslq %r8d,%r8
...
Under the overhauled optimiser, the loop is able to see through the alignment hints and convert the conditional branches into CMOV instructions:
...
.Lj2799:
movslq %r8d,%r9
subq %r9,%rdx
leaq 1(%rdx),%r9
cmpl %r9d,%r11d
cmovngel %r11d,%r9d
movq %rcx,%rdx
testq %rcx,%rcx
je .Lj2803
movq -8(%rdx),%rdx
.Lj2803:
movslq %r10d,%rbx
subq %rbx,%rdx
addq $1,%rdx
cmpl %edx,%r11d
cmovngel %r11d,%edx
movslq %r8d,%r8
...
Mantis conversion info:
- Mantis ID: 34628
- OS: Microsoft Windows
- OS Build: 10 Professional
- Build: x86_64-win64
- Platform: x86_64
- Version: 3.3.1
- Monitored by: » @engkin (engkin), » Artem3213212 (Artem3213212), » Dean Qin (Dean Qin), » @xhajt03 (Tomas Hajny), » Xor-el (Ugochukwu Mmaduekwe), » Vincent (Vincent Snijders), » @PascalRiekenberg (Pascal Riekenberg), » Akira1364 (Akira1364), » @MageSlayer (Denis Golovan)
- Target version: 3.3.1