View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0037389 | FPC | Compiler | public | 2020-07-19 17:23 | 2020-07-20 22:22 |
Reporter | J. Gareth Moreton | Assigned To | Florian | ||
Priority | normal | Severity | minor | Reproducibility | N/A |
Status | resolved | Resolution | fixed | ||
Platform | i386 and x86_64 | OS | Microsoft Windows | ||
Product Version | 3.3.1 | ||||
Fixed in Version | 3.3.1 | ||||
Summary | 0037389: [Patch] SHL-centric peephole optimisations | ||||
Description | The patch attached contains a pair of optimisations centred around the SHL instruction. ---- If a MOVSX, MOVSXD or MOVZX instruction is immediately followed by a SHL instruction, it is removed if the SHL instruction completly overwrites the extended bits, since the instruction is unnecessary - e.g: movslq %esi,%rsi shlq $32,%rsi becomes just "shlq $32,%rsi". ---- If a SHL is followed by an AND (or a MOV and an AND for 64-bit operands), and the FLAGS register isn't used afterwards, the AND instruction is removed if it has no effect on the value of the register - e.g: shlq $32,%rsi movq $-4294967296,%rax andq %rax,%rsi Becomes just "shlq $32,%rsi", since the constant $-4294967296, or 0xFFFFFFFF00000000, does not change the value of %rsi because the SHL instruction ensures the lower 32 bits are already zero. | ||||
Steps To Reproduce | Apply patch and confirm correct compilation with reduced binary size with no loss of performance. | ||||
Additional Information | It's hard to determine the effect of false dependencies with the first optimisation. If expanding from 32-bit to 64-bit, there is no false dependency because the upper 32-bits are guaranteed to be set to zero beforehand. In other situations, given that the SHL instruction depends on the MOVS/ZX instruction, the performance should be identical. If a SHL/AND (or SHL/MOV/AND) instruction group cannot be optimised because the mask changes the register value, the mask is nonetheless modified based on what bits are set to zero by the SHL instruction (e.g. if shifting left by 32 and then anding by 0x7FFFFFFF000000FF, it is changed to 0x7FFFFFFF00000000 because the lower 32 bits will always be zero). The aim of this is to potentially improve other optimisations involving AND instructions. | ||||
Tags | compiler, i386, optimizations, patch, x86, x86_64 | ||||
Fixed in Revision | 45811 | ||||
FPCOldBugId | |||||
FPCTarget | - | ||||
Attached Files |
|
|
shl-optimisations.patch (8,025 bytes)
Index: compiler/x86/aoptx86.pas =================================================================== --- compiler/x86/aoptx86.pas (revision 45802) +++ compiler/x86/aoptx86.pas (working copy) @@ -3396,10 +3396,15 @@ TmpBool1,TmpBool2 : Boolean; tmpref : treference; hp1,hp2: tai; + mask: tcgint; begin Result:=false; - if MatchOpType(taicpu(p),top_const,top_reg) and - (taicpu(p).opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) and + + { All these optimisations work on "shl/sal const,%reg" } + if not MatchOpType(taicpu(p),top_const,top_reg) then + Exit; + + if (taicpu(p).opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) and (taicpu(p).oper[0]^.val <= 3) then { Changes "shl const, %reg32; add const/reg, %reg32" to one lea statement } begin @@ -3511,8 +3516,7 @@ end; end {$ifndef x86_64} - else if (current_settings.optimizecputype < cpu_Pentium2) and - MatchOpType(taicpu(p),top_const,top_reg) then + else if (current_settings.optimizecputype < cpu_Pentium2) then begin { changes "shl $1, %reg" to "add %reg, %reg", which is the same on a 386, but faster on a 486, and Tairable in both U and V pipes on the Pentium @@ -3540,7 +3544,130 @@ end; end {$endif x86_64} - ; + else if + GetNextInstruction(p, hp1) and (hp1.typ = ait_instruction) and MatchOpType(taicpu(hp1), top_const, top_reg) and + ( + ( + MatchInstruction(hp1, A_AND, [taicpu(p).opsize]) and + SetAndTest(hp1, hp2) +{$ifdef x86_64} + ) or + ( + MatchInstruction(hp1, A_MOV, [taicpu(p).opsize]) and + GetNextInstruction(hp1, hp2) and + MatchInstruction(hp2, A_AND, [taicpu(p).opsize]) and + MatchOpType(taicpu(hp2), top_reg, top_reg) and + (taicpu(hp1).oper[1]^.reg = taicpu(hp2).oper[0]^.reg) +{$endif x86_64} + ) + ) and + (taicpu(p).oper[1]^.reg = taicpu(hp2).oper[1]^.reg) then + begin + { Change: + shl x, %reg1 + mov -(1<<x), %reg2 + and %reg2, %reg1 + + Or: + shl x, %reg1 + and -(1<<x), %reg1 + + To just: + shl x, %reg1 + + Since the and operation only zeroes bits that are already zero from the shl operation + } + case taicpu(p).oper[0]^.val of + 8: + mask:=$FFFFFFFFFFFFFF00; + 16: + mask:=$FFFFFFFFFFFF0000; + 32: + mask:=$FFFFFFFF00000000; + 63: + { Constant pre-calculated to prevent overflow errors with Int64 } + mask:=$8000000000000000; + else + begin + if taicpu(p).oper[0]^.val >= 64 then + { Shouldn't happen realistically, since the register + is guaranteed to be set to zero at this point } + mask := 0 + else + mask := -(Int64(1 shl taicpu(p).oper[0]^.val)); + end; + end; + + if taicpu(hp1).oper[0]^.val = mask then + begin + { Everything checks out, perform the optimisation, as long as + the FLAGS register isn't being used} + TransferUsedRegs(TmpUsedRegs); + UpdateUsedRegs(TmpUsedRegs, tai(p.next)); + +{$ifdef x86_64} + if (hp1 <> hp2) then + begin + { "shl/mov/and" version } + UpdateUsedRegs(TmpUsedRegs, tai(hp1.next)); + + { Don't do the optimisation if the FLAGS register is in use } + if not(RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp2, TmpUsedRegs)) then + begin + DebugMsg(SPeepholeOptimization + 'ShlMovAnd2Shl', p); + { Don't remove the 'mov' instruction if its register is used elsewhere } + if not(RegUsedAfterInstruction(taicpu(hp1).oper[1]^.reg, hp2, TmpUsedRegs)) then + begin + asml.Remove(hp1); + hp1.Free; + Result := True; + end; + + { Only set Result to True if the 'mov' instruction was removed } + asml.Remove(hp2); + hp2.Free; + end; + end + else +{$endif x86_64} + begin + { "shl/and" version } + + { Don't do the optimisation if the FLAGS register is in use } + if not(RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then + begin + DebugMsg(SPeepholeOptimization + 'ShlAnd2Shl', p); + asml.Remove(hp1); + hp1.Free; + Result := True; + end; + end; + + Exit; + end + else {$ifdef x86_64}if (hp1 = hp2) then{$endif x86_64} + begin + { Even if the mask doesn't allow for its removal, we might be + able to optimise the mask for the "shl/and" version, which + may permit other peephole optimisations } +{$ifdef DEBUG_AOPTCPU} + mask := taicpu(hp1).oper[0]^.val and mask; + if taicpu(hp1).oper[0]^.val <> mask then + begin + DebugMsg( + SPeepholeOptimization + + 'Changed mask from $' + debug_tostr(taicpu(hp1).oper[0]^.val) + + ' to $' + debug_tostr(mask) + + 'based on previous instruction (ShlAnd2ShlAnd)', hp1); + taicpu(hp1).oper[0]^.val := mask; + end; +{$else DEBUG_AOPTCPU} + { If debugging is off, just set the operand even if it's the same } + taicpu(hp1).oper[0]^.val := taicpu(hp1).oper[0]^.val and mask; +{$endif DEBUG_AOPTCPU} + end; + + end; end; @@ -5357,6 +5484,35 @@ hp1.Free; end; end + else if reg_and_hp1_is_instr and + (taicpu(p).oper[0]^.typ = top_reg) and + ( + (taicpu(hp1).opcode = A_SHL) or (taicpu(hp1).opcode = A_SAL) + ) and + (taicpu(hp1).oper[0]^.typ = top_const) and + SuperRegistersEqual(taicpu(p).oper[0]^.reg, taicpu(p).oper[1]^.reg) and + MatchOperand(taicpu(hp1).oper[1]^, taicpu(p).oper[1]^.reg) and + { Minimum shift value allowed is the bit difference between the sizes } + (taicpu(hp1).oper[0]^.val >= + { Multiply by 8 because tcgsize2size returns bytes, not bits } + 8 * ( + tcgsize2size[reg_cgsize(taicpu(p).oper[1]^.reg)] - + tcgsize2size[reg_cgsize(taicpu(p).oper[0]^.reg)] + ) + ) then + begin + { For: + movsx/movzx %reg1,%reg1 (same register, just different sizes) + shl/sal ##, %reg1 + + Remove the movsx/movzx instruction if the shift overwrites the + extended bits of the register (e.g. movslq %eax,%rax; shlq $32,%rax + } + DebugMsg(SPeepholeOptimization + 'MovxShl2Shl',p); + RemoveCurrentP(p, hp1); + Result := True; + Exit; + end else if taicpu(p).opcode=A_MOVZX then begin { removes superfluous And's after movzx's } |
|
i386-win32 and x86_64-win64 regression tests have passed for the combined patches of this issue and 0037390 |
|
Don't know how frequent appears combination of movslq %esi,%rsi shlq $32,%rsi it can be translated into shll $32,%esi with no harm, but for first 8 registers encoded instruction would be 1 byte shorter. |
|
The sequence of instructions sometimes appears when writing to some record types, it seems. Often the next instruction is to OR another 32-bit operand to fill up the lower 32 bits. There is potential to optimise this further with SHLD, but this would be easier to do with virtual registers, hence my email about that in the FPC developers mail list. As for your point, any operations on 32-bit registers will zero the upper 32 bits of the corresponding 64-bit register (this is why there's no 32-to-64-bit zero extension instruction). "shll $32,%esi" will set the entire register to zero. |
|
You are right. I was thinking about SHR while your patch is about SHL |
|
Thanks, applied meanwhile, I forgot to mark it as resolved. |
Date Modified | Username | Field | Change |
---|---|---|---|
2020-07-19 17:23 | J. Gareth Moreton | New Issue | |
2020-07-19 17:23 | J. Gareth Moreton | File Added: shl-optimisations.patch | |
2020-07-19 17:40 | J. Gareth Moreton | Note Added: 0124172 | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: patch | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: compiler | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: optimizations | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: x86_64 | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: x86 | |
2020-07-19 17:40 | J. Gareth Moreton | Tag Attached: i386 | |
2020-07-19 20:09 | Marģers | Note Added: 0124175 | |
2020-07-20 02:07 | J. Gareth Moreton | Note Added: 0124180 | |
2020-07-20 10:23 | Marģers | Note Added: 0124184 | |
2020-07-20 22:22 | Florian | Assigned To | => Florian |
2020-07-20 22:22 | Florian | Status | new => resolved |
2020-07-20 22:22 | Florian | Resolution | open => fixed |
2020-07-20 22:22 | Florian | Fixed in Version | => 3.3.1 |
2020-07-20 22:22 | Florian | Fixed in Revision | => 45811 |
2020-07-20 22:22 | Florian | FPCTarget | => - |
2020-07-20 22:22 | Florian | Note Added: 0124194 |