View Issue Details

IDProjectCategoryView StatusLast Update
0037389FPCCompilerpublic2020-07-20 22:22
ReporterJ. Gareth Moreton Assigned ToFlorian  
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Platformi386 and x86_64OSMicrosoft Windows 
Product Version3.3.1 
Fixed in Version3.3.1 
Summary0037389: [Patch] SHL-centric peephole optimisations
DescriptionThe patch attached contains a pair of optimisations centred around the SHL instruction.

----

If a MOVSX, MOVSXD or MOVZX instruction is immediately followed by a SHL instruction, it is removed if the SHL instruction completly overwrites the extended bits, since the instruction is unnecessary - e.g:

movslq %esi,%rsi
shlq $32,%rsi

becomes just "shlq $32,%rsi".

----

If a SHL is followed by an AND (or a MOV and an AND for 64-bit operands), and the FLAGS register isn't used afterwards, the AND instruction is removed if it has no effect on the value of the register - e.g:

shlq $32,%rsi
movq $-4294967296,%rax
andq %rax,%rsi

Becomes just "shlq $32,%rsi", since the constant $-4294967296, or 0xFFFFFFFF00000000, does not change the value of %rsi because the SHL instruction ensures the lower 32 bits are already zero.
Steps To ReproduceApply patch and confirm correct compilation with reduced binary size with no loss of performance.
Additional InformationIt's hard to determine the effect of false dependencies with the first optimisation. If expanding from 32-bit to 64-bit, there is no false dependency because the upper 32-bits are guaranteed to be set to zero beforehand. In other situations, given that the SHL instruction depends on the MOVS/ZX instruction, the performance should be identical.

If a SHL/AND (or SHL/MOV/AND) instruction group cannot be optimised because the mask changes the register value, the mask is nonetheless modified based on what bits are set to zero by the SHL instruction (e.g. if shifting left by 32 and then anding by 0x7FFFFFFF000000FF, it is changed to 0x7FFFFFFF00000000 because the lower 32 bits will always be zero). The aim of this is to potentially improve other optimisations involving AND instructions.
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision45811
FPCOldBugId
FPCTarget-
Attached Files

Activities

J. Gareth Moreton

2020-07-19 17:23

developer  

shl-optimisations.patch (8,025 bytes)   
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 45802)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -3396,10 +3396,15 @@
         TmpBool1,TmpBool2 : Boolean;
         tmpref : treference;
         hp1,hp2: tai;
+        mask:    tcgint;
       begin
         Result:=false;
-        if MatchOpType(taicpu(p),top_const,top_reg) and
-           (taicpu(p).opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) and
+
+        { All these optimisations work on "shl/sal const,%reg" }
+        if not MatchOpType(taicpu(p),top_const,top_reg) then
+          Exit;
+
+        if (taicpu(p).opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) and
            (taicpu(p).oper[0]^.val <= 3) then
           { Changes "shl const, %reg32; add const/reg, %reg32" to one lea statement }
           begin
@@ -3511,8 +3516,7 @@
               end;
           end
 {$ifndef x86_64}
-        else if (current_settings.optimizecputype < cpu_Pentium2) and
-          MatchOpType(taicpu(p),top_const,top_reg) then
+        else if (current_settings.optimizecputype < cpu_Pentium2) then
           begin
             { changes "shl $1, %reg" to "add %reg, %reg", which is the same on a 386,
               but faster on a 486, and Tairable in both U and V pipes on the Pentium
@@ -3540,7 +3544,130 @@
              end;
           end
 {$endif x86_64}
-          ;
+        else if
+          GetNextInstruction(p, hp1) and (hp1.typ = ait_instruction) and MatchOpType(taicpu(hp1), top_const, top_reg) and
+          (
+            (
+              MatchInstruction(hp1, A_AND, [taicpu(p).opsize]) and
+              SetAndTest(hp1, hp2)
+{$ifdef x86_64}
+            ) or
+            (
+              MatchInstruction(hp1, A_MOV, [taicpu(p).opsize]) and
+              GetNextInstruction(hp1, hp2) and
+              MatchInstruction(hp2, A_AND, [taicpu(p).opsize]) and
+              MatchOpType(taicpu(hp2), top_reg, top_reg) and
+              (taicpu(hp1).oper[1]^.reg = taicpu(hp2).oper[0]^.reg)
+{$endif x86_64}
+            )
+          ) and
+          (taicpu(p).oper[1]^.reg = taicpu(hp2).oper[1]^.reg) then
+          begin
+            { Change:
+                shl x, %reg1
+                mov -(1<<x), %reg2
+                and %reg2, %reg1
+
+              Or:
+                shl x, %reg1
+                and -(1<<x), %reg1
+
+              To just:
+                shl x, %reg1
+
+              Since the and operation only zeroes bits that are already zero from the shl operation
+            }
+            case taicpu(p).oper[0]^.val of
+               8:
+                 mask:=$FFFFFFFFFFFFFF00;
+               16:
+                 mask:=$FFFFFFFFFFFF0000;
+               32:
+                 mask:=$FFFFFFFF00000000;
+               63:
+                 { Constant pre-calculated to prevent overflow errors with Int64 }
+                 mask:=$8000000000000000;
+               else
+                 begin
+                   if taicpu(p).oper[0]^.val >= 64 then
+                     { Shouldn't happen realistically, since the register
+                       is guaranteed to be set to zero at this point }
+                     mask := 0
+                   else
+                     mask := -(Int64(1 shl taicpu(p).oper[0]^.val));
+                 end;
+            end;
+
+            if taicpu(hp1).oper[0]^.val = mask then
+              begin
+                { Everything checks out, perform the optimisation, as long as
+                  the FLAGS register isn't being used}
+                TransferUsedRegs(TmpUsedRegs);
+                UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+
+{$ifdef x86_64}
+                if (hp1 <> hp2) then
+                  begin
+                    { "shl/mov/and" version }
+                    UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
+
+                    { Don't do the optimisation if the FLAGS register is in use }
+                    if not(RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp2, TmpUsedRegs)) then
+                      begin
+                        DebugMsg(SPeepholeOptimization + 'ShlMovAnd2Shl', p);
+                        { Don't remove the 'mov' instruction if its register is used elsewhere }
+                        if not(RegUsedAfterInstruction(taicpu(hp1).oper[1]^.reg, hp2, TmpUsedRegs)) then
+                          begin
+                            asml.Remove(hp1);
+                            hp1.Free;
+                            Result := True;
+                          end;
+
+                        { Only set Result to True if the 'mov' instruction was removed }
+                        asml.Remove(hp2);
+                        hp2.Free;
+                      end;
+                  end
+                else
+{$endif x86_64}
+                  begin
+                    { "shl/and" version }
+
+                    { Don't do the optimisation if the FLAGS register is in use }
+                    if not(RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then
+                      begin
+                        DebugMsg(SPeepholeOptimization + 'ShlAnd2Shl', p);
+                        asml.Remove(hp1);
+                        hp1.Free;
+                        Result := True;
+                      end;
+                  end;
+
+                Exit;
+              end
+            else {$ifdef x86_64}if (hp1 = hp2) then{$endif x86_64}
+              begin
+                { Even if the mask doesn't allow for its removal, we might be
+                  able to optimise the mask for the "shl/and" version, which
+                  may permit other peephole optimisations }
+{$ifdef DEBUG_AOPTCPU}
+                mask := taicpu(hp1).oper[0]^.val and mask;
+                if taicpu(hp1).oper[0]^.val <> mask then
+                  begin
+                    DebugMsg(
+                      SPeepholeOptimization +
+                      'Changed mask from $' + debug_tostr(taicpu(hp1).oper[0]^.val) +
+                      ' to $' + debug_tostr(mask) +
+                      'based on previous instruction (ShlAnd2ShlAnd)', hp1);
+                    taicpu(hp1).oper[0]^.val := mask;
+                  end;
+{$else DEBUG_AOPTCPU}
+                { If debugging is off, just set the operand even if it's the same }
+                taicpu(hp1).oper[0]^.val := taicpu(hp1).oper[0]^.val and mask;
+{$endif DEBUG_AOPTCPU}
+              end;
+
+          end;
       end;
 
 
@@ -5357,6 +5484,35 @@
                 hp1.Free;
               end;
           end
+        else if reg_and_hp1_is_instr and
+          (taicpu(p).oper[0]^.typ = top_reg) and
+          (
+            (taicpu(hp1).opcode = A_SHL) or (taicpu(hp1).opcode = A_SAL)
+          ) and
+          (taicpu(hp1).oper[0]^.typ = top_const) and
+          SuperRegistersEqual(taicpu(p).oper[0]^.reg, taicpu(p).oper[1]^.reg) and
+          MatchOperand(taicpu(hp1).oper[1]^, taicpu(p).oper[1]^.reg) and
+          { Minimum shift value allowed is the bit difference between the sizes }
+          (taicpu(hp1).oper[0]^.val >=
+            { Multiply by 8 because tcgsize2size returns bytes, not bits }
+            8 * (
+              tcgsize2size[reg_cgsize(taicpu(p).oper[1]^.reg)] -
+              tcgsize2size[reg_cgsize(taicpu(p).oper[0]^.reg)]
+            )
+          ) then
+          begin
+            { For:
+                movsx/movzx %reg1,%reg1 (same register, just different sizes)
+                shl/sal     ##,   %reg1
+
+              Remove the movsx/movzx instruction if the shift overwrites the
+              extended bits of the register (e.g. movslq %eax,%rax; shlq $32,%rax
+            }
+            DebugMsg(SPeepholeOptimization + 'MovxShl2Shl',p);
+            RemoveCurrentP(p, hp1);
+            Result := True;
+            Exit;
+          end
         else if taicpu(p).opcode=A_MOVZX then
           begin
             { removes superfluous And's after movzx's }
shl-optimisations.patch (8,025 bytes)   

J. Gareth Moreton

2020-07-19 17:40

developer   ~0124172

i386-win32 and x86_64-win64 regression tests have passed for the combined patches of this issue and 0037390

Marģers

2020-07-19 20:09

reporter   ~0124175

Don't know how frequent appears combination of
movslq %esi,%rsi
shlq $32,%rsi

it can be translated into
shll $32,%esi
with no harm, but for first 8 registers encoded instruction would be 1 byte shorter.

J. Gareth Moreton

2020-07-20 02:07

developer   ~0124180

The sequence of instructions sometimes appears when writing to some record types, it seems. Often the next instruction is to OR another 32-bit operand to fill up the lower 32 bits. There is potential to optimise this further with SHLD, but this would be easier to do with virtual registers, hence my email about that in the FPC developers mail list.

As for your point, any operations on 32-bit registers will zero the upper 32 bits of the corresponding 64-bit register (this is why there's no 32-to-64-bit zero extension instruction). "shll $32,%esi" will set the entire register to zero.

Marģers

2020-07-20 10:23

reporter   ~0124184

You are right. I was thinking about SHR while your patch is about SHL

Florian

2020-07-20 22:22

administrator   ~0124194

Thanks, applied meanwhile, I forgot to mark it as resolved.

Issue History

Date Modified Username Field Change
2020-07-19 17:23 J. Gareth Moreton New Issue
2020-07-19 17:23 J. Gareth Moreton File Added: shl-optimisations.patch
2020-07-19 17:40 J. Gareth Moreton Note Added: 0124172
2020-07-19 17:40 J. Gareth Moreton Tag Attached: patch
2020-07-19 17:40 J. Gareth Moreton Tag Attached: compiler
2020-07-19 17:40 J. Gareth Moreton Tag Attached: optimizations
2020-07-19 17:40 J. Gareth Moreton Tag Attached: x86_64
2020-07-19 17:40 J. Gareth Moreton Tag Attached: x86
2020-07-19 17:40 J. Gareth Moreton Tag Attached: i386
2020-07-19 20:09 Marģers Note Added: 0124175
2020-07-20 02:07 J. Gareth Moreton Note Added: 0124180
2020-07-20 10:23 Marģers Note Added: 0124184
2020-07-20 22:22 Florian Assigned To => Florian
2020-07-20 22:22 Florian Status new => resolved
2020-07-20 22:22 Florian Resolution open => fixed
2020-07-20 22:22 Florian Fixed in Version => 3.3.1
2020-07-20 22:22 Florian Fixed in Revision => 45811
2020-07-20 22:22 Florian FPCTarget => -
2020-07-20 22:22 Florian Note Added: 0124194