View Issue Details

IDProjectCategoryView StatusLast Update
0036511FPCCompilerpublic2020-01-04 15:24
ReporterJ. Gareth MoretonAssigned ToFlorian 
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Platformi386 and x86_64OSMicrosoft WindowsOS Version10 Professional
Product Version3.3.1Product Buildr43854 
Target VersionFixed in Version3.3.1 
Summary0036511: [Patch] New 3*MOV -> XCHG optimisation
DescriptionThis patch adds a new optimisation into OptPass2MOV that looks for the following sequence:

mov %reg1,%reg2
mov %reg3,%reg1
mov %reg2,%reg3
(With %reg2 not used afterwards)

Change to

xchg %reg3,%reg1

There are a few restrictions though - see Additional Information.
Steps To ReproduceApply patch and confirm correct compilation and successful regression testing under i386 and x86_64 platforms.
Additional Information- Optimisation is done as xchg %reg3,%reg1 instead of xchg %reg1,%reg3 because XCHG is commutative and it means only having to change the middle MOV instruction's opcode and delete the rest, rather than modifying operands as well (i.e. it runs slightly faster!).
- Optimisation is undertaken in pass 2 so the MOVs can be optimised as deeply as possible first.
- Optimisation is always performed when optimising for size.
- On i8086, when optimising for speed, the optimisation is only performed if the optimiser's target is Pentium M.
- On i386, when optimising for speed, the optimisation is only performed if the optimiser's target is Pentium M or higher.
- On x86_64, when optimising for speed, the optimisation is only performed if the optimiser's target is Core I or higher.

The reason being is that these architectures were the first where XCHG's latency was reduced from 3 to 2, thereby allowing it to be equal in speed to the three MOV operations (the first two MOVs can run simultaneously because there isn't a dependency chain between them).
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision43858
FPCOldBugId
FPCTarget-
Attached Files
  • xchg-patch.patch (3,155 bytes)
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 43854)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -3230,6 +3230,24 @@
     
     
        function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
    +
    +     function IsXCHGAcceptable: Boolean; inline;
    +       begin
    +         { Always accept if optimising for size }
    +         Result := (cs_opt_size in current_settings.optimizerswitches) or
    +           (
    +{$ifdef x86_64}
    +             { XCHG takes 3 cycles on AMD Athlon64 }
    +             (current_settings.optimizecputype >= cpu_core_i)
    +{$else x86_64}
    +             { From the Pentium M onwards, XCHG only has a latency of 2 rather
    +             than 3, so it becomes a saving compared to three MOVs with two of
    +             them able to execute simultaneously. [Kit] }
    +             (current_settings.optimizecputype >= cpu_PentiumM)
    +{$endif x86_64}
    +           );
    +       end;
    +
           var
            hp1,hp2: tai;
     {$ifdef x86_64}
    @@ -3310,6 +3328,49 @@
                 exit;
               end
             else if MatchOpType(taicpu(p),top_reg,top_reg) and
    +          IsXCHGAcceptable and
    +          { XCHG doesn't support 8-byte registers }
    +          (taicpu(p).opsize <> S_B) and
    +          MatchInstruction(hp1, A_MOV, []) and
    +          MatchOpType(taicpu(hp1),top_reg,top_reg) and
    +          (taicpu(hp1).oper[1]^.reg = taicpu(p).oper[0]^.reg) and
    +          GetNextInstruction(hp1, hp2) and
    +          MatchInstruction(hp2, A_MOV, []) and
    +          { Don't need to call MatchOpType for hp2 because the operand matches below cover for it }
    +          MatchOperand(taicpu(hp2).oper[0]^, taicpu(p).oper[1]^.reg) and
    +          MatchOperand(taicpu(hp2).oper[1]^, taicpu(hp1).oper[0]^.reg) then
    +          begin
    +            { mov %reg1,%reg2
    +              mov %reg3,%reg1        ->  xchg %reg3,%reg1
    +              mov %reg2,%reg3
    +              (%reg2 not used afterwards)
    +
    +              Note that xchg takes 3 cycles to execute, and generally mov's take
    +              only one cycle apiece, but the first two mov's can be executed in
    +              parallel, only taking 2 cycles overall.  Older processors should
    +              therefore only optimise for size. [Kit]
    +            }
    +            TransferUsedRegs(TmpUsedRegs);
    +            UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
    +            UpdateUsedRegs(TmpUsedRegs, tai(hp1.Next));
    +
    +            if not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp2, TmpUsedRegs) then
    +              begin
    +                DebugMsg(SPeepholeOptimization + 'MovMovMov2XChg', p);
    +                AllocRegBetween(taicpu(hp2).oper[1]^.reg, p, hp1, UsedRegs);
    +                taicpu(hp1).opcode := A_XCHG;
    +
    +                asml.Remove(p);
    +                asml.Remove(hp2);
    +                p.Free;
    +                hp2.Free;
    +
    +                p := hp1;
    +                Result := True;
    +                Exit;
    +              end;
    +          end
    +        else if MatchOpType(taicpu(p),top_reg,top_reg) and
     {$ifdef x86_64}
               MatchInstruction(hp1,[A_MOV,A_MOVZX,A_MOVSX,A_MOVSXD],[]) and
     {$else x86_64}
    
    xchg-patch.patch (3,155 bytes)

Activities

J. Gareth Moreton

2020-01-03 23:13

developer  

xchg-patch.patch (3,155 bytes)
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 43854)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -3230,6 +3230,24 @@
 
 
    function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
+
+     function IsXCHGAcceptable: Boolean; inline;
+       begin
+         { Always accept if optimising for size }
+         Result := (cs_opt_size in current_settings.optimizerswitches) or
+           (
+{$ifdef x86_64}
+             { XCHG takes 3 cycles on AMD Athlon64 }
+             (current_settings.optimizecputype >= cpu_core_i)
+{$else x86_64}
+             { From the Pentium M onwards, XCHG only has a latency of 2 rather
+             than 3, so it becomes a saving compared to three MOVs with two of
+             them able to execute simultaneously. [Kit] }
+             (current_settings.optimizecputype >= cpu_PentiumM)
+{$endif x86_64}
+           );
+       end;
+
       var
        hp1,hp2: tai;
 {$ifdef x86_64}
@@ -3310,6 +3328,49 @@
             exit;
           end
         else if MatchOpType(taicpu(p),top_reg,top_reg) and
+          IsXCHGAcceptable and
+          { XCHG doesn't support 8-byte registers }
+          (taicpu(p).opsize <> S_B) and
+          MatchInstruction(hp1, A_MOV, []) and
+          MatchOpType(taicpu(hp1),top_reg,top_reg) and
+          (taicpu(hp1).oper[1]^.reg = taicpu(p).oper[0]^.reg) and
+          GetNextInstruction(hp1, hp2) and
+          MatchInstruction(hp2, A_MOV, []) and
+          { Don't need to call MatchOpType for hp2 because the operand matches below cover for it }
+          MatchOperand(taicpu(hp2).oper[0]^, taicpu(p).oper[1]^.reg) and
+          MatchOperand(taicpu(hp2).oper[1]^, taicpu(hp1).oper[0]^.reg) then
+          begin
+            { mov %reg1,%reg2
+              mov %reg3,%reg1        ->  xchg %reg3,%reg1
+              mov %reg2,%reg3
+              (%reg2 not used afterwards)
+
+              Note that xchg takes 3 cycles to execute, and generally mov's take
+              only one cycle apiece, but the first two mov's can be executed in
+              parallel, only taking 2 cycles overall.  Older processors should
+              therefore only optimise for size. [Kit]
+            }
+            TransferUsedRegs(TmpUsedRegs);
+            UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
+            UpdateUsedRegs(TmpUsedRegs, tai(hp1.Next));
+
+            if not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp2, TmpUsedRegs) then
+              begin
+                DebugMsg(SPeepholeOptimization + 'MovMovMov2XChg', p);
+                AllocRegBetween(taicpu(hp2).oper[1]^.reg, p, hp1, UsedRegs);
+                taicpu(hp1).opcode := A_XCHG;
+
+                asml.Remove(p);
+                asml.Remove(hp2);
+                p.Free;
+                hp2.Free;
+
+                p := hp1;
+                Result := True;
+                Exit;
+              end;
+          end
+        else if MatchOpType(taicpu(p),top_reg,top_reg) and
 {$ifdef x86_64}
           MatchInstruction(hp1,[A_MOV,A_MOVZX,A_MOVSX,A_MOVSXD],[]) and
 {$else x86_64}
xchg-patch.patch (3,155 bytes)

Florian

2020-01-04 15:24

administrator   ~0120212

Thanks, commited.

Issue History

Date Modified Username Field Change
2020-01-03 22:30 J. Gareth Moreton New Issue
2020-01-03 22:30 J. Gareth Moreton File Added: xchg-patch.patch
2020-01-03 22:52 J. Gareth Moreton Tag Attached: patch
2020-01-03 22:52 J. Gareth Moreton Tag Attached: compiler
2020-01-03 22:52 J. Gareth Moreton Tag Attached: optimizations
2020-01-03 22:52 J. Gareth Moreton Tag Attached: x86_64
2020-01-03 22:52 J. Gareth Moreton Tag Attached: x86
2020-01-03 22:52 J. Gareth Moreton Tag Attached: i386
2020-01-03 23:04 J. Gareth Moreton Additional Information Updated View Revisions
2020-01-03 23:04 J. Gareth Moreton FPCTarget => -
2020-01-03 23:13 J. Gareth Moreton File Deleted: xchg-patch.patch
2020-01-03 23:13 J. Gareth Moreton File Added: xchg-patch.patch
2020-01-03 23:16 J. Gareth Moreton Description Updated View Revisions
2020-01-03 23:16 J. Gareth Moreton Additional Information Updated View Revisions
2020-01-04 15:24 Florian Assigned To => Florian
2020-01-04 15:24 Florian Status new => resolved
2020-01-04 15:24 Florian Resolution open => fixed
2020-01-04 15:24 Florian Fixed in Version => 3.3.1
2020-01-04 15:24 Florian Fixed in Revision => 43858
2020-01-04 15:24 Florian Note Added: 0120212