View Issue Details

IDProjectCategoryView StatusLast Update
0036551FPCCompilerpublic2020-03-21 00:23
ReporterJ. Gareth Moreton Assigned ToFlorian  
PrioritylowSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Platformi386 and x86_64OSMicrosoft Windows 
Product Version3.3.1 
Fixed in Version3.3.1 
Summary0036551: [Patch] EAX -> EDX:EAX sign extension shortcuts, and MOVSX shortcuts for AX register
DescriptionFind attached a couple of patches that search for instruction combinations that sign-extend EAX to EDX:EAX and attempt to convert them into more optimal forms via the CWDE/cwtl, CDQ/cltd, CDQE/cltq and CQO/cqto opcodes.

mov-cdq.patch, all take place in OptPass2MOV:

- movl %eax,%edx; sarl $31,%edx -> cltd
- movq %rax,%rdx; sarq $63,%rdx -> cqto (x86_64 only)

- movl %edx,%eax; sarl $31,%edx -> movl %edx,%eax; cltd (-Os only due to cltd's dependency on %eax)
- movq %rdx,%rax; sarq $63,%rdx -> movq %rdx,%rax; cqto (x86_64 only; -Os only due to cqto's dependency on %rax)

---

A more complex one - either of the following combinations (these sequences sign-extend a 32-bit integer to a 64-bit integer, so are common around Int64-types under i386):

movl r/m,%edx; movl %edx,%eax; sarl $31,%edx
movl r/m,%eax; movl %eax,%edx; sarl $31,%edx
movl r/m,%edx; movl r/m,%eax; sarl $31,%edx
movl r/m,%eax; movl r/m,%edx; sarl $31,%edx

...become...

movl r/m,%eax <-- always %eax, even if it was %edx before
cltd

----

A very complex one, not performed on x86_64 (the first 4 instructions are the implemention of Abs() for LongInt on x86 platforms that don't have CMOV available... x86_64 always has CMOV available):

movl %eax,%edx
sarl $31,%eax
xorl %eax,%edx
subl %eax,%edx
(instruction that uses %edx)
(%eax and %edx deallocated)

...becomes...

cltd
xorl %edx,%eax <-- note the registers have swapped
subl %edx,%eax
(instruction that uses %eax) <-- %eax rather than %edx

----

movsx-cdq.patch, all take place in a new PostPeepholeOptMOVSX routine:

movswl %ax,%eax -> cwtl
movslq %eax,%rax -> cltq (x86_64 only)
Steps To ReproduceApply patches and confirm correct compilation as well as size saving with no penalty to speed (except when -Os is specified).
Additional InformationA 64-bit version of the "movl r/m" one is present in mov-cdq.patch, but commented out, because these combinations arise when sign-extending a LongInt to Int64, among other reasons. Equivalent instructions that sign-extend from %rax to %rdx:%rax are not performed because there is no Int128 type, but the code is left in the patch in a commented-out state should such need for it arise in the future.

Extensive testing has been performed for -Os under i386-win32, i386-linux, x86_64-win64 and x86_64-linux.
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision43917,43918
FPCOldBugId
FPCTarget-
Attached Files

Activities

J. Gareth Moreton

2020-01-11 00:07

developer  

mov-cdq.patch (14,136 bytes)   
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 43880)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -3363,9 +3363,10 @@
        end;
 
       var
-       hp1,hp2: tai;
-{$ifdef x86_64}
-       hp3: tai;
+       hp1,hp2, hp3: tai;
+{$ifndef x86_64}
+       hp4: tai;
+       OperIdx: Integer;
 {$endif x86_64}
       begin
         Result:=false;
@@ -3510,6 +3511,291 @@
             Result:=true;
             exit;
           end
+        else if MatchOpType(taicpu(p),top_reg,top_reg) and
+          MatchInstruction(hp1, A_SAR, []) then
+          begin
+            if MatchOperand(taicpu(hp1).oper[0]^, 31) then
+              begin
+                { the use of %edx also covers the opsize being S_L }
+                if MatchOperand(taicpu(hp1).oper[1]^, NR_EDX) then
+                  begin
+                    { Note it has to be specifically "movl %eax,%edx", and those specific sub-registers }
+                    if (taicpu(p).oper[0]^.reg = NR_EAX) and
+                      (taicpu(p).oper[1]^.reg = NR_EDX) then
+                      begin
+                        { Change:
+                            movl %eax,%edx
+                            sarl $31,%edx
+                          To:
+                            cltd
+                        }
+                        DebugMsg(SPeepholeOptimization + 'MovSar2Cltd', p);
+                        Asml.Remove(hp1);
+                        hp1.Free;
+                        taicpu(p).opcode := A_CDQ;
+                        taicpu(p).opsize := S_NO;
+                        taicpu(p).clearop(1);
+                        taicpu(p).clearop(0);
+                        taicpu(p).ops:=0;
+                        Result := True;
+                      end
+                    else if (cs_opt_size in current_settings.optimizerswitches) and
+                      (taicpu(p).oper[0]^.reg = NR_EDX) and
+                      (taicpu(p).oper[1]^.reg = NR_EAX) then
+                      begin
+                        { Change:
+                            movl %edx,%eax
+                            sarl $31,%edx
+                          To:
+                            movl %edx,%eax
+                            cltd
+
+                          Note that this creates a dependency between the two instructions,
+                            so only perform if optimising for size.
+                        }
+                        DebugMsg(SPeepholeOptimization + 'MovSar2MovCltd', p);
+                        taicpu(hp1).opcode := A_CDQ;
+                        taicpu(hp1).opsize := S_NO;
+                        taicpu(hp1).clearop(1);
+                        taicpu(hp1).clearop(0);
+                        taicpu(hp1).ops:=0;
+                      end;
+{$ifndef x86_64}
+                  end
+                { Don't bother if CMOV is supported, because a more optimal
+                  sequence would have been generated for the Abs() intrinsic }
+                else if not(CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype]) and
+                  { the use of %eax also covers the opsize being S_L }
+                  MatchOperand(taicpu(hp1).oper[1]^, NR_EAX) and
+                  (taicpu(p).oper[0]^.reg = NR_EAX) and
+                  (taicpu(p).oper[1]^.reg = NR_EDX) and
+                  GetNextInstruction(hp1, hp2) and
+                  MatchInstruction(hp2, A_XOR, [S_L]) and
+                  MatchOperand(taicpu(hp2).oper[0]^, NR_EAX) and
+                  MatchOperand(taicpu(hp2).oper[1]^, NR_EDX) and
+
+                  GetNextInstruction(hp2, hp3) and
+                  MatchInstruction(hp3, A_SUB, [S_L]) and
+                  MatchOperand(taicpu(hp3).oper[0]^, NR_EAX) and
+                  MatchOperand(taicpu(hp3).oper[1]^, NR_EDX) then
+                  begin
+                    { Change:
+                        movl %eax,%edx
+                        sarl $31,%eax
+                        xorl %eax,%edx
+                        subl %eax,%edx
+                        (Instruction that uses %edx)
+                        (%eax deallocated)
+                        (%edx deallocated)
+                      To:
+                        cltd
+                        xorl %edx,%eax  <-- Note the registers have swapped
+                        subl %edx,%eax
+                        (Instruction that uses %eax) <-- %eax rather than %edx
+                    }
+
+                    TransferUsedRegs(TmpUsedRegs);
+                    UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
+                    UpdateUsedRegs(TmpUsedRegs, tai(hp1.Next));
+                    UpdateUsedRegs(TmpUsedRegs, tai(hp2.Next));
+
+                    if not RegUsedAfterInstruction(NR_EAX, hp3, TmpUsedRegs) then
+                      begin
+                        if GetNextInstruction(hp3, hp4) and
+                          not RegModifiedByInstruction(NR_EDX, hp4) and
+                          not RegUsedAfterInstruction(NR_EDX, hp4, TmpUsedRegs) then
+                          begin
+                            DebugMsg(SPeepholeOptimization + 'abs() intrinsic optimisation', p);
+
+                            taicpu(p).opcode := A_CDQ;
+                            taicpu(p).clearop(1);
+                            taicpu(p).clearop(0);
+                            taicpu(p).ops:=0;
+
+                            AsmL.Remove(hp1);
+                            hp1.Free;
+
+                            taicpu(hp2).loadreg(0, NR_EDX);
+                            taicpu(hp2).loadreg(1, NR_EAX);
+
+                            taicpu(hp3).loadreg(0, NR_EDX);
+                            taicpu(hp3).loadreg(1, NR_EAX);
+
+                            AllocRegBetween(NR_EAX, hp3, hp4, TmpUsedRegs);
+                            { Convert references in the following instruction (hp4) from %edx to %eax }
+                            for OperIdx := 0 to taicpu(hp4).ops - 1 do
+                              with taicpu(hp4).oper[OperIdx]^ do
+                                case typ of
+                                  top_reg:
+                                    if reg = NR_EDX then
+                                      reg := NR_EAX;
+                                  top_ref:
+                                    begin
+                                      if ref^.base = NR_EDX then
+                                        ref^.base := NR_EAX;
+                                      if ref^.index = NR_EDX then
+                                        ref^.index := NR_EAX;
+                                    end;
+                                  else
+                                    { Do nothing };
+                                end;
+                          end;
+                      end;
+{$else x86_64}
+                  end;
+              end
+            else if MatchOperand(taicpu(hp1).oper[0]^, 63) and
+              { the use of %rdx also covers the opsize being S_Q }
+              MatchOperand(taicpu(hp1).oper[1]^, NR_RDX) then
+              begin
+                { Note it has to be specifically "movq %rax,%rdx", and those specific sub-registers }
+                if (taicpu(p).oper[0]^.reg = NR_RAX) and
+                  (taicpu(p).oper[1]^.reg = NR_RDX) then
+                  begin
+                    { Change:
+                        movq %rax,%rdx
+                        sarq $63,%rdx
+                      To:
+                        cqto
+                    }
+                    DebugMsg(SPeepholeOptimization + 'MovSar2Cqto', p);
+                    Asml.Remove(hp1);
+                    hp1.Free;
+                    taicpu(p).opcode := A_CQO;
+                    taicpu(p).opsize := S_NO;
+                    taicpu(p).clearop(1);
+                    taicpu(p).clearop(0);
+                    taicpu(p).ops:=0;
+                    Result := True;
+                  end
+                else if (cs_opt_size in current_settings.optimizerswitches) and
+                  (taicpu(p).oper[0]^.reg = NR_RDX) and
+                  (taicpu(p).oper[1]^.reg = NR_RAX) then
+                  begin
+                    { Change:
+                        movq %rdx,%rax
+                        sarq $63,%rdx
+                      To:
+                        movq %rdx,%rax
+                        cqto
+
+                      Note that this creates a dependency between the two instructions,
+                        so only perform if optimising for size.
+                    }
+                    DebugMsg(SPeepholeOptimization + 'MovSar2MovCqto', p);
+                    taicpu(hp1).opcode := A_CQO;
+                    taicpu(hp1).opsize := S_NO;
+                    taicpu(hp1).clearop(1);
+                    taicpu(hp1).clearop(0);
+                    taicpu(hp1).ops:=0;
+{$endif x86_64}
+                  end;
+              end;
+          end
+        else if MatchInstruction(hp1, A_MOV, []) and
+          (taicpu(hp1).oper[1]^.typ = top_reg) then
+          { Though "GetNextInstruction" could be factored out, along with
+            the instructions that depend on hp2, it is an expensive call that
+            should be delayed for as long as possible, hence we do cheaper
+            checks first that are likely to be False. [Kit] }
+          begin
+
+            if MatchOperand(taicpu(p).oper[1]^, NR_EDX) and
+              (
+                (
+                  (taicpu(hp1).oper[1]^.reg = NR_EAX) and
+                  (
+                    MatchOperand(taicpu(hp1).oper[0]^, taicpu(p).oper[0]^) or
+                    MatchOperand(taicpu(hp1).oper[0]^, NR_EDX)
+                  )
+                ) or
+                (
+                  (taicpu(hp1).oper[1]^.reg = NR_EDX) and
+                  (
+                    MatchOperand(taicpu(hp1).oper[0]^, taicpu(p).oper[0]^) or
+                    MatchOperand(taicpu(hp1).oper[0]^, NR_EAX)
+                  )
+                )
+              ) and
+              GetNextInstruction(hp1, hp2) and
+              MatchInstruction(hp2, A_SAR, []) and
+              MatchOperand(taicpu(hp2).oper[0]^, 31) then
+              begin
+                if MatchOperand(taicpu(hp2).oper[1]^, NR_EDX) then
+                  begin
+                    { Change:
+                        movl r/m,%edx         movl r/m,%eax         movl r/m,%edx         movl r/m,%eax
+                        movl %edx,%eax   or   movl %eax,%edx   or   movl r/m,%eax    or   movl r/m,%edx
+                        sarl $31,%edx         sarl $31,%edx         sarl $31,%edx         sarl $31,%edx
+                      To:
+                        movl r/m,%eax    <- Note the change in register
+                        cltd
+                    }
+                    DebugMsg(SPeepholeOptimization + 'MovMovSar2MovCltd', p);
+
+                    AllocRegBetween(NR_EAX, p, hp1, UsedRegs);
+                    taicpu(p).loadreg(1, NR_EAX);
+
+                    taicpu(hp1).opcode := A_CDQ;
+                    taicpu(hp1).clearop(1);
+                    taicpu(hp1).clearop(0);
+                    taicpu(hp1).ops:=0;
+
+                    AsmL.Remove(hp2);
+                    hp2.Free;
+(*
+{$ifdef x86_64}
+                  end
+                else if MatchOperand(taicpu(hp2).oper[1]^, NR_RDX) and
+                  { This code sequence does not get generated - however it might become useful
+                    if and when 128-bit signed integer types make an appearance, so the code
+                    is kept here for when it is eventually needed. [Kit] }
+                  (
+                    (
+                      (taicpu(hp1).oper[1]^.reg = NR_RAX) and
+                      (
+                        MatchOperand(taicpu(hp1).oper[0]^, taicpu(p).oper[0]^) or
+                        MatchOperand(taicpu(hp1).oper[0]^, NR_RDX)
+                      )
+                    ) or
+                    (
+                      (taicpu(hp1).oper[1]^.reg = NR_RDX) and
+                      (
+                        MatchOperand(taicpu(hp1).oper[0]^, taicpu(p).oper[0]^) or
+                        MatchOperand(taicpu(hp1).oper[0]^, NR_RAX)
+                      )
+                    )
+                  ) and
+                  GetNextInstruction(hp1, hp2) and
+                  MatchInstruction(hp2, A_SAR, [S_Q]) and
+                  MatchOperand(taicpu(hp2).oper[0]^, 63) and
+                  MatchOperand(taicpu(hp2).oper[1]^, NR_RDX) then
+                  begin
+                    { Change:
+                        movq r/m,%rdx         movq r/m,%rax         movq r/m,%rdx         movq r/m,%rax
+                        movq %rdx,%rax   or   movq %rax,%rdx   or   movq r/m,%rax    or   movq r/m,%rdx
+                        sarq $63,%rdx         sarq $63,%rdx         sarq $63,%rdx         sarq $63,%rdx
+                      To:
+                        movq r/m,%rax    <- Note the change in register
+                        cqto
+                    }
+                    DebugMsg(SPeepholeOptimization + 'MovMovSar2MovCqto', p);
+
+                    AllocRegBetween(NR_RAX, p, hp1, UsedRegs);
+                    taicpu(p).loadreg(1, NR_RAX);
+
+                    taicpu(hp1).opcode := A_CQO;
+                    taicpu(hp1).clearop(1);
+                    taicpu(hp1).clearop(0);
+                    taicpu(hp1).ops:=0;
+
+                    AsmL.Remove(hp2);
+                    hp2.Free;
+{$endif x86_64}
+*)
+                  end;
+              end;
+          end
         else if (taicpu(p).oper[0]^.typ = top_ref) and
           (hp1.typ = ait_instruction) and
           { while the GetNextInstruction(hp1,hp2) call could be factored out,
mov-cdq.patch (14,136 bytes)   

Florian

2020-01-11 09:56

administrator   ~0120322

One first comment: please do not add { do nothing } comments. Those are obvious comments and have no added value. The correct comment would be: { else clause added to avoid that the compiler complains about uncovered cases }, but I do not think this is needed either.

J. Gareth Moreton

2020-01-11 13:00

developer   ~0120324

Oh okay. I just felt they were a bit more informative compared to a lone semicolon, where a less-informed programmer may be tempted to remove the else block, while the presence of the comment shows that its presence is deliberate.

J. Gareth Moreton

2020-01-11 15:12

developer   ~0120328

Updated movsx-cdq.patch to remove "{ Do nothing }" comments.

J. Gareth Moreton

2020-01-11 15:58

developer  

movsx-cdq.patch (3,262 bytes)   
Index: compiler/i386/aoptcpu.pas
===================================================================
--- compiler/i386/aoptcpu.pas	(revision 43888)
+++ compiler/i386/aoptcpu.pas	(working copy)
@@ -324,6 +324,8 @@
                    end;
                 A_TEST, A_OR:
                   Result:=PostPeepholeOptTestOr(p);
+                A_MOVSX:
+                  Result:=PostPeepholeOptMOVSX(p);
                 else
                   ;
               end;
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 43880)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -88,6 +88,7 @@
         function PostPeepholeOptMovzx(var p : tai) : Boolean;
         function PostPeepholeOptXor(var p : tai) : Boolean;
 {$endif}
+        function PostPeepholeOptMOVSX(var p : tai) : boolean;
         function PostPeepholeOptCmp(var p : tai) : Boolean;
         function PostPeepholeOptTestOr(var p : tai) : Boolean;
         function PostPeepholeOptCall(var p : tai) : Boolean;
@@ -4747,6 +4748,50 @@
       end;
 
 
+    function TX86AsmOptimizer.PostPeepholeOptMOVSX(var p : tai) : boolean;
+      begin
+        Result := False;
+        if not MatchOpType(taicpu(p), top_reg, top_reg) then
+          Exit;
+
+        { Convert:
+            movswl %ax,%eax  -> cwtl
+            movslq %eax,%rax -> cdqe
+
+            NOTE: Don't convert movswl %al,%ax to cbw, because cbw and cwde
+              refer to the same opcode and depends only on the assembler's
+              current operand-size attribute. [Kit]
+        }
+        with taicpu(p) do
+          case opsize of
+            S_WL:
+              if (oper[0]^.reg = NR_AX) and (oper[1]^.reg = NR_EAX) then
+                begin
+                  DebugMsg(SPeepholeOptimization + 'Converted movswl %ax,%eax to cwtl', p);
+                  opcode := A_CWDE;
+                  clearop(0);
+                  clearop(1);
+                  ops := 0;
+                  Result := True;
+                end;
+{$ifdef x86_64}
+            S_LQ:
+              if (oper[0]^.reg = NR_EAX) and (oper[1]^.reg = NR_RAX) then
+                begin
+                  DebugMsg(SPeepholeOptimization + 'Converted movslq %eax,%rax to cltq', p);
+                  opcode := A_CDQE;
+                  clearop(0);
+                  clearop(1);
+                  ops := 0;
+                  Result := True;
+                end;
+{$endif x86_64}
+            else
+              ;
+          end;
+      end;
+
+
     function TX86AsmOptimizer.PostPeepholeOptCmp(var p : tai) : Boolean;
       begin
         Result:=false;
Index: compiler/x86_64/aoptcpu.pas
===================================================================
--- compiler/x86_64/aoptcpu.pas	(revision 43888)
+++ compiler/x86_64/aoptcpu.pas	(working copy)
@@ -174,6 +174,8 @@
               case taicpu(p).opcode of
                 A_MOV:
                   Result:=PostPeepholeOptMov(p);
+                A_MOVSX:
+                  Result:=PostPeepholeOptMOVSX(p);
                 A_MOVZX:
                   Result:=PostPeepholeOptMovzx(p);
                 A_CMP:
movsx-cdq.patch (3,262 bytes)   

Florian

2020-01-12 10:21

administrator   ~0120352

Thanks, applied. Removed another "do nothing".

Issue History

Date Modified Username Field Change
2020-01-11 00:07 J. Gareth Moreton New Issue
2020-01-11 00:07 J. Gareth Moreton File Added: mov-cdq.patch
2020-01-11 00:07 J. Gareth Moreton File Added: movsx-cdq.patch
2020-01-11 00:08 J. Gareth Moreton Priority normal => low
2020-01-11 00:08 J. Gareth Moreton FPCTarget => -
2020-01-11 00:08 J. Gareth Moreton Tag Attached: patch
2020-01-11 00:08 J. Gareth Moreton Tag Attached: compiler
2020-01-11 00:08 J. Gareth Moreton Tag Attached: optimizations
2020-01-11 00:08 J. Gareth Moreton Tag Attached: x86_64
2020-01-11 00:08 J. Gareth Moreton Tag Attached: x86
2020-01-11 00:08 J. Gareth Moreton Tag Attached: i386
2020-01-11 00:11 J. Gareth Moreton Description Updated View Revisions
2020-01-11 00:14 J. Gareth Moreton Description Updated View Revisions
2020-01-11 00:22 J. Gareth Moreton Description Updated View Revisions
2020-01-11 09:56 Florian Note Added: 0120322
2020-01-11 13:00 J. Gareth Moreton Note Added: 0120324
2020-01-11 15:12 J. Gareth Moreton File Deleted: movsx-cdq.patch
2020-01-11 15:12 J. Gareth Moreton File Added: movsx-cdq.patch
2020-01-11 15:12 J. Gareth Moreton Note Added: 0120328
2020-01-11 15:58 J. Gareth Moreton File Deleted: movsx-cdq.patch
2020-01-11 15:58 J. Gareth Moreton File Added: movsx-cdq.patch
2020-01-12 10:21 Florian Assigned To => Florian
2020-01-12 10:21 Florian Status new => resolved
2020-01-12 10:21 Florian Resolution open => fixed
2020-01-12 10:21 Florian Fixed in Version => 3.3.1
2020-01-12 10:21 Florian Fixed in Revision => 43917,43918
2020-01-12 10:21 Florian Note Added: 0120352