View Issue Details

IDProjectCategoryView StatusLast Update
0038761FPCCompilerpublic2021-05-16 11:00
ReporterJ. Gareth Moreton Assigned To 
PrioritylowSeverityminorReproducibilityN/A
Status newResolutionopen 
Platformi386 and x86_64OSMicrosoft Windows 
Product Version3.3.1 
Summary0038761: [Patch] x86 JccMovJmpMov2MovSetcc improvement
DescriptionThis patch makes some changes to the JccMovJmpMov2MovSetcc optimisation on i386 and x86-64 platforms to improve code generation:

- Moved from pass 2 to pass 1.
- Now favours "SETcc / MOVZX"
Steps To ReproduceApply patch and confirm correct compilation and improved code generation in RTL, for example.
Additional InformationThe change to move from pass 2 to pass 1 was made after it was noticed that "SETcc/TESTCmp/Jcc -> Jcc" missed combinations where the SETcc instruction was created in Pass 2; simply moving the SETcc optimisations into pass 2 caused them to perform worse for some reason. Running it in both pass 1 and pass 2 worked, but is a little questionable, plus this change addresses the problem more directly by moving the errant optimisation to pass 1, which also permits it to find more occurrances (again, for reasons uncertain).
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision
FPCOldBugId
FPCTarget-
Attached Files

Relationships

parent of 0038767 new [Patch] Additional SETcc optimisations 
Not all the children of this issue are yet resolved or closed.

Activities

J. Gareth Moreton

2021-04-16 07:21

developer  

JccMovJmpMov2MovSetcc-pass1.patch (9,164 bytes)   
Index: compiler/i386/aoptcpu.pas
===================================================================
--- compiler/i386/aoptcpu.pas	(revision 49207)
+++ compiler/i386/aoptcpu.pas	(working copy)
@@ -221,6 +221,8 @@
                   Result:=OptPass1MOVXX(p);
                 A_SETcc:
                   Result:=OptPass1SETcc(p);
+                A_Jcc:
+                  Result:=OptPass1Jcc(p);
                 else
                   ;
               end;
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 49207)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -140,6 +140,7 @@
         function OptPass1PXor(var p : tai) : boolean;
         function OptPass1VPXor(var p: tai): boolean;
         function OptPass1Imul(var p : tai) : boolean;
+        function OptPass1Jcc(var p : tai) : boolean;
 
         function OptPass2Movx(var p : tai): Boolean;
         function OptPass2MOV(var p : tai) : boolean;
@@ -4597,7 +4598,122 @@
      end;
 
 
+   function TX86AsmOptimizer.OptPass1Jcc(var p : tai) : boolean;
+     var
+       hp1, hp2, hp3, hp4, hp5: tai;
+     begin
+       Result := False;
 
+       if not GetNextInstruction(p,hp1) or (hp1.typ <> ait_instruction) then
+         Exit;
+
+       {
+           convert
+           j<c>  .L1
+           mov   1,reg
+           jmp   .L2
+         .L1
+           mov   0,reg
+         .L2
+
+         into
+           mov   0,reg
+           set<not(c)> reg
+
+         take care of alignment and that the mov 0,reg is not converted into a xor as this
+         would destroy the flag contents
+
+         But prefer if movzx is acceptable
+           set<not(c)> reg
+           movzx       reg, reg
+       }
+
+       if MatchInstruction(hp1,A_MOV,[]) and
+         MatchOpType(taicpu(hp1),top_const,top_reg) and
+{$ifdef i386}
+         (
+         { Under i386, ESI, EDI, EBP and ESP
+           don't have an 8-bit representation }
+           not (getsupreg(taicpu(hp1).oper[1]^.reg) in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
+         ) and
+{$endif i386}
+         (taicpu(hp1).oper[0]^.val=1) and
+         GetNextInstruction(hp1,hp2) and
+         MatchInstruction(hp2,A_JMP,[]) and (taicpu(hp2).oper[0]^.ref^.refaddr=addr_full) and
+         GetNextInstruction(hp2,hp3) and
+         { skip align }
+         ((hp3.typ<>ait_align) or GetNextInstruction(hp3,hp3)) and
+         (hp3.typ=ait_label) and
+         (tasmlabel(taicpu(p).oper[0]^.ref^.symbol)=tai_label(hp3).labsym) and
+         (tai_label(hp3).labsym.getrefs=1) and
+         GetNextInstruction(hp3,hp4) and
+         MatchInstruction(hp4,A_MOV,[]) and
+         MatchOpType(taicpu(hp4),top_const,top_reg) and
+         (taicpu(hp4).oper[0]^.val=0) and
+         MatchOperand(taicpu(hp1).oper[1]^,taicpu(hp4).oper[1]^) and
+         GetNextInstruction(hp4,hp5) and
+         (hp5.typ=ait_label) and
+         (tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol)=tai_label(hp5).labsym) and
+         (tai_label(hp5).labsym.getrefs=1) then
+         begin
+           AllocRegBetween(NR_FLAGS,p,hp2,UsedRegs);
+           DebugMsg(SPeepholeOptimization+'JccMovJmpMov2Setcc',p);
+           { Safe to remove labels because they only have a single reference each }
+           { remove last label }
+           RemoveInstruction(hp5);
+           { remove second label }
+           RemoveInstruction(hp3);
+           { if align is present remove it }
+           if GetNextInstruction(hp2,hp3) and (hp3.typ=ait_align) then
+             RemoveInstruction(hp3);
+
+           { Convert jump into SETcc instruction }
+           taicpu(hp2).opcode:=A_SETcc;
+           taicpu(hp2).opsize:=S_B;
+           taicpu(hp2).condition:=inverse_cond(taicpu(p).condition);
+           taicpu(hp2).loadreg(0,newreg(R_INTREGISTER,getsupreg(taicpu(hp4).oper[1]^.reg),R_SUBL));
+
+           if taicpu(hp1).opsize=S_B then
+             begin
+               RemoveInstruction(hp1);
+               RemoveInstruction(hp4);
+               hp1 := hp2; { For "RemoveCurrentP" below }
+             end
+           else
+             begin
+               if IsMOVZXAcceptable then
+                 begin
+                   taicpu(hp4).opcode:=A_MOVZX;
+
+                   case taicpu(hp1).opsize of
+                     S_W:
+                       taicpu(hp4).opsize:=S_BW;
+                     S_L, S_Q:
+                       taicpu(hp4).opsize:=S_BL;
+                     else
+                       InternalError(2021041601);
+                   end;
+
+                   RemoveInstruction(hp1);
+                   taicpu(hp4).loadreg(0,newreg(R_INTREGISTER,getsupreg(taicpu(hp4).oper[1]^.reg),R_SUBL));
+                   { Second operand is already the full register }
+
+                   hp1 := hp2; { For "RemoveCurrentP" below }
+                 end
+               else
+                 begin
+                   taicpu(hp1).loadconst(0,0);
+                   RemoveInstruction(hp4);
+                 end;
+             end;
+
+           RemoveCurrentP(p, hp1);
+           Result:=true;
+           exit;
+         end
+     end;
+
+
    function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
 
      function IsXCHGAcceptable: Boolean; inline;
@@ -6085,76 +6201,6 @@
                     end;
 {$ifndef i8086}
                 end
-              {
-                  convert
-                  j<c>  .L1
-                  mov   1,reg
-                  jmp   .L2
-                .L1
-                  mov   0,reg
-                .L2
-
-                into
-                  mov   0,reg
-                  set<not(c)> reg
-
-                take care of alignment and that the mov 0,reg is not converted into a xor as this
-                would destroy the flag contents
-              }
-              else if MatchInstruction(hp1,A_MOV,[]) and
-                MatchOpType(taicpu(hp1),top_const,top_reg) and
-{$ifdef i386}
-                (
-                { Under i386, ESI, EDI, EBP and ESP
-                  don't have an 8-bit representation }
-                  not (getsupreg(taicpu(hp1).oper[1]^.reg) in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
-                ) and
-{$endif i386}
-                (taicpu(hp1).oper[0]^.val=1) and
-                GetNextInstruction(hp1,hp2) and
-                MatchInstruction(hp2,A_JMP,[]) and (taicpu(hp2).oper[0]^.ref^.refaddr=addr_full) and
-                GetNextInstruction(hp2,hp3) and
-                { skip align }
-                ((hp3.typ<>ait_align) or GetNextInstruction(hp3,hp3)) and
-                (hp3.typ=ait_label) and
-                (tasmlabel(taicpu(p).oper[0]^.ref^.symbol)=tai_label(hp3).labsym) and
-                (tai_label(hp3).labsym.getrefs=1) and
-                GetNextInstruction(hp3,hp4) and
-                MatchInstruction(hp4,A_MOV,[]) and
-                MatchOpType(taicpu(hp4),top_const,top_reg) and
-                (taicpu(hp4).oper[0]^.val=0) and
-                MatchOperand(taicpu(hp1).oper[1]^,taicpu(hp4).oper[1]^) and
-                GetNextInstruction(hp4,hp5) and
-                (hp5.typ=ait_label) and
-                (tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol)=tai_label(hp5).labsym) and
-                (tai_label(hp5).labsym.getrefs=1) then
-                begin
-                  AllocRegBetween(NR_FLAGS,p,hp4,UsedRegs);
-                  DebugMsg(SPeepholeOptimization+'JccMovJmpMov2MovSetcc',p);
-                  { remove last label }
-                  RemoveInstruction(hp5);
-                  { remove second label }
-                  RemoveInstruction(hp3);
-                  { if align is present remove it }
-                  if GetNextInstruction(hp2,hp3) and (hp3.typ=ait_align) then
-                    RemoveInstruction(hp3);
-                  { remove jmp }
-                  RemoveInstruction(hp2);
-                  if taicpu(hp1).opsize=S_B then
-                    RemoveInstruction(hp1)
-                  else
-                    taicpu(hp1).loadconst(0,0);
-                  taicpu(hp4).opcode:=A_SETcc;
-                  taicpu(hp4).opsize:=S_B;
-                  taicpu(hp4).condition:=inverse_cond(taicpu(p).condition);
-                  taicpu(hp4).loadreg(0,newreg(R_INTREGISTER,getsupreg(taicpu(hp4).oper[1]^.reg),R_SUBL));
-                  taicpu(hp4).opercnt:=1;
-                  taicpu(hp4).ops:=1;
-                  taicpu(hp4).freeop(1);
-                  RemoveCurrentP(p);
-                  Result:=true;
-                  exit;
-                end
               else if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
                 begin
                  { check for
Index: compiler/x86_64/aoptcpu.pas
===================================================================
--- compiler/x86_64/aoptcpu.pas	(revision 49207)
+++ compiler/x86_64/aoptcpu.pas	(working copy)
@@ -145,6 +145,8 @@
                 A_XORPD,
                 A_PXOR:
                   Result:=OptPass1PXor(p);
+                A_Jcc:
+                  Result:=OptPass1Jcc(p);
                 else
                   ;
               end;

J. Gareth Moreton

2021-04-16 07:23

developer   ~0130400

Improvements are seen both under -O2 and -O4, with no instances yet observed where the peephole optimizer performed worse under -O2 (due to the number of iterations of pass 1 being limited to two).

J. Gareth Moreton

2021-04-16 11:34

developer   ~0130402

(There is room to improve this further down the line, but one step at a time)

Florian

2021-04-16 17:43

administrator   ~0130409

Do you really think that using movzx is better? To be honest, I did not benchmark it, but I'd expect that using movzx increases latency because the movzx depends on the set instruction while the mov 0,... can executed while the condition is executed:

test
mov
set

2 cycles

test
set
movzx

at least 3 cycles as all instructions depend on the one before.

Florian

2021-04-16 17:44

administrator   ~0130411

Actually, the best would be to figure out if the mov can be moved further up, so it can be a xor. But this might be hard to figure out failsafe.

J. Gareth Moreton

2021-04-16 18:05

developer   ~0130412

Last edited: 2021-04-16 18:07

View 3 revisions

That's a good point actually. I didn't think about that combination. I only saw the pipeline stall of MOV followed by SET, forgetting that TEST and MOV can execute together. MOVZX is definitely the smaller encoding though, so I would favour that one for -Os.

As long as the processor is Pentium II or above, MOVZX will execute in one clock cycle, and of course, it was only introduced on the 80386 processor, although the utility function "IsMOVZXAcceptable" handles all of this.

For moving MOV before TEST/CMP, I believe this is perfectly safe so long as the destination register doesn't appear in TEST/CMP. For example, swapping "test %edx,%edx; mov $0,%al" is fine, but not for "test %eax,%eax; mov $0,%al"; this is a very easy condition to check ("not SuperRegistersEqual").

I'm in the process of further improving this optimisation anyway, so watch this space.

I'm glad we are having good discussions over optimisations!

J. Gareth Moreton

2021-05-03 21:56

developer   ~0130744

So I got my laptop repaired at last, and I can finally submit this improved patch. An example of how it optimises code in classes.s - before:

    jnl .Lj3786
    movl $-1,%edx
    jmp .Lj3787
    .p2align 4,,10
    .p2align 3
.Lj3786:
    testq %rax,%rax
    jng .Lj3789
    movl $1,%edx
    jmp .Lj3787
    .p2align 4,,10
    .p2align 3
.Lj3789:
    xorl %edx,%edx
.Lj3787:
# Peephole Optimization: MovxMov2Movx
    movslq %edx,%rbx
    jmp .Lj3791
    .p2align 4,,10
    .p2align 3
.Lj3784:

After:

    jnl .Lj3786
    movl $-1,%edx
# Peephole Optimization: Duplicated 1 assignment(s) and redirected jump
    movslq %edx,%rbx (<-- I'll find a way to properly optimise this later, so it becomes "movq $-1,%rbx")
    jmp .Lj3791
    .p2align 4,,10
    .p2align 3
.Lj3786:
# Peephole Optimization: Swapped test and mov instructions to improve optimisation potential
    xorl %edx,%edx (<-- This was "movl %0,%edx", and moving it before the "test" instruction allows it to be changed to "xor")
    testq %rax,%rax
# Peephole Optimization: J(c)Mov1JmpMov0 -> Set(~c)
    setgb %dl
# Peephole Optimization: MovxMov2Movx
    movslq %edx,%rbx (<-- I'll find a way to properly optimise this later)
    jmp .Lj3791
    .p2align 4,,10
    .p2align 3
.Lj3784:
JccMovJmpMov2MovSetcc-pass1-improved.patch (54,174 bytes)   
Index: compiler/aoptutils.pas
===================================================================
--- compiler/aoptutils.pas	(revision 49330)
+++ compiler/aoptutils.pas	(working copy)
@@ -35,6 +35,10 @@
     function MatchOpType(const p : taicpu; type0,type1,type2 : toptype) : Boolean;
 {$endif max_operands>2}
 
+    { skips all alignment fields and returns the next label (or non-align).
+      returns immediately with true if hp is a label }
+    function SkipAligns(hp: tai; out hp2: tai): boolean;
+
     { skips all labels and returns the next "real" instruction }
     function SkipLabels(hp: tai; out hp2: tai): boolean;
 
@@ -67,6 +71,21 @@
 {$endif max_operands>2}
 
 
+    { skips all alignment fields and returns the next label (or non-align).
+      Returns immediately with True if hp is a label }
+    function SkipAligns(hp: tai; out hp2: tai): boolean;
+      begin
+        while assigned(hp) and
+              (hp.typ in SkipInstr + [ait_label,ait_align]) Do
+          begin
+            { Check that the label is actually live }
+            if (hp.typ = ait_label) and tai_label(hp).labsym.is_used then
+              Break;
+            hp := tai(hp.next);
+          end;
+        SkipAligns := SetAndTest(hp, hp2);
+      end;
+
     { skips all labels and returns the next "real" instruction }
     function SkipLabels(hp: tai; out hp2: tai): boolean;
       begin
Index: compiler/cgutils.pas
===================================================================
--- compiler/cgutils.pas	(revision 49330)
+++ compiler/cgutils.pas	(working copy)
@@ -197,7 +197,7 @@
     { This routine verifies if two references are the same, and
        if so, returns TRUE, otherwise returns false.
     }
-    function references_equal(const sref,dref : treference) : boolean;
+    function references_equal(const sref,dref : treference) : boolean; inline;
 
     { tlocation handling }
 
@@ -262,7 +262,7 @@
       end;
 
 
-    function references_equal(const sref,dref : treference):boolean;
+    function references_equal(const sref,dref : treference):boolean; inline;
       begin
         references_equal:=CompareByte(sref,dref,sizeof(treference))=0;
       end;
Index: compiler/i386/aoptcpu.pas
===================================================================
--- compiler/i386/aoptcpu.pas	(revision 49330)
+++ compiler/i386/aoptcpu.pas	(working copy)
@@ -187,6 +187,8 @@
                   Result:=OptPass1SHLSAL(p);
                 A_SUB:
                   Result:=OptPass1Sub(p);
+                A_TEST:
+                  Result:=OptPass1Test(p);
                 A_MOVAPD,
                 A_MOVAPS,
                 A_MOVUPD,
@@ -221,6 +223,8 @@
                   Result:=OptPass1MOVXX(p);
                 A_SETcc:
                   Result:=OptPass1SETcc(p);
+                A_Jcc:
+                  Result:=OptPass1Jcc(p);
                 else
                   ;
               end;
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 49330)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -140,6 +140,8 @@
         function OptPass1PXor(var p : tai) : boolean;
         function OptPass1VPXor(var p: tai): boolean;
         function OptPass1Imul(var p : tai) : boolean;
+        function OptPass1Jcc(var p : tai) : boolean;
+        function OptPass1Test(var p : tai) : boolean;
 
         function OptPass2Movx(var p : tai): Boolean;
         function OptPass2MOV(var p : tai) : boolean;
@@ -150,6 +152,10 @@
         function OptPass2SUB(var p: tai): Boolean;
         function OptPass2ADD(var p : tai): Boolean;
 
+        function CheckMemoryWrite(var first_mov, second_mov: taicpu): Boolean;
+        function CheckJumpMovTransferOpt(var p: tai; hp1: tai; LoopCount: Integer; out Count: Integer): Boolean;
+        procedure SwapMovCmp(var p, hp1: tai);
+
         function PostPeepholeOptMov(var p : tai) : Boolean;
         function PostPeepholeOptMovzx(var p : tai) : Boolean;
 {$ifdef x86_64} { These post-peephole optimisations only affect 64-bit registers. [Kit] }
@@ -2324,9 +2330,31 @@
                   end;
               end;
           end;
+
         { Next instruction is also a MOV ? }
         if MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) then
           begin
+            {
+              Change:
+  	        movb    %regb,(ref)
+  	        movb    $0,1(ref)
+  	        movb    $0,2(ref)
+  	        movb    $0,3(ref)
+
+              To:
+                movzbl  %regb,%regl
+                movl    %regl,(ref)
+            }
+
+            if (taicpu(p).opsize = S_B) and
+              (taicpu(p).oper[1]^.typ = top_ref) and
+              (taicpu(hp1).oper[1]^.typ = top_ref) and
+              CheckMemoryWrite(taicpu(p), taicpu(hp1)) then
+              begin
+                Result := True;
+                Exit;
+              end;
+
             if (taicpu(p).oper[1]^.typ = top_reg) and
               MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
               begin
@@ -4035,9 +4063,129 @@
       end;
 
 
+    function TX86AsmOptimizer.CheckMemoryWrite(var first_mov, second_mov: taicpu): Boolean;
+      var
+        CurrentRef: TReference;
+        FullReg: TRegister;
+        hp1, hp2: tai;
+      begin
+        Result := False;
+        if (first_mov.opsize <> S_B) or (second_mov.opsize <> S_B) then
+          Exit;
+
+        { We assume you've checked if the operand is actually a reference by
+          this point. If it isn't, you'll most likely get an access violation }
+        CurrentRef := first_mov.oper[1]^.ref^;
+
+        { Memory must be aligned }
+        if (CurrentRef.offset mod 4) <> 0 then
+          Exit;
+
+        Inc(CurrentRef.offset);
+        CurrentRef.alignment := 1; { Otherwise references_equal will return False }
+
+        if MatchOperand(second_mov.oper[0]^, 0) and
+          references_equal(second_mov.oper[1]^.ref^, CurrentRef) and
+          GetNextInstruction(second_mov, hp1) and
+          (hp1.typ = ait_instruction) and
+          (taicpu(hp1).opcode = A_MOV) and
+          MatchOpType(taicpu(hp1), top_const, top_ref) and
+          (taicpu(hp1).oper[0]^.val = 0) then
+          begin
+            Inc(CurrentRef.offset);
+            CurrentRef.alignment := taicpu(hp1).oper[1]^.ref^.alignment; { Otherwise references_equal might return False }
+
+            FullReg := newreg(R_INTREGISTER,getsupreg(first_mov.oper[0]^.reg), R_SUBD);
+
+            if references_equal(taicpu(hp1).oper[1]^.ref^, CurrentRef) then
+              begin
+                case taicpu(hp1).opsize of
+                  S_B:
+                    if GetNextInstruction(hp1, hp2) and
+                      MatchInstruction(taicpu(hp2), A_MOV, [S_B]) and
+                      MatchOpType(taicpu(hp2), top_const, top_ref) and
+                      (taicpu(hp2).oper[0]^.val = 0) then
+                      begin
+                        Inc(CurrentRef.offset);
+                        CurrentRef.alignment := 1; { Otherwise references_equal will return False }
+
+                        if references_equal(taicpu(hp2).oper[1]^.ref^, CurrentRef) and
+                          (taicpu(hp2).opsize = S_B) then
+                          begin
+                            RemoveInstruction(hp1);
+                            RemoveInstruction(hp2);
+
+                            first_mov.opsize := S_L;
+
+                            if first_mov.oper[0]^.typ = top_reg then
+                              begin
+                                DebugMsg(SPeepholeOptimization + 'MOVb/MOVb/MOVb/MOVb -> MOVZX/MOVl', first_mov);
+
+                                { Reuse second_mov as a MOVZX instruction }
+                                second_mov.opcode := A_MOVZX;
+                                second_mov.opsize := S_BL;
+                                second_mov.loadreg(0, first_mov.oper[0]^.reg);
+                                second_mov.loadreg(1, FullReg);
+
+                                first_mov.oper[0]^.reg := FullReg;
+
+                                asml.Remove(second_mov);
+                                asml.InsertBefore(second_mov, first_mov);
+                              end
+                            else
+                              { It's a value }
+                              begin
+                                DebugMsg(SPeepholeOptimization + 'MOVb/MOVb/MOVb/MOVb -> MOVl', first_mov);
+                                RemoveInstruction(second_mov);
+                              end;
+
+                            Result := True;
+                            Exit;
+                          end;
+                      end;
+                  S_W:
+                    begin
+                      RemoveInstruction(hp1);
+
+                      first_mov.opsize := S_L;
+
+                      if first_mov.oper[0]^.typ = top_reg then
+                        begin
+                          DebugMsg(SPeepholeOptimization + 'MOVb/MOVb/MOVw -> MOVZX/MOVl', first_mov);
+
+                          { Reuse second_mov as a MOVZX instruction }
+                          second_mov.opcode := A_MOVZX;
+                          second_mov.opsize := S_BL;
+                          second_mov.loadreg(0, first_mov.oper[0]^.reg);
+                          second_mov.loadreg(1, FullReg);
+
+                          first_mov.oper[0]^.reg := FullReg;
+
+                          asml.Remove(second_mov);
+                          asml.InsertBefore(second_mov, first_mov);
+                        end
+                      else
+                        { It's a value }
+                        begin
+                          DebugMsg(SPeepholeOptimization + 'MOVb/MOVb/MOVw -> MOVl', first_mov);
+                          RemoveInstruction(second_mov);
+                        end;
+
+                      Result := True;
+                      Exit;
+                    end;
+                  else
+                    ;
+                end;
+              end;
+          end;
+      end;
+
+
     function TX86AsmOptimizer.OptPass1SETcc(var p: tai): boolean;
       var
         hp1,hp2,next: tai; SetC, JumpC: TAsmCond; Unconditional: Boolean;
+        OperPtr: POper;
       begin
         Result:=false;
 
@@ -4113,19 +4261,75 @@
                 DebugMsg(SPeepholeOptimization + 'SETcc/TESTCmp/Jcc -> Jcc',p);
               end
             else if MatchInstruction(hp1, A_MOV, [S_B]) and
-              MatchOpType(taicpu(hp1),top_reg,top_reg) and
-              MatchOperand(taicpu(p).oper[0]^,taicpu(hp1).oper[0]^) then
+              { Writing to memory is allowed }
+              MatchOperand(taicpu(p).oper[0]^, taicpu(hp1).oper[0]^.reg) then
               begin
-                TransferUsedRegs(TmpUsedRegs);
-                UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
-                if not RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp1, TmpUsedRegs) then
+                {
+                  Watch out for sequences such as:
+
+                  set(c)b %regb
+                  movb    %regb,(ref)
+                  movb    $0,1(ref)
+                  movb    $0,2(ref)
+                  movb    $0,3(ref)
+
+                  Much more efficient to turn it into:
+                    movl    $0,%regl
+                    set(c)b %regb
+                    movl    %regl,(ref)
+
+                  Or:
+                    set(c)b %regb
+                    movzbl  %regb,%regl
+                    movl    %regl,(ref)
+                }
+                if (taicpu(hp1).oper[1]^.typ = top_ref) and
+                  GetNextInstruction(hp1, hp2) and
+                  MatchInstruction(hp2, A_MOV, [S_B]) and
+                  (taicpu(hp2).oper[1]^.typ = top_ref) and
+                  CheckMemoryWrite(taicpu(hp1), taicpu(hp2)) then
                   begin
-                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
-                    taicpu(p).oper[0]^.reg:=taicpu(hp1).oper[1]^.reg;
-                    RemoveInstruction(hp1);
-                    DebugMsg(SPeepholeOptimization + 'SETcc/Mov -> SETcc',p);
-                    Result := true;
+                    { Don't do anything else except set Result to True }
+                  end
+                else
+                  begin
+                    TransferUsedRegs(TmpUsedRegs);
+                    UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
+                    if RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp1, TmpUsedRegs) then
+                      begin
+                        { Even if the register is still in use, we can minimise the
+                          pipeline stall by changing the MOV into another SETcc. }
+                        taicpu(hp1).opcode := A_SETcc;
+                        taicpu(hp1).condition := taicpu(p).condition;
+                        if taicpu(hp1).oper[1]^.typ = top_ref then
+                          begin
+                            { Swapping the operand pointers like this is probably a
+                              bit naughty, but it is far faster than using loadoper
+                              to transfer the reference from oper[1] to oper[0] if
+                              you take into account the extra procedure calls and
+                              the memory allocation and deallocation required }
+                            OperPtr := taicpu(hp1).oper[1];
+                            taicpu(hp1).oper[1] := taicpu(hp1).oper[0];
+                            taicpu(hp1).oper[0] := OperPtr;
+                          end
+                        else
+                          taicpu(hp1).oper[0]^.reg := taicpu(hp1).oper[1]^.reg;
+
+                        taicpu(hp1).clearop(1);
+                        taicpu(hp1).ops := 1;
+                        DebugMsg(SPeepholeOptimization + 'SETcc/Mov -> SETcc/SETcc',p);
+                      end
+                    else
+                      begin
+                        if taicpu(hp1).oper[1]^.typ = top_reg then
+                          AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,UsedRegs);
+
+                        taicpu(p).loadoper(0, taicpu(hp1).oper[1]^);
+                        RemoveInstruction(hp1);
+                        DebugMsg(SPeepholeOptimization + 'SETcc/Mov -> SETcc',p);
+                      end
                   end;
+                Result := True;
               end;
           end;
       end;
@@ -4299,7 +4503,6 @@
          hp1, hp2: tai;
        begin
          Result:=false;
-
          if taicpu(p).oper[0]^.typ = top_const then
            begin
              { Though GetNextInstruction can be factored out, it is an expensive
@@ -4476,6 +4679,29 @@
                    end;
                end;
            end;
+
+         if (taicpu(p).oper[1]^.typ = top_reg) and
+           GetNextInstruction(p, hp1) and
+           MatchInstruction(hp1,A_MOV,[]) and
+           not RegInInstruction(taicpu(p).oper[1]^.reg, hp1) and
+           (
+             (taicpu(p).oper[0]^.typ <> top_reg) or
+             not RegInInstruction(taicpu(p).oper[0]^.reg, hp1)
+           ) then
+           begin
+             { If we have something like:
+                 cmp ###,%reg1
+                 mov 0,%reg2
+
+               And no registers are shared, move the MOV command to before the
+               comparison as this means it can be optimised without worrying
+               about the FLAGS register. (This combination is generated by
+               "J(c)Mov1JmpMov0 -> Set(~c)", among other things).
+             }
+             SwapMovCmp(p, hp1);
+             Result := True;
+             Exit;
+           end;
      end;
 
 
@@ -4597,9 +4823,459 @@
      end;
 
 
+   function TX86AsmOptimizer.OptPass1Jcc(var p : tai) : boolean;
+     var
+       hp1, hp2, hp3, hp4, hp5: tai;
+       ThisReg: TRegister;
+     begin
+       Result := False;
+       if not GetNextInstruction(p,hp1) or (hp1.typ <> ait_instruction) then
+         Exit;
 
-   function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
+       {
+           convert
+           j<c>  .L1
+           mov   1,reg
+           jmp   .L2
+         .L1
+           mov   0,reg
+         .L2
 
+         into
+           mov   0,reg
+           set<not(c)> reg
+
+         take care of alignment and that the mov 0,reg is not converted into a xor as this
+         would destroy the flag contents
+
+         Use MOVZX if size is preferred, since while mov 0,reg is bigger, it can be
+         executed at the same time as a previous comparison.
+           set<not(c)> reg
+           movzx       reg, reg
+       }
+
+       if MatchInstruction(hp1,A_MOV,[]) and
+         (taicpu(hp1).oper[0]^.typ = top_const) and
+         (
+           (
+             (taicpu(hp1).oper[1]^.typ = top_reg)
+{$ifdef i386}
+             { Under i386, ESI, EDI, EBP and ESP
+               don't have an 8-bit representation }
+              and not (getsupreg(taicpu(hp1).oper[1]^.reg) in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
+
+{$endif i386}
+           ) or (
+{$ifdef i386}
+             (taicpu(hp1).oper[1]^.typ <> top_reg) and
+{$endif i386}
+             (taicpu(hp1).opsize = S_B)
+           )
+         ) and
+         GetNextInstruction(hp1,hp2) and
+         MatchInstruction(hp2,A_JMP,[]) and (taicpu(hp2).oper[0]^.ref^.refaddr=addr_full) and
+         GetNextInstruction(hp2,hp3) and
+         SkipAligns(hp3, hp3) and
+         (hp3.typ=ait_label) and
+         (tasmlabel(taicpu(p).oper[0]^.ref^.symbol)=tai_label(hp3).labsym) and
+         GetNextInstruction(hp3,hp4) and
+         MatchInstruction(hp4,A_MOV,[taicpu(hp1).opsize]) and
+         (taicpu(hp4).oper[0]^.typ = top_const) and
+         (
+           ((taicpu(hp1).oper[0]^.val = 0) and (taicpu(hp4).oper[0]^.val = 1)) or
+           ((taicpu(hp1).oper[0]^.val = 1) and (taicpu(hp4).oper[0]^.val = 0))
+         ) and
+         MatchOperand(taicpu(hp1).oper[1]^,taicpu(hp4).oper[1]^) and
+         GetNextInstruction(hp4,hp5) and
+         SkipAligns(hp5, hp5) and
+         (hp5.typ=ait_label) and
+         (tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol)=tai_label(hp5).labsym) then
+         begin
+           if (taicpu(hp1).oper[0]^.val = 1) and (taicpu(hp4).oper[0]^.val = 0) then
+             taicpu(p).condition := inverse_cond(taicpu(p).condition);
+
+           tai_label(hp3).labsym.DecRefs;
+
+           { If this isn't the only reference to the middle label, we can
+             still make a saving - only that the first jump and everything
+             that follows will remain. }
+           if (tai_label(hp3).labsym.getrefs = 0) then
+             begin
+               if (taicpu(hp1).oper[0]^.val = 1) and (taicpu(hp4).oper[0]^.val = 0) then
+                 DebugMsg(SPeepholeOptimization + 'J(c)Mov1JmpMov0 -> Set(~c)',p)
+               else
+                 DebugMsg(SPeepholeOptimization + 'J(c)Mov0JmpMov1 -> Set(c)',p);
+
+               { remove jump, first label and second MOV (also catching any aligns) }
+               repeat
+                 if not GetNextInstruction(hp2, hp3) then
+                   InternalError(2021040810);
+
+                 RemoveInstruction(hp2);
+
+                 hp2 := hp3;
+               until hp2 = hp5;
+
+               { Don't decrement reference count before the removal loop
+                 above, otherwise GetNextInstruction won't stop on the
+                 the label }
+               tai_label(hp5).labsym.DecRefs;
+             end
+           else
+             begin
+               if (taicpu(hp1).oper[0]^.val = 1) and (taicpu(hp4).oper[0]^.val = 0) then
+                 DebugMsg(SPeepholeOptimization + 'J(c)Mov1JmpMov0 -> Set(~c) (partial)',p)
+               else
+                 DebugMsg(SPeepholeOptimization + 'J(c)Mov0JmpMov1 -> Set(c) (partial)',p);
+             end;
+
+           taicpu(p).opcode:=A_SETcc;
+           taicpu(p).opsize:=S_B;
+           taicpu(p).is_jmp:=False;
+
+           if taicpu(hp1).opsize=S_B then
+             begin
+               taicpu(p).loadoper(0, taicpu(hp1).oper[1]^);
+               RemoveInstruction(hp1);
+             end
+           else
+             begin
+               { Will be a register because the size can't be S_B otherwise }
+               ThisReg := newreg(R_INTREGISTER,getsupreg(taicpu(hp1).oper[1]^.reg), R_SUBL);
+               taicpu(p).loadreg(0, ThisReg);
+
+               if (cs_opt_size in current_settings.optimizerswitches) and IsMOVZXAcceptable then
+                 begin
+                   case taicpu(hp1).opsize of
+                     S_W:
+                       taicpu(hp1).opsize := S_BW;
+                     S_L:
+                       taicpu(hp1).opsize := S_BL;
+{$ifdef x86_64}
+                     S_Q:
+                       begin
+                         taicpu(hp1).opsize := S_BL;
+                         { Change the destination register to 32-bit }
+                         taicpu(hp1).loadreg(1, newreg(R_INTREGISTER,getsupreg(ThisReg), R_SUBD));
+                       end;
+{$endif x86_64}
+                     else
+                       InternalError(2021040820);
+                   end;
+
+                   taicpu(hp1).opcode := A_MOVZX;
+                   taicpu(hp1).loadreg(0, ThisReg);
+                 end
+               else
+                 begin
+                   AllocRegBetween(NR_FLAGS,p,hp1,UsedRegs);
+
+                   { hp1 is already a MOV instruction with the correct register }
+                   taicpu(hp1).loadconst(0, 0);
+
+                   { Inserting it right before p will guarantee that the flags are also tracked }
+                   asml.Remove(hp1);
+                   asml.InsertBefore(hp1, p);
+                 end;
+             end;
+
+           Result:=true;
+           exit;
+         end
+     end;
+
+
+  function TX86AsmOptimizer.OptPass1Test(var p : tai) : boolean;
+    var
+      hp1: tai;
+    begin
+      Result := False;
+      if (taicpu(p).oper[1]^.typ = top_reg) and
+        GetNextInstruction(p, hp1) and
+        MatchInstruction(hp1,A_MOV,[]) and
+        not RegInInstruction(taicpu(p).oper[1]^.reg, hp1) and
+        (
+          (taicpu(p).oper[0]^.typ <> top_reg) or
+          not RegInInstruction(taicpu(p).oper[0]^.reg, hp1)
+        ) then
+        begin
+          { If we have something like:
+              test %reg1,%reg1
+              mov  0,%reg2
+
+            And no registers are shared (the two %reg1's can be different, as
+            long as neither of them are also %reg2), move the MOV command to
+            before the comparison as this means it can be optimised without
+            worrying about the FLAGS register. (This combination is generated
+            by "J(c)Mov1JmpMov0 -> Set(~c)", among other things).
+          }
+          SwapMovCmp(p, hp1);
+          Result := True;
+        end;
+    end;
+
+
+  function TX86AsmOptimizer.CheckJumpMovTransferOpt(var p: tai; hp1: tai; LoopCount: Integer; out Count: Integer): Boolean;
+    var
+      hp2, hp3, first_assignment: tai;
+      IncCount, OperIdx: Integer;
+      OrigLabel: TAsmLabel;
+    begin
+      Count := 0;
+      Result := False;
+      first_assignment := nil;
+      if (LoopCount >= 20) then
+        begin
+          { Guard against infinite loops }
+          Exit;
+        end;
+      if (taicpu(p).oper[0]^.typ <> top_ref) or
+        (taicpu(p).oper[0]^.ref^.refaddr <> addr_full) or
+        (taicpu(p).oper[0]^.ref^.base <> NR_NO) or
+        (taicpu(p).oper[0]^.ref^.index <> NR_NO) or
+        not (taicpu(p).oper[0]^.ref^.symbol is TAsmLabel) then
+        Exit;
+
+      OrigLabel := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
+
+      {
+        change
+               jmp .L1
+               ...
+           .L1:
+               mov ##, ## ( multiple movs possible )
+               jmp/ret
+        into
+               mov ##, ##
+               jmp/ret
+      }
+
+      if not Assigned(hp1) then
+        begin
+          hp1 := GetLabelWithSym(OrigLabel);
+          if not Assigned(hp1) or not SkipLabels(hp1, hp1) then
+            Exit;
+
+        end;
+
+      hp2 := hp1;
+
+      while Assigned(hp2) do
+        begin
+          if Assigned(hp2) and (hp2.typ in [ait_label, ait_align]) then
+            SkipLabels(hp2,hp2);
+
+          if not Assigned(hp2) or (hp2.typ <> ait_instruction) then
+            Break;
+
+          case taicpu(hp2).opcode of
+            A_MOVSS:
+              begin
+                if taicpu(hp2).ops = 0 then
+                  { Wrong MOVSS }
+                  Break;
+                Inc(Count);
+                if Count >= 5 then
+                  { Too many to be worthwhile }
+                  Break;
+                GetNextInstruction(hp2, hp2);
+                Continue;
+              end;
+            A_MOV,
+            A_MOVD,
+            A_MOVQ,
+            A_MOVSX,
+{$ifdef x86_64}
+            A_MOVSXD,
+{$endif x86_64}
+            A_MOVZX,
+            A_MOVAPS,
+            A_MOVUPS,
+            A_MOVSD,
+            A_MOVAPD,
+            A_MOVUPD,
+            A_MOVDQA,
+            A_MOVDQU,
+            A_VMOVSS,
+            A_VMOVAPS,
+            A_VMOVUPS,
+            A_VMOVSD,
+            A_VMOVAPD,
+            A_VMOVUPD,
+            A_VMOVDQA,
+            A_VMOVDQU:
+              begin
+                Inc(Count);
+                if Count >= 5 then
+                  { Too many to be worthwhile }
+                  Break;
+                GetNextInstruction(hp2, hp2);
+                Continue;
+              end;
+            A_JMP:
+              begin
+                { Guard against infinite loops }
+                if taicpu(hp2).oper[0]^.ref^.symbol = OrigLabel then
+                  Exit;
+
+                { Analyse this jump first in case it also duplicates assignments }
+                if CheckJumpMovTransferOpt(hp2, nil, LoopCount + 1, IncCount) then
+                  begin
+                    { Something did change! }
+                    Result := True;
+
+                    Inc(Count, IncCount);
+                    if Count >= 5 then
+                      begin
+                        { Too many to be worthwhile }
+                        Exit;
+                      end;
+
+                    if MatchInstruction(hp2, [A_JMP, A_RET], []) then
+                      Break;
+                  end;
+
+                Result := True;
+                Break;
+              end;
+            A_RET:
+              begin
+                Result := True;
+                Break;
+              end;
+            else
+              Break;
+          end;
+        end;
+
+      if Result then
+        begin
+          { A count of zero can happen when CheckJumpMovTransferOpt is called recursively }
+          if Count = 0 then
+            begin
+              Result := False;
+              Exit;
+            end;
+
+          hp3 := p;
+          DebugMsg(SPeepholeOptimization + 'Duplicated ' + debug_tostr(Count) + ' assignment(s) and redirected jump', p);
+          while True do
+            begin
+              if Assigned(hp1) and (hp1.typ in [ait_label, ait_align]) then
+                SkipLabels(hp1,hp1);
+
+              if (hp1.typ <> ait_instruction) then
+                InternalError(2021040720);
+
+              case taicpu(hp1).opcode of
+                A_JMP:
+                  begin
+                    { Change the original jump to the new destination }
+                    OrigLabel.decrefs;
+                    taicpu(hp1).oper[0]^.ref^.symbol.increfs;
+                    taicpu(p).loadref(0, taicpu(hp1).oper[0]^.ref^);
+
+                    { Set p to the first duplicated assignment so it can get optimised if needs be }
+                    if not Assigned(first_assignment) then
+                      InternalError(2021040810)
+                    else
+                      p := first_assignment;
+
+                    Exit;
+                  end;
+                A_RET:
+                  begin
+                    { Now change the jump into a RET instruction }
+                    ConvertJumpToRET(p, hp1);
+
+                    { Set p to the first duplicated assignment so it can get optimised if needs be }
+                    if not Assigned(first_assignment) then
+                      InternalError(2021040811)
+                    else
+                      p := first_assignment;
+
+                    Exit;
+                  end;
+                else
+                  begin
+                    { Duplicate the MOV instruction }
+                    hp3:=tai(hp1.getcopy);
+                    if first_assignment = nil then
+                      first_assignment := hp3;
+
+                    asml.InsertBefore(hp3, p);
+
+                    { Make sure the compiler knows about any final registers written here }
+                    for OperIdx := 0 to taicpu(hp3).ops - 1 do
+                      with taicpu(hp3).oper[OperIdx]^ do
+                        begin
+                          case typ of
+                            top_ref:
+                              begin
+                                if (ref^.base <> NR_NO) and
+                                  (getsupreg(ref^.base) <> RS_ESP) and
+                                  (getsupreg(ref^.base) <> RS_EBP)
+                                  {$ifdef x86_64} and (ref^.base <> NR_RIP) {$endif x86_64}
+                                  then
+                                  AllocRegBetween(ref^.base, hp3, tai(p.Next), UsedRegs);
+                                if (ref^.index <> NR_NO) and
+                                  (getsupreg(ref^.index) <> RS_ESP) and
+                                  (getsupreg(ref^.index) <> RS_EBP)
+                                  {$ifdef x86_64} and (ref^.index <> NR_RIP) {$endif x86_64} and
+                                  (ref^.index <> ref^.base) then
+                                  AllocRegBetween(ref^.index, hp3, tai(p.Next), UsedRegs);
+                              end;
+                            top_reg:
+                              AllocRegBetween(reg, hp3, tai(p.Next), UsedRegs);
+                            else
+                              ;
+                          end;
+                        end;
+                  end;
+              end;
+
+              if not GetNextInstruction(hp1, hp1) then
+                { Should have dropped out earlier }
+                InternalError(2021040710);
+            end;
+        end;
+    end;
+
+
+  procedure TX86AsmOptimizer.SwapMovCmp(var p, hp1: tai);
+    var
+      hp2: tai;
+      X: Integer;
+    begin
+      asml.Remove(hp1);
+
+      { Try to insert after the last instructions where the FLAGS register is not yet in use }
+      if not GetLastInstruction(p, hp2) then
+        asml.InsertBefore(hp1, p)
+      else
+        asml.InsertAfter(hp1, hp2);
+
+      DebugMsg(SPeepholeOptimization + 'Swapped ' + debug_op2str(taicpu(p).opcode) + ' and mov instructions to improve optimisation potential', hp1);
+
+      for X := 0 to 1 do
+        case taicpu(hp1).oper[X]^.typ of
+          top_reg:
+            AllocRegBetween(taicpu(hp1).oper[X]^.reg, hp1, p, UsedRegs);
+          top_ref:
+            begin
+              if taicpu(hp1).oper[X]^.ref^.base <> NR_NO then
+                AllocRegBetween(taicpu(hp1).oper[X]^.ref^.base, hp1, p, UsedRegs);
+              if taicpu(hp1).oper[X]^.ref^.index <> NR_NO then
+                AllocRegBetween(taicpu(hp1).oper[X]^.ref^.index, hp1, p, UsedRegs);
+            end;
+          else
+            ;
+        end;
+    end;
+
+
+  function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
+
      function IsXCHGAcceptable: Boolean; inline;
        begin
          { Always accept if optimising for size }
@@ -4619,13 +5295,156 @@
 
       var
         NewRef: TReference;
-       hp1,hp2,hp3: tai;
+        hp1, hp2, hp3, hp4: Tai;
 {$ifndef x86_64}
-       hp4: tai;
-       OperIdx: Integer;
+        OperIdx: Integer;
 {$endif x86_64}
-      begin
+        NewInstr : Taicpu;
+        NewAligh : Tai_align;
+        DestLabel: TAsmLabel;
+     begin
         Result:=false;
+
+        { This optimisation adds an instruction, so only do it for speed }
+        if not (cs_opt_size in current_settings.optimizerswitches) and
+          MatchOpType(taicpu(p), top_const, top_reg) and
+          (taicpu(p).oper[0]^.val = 0) then
+          begin
+
+            { To avoid compiler warning }
+            DestLabel := nil;
+
+            if (p.typ <> ait_instruction) or (taicpu(p).oper[1]^.typ <> top_reg) then
+              InternalError(2021040750);
+
+            if not GetNextInstructionUsingReg(p, hp1, taicpu(p).oper[1]^.reg) then
+              Exit;
+
+            case hp1.typ of
+              ait_label:
+                begin
+                  { Change:
+                      mov $0,%reg                     mov $0,%reg
+                    @Lbl1:                          @Lbl1:
+                      test %reg,%reg / cmp $0,%reg    test %reg,%reg / mov $0,%reg
+                      je   @Lbl2                      jne  @Lbl2
+
+                    To:                             To:
+                      mov $0,%reg                     mov $0,%reg
+                      jmp  @Lbl2                      jmp  @Lbl3
+                      (align)                         (align)
+                    @Lbl1:                          @Lbl1:
+                      test %reg,%reg / cmp $0,%reg    test %reg,%reg / cmp $0,%reg
+                      je   @Lbl2                      je   @Lbl2
+                                                    @Lbl3:   <-- Only if label exists
+
+                    (Not if it's optimised for size)
+                  }
+                  if not GetNextInstruction(hp1, hp2) then
+                    Exit;
+
+                  if not (cs_opt_size in current_settings.optimizerswitches) and
+                    (hp2.typ = ait_instruction) and
+                    (
+                      { Register sizes must exactly match }
+                      (
+                        (taicpu(hp2).opcode = A_CMP) and
+                        MatchOperand(taicpu(hp2).oper[0]^, 0) and
+                        MatchOperand(taicpu(hp2).oper[1]^, taicpu(p).oper[1]^.reg)
+                      ) or (
+                        (taicpu(hp2).opcode = A_TEST) and
+                        MatchOperand(taicpu(hp2).oper[0]^, taicpu(p).oper[1]^.reg) and
+                        MatchOperand(taicpu(hp2).oper[1]^, taicpu(p).oper[1]^.reg)
+                      )
+                    ) and GetNextInstruction(hp2, hp3) and
+                    (hp3.typ = ait_instruction) and
+                    (taicpu(hp3).opcode = A_JCC) and
+                    (taicpu(hp3).oper[0]^.typ=top_ref) and (taicpu(hp3).oper[0]^.ref^.refaddr=addr_full) and (taicpu(hp3).oper[0]^.ref^.base=NR_NO) and
+                    (taicpu(hp3).oper[0]^.ref^.index=NR_NO) and (taicpu(hp3).oper[0]^.ref^.symbol is tasmlabel) then
+                    begin
+                      { Check condition of jump }
+
+                      { Always true? }
+                      if condition_in(C_E, taicpu(hp3).condition) then
+                        begin
+                          { Copy label symbol and obtain matching label entry for the
+                            conditional jump, as this will be our destination}
+                          DestLabel := tasmlabel(taicpu(hp3).oper[0]^.ref^.symbol);
+                          DebugMsg(SPeepholeOptimization + 'Mov0LblCmp0Je -> Mov0JmpLblCmp0Je', p);
+                          Result := True;
+                        end
+
+                      { Always false? }
+                      else if condition_in(C_NE, taicpu(hp3).condition) and GetNextInstruction(hp3, hp2) then
+                        begin
+                          { This is only worth it if there's a jump to take }
+
+                          case hp2.typ of
+                            ait_instruction:
+                              begin
+                                if taicpu(hp2).opcode = A_JMP then
+                                  begin
+                                    DestLabel := tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol);
+                                    { An unconditional jump follows the conditional jump which will always be false,
+                                      so use this jump's destination for the new jump }
+                                    DebugMsg(SPeepholeOptimization + 'Mov0LblCmp0Jne -> Mov0JmpLblCmp0Jne (with JMP)', p);
+                                    Result := True;
+                                  end
+                                else if taicpu(hp2).opcode = A_JCC then
+                                  begin
+                                    DestLabel := tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol);
+                                    if condition_in(C_E, taicpu(hp2).condition) then
+                                      begin
+                                        { A second conditional jump follows the conditional jump which will always be false,
+                                          while the second jump is always True, so use this jump's destination for the new jump }
+                                        DebugMsg(SPeepholeOptimization + 'Mov0LblCmp0Jne -> Mov0JmpLblCmp0Jne (with second Jcc)', p);
+                                        Result := True;
+                                      end;
+
+                                    { Don't risk it if the jump isn't always true (Result remains False) }
+                                  end;
+                              end;
+                            else
+                              { If anything else don't optimise };
+                          end;
+                        end;
+
+                      if Result then
+                        begin
+                          { Just so we have something to insert as a paremeter}
+                          reference_reset(NewRef, 1, []);
+                          NewInstr := taicpu.op_ref(A_JMP, S_NO, NewRef);
+
+                          { Now actually load the correct parameter }
+                          NewInstr.loadsymbol(0, DestLabel, 0);
+
+                          { Get instruction before original label (may not be p under -O3) }
+                          if not GetLastInstruction(hp1, hp2) then
+                            { Shouldn't fail here }
+                            InternalError(2021040701);
+
+                          DestLabel.increfs;
+
+                          AsmL.InsertAfter(NewInstr, hp2);
+                          { Add new alignment field }
+      (*                    AsmL.InsertAfter(
+                            cai_align.create_max(
+                              current_settings.alignment.jumpalign,
+                              current_settings.alignment.jumpalignskipmax
+                            ),
+                            NewInstr
+                          ); *)
+                        end;
+
+                      Exit;
+                    end;
+                end;
+              else
+                ;
+            end;
+
+          end;
+
         if not GetNextInstruction(p, hp1) then
           Exit;
 
@@ -5755,14 +6574,28 @@
 
     function TX86AsmOptimizer.OptPass2Jmp(var p : tai) : boolean;
       var
-        hp1, hp2, hp3: tai;
-        OperIdx: Integer;
+        hp1: tai;
+        Count: Integer;
+        OrigLabel: TAsmLabel;
       begin
-        result:=false;
+        result := False;
+
+        { Sometimes, the optimisations below can permit this }
+        RemoveDeadCodeAfterJump(p);
+
         if (taicpu(p).oper[0]^.typ=top_ref) and (taicpu(p).oper[0]^.ref^.refaddr=addr_full) and (taicpu(p).oper[0]^.ref^.base=NR_NO) and
           (taicpu(p).oper[0]^.ref^.index=NR_NO) and (taicpu(p).oper[0]^.ref^.symbol is tasmlabel) then
           begin
-            hp1:=getlabelwithsym(tasmlabel(taicpu(p).oper[0]^.ref^.symbol));
+            OrigLabel := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
+
+            { Also a side-effect of optimisations }
+            if CollapseZeroDistJump(p, OrigLabel) then
+              begin
+                Result := True;
+                Exit;
+              end;
+
+            hp1 := GetLabelWithSym(OrigLabel);
             if (taicpu(p).condition=C_None) and assigned(hp1) and SkipLabels(hp1,hp1) and (hp1.typ = ait_instruction) then
               begin
                 case taicpu(hp1).opcode of
@@ -5780,58 +6613,35 @@
                       ConvertJumpToRET(p, hp1);
                       result:=true;
                     end;
-                  A_MOV:
-                    {
-                      change
-                             jmp .L1
-                             ...
-                         .L1:
-                             mov ##, ##
-                             ret
-                      into
-                             mov ##, ##
-                             ret
-                    }
-                    { This optimisation tends to increase code size if the pass 1 MOV optimisations aren't
-                      re-run, so only do this particular optimisation if optimising for speed or when
-                      optimisations are very in-depth. [Kit] }
-                    if (current_settings.optimizerswitches * [cs_opt_level3, cs_opt_size]) <> [cs_opt_size] then
+                  { Check any kind of direct assignment instruction }
+                  A_MOV,
+                  A_MOVD,
+                  A_MOVQ,
+                  A_MOVSX,
+{$ifdef x86_64}
+                  A_MOVSXD,
+{$endif x86_64}
+                  A_MOVZX,
+                  A_MOVAPS,
+                  A_MOVUPS,
+                  A_MOVSD,
+                  A_MOVAPD,
+                  A_MOVUPD,
+                  A_MOVDQA,
+                  A_MOVDQU,
+                  A_VMOVSS,
+                  A_VMOVAPS,
+                  A_VMOVUPS,
+                  A_VMOVSD,
+                  A_VMOVAPD,
+                  A_VMOVUPD,
+                  A_VMOVDQA,
+                  A_VMOVDQU:
+                    if ((current_settings.optimizerswitches * [cs_opt_level3, cs_opt_size]) <> [cs_opt_size]) and
+                      CheckJumpMovTransferOpt(p, hp1, 0, Count) then
                       begin
-                        GetNextInstruction(hp1, hp2);
-                        if not Assigned(hp2) then
-                          Exit;
-
-                        if (hp2.typ in [ait_label, ait_align]) then
-                          SkipLabels(hp2,hp2);
-                        if Assigned(hp2) and MatchInstruction(hp2, A_RET, [S_NO]) then
-                          begin
-                            { Duplicate the MOV instruction }
-                            hp3:=tai(hp1.getcopy);
-                            asml.InsertBefore(hp3, p);
-
-                            { Make sure the compiler knows about any final registers written here }
-                            for OperIdx := 0 to 1 do
-                              with taicpu(hp3).oper[OperIdx]^ do
-                                begin
-                                  case typ of
-                                    top_ref:
-                                      begin
-                                        if (ref^.base <> NR_NO) {$ifdef x86_64} and (ref^.base <> NR_RIP) {$endif x86_64} then
-                                          AllocRegBetween(ref^.base, hp3, tai(p.Next), UsedRegs);
-                                        if (ref^.index <> NR_NO) {$ifdef x86_64} and (ref^.index <> NR_RIP) {$endif x86_64} then
-                                          AllocRegBetween(ref^.index, hp3, tai(p.Next), UsedRegs);
-                                      end;
-                                    top_reg:
-                                      AllocRegBetween(reg, hp3, tai(p.Next), UsedRegs);
-                                    else
-                                      ;
-                                  end;
-                                end;
-
-                            { Now change the jump into a RET instruction }
-                            ConvertJumpToRET(p, hp2);
-                            result:=true;
-                          end;
+                        Result := True;
+                        Exit;
                       end;
                   else
                     ;
@@ -5866,9 +6676,9 @@
 
     function TX86AsmOptimizer.OptPass2Jcc(var p : tai) : boolean;
       var
-        hp1,hp2: tai;
+        hp1,hp2,hp3,hp4,hp5: tai;
 {$ifndef i8086}
-        hp3,hp4,hpmov2, hp5: tai;
+        hpmov2: tai;
         l : Longint;
         condition : TAsmCond;
 {$endif i8086}
@@ -5881,15 +6691,7 @@
         if GetNextInstruction(p,hp1) and (hp1.typ=ait_instruction) then
           begin
             symbol := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
-
-            if GetNextInstruction(hp1,hp2) and
-              (
-                (hp2.typ=ait_label) or
-                { trick to skip align }
-                ((hp2.typ=ait_align) and GetNextInstruction(hp2,hp2) and (hp2.typ=ait_label))
-              ) and
-              (Tasmlabel(symbol) = Tai_label(hp2).labsym) and
-              (
+            if (
                 (
                   ((Taicpu(hp1).opcode=A_ADD) or (Taicpu(hp1).opcode=A_SUB)) and
                   MatchOptype(Taicpu(hp1),top_const,top_reg) and
@@ -5896,7 +6698,11 @@
                   (Taicpu(hp1).oper[0]^.val=1)
                 ) or
                 ((Taicpu(hp1).opcode=A_INC) or (Taicpu(hp1).opcode=A_DEC))
-              ) then
+              ) and
+              GetNextInstruction(hp1,hp2) and
+              SkipAligns(hp2, hp2) and
+              (hp2.typ = ait_label) and
+              (Tasmlabel(symbol) = Tai_label(hp2).labsym) then
              { jb @@1                            cmc
                inc/dec operand           -->     adc/sbb operand,0
                @@1:
@@ -6085,76 +6891,6 @@
                     end;
 {$ifndef i8086}
                 end
-              {
-                  convert
-                  j<c>  .L1
-                  mov   1,reg
-                  jmp   .L2
-                .L1
-                  mov   0,reg
-                .L2
-
-                into
-                  mov   0,reg
-                  set<not(c)> reg
-
-                take care of alignment and that the mov 0,reg is not converted into a xor as this
-                would destroy the flag contents
-              }
-              else if MatchInstruction(hp1,A_MOV,[]) and
-                MatchOpType(taicpu(hp1),top_const,top_reg) and
-{$ifdef i386}
-                (
-                { Under i386, ESI, EDI, EBP and ESP
-                  don't have an 8-bit representation }
-                  not (getsupreg(taicpu(hp1).oper[1]^.reg) in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
-                ) and
-{$endif i386}
-                (taicpu(hp1).oper[0]^.val=1) and
-                GetNextInstruction(hp1,hp2) and
-                MatchInstruction(hp2,A_JMP,[]) and (taicpu(hp2).oper[0]^.ref^.refaddr=addr_full) and
-                GetNextInstruction(hp2,hp3) and
-                { skip align }
-                ((hp3.typ<>ait_align) or GetNextInstruction(hp3,hp3)) and
-                (hp3.typ=ait_label) and
-                (tasmlabel(taicpu(p).oper[0]^.ref^.symbol)=tai_label(hp3).labsym) and
-                (tai_label(hp3).labsym.getrefs=1) and
-                GetNextInstruction(hp3,hp4) and
-                MatchInstruction(hp4,A_MOV,[]) and
-                MatchOpType(taicpu(hp4),top_const,top_reg) and
-                (taicpu(hp4).oper[0]^.val=0) and
-                MatchOperand(taicpu(hp1).oper[1]^,taicpu(hp4).oper[1]^) and
-                GetNextInstruction(hp4,hp5) and
-                (hp5.typ=ait_label) and
-                (tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol)=tai_label(hp5).labsym) and
-                (tai_label(hp5).labsym.getrefs=1) then
-                begin
-                  AllocRegBetween(NR_FLAGS,p,hp4,UsedRegs);
-                  DebugMsg(SPeepholeOptimization+'JccMovJmpMov2MovSetcc',p);
-                  { remove last label }
-                  RemoveInstruction(hp5);
-                  { remove second label }
-                  RemoveInstruction(hp3);
-                  { if align is present remove it }
-                  if GetNextInstruction(hp2,hp3) and (hp3.typ=ait_align) then
-                    RemoveInstruction(hp3);
-                  { remove jmp }
-                  RemoveInstruction(hp2);
-                  if taicpu(hp1).opsize=S_B then
-                    RemoveInstruction(hp1)
-                  else
-                    taicpu(hp1).loadconst(0,0);
-                  taicpu(hp4).opcode:=A_SETcc;
-                  taicpu(hp4).opsize:=S_B;
-                  taicpu(hp4).condition:=inverse_cond(taicpu(p).condition);
-                  taicpu(hp4).loadreg(0,newreg(R_INTREGISTER,getsupreg(taicpu(hp4).oper[1]^.reg),R_SUBL));
-                  taicpu(hp4).opercnt:=1;
-                  taicpu(hp4).ops:=1;
-                  taicpu(hp4).freeop(1);
-                  RemoveCurrentP(p);
-                  Result:=true;
-                  exit;
-                end
               else if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
                 begin
                  { check for
@@ -7530,6 +8266,8 @@
       var
         hp1: tai;
       begin
+        Result := False;
+
         { Detect:
             andw   x,  %ax (0 <= x < $8000)
             ...
@@ -7538,7 +8276,6 @@
           Change movzwl %ax,%eax to cwtl (shorter encoding for movswl %ax,%eax)
         }
 
-        Result := False;
         if MatchOpType(taicpu(p), top_const, top_reg) and
           (taicpu(p).oper[1]^.reg = NR_AX) and { This is also enough to determine that opsize = S_W }
           ((taicpu(p).oper[0]^.val and $7FFF) = taicpu(p).oper[0]^.val) and
@@ -7557,7 +8294,6 @@
             p := tai(p.Next);
             Result := True;
           end;
-
       end;
 
 
@@ -7685,6 +8421,7 @@
                   begin
                     RemoveCurrentP(p, hp2);
                     Result:=true;
+                    Exit;
                   end;
               end;
             A_SHL, A_SAL, A_SHR, A_SAR:
@@ -7701,6 +8438,7 @@
                   begin
                     RemoveCurrentP(p, hp2);
                     Result:=true;
+                    Exit;
                   end;
               end;
             A_DEC, A_INC, A_NEG:
@@ -7729,16 +8467,26 @@
                     end;
                     RemoveCurrentP(p, hp2);
                     Result:=true;
+                    Exit;
                   end;
               end
           else
-            { change "test  $-1,%reg" into "test %reg,%reg" }
-            if IsTestConstX and (taicpu(p).oper[1]^.typ=top_reg) then
-              taicpu(p).loadoper(0,taicpu(p).oper[1]^);
-          end { case }
+            ;
+          end; { case }
+
         { change "test  $-1,%reg" into "test %reg,%reg" }
-        else if IsTestConstX and (taicpu(p).oper[1]^.typ=top_reg) then
+        if IsTestConstX and (taicpu(p).oper[1]^.typ=top_reg) then
           taicpu(p).loadoper(0,taicpu(p).oper[1]^);
+
+        { Change "or %reg,%reg" to "test %reg,%reg" as OR generates a false dependency }
+        if MatchInstruction(p, A_OR, []) and
+          { Can only match if they're both registers }
+          MatchOperand(taicpu(p).oper[0]^, taicpu(p).oper[1]^) then
+          begin
+            DebugMsg(SPeepholeOptimization + 'or %reg,%reg -> test %reg,%reg to remove false dependency (Or2Test)', p);
+            taicpu(p).opcode := A_TEST;
+            { No need to set Result to True, as we've done all the optimisations we can }
+          end;
       end;
 
 
Index: compiler/x86_64/aoptcpu.pas
===================================================================
--- compiler/x86_64/aoptcpu.pas	(revision 49330)
+++ compiler/x86_64/aoptcpu.pas	(working copy)
@@ -135,6 +135,8 @@
                   result:=OptPass1FLD(p);
                 A_CMP:
                   result:=OptPass1Cmp(p);
+                A_TEST:
+                  Result:=OptPass1Test(p);
                 A_VPXORD,
                 A_VPXORQ,
                 A_VXORPS,
@@ -145,6 +147,8 @@
                 A_XORPD,
                 A_PXOR:
                   Result:=OptPass1PXor(p);
+                A_Jcc:
+                  Result:=OptPass1Jcc(p);
                 else
                   ;
               end;

J. Gareth Moreton

2021-05-16 11:00

developer   ~0130915

So now I'm getting an internal error that I'm surprised I missed before or didn't get raised, because I tried to use AllocRegBetween with the flags register, which the function doesn't like.

I'm fixing up the patch to correct the problem and hopefully split the SetCC and JccMov optimisations apart again. Currently though, the best fix, I think, is to allow AllocRegBetween to work with the flags register on x86 platforms, especially if instruction manipulation may otherwise cause it to not get tracked properly otherwise.

Issue History

Date Modified Username Field Change
2021-04-16 07:21 J. Gareth Moreton New Issue
2021-04-16 07:21 J. Gareth Moreton File Added: JccMovJmpMov2MovSetcc-pass1.patch
2021-04-16 07:21 J. Gareth Moreton Priority normal => low
2021-04-16 07:21 J. Gareth Moreton FPCTarget => -
2021-04-16 07:21 J. Gareth Moreton Tag Attached: patch.compiler
2021-04-16 07:21 J. Gareth Moreton Tag Attached: optimizations
2021-04-16 07:21 J. Gareth Moreton Tag Attached: i386
2021-04-16 07:21 J. Gareth Moreton Tag Attached: x86
2021-04-16 07:21 J. Gareth Moreton Tag Attached: x86_64
2021-04-16 07:23 J. Gareth Moreton Note Added: 0130400
2021-04-16 11:34 J. Gareth Moreton Note Added: 0130402
2021-04-16 17:43 Florian Note Added: 0130409
2021-04-16 17:44 Florian Note Added: 0130411
2021-04-16 18:05 J. Gareth Moreton Note Added: 0130412
2021-04-16 18:06 J. Gareth Moreton Note Edited: 0130412 View Revisions
2021-04-16 18:07 J. Gareth Moreton Note Edited: 0130412 View Revisions
2021-04-16 22:27 J. Gareth Moreton Tag Detached: patch.compiler
2021-04-16 22:27 J. Gareth Moreton Tag Attached: patch
2021-04-16 22:27 J. Gareth Moreton Tag Attached: compiler
2021-05-03 21:56 J. Gareth Moreton Note Added: 0130744
2021-05-03 21:56 J. Gareth Moreton File Added: JccMovJmpMov2MovSetcc-pass1-improved.patch
2021-05-04 00:08 J. Gareth Moreton Relationship added parent of 0038767
2021-05-16 11:00 J. Gareth Moreton Note Added: 0130915