View Issue Details

IDProjectCategoryView StatusLast Update
0034628FPCCompilerpublic2019-11-07 10:13
ReporterJ. Gareth MoretonAssigned ToJ. Gareth Moreton 
PrioritynormalSeverityminorReproducibilityN/A
Status closedResolutionsuspended 
Platformx86_64OSMicrosoft WindowsOS Version10 Professional
Product Version3.3.1Product Buildx86_64-win64 
Target Version3.3.1Fixed in Version 
Summary0034628: [Patch / Refactor] x86_64 optimizer overhaul
DescriptionThis patch serves to overhaul the optimiser for x86_64 to minimise the number of passes required and to be more intelligent. Preliminary tests show about a 5% speed increase on an -O1 compilation of Lazarus and about a 15% speed increase for -O3. See the attached Metric.txt file showcasing the timings.

To minimise the pass count, the pre-peephole, pass 1 and pass 2 stages have been merged, and jump and MOV optimisations have been overhauled. One of the control cases is that a compilation under -O1 should not produce worse code than the trunk - it turns out though that in many cases, the compiler produces better code even though no new actual optimization combinations have been introduced.

Additionally, for individual passes, the optimizer attempts to mark the end of function prologues so as to not waste time on sequences that won't change.

The code isn't completely clean as I have attempted to separate i386 from the changes, mostly as a control case to show it doesn't affect other platforms. Once testing and implementation is successful for x86_64, I plan to port my changes over to i386.

(NOTE: Linux testing hasn't yet been overly successful due to configuration difficulties)
Steps To ReproduceApply patch and test on all platforms for successful compilation and correct machine code output of binaries.
Additional InformationThough not the intention, the rewriting of some of the optimisation routines has allowed for some additional space and size savings. A lot of the time, this just amounts to stripping out dead labels that doesn't actually change the final binary size, but occasionally it can eliminate superfluous jumps and unnecessary alignment hints, which sometimes leads to further optimisaions. For example, in "components/codetools/basiccodetools.pas" for Lazarus, under -O3 compilation, the overhauled optimiser is able to remove two additional branches in the CompareSubstrings function. Under the trunk, the segment is as follows:

...
.Lj2799:
    movslq %r8d,%r9
    subq %r9,%rdx
    leaq 1(%rdx),%r9
    cmpl %r9d,%r11d
    jge .Lj2802
    .p2align 2,,0
    .p2align 1
    movl %r11d,%r9d
.Lj2802:
    movq %rcx,%rdx
    testq %rcx,%rcx
    je .Lj2803
    movq -8(%rdx),%rdx
.Lj2803:
    movslq %r10d,%rbx
    subq %rbx,%rdx
    addq $1,%rdx
    cmpl %edx,%r11d
    jge .Lj2806
    .p2align 2,,0
    .p2align 1
    movl %r11d,%edx
.Lj2806:
    movslq %r8d,%r8
...

Under the overhauled optimiser, the loop is able to see through the alignment hints and convert the conditional branches into CMOV instructions:

...
.Lj2799:
    movslq %r8d,%r9
    subq %r9,%rdx
    leaq 1(%rdx),%r9
    cmpl %r9d,%r11d
    cmovngel %r11d,%r9d
    movq %rcx,%rdx
    testq %rcx,%rcx
    je .Lj2803
    movq -8(%rdx),%rdx
.Lj2803:
    movslq %r10d,%rbx
    subq %rbx,%rdx
    addq $1,%rdx
    cmpl %edx,%r11d
    cmovngel %r11d,%edx
    movslq %r8d,%r8
...
Tags64-bit, compiler, optimization, patch, refactoring, x86, x86_64-win64
Fixed in Revision
FPCOldBugId0
FPCTarget-
Attached Files
  • Metric.txt (6,396 bytes)
    Compilation script:
    
    ppcx64 -Sc -Sg -Mobjfpc -FEC:\Users\NLO-012\Documents\Programming\lazarus -g- -Xs -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\sqldb -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\libtar\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fpmkunit\src -FuC:\Users\NLO-012\Documents\Programming\lazarus\packager -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fppkg\src -FuC:\Users\NLO-012\Documents\Programming\fpc\compiler\systems -FlC:\Users\NLO-012\Documents\Programming\fpc\units\x86_64-win64\rtl -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\win64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\inc -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\x86_64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win\wininc -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\win -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas\sysutils -FiC:\users\NLO-012\Documents\Programming\lazarus\ide\include -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\inc -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\interfaces\win32 -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazutils -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas\classes -FuC:\users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\inc -FuC:\users\NLO-012\Documents\Programming\fpc\packages\fcl-base\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl -FuC:\users\NLO-012\Documents\Programming\fpc\packages\fcl-image\src -FiC:\users\NLO-012\Documents\Programming\lazarus\lcl\include -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\winunits-base\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\win -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\inc -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\paszlib\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\hash\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\pasjpeg\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\widgetset -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazutils -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src\win -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\chm\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-json\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\forms -FuC:\users\NLO-012\Documents\Programming\lazarus\components\codetools -FiC:\users\NLO-012\Documents\Programming\lazarus\ide\include\win64 -FuC:\users\NLO-012\Documents\Programming\lazarus\components\ideintf -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazcontrols -FuC:\users\NLO-012\Documents\Programming\lazarus\components\debuggerintf -FuC:\users\NLO-012\Documents\Programming\lazarus\debugger -FuC:\users\NLO-012\Documents\Programming\lazarus\components\synedit -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-registry\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\regexpr\src -FuC:\users\NLO-012\Documents\Programming\lazarus\packager\registration -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\base -FuC:\users\NLO-012\Documents\Programming\lazarus\components\ideintf -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-res\src -FuC:\users\NLO-012\Documents\Programming\lazarus\packager -FuC:\users\NLO-012\Documents\Programming\lazarus\designer -FuC:\users\NLO-012\Documents\Programming\lazarus\ide\frames -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-xml\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-extra\src\win -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\winunits-jedi\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\dbase -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src\winall -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-base\src\win -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazdebuggergdbmi -FuC:\users\NLO-012\Documents\Programming\lazarus\debugger\frames -FuC:\users\NLO-012\Documents\Programming\lazarus\converter -FuC:\users\NLO-012\Documents\Programming\lazarus\packager\frames C:\Users\NLO-012\Documents\Programming\lazarus\ide\lazarus.pp -vs -a -B -O3
    
    Trunk build:
    Discard first run (extra time taken due to disk polling etc.)
    
    [161.777] 1285546 lines compiled, 161.8 sec, 9134400 bytes code, 788644 bytes data
    [134.719] 1285546 lines compiled, 134.7 sec, 9134400 bytes code, 788644 bytes data
    [124.336] 1285546 lines compiled, 124.3 sec, 9134400 bytes code, 788644 bytes data
    [126.129] 1285546 lines compiled, 126.1 sec, 9134400 bytes code, 788644 bytes data
    
    Average: 128.367s (x)
    
    Optimisation overhaul:
    Discard first run (extra time taken due to disk polling etc.)
    
    [117.906] 1285498 lines compiled, 117.9 sec, 9124736 bytes code, 788580 bytes data
    [109.793] 1285498 lines compiled, 109.8 sec, 9124736 bytes code, 788580 bytes data
    [109.480] 1285498 lines compiled, 109.5 sec, 9124736 bytes code, 788580 bytes data
    [106.266] 1285498 lines compiled, 106.3 sec, 9124736 bytes code, 788580 bytes data
    
    Average: 108.533s (y)
    
    Saving: 1 - (y/x) = 0.154505 = ~15% faster
    
    
    Changing -O3 to -O1...
    
    Trunk build:
    Discard first run (extra time taken due to disk polling etc.)
    
    [130.668] 1285571 lines compiled, 130.7 sec, 10196576 bytes code, 788996 bytes data
    [128.012] 1285571 lines compiled, 128.0 sec, 10196576 bytes code, 788996 bytes data
    [137.973] 1285571 lines compiled, 138.0 sec, 10196576 bytes code, 788996 bytes data
    [131.625] 1285571 lines compiled, 131.6 sec, 10196576 bytes code, 788996 bytes data
    
    Averge: 132.533s (x)
    
    Optimisation overhaul:
    Discard first run (extra time taken due to disk polling etc.)
    
    [162.082] 1285498 lines compiled, 162.1 sec, 10182608 bytes code, 788836 bytes data
    [125.703] 1285498 lines compiled, 125.7 sec, 10182608 bytes code, 788836 bytes data
    [126.027] 1285498 lines compiled, 126.0 sec, 10182608 bytes code, 788836 bytes data
    [126.824] 1285498 lines compiled, 126.8 sec, 10182608 bytes code, 788836 bytes data
    
    Average: 126.167 (y)
    
    Saving: 1 - (y/x) = 0.048038 = ~5% faster
    
    Metric.txt (6,396 bytes)
  • overhaul-base.patch (3,376 bytes)
    Index: compiler/aopt.pas
    ===================================================================
    --- compiler/aopt.pas	(revision 42345)
    +++ compiler/aopt.pas	(working copy)
    @@ -53,9 +53,9 @@
             { Builds a table with the locations of the labels in the TAsmList.
               Also fixes some RegDeallocs like "# %eax released; push (%eax)"  }
             Procedure BuildLabelTableAndFixRegAlloc;
    -        procedure clear;
           protected
             procedure pass_1;
    +        procedure clear;
           End;
           TAsmOptimizerClass = class of TAsmOptimizer;
     
    Index: compiler/aoptbase.pas
    ===================================================================
    --- compiler/aoptbase.pas	(revision 42345)
    +++ compiler/aoptbase.pas	(working copy)
    @@ -176,7 +176,7 @@
       End;
     
     
    -  function labelCanBeSkipped(p: tai_label): boolean;
    +  function labelCanBeSkipped(p: tai_label): boolean; inline;
       begin
         labelCanBeSkipped := not(p.labsym.is_used) or (p.labsym.labeltype<>alt_jump);
       end;
    Index: compiler/aoptobj.pas
    ===================================================================
    --- compiler/aoptobj.pas	(revision 42345)
    +++ compiler/aoptobj.pas	(working copy)
    @@ -371,6 +396,15 @@
     
            Function ArrayRefsEq(const r1, r2: TReference): Boolean;
     
    +       { Returns a pointer to the operand that contains the destination label }
    +       function JumpTargetOp(ai: taicpu): poper;
    +
    +       { Returns True if hp is any jump to a label }
    +       function IsJumpToLabel(hp: taicpu): boolean;
    +
    +       { Returns True if hp is an unconditional jump to a label }
    +       function IsJumpToLabelUncond(hp: taicpu): boolean;
    +
         { ***************************** Implementation **************************** }
     
       Implementation
    Index: compiler/aoptutils.pas
    ===================================================================
    --- compiler/aoptutils.pas	(revision 42345)
    +++ compiler/aoptutils.pas	(working copy)
    @@ -38,15 +38,22 @@
         { skips all labels and returns the next "real" instruction }
         function SkipLabels(hp: tai; var hp2: tai): boolean;
     
    +    { sets hp2 to hp and returns True if hp is not nil }
    +    function SetAndTest(const hp: tai; out hp2: tai): Boolean;
    +
       implementation
     
    -    function MatchOpType(const p : taicpu; type0: toptype) : Boolean;
    +    uses
    +      aasmbase;
    +
    +
    +    function MatchOpType(const p : taicpu; type0: toptype) : Boolean; inline;
           begin
             Result:=(p.ops=1) and (p.oper[0]^.typ=type0);
           end;
     
     
    -    function MatchOpType(const p : taicpu; type0,type1 : toptype) : Boolean;
    +    function MatchOpType(const p : taicpu; type0,type1 : toptype) : Boolean; inline;
           begin
             Result:=(p.ops=2) and (p.oper[0]^.typ=type0) and (p.oper[1]^.typ=type1);
           end;
    @@ -53,7 +60,7 @@
     
     
     {$if max_operands>2}
    -    function MatchOpType(const p : taicpu; type0,type1,type2 : toptype) : Boolean;
    +    function MatchOpType(const p : taicpu; type0,type1,type2 : toptype) : Boolean; inline;
           begin
             Result:=(p.ops=3) and (p.oper[0]^.typ=type0) and (p.oper[1]^.typ=type1) and (p.oper[2]^.typ=type2);
           end;
    @@ -78,6 +85,11 @@
               end;
           end;
     
    +    { sets hp2 to hp and returns True if hp is not nil }
    +    function SetAndTest(const hp: tai; out hp2: tai): Boolean; inline;
    +      begin
    +        hp2 := hp;
    +        Result := Assigned(hp);
    +      end;
     
     end.
    -
    
    overhaul-base.patch (3,376 bytes)
  • overhaul-global.patch (19,603 bytes)
    Index: compiler/aoptobj.pas
    ===================================================================
    --- compiler/aoptobj.pas	(revision 42345)
    +++ compiler/aoptobj.pas	(working copy)
    @@ -24,6 +24,8 @@
     }
     Unit AoptObj;
     
    +{ $DEFINE DEBUG_JUMP}
    +
       {$i fpcdefs.inc}
     
       { general, processor independent objects for use by the assembler optimizer }
    @@ -268,10 +270,21 @@
             Procedure CreateUsedRegs(var regs: TAllUsedRegs);
             Procedure ClearUsedRegs;
             Procedure UpdateUsedRegs(p : Tai);
    -        class procedure UpdateUsedRegs(var Regs: TAllUsedRegs; p: Tai);
    +        { Function always returns True.  Used so the method can be inserted into
    +          an if-block when paired with RegUsedAfterInstruction, say }
    +        class function UpdateUsedRegs(var Regs: TAllUsedRegs; p: Tai): Boolean;
             Function CopyUsedRegs(var dest : TAllUsedRegs) : boolean;
    +
    +        { If UpdateUsedRegsAndOptimize has read ahead, the result is one before
    +          the next valid entry (so "p.Next" returns what's expected).  If no
    +          reading ahead happened, then the result is equal to p. }
    +        function UpdateUsedRegsAndOptimize(p : Tai): Tai;
    +
             procedure RestoreUsedRegs(const Regs : TAllUsedRegs);
    -        procedure TransferUsedRegs(var dest: TAllUsedRegs);
    +
    +        { Function always returns True.  Used so the method can be inserted into
    +          an if-block when paired with RegUsedAfterInstruction, say }
    +        function TransferUsedRegs(var dest: TAllUsedRegs): Boolean;
             class Procedure ReleaseUsedRegs(const regs : TAllUsedRegs);
             class Function RegInUsedRegs(reg : TRegister;regs : TAllUsedRegs) : boolean;
             class Procedure IncludeRegInUsedRegs(reg : TRegister;var regs : TAllUsedRegs);
    @@ -351,6 +364,7 @@
             procedure RemoveDelaySlot(hp1: tai);
     
             { peephole optimizer }
    +        function GetFirstInstruction(const Start: tai; var p: tai): Boolean; virtual;
             procedure PrePeepHoleOpts; virtual;
             procedure PeepHoleOptPass1; virtual;
             procedure PeepHoleOptPass2; virtual;
    @@ -363,6 +377,17 @@
             function PeepHoleOptPass2Cpu(var p: tai): boolean; virtual;
             function PostPeepHoleOptsCpu(var p: tai): boolean; virtual;
     
    +        { Removes all instructions between an unconditional jump and the next label }
    +        procedure RemoveDeadCodeAfterJump(p: taicpu);
    +
    +        { If hp is a label, strip it if its reference count is zero.  Repeat until
    +          a non-label is found, or a label with a non-zero reference count.
    +          True is returned if something was stripped }
    +        function StripDeadLabels(hp: tai; var NextValid: tai): Boolean;
    +
    +        { Checks and removes "jmp @@lbl; @lbl". Returns True if the jump was removed }
    +        function CollapseZeroDistJump(var p: tai; hp1: tai; ThisLabel: TAsmLabel): Boolean;
    +
             { insert debug comments about which registers are read and written by
               each instruction. Useful for debugging the InstructionLoadsFromReg and
               other similar functions. }
    @@ -900,7 +934,81 @@
                 UsedRegs[i].Clear;
             end;
     
    +      { If UpdateUsedRegsAndOptimize has read ahead, the result is one before
    +        the next valid entry (so "p.Next" returns what's expected).  If no
    +        reading ahead happened, then the result is equal to p. }
    +      function TAOptObj.UpdateUsedRegsAndOptimize(p : Tai): Tai;
    +        var
    +          NotFirst: Boolean;
    +        begin
    +          { this code is based on TUsedRegs.Update to avoid multiple passes through the asmlist,
    +            the code is duplicated here }
     
    +          Result := p;
    +          if (p.typ in [ait_instruction, ait_label]) then
    +            begin
    +              if (p.next <> BlockEnd) and (tai(p.next).typ <> ait_instruction) then
    +                begin
    +                  { Advance one, otherwise the routine exits immediately and wastes time }
    +                  p := tai(p.Next);
    +                  NotFirst := True;
    +                end
    +              else
    +                { If the next entry is an instruction, nothing will be updated or
    +                  optimised here, so exit now to save time }
    +                Exit;
    +            end
    +          else
    +            NotFirst := False;
    +
    +          repeat
    +            while assigned(p) and
    +                  ((p.typ in (SkipInstr + [ait_align, ait_label] - [ait_RegAlloc])) or
    +                   ((p.typ = ait_marker) and
    +                    (tai_Marker(p).Kind in [mark_AsmBlockEnd,mark_NoLineInfoStart,mark_NoLineInfoEnd]))) do
    +                 begin
    +                   { Here's the optimise part }
    +                   if (p.typ in [ait_align, ait_label]) then
    +                     begin
    +                       if StripDeadLabels(p, p) then
    +                         begin
    +                           { Note, if the first instruction is stripped and is
    +                             the only one that gets removed, Result will now
    +                             contain a dangling pointer, so compensate for this. }
    +                           if not NotFirst then
    +                             Result := tai(p.Previous);
    +
    +                           Continue;
    +                         end;
    +
    +                       if ((p.typ = ait_label) and not labelCanBeSkipped(tai_label(p))) then
    +                         Break;
    +                     end;
    +
    +                   Result := p;
    +                   p := tai(p.next);
    +                 end;
    +            while assigned(p) and
    +                  (p.typ=ait_RegAlloc) Do
    +              begin
    +                case tai_regalloc(p).ratype of
    +                  ra_alloc :
    +                    Include(UsedRegs[getregtype(tai_regalloc(p).reg)].UsedRegs, getsupreg(tai_regalloc(p).reg));
    +                  ra_dealloc :
    +                    Exclude(UsedRegs[getregtype(tai_regalloc(p).reg)].UsedRegs, getsupreg(tai_regalloc(p).reg));				
    +                  else
    +                    { Do nothing };
    +                end;
    +                Result := p;
    +                p := tai(p.next);
    +              end;
    +            NotFirst := True;
    +          until not(assigned(p)) or
    +                (not(p.typ in SkipInstr + [ait_align]) and
    +                 not((p.typ = ait_label) and
    +                     labelCanBeSkipped(tai_label(p))));
    +        end;
    +
           procedure TAOptObj.UpdateUsedRegs(p : Tai);
             begin
               { this code is based on TUsedRegs.Update to avoid multiple passes through the asmlist,
    @@ -933,12 +1041,14 @@
             end;
     
     
    -      class procedure TAOptObj.UpdateUsedRegs(var Regs : TAllUsedRegs;p : Tai);
    +      class function TAOptObj.UpdateUsedRegs(var Regs : TAllUsedRegs;p : Tai): Boolean;
             var
               i : TRegisterType;
             begin
               for i:=low(TRegisterType) to high(TRegisterType) do
                 Regs[i].Update(p);
    +
    +          Result := True;
             end;
     
     
    @@ -964,7 +1074,7 @@
           end;
     
     
    -      procedure TAOptObj.TransferUsedRegs(var dest: TAllUsedRegs);
    +      function TAOptObj.TransferUsedRegs(var dest: TAllUsedRegs): Boolean;
           var
             i : TRegisterType;
           begin
    @@ -973,6 +1083,8 @@
               the only published means to modify the internal state en-masse. [Kit] }
             for i:=low(TRegisterType) to high(TRegisterType) do
               dest[i].Create_Regset(i, UsedRegs[i].GetUsedRegs);
    +
    +        Result := True;
           end;
     
     
    @@ -1338,17 +1450,33 @@
     
     
         function FindAnyLabel(hp: tai; var l: tasmlabel): Boolean;
    +      var
    +        next: tai;
           begin
             FindAnyLabel := false;
    -        while assigned(hp.next) and
    -              (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
    -          hp := tai(hp.next);
    -        if assigned(hp.next) and
    -           (tai(hp.next).typ = ait_label) then
    +
    +        while True do
               begin
    -            FindAnyLabel := true;
    -            l := tai_label(hp.next).labsym;
    -          end
    +            while assigned(hp.next) and
    +                  (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
    +              hp := tai(hp.next);
    +
    +            next := tai(hp.next);
    +            if assigned(next) and
    +              (tai(next).typ = ait_label) then
    +              begin
    +                l := tai_label(next).labsym;
    +                if not l.is_used then
    +                  begin
    +                    { Unsafe label }
    +                    hp := next;
    +                    Continue;
    +                  end;
    +
    +                FindAnyLabel := true;
    +              end;
    +            Exit;
    +          end;
           end;
     
     
    @@ -1414,7 +1542,230 @@
               execute before branch, so code stays correct if branch is removed. }
           end;
     
    +    { Search forward from BlockStart until we find the first instruction }
    +    function TAOptObj.GetFirstInstruction(const Start: tai; var p: tai): Boolean;
    +      begin
    +        Result := False;
    +        p := Start;
    +        while (p <> BlockEnd) do
    +          begin
    +            if (p.Typ = ait_instruction) then
    +              begin
    +                Result := True;
    +                Exit;
    +              end
    +            else
    +              begin
    +                UpdateUsedRegs(p);
    +                p := tai(p.Next);
    +              end;
    +          end;
    +      end;
     
    +    { Removes all instructions between an unconditional jump and the next label }
    +    procedure TAOptObj.RemoveDeadCodeAfterJump(p: taicpu);
    +      var
    +        hp1, hp2: tai;
    +      begin
    +        if not IsJumpToLabelUncond(p) then
    +          Exit;
    +
    +        { the following if-block removes all code between a jmp and the next label,
    +          because it can never be executed
    +        }
    +        while GetNextInstruction(p, hp1) and
    +              (hp1 <> BlockEnd) and
    +              (hp1.typ <> ait_label)
    +{$ifdef JVM}
    +              and (hp1.typ <> ait_jcatch)
    +{$endif}
    +              do
    +          if not(hp1.typ in ([ait_label]+skipinstr)) then
    +            begin
    +              if (hp1.typ = ait_instruction) and
    +                 taicpu(hp1).is_jmp and
    +                 (JumpTargetOp(taicpu(hp1))^.typ = top_ref) and
    +                 (JumpTargetOp(taicpu(hp1))^.ref^.symbol is TAsmLabel) then
    +                 TAsmLabel(JumpTargetOp(taicpu(hp1))^.ref^.symbol).decrefs;
    +              { don't kill start/end of assembler block,
    +                no-line-info-start/end etc }
    +              if (hp1.typ <> ait_marker) then
    +                begin
    +{$ifdef cpudelayslot}
    +                  if (hp1.typ=ait_instruction) and (taicpu(hp1).is_jmp) then
    +                    RemoveDelaySlot(hp1);
    +{$endif cpudelayslot}
    +                  if (hp1.typ = ait_align) then
    +                    begin
    +                      { Only remove the align if a label doesn't immediately follow }
    +                      if GetNextInstruction(hp1, hp2) and (hp2.typ = ait_label) then
    +                        { The label is unskippable }
    +                        Exit;
    +                    end;
    +                  asml.remove(hp1);
    +                  hp1.free;
    +                end
    +              else
    +                p:=taicpu(hp1);
    +            end
    +          else
    +            Break;
    +      end;
    +
    +    { If hp is a label, strip it if its reference count is zero.  Repeat until
    +      a non-label is found, or a label with a non-zero reference count.
    +      True is returned if something was stripped }
    +    function TAOptObj.StripDeadLabels(hp: tai; var NextValid: tai): Boolean;
    +      var
    +        tmp: tai;
    +        hp1: tai;
    +        CurrentAlign: tai;
    +      begin
    +        CurrentAlign := nil;
    +        Result := False;
    +        hp1 := hp;
    +        NextValid := hp;
    +
    +        { Stop if hp is an instruction, for example }
    +        while (hp1 <> BlockEnd) and (hp1.typ in [ait_label,ait_align]) do
    +          begin
    +            case hp1.typ of
    +              ait_label:
    +                begin
    +                  with tai_label(hp1).labsym do
    +                    if is_used or (bind <> AB_LOCAL) or (labeltype <> alt_jump) then
    +                      begin
    +                        { Valid label }
    +                        if Result then
    +                          NextValid := hp1;
    +                        Exit;
    +                      end;
    +
    +                  { Set tmp to the next valid entry }
    +                  tmp := tai(hp1.Next);
    +                  { Remove label }
    +                  AsmL.Remove(hp1);
    +                  hp1.Free;
    +
    +                  hp1 := tmp;
    +
    +                  Result := True;
    +                  Continue;
    +                end;
    +              { Also remove the align if it comes before an unused label }
    +              ait_align:
    +                begin
    +                  tmp := tai(hp1.Next);
    +
    +                  if (cs_debuginfo in current_settings.moduleswitches) or
    +                     (cs_use_lineinfo in current_settings.globalswitches) then
    +                     { Don't remove aligns if debuginfo is present }
    +                    begin
    +                      if (tmp.typ in [ait_label,ait_align]) then
    +                        begin
    +                          hp1 := tmp;
    +                          Continue;
    +                        end
    +                      else
    +                        Break;
    +                    end;
    +
    +                  if tmp = BlockEnd then
    +                    { End of block }
    +                    Exit;
    +
    +                  case tmp.typ of
    +                    ait_align: { Merge the aligns - we might as well }
    +                      begin
    +                        { Actually the correct operation here is not max, but
    +                          the least common multiple, but alignments are
    +                          strictly powers of two anyway, so the largest of the
    +                          two alignments is also the LCM. [Kit] }
    +                        tai_align_abstract(hp1).aligntype := max(tai_align_abstract(hp1).aligntype, tai_align_abstract(tmp).aligntype);
    +                        AsmL.Remove(tmp);
    +                        tmp.Free;
    +                        Result := True;
    +                        Continue;
    +                      end;
    +                    ait_label:
    +                      begin
    +                        { Signal that we can possibly delete this align entry }
    +                        CurrentAlign := hp1;
    +
    +                        with tai_label(tmp).labsym do
    +                          if is_used or (bind <> AB_LOCAL) or (labeltype <> alt_jump) then
    +                            begin
    +                              { Valid label }
    +                              if Result then
    +                                NextValid := hp1;
    +                              Exit;
    +                            end;
    +
    +                        { Remove label }
    +                        AsmL.Remove(tmp);
    +                        tmp.Free;
    +
    +                        Result := True;
    +
    +                        { Re-evaluate the align and see what follows }
    +                        Continue;
    +                      end
    +                    else
    +                      begin
    +                        { Set hp1 to the instruction after the align, because the
    +                          align might get deleted later and hence set NextValid
    +                          to a dangling pointer. [Kit] }
    +                        hp1 := tmp;
    +                        Break;
    +                      end;
    +                  end;
    +                end
    +              else
    +                Break;
    +            end;
    +            hp1 := tai(hp1.Next);
    +          end;
    +
    +        { hp1 will be the next valid entry }
    +        NextValid := hp1;
    +
    +        if Assigned(CurrentAlign) then
    +          begin
    +            { Remove the alignment field }
    +            AsmL.Remove(CurrentAlign);
    +            CurrentAlign.Free;
    +          end;
    +      end;
    +
    +    function TAOptObj.CollapseZeroDistJump(var p: tai; hp1: tai; ThisLabel: TAsmLabel): Boolean;
    +      var
    +        tmp: tai;
    +      begin
    +        Result := False;
    +
    +        { remove jumps to labela coming right after them }
    +        if FindLabel(ThisLabel, hp1) and
    +            { TODO: FIXME removing the first instruction fails}
    +            (p<>blockstart) then
    +          begin
    +            ThisLabel.decrefs;
    +
    +            tmp := tai(p.Next); { Might be an align before the label }
    +{$ifdef cpudelayslot}
    +            RemoveDelaySlot(p);
    +{$endif cpudelayslot}
    +            asml.remove(p);
    +            p.free;
    +
    +            StripDeadLabels(tmp, hp1);
    +
    +            p:=hp1;
    +            Result := True;
    +          end;
    +
    +    end;
    +
    +
         function TAOptObj.GetFinalDestination(hp: taicpu; level: longint): boolean;
           {traces sucessive jumps to their final destination and sets it, e.g.
            je l1                je l3
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 42345)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -50,6 +58,12 @@
     
             procedure DebugMsg(const s : string; p : tai);inline;
     
    +        { TODO: This method is declared here so it can be more easily split away
    +          into a separate patch file - once fully implemented into the trunk, it
    +          can be moved with the other OptPass1 routines }
    +
    +        function OptPass1XOR(var p : tai) : boolean;
    +
             class function IsExitCode(p : tai) : boolean;
             class function isFoldableArithOp(hp1 : taicpu; reg : tregister) : boolean;
             procedure RemoveLastDeallocForFuncRes(p : tai);
    @@ -96,6 +105,7 @@
         function MatchInstruction(const instr: tai; const op1,op2: TAsmOp; const opsize: topsizes): boolean;
         function MatchInstruction(const instr: tai; const op1,op2,op3: TAsmOp; const opsize: topsizes): boolean;
         function MatchInstruction(const instr: tai; const ops: array of TAsmOp; const opsize: topsizes): boolean;
    +    function MatchInstruction(const instr: tai; const op: TAsmOp): boolean; inline;
     
         function MatchOperand(const oper: TOper; const reg: TRegister): boolean; inline;
         function MatchOperand(const oper: TOper; const a: tcgint): boolean; inline;
    @@ -119,6 +129,14 @@
         SPeepholeOptimization = '';
     {$endif DEBUG_AOPTCPU}
     
    +
    +  function debug_tostr(i: tcgint): string;
    +  function debug_regname(r: TRegister): string;
    +  function debug_operstr(oper: TOper): string;
    +  function debug_op2str(opcode: tasmop): string;
    +  function debug_opsize2str(opsize: topsize): string;
    +
    +
       implementation
     
         uses
    @@ -183,6 +204,14 @@
           end;
     
     
    +    function MatchInstruction(const instr: tai; const op: TAsmOp): boolean;
    +      begin
    +        result :=
    +          (instr.typ = ait_instruction) and
    +          (taicpu(instr).opcode = op);
    +      end;
    +
    +
         function MatchOperand(const oper: TOper; const reg: TRegister): boolean; inline;
           begin
             result := (oper.typ = top_reg) and (oper.reg = reg);
    @@ -1176,6 +1576,22 @@
           end;
     
     
    +    function TX86AsmOptimizer.OptPass1XOR(var p: tai): boolean;
    +      begin
    +        Result := False;
    +        if (taicpu(p).oper[0]^.typ = top_reg) and
    +           (taicpu(p).oper[1]^.typ = top_reg) and
    +           (taicpu(p).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
    +         { temporarily change this to 'mov reg,0' to make it easier }
    +         { for the CSE. Will be changed back in the post-peephole stage }
    +          begin
    +            taicpu(p).opcode := A_MOV;
    +            taicpu(p).loadConst(0,0);
    +            Result := True;
    +          end;
    +      end;
    +
    +
         function TX86AsmOptimizer.OptPass1VOP(var p : tai) : boolean;
           var
             hp1 : tai;
    
    overhaul-global.patch (19,603 bytes)
  • overhaul-mov-refactor.patch (94,203 bytes)
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 42345)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -1216,251 +1635,769 @@
           var
             hp1, hp2: tai;
             GetNextInstruction_p: Boolean;
    +        hp3: tai;
    +        HP_Result: Boolean;
             PreMessage, RegName1, RegName2, InputVal, MaskNum: string;
             NewSize: topsize;
    +      label
    +        MovCaseBlock_CheckNext, MovCaseBlock;
    +
    +        function MOVRefOptimize: Boolean;
    +          begin
    +            Result := False;
    +            if MatchOpType(taicpu(p),top_reg,top_reg) and
    +              MatchOpType(taicpu(hp1),top_ref,top_reg) and
    +            ((taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg)
    +             or
    +             (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg)
    +              ) and
    +            (getsupreg(taicpu(hp1).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) then
    +            { mov reg1, reg2
    +              mov/zx/sx (reg2, ..), reg2      to   mov/zx/sx (reg1, ..), reg2}
    +            begin
    +              if (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) then
    +                taicpu(hp1).oper[0]^.ref^.base := taicpu(p).oper[0]^.reg;
    +              if (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) then
    +                taicpu(hp1).oper[0]^.ref^.index := taicpu(p).oper[0]^.reg;
    +              DebugMsg(SPeepholeOptimization + 'MovMovXX2MoVXX 1 done',p);
    +              asml.remove(p);
    +              p.free;
    +              p := hp1;
    +              Result:=true;
    +            end;
    +          end;
    +
           begin
             Result:=false;
    +        repeat
     
    -        GetNextInstruction_p:=GetNextInstruction(p, hp1);
    +          GetNextInstruction_p := GetNextInstruction(p, hp1);
     
    -        {  remove mov reg1,reg1? }
    -        if MatchOperand(taicpu(p).oper[0]^,taicpu(p).oper[1]^)
    -        then
    -          begin
    -            DebugMsg(SPeepholeOptimization + 'Mov2Nop done',p);
    -            { take care of the register (de)allocs following p }
    -            UpdateUsedRegs(tai(p.next));
    -            asml.remove(p);
    -            p.free;
    -            p:=hp1;
    -            Result:=true;
    -            exit;
    -          end;
    +          { remove mov reg1,reg1? }
    +          if MatchOperand(taicpu(p).oper[0]^,taicpu(p).oper[1]^)
    +          then
    +            begin
    +              DebugMsg(SPeepholeOptimization + 'Mov2Nop done',p);
    +              { take care of the register (de)allocs following p }
    +              UpdateUsedRegsAndOptimize(tai(p.next));
    +              asml.remove(p);
    +              p.free;
    +              p:=hp1;
    +              Result:=true;
    +              if MatchInstruction(hp1, A_MOV) then
    +                Continue
    +              else
    +                exit;
    +            end;
     
    -        if GetNextInstruction_p and
    -          MatchInstruction(hp1,A_AND,[]) and
    -          (taicpu(p).oper[1]^.typ = top_reg) and
    -          MatchOpType(taicpu(hp1),top_const,top_reg) then
    -          begin
    -            if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) then
    -              begin
    -                case taicpu(p).opsize of
    -                  S_L:
    -                    if (taicpu(hp1).oper[0]^.val = $ffffffff) then
    -                      begin
    -                        { Optimize out:
    -                            mov x, %reg
    -                            and ffffffffh, %reg
    -                        }
    -                        DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 1 done',p);
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Result:=true;
    -                        exit;
    -                      end;
    -                  S_Q: { TODO: Confirm if this is even possible }
    -                    if (taicpu(hp1).oper[0]^.val = $ffffffffffffffff) then
    -                      begin
    -                        { Optimize out:
    -                            mov x, %reg
    -                            and ffffffffffffffffh, %reg
    -                        }
    -                        DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 2 done',p);
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Result:=true;
    -                        exit;
    -                      end;
    +          if GetNextInstruction_p and
    +            MatchInstruction(hp1,A_JMP) then
    +            { Doing this optimisation here allows for some additional
    +              optimisations in the same pass.  This ensures that certain
    +              MOV optimisations are still performed under -O1. [Kit] }
    +            begin
    +              if GetNextInstruction(hp1, hp2) and CollapseZeroDistJump(hp1, hp2, TAsmLabel(taicpu(hp1).oper[0]^.ref^.symbol)) then
    +                begin
    +                  if tai(hp1).typ = ait_instruction then
    +                    { hp1 is now the next instruction }
    +                    GetNextInstruction_p := True
                       else
    -                    ;
    -                end;
    -              end
    -            else if (taicpu(p).oper[1]^.typ = top_reg) and (taicpu(hp1).oper[1]^.typ = top_reg) and
    -              (taicpu(p).oper[0]^.typ <> top_const) and { MOVZX only supports registers and memory, not immediates (use MOV for that!) }
    -              (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    -              then
    +                    { Note, if hp1 lands on a label, it won't be skippable, so
    +                      Exit if that happens }
    +                    if (tai(hp1).typ in SkipInstr) then
    +                      GetNextInstruction_p := GetNextInstruction(hp1, hp1)
    +                    else
    +                      Exit;
    +                end
    +              else
    +                Exit;
    +            end;
    +
    +MovCaseBlock_CheckNext:
    +          { All the following optimisations require a next instruction }
    +          if not GetNextInstruction_p or (hp1.typ <> ait_instruction) then
    +            Exit;
    +
    +MovCaseBlock:
    +          case taicpu(hp1).opcode of
    +            { Optimisations where next instruction = XOR }
    +            A_XOR:
                   begin
    -                InputVal := debug_operstr(taicpu(p).oper[0]^);
    -                MaskNum := debug_tostr(taicpu(hp1).oper[0]^.val);
    +                { OptPass1XOR doesn't use register tracking, so no need to
    +                  update and restore the register array }
    +                HP_Result := OptPass1XOR(hp1);
     
    -                case taicpu(p).opsize of
    -                  S_B:
    -                    if (taicpu(hp1).oper[0]^.val = $ff) then
    -                      begin
    -                        { Convert:
    -                            movb x, %regl        movb x, %regl
    -                            andw ffh, %regw      andl ffh, %regd
    -                          To:
    -                            movzbw x, %regd      movzbl x, %regd
    +                if HP_Result then
    +                  goto MovCaseBlock;
    +              end;
    +            { Optimisations where next instruction = AND }
    +            A_AND:
    +              if (taicpu(p).oper[1]^.typ = top_reg) and
    +                MatchOpType(taicpu(hp1),top_const,top_reg) then
    +                begin
    +                  if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) then
    +                    begin
    +                      case taicpu(p).opsize of
    +                        S_L:
    +                          if (taicpu(hp1).oper[0]^.val = $ffffffff) then
    +                            begin
    +                              { Optimize out:
    +                                  mov x, %reg
    +                                  and ffffffffh, %reg
    +                              }
    +                              DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 1 done',p);
    +                              asml.remove(hp1);
    +                              hp1.free;
    +                              GetNextInstruction_p := GetNextInstruction(p, hp1);
    +                              goto MovCaseBlock_CheckNext;
    +                            end;
    +                        S_Q: { TODO: Confirm if this is even possible }
    +                          if (taicpu(hp1).oper[0]^.val = $ffffffffffffffff) then
    +                            begin
    +                              { Optimize out:
    +                                  mov x, %reg
    +                                  and ffffffffffffffffh, %reg
    +                              }
    +                              DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 2 done',p);
    +                              asml.remove(hp1);
    +                              hp1.free;
    +                              GetNextInstruction_p := GetNextInstruction(p, hp1);
    +                              goto MovCaseBlock_CheckNext;
    +                            end;
    +                        else
    +                          { Do nothing };
    +                      end;
    +                    end
    +                  else if (taicpu(p).oper[1]^.typ = top_reg) and (taicpu(hp1).oper[1]^.typ = top_reg) and
    +                    (taicpu(p).oper[0]^.typ <> top_const) and { MOVZX only supports registers and memory, not immediates (use MOV for that!) }
    +                    (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    +                    then
    +                    begin
    +                      InputVal := debug_operstr(taicpu(p).oper[0]^);
    +                      MaskNum := debug_tostr(taicpu(hp1).oper[0]^.val);
     
    -                          (Identical registers, just different sizes)
    -                        }
    -                        RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 8-bit register name }
    -                        RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 16/32-bit register name }
    +                      case taicpu(p).opsize of
    +                        S_B:
    +                          if (taicpu(hp1).oper[0]^.val = $ff) then
    +                            begin
    +                              { Convert:
    +                                  movb x, %regl        movb x, %regl
    +                                  andw ffh, %regw      andl ffh, %regd
    +                                To:
    +                                  movzbw x, %regd      movzbl x, %regd
     
    -                        case taicpu(hp1).opsize of
    -                          S_W: NewSize := S_BW;
    -                          S_L: NewSize := S_BL;
    +                                (Identical registers, just different sizes)
    +                              }
    +                              RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 8-bit register name }
    +                              RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 16/32-bit register name }
    +
    +                              case taicpu(hp1).opsize of
    +                                S_W: NewSize := S_BW;
    +                                S_L: NewSize := S_BL;
     {$ifdef x86_64}
    -                          S_Q: NewSize := S_BQ;
    +                                S_Q: NewSize := S_BQ;
     {$endif x86_64}
    +                                else
    +                                  InternalError(2018011510);
    +                              end;
    +                            end
                               else
    -                            InternalError(2018011510);
    -                        end;
    -                      end
    -                    else
    -                      NewSize := S_NO;
    -                  S_W:
    -                    if (taicpu(hp1).oper[0]^.val = $ffff) then
    -                      begin
    -                        { Convert:
    -                            movw x, %regw
    -                            andl ffffh, %regd
    -                          To:
    -                            movzwl x, %regd
    +                            NewSize := S_NO;
    +                        S_W:
    +                          if (taicpu(hp1).oper[0]^.val = $ffff) then
    +                            begin
    +                              { Convert:
    +                                  movw x, %regw
    +                                  andl ffffh, %regd
    +                                To:
    +                                  movzwl x, %regd
     
    -                          (Identical registers, just different sizes)
    -                        }
    -                        RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 16-bit register name }
    -                        RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 32-bit register name }
    +                                (Identical registers, just different sizes)
    +                              }
    +                              RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 16-bit register name }
    +                              RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 32-bit register name }
     
    -                        case taicpu(hp1).opsize of
    -                          S_L: NewSize := S_WL;
    +                              case taicpu(hp1).opsize of
    +                                S_L: NewSize := S_WL;
     {$ifdef x86_64}
    -                          S_Q: NewSize := S_WQ;
    +                                S_Q: NewSize := S_WQ;
     {$endif x86_64}
    +                                else
    +                                  InternalError(2018011511);
    +                              end;
    +                            end
                               else
    -                            InternalError(2018011511);
    +                            NewSize := S_NO;
    +                        else
    +                          NewSize := S_NO;
    +                      end;
    +
    +                      if NewSize <> S_NO then
    +                        begin
    +                          PreMessage := 'mov' + debug_opsize2str(taicpu(p).opsize) + ' ' + InputVal + ',' + RegName1;
    +
    +                          { The actual optimization }
    +                          taicpu(p).opcode := A_MOVZX;
    +                          taicpu(p).changeopsize(NewSize);
    +                          taicpu(p).oper[1]^ := taicpu(hp1).oper[1]^;
    +
    +                          { Safeguard if "and" is followed by a conditional command }
    +                          TransferUsedRegs(TmpUsedRegs);
    +                          UpdateUsedRegs(TmpUsedRegs,tai(p.next));
    +
    +                          if (RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then
    +                            begin
    +                              { At this point, the "and" command is effectively equivalent to
    +                                "test %reg,%reg". This will be handled separately by the
    +                                Peephole Optimizer. [Kit] }
    +
    +                              DebugMsg(SPeepholeOptimization + PreMessage +
    +                                ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
    +                            end
    +                          else
    +                            begin
    +                              DebugMsg(SPeepholeOptimization + PreMessage + '; and' + debug_opsize2str(taicpu(hp1).opsize) + ' $' + MaskNum + ',' + RegName2 +
    +                                ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
    +
    +                              asml.Remove(hp1);
    +                              hp1.Free;
    +                            end;
    +
    +                          Result := True;
    +                          Exit;
    +
                             end;
    -                      end
    -                    else
    -                      NewSize := S_NO;
    -                  else
    -                    NewSize := S_NO;
    +                    end;
                     end;
     
    -                if NewSize <> S_NO then
    +            { Optimisations where next instruction = MOV }
    +            A_MOV:
    +              begin
    +                if taicpu(hp1).opsize = taicpu(p).opsize then
                       begin
    -                    PreMessage := 'mov' + debug_opsize2str(taicpu(p).opsize) + ' ' + InputVal + ',' + RegName1;
    +                    if (taicpu(p).oper[1]^.typ = top_reg) and
    +                      MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
    +                      begin
    +                        { we have
    +                            mov x, %treg
    +                            mov %treg, y
    +                        }
     
    -                    { The actual optimization }
    -                    taicpu(p).opcode := A_MOVZX;
    -                    taicpu(p).changeopsize(NewSize);
    -                    taicpu(p).oper[1]^ := taicpu(hp1).oper[1]^;
    +                        if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[1]^)) then
    +                          begin
    +                            if (TransferUsedRegs(TmpUsedRegs) and
    +                              UpdateUsedRegs(TmpUsedRegs, tai(p.Next)) and
    +                              RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
    +                              begin
    +                                { we've got
     
    -                    { Safeguard if "and" is followed by a conditional command }
    -                    TransferUsedRegs(TmpUsedRegs);
    -                    UpdateUsedRegs(TmpUsedRegs,tai(p.next));
    +                                  mov x, %treg
    +                                  mov %treg, y
     
    -                    if (RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then
    -                      begin
    -                        { At this point, the "and" command is effectively equivalent to
    -                          "test %reg,%reg". This will be handled separately by the
    -                          Peephole Optimizer. [Kit] }
    +                                  ... but %treg is used afterwards.  We can optimise this to minimise a pipeline stall:
     
    -                        DebugMsg(SPeepholeOptimization + PreMessage +
    -                          ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
    -                      end
    -                    else
    -                      begin
    -                        DebugMsg(SPeepholeOptimization + PreMessage + '; and' + debug_opsize2str(taicpu(hp1).opsize) + ' $' + MaskNum + ',' + RegName2 +
    -                          ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
    +                                  mov x, %treg
    +                                  mov x, y
     
    -                        asml.Remove(hp1);
    -                        hp1.Free;
    +                                  x must be a constant or a register, and y must also a register.  It can work if x
    +                                  is a reference that doesn't contain %treg, but this ends up using an AGU as well
    +                                  as an ALU and harms hyperthreading and instruction throughput. [Kit]
    +                                }
    +                                if (taicpu(hp1).oper[1]^.typ = top_reg) and (taicpu(p).oper[0]^.typ <> top_ref) then
    +                                  begin
    +
    +                                    if (taicpu(p).oper[0]^.typ = top_reg) then
    +                                      begin
    +
    +                                        if (
    +                                          (taicpu(p).oper[0]^.reg = taicpu(hp1).oper[1]^.reg) or
    +                                          (taicpu(hp1).oper[0]^.reg = taicpu(hp1).oper[1]^.reg)
    +                                        ) then
    +                                        begin
    +                                          { If %treg = x or y, then remove the second MOV }
    +                                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1a',p);
    +                                          asml.remove(hp1);
    +                                          hp1.free;
    +                                          GetNextInstruction_p := GetNextInstruction(p, hp1);
    +                                          goto MovCaseBlock_CheckNext;
    +                                        end;
    +
    +                                        { Make sure the optimizer is aware that register x is used for an extra instruction }
    +                                        if taicpu(p).oper[0]^.typ = top_reg then
    +                                          AllocRegBetween(taicpu(p).oper[0]^.reg, p, hp1, UsedRegs);
    +                                      end;
    +
    +                                    taicpu(hp1).loadOper(0,taicpu(p).oper[0]^);
    +                                    DebugMsg(SPeepholeOptimization + 'mov x, %reg; mov %reg, y -> mov x, %reg; mov x, y', p);
    +                                    { Don't need to set the Result to True because the change was done to the next command }
    +
    +                                  end;
    +                              end
    +                            else
    +                              begin
    +                                { we've got
    +
    +                                  mov x, %treg
    +                                  mov %treg, y
    +
    +                                  with %treg is not used after }
    +                                case taicpu(p).oper[0]^.typ Of
    +                                  top_reg:
    +                                    begin
    +                                      { change
    +                                          mov %reg, %treg
    +                                          mov %treg, y
    +
    +                                          to
    +
    +                                          mov %reg, y
    +                                      }
    +                                      if taicpu(hp1).oper[1]^.typ=top_reg then
    +                                        AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    +                                      taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
    +                                      DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
    +                                      asml.remove(hp1);
    +                                      hp1.free;
    +                                      Result := True;
    +                                      Continue;
    +                                    end;
    +                                  top_const:
    +                                    begin
    +                                      { change
    +                                          mov const, %treg
    +                                          mov %treg, y
    +
    +                                          to
    +
    +                                          mov const, y
    +                                      }
    +                                      if (taicpu(hp1).oper[1]^.typ=top_reg) or
    +                                        ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
    +                                        begin
    +                                          if taicpu(hp1).oper[1]^.typ=top_reg then
    +                                            AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    +                                          taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
    +                                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
    +                                          asml.remove(hp1);
    +                                          hp1.free;
    +                                          Result := True;
    +                                          Continue;
    +                                        end;
    +                                    end;
    +                                  top_ref:
    +                                    if (taicpu(hp1).oper[1]^.typ = top_reg) then
    +                                      begin
    +                                        { change
    +                                             mov mem, %treg
    +                                             mov %treg, %reg
    +
    +                                             to
    +
    +                                             mov mem, %reg"
    +                                        }
    +                                        AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    +                                        taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    +                                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
    +                                        asml.remove(hp1);
    +                                        hp1.free;
    +                                        Result:=true;
    +                                        Continue;
    +                                      end;
    +                                  else
    +                                    InternalError(2019071001);
    +                                end;
    +                              end;
    +                          end;
                           end;
     
    -                    Result := True;
    -                    Exit;
    +                    if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
    +                     (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
    +                        {  mov reg1, mem1     or     mov mem1, reg1
    +                           mov mem2, reg2            mov reg2, mem2}
    +                      begin
    +                        if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
    +                          { mov reg1, mem1     or     mov mem1, reg1
    +                            mov mem2, reg1            mov reg2, mem1}
    +                          begin
    +                            if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    +                              { Removes the second statement from
    +                                mov reg1, mem1/reg2
    +                                mov mem1/reg2, reg1 }
    +                              begin
    +                                if taicpu(p).oper[0]^.typ=top_reg then
    +                                  AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    +                                DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1',p);
    +                                asml.remove(hp1);
    +                                hp1.free;
    +                                Result:=true;
    +                                Continue;
    +                              end
    +                            else
    +                              begin
    +                                if (taicpu(p).oper[1]^.typ = top_ref) and
    +                                  { mov reg1, mem1
    +                                    mov mem2, reg1 }
    +                                   (taicpu(hp1).oper[0]^.ref^.refaddr = addr_no) and
    +                                   GetNextInstruction(hp1, hp2) and
    +                                   MatchInstruction(hp2,A_CMP,[taicpu(p).opsize]) and
    +                                   OpsEqual(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^) and
    +                                   OpsEqual(taicpu(p).oper[0]^,taicpu(hp2).oper[1]^) and
    +                                   not (
    +                                     TransferUsedRegs(TmpUsedRegs) and
    +                                     UpdateUsedRegs(TmpUsedRegs, tai(hp1.next)) and
    +                                     RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp2, TmpUsedRegs)
    +                                   ) then
    +                                   { change                   to
    +                                     mov reg1, mem1           mov reg1, mem1
    +                                     mov mem2, reg1           cmp reg1, mem2
    +                                     cmp mem1, reg1
    +                                   }
    +                                  begin
    +                                    asml.remove(hp2);
    +                                    hp2.free;
    +                                    taicpu(hp1).opcode := A_CMP;
    +                                    taicpu(hp1).loadref(1,taicpu(hp1).oper[0]^.ref^);
    +                                    taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
    +                                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
    +                                    DebugMsg(SPeepholeOptimization + 'MovMovCmp2MovCmp done',hp1);
    +                                  end;
    +                              end;
    +                          end
    +                        else if (taicpu(p).oper[1]^.typ=top_ref) and
    +                          OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    +                          begin
    +                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
    +                            taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
    +                            DebugMsg(SPeepholeOptimization + 'MovMov2MovMov1 done',p);
    +                          end
    +                        else
    +                          begin
    +                            if MatchOpType(taicpu(p),top_ref,top_reg) and
    +                              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    +                              (taicpu(hp1).oper[1]^.typ = top_ref) and
    +                              GetNextInstruction(hp1, hp2) and
    +                              MatchInstruction(hp2,A_MOV,[taicpu(p).opsize]) and
    +                              MatchOpType(taicpu(hp2),top_ref,top_reg) and
    +                              RefsEqual(taicpu(hp2).oper[0]^.ref^, taicpu(hp1).oper[1]^.ref^)  then
    +                              if not RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^) and
    +                                 not (
    +                                   TransferUsedRegs(TmpUsedRegs) and
    +                                   RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,tmpUsedRegs)
    +                                 ) then
    +                                {   mov mem1, %reg1
    +                                    mov %reg1, mem2
    +                                    mov mem2, reg2
    +                                 to:
    +                                    mov mem1, reg2
    +                                    mov reg2, mem2}
    +                                begin
    +                                  AllocRegBetween(taicpu(hp2).oper[1]^.reg,p,hp2,usedregs);
    +                                  DebugMsg(SPeepholeOptimization + 'MovMovMov2MovMov 1 done',p);
    +                                  taicpu(p).loadoper(1,taicpu(hp2).oper[1]^);
    +                                  taicpu(hp1).loadoper(0,taicpu(hp2).oper[1]^);
    +                                  asml.remove(hp2);
    +                                  hp2.free;
    +                                end
    +{$ifdef i386}
    +                              { this is enabled for i386 only, as the rules to create the reg sets below
    +                                are too complicated for x86-64, so this makes this code too error prone
    +                                on x86-64
    +                              }
    +                              else if (taicpu(p).oper[1]^.reg <> taicpu(hp2).oper[1]^.reg) and
    +                                not(RegInRef(taicpu(p).oper[1]^.reg,taicpu(p).oper[0]^.ref^)) and
    +                                not(RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^)) then
    +                                {   mov mem1, reg1         mov mem1, reg1
    +                                    mov reg1, mem2         mov reg1, mem2
    +                                    mov mem2, reg2         mov mem2, reg1
    +                                 to:                    to:
    +                                    mov mem1, reg1         mov mem1, reg1
    +                                    mov mem1, reg2         mov reg1, mem2
    +                                    mov reg1, mem2
     
    +                                 or (if mem1 depends on reg1
    +                              and/or if mem2 depends on reg2)
    +                                 to:
    +                                     mov mem1, reg1
    +                                     mov reg1, mem2
    +                                     mov reg1, reg2
    +                                }
    +                                begin
    +                                  taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    +                                  taicpu(hp1).loadReg(1,taicpu(hp2).oper[1]^.reg);
    +                                  taicpu(hp2).loadRef(1,taicpu(hp2).oper[0]^.ref^);
    +                                  taicpu(hp2).loadReg(0,taicpu(p).oper[1]^.reg);
    +                                  AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
    +                                  if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
    +                                     (getsupreg(taicpu(p).oper[0]^.ref^.base) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
    +                                    AllocRegBetween(taicpu(p).oper[0]^.ref^.base,p,hp2,usedregs);
    +                                  if (taicpu(p).oper[0]^.ref^.index <> NR_NO) and
    +                                     (getsupreg(taicpu(p).oper[0]^.ref^.index) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
    +                                    AllocRegBetween(taicpu(p).oper[0]^.ref^.index,p,hp2,usedregs);
    +                                end
    +                              else if (taicpu(hp1).Oper[0]^.reg <> taicpu(hp2).Oper[1]^.reg) then
    +                                begin
    +                                  taicpu(hp2).loadReg(0,taicpu(hp1).Oper[0]^.reg);
    +                                  AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
    +                                end
    +                              else
    +                                begin
    +                                  asml.remove(hp2);
    +                                  hp2.free;
    +                                end
    +{$endif i386}
    +                                ;
    +                          end;
    +                      end;
                       end;
    -              end;
    -          end
    -        else if GetNextInstruction_p and
    -          MatchInstruction(hp1,A_MOV,[]) and
    -          (taicpu(p).oper[1]^.typ = top_reg) and
    -          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
    -            { we have
    -                mov x, %treg
    -                mov %treg, y
    -            }
    -            if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[1]^)) and
    -               not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
    -              { we've got
    +    (*          { movl [mem1],reg1
    +                  movl [mem1],reg2
     
    -                mov x, %treg
    -                mov %treg, y
    +                  to
     
    -                with %treg is not used after }
    -              case taicpu(p).oper[0]^.typ Of
    -                top_reg:
    +                  movl [mem1],reg1
    +                  movl reg1,reg2
    +                 }
    +                if (taicpu(p).oper[0]^.typ = top_ref) and
    +                  (taicpu(p).oper[1]^.typ = top_reg) and
    +                  (taicpu(hp1).oper[0]^.typ = top_ref) and
    +                  (taicpu(hp1).oper[1]^.typ = top_reg) and
    +                  (taicpu(p).opsize = taicpu(hp1).opsize) and
    +                  RefsEqual(TReference(taicpu(p).oper[0]^^),taicpu(hp1).oper[0]^^.ref^) and
    +                  (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.base) and
    +                  (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.index) then
    +                  taicpu(hp1).loadReg(0,taicpu(p).oper[1]^.reg)
    +                *)
    +
    +                {   movl const1,[mem1]
    +                    movl [mem1],reg1
    +
    +                    to
    +
    +                    movl const1,reg1
    +                    movl reg1,[mem1]
    +                }
    +                if MatchOpType(Taicpu(p),top_const,top_ref) and
    +                     MatchOpType(Taicpu(hp1),top_ref,top_reg) and
    +                     (taicpu(p).opsize = taicpu(hp1).opsize) and
    +                     RefsEqual(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.ref^) and
    +                     not(RegInRef(taicpu(hp1).oper[1]^.reg,taicpu(hp1).oper[0]^.ref^)) then
                       begin
    -                    { change
    -                        mov %reg, %treg
    -                        mov %treg, y
    +                    AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    +                    taicpu(hp1).loadReg(0,taicpu(hp1).oper[1]^.reg);
    +                    taicpu(hp1).loadRef(1,taicpu(p).oper[1]^.ref^);
    +                    taicpu(p).loadReg(1,taicpu(hp1).oper[0]^.reg);
    +                    taicpu(hp1).fileinfo := taicpu(p).fileinfo;
    +                    DebugMsg(SPeepholeOptimization + 'MovMov2MovMov 1',p);
    +                  end
    +                {
    +                  mov*  x,reg1
    +                  mov*  y,reg1
     
    -                        to
    +                  to
     
    -                        mov %reg, y
    -                    }
    -                    if taicpu(hp1).oper[1]^.typ=top_reg then
    -                      AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    -                    taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
    -                    DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
    -                    asml.remove(hp1);
    -                    hp1.free;
    +                  mov*  y,reg1
    +                }
    +                else if (taicpu(p).oper[1]^.typ=top_reg) and
    +                  MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    +                  not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[0]^)) then
    +                  begin
    +                    DebugMsg(SPeepholeOptimization + 'MovMov2Mov 4 done',p);
    +                    { take care of the register (de)allocs following p }
    +                    UpdateUsedRegs(tai(p.next));
    +                    asml.remove(p);
    +                    p.free;
    +                    p:=hp1;
                         Result:=true;
    -                    Exit;
    +                    Continue;
    +                  end
    +                else if MOVRefOptimize then
    +                  begin
    +                    Result := True;
    +                    if MatchInstruction(hp1, A_MOV) then
    +                      Continue
    +                    else
    +                      Exit;
                       end;
    -                top_const:
    -                  begin
    -                    { change
    -                        mov const, %treg
    -                        mov %treg, y
    +              end;
     
    -                        to
    +            { Optimisations where next instruction = LEA }
    +            A_LEA:
    +{$ifdef x86_64}
    +              if (taicpu(hp1).opsize in [S_L,S_Q]) then
    +{$else x86_64}
    +              if (taicpu(hp1).opsize = S_L) then
    +{$endif x86_64}
    +                begin
    +                  { Optimise the LEA into something more manageable if possible,
    +                    but requires temporarily advancing the used register tracker }
    +                  TransferUsedRegs(TmpUsedRegs);
    +                  UpdateUsedRegs(tai(p.next));
     
    -                        mov const, y
    -                    }
    -                    if (taicpu(hp1).oper[1]^.typ=top_reg) or
    -                      ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
    -                      begin
    -                        if taicpu(hp1).oper[1]^.typ=top_reg then
    -                          AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    -                        taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
    -                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Result:=true;
    -                        Exit;
    -                      end;
    -                  end;
    -                top_ref:
    -                  if (taicpu(hp1).oper[1]^.typ = top_reg) then
    +                  HP_Result := OptPass1LEA(hp1);
    +
    +                  { Restore proper state }
    +                  RestoreUsedRegs(TmpUsedRegs);
    +
    +                  if HP_Result then
                         begin
    -                      { change
    -                           mov mem, %treg
    -                           mov %treg, %reg
    +                      if (hp1 = BlockEnd) or (hp1.typ <> ait_instruction) then
    +                        begin
    +                          Result := True;
    +                          Exit;
    +                        end;
     
    -                           to
    +                      if (taicpu(hp1).opcode <> A_LEA) then
    +                        { Go back to the start of the case block if hp1 was changed into something other than LEA }
    +                        goto MovCaseBlock;
    +                  end;
     
    -                           mov mem, %reg"
    -                      }
    -                      taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    -                      DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
    -                      asml.remove(hp1);
    -                      hp1.free;
    +                  if MatchOpType(Taicpu(p),top_ref,top_reg) and
    +                   ((MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(hp1).oper[1]^.reg,Taicpu(p).oper[1]^.reg) and
    +                     (Taicpu(hp1).oper[0]^.ref^.base<>Taicpu(p).oper[1]^.reg)
    +                    ) or
    +                    (MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(p).oper[1]^.reg,Taicpu(hp1).oper[1]^.reg) and
    +                     (Taicpu(hp1).oper[0]^.ref^.index<>Taicpu(p).oper[1]^.reg)
    +                    )
    +                    { reg1 may not be used afterwards }
    +                  ) and not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs))
    +                  then
    +                    { mov reg1,ref
    +                      lea reg2,[reg1,reg2]
    +
    +                      to
    +
    +                      add reg2,ref}
    +                    begin
    +                      Taicpu(hp1).opcode:=A_ADD;
    +                      Taicpu(hp1).oper[0]^.ref^:=Taicpu(p).oper[0]^.ref^;
    +                      DebugMsg(SPeepholeOptimization + 'MovLea2Add done',hp1);
    +                      UpdateUsedRegs(tai(p.Next));
    +                      asml.remove(p);
    +                      p.free;
    +                      p:=hp1;
                           Result:=true;
                           Exit;
                         end;
    -                else
    -                  ;
    -              end;
    -          end
    -        else
    +                end;
    +
    +            { Optimisations where next instruction = TEST or = CMP }
    +            A_TEST, A_CMP:
    +              { change
    +                  mov reg1, mem1
    +                  test/cmp x, mem1
    +
    +                  to
    +
    +                  mov reg1, mem1
    +                  test/cmp x, reg1
    +              }
    +              if MatchOpType(taicpu(p),top_reg,top_ref) and
    +                (taicpu(hp1).opsize = taicpu(p).opsize) and
    +                (taicpu(hp1).oper[1]^.typ = top_ref) and
    +                RefsEqual(taicpu(p).oper[1]^.ref^, taicpu(hp1).oper[1]^.ref^) then
    +                begin
    +                  taicpu(hp1).loadreg(1,taicpu(p).oper[0]^.reg);
    +                  DebugMsg(SPeepholeOptimization + 'MovTestCmp2MovTestCmp 1',hp1);
    +                  AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    +                  { Structure of operations hasn't changed, so fall through the
    +                    case block to see what else can be done }
    +                end;
    +
    +            { Optimisations where next instruction = BTS or = BTR }
    +            A_BTS, A_BTR:
    +              if MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
    +                MatchOperand(Taicpu(p).oper[0]^,0) and
    +                (Taicpu(p).oper[1]^.typ = top_reg) and
    +                MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp1).oper[1]^) and
    +                GetNextInstruction(hp1, hp2) and
    +                MatchInstruction(hp2,A_OR,[Taicpu(p).opsize]) and
    +                MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp2).oper[1]^) then
    +                { mov reg1,0
    +                  bts reg1,operand1             -->      mov reg1,operand2
    +                  or  reg1,operand2                      bts reg1,operand1}
    +                begin
    +                  Taicpu(hp2).opcode:=A_MOV;
    +                  asml.remove(hp1);
    +                  insertllitem(hp2,hp2.next,hp1);
    +                  asml.remove(p);
    +                  p.free;
    +                  p:=hp2;
    +
    +                  { hp2 is a MOV command, so it's safe to continue }
    +                  Continue;
    +                end;
    +
    +            { Optimisations where next instruction = MOVZX or = MOVSX or = MOVSXD }
    +            A_MOVZX, A_MOVSX {$ifdef x86_64}, A_MOVSXD{$endif x86_64}:
    +              if MatchOpType(taicpu(p),top_reg,top_reg) then
    +                begin
    +                  if MOVRefOptimize then
    +                    begin
    +                      Result := True;
    +                      if MatchInstruction(hp1, A_MOV) then
    +                        Continue
    +                      else
    +                        Exit;
    +                    end
    +                  else if MatchOpType(taicpu(hp1),top_reg,top_reg) and
    +                    (taicpu(hp1).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
    +                    { mov reg1, reg2                mov reg1, reg2
    +                      movzx/sx reg2, reg3      to   movzx/sx reg1, reg3}
    +                    begin
    +                      taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
    +                      DebugMsg(SPeepholeOptimization + 'mov %reg1,%reg2; movzx/sx %reg2,%reg3 -> mov %reg1,%reg2; movzx/sx %reg1,%reg3',p);
    +
    +                      { Don't remove the MOV command without first checking that reg2 isn't used afterwards,
    +                        or unless supreg(reg3) = supreg(reg2)). [Kit] }
    +
    +
    +                      if (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) or
    +                        not (
    +                          TransferUsedRegs(TmpUsedRegs) and
    +                          UpdateUsedRegs(TmpUsedRegs, tai(p.next)) and
    +                          UpdateUsedRegs(TmpUsedRegs, tai(hp1.next)) and
    +                          RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)
    +                        )
    +                      then
    +                        begin
    +                          asml.remove(p);
    +                          p.free;
    +                          p := hp1;
    +                          Result:=true;
    +                        end;
    +
    +                      exit;
    +                    end;
    +                end;
    +
    +            { Last of the two-instruction optimisations: }
    +            else
    +              { leave out the mov from "mov reg, x(%frame_pointer); leave/ret" (with
    +                x >= RetOffset) as it doesn't do anything (it writes either to a
    +                parameter or to the temporary storage room for the function
    +                result)
    +              }
    +
    +              if IsExitCode(hp1) and
    +                MatchOpType(taicpu(p),top_reg,top_ref) and
    +                (taicpu(p).oper[1]^.ref^.base = current_procinfo.FramePointer) and
    +                not(assigned(current_procinfo.procdef.funcretsym) and
    +                   (taicpu(p).oper[1]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
    +                (taicpu(p).oper[1]^.ref^.index = NR_NO) then
    +                begin
    +                  asml.remove(p);
    +                  p.free;
    +                  p:=hp1;
    +                  DebugMsg(SPeepholeOptimization + 'removed deadstore before leave/ret',p);
    +                  RemoveLastDeallocForFuncRes(p);
    +                  Result:=true;
    +                  exit;
    +                end;
    +
    +          end;
    +
    +          { Miscellaneous optimisations }
    +
               { Change
                  mov %reg1, %reg2
                  xxx %reg2, ???
    @@ -1472,9 +2409,10 @@
     
                  to avoid a write/read penalty
               }
    +
    +          { NOTE: Don't put this in the case block above, otherwise it won't be
    +            called if hp1.opcode = A_AND. [Kit] }
               if MatchOpType(taicpu(p),top_reg,top_reg) and
    -             GetNextInstruction(p,hp1) and
    -             (tai(hp1).typ = ait_instruction) and
                  (taicpu(hp1).ops >= 1) and
                  MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
                 { we have
    @@ -1496,12 +2434,12 @@
                     begin
                       TransferUsedRegs(TmpUsedRegs);
                       { reg1 will be used after the first instruction,
    -                    so update the allocation info                  }
    +                    so update the allocation info }
                       AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    -                  if GetNextInstruction(hp1, hp2) and
    -                     (hp2.typ = ait_instruction) and
    -                     taicpu(hp2).is_jmp and
    -                     not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg, hp1, TmpUsedRegs)) then
    +                  if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg, hp1, TmpUsedRegs)) and
    +                    GetNextInstruction(hp1, hp2) and
    +                    (hp2.typ = ait_instruction) and
    +                    taicpu(hp2).is_jmp then
                           { change
     
                             mov %reg1, %reg2
    @@ -1516,11 +2454,12 @@
                           begin
                             taicpu(hp1).loadoper(0,taicpu(p).oper[0]^);
                             taicpu(hp1).loadoper(1,taicpu(p).oper[0]^);
    +                        taicpu(hp1).opcode := A_TEST; { Changing it now saves on some unnecessary processing later }
                             DebugMsg(SPeepholeOptimization + 'MovTestJxx2TestMov done',p);
                             asml.remove(p);
                             p.free;
                             p := hp1;
    -                        Exit;
    +                        Result := True;
                           end
                         else
                           { change
    @@ -1538,460 +2477,403 @@
                             taicpu(hp1).loadoper(0,taicpu(p).oper[0]^);
                             taicpu(hp1).loadoper(1,taicpu(p).oper[0]^);
                             DebugMsg(SPeepholeOptimization + 'MovTestJxx2MovTestJxx done',p);
    +                        { Don't need to set Result to true because the MOV itself wasn't changed }
                           end;
    -                end
    -            end
    -        else
    -          { leave out the mov from "mov reg, x(%frame_pointer); leave/ret" (with
    -            x >= RetOffset) as it doesn't do anything (it writes either to a
    -            parameter or to the temporary storage room for the function
    -            result)
    -          }
    -          if GetNextInstruction_p and
    -            (tai(hp1).typ = ait_instruction) then
    +                  Exit;
    +                end;
    +            end;
    +
    +          if (taicpu(p).oper[1]^.typ = top_reg) and GetNextInstruction(hp1, hp2) then
                 begin
    -              if IsExitCode(hp1) and
    -                MatchOpType(taicpu(p),top_reg,top_ref) and
    -                (taicpu(p).oper[1]^.ref^.base = current_procinfo.FramePointer) and
    -                not(assigned(current_procinfo.procdef.funcretsym) and
    -                   (taicpu(p).oper[1]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
    -                (taicpu(p).oper[1]^.ref^.index = NR_NO) then
    +              if MatchInstruction(hp2,A_MOV) and
    +                (taicpu(hp2).oper[0]^.typ = top_reg) and
    +                (SuperRegistersEqual(taicpu(hp2).oper[0]^.reg,taicpu(p).oper[1]^.reg)) and
    +                (
    +{$ifdef x86_64}
    +                  (
    +                    { Upper 32 bit of a register are guaranteed to be set to zero if only using the lower 32 bits }
    +                    (taicpu(hp1).opsize = S_Q) and (taicpu(p).opsize >= S_L) and (taicpu(hp2).opsize = taicpu(p).opsize) and
    +                    IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBQ))
    +                  ) or
    +{$endif x86_64}
    +                  (
    +                    { This inequality works because S_NO, S_B, S_W, S_L and S_Q are
    +                    in sequentual order, and a MOV cannot be of size S_NO. [Kit] }
    +                    (taicpu(hp2).opsize <= taicpu(p).opsize) and
    +                    (
    +                      (
    +                        (taicpu(hp1).opsize = S_L) and
    +                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBD))
    +                      ) or
    +                      (
    +                        (taicpu(hp1).opsize = S_W) and
    +                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBW))
    +                      ) or
    +                      (
    +                        (taicpu(hp1).opsize = S_B) and
    +                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBL))
    +                      )
    +                    )
    +                  )
    +                ) then
                     begin
    -                  asml.remove(p);
    -                  p.free;
    -                  p:=hp1;
    -                  DebugMsg(SPeepholeOptimization + 'removed deadstore before leave/ret',p);
    -                  RemoveLastDeallocForFuncRes(p);
    -                  exit;
    -                end
    -              { change
    -                  mov reg1, mem1
    -                  test/cmp x, mem1
    -
    -                  to
    -
    -                  mov reg1, mem1
    -                  test/cmp x, reg1
    -              }
    -              else if MatchOpType(taicpu(p),top_reg,top_ref) and
    -                  MatchInstruction(hp1,A_CMP,A_TEST,[taicpu(p).opsize]) and
    -                  (taicpu(hp1).oper[1]^.typ = top_ref) and
    -                   RefsEqual(taicpu(p).oper[1]^.ref^, taicpu(hp1).oper[1]^.ref^) then
    -                  begin
    -                    taicpu(hp1).loadreg(1,taicpu(p).oper[0]^.reg);
    -                    DebugMsg(SPeepholeOptimization + 'MovTestCmp2MovTestCmp 1',hp1);
    -                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    -                  end;
    -            end;
    -
    -        { Next instruction is also a MOV ? }
    -        if GetNextInstruction_p and
    -          MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) then
    -          begin
    -            if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
    -               (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
    -                {  mov reg1, mem1     or     mov mem1, reg1
    -                   mov mem2, reg2            mov reg2, mem2}
    -              begin
    -                if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
    -                  { mov reg1, mem1     or     mov mem1, reg1
    -                    mov mem2, reg1            mov reg2, mem1}
    -                  begin
    -                    if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    -                      { Removes the second statement from
    -                        mov reg1, mem1/reg2
    -                        mov mem1/reg2, reg1 }
    -                      begin
    -                        if taicpu(p).oper[0]^.typ=top_reg then
    -                          AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    -                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1',p);
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Result:=true;
    -                        exit;
    -                      end
    -                    else
    -                      begin
    -                        TransferUsedRegs(TmpUsedRegs);
    -                        UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
    -                        if (taicpu(p).oper[1]^.typ = top_ref) and
    -                          { mov reg1, mem1
    -                            mov mem2, reg1 }
    -                           (taicpu(hp1).oper[0]^.ref^.refaddr = addr_no) and
    -                           GetNextInstruction(hp1, hp2) and
    -                           MatchInstruction(hp2,A_CMP,[taicpu(p).opsize]) and
    -                           OpsEqual(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^) and
    -                           OpsEqual(taicpu(p).oper[0]^,taicpu(hp2).oper[1]^) and
    -                           not(RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp2, TmpUsedRegs)) then
    -                           { change                   to
    -                             mov reg1, mem1           mov reg1, mem1
    -                             mov mem2, reg1           cmp reg1, mem2
    -                             cmp mem1, reg1
    -                           }
    -                          begin
    -                            asml.remove(hp2);
    -                            hp2.free;
    -                            taicpu(hp1).opcode := A_CMP;
    -                            taicpu(hp1).loadref(1,taicpu(hp1).oper[0]^.ref^);
    -                            taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
    -                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
    -                            DebugMsg(SPeepholeOptimization + 'MovMovCmp2MovCmp done',hp1);
    +                  if OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) then
    +                    { change   movq           reg/ref, reg2
    +                               add/sub/or/... reg3/$const, reg2
    +                               mov            reg2, reg/ref
    +                               dealloc        reg2
    +                      to
    +                               add/sub/or/... reg3/$const, reg/ref      }
    +                    begin
    +                      TransferUsedRegs(TmpUsedRegs);
    +                      UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    +                      UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
    +                      If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
    +                        begin
    +                          { by example:
    +                              movq    %rsi,%rax       movq    %rsi,%rax     p
    +                              decl    %eax            addl    %edx,%eax     hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                            ->
    +                              movq    %rsi,%eax       movq    %rsi,%eax     p
    +                              decw    %ax             addw    %dx,%ax       hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                          }
    +                          DebugMsg(SPeepholeOptimization + 'MovOpMov2Op ('+
    +                                debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
    +                                debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
    +                                debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize)+')',p);
    +                          taicpu(hp1).changeopsize(taicpu(hp2).opsize);
    +                          {
    +                            ->
    +                              movq    %rsi,%rax       movq    %rsi,%rax     p
    +                              decw    %si             addw    %dx,%si       hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                          }
    +                          case taicpu(hp1).ops of
    +                            1:
    +                              begin
    +                                taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
    +                                if taicpu(hp1).oper[0]^.typ=top_reg then
    +                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    +                              end;
    +                            2:
    +                              begin
    +                                taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
    +                                if (taicpu(hp1).oper[0]^.typ=top_reg) and
    +                                  (taicpu(hp1).opcode<>A_SHL) and
    +                                  (taicpu(hp1).opcode<>A_SHR) and
    +                                  (taicpu(hp1).opcode<>A_SAR) then
    +                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    +                              end;
    +                            else
    +                              internalerror(2008042701);
                               end;
    -                      end;
    -                  end
    -                else if (taicpu(p).oper[1]^.typ=top_ref) and
    -                  OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    -                  begin
    -                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
    -                    taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
    -                    DebugMsg(SPeepholeOptimization + 'MovMov2MovMov1 done',p);
    -                  end
    -                else
    -                  begin
    -                    TransferUsedRegs(TmpUsedRegs);
    -                    if GetNextInstruction(hp1, hp2) and
    -                      MatchOpType(taicpu(p),top_ref,top_reg) and
    -                      MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    -                      (taicpu(hp1).oper[1]^.typ = top_ref) and
    -                      MatchInstruction(hp2,A_MOV,[taicpu(p).opsize]) and
    -                      MatchOpType(taicpu(hp2),top_ref,top_reg) and
    -                      RefsEqual(taicpu(hp2).oper[0]^.ref^, taicpu(hp1).oper[1]^.ref^)  then
    -                      if not RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^) and
    -                         not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,tmpUsedRegs)) then
    -                        {   mov mem1, %reg1
    -                            mov %reg1, mem2
    -                            mov mem2, reg2
    -                         to:
    -                            mov mem1, reg2
    -                            mov reg2, mem2}
    -                        begin
    -                          AllocRegBetween(taicpu(hp2).oper[1]^.reg,p,hp2,usedregs);
    -                          DebugMsg(SPeepholeOptimization + 'MovMovMov2MovMov 1 done',p);
    -                          taicpu(p).loadoper(1,taicpu(hp2).oper[1]^);
    -                          taicpu(hp1).loadoper(0,taicpu(hp2).oper[1]^);
    +                          {
    +                            ->
    +                              decw    %si             addw    %dx,%si       p
    +                          }
    +                          UpdateUsedRegs(tai(p.Next));
    +                          asml.remove(p);
                               asml.remove(hp2);
    -                          hp2.free;
    -                        end
    +                          p.Free;
    +                          hp2.Free;
    +                          p := hp1;
    +                          Result:=true;
    +                          Exit;
    +                        end;
    +                    end
    +                  else if (taicpu(hp2).oper[1]^.typ = top_reg) and
    +                    not(SuperRegistersEqual(taicpu(hp1).oper[0]^.reg,taicpu(hp2).oper[1]^.reg))
     {$ifdef i386}
    -                      { this is enabled for i386 only, as the rules to create the reg sets below
    -                        are too complicated for x86-64, so this makes this code too error prone
    -                        on x86-64
    -                      }
    -                      else if (taicpu(p).oper[1]^.reg <> taicpu(hp2).oper[1]^.reg) and
    -                        not(RegInRef(taicpu(p).oper[1]^.reg,taicpu(p).oper[0]^.ref^)) and
    -                        not(RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^)) then
    -                        {   mov mem1, reg1         mov mem1, reg1
    -                            mov reg1, mem2         mov reg1, mem2
    -                            mov mem2, reg2         mov mem2, reg1
    -                         to:                    to:
    -                            mov mem1, reg1         mov mem1, reg1
    -                            mov mem1, reg2         mov reg1, mem2
    -                            mov reg1, mem2
    +                    { byte registers of esi, edi, ebp, esp are not available on i386 }
    +                    and (
    +                      (taicpu(hp2).opsize<>S_B) or
    +                      not (
    +                        (getsupreg(taicpu(p).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]) or
    +                        (getsupreg(taicpu(hp1).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP])
    +                      )
    +                    )
    +{$endif i386}
    +                    then
    +                    { change   movq           reg/ref, reg2
    +                               add/sub/or/... regX/$const, reg2
    +                               mov            reg2, reg3
    +                               dealloc        reg2
    +                      to
    +                               movq           reg/ref, reg3
    +                               add/sub/or/... reg3/$const, reg3
    +                    }
    +                    begin
    +                      TransferUsedRegs(TmpUsedRegs);
    +                      UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    +                      UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
    +                      If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
    +                        begin
    +                          { by example:
    +                              movswl  %si,%eax        movswl  %si,%eax      p
    +                              decl    %eax            addl    %edx,%eax     hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                            ->
    +                              movswl  %si,%eax        movswl  %si,%eax      p
    +                              decw    %ax             addw    %dx,%ax       hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                          }
    +                          DebugMsg(SPeepholeOptimization + 'MovOpMov2MovOp ('+
    +                                debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
    +                                debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
    +                                debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
    +                          taicpu(hp1).changeopsize(taicpu(hp2).opsize);
    +                          taicpu(p).changeopsize(taicpu(hp2).opsize);
    +                          if taicpu(p).oper[0]^.typ=top_reg then
    +                            setsubreg(taicpu(p).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
     
    -                         or (if mem1 depends on reg1
    -                      and/or if mem2 depends on reg2)
    -                         to:
    -                             mov mem1, reg1
    -                             mov reg1, mem2
    -                             mov reg1, reg2
    -                        }
    -                        begin
    -                          taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    -                          taicpu(hp1).loadReg(1,taicpu(hp2).oper[1]^.reg);
    -                          taicpu(hp2).loadRef(1,taicpu(hp2).oper[0]^.ref^);
    -                          taicpu(hp2).loadReg(0,taicpu(p).oper[1]^.reg);
    -                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
    -                          if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
    -                             (getsupreg(taicpu(p).oper[0]^.ref^.base) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
    -                            AllocRegBetween(taicpu(p).oper[0]^.ref^.base,p,hp2,usedregs);
    -                          if (taicpu(p).oper[0]^.ref^.index <> NR_NO) and
    -                             (getsupreg(taicpu(p).oper[0]^.ref^.index) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
    -                            AllocRegBetween(taicpu(p).oper[0]^.ref^.index,p,hp2,usedregs);
    -                        end
    -                      else if (taicpu(hp1).Oper[0]^.reg <> taicpu(hp2).Oper[1]^.reg) then
    -                        begin
    -                          taicpu(hp2).loadReg(0,taicpu(hp1).Oper[0]^.reg);
    -                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
    -                        end
    -                      else
    -                        begin
    +                          taicpu(p).loadoper(1, taicpu(hp2).oper[1]^);
    +                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp1,usedregs);
    +                          {
    +                            ->
    +                              movswl  %si,%eax        movswl  %si,%eax      p
    +                              decw    %si             addw    %dx,%si       hp1
    +                              movw    %ax,%si         movw    %ax,%si       hp2
    +                          }
    +                          case taicpu(hp1).ops of
    +                            1:
    +                              begin
    +                                taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
    +                                if taicpu(hp1).oper[0]^.typ=top_reg then
    +                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    +                              end;
    +                            2:
    +                              begin
    +                                taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
    +                                if (taicpu(hp1).oper[0]^.typ=top_reg) and
    +                                  (taicpu(hp1).opcode<>A_SHL) and
    +                                  (taicpu(hp1).opcode<>A_SHR) and
    +                                  (taicpu(hp1).opcode<>A_SAR) then
    +                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    +                              end;
    +                            else
    +                              internalerror(2018111801);
    +                          end;
    +                          {
    +                            ->
    +                              decw    %si             addw    %dx,%si       p
    +                          }
                               asml.remove(hp2);
    -                          hp2.free;
    -                        end
    -{$endif i386}
    -                        ;
    -                  end;
    -              end
    -(*          { movl [mem1],reg1
    -              movl [mem1],reg2
    +                          hp2.Free;
    +                          Continue;
    +                        end;
    +                    end;
    +{$ifdef x86_64}
    +                end
    +              else if (taicpu(p).opsize = S_L) and
    +                (
    +                  MatchInstruction(hp1, A_MOV) and
    +                  (taicpu(hp1).opsize = S_L) and
    +                  (taicpu(hp1).oper[1]^.typ = top_reg)
    +                ) and (
    +                  (tai(hp2).typ=ait_instruction) and
    +                  (taicpu(hp2).opsize = S_Q) and
    +                  (
    +                    (
    +                      MatchInstruction(hp2, A_ADD) and
    +                      (taicpu(hp2).opsize = S_Q) and
    +                      (taicpu(hp2).oper[0]^.typ = top_reg) and (taicpu(hp2).oper[1]^.typ = top_reg) and
    +                      (
    +                        (
    +                          (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) and
    +                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    +                        ) or (
    +                          (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    +                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
    +                        )
    +                      )
    +                    ) or (
    +                      MatchInstruction(hp2, A_LEA) and
    +                      (taicpu(hp2).oper[0]^.ref^.offset = 0) and
    +                      (taicpu(hp2).oper[0]^.ref^.scalefactor <= 1) and
    +                      (
    +                        (
    +                          (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(p).oper[1]^.reg)) and
    +                          (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(hp1).oper[1]^.reg))
    +                        ) or (
    +                          (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    +                          (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(p).oper[1]^.reg))
    +                        )
    +                      ) and (
    +                        (
    +                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    +                        ) or (
    +                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
    +                        )
    +                      )
    +                    )
    +                  )
    +                ) and (
    +                  GetNextInstruction(hp2, hp3) and
    +                  MatchInstruction(hp3, A_SHR) and
    +                  (taicpu(hp3).opsize = S_Q) and
    +                  (taicpu(hp3).oper[0]^.typ = top_const) and (taicpu(hp2).oper[1]^.typ = top_reg) and
    +                  (taicpu(hp3).oper[0]^.val = 1) and
    +                  (taicpu(hp3).oper[1]^.reg = taicpu(hp2).oper[1]^.reg)
    +                ) then
    +                begin
    +                  { Change   movl    x,    reg1d         movl    x,    reg1d
    +                             movl    y,    reg2d         movl    y,    reg2d
    +                             addq    reg2q,reg1q   or    leaq    (reg1q,reg2q),reg1q
    +                             shrq    $1,   reg1q         shrq    $1,   reg1q
     
    -              to
    +                  ( reg1d and reg2d can be switched around in the first two instructions )
     
    -              movl [mem1],reg1
    -              movl reg1,reg2
    -             }
    -             else if (taicpu(p).oper[0]^.typ = top_ref) and
    -                (taicpu(p).oper[1]^.typ = top_reg) and
    -                (taicpu(hp1).oper[0]^.typ = top_ref) and
    -                (taicpu(hp1).oper[1]^.typ = top_reg) and
    -                (taicpu(p).opsize = taicpu(hp1).opsize) and
    -                RefsEqual(TReference(taicpu(p).oper[0]^^),taicpu(hp1).oper[0]^^.ref^) and
    -                (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.base) and
    -                (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.index) then
    -                taicpu(hp1).loadReg(0,taicpu(p).oper[1]^.reg)
    -              else*)
    +                    To       movl    x,    reg1d
    +                             addl    y,    reg1d
    +                             rcrl    $1,   reg1d
     
    -            {   movl const1,[mem1]
    -                movl [mem1],reg1
    +                    This corresponds to the common expression (x + y) shr 1, where
    +                    x and y are Cardinals (replacing "shr 1" with "div 2" produces
    +                    smaller code, but won't account for x + y causing an overflow). [Kit]
    +                  }
     
    -                to
    +                  if (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) then
    +                    { Change first MOV command to have the same register as the final output }
    +                    taicpu(p).oper[1]^.reg := taicpu(hp1).oper[1]^.reg
    +                  else
    +                    taicpu(hp1).oper[1]^.reg := taicpu(p).oper[1]^.reg;
     
    -                movl const1,reg1
    -                movl reg1,[mem1]
    -            }
    -            else if MatchOpType(Taicpu(p),top_const,top_ref) and
    -                 MatchOpType(Taicpu(hp1),top_ref,top_reg) and
    -                 (taicpu(p).opsize = taicpu(hp1).opsize) and
    -                 RefsEqual(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.ref^) and
    -                 not(RegInRef(taicpu(hp1).oper[1]^.reg,taicpu(hp1).oper[0]^.ref^)) then
    -              begin
    -                AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
    -                taicpu(hp1).loadReg(0,taicpu(hp1).oper[1]^.reg);
    -                taicpu(hp1).loadRef(1,taicpu(p).oper[1]^.ref^);
    -                taicpu(p).loadReg(1,taicpu(hp1).oper[0]^.reg);
    -                taicpu(hp1).fileinfo := taicpu(p).fileinfo;
    -                DebugMsg(SPeepholeOptimization + 'MovMov2MovMov 1',p);
    -              end
    -            {
    -              mov*  x,reg1
    -              mov*  y,reg1
    +                  { Change second MOV command to an ADD command. This is easier than
    +                    converting the existing command because it means we don't have to
    +                    touch 'y', which might be a complicated reference, and also the
    +                    fact that the third command might either be ADD or LEA. [Kit] }
    +                  taicpu(hp1).opcode := A_ADD;
     
    -              to
    +                  { Delete old ADD/LEA instruction }
    +                  asml.remove(hp2);
    +                  hp2.free;
     
    -              mov*  y,reg1
    -            }
    -            else if (taicpu(p).oper[1]^.typ=top_reg) and
    -              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    -              not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[0]^)) then
    -              begin
    -                DebugMsg(SPeepholeOptimization + 'MovMov2Mov 4 done',p);
    -                { take care of the register (de)allocs following p }
    -                UpdateUsedRegs(tai(p.next));
    -                asml.remove(p);
    -                p.free;
    -                p:=hp1;
    -                Result:=true;
    -                exit;
    -              end;
    -          end
    +                  { Convert "shrq $1, reg1q" to "rcr $1, reg1d" }
    +                  taicpu(hp3).opcode := A_RCR;
    +                  taicpu(hp3).changeopsize(S_L);
    +                  setsubreg(taicpu(hp3).oper[1]^.reg, R_SUBD);
    +{$endif x86_64}
    +                end;
    +            end;
     
    -        else if (taicpu(p).oper[1]^.typ = top_reg) and
    -          GetNextInstruction_p and
    -          (hp1.typ = ait_instruction) and
    -          GetNextInstruction(hp1, hp2) and
    -          MatchInstruction(hp2,A_MOV,[]) and
    -          (SuperRegistersEqual(taicpu(hp2).oper[0]^.reg,taicpu(p).oper[1]^.reg)) and
    -          (IsFoldableArithOp(taicpu(hp1), taicpu(p).oper[1]^.reg) or
    -           ((taicpu(p).opsize=S_L) and (taicpu(hp1).opsize=S_Q) and (taicpu(hp2).opsize=S_L) and
    -            IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER,getsupreg(taicpu(p).oper[1]^.reg),R_SUBQ)))
    -          ) then
    -          begin
    -            if OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) and
    -              (taicpu(hp2).oper[0]^.typ=top_reg) then
    -              { change   movsX/movzX    reg/ref, reg2
    -                         add/sub/or/... reg3/$const, reg2
    -                         mov            reg2 reg/ref
    -                         dealloc        reg2
    -                to
    -                         add/sub/or/... reg3/$const, reg/ref      }
    -              begin
    -                TransferUsedRegs(TmpUsedRegs);
    -                UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -                UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
    -                If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
    -                  begin
    -                    { by example:
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decl    %eax            addl    %edx,%eax     hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                      ->
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decw    %eax            addw    %edx,%eax     hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                    }
    -                    DebugMsg(SPeepholeOptimization + 'MovOpMov2Op ('+
    -                          debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
    -                          debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
    -                          debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
    -                    taicpu(hp1).changeopsize(taicpu(hp2).opsize);
    -                    {
    -                      ->
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decw    %si             addw    %dx,%si       hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                    }
    -                    case taicpu(hp1).ops of
    -                      1:
    -                        begin
    -                          taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
    -                          if taicpu(hp1).oper[0]^.typ=top_reg then
    -                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    -                        end;
    -                      2:
    -                        begin
    -                          taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
    -                          if (taicpu(hp1).oper[0]^.typ=top_reg) and
    -                            (taicpu(hp1).opcode<>A_SHL) and
    -                            (taicpu(hp1).opcode<>A_SHR) and
    -                            (taicpu(hp1).opcode<>A_SAR) then
    -                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    -                        end;
    -                      else
    -                        internalerror(2008042701);
    -                    end;
    -                    {
    -                      ->
    -                        decw    %si             addw    %dx,%si       p
    -                    }
    -                    asml.remove(p);
    -                    asml.remove(hp2);
    -                    p.Free;
    -                    hp2.Free;
    -                    p := hp1;
    -                  end;
    -              end
    -            else if MatchOpType(taicpu(hp2),top_reg,top_reg) and
    -              not(SuperRegistersEqual(taicpu(hp1).oper[0]^.reg,taicpu(hp2).oper[1]^.reg)) and
    -              ((topsize2memsize[taicpu(hp1).opsize]<= topsize2memsize[taicpu(hp2).opsize]) or
    -               { opsize matters for these opcodes, we could probably work around this, but it is not worth the effort }
    -               ((taicpu(hp1).opcode<>A_SHL) and (taicpu(hp1).opcode<>A_SHR) and (taicpu(hp1).opcode<>A_SAR))
    +          if (taicpu(p).oper[0]^.typ = top_ref) and
    +            (
    +              (
    +                (taicpu(hp1).opcode=A_LEA) and
    +                (
    +                  (
    +                    MatchReference(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_INVALID) and
    +                    (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg)
    +                  ) or (
    +                    MatchReference(taicpu(hp1).oper[0]^.ref^,NR_INVALID, taicpu(p).oper[1]^.reg) and
    +                    (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg)
    +                  ) or
    +                  MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_NO) or
    +                  MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,NR_NO,taicpu(p).oper[1]^.reg)
    +                ) and
    +                { GetNextInstruction is not factored out so it is only called
    +                  when all the other independent conditional checks are True
    +                  (we also need access to hp2 for MatchOperand) }
    +                GetNextInstruction(hp1,hp2) and
    +                (
    +                  not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs) or
    +                  (
    +                    MatchInstruction(hp2,A_MOV) and
    +                    MatchOperand(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^)
    +                  )
    +                )
    +              ) or (
    +                IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
    +                GetNextInstruction(hp1,hp2)
                   )
    -{$ifdef i386}
    -              { byte registers of esi, edi, ebp, esp are not available on i386 }
    -              and ((taicpu(hp2).opsize<>S_B) or not(getsupreg(taicpu(hp1).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]))
    -              and ((taicpu(hp2).opsize<>S_B) or not(getsupreg(taicpu(p).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]))
    -{$endif i386}
    -              then
    -              { change   movsX/movzX    reg/ref, reg2
    -                         add/sub/or/... regX/$const, reg2
    -                         mov            reg2, reg3
    -                         dealloc        reg2
    -                to
    -                         movsX/movzX    reg/ref, reg3
    -                         add/sub/or/... reg3/$const, reg3
    -              }
    -              begin
    -                TransferUsedRegs(TmpUsedRegs);
    -                UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -                UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
    -                If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
    -                  begin
    -                    { by example:
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decl    %eax            addl    %edx,%eax     hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                      ->
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decw    %eax            addw    %edx,%eax     hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                    }
    -                    DebugMsg(SPeepholeOptimization + 'MovOpMov2MovOp ('+
    -                          debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
    -                          debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
    -                          debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
    -                    { limit size of constants as well to avoid assembler errors, but
    -                      check opsize to avoid overflow when left shifting the 1 }
    -                    if (taicpu(p).oper[0]^.typ=top_const) and (topsize2memsize[taicpu(hp2).opsize]<=4) then
    -                      taicpu(p).oper[0]^.val:=taicpu(p).oper[0]^.val and ((qword(1) shl (topsize2memsize[taicpu(hp2).opsize]*8))-1);
    -                    taicpu(hp1).changeopsize(taicpu(hp2).opsize);
    -                    taicpu(p).changeopsize(taicpu(hp2).opsize);
    -                    if taicpu(p).oper[0]^.typ=top_reg then
    -                      setsubreg(taicpu(p).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    -                    taicpu(p).loadoper(1, taicpu(hp2).oper[1]^);
    -                    AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp1,usedregs);
    -                    {
    -                      ->
    -                        movswl  %si,%eax        movswl  %si,%eax      p
    -                        decw    %si             addw    %dx,%si       hp1
    -                        movw    %ax,%si         movw    %ax,%si       hp2
    -                    }
    -                    case taicpu(hp1).ops of
    -                      1:
    -                        begin
    -                          taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
    -                          if taicpu(hp1).oper[0]^.typ=top_reg then
    -                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    -                        end;
    -                      2:
    -                        begin
    -                          taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
    -                          if (taicpu(hp1).oper[0]^.typ=top_reg) and
    -                            (taicpu(hp1).opcode<>A_SHL) and
    -                            (taicpu(hp1).opcode<>A_SHR) and
    -                            (taicpu(hp1).opcode<>A_SAR) then
    -                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
    -                        end;
    -                      else
    -                        internalerror(2018111801);
    -                    end;
    -                    {
    -                      ->
    -                        decw    %si             addw    %dx,%si       p
    -                    }
    -                    asml.remove(hp2);
    -                    hp2.Free;
    +            ) and
    +            MatchInstruction(hp2,A_MOV) and
    +            (taicpu(hp2).oper[1]^.typ = top_ref) and
    +            (
    +              MatchOperand(taicpu(hp1).oper[taicpu(hp1).ops-1]^,taicpu(hp2).oper[0]^)
    +{$ifdef x86_64}
    +              or (
    +                (taicpu(hp1).oper[taicpu(hp1).ops-1]^.typ = top_reg) and
    +                (taicpu(hp2).oper[0]^.typ = top_reg)
    +                { This is not an exact match, but because only 32 bits are read
    +                  from the reference, anything written to the upper 32 bits can
    +                  be considered discarded.  Inconsistencies will only occur if
    +                  a 64-bit variable is mapped onto a 32-bit variable using the
    +                  "absolute" keyword, which is generally not recommended. [Kit] }
    +                and SuperRegistersEqual(taicpu(hp1).oper[taicpu(hp1).ops-1]^.reg, taicpu(hp2).oper[0]^.reg)
    +                and (getsubreg(taicpu(p).oper[1]^.reg) = R_SUBD)
    +                and (getsubreg(taicpu(hp1).oper[taicpu(hp1).ops-1]^.reg) = R_SUBD)
    +                and (getsubreg(taicpu(hp2).oper[0]^.reg) = R_SUBQ)
    +              )
    +{$endif x86_64}
    +            ) then
    +            begin
    +              if RefsEqual(taicpu(hp2).oper[1]^.ref^,taicpu(p).oper[0]^.ref^) and
    +                not (
    +                  TransferUsedRegs(TmpUsedRegs) and
    +                  UpdateUsedRegs(TmpUsedRegs,tai(p.next)) and
    +                  UpdateUsedRegs(TmpUsedRegs,tai(hp1.next)) and
    +                  RegUsedAfterInstruction(taicpu(hp2).oper[0]^.reg,hp2,TmpUsedRegs)
    +                ) then
    +                { change   mov            (ref), reg
    +                           add/sub/or/... reg2/$const, reg
    +                           mov            reg, (ref)
    +                           # release reg
    +                  to       add/sub/or/... reg2/$const, (ref)    }
    +                begin
    +                  case taicpu(hp1).opcode of
    +                    A_INC,A_DEC,A_NOT,A_NEG :
    +                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    +                    A_LEA :
    +                      begin
    +                        taicpu(hp1).opcode:=A_ADD;
    +                        taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
    +                        if (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.index<>NR_NO) then
    +                          taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.index)
    +                        else if (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.base<>NR_NO) then
    +                          taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.base)
    +                        else
    +                          begin
    +                            { Optimise for size if applicable }
    +                            if UseIncDec then
    +                              begin
    +                                case taicpu(hp1).oper[0]^.ref^.offset of
    +                                  1:
    +                                    begin
    +                                      taicpu(hp1).opcode:=A_INC;
    +                                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    +                                      taicpu(hp1).ops := 1;
    +                                    end;
    +                                  -1:
    +                                    begin
    +                                      taicpu(hp1).opcode:=A_DEC;
    +                                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    +                                      taicpu(hp1).ops := 1;
    +                                    end;
    +                                  else
    +                                    taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
    +                                end;
    +                              end
    +                            else
    +                              taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
    +                          end;
    +                        DebugMsg(SPeepholeOptimization + 'FoldLea done',hp1);
    +                      end;
    +                    else
    +                      taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
                       end;
    -              end;
    -          end
    -        else if GetNextInstruction_p and
    -          MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
    -          GetNextInstruction(hp1, hp2) and
    -          MatchInstruction(hp2,A_OR,[Taicpu(p).opsize]) and
    -          MatchOperand(Taicpu(p).oper[0]^,0) and
    -          (Taicpu(p).oper[1]^.typ = top_reg) and
    -          MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp1).oper[1]^) and
    -          MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp2).oper[1]^) then
    -          { mov reg1,0
    -            bts reg1,operand1             -->      mov reg1,operand2
    -            or  reg1,operand2                      bts reg1,operand1}
    -          begin
    -            Taicpu(hp2).opcode:=A_MOV;
    -            asml.remove(hp1);
    -            insertllitem(hp2,hp2.next,hp1);
    -            asml.remove(p);
    -            p.free;
    -            p:=hp1;
    -          end
    -
    -        else if GetNextInstruction_p and
    -           MatchInstruction(hp1,A_LEA,[S_L]) and
    -           MatchOpType(Taicpu(p),top_ref,top_reg) and
    -           ((MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(hp1).oper[1]^.reg,Taicpu(p).oper[1]^.reg) and
    -             (Taicpu(hp1).oper[0]^.ref^.base<>Taicpu(p).oper[1]^.reg)
    -            ) or
    -            (MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(p).oper[1]^.reg,Taicpu(hp1).oper[1]^.reg) and
    -             (Taicpu(hp1).oper[0]^.ref^.index<>Taicpu(p).oper[1]^.reg)
    -            )
    -           ) then
    -           { mov reg1,ref
    -             lea reg2,[reg1,reg2]
    -
    -             to
    -
    -             add reg2,ref}
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            { reg1 may not be used afterwards }
    -            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
    -              begin
    -                Taicpu(hp1).opcode:=A_ADD;
    -                Taicpu(hp1).oper[0]^.ref^:=Taicpu(p).oper[0]^.ref^;
    -                DebugMsg(SPeepholeOptimization + 'MovLea2Add done',hp1);
    -                asml.remove(p);
    -                p.free;
    -                p:=hp1;
    -              end;
    -          end;
    +                  asml.remove(p);
    +                  asml.remove(hp2);
    +                  p.free;
    +                  hp2.free;
    +                  p := hp1;
    +                  Result := True;
    +                end;
    +            end;
    +          Exit;
    +        until False;
           end;
     
     
    
  • overhaul-singlepass.patch (157,392 bytes)
    Index: compiler/aoptobj.pas
    ===================================================================
    --- compiler/aoptobj.pas	(revision 42345)
    +++ compiler/aoptobj.pas	(working copy)
    @@ -1429,96 +1780,150 @@
            to avoid endless loops with constructs such as "l5: ; jmp l5"           }
     
           var p1: tai;
    +          p2: tai;
               {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
    -          p2: tai;
    -          l: tasmlabel;
    +          p3: tai;
               {$endif}
    +          ThisLabel, l: tasmlabel;
     
           begin
    -        GetfinalDestination := false;
    +        GetFinalDestination := false;
             if level > 20 then
               exit;
    -        p1 := getlabelwithsym(tasmlabel(JumpTargetOp(hp)^.ref^.symbol));
    +
    +        ThisLabel := TAsmLabel(JumpTargetOp(hp)^.ref^.symbol);
    +        p1 := getlabelwithsym(ThisLabel);
             if assigned(p1) then
               begin
                 SkipLabels(p1,p1);
    -            if (tai(p1).typ = ait_instruction) and
    +            if (p1.typ = ait_instruction) and
                    (taicpu(p1).is_jmp) then
    -              if { the next instruction after the label where the jump hp arrives}
    -                 { is unconditional or of the same type as hp, so continue       }
    -                 IsJumpToLabelUncond(taicpu(p1))
    +              begin
    +                p2 := tai(p1.Next);
    +
    +                { Collapse any zero distance jumps we stumble across }
    +                while (p1<>blockstart) and CollapseZeroDistJump(p1, p2, TAsmLabel(JumpTargetOp(taicpu(p1))^.ref^.symbol)) do
    +                  begin
    +                    { TODO: FIXME removing the first instruction fails}
    +                    if (p1.typ = ait_label) then
    +                      SkipLabels(p1, p1);
    +
    +                    if not Assigned(p1) then
    +                      { No more valid commands }
    +                      Exit;
    +
    +                    { Check to see that we are actually still at a jump }
    +                    if not ((tai(p1).typ = ait_instruction) and (taicpu(p1).is_jmp)) then
    +                      begin
    +                        { Required to ensure recursion works properly, but to also
    +                          return false if a jump isn't modified. [Kit] }
    +                        if level > 0 then GetFinalDestination := True;
    +                        Exit;
    +                      end;
    +
    +                    p2 := tai(p1.Next);
    +                    if p2 = BlockEnd then
    +                      Exit;
    +                  end;
    +
     {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
    -{ for MIPS, it isn't enough to check the condition; first operands must be same, too. }
    -                 or
    -                 conditions_equal(taicpu(p1).condition,hp.condition) or
    +                p3 := p2;
    +{$endif not MIPS and not RV64 and not RV32 and not JVM}
     
    -                 { the next instruction after the label where the jump hp arrives
    -                   is the opposite of hp (so this one is never taken), but after
    -                   that one there is a branch that will be taken, so perform a
    -                   little hack: set p1 equal to this instruction (that's what the
    -                   last SkipLabels is for, only works with short bool evaluation)}
    -                 (conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) and
    -                  SkipLabels(p1,p2) and
    -                  (p2.typ = ait_instruction) and
    -                  (taicpu(p2).is_jmp) and
    -                   (IsJumpToLabelUncond(taicpu(p2)) or
    -                   (conditions_equal(taicpu(p2).condition,hp.condition))) and
    -                  SkipLabels(p1,p1))
    +                if { the next instruction after the label where the jump hp arrives}
    +                   { is unconditional or of the same type as hp, so continue       }
    +                   IsJumpToLabelUncond(taicpu(p1))
    +{$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
    +  { for MIPS, it isn't enough to check the condition; first operands must be same, too. }
    +                   or
    +                   conditions_equal(taicpu(p1).condition,hp.condition) or
    +
    +                   { the next instruction after the label where the jump hp arrives
    +                     is the opposite of hp (so this one is never taken), but after
    +                     that one there is a branch that will be taken, so perform a
    +                     little hack: set p1 equal to this instruction }
    +                   (conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) and
    +                     SkipLabels(p3,p2) and
    +                     (p2.typ = ait_instruction) and
    +                     (taicpu(p2).is_jmp) and
    +                       (IsJumpToLabelUncond(taicpu(p2)) or
    +                       (conditions_equal(taicpu(p2).condition,hp.condition))
    +                     ) and
    +                     SetAndTest(p2,p1)
    +                   )
     {$endif not MIPS and not RV64 and not RV32 and not JVM}
    -                 then
    -                begin
    -                  { quick check for loops of the form "l5: ; jmp l5 }
    -                  if (tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).labelnr =
    -                       tasmlabel(JumpTargetOp(hp)^.ref^.symbol).labelnr) then
    -                    exit;
    -                  if not GetFinalDestination(taicpu(p1),succ(level)) then
    -                    exit;
    +                   then
    +                  begin
    +                    { quick check for loops of the form "l5: ; jmp l5" }
    +                    if (TAsmLabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).labelnr = ThisLabel.labelnr) then
    +                      exit;
    +                    if not GetFinalDestination(taicpu(p1),succ(level)) then
    +                      exit;
    +
    +                    { NOTE: Do not move this before the "l5: ; jmp l5" check,
    +                      because GetFinalDestination may change the destination
    +                      label of p1. [Kit] }
    +
    +                    l := tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol);
    +
     {$if defined(aarch64)}
    -                  { can't have conditional branches to
    -                    global labels on AArch64, because the
    -                    offset may become too big }
    -                  if not(taicpu(hp).condition in [C_None,C_AL,C_NV]) and
    -                     (tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).bind<>AB_LOCAL) then
    -                    exit;
    +                    { can't have conditional branches to
    +                      global labels on AArch64, because the
    +                      offset may become too big }
    +                    if not(taicpu(hp).condition in [C_None,C_AL,C_NV]) and
    +                       (l.bind<>AB_LOCAL) then
    +                      exit;
     {$endif aarch64}
    -                  tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
    -                  JumpTargetOp(hp)^.ref^.symbol:=JumpTargetOp(taicpu(p1))^.ref^.symbol;
    -                  tasmlabel(JumpTargetOp(hp)^.ref^.symbol).increfs;
    -                end
    +                    ThisLabel.decrefs;
    +                    JumpTargetOp(hp)^.ref^.symbol:=l;
    +                    l.increfs;
    +                    GetFinalDestination := True;
    +                    Exit;
    +                  end
     {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
    -              else
    -                if conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) then
    -                  if not FindAnyLabel(p1,l) then
    +                else
    +                  if conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) then
                         begin
    -      {$ifdef finaldestdebug}
    -                      insertllitem(asml,p1,p1.next,tai_comment.Create(
    -                        strpnew('previous label inserted'))));
    -      {$endif finaldestdebug}
    -                      current_asmdata.getjumplabel(l);
    -                      insertllitem(p1,p1.next,tai_label.Create(l));
    -                      tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
    -                      JumpTargetOp(hp)^.ref^.symbol := l;
    -                      l.increfs;
    -      {               this won't work, since the new label isn't in the labeltable }
    -      {               so it will fail the rangecheck. Labeltable should become a   }
    -      {               hashtable to support this:                                   }
    -      {               GetFinalDestination(asml, hp);                               }
    -                    end
    -                  else
    -                    begin
    -      {$ifdef finaldestdebug}
    -                      insertllitem(asml,p1,p1.next,tai_comment.Create(
    -                        strpnew('next label reused'))));
    -      {$endif finaldestdebug}
    -                      l.increfs;
    -                      tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
    -                      JumpTargetOp(hp)^.ref^.symbol := l;
    -                      if not GetFinalDestination(hp,succ(level)) then
    -                        exit;
    +                      if not FindAnyLabel(p1,l) then
    +                        begin
    +{$ifdef finaldestdebug}
    +                          insertllitem(asml,p1,p1.next,tai_comment.Create(
    +                            strpnew('previous label inserted'))));
    +{$endif finaldestdebug}
    +                          current_asmdata.getjumplabel(l);
    +                          insertllitem(p1,p1.next,tai_label.Create(l));
    +
    +                          ThisLabel.decrefs;
    +                          JumpTargetOp(hp)^.ref^.symbol := l;
    +                          l.increfs;
    +                          GetFinalDestination := True;
    +          {               this won't work, since the new label isn't in the labeltable }
    +          {               so it will fail the rangecheck. Labeltable should become a   }
    +          {               hashtable to support this:                                   }
    +          {               GetFinalDestination(asml, hp);                               }
    +                        end
    +                      else
    +                        begin
    +{$ifdef finaldestdebug}
    +                          insertllitem(asml,p1,p1.next,tai_comment.Create(
    +                            strpnew('next label reused'))));
    +{$endif finaldestdebug}
    +                          l.increfs;
    +                          ThisLabel.decrefs;
    +                          JumpTargetOp(hp)^.ref^.symbol := l;
    +                          if not GetFinalDestination(hp,succ(level)) then
    +                            exit;
    +                        end;
    +                      GetFinalDestination := True;
    +                      Exit;
                         end;
     {$endif not MIPS and not RV64 and not RV32 and not JVM}
    +              end;
               end;
    -        GetFinalDestination := true;
    +
    +        { Required to ensure recursion works properly, but to also
    +          return false if a jump isn't modified. [Kit] }
    +        if level > 0 then GetFinalDestination := True;
           end;
     
     
    Index: compiler/i386/aoptcpu.pas
    ===================================================================
    --- compiler/i386/aoptcpu.pas	(revision 42345)
    +++ compiler/i386/aoptcpu.pas	(working copy)
    @@ -34,12 +34,21 @@
           Aasmbase,aasmtai,aasmdata;
     
         Type
    +
    +      { TCpuAsmOptimizer }
    +
           TCpuAsmOptimizer = class(TX86AsmOptimizer)
    -        procedure Optimize; override;
    -        procedure PrePeepHoleOpts; override;
    -        procedure PeepHoleOptPass1; override;
    -        procedure PeepHoleOptPass2; override;
    +        function PeepHoleOptPass1Cpu(var p: tai): boolean; override;
             procedure PostPeepHoleOpts; override;
    +        function PostPeepHoleOptsCpu(var p : tai) : boolean; override;
    +
    +        { Optimizations specific to i386 }
    +        function OptPass1FSTPFISTP(var p : tai) : boolean;
    +        function OptPass1FLD(var p: tai): Boolean;
    +        function OptPass1PUSH(var p: tai): Boolean;
    +
    +        { The x86_64 version is very different }
    +        function PostPeepholeOptMovzx(var p : tai) : Boolean; inline;
           end;
     
         Var
    @@ -55,769 +64,423 @@
           aasmcfi,
           procinfo,
           cgutils,
    -      { units we should get rid off: }
    +      systems,
    +      { units we should get rid of: }
           symsym,symconst;
     
     
    -  { Checks if the register is a 32 bit general purpose register }
    -  function isgp32reg(reg: TRegister): boolean;
    +    { Checks if the register is a 32 bit general purpose register }
    +    function isgp32reg(reg: TRegister): boolean; inline;
    +      begin
    +        {$push}{$warnings off}
    +        isgp32reg:=(getregtype(reg)=R_INTREGISTER) and (getsupreg(reg)>=RS_EAX) and (getsupreg(reg)<=RS_EBX);
    +        {$pop}
    +      end;
    +
    +
    +  { converts a TChange variable to a TRegister }
    +  function tch2reg(ch: tinschange): tsuperregister;
    +    const
    +      ch2reg: array[CH_REAX..CH_REDI] of tsuperregister = (RS_EAX,RS_ECX,RS_EDX,RS_EBX,RS_ESP,RS_EBP,RS_ESI,RS_EDI);
         begin
    -      {$push}{$warnings off}
    -      isgp32reg:=(getregtype(reg)=R_INTREGISTER) and (getsupreg(reg)>=RS_EAX) and (getsupreg(reg)<=RS_EBX);
    -      {$pop}
    +      if (ch <= CH_REDI) then
    +        tch2reg := ch2reg[ch]
    +      else if (ch <= CH_WEDI) then
    +        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_REDI))]
    +      else if (ch <= CH_RWEDI) then
    +        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_WEDI))]
    +      else if (ch <= CH_MEDI) then
    +        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_RWEDI))]
    +      else
    +        InternalError(2016041901)
         end;
     
     
    -{ returns true if p contains a memory operand with a segment set }
    -function InsContainsSegRef(p: taicpu): boolean;
    -var
    -  i: longint;
    -begin
    -  result:=true;
    -  for i:=0 to p.opercnt-1 do
    -    if (p.oper[i]^.typ=top_ref) and
    -       (p.oper[i]^.ref^.segment<>NR_NO) then
    -      exit;
    -  result:=false;
    -end;
    +  { returns true if p contains a memory operand with a segment set }
    +  function InsContainsSegRef(p: taicpu): boolean;
    +    var
    +      i: longint;
    +    begin
    +      result:=true;
    +      for i:=0 to p.opercnt-1 do
    +        if (p.oper[i]^.typ=top_ref) and
    +           (p.oper[i]^.ref^.segment<>NR_NO) then
    +          exit;
    +      result:=false;
    +    end;
     
     
    -procedure TCPUAsmOptimizer.PrePeepHoleOpts;
    -var
    -  p: tai;
    -begin
    -  p := BlockStart;
    -  while (p <> BlockEnd) Do
    +  function TCpuAsmOptimizer.OptPass1FSTPFISTP(var p: tai): boolean;
    +    var
    +      hp1, hp2: tai;
         begin
    -      case p.Typ Of
    -        Ait_Instruction:
    -          begin
    -            if InsContainsSegRef(taicpu(p)) then
    +      Result := false;
    +
    +      if (taicpu(p).oper[0]^.typ = top_ref) and
    +         getNextInstruction(p, hp1) and
    +         (hp1.typ = ait_instruction) and
    +         (((taicpu(hp1).opcode = A_FLD) and
    +           (taicpu(p).opcode = A_FSTP)) or
    +          ((taicpu(p).opcode = A_FISTP) and
    +           (taicpu(hp1).opcode = A_FILD))) and
    +         (taicpu(hp1).oper[0]^.typ = top_ref) and
    +         (taicpu(hp1).opsize = taicpu(p).opsize) and
    +         RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
    +        begin
    +          { replacing fstp f;fld f by fst f is only valid for extended because of rounding }
    +          if (taicpu(p).opsize=S_FX) and
    +             getNextInstruction(hp1, hp2) and
    +             (hp2.typ = ait_instruction) and
    +             IsExitCode(hp2) and
    +             (taicpu(p).oper[0]^.ref^.base = current_procinfo.FramePointer) and
    +             not(assigned(current_procinfo.procdef.funcretsym) and
    +                 (taicpu(p).oper[0]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
    +             (taicpu(p).oper[0]^.ref^.index = NR_NO) then
    +            begin
    +              asml.remove(p);
    +              asml.remove(hp1);
    +              p.free;
    +              hp1.free;
    +              p := hp2;
    +              removeLastDeallocForFuncRes(p);
    +              Result := true;
    +            end
    +          (* can't be done because the store operation rounds
    +          else
    +            { fst can't store an extended value! }
    +            if (taicpu(p).opsize <> S_FX) and
    +               (taicpu(p).opsize <> S_IQ) then
                   begin
    -                p := tai(p.next);
    -                continue;
    -              end;
    -            case taicpu(p).opcode Of
    -              A_IMUL:
    -                if PrePeepholeOptIMUL(p) then
    -                  Continue;
    -              A_SAR,A_SHR:
    -                if PrePeepholeOptSxx(p) then
    -                  continue;
    -              A_XOR:
    -                begin
    -                  if (taicpu(p).oper[0]^.typ = top_reg) and
    -                     (taicpu(p).oper[1]^.typ = top_reg) and
    -                     (taicpu(p).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
    -                   { temporarily change this to 'mov reg,0' to make it easier }
    -                   { for the CSE. Will be changed back in pass 2              }
    -                    begin
    -                      taicpu(p).opcode := A_MOV;
    -                      taicpu(p).loadConst(0,0);
    -                    end;
    -                end;
    -              else
    -                ;
    -            end;
    -          end;
    -        else
    -          ;
    -      end;
    -      p := tai(p.next)
    +                if (taicpu(p).opcode = A_FSTP) then
    +                  taicpu(p).opcode := A_FST
    +                else taicpu(p).opcode := A_FIST;
    +                asml.remove(hp1);
    +                hp1.free;
    +              end
    +          *)
    +        end;
         end;
    -end;
     
    +  function TCpuAsmOptimizer.OptPass1FLD(var p: tai): Boolean;
    +    var
    +      hp1, hp2: tai;
    +    begin
    +      Result := False;
     
    -{ First pass of peephole optimizations }
    -procedure TCPUAsmOPtimizer.PeepHoleOptPass1;
    -
    -function WriteOk : Boolean;
    -  begin
    -    writeln('Ok');
    -    Result:=True;
    -  end;
    -
    -var
    -  p,hp1,hp2 : tai;
    -  hp3,hp4: tai;
    -  v:aint;
    -
    -  function GetFinalDestination(asml: TAsmList; hp: taicpu; level: longint): boolean;
    -  {traces sucessive jumps to their final destination and sets it, e.g.
    -   je l1                je l3
    -   <code>               <code>
    -   l1:       becomes    l1:
    -   je l2                je l3
    -   <code>               <code>
    -   l2:                  l2:
    -   jmp l3               jmp l3
    -
    -   the level parameter denotes how deeep we have already followed the jump,
    -   to avoid endless loops with constructs such as "l5: ; jmp l5"           }
    -
    -  var p1, p2: tai;
    -      l: tasmlabel;
    -
    -    function FindAnyLabel(hp: tai; var l: tasmlabel): Boolean;
    -    begin
    -      FindAnyLabel := false;
    -      while assigned(hp.next) and
    -            (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
    -        hp := tai(hp.next);
    -      if assigned(hp.next) and
    -         (tai(hp.next).typ = ait_label) then
    +      if (taicpu(p).oper[0]^.typ = top_reg) and
    +        GetNextInstruction(p, hp1) and
    +        (hp1.typ = Ait_Instruction) and
    +         (taicpu(hp1).oper[0]^.typ = top_reg) and
    +        (taicpu(hp1).oper[1]^.typ = top_reg) and
    +        (taicpu(hp1).oper[0]^.reg = NR_ST) and
    +        (taicpu(hp1).oper[1]^.reg = NR_ST1) then
    +        { change                        to
    +            fld      reg               fxxx reg,st
    +            fxxxp    st, st1 (hp1)
    +          Remark: non commutative operations must be reversed!
    +        }
             begin
    -          FindAnyLabel := true;
    -          l := tai_label(hp.next).labsym;
    +          if taicpu(hp1).opcode in [A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP] then
    +            begin
    +              case taicpu(hp1).opcode Of
    +                A_FADDP: taicpu(hp1).opcode := A_FADD;
    +                A_FMULP: taicpu(hp1).opcode := A_FMUL;
    +                A_FSUBP: taicpu(hp1).opcode := A_FSUBR;
    +                A_FSUBRP: taicpu(hp1).opcode := A_FSUB;
    +                A_FDIVP: taicpu(hp1).opcode := A_FDIVR;
    +                A_FDIVRP: taicpu(hp1).opcode := A_FDIV;
    +			    else
    +			      InternalError(2019071010);
    +              end;
    +              taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
    +              taicpu(hp1).oper[1]^.reg := NR_ST;
    +              asml.remove(p);
    +              p.free;
    +              p := hp1;
    +              Result := True;
    +            end;
             end
    -    end;
    -
    -  begin
    -    GetfinalDestination := false;
    -    if level > 20 then
    -      exit;
    -    p1 := getlabelwithsym(tasmlabel(hp.oper[0]^.ref^.symbol));
    -    if assigned(p1) then
    -      begin
    -        SkipLabels(p1,p1);
    -        if (tai(p1).typ = ait_instruction) and
    -           (taicpu(p1).is_jmp) then
    -          if { the next instruction after the label where the jump hp arrives}
    -             { is unconditional or of the same type as hp, so continue       }
    -             (taicpu(p1).condition in [C_None,hp.condition]) or
    -             { the next instruction after the label where the jump hp arrives}
    -             { is the opposite of hp (so this one is never taken), but after }
    -             { that one there is a branch that will be taken, so perform a   }
    -             { little hack: set p1 equal to this instruction (that's what the}
    -             { last SkipLabels is for, only works with short bool evaluation)}
    -             ((taicpu(p1).condition = inverse_cond(hp.condition)) and
    -              SkipLabels(p1,p2) and
    -              (p2.typ = ait_instruction) and
    -              (taicpu(p2).is_jmp) and
    -              (taicpu(p2).condition in [C_None,hp.condition]) and
    -              SkipLabels(p1,p1)) then
    +      else
    +        if (taicpu(p).oper[0]^.typ = top_ref) and
    +           GetNextInstruction(p, hp2) and
    +           (hp2.typ = Ait_Instruction) and
    +           (taicpu(hp2).ops = 2) and
    +           (taicpu(hp2).oper[0]^.typ = top_reg) and
    +           (taicpu(hp2).oper[1]^.typ = top_reg) and
    +           (taicpu(p).opsize in [S_FS, S_FL]) and
    +           (taicpu(hp2).oper[0]^.reg = NR_ST) and
    +           (taicpu(hp2).oper[1]^.reg = NR_ST1) then
    +          if GetLastInstruction(p, hp1) and
    +             (hp1.typ = ait_Instruction) and
    +             ((taicpu(hp1).opcode = A_FLD) or
    +              (taicpu(hp1).opcode = A_FST)) and
    +             (taicpu(hp1).opsize = taicpu(p).opsize) and
    +             (taicpu(hp1).oper[0]^.typ = top_ref) and
    +             RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
                 begin
    -              { quick check for loops of the form "l5: ; jmp l5 }
    -              if (tasmlabel(taicpu(p1).oper[0]^.ref^.symbol).labelnr =
    -                   tasmlabel(hp.oper[0]^.ref^.symbol).labelnr) then
    -                exit;
    -              if not GetFinalDestination(asml, taicpu(p1),succ(level)) then
    -                exit;
    -              tasmlabel(hp.oper[0]^.ref^.symbol).decrefs;
    -              hp.oper[0]^.ref^.symbol:=taicpu(p1).oper[0]^.ref^.symbol;
    -              tasmlabel(hp.oper[0]^.ref^.symbol).increfs;
    -            end
    -          else
    -            if (taicpu(p1).condition = inverse_cond(hp.condition)) then
    -              if not FindAnyLabel(p1,l) then
    +              if ((taicpu(hp2).opcode = A_FMULP) or
    +                  (taicpu(hp2).opcode = A_FADDP)) then
    +              { change                      to
    +                  fld/fst   mem1  (hp1)       fld/fst   mem1
    +                  fld       mem1  (p)         fadd/
    +                  faddp/                       fmul     st, st
    +                  fmulp  st, st1 (hp2) }
                     begin
    -  {$ifdef finaldestdebug}
    -                  insertllitem(asml,p1,p1.next,tai_comment.Create(
    -                    strpnew('previous label inserted'))));
    -  {$endif finaldestdebug}
    -                  current_asmdata.getjumplabel(l);
    -                  insertllitem(p1,p1.next,tai_label.Create(l));
    -                  tasmlabel(taicpu(hp).oper[0]^.ref^.symbol).decrefs;
    -                  hp.oper[0]^.ref^.symbol := l;
    -                  l.increfs;
    -  {               this won't work, since the new label isn't in the labeltable }
    -  {               so it will fail the rangecheck. Labeltable should become a   }
    -  {               hashtable to support this:                                   }
    -  {               GetFinalDestination(asml, hp);                               }
    +                  asml.remove(p);
    +                  p.free;
    +                  p := hp1;
    +                  if (taicpu(hp2).opcode = A_FADDP) then
    +                    taicpu(hp2).opcode := A_FADD
    +                  else
    +                    taicpu(hp2).opcode := A_FMUL;
    +                  taicpu(hp2).oper[1]^.reg := NR_ST;
    +                  Result := True;
                     end
                   else
    +              { change              to
    +                  fld/fst mem1 (hp1)   fld/fst mem1
    +                  fld     mem1 (p)     fld      st}
                     begin
    -  {$ifdef finaldestdebug}
    -                  insertllitem(asml,p1,p1.next,tai_comment.Create(
    -                    strpnew('next label reused'))));
    -  {$endif finaldestdebug}
    -                  l.increfs;
    -                  hp.oper[0]^.ref^.symbol := l;
    -                  if not GetFinalDestination(asml, hp,succ(level)) then
    -                    exit;
    +                  taicpu(p).changeopsize(S_FL);
    +                  taicpu(p).loadreg(0,NR_ST);
                     end;
    -      end;
    -    GetFinalDestination := true;
    -  end;
     
    -begin
    -  p := BlockStart;
    -  ClearUsedRegs;
    -  while (p <> BlockEnd) Do
    -    begin
    -      UpDateUsedRegs(UsedRegs, tai(p.next));
    -      case p.Typ Of
    -        ait_instruction:
    -          begin
    -            current_filepos:=taicpu(p).fileinfo;
    -            if InsContainsSegRef(taicpu(p)) then
    -              begin
    -                p := tai(p.next);
    -                continue;
    -              end;
    -            { Handle Jmp Optimizations }
    -            if taicpu(p).is_jmp then
    -              begin
    -                { the following if-block removes all code between a jmp and the next label,
    -                  because it can never be executed }
    -                if (taicpu(p).opcode = A_JMP) then
    -                  begin
    -                    hp2:=p;
    -                    while GetNextInstruction(hp2, hp1) and
    -                          (hp1.typ <> ait_label) do
    -                      if not(hp1.typ in ([ait_label]+skipinstr)) then
    -                        begin
    -                          { don't kill start/end of assembler block,
    -                            no-line-info-start/end, cfi end, etc }
    -                          if not(hp1.typ in [ait_align,ait_marker]) and
    -                             ((hp1.typ<>ait_cfi) or
    -                              (tai_cfi_base(hp1).cfityp<>cfi_endproc)) then
    -                            begin
    -                              asml.remove(hp1);
    -                              hp1.free;
    -                            end
    -                          else
    -                            hp2:=hp1;
    -                        end
    -                      else break;
    -                    end;
    -                { remove jumps to a label coming right after them }
    -                if GetNextInstruction(p, hp1) then
    -                  begin
    -                    if FindLabel(tasmlabel(taicpu(p).oper[0]^.ref^.symbol), hp1) and
    -  { TODO: FIXME removing the first instruction fails}
    -                        (p<>blockstart) then
    -                      begin
    -                        hp2:=tai(hp1.next);
    -                        asml.remove(p);
    -                        p.free;
    -                        p:=hp2;
    -                        continue;
    -                      end
    -                    else
    -                      begin
    -                        if hp1.typ = ait_label then
    -                          SkipLabels(hp1,hp1);
    -                        if (tai(hp1).typ=ait_instruction) and
    -                            (taicpu(hp1).opcode=A_JMP) and
    -                            GetNextInstruction(hp1, hp2) and
    -                            FindLabel(tasmlabel(taicpu(p).oper[0]^.ref^.symbol), hp2) then
    -                          begin
    -                            if taicpu(p).opcode=A_Jcc then
    -                              begin
    -                                taicpu(p).condition:=inverse_cond(taicpu(p).condition);
    -                                tai_label(hp2).labsym.decrefs;
    -                                taicpu(p).oper[0]^.ref^.symbol:=taicpu(hp1).oper[0]^.ref^.symbol;
    -                                { when free'ing hp1, the ref. isn't decresed, so we don't
    -                                  increase it (FK)
    +            end
    +          else
    +            begin
    +              if taicpu(hp2).opcode in [A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP] then
    +          { change                        to
    +              fld      mem2    (p)        fxxx       mem2
    +              fxxxp    st, st1 (hp2)                      }
     
    -                                  taicpu(p).oper[0]^.ref^.symbol.increfs;
    -                                }
    -                                asml.remove(hp1);
    -                                hp1.free;
    -                                GetFinalDestination(asml, taicpu(p),0);
    -                              end
    -                            else
    -                              begin
    -                                GetFinalDestination(asml, taicpu(p),0);
    -                                p:=tai(p.next);
    -                                continue;
    -                              end;
    -                          end
    -                        else
    -                          GetFinalDestination(asml, taicpu(p),0);
    -                      end;
    +                begin
    +                  case taicpu(hp2).opcode Of
    +                    A_FADDP: taicpu(p).opcode := A_FADD;
    +                    A_FMULP: taicpu(p).opcode := A_FMUL;
    +                    A_FSUBP: taicpu(p).opcode := A_FSUBR;
    +                    A_FSUBRP: taicpu(p).opcode := A_FSUB;
    +                    A_FDIVP: taicpu(p).opcode := A_FDIVR;
    +                    A_FDIVRP: taicpu(p).opcode := A_FDIV;
    +					else
    +					  InternalError(2019071011);
                       end;
    -              end
    -            else
    -            { All other optimizes }
    -              begin
    -                case taicpu(p).opcode Of
    -                  A_AND:
    -                    if OptPass1And(p) then
    -                      continue;
    -                  A_CMP:
    -                    begin
    -                      { cmp register,$8000                neg register
    -                        je target                 -->     jo target
    +                  asml.remove(hp2);
    +                  hp2.free;
    +                end;
    +            end;
    +    end;
     
    -                        .... only if register is deallocated before jump.}
    -                      case Taicpu(p).opsize of
    -                        S_B: v:=$80;
    -                        S_W: v:=$8000;
    -                        S_L: v:=aint($80000000);
    -                        else
    -                          internalerror(2013112905);
    -                      end;
    -                      if (taicpu(p).oper[0]^.typ=Top_const) and
    -                         (taicpu(p).oper[0]^.val=v) and
    -                         (Taicpu(p).oper[1]^.typ=top_reg) and
    -                         GetNextInstruction(p, hp1) and
    -                         (hp1.typ=ait_instruction) and
    -                         (taicpu(hp1).opcode=A_Jcc) and
    -                         (Taicpu(hp1).condition in [C_E,C_NE]) and
    -                         not(RegInUsedRegs(Taicpu(p).oper[1]^.reg, UsedRegs)) then
    -                        begin
    -                          Taicpu(p).opcode:=A_NEG;
    -                          Taicpu(p).loadoper(0,Taicpu(p).oper[1]^);
    -                          Taicpu(p).clearop(1);
    -                          Taicpu(p).ops:=1;
    -                          if Taicpu(hp1).condition=C_E then
    -                            Taicpu(hp1).condition:=C_O
    -                          else
    -                            Taicpu(hp1).condition:=C_NO;
    -                          continue;
    -                        end;
    -                      {
    -                      @@2:                              @@2:
    -                        ....                              ....
    -                        cmp operand1,0
    -                        jle/jbe @@1
    -                        dec operand1             -->      sub operand1,1
    -                        jmp @@2                           jge/jae @@2
    -                      @@1:                              @@1:
    -                        ...                               ....}
    -                      if (taicpu(p).oper[0]^.typ = top_const) and
    -                         (taicpu(p).oper[1]^.typ in [top_reg,top_ref]) and
    -                         (taicpu(p).oper[0]^.val = 0) and
    -                         GetNextInstruction(p, hp1) and
    -                         (hp1.typ = ait_instruction) and
    -                         (taicpu(hp1).is_jmp) and
    -                         (taicpu(hp1).opcode=A_Jcc) and
    -                         (taicpu(hp1).condition in [C_LE,C_BE]) and
    -                         GetNextInstruction(hp1,hp2) and
    -                         (hp2.typ = ait_instruction) and
    -                         (taicpu(hp2).opcode = A_DEC) and
    -                         OpsEqual(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
    -                         GetNextInstruction(hp2, hp3) and
    -                         (hp3.typ = ait_instruction) and
    -                         (taicpu(hp3).is_jmp) and
    -                         (taicpu(hp3).opcode = A_JMP) and
    -                         GetNextInstruction(hp3, hp4) and
    -                         FindLabel(tasmlabel(taicpu(hp1).oper[0]^.ref^.symbol),hp4) then
    -                        begin
    -                          taicpu(hp2).Opcode := A_SUB;
    -                          taicpu(hp2).loadoper(1,taicpu(hp2).oper[0]^);
    -                          taicpu(hp2).loadConst(0,1);
    -                          taicpu(hp2).ops:=2;
    -                          taicpu(hp3).Opcode := A_Jcc;
    -                          case taicpu(hp1).condition of
    -                            C_LE: taicpu(hp3).condition := C_GE;
    -                            C_BE: taicpu(hp3).condition := C_AE;
    -                            else
    -                              internalerror(2019050903);
    -                          end;
    -                          asml.remove(p);
    -                          asml.remove(hp1);
    -                          p.free;
    -                          hp1.free;
    -                          p := hp2;
    -                          continue;
    -                        end
    -                    end;
    -                  A_FLD:
    -                    if OptPass1FLD(p) then
    -                      continue;
    -                  A_FSTP,A_FISTP:
    -                    if OptPass1FSTP(p) then
    -                      continue;
    -                  A_LEA:
    -                    begin
    -                      if OptPass1LEA(p) then
    -                        continue;
    -                    end;
     
    -                  A_MOV:
    -                    begin
    -                      If OptPass1MOV(p) then
    -                        Continue;
    -                    end;
    +  function TCpuAsmOptimizer.OptPass1PUSH(var p: tai): Boolean;
    +    var
    +      hp1: tai;
    +    begin
    +      Result := False;
    +      if (taicpu(p).opsize = S_W) and
    +         (taicpu(p).oper[0]^.typ = Top_Const) and
    +         GetNextInstruction(p, hp1) and
    +         (tai(hp1).typ = ait_instruction) and
    +         (taicpu(hp1).opcode = A_PUSH) and
    +         (taicpu(hp1).oper[0]^.typ = Top_Const) and
    +         (taicpu(hp1).opsize = S_W) then
    +        begin
    +          taicpu(p).changeopsize(S_L);
    +          taicpu(p).loadConst(0,taicpu(p).oper[0]^.val shl 16 + word(taicpu(hp1).oper[0]^.val));
    +          asml.remove(hp1);
    +          hp1.free;
    +        end;
    +    end;
     
    -                  A_MOVSX,
    -                  A_MOVZX :
    -                    begin
    -                      If OptPass1Movx(p) then
    -                        Continue
    -                    end;
     
    -(* should not be generated anymore by the current code generator
    -                  A_POP:
    +    function TCpuAsmOptimizer.PostPeepholeOptMovzx(var p: tai): Boolean;
    +      var
    +        hp1: tai;
    +      begin
    +        { if register vars are on, it's possible there is code like }
    +        {   "cmpl $3,%eax; movzbl 8(%ebp),%ebx; je .Lxxx"           }
    +        { so we can't safely replace the movzx then with xor/mov,   }
    +        { since that would change the flags (JM)                    }
    +        Result := False;
    +        if not(cs_opt_regvar in current_settings.optimizerswitches) then
    +          begin
    +            if (taicpu(p).oper[1]^.typ = top_reg) then
    +              if (taicpu(p).oper[0]^.typ = top_reg)
    +                then
    +                  if (taicpu(p).opsize = S_BL) and
    +                    IsGP32Reg(taicpu(p).oper[1]^.reg) and
    +                    not(cs_opt_size in current_settings.optimizerswitches) and
    +                    (current_settings.optimizecputype = cpu_Pentium) then
    +                    {Change "movzbl %reg1, %reg2" to
    +                     "xorl %reg2, %reg2; movb %reg1, %reg2" for Pentium and
    +                     PentiumMMX}
                         begin
    -                      if target_info.system=system_i386_go32v2 then
    -                      begin
    -                        { Transform a series of pop/pop/pop/push/push/push to }
    -                        { 'movl x(%esp),%reg' for go32v2 (not for the rest,   }
    -                        { because I'm not sure whether they can cope with     }
    -                        { 'movl x(%esp),%reg' with x > 0, I believe we had    }
    -                        { such a problem when using esp as frame pointer (JM) }
    -                        if (taicpu(p).oper[0]^.typ = top_reg) then
    -                          begin
    -                            hp1 := p;
    -                            hp2 := p;
    -                            l := 0;
    -                            while getNextInstruction(hp1,hp1) and
    -                                  (hp1.typ = ait_instruction) and
    -                                  (taicpu(hp1).opcode = A_POP) and
    -                                  (taicpu(hp1).oper[0]^.typ = top_reg) do
    -                              begin
    -                                hp2 := hp1;
    -                                inc(l,4);
    -                              end;
    -                            getLastInstruction(p,hp3);
    -                            l1 := 0;
    -                            while (hp2 <> hp3) and
    -                                  assigned(hp1) and
    -                                  (hp1.typ = ait_instruction) and
    -                                  (taicpu(hp1).opcode = A_PUSH) and
    -                                  (taicpu(hp1).oper[0]^.typ = top_reg) and
    -                                  (taicpu(hp1).oper[0]^.reg.enum = taicpu(hp2).oper[0]^.reg.enum) do
    -                              begin
    -                                { change it to a two op operation }
    -                                taicpu(hp2).oper[1]^.typ:=top_none;
    -                                taicpu(hp2).ops:=2;
    -                                taicpu(hp2).opcode := A_MOV;
    -                                taicpu(hp2).loadoper(1,taicpu(hp1).oper[0]^);
    -                                reference_reset(tmpref);
    -                                tmpRef.base.enum:=R_INTREGISTER;
    -                                tmpRef.base.number:=NR_STACK_POINTER_REG;
    -                                convert_register_to_enum(tmpref.base);
    -                                tmpRef.offset := l;
    -                                taicpu(hp2).loadRef(0,tmpRef);
    -                                hp4 := hp1;
    -                                getNextInstruction(hp1,hp1);
    -                                asml.remove(hp4);
    -                                hp4.free;
    -                                getLastInstruction(hp2,hp2);
    -                                dec(l,4);
    -                                inc(l1);
    -                              end;
    -                            if l <> -4 then
    -                              begin
    -                                inc(l,4);
    -                                for l1 := l1 downto 1 do
    -                                  begin
    -                                    getNextInstruction(hp2,hp2);
    -                                    dec(taicpu(hp2).oper[0]^.ref^.offset,l);
    -                                  end
    -                              end
    -                          end
    -                        end
    -                      else
    -                        begin
    -                          if (taicpu(p).oper[0]^.typ = top_reg) and
    -                            GetNextInstruction(p, hp1) and
    -                            (tai(hp1).typ=ait_instruction) and
    -                            (taicpu(hp1).opcode=A_PUSH) and
    -                            (taicpu(hp1).oper[0]^.typ = top_reg) and
    -                            (taicpu(hp1).oper[0]^.reg.enum=taicpu(p).oper[0]^.reg.enum) then
    -                            begin
    -                              { change it to a two op operation }
    -                              taicpu(p).oper[1]^.typ:=top_none;
    -                              taicpu(p).ops:=2;
    -                              taicpu(p).opcode := A_MOV;
    -                              taicpu(p).loadoper(1,taicpu(p).oper[0]^);
    -                              reference_reset(tmpref);
    -                              TmpRef.base.enum := R_ESP;
    -                              taicpu(p).loadRef(0,TmpRef);
    -                              asml.remove(hp1);
    -                              hp1.free;
    -                            end;
    -                        end;
    -                    end;
    -*)
    -                  A_PUSH:
    -                    begin
    -                      if (taicpu(p).opsize = S_W) and
    -                         (taicpu(p).oper[0]^.typ = Top_Const) and
    -                         GetNextInstruction(p, hp1) and
    -                         (tai(hp1).typ = ait_instruction) and
    -                         (taicpu(hp1).opcode = A_PUSH) and
    -                         (taicpu(hp1).oper[0]^.typ = Top_Const) and
    -                         (taicpu(hp1).opsize = S_W) then
    -                        begin
    -                          taicpu(p).changeopsize(S_L);
    -                          taicpu(p).loadConst(0,taicpu(p).oper[0]^.val shl 16 + word(taicpu(hp1).oper[0]^.val));
    -                          asml.remove(hp1);
    -                          hp1.free;
    -                        end;
    -                    end;
    -                  A_SHL, A_SAL:
    -                    if OptPass1SHLSAL(p) then
    -                      Continue;
    -                  A_SUB:
    -                    if OptPass1Sub(p) then
    -                      continue;
    -                  A_VMOVAPS,
    -                  A_VMOVAPD:
    -                    if OptPass1VMOVAP(p) then
    -                      continue;
    -                  A_VDIVSD,
    -                  A_VDIVSS,
    -                  A_VSUBSD,
    -                  A_VSUBSS,
    -                  A_VMULSD,
    -                  A_VMULSS,
    -                  A_VADDSD,
    -                  A_VADDSS,
    -                  A_VANDPD,
    -                  A_VANDPS,
    -                  A_VORPD,
    -                  A_VORPS,
    -                  A_VXORPD,
    -                  A_VXORPS:
    -                    if OptPass1VOP(p) then
    -                      continue;
    -                  A_MULSD,
    -                  A_MULSS,
    -                  A_ADDSD,
    -                  A_ADDSS:
    -                    if OptPass1OP(p) then
    -                      continue;
    -                  A_MOVAPD,
    -                  A_MOVAPS:
    -                    if OptPass1MOVAP(p) then
    -                      continue;
    -                  A_VMOVSD,
    -                  A_VMOVSS,
    -                  A_MOVSD,
    -                  A_MOVSS:
    -                    if OptPass1MOVXX(p) then
    -                      continue;
    -                  A_SETcc:
    -                    begin
    -                      if OptPass1SETcc(p) then
    -                        continue;
    +                      hp1 := taicpu.op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
    +                      InsertLLItem(p.previous, p, hp1);
    +                      taicpu(p).opcode := A_MOV;
    +                      taicpu(p).changeopsize(S_B);
    +                      setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
                         end
    -                  else
    -                    ;
    -                end;
    -            end; { if is_jmp }
    -          end;
    +                else if (taicpu(p).oper[0]^.typ = top_ref) and
    +                  (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
    +                  (taicpu(p).oper[0]^.ref^.index <> taicpu(p).oper[1]^.reg) and
    +                  not(cs_opt_size in current_settings.optimizerswitches) and
    +                  IsGP32Reg(taicpu(p).oper[1]^.reg) and
    +                  (current_settings.optimizecputype = cpu_Pentium) and
    +                  (taicpu(p).opsize = S_BL) then
    +                  {changes "movzbl mem, %reg" to "xorl %reg, %reg; movb mem, %reg8" for
    +                    Pentium and PentiumMMX}
    +                  begin
    +                    hp1 := taicpu.Op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
    +                    taicpu(p).opcode := A_MOV;
    +                    taicpu(p).changeopsize(S_B);
    +                    setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
    +                    InsertLLItem(p.previous, p, hp1);
    +                  end;
    +          end
             else
               ;
           end;
    -      updateUsedRegs(UsedRegs,p);
    -      p:=tai(p.next);
    -    end;
    -end;
     
     
    -procedure TCPUAsmOptimizer.PeepHoleOptPass2;
    -var
    -  p : tai;
    -begin
    -  p := BlockStart;
    -  ClearUsedRegs;
    -  while (p <> BlockEnd) Do
    -    begin
    -      UpdateUsedRegs(UsedRegs, tai(p.next));
    -      case p.Typ Of
    -        Ait_Instruction:
    -          begin
    -            if InsContainsSegRef(taicpu(p)) then
    -              begin
    -                p := tai(p.next);
    -                continue;
    -              end;
    -            case taicpu(p).opcode Of
    -              A_Jcc:
    -                if OptPass2Jcc(p) then
    -                  continue;
    -              A_FSTP,A_FISTP:
    -                if OptPass1FSTP(p) then
    -                  continue;
    -              A_IMUL:
    -                if OptPass2Imul(p) then
    -                  continue;
    -              A_JMP:
    -                if OptPass2Jmp(p) then
    -                  continue;
    -              A_MOV:
    -                begin
    -                  if OptPass2MOV(p) then
    -                    continue;
    -                end
    -              else
    -                ;
    -            end;
    +    function TCpuAsmOptimizer.PeepHoleOptPass1Cpu(var p: tai): boolean;
    +      var
    +        Opcode: TAsmOp;
    +      begin
    +        result:=False;
    +        { p is known to be an instruction by this point }
    +
    +        { Use a local variable/register to reduce the number of pointer
    +          dereferences (the peephole optimiser would never optimise this
    +          by itself because the compiler has to consider the possibility
    +          of multi-threaded race hazards. [Kit] }
    +        Opcode := taicpu(p).opcode;
    +
    +        { Clever optimisation: MOV instructions appear disproportionally
    +          more frequently than any other instruction, so check for this
    +          opcode first and reduce the total number of comparisons
    +          required over the entire block. [Kit] }
    +        if Opcode = A_MOV then
    +          Result := OptPass1MOV(p)
    +        else
    +          case Opcode of
    +            A_PUSH:
    +              Result := OptPass1PUSH(p);
    +            A_AND:
    +              Result:=OptPass1AND(p);
    +            A_XOR:
    +              Result:=OptPass1XOR(p);
    +            A_MOVSX,
    +            A_MOVZX:
    +              Result:=OptPass1Movx(p);
    +            A_VMOVAPS,
    +            A_VMOVAPD,
    +            A_VMOVUPS,
    +            A_VMOVUPD:
    +              result:=OptPass1VMOVAP(p);
    +            A_MOVAPD,
    +            A_MOVAPS,
    +            A_MOVUPD,
    +            A_MOVUPS:
    +              result:=OptPass1MOVAP(p);
    +            A_VDIVSD,
    +            A_VDIVSS,
    +            A_VSUBSD,
    +            A_VSUBSS,
    +            A_VMULSD,
    +            A_VMULSS,
    +            A_VADDSD,
    +            A_VADDSS,
    +            A_VANDPD,
    +            A_VANDPS,
    +            A_VORPD,
    +            A_VORPS,
    +            A_VXORPD,
    +            A_VXORPS:
    +              result:=OptPass1VOP(p);
    +            A_MULSD,
    +            A_MULSS,
    +            A_ADDSD,
    +            A_ADDSS:
    +              result:=OptPass1OP(p);
    +            A_VMOVSD,
    +            A_VMOVSS,
    +            A_MOVSD,
    +            A_MOVSS:
    +              result:=OptPass1MOVXX(p);
    +            A_FSTP,A_FISTP:
    +              Result := OptPass1FSTPFISTP(p);
    +            A_FLD:
    +              Result := OptPass1FLD(p);
    +            A_LEA:
    +              result:=OptPass1LEA(p);
    +            A_SUB:
    +              result:=OptPass1Sub(p);
    +            A_SHL,A_SAL:
    +              result:=OptPass1SHLSAL(p);
    +            A_SHR,A_SAR:
    +              result:=OptPass1SHRSAR(p);
    +            A_SETcc:
    +              result:=OptPass1SETcc(p);
    +            A_IMUL:
    +              Result:=OptPass1Imul(p);
    +            A_JMP:
    +              Result:=OptPass1Jmp(p);
    +            A_Jcc:
    +              Result:=OptPass1Jcc(p);
    +			else
    +			  { Do nothing };
               end;
    -        else
    -          ;
           end;
    -      p := tai(p.next)
    -    end;
    -end;
     
     
    -procedure TCPUAsmOptimizer.PostPeepHoleOpts;
    -var
    -  p,hp1: tai;
    -begin
    -  p := BlockStart;
    -  ClearUsedRegs;
    -  while (p <> BlockEnd) Do
    -    begin
    -      UpdateUsedRegs(UsedRegs, tai(p.next));
    -      case p.Typ Of
    -        Ait_Instruction:
    +    function TCpuAsmOptimizer.PostPeepHoleOptsCpu(var p: tai): boolean;
    +      begin
    +        Result := False;
    +        case taicpu(p).opcode Of
    +          A_CALL:
    +            Result := PostPeepHoleOptCall(p);
    +          A_LEA:
    +            Result := PostPeepholeOptLea(p);
    +          A_CMP:
    +            Result := PostPeepholeOptCmp(p);
    +          A_MOV:
    +            Result := PostPeepholeOptMov(p);
    +          A_TEST, A_OR:
    +            Result := PostPeepholeOptTestOr(p);
    +          A_MOVZX:
    +            Result := PostPeepholeOptMovzx(p);
    +		  else
    +		    { Do nothing };
    +        end;
    +      end;
    +
    +
    +    procedure TCpuAsmOptimizer.PostPeepHoleOpts;
    +      var
    +        p,hp1: tai;
    +      begin
    +        p := BlockStart;
    +        ClearUsedRegs;
    +        while (p <> BlockEnd) Do
               begin
    -            if InsContainsSegRef(taicpu(p)) then
    +            UpdateUsedRegs(UsedRegs, tai(p.next));
    +            if p.Typ = ait_Instruction then
                   begin
    -                p := tai(p.next);
    -                continue;
    +                if InsContainsSegRef(taicpu(p)) then
    +                  begin
    +                    p := tai(p.next);
    +                    continue;
    +                  end;
    +                if PostPeepHoleOptsCpu(p) then
    +                  Continue;
                   end;
    -            case taicpu(p).opcode Of
    -              A_CALL:
    -                if PostPeepHoleOptCall(p) then
    -                  Continue;
    -              A_LEA:
    -                if PostPeepholeOptLea(p) then
    -                  Continue;
    -              A_CMP:
    -                if PostPeepholeOptCmp(p) then
    -                  Continue;
    -              A_MOV:
    -                if PostPeepholeOptMov(p) then
    -                  Continue;
    -              A_MOVZX:
    -                { if register vars are on, it's possible there is code like }
    -                {   "cmpl $3,%eax; movzbl 8(%ebp),%ebx; je .Lxxx"           }
    -                { so we can't safely replace the movzx then with xor/mov,   }
    -                { since that would change the flags (JM)                    }
    -                if not(cs_opt_regvar in current_settings.optimizerswitches) then
    -                 begin
    -                  if (taicpu(p).oper[1]^.typ = top_reg) then
    -                    if (taicpu(p).oper[0]^.typ = top_reg)
    -                      then
    -                        case taicpu(p).opsize of
    -                          S_BL:
    -                            begin
    -                              if IsGP32Reg(taicpu(p).oper[1]^.reg) and
    -                                 not(cs_opt_size in current_settings.optimizerswitches) and
    -                                 (current_settings.optimizecputype = cpu_Pentium) then
    -                                  {Change "movzbl %reg1, %reg2" to
    -                                   "xorl %reg2, %reg2; movb %reg1, %reg2" for Pentium and
    -                                   PentiumMMX}
    -                                begin
    -                                  hp1 := taicpu.op_reg_reg(A_XOR, S_L,
    -                                              taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
    -                                  InsertLLItem(p.previous, p, hp1);
    -                                  taicpu(p).opcode := A_MOV;
    -                                  taicpu(p).changeopsize(S_B);
    -                                  setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
    -                                end;
    -                            end;
    -                          else
    -                            ;
    -                        end
    -                      else if (taicpu(p).oper[0]^.typ = top_ref) and
    -                          (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
    -                          (taicpu(p).oper[0]^.ref^.index <> taicpu(p).oper[1]^.reg) and
    -                          not(cs_opt_size in current_settings.optimizerswitches) and
    -                          IsGP32Reg(taicpu(p).oper[1]^.reg) and
    -                          (current_settings.optimizecputype = cpu_Pentium) and
    -                          (taicpu(p).opsize = S_BL) then
    -                        {changes "movzbl mem, %reg" to "xorl %reg, %reg; movb mem, %reg8" for
    -                          Pentium and PentiumMMX}
    -                        begin
    -                          hp1 := taicpu.Op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg,
    -                                      taicpu(p).oper[1]^.reg);
    -                          taicpu(p).opcode := A_MOV;
    -                          taicpu(p).changeopsize(S_B);
    -                          setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
    -                          InsertLLItem(p.previous, p, hp1);
    -                        end;
    -                 end;
    -              A_TEST, A_OR:
    -                begin
    -                  if PostPeepholeOptTestOr(p) then
    -                    Continue;
    -                end;
    -              else
    -                ;
    -            end;
    +
    +            p := tai(p.next)
               end;
    -        else
    -          ;
    +        OptReferences;
           end;
    -      p := tai(p.next)
    -    end;
    -  OptReferences;
    -end;
     
     
    -Procedure TCpuAsmOptimizer.Optimize;
    -Var
    -  HP: Tai;
    -  pass: longint;
    -  slowopt, changed, lastLoop: boolean;
    -Begin
    -  slowopt := (cs_opt_level3 in current_settings.optimizerswitches);
    -  pass := 0;
    -  changed := false;
    -  repeat
    -     lastLoop :=
    -       not(slowopt) or
    -       (not changed and (pass > 2)) or
    -      { prevent endless loops }
    -       (pass = 4);
    -     changed := false;
    -   { Setup labeltable, always necessary }
    -     blockstart := tai(asml.first);
    -     pass_1;
    -   { Blockend now either contains an ait_marker with Kind = mark_AsmBlockStart, }
    -   { or nil                                                                }
    -     While Assigned(BlockStart) Do
    -       Begin
    -         if (cs_opt_peephole in current_settings.optimizerswitches) then
    -           begin
    -            if (pass = 0) then
    -              PrePeepHoleOpts;
    -              { Peephole optimizations }
    -               PeepHoleOptPass1;
    -              { Only perform them twice in the first pass }
    -               if pass = 0 then
    -                 PeepHoleOptPass1;
    -           end;
    -        { More peephole optimizations }
    -         if (cs_opt_peephole in current_settings.optimizerswitches) then
    -           begin
    -             PeepHoleOptPass2;
    -             if lastLoop then
    -               PostPeepHoleOpts;
    -           end;
    -
    -        { Continue where we left off, BlockEnd is either the start of an }
    -        { assembler block or nil                                         }
    -         BlockStart := BlockEnd;
    -         While Assigned(BlockStart) And
    -               (BlockStart.typ = ait_Marker) And
    -               (Tai_Marker(BlockStart).Kind = mark_AsmBlockStart) Do
    -           Begin
    -           { We stopped at an assembler block, so skip it }
    -            Repeat
    -              BlockStart := Tai(BlockStart.Next);
    -            Until (BlockStart.Typ = Ait_Marker) And
    -                  (Tai_Marker(Blockstart).Kind = mark_AsmBlockEnd);
    -           { Blockstart now contains a Tai_marker(mark_AsmBlockEnd) }
    -             If GetNextInstruction(BlockStart, HP) And
    -                ((HP.typ <> ait_Marker) Or
    -                 (Tai_Marker(HP).Kind <> mark_AsmBlockStart)) Then
    -             { There is no assembler block anymore after the current one, so }
    -             { optimize the next block of "normal" instructions              }
    -               pass_1
    -             { Otherwise, skip the next assembler block }
    -             else
    -               blockStart := hp;
    -           End;
    -       End;
    -     inc(pass);
    -  until lastLoop;
    -  dfa.free;
    -
    -End;
    -
    -
     begin
       casmoptimizer:=TCpuAsmOptimizer;
     end.
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 42345)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -30,16 +30,24 @@
         uses
           globtype,
           cpubase,
    -      aasmtai,aasmcpu,
    +      aasmtai,aasmcpu,aasmdata,
           cgbase,cgutils,
           aopt,aoptobj;
     
         type
    +
           TX86AsmOptimizer = class(TAsmOptimizer)
             function RegLoadedWithNewValue(reg : tregister; hp : tai) : boolean; override;
             function InstructionLoadsFromReg(const reg : TRegister; const hp : tai) : boolean; override;
             function RegReadByInstruction(reg : TRegister; hp : tai) : boolean;
    +        procedure Optimize; override;
    +        procedure PeepHoleOptPass1; override;
    +        function GetFirstInstruction(const Start: tai; var p: tai): Boolean; override;
    +        constructor Create(_AsmL: TAsmList); override;
    +        destructor Destroy; override;
           protected
    +        StatePreserveRegs: TAllUsedRegs;
    +
             { checks whether loading a new value in reg1 overwrites the entirety of reg2 }
             function Reg1WriteOverwritesReg2Entirely(reg1, reg2: tregister): boolean;
             { checks whether reading the value in reg1 depends on the value of reg2. This
    @@ -56,8 +70,9 @@
     
             function DoSubAddOpt(var p : tai) : Boolean;
     
    -        function PrePeepholeOptSxx(var p : tai) : boolean;
    -        function PrePeepholeOptIMUL(var p : tai) : boolean;
    +        { - Below are optimisations common to both i386 and x86_64
    +          - See i386/aoptcpu.pas for i386-specific optimisations
    +          - See x86_64/aoptcpu.pas for x86_64-specific optimisations }
     
             function OptPass1AND(var p : tai) : boolean;
             function OptPass1VMOVAP(var p : tai) : boolean;
    @@ -71,24 +86,18 @@
             function OptPass1Sub(var p : tai) : boolean;
             function OptPass1SHLSAL(var p : tai) : boolean;
             function OptPass1SETcc(var p: tai): boolean;
    -        function OptPass1FSTP(var p: tai): boolean;
    -        function OptPass1FLD(var p: tai): boolean;
    +        function OptPass1SHRSAR(var p : tai) : boolean;
    +        function OptPass1Imul(var p : tai) : boolean;
    +        function OptPass1Jmp(var p : tai) : boolean;
    +        function OptPass1Jcc(var p : tai) : boolean;
    +        function OptPass1CMP(var p : tai) : boolean;
     
    -        function OptPass2MOV(var p : tai) : boolean;
    -        function OptPass2Imul(var p : tai) : boolean;
    -        function OptPass2Jmp(var p : tai) : boolean;
    -        function OptPass2Jcc(var p : tai) : boolean;
    +        function PostPeepholeOptMov(var p : tai) : Boolean; inline;
    +        function PostPeepholeOptCmp(var p : tai) : Boolean; inline;
    +        function PostPeepholeOptTestOr(var p : tai) : Boolean; inline;
    +        function PostPeepholeOptCall(var p : tai) : Boolean; inline;
    +        function PostPeepholeOptLea(var p : tai) : Boolean; inline;
     
    -        function PostPeepholeOptMov(var p : tai) : Boolean;
    -{$ifdef x86_64} { These post-peephole optimisations only affect 64-bit registers. [Kit] }
    -        function PostPeepholeOptMovzx(var p : tai) : Boolean;
    -        function PostPeepholeOptXor(var p : tai) : Boolean;
    -{$endif}
    -        function PostPeepholeOptCmp(var p : tai) : Boolean;
    -        function PostPeepholeOptTestOr(var p : tai) : Boolean;
    -        function PostPeepholeOptCall(var p : tai) : Boolean;
    -        function PostPeepholeOptLea(var p : tai) : Boolean;
    -
             procedure OptReferences;
           end;
     
    @@ -130,8 +148,11 @@
           aoptutils,
           symconst,symsym,
           cgx86,
    -      itcpugas;
    +      itcpugas,
    +      systems,
    +      aoptcpub;
     
    +
         function MatchInstruction(const instr: tai; const op: TAsmOp; const opsize: topsizes): boolean;
           begin
             result :=
    @@ -494,7 +523,459 @@
           end;
         end;
     
    +  procedure TX86AsmOptimizer.Optimize;
    +    var
    +      HP: tai;
    +    begin
    +      BlockStart := tai(AsmL.First);
    +      pass_1;
    +      while Assigned(BlockStart) do
    +        begin
     
    +          if (cs_opt_peephole in current_settings.optimizerswitches) then
    +            begin
    +              { Peephole optimizations }
    +              PeepHoleOptPass1;
    +              PostPeepHoleOpts;
    +            end;
    +          { free memory }
    +          clear;
    +          { continue where we left off, BlockEnd is either the start of an }
    +          { assembler block or nil}
    +          BlockStart := BlockEnd;
    +          While Assigned(BlockStart) And
    +                (BlockStart.typ = ait_Marker) And
    +                (tai_Marker(BlockStart).Kind = mark_AsmBlockStart) Do
    +            Begin
    +             { we stopped at an assembler block, so skip it    }
    +             While GetNextInstruction(BlockStart, BlockStart) And
    +                   ((BlockStart.Typ <> Ait_Marker) Or
    +                    (tai_Marker(Blockstart).Kind <> mark_AsmBlockEnd)) Do;
    +             { blockstart now contains a tai_marker(mark_AsmBlockEnd) }
    +             If GetNextInstruction(BlockStart, HP) And
    +                ((HP.typ <> ait_Marker) Or
    +                 (Tai_Marker(HP).Kind <> mark_AsmBlockStart)) Then
    +             { There is no assembler block anymore after the current one, so }
    +             { optimize the next block of "normal" instructions              }
    +               pass_1
    +             { Otherwise, skip the next assembler block }
    +             else
    +               blockStart := hp;
    +            end;
    +        end;
    +    end;
    +
    +  procedure TX86AsmOptimizer.PeepHoleOptPass1;
    +    var
    +      stoploop:boolean;
    +
    +      { If a group of labels are clustered, change the jump to point to the last one
    +        that is still referenced }
    +      function CollapseLabelCluster(jump: tai; var lbltai: tai): TAsmLabel; inline;
    +        var
    +          LastLabel: TAsmLabel;
    +          hp2: tai;
    +        begin
    +          Result := tai_label(lbltai).labsym;
    +          LastLabel := Result;
    +          hp2 := tai(lbltai.next);
    +
    +          while (hp2 <> BlockEnd) and (hp2.typ in SkipInstr + [ait_align, ait_label]) do
    +            begin
    +
    +              if (hp2.typ = ait_label) and
    +                (tai_label(hp2).labsym.is_used) and
    +                (tai_label(hp2).labsym.labeltype = alt_jump) then
    +                LastLabel := tai_label(hp2).labsym;
    +
    +              hp2 := tai(hp2.next);
    +            end;
    +
    +          if (Result <> LastLabel) then
    +            begin
    +              Result.decrefs;
    +              JumpTargetOp(taicpu(jump))^.ref^.symbol := LastLabel;
    +              LastLabel.increfs;
    +              Result := LastLabel;
    +              lbltai := hp2;
    +            end;
    +        end;
    +
    +      function UnconditionalJumpShortcut(NCJLabel: TAsmLabel; NCJ: tai; level: Integer): TAsmLabel;
    +        var
    +          NewLabel: TAsmLabel;
    +          LabelTai, AfterLabel: tai;
    +        begin
    +          Result := nil;
    +          if level > 20 then Exit;
    +
    +          if not ((NCJ.typ=ait_instruction) and IsJumpToLabelUncond(taicpu(NCJ))) then
    +            Exit;
    +
    +          LabelTai := getlabelwithsym(NCJLabel);
    +          if not Assigned(LabelTai) then
    +            Exit;
    +
    +          SkipLabels(LabelTai, AfterLabel);
    +
    +          if (AfterLabel.typ=ait_instruction) and IsJumpToLabelUncond(taicpu(AfterLabel)) then
    +            begin
    +              NewLabel := TAsmLabel(JumpTargetOp(taicpu(AfterLabel))^.ref^.symbol);
    +
    +              if NCJLabel = NewLabel then
    +                { Identical jump }
    +                Exit;
    +
    +              Result := UnconditionalJumpShortcut(NewLabel, AfterLabel, succ(level));
    +              if not Assigned(Result) then
    +                Result := NewLabel;
    +
    +              NCJLabel.decrefs;
    +              JumpTargetOp(taicpu(NCJ))^.ref^.symbol := Result;
    +              Result.increfs;
    +            end;
    +        end;
    +
    +      function ConditionalJumpShortcut(CJLabel: TAsmLabel; var p: tai; hp1: tai): Boolean; inline;
    +        var
    +          hp2: tai;
    +          NCJLabel: TAsmLabel;
    +        begin
    +          Result := False;
    +
    +          StripDeadLabels(hp1, hp1);
    +
    +          if (hp1 <> BlockEnd) and
    +            (tai(hp1).typ=ait_instruction) and
    +            IsJumpToLabelUncond(taicpu(hp1)) then
    +            begin
    +
    +              NCJLabel := TAsmLabel(JumpTargetOp(taicpu(hp1))^.ref^.symbol);
    +
    +              if CJLabel = NCJLabel then
    +                begin
    +{$ifdef DEBUG_JUMP}
    +                  WriteLn('JUMP DEBUG: Short-circuited conditional jump');
    +{$endif DEBUG_JUMP}
    +                  { Both jumps go to the same label }
    +                  CJLabel.decrefs;
    +{$ifdef cpudelayslot}
    +                  RemoveDelaySlot(p);
    +{$endif cpudelayslot}
    +                  UpdateUsedRegs(tai(p.Next));
    +                  AsmL.Remove(p);
    +                  p.Free;
    +                  p := hp1;
    +
    +                  Result := True;
    +                  Exit;
    +                end;
    +
    +              { Do it now to get it out of the way and to aid the
    +                following optimisation }
    +              RemoveDeadCodeAfterJump(taicpu(hp1));
    +
    +              if GetNextInstruction(hp1, hp2) then
    +                begin
    +
    +                  if FindLabel(CJLabel, hp2) then
    +                    begin
    +                      { change the following jumps:
    +                          jmp<cond> CJLabel         jmp<cond_inverted> NCJLabel
    +                          jmp       NCJLabel >>>    <code>
    +                        CJLabel:                  NCJLabel:
    +                          <code>
    +                        NCJLabel:
    +                      }
    +{$if defined(arm) or defined(aarch64)}
    +                      if (taicpu(p).condition<>C_None)
    +{$if defined(aarch64)}
    +                      { can't have conditional branches to
    +                        global labels on AArch64, because the
    +                        offset may become too big }
    +                      and (NCJLabel.bind=AB_LOCAL)
    +{$endif aarch64}
    +                    then
    +                      begin
    +{$endif arm or aarch64}
    +{$ifdef DEBUG_JUMP}
    +                        WriteLn('JUMP DEBUG: Conditional jump optimisation');
    +{$endif DEBUG_JUMP}
    +                        taicpu(p).condition:=inverse_cond(taicpu(p).condition);
    +                        CJLabel.decrefs;
    +
    +                        JumpTargetOp(taicpu(p))^.ref^.symbol := JumpTargetOp(taicpu(hp1))^.ref^.symbol;
    +
    +                        { when freeing hp1, the reference count
    +                          isn't decreased, so don't increase }
    +{$ifdef cpudelayslot}
    +                        RemoveDelaySlot(hp1);
    +{$endif cpudelayslot}
    +                        asml.remove(hp1);
    +                        hp1.free;
    +
    +                        Result := True;
    +{$if defined(arm) or defined(aarch64)}
    +                      end;
    +{$endif arm or aarch64}
    +                    end
    +                  else if CollapseZeroDistJump(hp1, hp2, NCJLabel) then
    +                    Result := True;
    +                end;
    +            end;
    +
    +          if GetFinalDestination(taicpu(p),0) then
    +            stoploop := False;
    +
    +          Exit;
    +        end;
    +
    +
    +      function JumpOptimizations(var p: tai): Boolean; inline;
    +        var
    +          hp1, hp2: tai;
    +          ThisLabel: TAsmLabel;
    +          ThisPassResult: Boolean;
    +        begin
    +          Result := False;
    +          repeat
    +            ThisPassResult := False;
    +
    +            { Remove unreachable code between the jump and the next label }
    +            RemoveDeadCodeAfterJump(taicpu(p));
    +
    +            if GetNextInstruction(p, hp1) and (hp1 <> BlockEnd) then
    +              begin
    +                SkipEntryExitMarker(hp1,hp1);
    +                if (hp1 = BlockEnd) then
    +                  Exit;
    +
    +                ThisLabel := TAsmLabel(JumpTargetOp(taicpu(p))^.ref^.symbol);
    +
    +                { If there are multiple labels in a row, change the destination to the last one
    +                  in order to aid optimisation later }
    +                hp2 := getlabelwithsym(ThisLabel);
    +
    +                { getlabelwithsym returning nil occurs if a label is in a
    +                  different block (e.g. on the other side of an asm...end pair). }
    +                if Assigned(hp2) then
    +                  begin
    +                    ThisLabel := CollapseLabelCluster(p, hp2);
    +
    +                    if CollapseZeroDistJump(p, hp1, ThisLabel) then
    +                      begin
    +                        stoploop := False;
    +                        Result := True;
    +                        Continue;
    +                      end;
    +
    +                    if IsJumpToLabelUncond(taicpu(p)) then
    +                      ThisPassResult := Assigned(UnconditionalJumpShortcut(ThisLabel, p, 0))
    +                    else if (taicpu(p).opcode = aopt_condjmp) then
    +                      ThisPassResult := ConditionalJumpShortcut(ThisLabel, p, hp1);
    +                  end;
    +              end;
    +
    +            Result := Result or ThisPassResult;
    +          until not (ThisPassResult and (p.typ = ait_instruction) and IsJumpToLabel(taicpu(p)));
    +        end;
    +
    +    var
    +      p : tai;
    +      orig_instr: tasmop;
    +      StartPoint: tai;
    +      StartingRegs: TAllUsedRegs;
    +      FirstInstruction, OptLevel3: Boolean;
    +      loopcount: Integer;
    +
    +    begin
    +      { Very minor speed-up.  Reduce the chance of a memory stall and the
    +        requirement of using bitwise operations by only checking this flag once
    +        and storing a Boolean result on the stack. }
    +      OptLevel3 := (cs_opt_level3 in current_settings.optimizerswitches);
    +
    +      ClearUsedRegs;
    +
    +      { Search forward from BlockStart until we find the first instruction }
    +      if not GetFirstInstruction(BlockStart, StartPoint) then
    +        Exit;
    +
    +      { Preserve the register allocation state at StartPoint }
    +      if OptLevel3 then
    +        CopyUsedRegs(StartingRegs);
    +
    +      LoopCount := 5;
    +
    +      repeat
    +        stoploop:=true;
    +        p := StartPoint;
    +        FirstInstruction := True;
    +
    +        while (p <> BlockEnd) Do
    +          begin
    +            prefetch(p.Next);
    +
    +            case p.Typ Of
    +              ait_instruction:
    +                begin
    +                  orig_instr := taicpu(p).opcode;
    +                  {$ifdef DEBUG_OPTALLOC}
    +                  if p.Typ=ait_instruction then
    +                    InsertLLItem(tai(p.Previous),p,tai_comment.create(strpnew(GetAllocationString(UsedRegs))));
    +                  {$endif DEBUG_OPTALLOC}
    +
    +                  { The whole "MatchInstruction(p, orig_instr)" thing... if the instruction type hasn't changed, then
    +                    the peephole optimiser assumes that no further optimisations can be done on that instruction and
    +                    so moves on instead of calling the individual routine again in PeepHoleOptPass1Cpu. }
    +
    +                  { Handle Jmp Optimizations first }
    +                  if IsJumpToLabel(taicpu(p)) and JumpOptimizations(p) then
    +                    begin
    +                      UpdateUsedRegs(p);
    +                      if FirstInstruction then
    +                        { Update StartPoint, since the old p was removed;
    +                          don't set FirstInstruction to False though, as
    +                          the new p might get removed too. }
    +                        StartPoint := p;
    +
    +                      Continue;
    +                    end;
    +
    +                  if PeepHoleOptPass1Cpu(p) then
    +                    begin
    +                      stoploop:=false;
    +                      if (p = BlockEnd) then
    +                        Continue;
    +
    +                      UpdateUsedRegs(p);
    +                      if FirstInstruction then
    +                        { Update StartPoint, since the old p was removed;
    +                          don't set FirstInstruction to False though, as
    +                          the new p might get removed too. }
    +                        StartPoint := p;
    +
    +                      if not MatchInstruction(p, orig_instr) then
    +                        continue;
    +                    end;
    +                end;
    +              else
    +                { Other optimizations }
    +                begin
    +                end;
    +            end;
    +            FirstInstruction := False;
    +            p := tai(UpdateUsedRegsAndOptimize(p).Next);
    +          end;
    +
    +        { Restore the register allocation state to what it was at StartPoint,
    +          ready for the next loop iteration. }
    +        if OptLevel3 and not stoploop then
    +          RestoreUsedRegs(StartingRegs);
    +
    +        Dec(loopcount);
    +
    +      until stoploop or not OptLevel3 or (loopcount <= 0);
    +      if (loopcount <= 0) and not stoploop then
    +        DebugMsg(SPeepholeOptimization + 'Possible infinite loop in peephole optimizer', BlockStart);
    +
    +      if OptLevel3 then
    +        ReleaseUsedRegs(StartingRegs);
    +    end;
    +
    +  constructor TX86AsmOptimizer.Create(_AsmL: TAsmList);
    +    begin
    +      inherited Create(_AsmL);
    +
    +      { Pooled object for preserving the used register state in OptPass1Jcc }
    +      CreateUsedRegs(StatePreserveRegs);
    +    end;
    +
    +  destructor TX86AsmOptimizer.Destroy;
    +    begin
    +      ReleaseUsedRegs(StatePreserveRegs);
    +      inherited Destroy;
    +    end;
    +
    +  { Search forward from Start until we find the first instruction }
    +  function TX86AsmOptimizer.GetFirstInstruction(const Start: tai; var p: tai): Boolean;
    +    begin
    +      p := Start;
    +      Result := False;
    +      while Assigned(p) and (p <> BlockEnd) do
    +        begin
    +          if (p.Typ = ait_seh_directive) then
    +            begin
    +              if (tai_seh_directive(p).kind = ash_endprologue) then
    +                { End of prologue }
    +                begin
    +                  UpdateUsedRegs(p);
    +                  Result := GetNextInstruction(p, p);
    +                  Exit;
    +                end
    +              else
    +                p := tai(p.Next);
    +            end
    +          else if (p.Typ = ait_regalloc) then
    +            begin
    +              UpdateUsedRegs(p);
    +              repeat
    +                p := tai(p.Next);
    +                { All of the nearby register allocations have been handled already }
    +              until (p.Typ <> ait_regalloc);
    +            end
    +          else if (p.Typ <> ait_instruction) then
    +            begin
    +              p := tai(p.Next);
    +            end
    +          else if
    +            { Skip over instructions related to the function prologue }
    +            (taicpu(p).opcode = A_PUSH) or
    +            ((taicpu(p).opcode = A_LEA) and (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP)) or
    +            ((taicpu(p).opcode = A_SUB) and (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP)) or
    +            ((taicpu(p).opcode = A_MOV) and (taicpu(p).oper[0]^.typ = top_reg) and (
    +            { An alternative to PUSH: writing a register to a particular point on the stack }
    +              (
    +                { Preserving stack pointer }
    +                (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP) and
    +                (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_EBP)
    +              ) or (
    +                (taicpu(p).oper[1]^.typ = top_ref) and (getsupreg(taicpu(p).oper[1]^.ref^.base) in [RS_ESP, RS_EBP])) and
    +                (
    +                  { If a scratch register is being written to the stack, it's likely preserving a parameter, so don't exclude }
    +                  not ((target_info.system in [system_i386_win32]) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RDX, RS_RCX])) and
    +                  not ((target_info.system in [system_x86_64_win64]) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RDX, RS_RCX, RS_R8, RS_R9, RS_R10, RS_R11])) and
    +                  not (((target_info.system in systems_linux) or (target_info.system in systems_android)) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RDI, RS_RSI, RS_RAX, RS_RDX, RS_RCX, RS_R8, RS_R9, RS_R10, RS_R11]))
    +                )
    +              )
    +            ) or (
    +              { Writing XMM registers to the stack }
    +              (
    +                { Cannot use the "in" operator here as putting these opcodes
    +                  into a set causes compiler error e03074. [Kit] }
    +                (taicpu(p).opcode = A_MOVDQA) or
    +                (taicpu(p).opcode = A_MOVDQU) or
    +                (taicpu(p).opcode = A_VMOVDQA) or
    +                (taicpu(p).opcode = A_VMOVDQU)
    +              ) and
    +              (taicpu(p).oper[0]^.typ = top_reg) and
    +              (taicpu(p).oper[1]^.typ = top_ref) and (getsupreg(taicpu(p).oper[1]^.ref^.base) = RS_EBP) and
    +              (
    +                { If a scratch register is being written to the stack, it's likely preserving a parameter, so don't exclude }
    +                not (getsupreg(taicpu(p).oper[0]^.reg) in [RS_XMM0, RS_XMM1, RS_XMM2, RS_XMM3, RS_XMM4, RS_XMM5]) or
    +                (getsubreg(taicpu(p).oper[0]^.reg) <> R_SUBMMX)
    +              )
    +            ) then
    +              p := tai(p.Next)
    +
    +          else
    +            begin
    +              Result := True;
    +              Exit;
    +            end;
    +        end;
    +    end;
    +
    +
     {$ifdef DEBUG_AOPTCPU}
         procedure TX86AsmOptimizer.DebugMsg(const s: string;p : tai);
           begin
    @@ -645,7 +1126,7 @@
           end;
     
     
    -    function TX86AsmOptimizer.PrePeepholeOptSxx(var p : tai) : boolean;
    +    function TX86AsmOptimizer.OptPass1SHRSAR(var p : tai) : boolean;
           var
             hp1 : tai;
             l : TCGInt;
    @@ -659,7 +1140,7 @@
     
               either "sar/and", "shl/and" or just "and" depending on const1 and const2 }
             if GetNextInstruction(p, hp1) and
    -          MatchInstruction(hp1,A_SHL,[]) and
    +          MatchInstruction(hp1,A_SHL) and
               (taicpu(p).oper[0]^.typ = top_const) and
               (taicpu(hp1).oper[0]^.typ = top_const) and
               (taicpu(hp1).opsize = taicpu(p).opsize) and
    @@ -701,6 +1182,7 @@
                       else
                         Internalerror(2017050702)
                     end;
    +                Result := True;
                   end
                 else if (taicpu(p).oper[0]^.val = taicpu(hp1).oper[0]^.val) then
                   begin
    @@ -719,95 +1201,12 @@
                     end;
                     asml.remove(hp1);
                     hp1.free;
    +                Result := True;
                   end;
               end;
           end;
     
     
    -    function TX86AsmOptimizer.PrePeepholeOptIMUL(var p : tai) : boolean;
    -      var
    -        opsize : topsize;
    -        hp1 : tai;
    -        tmpref : treference;
    -        ShiftValue : Cardinal;
    -        BaseValue : TCGInt;
    -      begin
    -        result:=false;
    -        opsize:=taicpu(p).opsize;
    -        { changes certain "imul const, %reg"'s to lea sequences }
    -        if (MatchOpType(taicpu(p),top_const,top_reg) or
    -            MatchOpType(taicpu(p),top_const,top_reg,top_reg)) and
    -           (opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) then
    -          if (taicpu(p).oper[0]^.val = 1) then
    -            if (taicpu(p).ops = 2) then
    -             { remove "imul $1, reg" }
    -              begin
    -                hp1 := tai(p.Next);
    -                asml.remove(p);
    -                DebugMsg(SPeepholeOptimization + 'Imul2Nop done',p);
    -                p.free;
    -                p := hp1;
    -                result:=true;
    -              end
    -            else
    -             { change "imul $1, reg1, reg2" to "mov reg1, reg2" }
    -              begin
    -                hp1 := taicpu.Op_Reg_Reg(A_MOV, opsize, taicpu(p).oper[1]^.reg,taicpu(p).oper[2]^.reg);
    -                InsertLLItem(p.previous, p.next, hp1);
    -                DebugMsg(SPeepholeOptimization + 'Imul2Mov done',p);
    -                p.free;
    -                p := hp1;
    -              end
    -          else if
    -           ((taicpu(p).ops <= 2) or
    -            (taicpu(p).oper[2]^.typ = Top_Reg)) and
    -           not(cs_opt_size in current_settings.optimizerswitches) and
    -           (not(GetNextInstruction(p, hp1)) or
    -             not((tai(hp1).typ = ait_instruction) and
    -                 ((taicpu(hp1).opcode=A_Jcc) and
    -                  (taicpu(hp1).condition in [C_O,C_NO])))) then
    -            begin
    -              {
    -                imul X, reg1, reg2 to
    -                  lea (reg1,reg1,Y), reg2
    -                  shl ZZ,reg2
    -                imul XX, reg1 to
    -                  lea (reg1,reg1,YY), reg1
    -                  shl ZZ,reg2
    -
    -                This optimziation makes sense for pretty much every x86, except the VIA Nano3000: it has IMUL latency 2, lea/shl pair as well,
    -                it does not exist as a separate optimization target in FPC though.
    -
    -                This optimziation can be applied as long as only two bits are set in the constant and those two bits are separated by
    -                at most two zeros
    -              }
    -              reference_reset(tmpref,1,[]);
    -              if (PopCnt(QWord(taicpu(p).oper[0]^.val))=2) and (BsrQWord(taicpu(p).oper[0]^.val)-BsfQWord(taicpu(p).oper[0]^.val)<=3) then
    -                begin
    -                  ShiftValue:=BsfQWord(taicpu(p).oper[0]^.val);
    -                  BaseValue:=taicpu(p).oper[0]^.val shr ShiftValue;
    -                  TmpRef.base := taicpu(p).oper[1]^.reg;
    -                  TmpRef.index := taicpu(p).oper[1]^.reg;
    -                  if not(BaseValue in [3,5,9]) then
    -                    Internalerror(2018110101);
    -                  TmpRef.ScaleFactor := BaseValue-1;
    -                  if (taicpu(p).ops = 2) then
    -                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[1]^.reg)
    -                  else
    -                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[2]^.reg);
    -                  AsmL.InsertAfter(hp1,p);
    -                  DebugMsg(SPeepholeOptimization + 'Imul2LeaShl done',p);
    -                  AsmL.Remove(p);
    -                  taicpu(hp1).fileinfo:=taicpu(p).fileinfo;
    -                  p.free;
    -                  p := hp1;
    -                  if ShiftValue>0 then
    -                    AsmL.InsertAfter(taicpu.op_const_reg(A_SHL, opsize, ShiftValue, taicpu(hp1).oper[1]^.reg),hp1);
    -              end;
    -            end;
    -      end;
    -
    -
         function TX86AsmOptimizer.RegLoadedWithNewValue(reg: tregister; hp: tai): boolean;
           var
             p: taicpu;
    @@ -944,7 +1343,7 @@
             hp2,hp3 : tai;
           begin
             { some x86-64 issue a NOP before the real exit code }
    -        if MatchInstruction(p,A_NOP,[]) then
    +        if MatchInstruction(p,A_NOP) then
               GetNextInstruction(p,p);
             result:=assigned(p) and (p.typ=ait_instruction) and
             ((taicpu(p).opcode = A_RET) or
    @@ -1054,7 +1453,7 @@
               GetNextInstruction(p, hp1) and
               (hp1.typ = ait_instruction) and
               GetNextInstruction(hp1, hp2) and
    -          MatchInstruction(hp2,taicpu(p).opcode,[]) and
    +          MatchInstruction(hp2,taicpu(p).opcode) and
               OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) and
               MatchOpType(taicpu(hp2),top_reg,top_reg) and
               MatchOperand(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
    @@ -1169,6 +1568,7 @@
                             asml.Remove(hp2);
                             hp2.Free;
                             p:=hp1;
    +                        Result := True;
                           end;
                       end;
                   end;
    @@ -1190,25 +1606,28 @@
                 V<Op>X   %mreg1,%mreg2,%mreg4
               ?
             }
    -        if GetNextInstruction(p,hp1) and
    -          { we mix single and double operations here because we assume that the compiler
    -            generates vmovapd only after double operations and vmovaps only after single operations }
    -          MatchInstruction(hp1,A_VMOVAPD,A_VMOVAPS,[S_NO]) and
    -          MatchOperand(taicpu(p).oper[2]^,taicpu(hp1).oper[0]^) and
    -          (taicpu(hp1).oper[1]^.typ=top_reg) then
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -            if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg,hp1,TmpUsedRegs)
    -             ) then
    -              begin
    -                taicpu(p).loadoper(2,taicpu(hp1).oper[1]^);
    -                DebugMsg(SPeepholeOptimization + 'VOpVmov2VOp done',p);
    -                asml.Remove(hp1);
    -                hp1.Free;
    -                result:=true;
    -              end;
    -          end;
    +        repeat
    +          if GetNextInstruction(p,hp1) and
    +            { we mix single and double operations here because we assume that the compiler
    +              generates vmovapd only after double operations and vmovaps only after single operations }
    +            MatchInstruction(hp1,A_VMOVAPD,A_VMOVAPS,[S_NO]) and
    +            MatchOperand(taicpu(p).oper[2]^,taicpu(hp1).oper[0]^) and
    +            (taicpu(hp1).oper[1]^.typ=top_reg) then
    +            begin
    +              TransferUsedRegs(TmpUsedRegs);
    +              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    +              if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg,hp1,TmpUsedRegs)
    +               ) then
    +                begin
    +                  taicpu(p).loadoper(2,taicpu(hp1).oper[1]^);
    +                  DebugMsg(SPeepholeOptimization + 'VOpVmov2VOp done',p);
    +                  asml.Remove(hp1);
    +                  hp1.Free;
    +                  Continue; { Can we do it again? }
    +                end;
    +            end;
    +          Exit;
    +        until False;
           end;
     
     
    @@ -2234,6 +3142,103 @@
           end;
     
     
    +    function TX86AsmOptimizer.OptPass1CMP(var p: tai): boolean;
    +      var
    +        hp1, hp2, hp3, hp4: tai;
    +        v: TCGInt; { using aint will cause problems when compiling on i8086 }
    +      begin
    +        Result := False;
    +
    +        { Though "GetNextInstruction" and the check to see if hp1 is A_Jcc could
    +          be factored out, it's better to do the cheap checks first to see if the
    +          CMP instruction fulfils the criteria before calling the relatively
    +          expensive GetNextInstruction call. [Kit] }
    +        if (taicpu(p).oper[0]^.typ=Top_const) then
    +          begin
    +            { cmp %reg,$8000                    neg %reg
    +              je target                 -->     jo target
    +
    +              .... only if register is deallocated before jump.}
    +            case Taicpu(p).opsize of
    +              S_B: v:=$80;
    +              S_W: v:=$8000;
    +              S_L: v:=$80000000;
    +{$ifdef x86_64}
    +              S_Q: v:=$8000000000000000;
    +{$endif x86_64}
    +              else
    +                internalerror(2013112905);
    +            end;
    +
    +            if (taicpu(p).oper[0]^.val=v) and
    +              (Taicpu(p).oper[1]^.typ=top_reg) and
    +              GetNextInstruction(p, hp1) and
    +              (hp1.typ=ait_instruction) and
    +              (taicpu(hp1).opcode=A_Jcc) and
    +              (Taicpu(hp1).condition in [C_E,C_NE]) and
    +              not(RegInUsedRegs(Taicpu(p).oper[1]^.reg, UsedRegs)) then
    +            begin
    +              Taicpu(p).opcode:=A_NEG;
    +              Taicpu(p).loadoper(0,Taicpu(p).oper[1]^);
    +              Taicpu(p).clearop(1);
    +              Taicpu(p).ops:=1;
    +              if taicpu(hp1).condition=C_E then
    +                taicpu(hp1).condition := C_O
    +              else
    +                taicpu(hp1).condition := C_NO;
    +
    +              { No need to set Result to True because no other optimisations
    +                use or check for NEG }
    +            end;
    +            {
    +            @@2:                              @@2:
    +              ....                              ....
    +              cmp operand1,0
    +              jle/jbe @@1
    +              dec operand1             -->      sub operand1,1
    +              jmp @@2                           jge/jae @@2
    +            @@1:                              @@1:
    +              ...                               ....}
    +            if (taicpu(p).oper[1]^.typ in [top_reg,top_ref]) and
    +              (taicpu(p).oper[0]^.val = 0) and
    +              GetNextInstruction(p, hp1) and
    +              (hp1.typ = ait_instruction) and
    +              (taicpu(hp1).is_jmp) and
    +              (taicpu(hp1).opcode=A_Jcc) and
    +              (taicpu(hp1).condition in [C_LE,C_BE]) and
    +              GetNextInstruction(hp1,hp2) and
    +              (hp2.typ = ait_instruction) and
    +              (taicpu(hp2).opcode = A_DEC) and
    +              OpsEqual(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
    +              GetNextInstruction(hp2, hp3) and
    +              (hp3.typ = ait_instruction) and
    +              (taicpu(hp3).is_jmp) and
    +              (taicpu(hp3).opcode = A_JMP) and
    +              GetNextInstruction(hp3, hp4) and
    +              FindLabel(tasmlabel(taicpu(hp1).oper[0]^.ref^.symbol),hp4) then
    +            begin
    +              taicpu(hp2).Opcode := A_SUB;
    +              taicpu(hp2).loadoper(1,taicpu(hp2).oper[0]^);
    +              taicpu(hp2).loadConst(0,1);
    +              taicpu(hp2).ops:=2;
    +              taicpu(hp3).Opcode := A_Jcc;
    +			  
    +              if taicpu(hp1).condition=C_LE then
    +                taicpu(hp3).condition := C_GE
    +			  else
    +                taicpu(hp3).condition := C_AE;
    +
    +              asml.remove(p);
    +              asml.remove(hp1);
    +              p.free;
    +              hp1.free;
    +              p := hp2;
    +              Result := True;
    +            end;
    +          end;
    +      end;
    +
    +
         function TX86AsmOptimizer.OptPass1Sub(var p : tai) : boolean;
     {$ifdef i386}
           var
    @@ -2426,7 +3447,7 @@
               (taicpu(p).oper[0]^.reg = taicpu(hp1).oper[0]^.reg) and
               (taicpu(hp1).oper[0]^.reg = taicpu(hp1).oper[1]^.reg) and
               GetNextInstruction(hp1, hp2) and
    -          MatchInstruction(hp2, A_Jcc, []) then
    +          MatchInstruction(hp2, A_Jcc) then
               { Change from:             To:
     
                 set(C) %reg              j(~C) label
    @@ -2476,403 +3497,112 @@
           end;
     
     
    -    function TX86AsmOptimizer.OptPass1FSTP(var p: tai): boolean;
    -      { returns true if a "continue" should be done after this optimization }
    -      var
    -        hp1, hp2: tai;
    +    function CanBeCMOV(p : tai) : boolean; inline;
           begin
    -        Result := false;
    -        if MatchOpType(taicpu(p),top_ref) and
    -           GetNextInstruction(p, hp1) and
    -           (hp1.typ = ait_instruction) and
    -           (((taicpu(hp1).opcode = A_FLD) and
    -             (taicpu(p).opcode = A_FSTP)) or
    -            ((taicpu(p).opcode = A_FISTP) and
    -             (taicpu(hp1).opcode = A_FILD))) and
    -           MatchOpType(taicpu(hp1),top_ref) and
    -           (taicpu(hp1).opsize = taicpu(p).opsize) and
    -           RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
    -          begin
    -            { replacing fstp f;fld f by fst f is only valid for extended because of rounding }
    -            if (taicpu(p).opsize=S_FX) and
    -               GetNextInstruction(hp1, hp2) and
    -               (hp2.typ = ait_instruction) and
    -               IsExitCode(hp2) and
    -               (taicpu(p).oper[0]^.ref^.base = current_procinfo.FramePointer) and
    -               not(assigned(current_procinfo.procdef.funcretsym) and
    -                   (taicpu(p).oper[0]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
    -               (taicpu(p).oper[0]^.ref^.index = NR_NO) then
    -              begin
    -                asml.remove(p);
    -                asml.remove(hp1);
    -                p.free;
    -                hp1.free;
    -                p := hp2;
    -                RemoveLastDeallocForFuncRes(p);
    -                Result := true;
    -              end
    -            (* can't be done because the store operation rounds
    -            else
    -              { fst can't store an extended value! }
    -              if (taicpu(p).opsize <> S_FX) and
    -                 (taicpu(p).opsize <> S_IQ) then
    -                begin
    -                  if (taicpu(p).opcode = A_FSTP) then
    -                    taicpu(p).opcode := A_FST
    -                  else taicpu(p).opcode := A_FIST;
    -                  asml.remove(hp1);
    -                  hp1.free;
    -                end
    -            *)
    -          end;
    +         CanBeCMOV:=assigned(p) and
    +           MatchInstruction(p,A_MOV,[S_W,S_L,S_Q]) and
    +           { we can't use cmov ref,reg because
    +             ref could be nil and cmov still throws an exception
    +             if ref=nil but the mov isn't done (FK)
    +            or ((taicpu(p).oper[0]^.typ = top_ref) and
    +             (taicpu(p).oper[0]^.ref^.refaddr = addr_no))
    +           }
    +           MatchOpType(taicpu(p),top_reg,top_reg);
           end;
     
     
    -     function TX86AsmOptimizer.OptPass1FLD(var p : tai) : boolean;
    +    function TX86AsmOptimizer.OptPass1Imul(var p : tai) : boolean;
           var
    -       hp1, hp2: tai;
    +        opsize : topsize;
    +        hp1 : tai;
    +        tmpref : treference;
    +        ShiftValue : Cardinal;
    +        BaseValue : TCGInt;
           begin
             result:=false;
    -        if MatchOpType(taicpu(p),top_reg) and
    -           GetNextInstruction(p, hp1) and
    -           (hp1.typ = Ait_Instruction) and
    -           MatchOpType(taicpu(hp1),top_reg,top_reg) and
    -           (taicpu(hp1).oper[0]^.reg = NR_ST) and
    -           (taicpu(hp1).oper[1]^.reg = NR_ST1) then
    -           { change                        to
    -               fld      reg               fxxx reg,st
    -               fxxxp    st, st1 (hp1)
    -             Remark: non commutative operations must be reversed!
    -           }
    -          begin
    -              case taicpu(hp1).opcode Of
    -                A_FMULP,A_FADDP,
    -                A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP:
    -                  begin
    -                    case taicpu(hp1).opcode Of
    -                      A_FADDP: taicpu(hp1).opcode := A_FADD;
    -                      A_FMULP: taicpu(hp1).opcode := A_FMUL;
    -                      A_FSUBP: taicpu(hp1).opcode := A_FSUBR;
    -                      A_FSUBRP: taicpu(hp1).opcode := A_FSUB;
    -                      A_FDIVP: taicpu(hp1).opcode := A_FDIVR;
    -                      A_FDIVRP: taicpu(hp1).opcode := A_FDIV;
    -                      else
    -                        internalerror(2019050534);
    -                    end;
    -                    taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
    -                    taicpu(hp1).oper[1]^.reg := NR_ST;
    -                    asml.remove(p);
    -                    p.free;
    -                    p := hp1;
    -                    Result:=true;
    -                    exit;
    -                  end;
    -                else
    -                  ;
    -              end;
    -          end
    -        else
    -          if MatchOpType(taicpu(p),top_ref) and
    -             GetNextInstruction(p, hp2) and
    -             (hp2.typ = Ait_Instruction) and
    -             MatchOpType(taicpu(hp2),top_reg,top_reg) and
    -             (taicpu(p).opsize in [S_FS, S_FL]) and
    -             (taicpu(hp2).oper[0]^.reg = NR_ST) and
    -             (taicpu(hp2).oper[1]^.reg = NR_ST1) then
    -            if GetLastInstruction(p, hp1) and
    -              MatchInstruction(hp1,A_FLD,A_FST,[taicpu(p).opsize]) and
    -              MatchOpType(taicpu(hp1),top_ref) and
    -              RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
    -              if ((taicpu(hp2).opcode = A_FMULP) or
    -                  (taicpu(hp2).opcode = A_FADDP)) then
    -              { change                      to
    -                  fld/fst   mem1  (hp1)       fld/fst   mem1
    -                  fld       mem1  (p)         fadd/
    -                  faddp/                       fmul     st, st
    -                  fmulp  st, st1 (hp2) }
    +        opsize:=taicpu(p).opsize;
    +        { changes certain "imul const, %reg"'s to lea sequences }
    +        if (MatchOpType(taicpu(p),top_const,top_reg) or
    +            MatchOpType(taicpu(p),top_const,top_reg,top_reg)) and
    +{$ifdef x86_64}
    +           (opsize in [S_L,S_Q])
    +{$else x86_64}
    +           (opsize = S_L)
    +{$endif x86_64}
    +          then
    +          if (taicpu(p).oper[0]^.val = 1) then
    +            begin
    +              if (taicpu(p).ops = 2) then
    +               { remove "imul $1, reg" }
                     begin
    +                  hp1 := tai(p.Next);
                       asml.remove(p);
    -                  p.free;
    -                  p := hp1;
    -                  if (taicpu(hp2).opcode = A_FADDP) then
    -                    taicpu(hp2).opcode := A_FADD
    -                  else
    -                    taicpu(hp2).opcode := A_FMUL;
    -                  taicpu(hp2).oper[1]^.reg := NR_ST;
    +                  DebugMsg(SPeepholeOptimization + 'Imul2Nop done',p);
                     end
                   else
    -              { change              to
    -                  fld/fst mem1 (hp1)   fld/fst mem1
    -                  fld     mem1 (p)     fld      st}
    +               { change "imul $1, reg1, reg2" to "mov reg1, reg2" }
                     begin
    -                  taicpu(p).changeopsize(S_FL);
    -                  taicpu(p).loadreg(0,NR_ST);
    -                end
    -            else
    -              begin
    -                case taicpu(hp2).opcode Of
    -                  A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP:
    -            { change                        to
    -                fld/fst  mem1    (hp1)      fld/fst    mem1
    -                fld      mem2    (p)        fxxx       mem2
    -                fxxxp    st, st1 (hp2)                      }
    +                  hp1 := taicpu.Op_Reg_Reg(A_MOV, opsize, taicpu(p).oper[1]^.reg,taicpu(p).oper[2]^.reg);
    +                  InsertLLItem(p.previous, p.next, hp1);
    +                  DebugMsg(SPeepholeOptimization + 'Imul2Mov done',p);
    +                end;
     
    -                    begin
    -                      case taicpu(hp2).opcode Of
    -                        A_FADDP: taicpu(p).opcode := A_FADD;
    -                        A_FMULP: taicpu(p).opcode := A_FMUL;
    -                        A_FSUBP: taicpu(p).opcode := A_FSUBR;
    -                        A_FSUBRP: taicpu(p).opcode := A_FSUB;
    -                        A_FDIVP: taicpu(p).opcode := A_FDIVR;
    -                        A_FDIVRP: taicpu(p).opcode := A_FDIV;
    -                        else
    -                          internalerror(2019050533);
    -                      end;
    -                      asml.remove(hp2);
    -                      hp2.free;
    -                    end
    -                  else
    -                    ;
    -                end
    -              end
    -      end;
    +              p.free;
    +              p := hp1;
    +              Result := True;
    +              Exit;
    +            end
    +          else if
    +           ((taicpu(p).ops <= 2) or
    +            (taicpu(p).oper[2]^.typ = Top_Reg)) and
    +           not(cs_opt_size in current_settings.optimizerswitches) and
    +           (not(GetNextInstruction(p, hp1)) or
    +             not((tai(hp1).typ = ait_instruction) and
    +                 ((taicpu(hp1).opcode=A_Jcc) and
    +                  (taicpu(hp1).condition in [C_O,C_NO])))) then
    +            begin
    +              {
    +                imul X, reg1, reg2 to
    +                  lea (reg1,reg1,Y), reg2
    +                  shl ZZ,reg2
    +                imul XX, reg1 to
    +                  lea (reg1,reg1,YY), reg1
    +                  shl ZZ,reg2
     
    +                This optimization makes sense for pretty much every x86, except the VIA Nano3000: it has IMUL latency 2, lea/shl pair as well,
    +                it does not exist as a separate optimization target in FPC though.
     
    -   function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
    -      var
    -       hp1,hp2: tai;
    -{$ifdef x86_64}
    -       hp3: tai;
    -{$endif x86_64}
    -      begin
    -        Result:=false;
    -        if MatchOpType(taicpu(p),top_reg,top_reg) and
    -          GetNextInstruction(p, hp1) and
    -{$ifdef x86_64}
    -          MatchInstruction(hp1,A_MOVZX,A_MOVSX,A_MOVSXD,[]) and
    -{$else x86_64}
    -          MatchInstruction(hp1,A_MOVZX,A_MOVSX,[]) and
    -{$endif x86_64}
    -          MatchOpType(taicpu(hp1),top_reg,top_reg) and
    -          (taicpu(hp1).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
    -          { mov reg1, reg2                mov reg1, reg2
    -            movzx/sx reg2, reg3      to   movzx/sx reg1, reg3}
    -          begin
    -            taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
    -            DebugMsg(SPeepholeOptimization + 'mov %reg1,%reg2; movzx/sx %reg2,%reg3 -> mov %reg1,%reg2;movzx/sx %reg1,%reg3',p);
    +                This optimization can be applied as long as only two bits are set in the constant and those two bits are separated by
    +                at most two zeros
    +              }
    +              reference_reset(tmpref,1,[]);
    +              if (PopCnt(QWord(taicpu(p).oper[0]^.val))=2) and (BsrQWord(taicpu(p).oper[0]^.val)-BsfQWord(taicpu(p).oper[0]^.val)<=3) then
    +                begin
    +                  ShiftValue:=BsfQWord(taicpu(p).oper[0]^.val);
    +                  BaseValue:=taicpu(p).oper[0]^.val shr ShiftValue;
    +                  TmpRef.base := taicpu(p).oper[1]^.reg;
    +                  TmpRef.index := taicpu(p).oper[1]^.reg;
    +                  if not(BaseValue in [3,5,9]) then
    +                    Internalerror(2018110101);
    +                  TmpRef.ScaleFactor := BaseValue-1;
    +                  if (taicpu(p).ops = 2) then
    +                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[1]^.reg)
    +                  else
    +                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[2]^.reg);
    +                  AsmL.InsertAfter(hp1,p);
    +                  DebugMsg(SPeepholeOptimization + 'Imul2LeaShl done',p);
    +                  AsmL.Remove(p);
    +                  taicpu(hp1).fileinfo:=taicpu(p).fileinfo;
    +                  p.free;
    +                  p := hp1;
    +                  if ShiftValue>0 then
    +                    AsmL.InsertAfter(taicpu.op_const_reg(A_SHL, opsize, ShiftValue, taicpu(hp1).oper[1]^.reg),hp1);
     
    -            { Don't remove the MOV command without first checking that reg2 isn't used afterwards,
    -              or unless supreg(reg3) = supreg(reg2)). [Kit] }
    -
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -
    -            if (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) or
    -              not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)
    -            then
    -              begin
    -                asml.remove(p);
    -                p.free;
    -                p := hp1;
    -                Result:=true;
    +                  { LEA won't get optimised, so no need to set Result to True }
    +                  Exit;
                   end;
    +            end;
     
    -            exit;
    -          end
    -        else if MatchOpType(taicpu(p),top_reg,top_reg) and
    -          GetNextInstruction(p, hp1) and
    -{$ifdef x86_64}
    -          MatchInstruction(hp1,[A_MOV,A_MOVZX,A_MOVSX,A_MOVSXD],[]) and
    -{$else x86_64}
    -          MatchInstruction(hp1,A_MOV,A_MOVZX,A_MOVSX,[]) and
    -{$endif x86_64}
    -          MatchOpType(taicpu(hp1),top_ref,top_reg) and
    -          ((taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg)
    -           or
    -           (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg)
    -            ) and
    -          (getsupreg(taicpu(hp1).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) then
    -          { mov reg1, reg2
    -            mov/zx/sx (reg2, ..), reg2      to   mov/zx/sx (reg1, ..), reg2}
    -          begin
    -            if (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) then
    -              taicpu(hp1).oper[0]^.ref^.base := taicpu(p).oper[0]^.reg;
    -            if (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) then
    -              taicpu(hp1).oper[0]^.ref^.index := taicpu(p).oper[0]^.reg;
    -            DebugMsg(SPeepholeOptimization + 'MovMovXX2MoVXX 1 done',p);
    -            asml.remove(p);
    -            p.free;
    -            p := hp1;
    -            Result:=true;
    -            exit;
    -          end
    -        else if (taicpu(p).oper[0]^.typ = top_ref) and
    -          GetNextInstruction(p,hp1) and
    -          (hp1.typ = ait_instruction) and
    -          { while the GetNextInstruction(hp1,hp2) call could be factored out,
    -            doing it separately in both branches allows to do the cheap checks
    -            with low probability earlier }
    -          ((IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
    -            GetNextInstruction(hp1,hp2) and
    -            MatchInstruction(hp2,A_MOV,[])
    -           ) or
    -           ((taicpu(hp1).opcode=A_LEA) and
    -             GetNextInstruction(hp1,hp2) and
    -             MatchInstruction(hp2,A_MOV,[]) and
    -            ((MatchReference(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_INVALID) and
    -             (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg)
    -              ) or
    -             (MatchReference(taicpu(hp1).oper[0]^.ref^,NR_INVALID,
    -              taicpu(p).oper[1]^.reg) and
    -             (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg)) or
    -             (MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_NO)) or
    -             (MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,NR_NO,taicpu(p).oper[1]^.reg))
    -            ) and
    -            ((MatchOperand(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^)) or not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)))
    -           )
    -          ) and
    -          MatchOperand(taicpu(hp1).oper[taicpu(hp1).ops-1]^,taicpu(hp2).oper[0]^) and
    -          (taicpu(hp2).oper[1]^.typ = top_ref) then
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs,tai(p.next));
    -            UpdateUsedRegs(TmpUsedRegs,tai(hp1.next));
    -            if (RefsEqual(taicpu(hp2).oper[1]^.ref^,taicpu(p).oper[0]^.ref^) and
    -              not(RegUsedAfterInstruction(taicpu(hp2).oper[0]^.reg,hp2,TmpUsedRegs))) then
    -              { change   mov            (ref), reg
    -                         add/sub/or/... reg2/$const, reg
    -                         mov            reg, (ref)
    -                         # release reg
    -                to       add/sub/or/... reg2/$const, (ref)    }
    -              begin
    -                case taicpu(hp1).opcode of
    -                  A_INC,A_DEC,A_NOT,A_NEG :
    -                    taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
    -                  A_LEA :
    -                    begin
    -                      taicpu(hp1).opcode:=A_ADD;
    -                      if (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.index<>NR_NO) then
    -                        taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.index)
    -                      else if (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.base<>NR_NO) then
    -                        taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.base)
    -                      else
    -                        taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
    -                      taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
    -                      DebugMsg(SPeepholeOptimization + 'FoldLea done',hp1);
    -                    end
    -                  else
    -                    taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
    -                end;
    -                asml.remove(p);
    -                asml.remove(hp2);
    -                p.free;
    -                hp2.free;
    -                p := hp1
    -              end;
    -            Exit;
    -{$ifdef x86_64}
    -          end
    -        else if (taicpu(p).opsize = S_L) and
    -          (taicpu(p).oper[1]^.typ = top_reg) and
    -          (
    -            GetNextInstruction(p, hp1) and
    -            MatchInstruction(hp1, A_MOV,[]) and
    -            (taicpu(hp1).opsize = S_L) and
    -            (taicpu(hp1).oper[1]^.typ = top_reg)
    -          ) and (
    -            GetNextInstruction(hp1, hp2) and
    -            (tai(hp2).typ=ait_instruction) and
    -            (taicpu(hp2).opsize = S_Q) and
    -            (
    -              (
    -                MatchInstruction(hp2, A_ADD,[]) and
    -                (taicpu(hp2).opsize = S_Q) and
    -                (taicpu(hp2).oper[0]^.typ = top_reg) and (taicpu(hp2).oper[1]^.typ = top_reg) and
    -                (
    -                  (
    -                    (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) and
    -                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    -                  ) or (
    -                    (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    -                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
    -                  )
    -                )
    -              ) or (
    -                MatchInstruction(hp2, A_LEA,[]) and
    -                (taicpu(hp2).oper[0]^.ref^.offset = 0) and
    -                (taicpu(hp2).oper[0]^.ref^.scalefactor <= 1) and
    -                (
    -                  (
    -                    (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(p).oper[1]^.reg)) and
    -                    (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(hp1).oper[1]^.reg))
    -                  ) or (
    -                    (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    -                    (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(p).oper[1]^.reg))
    -                  )
    -                ) and (
    -                  (
    -                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
    -                  ) or (
    -                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
    -                  )
    -                )
    -              )
    -            )
    -          ) and (
    -            GetNextInstruction(hp2, hp3) and
    -            MatchInstruction(hp3, A_SHR,[]) and
    -            (taicpu(hp3).opsize = S_Q) and
    -            (taicpu(hp3).oper[0]^.typ = top_const) and (taicpu(hp2).oper[1]^.typ = top_reg) and
    -            (taicpu(hp3).oper[0]^.val = 1) and
    -            (taicpu(hp3).oper[1]^.reg = taicpu(hp2).oper[1]^.reg)
    -          ) then
    -          begin
    -            { Change   movl    x,    reg1d         movl    x,    reg1d
    -                       movl    y,    reg2d         movl    y,    reg2d
    -                       addq    reg2q,reg1q   or    leaq    (reg1q,reg2q),reg1q
    -                       shrq    $1,   reg1q         shrq    $1,   reg1q
    -
    -            ( reg1d and reg2d can be switched around in the first two instructions )
    -
    -              To       movl    x,    reg1d
    -                       addl    y,    reg1d
    -                       rcrl    $1,   reg1d
    -
    -              This corresponds to the common expression (x + y) shr 1, where
    -              x and y are Cardinals (replacing "shr 1" with "div 2" produces
    -              smaller code, but won't account for x + y causing an overflow). [Kit]
    -            }
    -
    -            if (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) then
    -              { Change first MOV command to have the same register as the final output }
    -              taicpu(p).oper[1]^.reg := taicpu(hp1).oper[1]^.reg
    -            else
    -              taicpu(hp1).oper[1]^.reg := taicpu(p).oper[1]^.reg;
    -
    -            { Change second MOV command to an ADD command. This is easier than
    -              converting the existing command because it means we don't have to
    -              touch 'y', which might be a complicated reference, and also the
    -              fact that the third command might either be ADD or LEA. [Kit] }
    -            taicpu(hp1).opcode := A_ADD;
    -
    -            { Delete old ADD/LEA instruction }
    -            asml.remove(hp2);
    -            hp2.free;
    -
    -            { Convert "shrq $1, reg1q" to "rcr $1, reg1d" }
    -            taicpu(hp3).opcode := A_RCR;
    -            taicpu(hp3).changeopsize(S_L);
    -            setsubreg(taicpu(hp3).oper[1]^.reg, R_SUBD);
    -{$endif x86_64}
    -          end;
    -      end;
    -
    -
    -    function TX86AsmOptimizer.OptPass2Imul(var p : tai) : boolean;
    -      var
    -        hp1 : tai;
    -      begin
    -        Result:=false;
             if (taicpu(p).ops >= 2) and
                ((taicpu(p).oper[0]^.typ = top_const) or
                 ((taicpu(p).oper[0]^.typ = top_ref) and (taicpu(p).oper[0]^.ref^.refaddr=addr_full))) and
    @@ -2881,7 +3611,7 @@
                 ((taicpu(p).oper[2]^.typ = top_reg) and
                  (taicpu(p).oper[2]^.reg = taicpu(p).oper[1]^.reg))) and
                GetLastInstruction(p,hp1) and
    -           MatchInstruction(hp1,A_MOV,[]) and
    +           MatchInstruction(hp1,A_MOV) and
                MatchOpType(taicpu(hp1),top_reg,top_reg) and
                ((taicpu(hp1).oper[1]^.reg = taicpu(p).oper[1]^.reg) or
                 ((taicpu(hp1).opsize=S_L) and (taicpu(p).opsize=S_Q) and SuperRegistersEqual(taicpu(hp1).oper[1]^.reg,taicpu(p).oper[1]^.reg))) then
    @@ -2898,6 +3628,10 @@
                     DebugMsg(SPeepholeOptimization + 'MovImul2Imul done',p);
                     asml.remove(hp1);
                     hp1.free;
    +                { Though p is still IMUL, the overhauled peephole optimiser
    +                  won't call OptPass1Imul again because the instruction type
    +                  hasn't changed (it'a assumed that if p still has the same
    +                  instruction, no more optimisations can be done on it) }
                     result:=true;
                   end;
               end;
    @@ -2904,7 +3638,7 @@
           end;
     
     
    -    function TX86AsmOptimizer.OptPass2Jmp(var p : tai) : boolean;
    +    function TX86AsmOptimizer.OptPass1Jmp(var p : tai) : boolean;
           var
             hp1 : tai;
           begin
    @@ -2925,6 +3659,9 @@
                 if (taicpu(p).condition=C_None) and assigned(hp1) and SkipLabels(hp1,hp1) and
                   MatchInstruction(hp1,A_RET,[S_NO]) then
                   begin
    +                { This jump optimisation would be missed otherwise. [Kit] }
    +                RemoveDeadCodeAfterJump(taicpu(p));
    +
                     tasmlabel(taicpu(p).oper[0]^.ref^.symbol).decrefs;
                     taicpu(p).opcode:=A_RET;
                     taicpu(p).is_jmp:=false;
    @@ -2943,23 +3680,9 @@
           end;
     
     
    -    function CanBeCMOV(p : tai) : boolean;
    -      begin
    -         CanBeCMOV:=assigned(p) and
    -           MatchInstruction(p,A_MOV,[S_W,S_L,S_Q]) and
    -           { we can't use cmov ref,reg because
    -             ref could be nil and cmov still throws an exception
    -             if ref=nil but the mov isn't done (FK)
    -            or ((taicpu(p).oper[0]^.typ = top_ref) and
    -             (taicpu(p).oper[0]^.ref^.refaddr = addr_no))
    -           }
    -           MatchOpType(taicpu(p),top_reg,top_reg);
    -      end;
    -
    -
    -    function TX86AsmOptimizer.OptPass2Jcc(var p : tai) : boolean;
    +    function TX86AsmOptimizer.OptPass1Jcc(var p : tai) : boolean;
           var
    -        hp1,hp2,hp3,hp4,hpmov2: tai;
    +        hp1,hp2,hp3,hp4,hpmov1,hpmov2: tai;
             carryadd_opcode : TAsmOp;
             l : Longint;
             condition : TAsmCond;
    @@ -2967,339 +3690,349 @@
           begin
             result:=false;
             symbol:=nil;
    -        if GetNextInstruction(p,hp1) then
    -          begin
    -            symbol := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
    +        if not GetNextInstruction(p,hp1) or (hp1.typ <> ait_instruction) then
    +          { No next instruction, so exit }
    +          Exit;
     
    -            if (hp1.typ=ait_instruction) and
    -               GetNextInstruction(hp1,hp2) and (hp2.typ=ait_label) and
    -               (Tasmlabel(symbol) = Tai_label(hp2).labsym) then
    -                 { jb @@1                            cmc
    -                   inc/dec operand           -->     adc/sbb operand,0
    -                   @@1:
    +        symbol := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
     
    -                   ... and ...
    +        if (hp1.typ=ait_instruction) and
    +           GetNextInstruction(hp1,hp2) and (hp2.typ=ait_label) and
    +           (Tasmlabel(symbol) = Tai_label(hp2).labsym) then
    +             { jb @@1                            cmc
    +               inc/dec operand           -->     adc/sbb operand,0
    +               @@1:
     
    -                   jnb @@1
    -                   inc/dec operand           -->     adc/sbb operand,0
    -                   @@1: }
    +               ... and ...
    +
    +               jnb @@1
    +               inc/dec operand           -->     adc/sbb operand,0
    +               @@1: }
    +          begin
    +            carryadd_opcode:=A_NONE;
    +            if Taicpu(p).condition in [C_NAE,C_B] then
                   begin
    -                carryadd_opcode:=A_NONE;
    -                if Taicpu(p).condition in [C_NAE,C_B] then
    +                if Taicpu(hp1).opcode=A_INC then
    +                  carryadd_opcode:=A_ADC;
    +                if Taicpu(hp1).opcode=A_DEC then
    +                  carryadd_opcode:=A_SBB;
    +                if carryadd_opcode<>A_NONE then
                       begin
    -                    if Taicpu(hp1).opcode=A_INC then
    -                      carryadd_opcode:=A_ADC;
    -                    if Taicpu(hp1).opcode=A_DEC then
    -                      carryadd_opcode:=A_SBB;
    -                    if carryadd_opcode<>A_NONE then
    -                      begin
    -                        Taicpu(p).clearop(0);
    -                        Taicpu(p).ops:=0;
    -                        Taicpu(p).is_jmp:=false;
    -                        Taicpu(p).opcode:=A_CMC;
    -                        Taicpu(p).condition:=C_NONE;
    -                        Taicpu(hp1).ops:=2;
    -                        Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
    -                        Taicpu(hp1).loadconst(0,0);
    -                        Taicpu(hp1).opcode:=carryadd_opcode;
    -                        result:=true;
    -                        exit;
    -                      end;
    +                    Taicpu(p).clearop(0);
    +                    Taicpu(p).ops:=0;
    +                    Taicpu(p).is_jmp:=false;
    +                    Taicpu(p).opcode:=A_CMC;
    +                    Taicpu(p).condition:=C_NONE;
    +                    Taicpu(hp1).ops:=2;
    +                    Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
    +                    Taicpu(hp1).loadconst(0,0);
    +                    Taicpu(hp1).opcode:=carryadd_opcode;
    +                    result:=true;
    +                    exit;
                       end;
    -                if Taicpu(p).condition in [C_AE,C_NB] then
    +              end;
    +            if Taicpu(p).condition in [C_AE,C_NB] then
    +              begin
    +                if Taicpu(hp1).opcode=A_INC then
    +                  carryadd_opcode:=A_ADC;
    +                if Taicpu(hp1).opcode=A_DEC then
    +                  carryadd_opcode:=A_SBB;
    +                if carryadd_opcode<>A_NONE then
                       begin
    -                    if Taicpu(hp1).opcode=A_INC then
    -                      carryadd_opcode:=A_ADC;
    -                    if Taicpu(hp1).opcode=A_DEC then
    -                      carryadd_opcode:=A_SBB;
    -                    if carryadd_opcode<>A_NONE then
    -                      begin
    -                        asml.remove(p);
    -                        p.free;
    -                        Taicpu(hp1).ops:=2;
    -                        Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
    -                        Taicpu(hp1).loadconst(0,0);
    -                        Taicpu(hp1).opcode:=carryadd_opcode;
    -                        p:=hp1;
    -                        result:=true;
    -                        exit;
    -                      end;
    +                    asml.remove(p);
    +                    p.free;
    +                    Taicpu(hp1).ops:=2;
    +                    Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
    +                    Taicpu(hp1).loadconst(0,0);
    +                    Taicpu(hp1).opcode:=carryadd_opcode;
    +                    p:=hp1;
    +                    result:=true;
    +                    exit;
                       end;
                   end;
    +          end;
     
    -            if ((hp1.typ = ait_label) and (symbol = tai_label(hp1).labsym))
    -                or ((hp1.typ = ait_align) and GetNextInstruction(hp1, hp2) and (hp2.typ = ait_label) and (symbol = tai_label(hp2).labsym)) then
    +        if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
    +          begin
    +            { check for
    +                   jCC   xxx
    +                   <several movs>
    +                xxx:
    +            }
    +            l:=0;
    +            { We already have hp1 from above };
    +
    +            { Look ahead with the register usage }
    +            TransferUsedRegs(StatePreserveRegs); { We can't use TmpUsedRegs because that's used by OptPass1MOV }
    +            UpdateUsedRegs(tai(p.Next));
    +
    +            hpmov1 := hp1;
    +            while (hp1 <> BlockEnd) and
    +              (hp1.typ = ait_instruction) and
    +              (taicpu(hp1).opcode = A_MOV) do
    +              { Will stop on labels }
                   begin
    -                { If Jcc is immediately followed by the label that it's supposed to jump to, remove it }
    -                DebugMsg(SPeepholeOptimization + 'Removed conditional jump whose destination was immediately after it', p);
    -                UpdateUsedRegs(hp1);
    -
    -                TAsmLabel(symbol).decrefs;
    -                { if the label refs. reach zero, remove any alignment before the label }
    -                if (hp1.typ = ait_align) then
    +                { Check to see if the MOV can't be optimised first }
    +                if OptPass1MOV(hp1) then
                       begin
    -                    UpdateUsedRegs(hp2);
    -                    if (TAsmLabel(symbol).getrefs = 0) then
    -                    begin
    -                      asml.Remove(hp1);
    -                      hp1.Free;
    -                    end;
    -                    hp1 := hp2; { Set hp1 to the label }
    +                    UpdateUsedRegs(hp1);
    +                    Continue;
                       end;
     
    -                asml.remove(p);
    -                p.free;
    +                if not CanBeCMOV(hp1) then
    +                  Break;
     
    -                if (TAsmLabel(symbol).getrefs = 0) then
    -                  begin
    -                    GetNextInstruction(hp1, p); { Instruction following the label }
    -                    asml.remove(hp1);
    -                    hp1.free;
    +                UpdateUsedRegs(tai(hp1.Next));
    +                inc(l);
    +                GetNextInstruction(hp1,hp1);
    +              end;
     
    -                    UpdateUsedRegs(p);
    -                    Result := True;
    -                  end
    -                else
    +            if (hp1 <> BlockEnd) then
    +              begin
    +                if FindLabel(tasmlabel(symbol),hp1) then
                       begin
    -                    { We don't need to set the result to True because we know hp1
    -                      is a label and won't trigger any optimisation routines. [Kit] }
    -                    p := hp1;
    -                  end;
    +                    if (l<=4) and (l>0) then
    +                      begin
    +                        condition:=inverse_cond(taicpu(p).condition);
    +                        repeat
    +                          taicpu(hpmov1).opcode:=A_CMOVcc;
    +                          taicpu(hpmov1).condition:=condition;
    +                          GetNextInstruction(hpmov1,hpmov1);
    +                        until not(CanBeCMOV(hpmov1));
     
    -                Exit;
    -              end;
    -          end;
    +                        { Don't decrement the reference count on the label yet, otherwise
    +                          GetNextInstruction might skip over the label if it drops to
    +                          zero. }
    +                        GetNextInstruction(hp1,hp2);
    +                        UpdateUsedRegs(tai(hp1.Next));
     
    -{$ifndef i8086}
    -        if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
    -          begin
    -             { check for
    -                    jCC   xxx
    -                    <several movs>
    -                 xxx:
    -             }
    -             l:=0;
    -             GetNextInstruction(p, hp1);
    -             while assigned(hp1) and
    -               CanBeCMOV(hp1) and
    -               { stop on labels }
    -               not(hp1.typ=ait_label) do
    -               begin
    -                  inc(l);
    -                  GetNextInstruction(hp1,hp1);
    -               end;
    -             if assigned(hp1) then
    -               begin
    -                  if FindLabel(tasmlabel(symbol),hp1) then
    -                    begin
    -                      if (l<=4) and (l>0) then
    -                        begin
    -                          condition:=inverse_cond(taicpu(p).condition);
    -                          GetNextInstruction(p,hp1);
    -                          repeat
    -                            if not Assigned(hp1) then
    -                              InternalError(2018062900);
    +                        { if the label refs. reach zero, remove any alignment before the label }
    +                        if (hp1.typ = ait_align) and (hp2.typ = ait_label) then
    +                          begin
    +                            { Ref = 1 means it will drop to zero }
    +                            if (tasmlabel(symbol).getrefs=1) then
    +                              begin
    +                                asml.Remove(hp1);
    +                                hp1.Free;
    +                              end;
    +                          end
    +                        else
    +                          hp2 := hp1;
     
    -                            taicpu(hp1).opcode:=A_CMOVcc;
    -                            taicpu(hp1).condition:=condition;
    -                            UpdateUsedRegs(hp1);
    -                            GetNextInstruction(hp1,hp1);
    -                          until not(CanBeCMOV(hp1));
    +                        if not Assigned(hp2) then
    +                          InternalError(2018062910);
     
    -                          { Don't decrement the reference count on the label yet, otherwise
    -                            GetNextInstruction might skip over the label if it drops to
    -                            zero. }
    -                          GetNextInstruction(hp1,hp2);
    +                        if (hp2.typ <> ait_label) then
    +                          begin
    +                            { There's something other than CMOVs here.  Move the original jump
    +                              to right before this point, then break out.
     
    -                          { if the label refs. reach zero, remove any alignment before the label }
    -                          if (hp1.typ = ait_align) and (hp2.typ = ait_label) then
    -                            begin
    -                              { Ref = 1 means it will drop to zero }
    -                              if (tasmlabel(symbol).getrefs=1) then
    -                                begin
    -                                  asml.Remove(hp1);
    -                                  hp1.Free;
    -                                end;
    -                            end
    -                          else
    -                            hp2 := hp1;
    +                              Originally this was part of the above internal error, but it got
    +                              triggered on the bootstrapping process sometimes. Investigate. [Kit] }
     
    -                          if not Assigned(hp2) then
    -                            InternalError(2018062910);
    +                            asml.remove(p);
    +                            asml.insertbefore(p, hp2);
     
    -                          if (hp2.typ <> ait_label) then
    -                            begin
    -                              { There's something other than CMOVs here.  Move the original jump
    -                                to right before this point, then break out.
    +                            UpdateUsedRegs(p);
    +                            DebugMsg('Jcc/CMOVcc drop-out', p);
    +                            Result := True;
    +                            Exit;
    +                          end;
     
    -                                Originally this was part of the above internal error, but it got
    -                                triggered on the bootstrapping process sometimes. Investigate. [Kit] }
    -                              asml.remove(p);
    -                              asml.insertbefore(p, hp2);
    -                              DebugMsg('Jcc/CMOVcc drop-out', p);
    -                              UpdateUsedRegs(p);
    -                              Result := True;
    -                              Exit;
    -                            end;
    +                        UpdateUsedRegs(tai(hp2.Next));
     
    -                          { Now we can safely decrement the reference count }
    -                          tasmlabel(symbol).decrefs;
    +                        { Now we can safely decrement the reference count }
    +                        tasmlabel(symbol).decrefs;
     
    -                          { Remove the original jump }
    -                          asml.Remove(p);
    -                          p.Free;
    +                        { Remove the original jump }
    +                        asml.Remove(p);
    +                        p.Free;
     
    -                          GetNextInstruction(hp2, p); { Instruction after the label }
    +                        GetNextInstruction(hp2, p); { Instruction after the label }
     
    -                          { Remove the label if this is its final reference }
    -                          if (tasmlabel(symbol).getrefs=0) then
    -                            begin
    -                              asml.remove(hp2);
    -                              hp2.free;
    -                            end;
    +                        { Remove the label if this is its final reference }
    +                        if (tasmlabel(symbol).getrefs=0) then
    +                          begin
    +                            asml.remove(hp2);
    +                            hp2.free;
    +                          end;
     
    -                          if Assigned(p) then
    -                            begin
    -                              UpdateUsedRegs(p);
    -                              result:=true;
    -                            end;
    -                          exit;
    -                        end;
    -                    end
    -                  else
    -                    begin
    -                       { check further for
    -                              jCC   xxx
    -                              <several movs 1>
    -                              jmp   yyy
    -                      xxx:
    -                              <several movs 2>
    -                      yyy:
    -                       }
    -                      { hp2 points to jmp yyy }
    -                      hp2:=hp1;
    -                      { skip hp1 to xxx (or an align right before it) }
    -                      GetNextInstruction(hp1, hp1);
    +                        if Assigned(p) and (l > 0) then
    +                          result:=true;
     
    -                      if assigned(hp2) and
    -                        assigned(hp1) and
    -                        (l<=3) and
    -                        (hp2.typ=ait_instruction) and
    -                        (taicpu(hp2).is_jmp) and
    -                        (taicpu(hp2).condition=C_None) and
    -                        { real label and jump, no further references to the
    -                          label are allowed }
    -                        (tasmlabel(symbol).getrefs=1) and
    -                        FindLabel(tasmlabel(symbol),hp1) then
    -                         begin
    -                           l:=0;
    -                           { skip hp1 to <several moves 2> }
    -                           if (hp1.typ = ait_align) then
    -                             GetNextInstruction(hp1, hp1);
    +                        Exit;
    +                      end;
    +                  end
    +                else if (l<=3) and (hp1 <> BlockEnd) and (hp1.typ=ait_instruction) and (taicpu(hp1).opcode = A_JMP) then
    +                  begin
    +                    { check further for
    +                            jCC   xxx
    +                            <several movs 1>
    +                            jmp   yyy                <-- Unconditional jump only
    +                    xxx:
    +                            <several movs 2>
    +                    yyy:
    +                     }
    +                    { hp2 points to jmp yyy }
    +                    hp2:=hp1;
    +                    { skip hp1 to xxx (or an align right before it) }
    +                    GetNextInstruction(hp1, hp1);
     
    -                           GetNextInstruction(hp1, hpmov2);
    +                    { real label and jump, no further references to the
    +                      label are allowed }
    +                    if (hp1 <> BlockEnd) and (tasmlabel(symbol).getrefs=1) and
    +                      FindLabel(tasmlabel(symbol),hp1) then
    +                      begin
    +                        { Do the first batch of CMOVs }
    +                        condition:=inverse_cond(taicpu(p).condition);
    +                        repeat
    +                          taicpu(hpmov1).opcode:=A_CMOVcc;
    +                          taicpu(hpmov1).condition:=condition;
    +                          GetNextInstruction(hpmov1,hpmov1);
    +                        until (hpmov1 = BlockEnd) or
    +                          not(CanBeCMOV(hpmov1));
     
    -                           hp1 := hpmov2;
    -                           while assigned(hp1) and
    -                             CanBeCMOV(hp1) do
    +                        { It's safe to keep UsedRegs as is now, so save the
    +                          state while the other set of MOVs is dealt with }
    +                        TransferUsedRegs(StatePreserveRegs);
    +                        UpdateUsedRegs(tai(hp2.next));
    +                        l:=0;
    +                        { skip hp1 to <several moves 2> }
    +                        if (hp1.typ = ait_align) then
    +                          begin
    +                            UpdateUsedRegs(hp1);
    +                            GetNextInstruction(hp1, hp1);
    +                          end;
    +
    +                        UpdateUsedRegs(tai(hp1.Next));
    +                        GetNextInstruction(hp1, hpmov2);
    +
    +                        hp1 := hpmov2;
    +                        while assigned(hp1) and
    +                          (hp1.typ = ait_instruction) and
    +                          (taicpu(hp1).opcode = A_MOV) do
    +                          begin
    +                           { Check to see if the MOV can't be optimised first }
    +                           if OptPass1MOV(hp1) then
                                  begin
    -                               inc(l);
    -                               GetNextInstruction(hp1, hp1);
    +                               UpdateUsedRegs(hp1);
    +                               Continue;
                                  end;
    -                           { hp1 points to yyy (or an align right before it) }
    -                           hp3 := hp1;
    -                           if assigned(hp1) and
    -                             FindLabel(tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol),hp1) then
    -                             begin
    -                                condition:=inverse_cond(taicpu(p).condition);
    -                                GetNextInstruction(p,hp1);
    -                                repeat
    -                                  taicpu(hp1).opcode:=A_CMOVcc;
    -                                  taicpu(hp1).condition:=condition;
    -                                  UpdateUsedRegs(hp1);
    -                                  GetNextInstruction(hp1,hp1);
    -                                until not(assigned(hp1)) or
    -                                  not(CanBeCMOV(hp1));
     
    -                                condition:=inverse_cond(condition);
    -                                hp1 := hpmov2;
    -                                { hp1 is now at <several movs 2> }
    -                                while Assigned(hp1) and CanBeCMOV(hp1) do
    -                                  begin
    -                                    taicpu(hp1).opcode:=A_CMOVcc;
    -                                    taicpu(hp1).condition:=condition;
    -                                    UpdateUsedRegs(hp1);
    -                                    GetNextInstruction(hp1,hp1);
    -                                  end;
    +                            if not CanBeCMOV(hp1) then
    +                              Break;
     
    -                                hp1 := p;
    +                            UpdateUsedRegs(tai(hp1.Next));
    +                            inc(l);
    +                            GetNextInstruction(hp1, hp1);
    +                          end;
    +                        { if yyy is the expected label, then hp1 points to it (or an align right before it) }
    +                        hp3 := hp1;
    +                        if assigned(hp1) and
    +                          FindLabel(tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol),hp1) then
    +                          begin
     
    -                                { Get first instruction after label }
    -                                GetNextInstruction(hp3, p);
    +                            condition:=inverse_cond(condition);
    +                            hp1 := hpmov2;
    +                            { hp1 is now at <several movs 2> }
    +                            while Assigned(hp1) and CanBeCMOV(hp1) do
    +                              begin
    +                                taicpu(hp1).opcode:=A_CMOVcc;
    +                                taicpu(hp1).condition:=condition;
    +                                GetNextInstruction(hp1,hp1);
    +                              end;
     
    -                                if assigned(p) and (hp3.typ = ait_align) then
    -                                  GetNextInstruction(p, p);
    +                            hp1 := p;
     
    -                                { Don't dereference yet, as doing so will cause
    -                                  GetNextInstruction to skip the label and
    -                                  optional align marker. [Kit] }
    -                                GetNextInstruction(hp2, hp4);
    +                            UpdateUsedRegs(tai(hp3.Next));
     
    -                                { remove jCC }
    +                            { Get first instruction after label }
    +                            GetNextInstruction(hp3, p);
    +
    +                            if assigned(p) and (hp3.typ = ait_align) then
    +                              begin
    +                                UpdateUsedRegs(tai(p.Next));
    +                                GetNextInstruction(p, p);
    +                              end;
    +
    +                            { Don't dereference yet, as doing so will cause
    +                              GetNextInstruction to skip the label and
    +                              optional align marker. [Kit] }
    +                            GetNextInstruction(hp2, hp4);
    +
    +                            { remove jCC }
    +                            asml.remove(hp1);
    +                            hp1.free;
    +
    +                            { Remove label xxx (it will have a ref of zero due to the initial check }
    +                            if (hp4.typ = ait_align) then
    +                              begin
    +                                { Account for alignment as well }
    +                                GetNextInstruction(hp4, hp1);
                                     asml.remove(hp1);
                                     hp1.free;
    +                              end;
     
    -                                { Remove label xxx (it will have a ref of zero due to the initial check }
    -                                if (hp4.typ = ait_align) then
    +                            asml.remove(hp4);
    +                            hp4.free;
    +
    +                            { Now we can safely decrement it }
    +                            tasmlabel(symbol).decrefs;
    +
    +                            { remove jmp }
    +                            symbol := taicpu(hp2).oper[0]^.ref^.symbol;
    +
    +                            asml.remove(hp2);
    +                            hp2.free;
    +
    +                            { Remove label yyy (and the optional alignment) if its reference will fall to zero }
    +                            if tasmlabel(symbol).getrefs = 1 then
    +                              begin
    +                                if (hp3.typ = ait_align) then
                                       begin
                                         { Account for alignment as well }
    -                                    GetNextInstruction(hp4, hp1);
    +                                    GetNextInstruction(hp3, hp1);
                                         asml.remove(hp1);
                                         hp1.free;
                                       end;
     
    -                                asml.remove(hp4);
    -                                hp4.free;
    +                                asml.remove(hp3);
    +                                hp3.free;
     
    -                                { Now we can safely decrement it }
    +                                { As before, now we can safely decrement it }
                                     tasmlabel(symbol).decrefs;
    +                              end;
     
    -                                { remove jmp }
    -                                symbol := taicpu(hp2).oper[0]^.ref^.symbol;
    +                            if Assigned(p) and (l > 0) then
    +                              result:=true;
     
    -                                asml.remove(hp2);
    -                                hp2.free;
    +                            Exit;
    +                          end
    +                        else
    +                          begin
    +                            { The first batch of MOVs was changed to CMOV instructions, but not the second }
     
    -                                { Remove label yyy (and the optional alignment) if its reference will fall to zero }
    -                                if tasmlabel(symbol).getrefs = 1 then
    -                                  begin
    -                                    if (hp3.typ = ait_align) then
    -                                      begin
    -                                        { Account for alignment as well }
    -                                        GetNextInstruction(hp3, hp1);
    -                                        asml.remove(hp1);
    -                                        hp1.free;
    -                                      end;
    +                            { remove jCC }
    +                            tasmlabel(symbol).decrefs;
    +                            asml.remove(p);
    +                            p.free;
     
    -                                    asml.remove(hp3);
    -                                    hp3.free;
    +                            p := hp2; { Set the current instruction to the JMP command, which is about to be changed... }
     
    -                                    { As before, now we can safely decrement it }
    -                                    tasmlabel(symbol).decrefs;
    -                                  end;
    +                            { Change the JMP to a jCC with the opposite condition }
    +                            taicpu(p).opcode:=A_Jcc;
    +                            taicpu(p).condition:=condition;
     
    -                                if Assigned(p) then
    -                                  begin
    -                                    UpdateUsedRegs(p);
    -                                    result:=true;
    -                                  end;
    -                                exit;
    -                             end;
    -                         end;
    -                    end;
    -               end;
    +                            Result := True;
    +                            { UsedRegs will be set to the state at the modified jump below... }
    +                          end;
    +                      end;
    +                  end;
    +              end;
    +
    +            { Restore UsedRegs state to the appropriate position }
    +            RestoreUsedRegs(StatePreserveRegs);
               end;
    -{$endif i8086}
           end;
     
     
    @@ -3974,75 +4739,6 @@
           end;
     
     
    -{$ifdef x86_64}
    -    function TX86AsmOptimizer.PostPeepholeOptMovzx(var p : tai) : Boolean;
    -      var
    -        PreMessage: string;
    -      begin
    -        Result := False;
    -        { Code size reduction by J. Gareth "Kit" Moreton }
    -        { Convert MOVZBQ and MOVZWQ to MOVZBL and MOVZWL respectively if it removes the REX prefix }
    -        if (taicpu(p).opsize in [S_BQ, S_WQ]) and
    -          (getsupreg(taicpu(p).oper[1]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP])
    -        then
    -          begin
    -            { Has 64-bit register name and opcode suffix }
    -            PreMessage := 'movz' + debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' -> movz';
    -
    -            { The actual optimization }
    -            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
    -            if taicpu(p).opsize = S_BQ then
    -              taicpu(p).changeopsize(S_BL)
    -            else
    -              taicpu(p).changeopsize(S_WL);
    -
    -            DebugMsg(SPeepholeOptimization + PreMessage +
    -              debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' (removes REX prefix)', p);
    -          end;
    -      end;
    -
    -
    -    function TX86AsmOptimizer.PostPeepholeOptXor(var p : tai) : Boolean;
    -      var
    -        PreMessage, RegName: string;
    -      begin
    -        { Code size reduction by J. Gareth "Kit" Moreton }
    -        { change "xorq %reg,%reg" to "xorl %reg,%reg" for %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rbp and %rsp,
    -          as this removes the REX prefix }
    -
    -        Result := False;
    -        if not OpsEqual(taicpu(p).oper[0]^,taicpu(p).oper[1]^) then
    -          Exit;
    -
    -        if taicpu(p).oper[0]^.typ <> top_reg then
    -          { Should be impossible if both operands were equal, since one of XOR's operands must be a register }
    -          InternalError(2018011500);
    -
    -        case taicpu(p).opsize of
    -          S_Q:
    -            begin
    -              if (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP]) then
    -                begin
    -                  RegName := debug_regname(taicpu(p).oper[0]^.reg); { 64-bit register name }
    -                  PreMessage := 'xorq ' + RegName + ',' + RegName + ' -> xorl ';
    -
    -                  { The actual optimization }
    -                  setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
    -                  setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
    -                  taicpu(p).changeopsize(S_L);
    -
    -                  RegName := debug_regname(taicpu(p).oper[0]^.reg); { 32-bit register name }
    -
    -                  DebugMsg(SPeepholeOptimization + PreMessage + RegName + ',' + RegName + ' (removes REX prefix)', p);
    -                end;
    -            end;
    -          else
    -            ;
    -        end;
    -      end;
    -{$endif}
    -
    -
         procedure TX86AsmOptimizer.OptReferences;
           var
             p: tai;
    Index: compiler/x86_64/aoptcpu.pas
    ===================================================================
    --- compiler/x86_64/aoptcpu.pas	(revision 42345)
    +++ compiler/x86_64/aoptcpu.pas	(working copy)
    @@ -30,12 +30,16 @@
     uses cpubase, aasmtai, aopt, aoptx86;
     
     type
    +
    +  { TCpuAsmOptimizer }
    +
       TCpuAsmOptimizer = class(TX86AsmOptimizer)
    -    function PrePeepHoleOptsCpu(var p: tai): boolean; override;
         function PeepHoleOptPass1Cpu(var p: tai): boolean; override;
    -    function PeepHoleOptPass2Cpu(var p: tai): boolean; override;
    -    function PostPeepHoleOptsCpu(var p : tai) : boolean; override;
    -    procedure PostPeepHoleOpts; override;
    +    function PostPeepHoleOptsCpu(var p : tai): boolean; override;
    +
    +    { Optimisations specific to x86_64 }
    +    function PostPeepholeOptMovzx(var p : tai): Boolean; inline;
    +    function PostPeepholeOptXor(var p : tai): Boolean; inline;
       end;
     
     implementation
    @@ -42,132 +46,111 @@
     
     uses
       globals,
    -  aasmcpu;
    +  aasmcpu,
    +  cgbase,
    +  verbose;
     
    -    function TCpuAsmOptimizer.PrePeepHoleOptsCpu(var p : tai) : boolean;
    -      begin
    -        result := false;
    -        case p.typ of
    -          ait_instruction:
    -            begin
    -              case taicpu(p).opcode of
    -                A_IMUL:
    -                  result:=PrePeepholeOptIMUL(p);
    -                A_SAR,A_SHR:
    -                  result:=PrePeepholeOptSxx(p);
    -                else
    -                  ;
    -              end;
    -            end;
    -          else
    -            ;
    -        end;
    -      end;
     
    -
         function TCpuAsmOptimizer.PeepHoleOptPass1Cpu(var p: tai): boolean;
    +      var
    +        Opcode: TAsmOp;
           begin
    +        { p is known to be an instruction by this point }
    +
             result:=False;
    -        case p.typ of
    -          ait_instruction:
    -            begin
    -              case taicpu(p).opcode of
    -                A_AND:
    -                  Result:=OptPass1AND(p);
    -                A_MOV:
    -                  Result:=OptPass1MOV(p);
    -                A_MOVSX,
    -                A_MOVZX:
    -                  Result:=OptPass1Movx(p);
    -                A_VMOVAPS,
    -                A_VMOVAPD,
    -                A_VMOVUPS,
    -                A_VMOVUPD:
    -                  result:=OptPass1VMOVAP(p);
    -                A_MOVAPD,
    -                A_MOVAPS,
    -                A_MOVUPD,
    -                A_MOVUPS:
    -                  result:=OptPass1MOVAP(p);
    -                A_VDIVSD,
    -                A_VDIVSS,
    -                A_VSUBSD,
    -                A_VSUBSS,
    -                A_VMULSD,
    -                A_VMULSS,
    -                A_VADDSD,
    -                A_VADDSS,
    -                A_VANDPD,
    -                A_VANDPS,
    -                A_VORPD,
    -                A_VORPS,
    -                A_VXORPD,
    -                A_VXORPS:
    -                  result:=OptPass1VOP(p);
    -                A_MULSD,
    -                A_MULSS,
    -                A_ADDSD,
    -                A_ADDSS:
    -                  result:=OptPass1OP(p);
    -                A_VMOVSD,
    -                A_VMOVSS,
    -                A_MOVSD,
    -                A_MOVSS:
    -                  result:=OptPass1MOVXX(p);
    -                A_LEA:
    -                  result:=OptPass1LEA(p);
    -                A_SUB:
    -                  result:=OptPass1Sub(p);
    -                A_SHL,A_SAL:
    -                  result:=OptPass1SHLSAL(p);
    -                A_SETcc:
    -                  result:=OptPass1SETcc(p);
    -                A_FSTP,A_FISTP:
    -                  result:=OptPass1FSTP(p);
    -                A_FLD:
    -                  result:=OptPass1FLD(p);
    -                else
    -                  ;
    -              end;
    -            end;
    -          else
    -            ;
    -        end;
    +        { Use a local variable/register to reduce the number of pointer
    +          dereferences (the peephole optimiser would never optimise this
    +          by itself because the compiler has to consider the possibility
    +          of multi-threaded race hazards. [Kit] }
    +        Opcode := taicpu(p).opcode;
     
    +        { Clever optimisation: MOV instructions appear disproportionally
    +          more frequently than any other instruction, so check for this
    +          opcode first and reduce the total number of comparisons
    +          required over the entire block. [Kit] }
    +        if Opcode = A_MOV then
    +          Result := OptPass1MOV(p)
    +        else
    +          case Opcode of
    +            A_AND:
    +              Result:=OptPass1AND(p);
    +            A_MOVSX,
    +            A_MOVZX:
    +              Result:=OptPass1Movx(p);
    +            A_VMOVAPS,
    +            A_VMOVAPD,
    +            A_VMOVUPS,
    +            A_VMOVUPD:
    +              result:=OptPass1VMOVAP(p);
    +            A_MOVAPD,
    +            A_MOVAPS,
    +            A_MOVUPD,
    +            A_MOVUPS:
    +              result:=OptPass1MOVAP(p);
    +            A_VDIVSD,
    +            A_VDIVSS,
    +            A_VSUBSD,
    +            A_VSUBSS,
    +            A_VMULSD,
    +            A_VMULSS,
    +            A_VADDSD,
    +            A_VADDSS,
    +            A_VANDPD,
    +            A_VANDPS,
    +            A_VORPD,
    +            A_VORPS,
    +            A_VXORPD,
    +            A_VXORPS:
    +              result:=OptPass1VOP(p);
    +            A_MULSD,
    +            A_MULSS,
    +            A_ADDSD,
    +            A_ADDSS:
    +              result:=OptPass1OP(p);
    +            A_VMOVSD,
    +            A_VMOVSS,
    +            A_MOVSD,
    +            A_MOVSS:
    +              result:=OptPass1MOVXX(p);
    +            A_LEA:
    +              result:=OptPass1LEA(p);
    +            A_SUB:
    +              result:=OptPass1Sub(p);
    +            A_CMP:
    +              Result:=OptPass1CMP(p);
    +            A_SHL,A_SAL:
    +              result:=OptPass1SHLSAL(p);
    +            A_SHR,A_SAR:
    +              result:=OptPass1SHRSAR(p);
    +            A_SETcc:
    +              result:=OptPass1SETcc(p);
    +            A_IMUL:
    +              Result:=OptPass1Imul(p);
    +            A_JMP:
    +              Result:=OptPass1Jmp(p);
    +            A_Jcc:
    +              Result:=OptPass1Jcc(p);
    +            A_XOR:
    +              Result:=OptPass1XOR(p);
    +            else
    +	      { Do nothing };
    +          end;
           end;
     
     
    -    function TCpuAsmOptimizer.PeepHoleOptPass2Cpu(var p : tai) : boolean;
    -      begin
    -        Result := False;
    -        case p.typ of
    -          ait_instruction:
    -            begin
    -              case taicpu(p).opcode of
    -                A_MOV:
    -                  Result:=OptPass2MOV(p);
    -                A_IMUL:
    -                  Result:=OptPass2Imul(p);
    -                A_JMP:
    -                  Result:=OptPass2Jmp(p);
    -                A_Jcc:
    -                  Result:=OptPass2Jcc(p);
    -                else
    -                  ;
    -              end;
    -            end;
    -          else
    -            ;
    -        end;
    -      end;
    -
    -
         function TCpuAsmOptimizer.PostPeepHoleOptsCpu(var p: tai): boolean;
    +      var
    +        i: Integer;
           begin
             result := false;
             case p.typ of
               ait_instruction:
                 begin
    +              { Optimise the references }
    +              for i:=0 to taicpu(p).ops-1 do
    +                if taicpu(p).oper[i]^.typ=top_ref then
    +                  optimize_ref(taicpu(p).oper[i]^.ref^,false);
    +
                   case taicpu(p).opcode of
                     A_MOV:
                       Result:=PostPeepholeOptMov(p);
    @@ -194,12 +177,66 @@
           end;
     
     
    -    procedure TCpuAsmOptimizer.PostPeepHoleOpts;
    +    function TCpuAsmOptimizer.PostPeepholeOptMovzx(var p: tai): Boolean;
    +      var
    +        PreMessage: string;
           begin
    -        inherited;
    -        OptReferences;
    +        Result := False;
    +        { Code size reduction by J. Gareth "Kit" Moreton }
    +        { Convert MOVZBQ and MOVZWQ to MOVZBL and MOVZWL respectively if it removes the REX prefix }
    +        if (taicpu(p).opsize in [S_BQ, S_WQ]) and
    +          (getsupreg(taicpu(p).oper[1]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP])
    +        then
    +          begin
    +            { Has 64-bit register name and opcode suffix }
    +            PreMessage := 'movz' + debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' -> movz';
    +
    +            { The actual optimization }
    +            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
    +            if taicpu(p).opsize = S_BQ then
    +              taicpu(p).changeopsize(S_BL)
    +            else
    +              taicpu(p).changeopsize(S_WL);
    +
    +            DebugMsg(SPeepholeOptimization + PreMessage +
    +              debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' (removes REX prefix)', p);
    +          end;
           end;
     
    +
    +    function TCpuAsmOptimizer.PostPeepholeOptXor(var p : tai) : Boolean;
    +      var
    +        PreMessage, RegName: string;
    +      begin
    +        { Code size reduction by J. Gareth "Kit" Moreton }
    +        { change "xorq %reg,%reg" to "xorl %reg,%reg" for %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rbp and %rsp,
    +          as this removes the REX prefix }
    +
    +        Result := False;
    +        if not OpsEqual(taicpu(p).oper[0]^,taicpu(p).oper[1]^) then
    +          Exit;
    +
    +        if taicpu(p).oper[0]^.typ <> top_reg then
    +          { Should be impossible if both operands were equal, since one of XOR's operands must be a register }
    +          InternalError(2018011500);
    +
    +        if (taicpu(p).opsize = S_Q) and
    +          (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP]) then
    +          begin
    +            RegName := debug_regname(taicpu(p).oper[0]^.reg); { 64-bit register name }
    +            PreMessage := 'xorq ' + RegName + ',' + RegName + ' -> xorl ';
    +
    +            { The actual optimization }
    +            setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
    +            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
    +            taicpu(p).changeopsize(S_L);
    +
    +            RegName := debug_regname(taicpu(p).oper[0]^.reg); { 32-bit register name }
    +
    +            DebugMsg(SPeepholeOptimization + PreMessage + RegName + ',' + RegName + ' (removes REX prefix)', p);
    +          end;
    +      end;
    +
     begin
       casmoptimizer := TCpuAsmOptimizer;
     end.
    
    overhaul-singlepass.patch (157,392 bytes)
  • overhaul-standalone.patch (44,635 bytes)
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 42345)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -1998,53 +2880,68 @@
        function TX86AsmOptimizer.OptPass1MOVXX(var p : tai) : boolean;
           var
             hp1 : tai;
    +        orig_instr: tasmop;
           begin
             Result:=false;
    -        if taicpu(p).ops <> 2 then
    -          exit;
    -        if GetNextInstruction(p,hp1) and
    -          MatchInstruction(hp1,taicpu(p).opcode,[taicpu(p).opsize]) and
    -          (taicpu(hp1).ops = 2) then
    -          begin
    -            if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
    -               (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
    -                {  movXX reg1, mem1     or     movXX mem1, reg1
    -                   movXX mem2, reg2            movXX reg2, mem2}
    -              begin
    -                if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
    -                  { movXX reg1, mem1     or     movXX mem1, reg1
    -                    movXX mem2, reg1            movXX reg2, mem1}
    -                  begin
    -                    if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    -                      begin
    -                        { Removes the second statement from
    -                          movXX reg1, mem1/reg2
    -                          movXX mem1/reg2, reg1
    -                        }
    -                        if taicpu(p).oper[0]^.typ=top_reg then
    -                          AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    -                        { Removes the second statement from
    -                          movXX mem1/reg1, reg2
    -                          movXX reg2, mem1/reg1
    -                        }
    -                        if (taicpu(p).oper[1]^.typ=top_reg) and
    -                          not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)) then
    -                          begin
    -                            asml.remove(p);
    -                            p.free;
    -                            GetNextInstruction(hp1,p);
    -                            DebugMsg(SPeepholeOptimization + 'MovXXMovXX2Nop 1 done',p);
    -                          end
    -                        else
    -                          DebugMsg(SPeepholeOptimization + 'MovXXMovXX2MoVXX 1 done',p);
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Result:=true;
    -                        exit;
    -                      end
    -                end;
    -            end;
    -        end;
    +        repeat
    +          orig_instr := taicpu(p).opcode;
    +          if taicpu(p).ops <> 2 then
    +            exit;
    +          if GetNextInstruction(p,hp1) and
    +            MatchInstruction(hp1,orig_instr,[taicpu(p).opsize]) and
    +            (taicpu(hp1).ops = 2) then
    +            begin
    +              if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
    +                 (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
    +                  {  movXX reg1, mem1     or     movXX mem1, reg1
    +                     movXX mem2, reg2            movXX reg2, mem2}
    +                begin
    +                  if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
    +                    { movXX reg1, mem1     or     movXX mem1, reg1
    +                      movXX mem2, reg1            movXX reg2, mem1}
    +                    begin
    +                      if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
    +                        begin
    +                          { Removes the second statement from
    +                            movXX reg1, mem1/reg2
    +                            movXX mem1/reg2, reg1
    +                          }
    +                          if taicpu(p).oper[0]^.typ=top_reg then
    +                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
    +                          { Removes the second statement from
    +                            movXX mem1/reg1, reg2
    +                            movXX reg2, mem1/reg1
    +                          }
    +
    +
    +                          if (taicpu(p).oper[1]^.typ=top_reg) and
    +                            not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)) then
    +                            begin
    +                              asml.remove(p);
    +                              p.free;
    +                              asml.remove(hp1);
    +                              hp1.free;
    +                              Result := True;
    +
    +                              DebugMsg(SPeepholeOptimization + 'MovXXMovXX2Nop 1 done',p);
    +
    +                              if GetNextInstruction(hp1,p) and MatchInstruction(hp1,orig_instr) then
    +                                Continue;
    +                            end
    +                          else
    +                            begin
    +                              DebugMsg(SPeepholeOptimization + 'MovXXMovXX2MoVXX 1 done',p);
    +                              asml.remove(hp1);
    +                              hp1.free;
    +                              Result := True;
    +                              Continue;
    +                            end;
    +                        end
    +                  end;
    +              end;
    +          end;
    +          Exit;
    +        until False;
           end;
     
     
    @@ -2062,26 +2959,30 @@
                 <Op>X    %mreg2,%mreg1
               ?
             }
    -        if GetNextInstruction(p,hp1) and
    -          { we mix single and double opperations here because we assume that the compiler
    -            generates vmovapd only after double operations and vmovaps only after single operations }
    -          MatchInstruction(hp1,A_MOVAPD,A_MOVAPS,[S_NO]) and
    -          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    -          MatchOperand(taicpu(p).oper[0]^,taicpu(hp1).oper[1]^) and
    -          (taicpu(p).oper[0]^.typ=top_reg) then
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
    -              begin
    -                taicpu(p).loadoper(0,taicpu(hp1).oper[0]^);
    -                taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    -                DebugMsg(SPeepholeOptimization + 'OpMov2Op done',p);
    -                asml.Remove(hp1);
    -                hp1.Free;
    -                result:=true;
    -              end;
    -          end;
    +        repeat
    +          if GetNextInstruction(p,hp1) and
    +            { we mix single and double opperations here because we assume that the compiler
    +              generates vmovapd only after double operations and vmovaps only after single operations }
    +            MatchInstruction(hp1,A_MOVAPD,A_MOVAPS,[S_NO]) and
    +            MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    +            MatchOperand(taicpu(p).oper[0]^,taicpu(hp1).oper[1]^) and
    +            (taicpu(p).oper[0]^.typ=top_reg) then
    +            begin
    +              TransferUsedRegs(TmpUsedRegs);
    +              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    +              if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
    +                begin
    +                  taicpu(p).loadoper(0,taicpu(hp1).oper[0]^);
    +                  taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    +                  DebugMsg(SPeepholeOptimization + 'OpMov2Op done',p);
    +                  asml.Remove(hp1);
    +                  hp1.Free;
    +                  result:=true;
    +                  Continue;
    +                end;
    +            end;
    +          Exit;
    +        until False;
           end;
     
     
    @@ -2091,96 +2992,103 @@
             l : ASizeInt;
           begin
             Result:=false;
    -        { removes seg register prefixes from LEA operations, as they
    -          don't do anything}
    -        taicpu(p).oper[0]^.ref^.Segment:=NR_NO;
    -        { changes "lea (%reg1), %reg2" into "mov %reg1, %reg2" }
    -        if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
    -           (taicpu(p).oper[0]^.ref^.index = NR_NO) and
    -           { do not mess with leas acessing the stack pointer }
    -           (taicpu(p).oper[1]^.reg <> NR_STACK_POINTER_REG) and
    -           (not(Assigned(taicpu(p).oper[0]^.ref^.Symbol))) then
    -          begin
    -            if (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
    -               (taicpu(p).oper[0]^.ref^.offset = 0) then
    -              begin
    -                hp1:=taicpu.op_reg_reg(A_MOV,taicpu(p).opsize,taicpu(p).oper[0]^.ref^.base,
    -                  taicpu(p).oper[1]^.reg);
    -                InsertLLItem(p.previous,p.next, hp1);
    -                DebugMsg(SPeepholeOptimization + 'Lea2Mov done',hp1);
    -                p.free;
    -                p:=hp1;
    -                Result:=true;
    -                exit;
    -              end
    -            else if (taicpu(p).oper[0]^.ref^.offset = 0) then
    -              begin
    -                hp1:=taicpu(p.Next);
    -                DebugMsg(SPeepholeOptimization + 'Lea2Nop done',p);
    -                asml.remove(p);
    -                p.free;
    -                p:=hp1;
    -                Result:=true;
    -                exit;
    -              end
    -            { continue to use lea to adjust the stack pointer,
    -              it is the recommended way, but only if not optimizing for size }
    -            else if (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) or
    -              (cs_opt_size in current_settings.optimizerswitches) then
    -              with taicpu(p).oper[0]^.ref^ do
    -                if (base = taicpu(p).oper[1]^.reg) then
    -                  begin
    -                    l:=offset;
    -                    if (l=1) and UseIncDec then
    -                      begin
    -                        taicpu(p).opcode:=A_INC;
    -                        taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
    -                        taicpu(p).ops:=1;
    -                        DebugMsg(SPeepholeOptimization + 'Lea2Inc done',p);
    -                      end
    -                    else if (l=-1) and UseIncDec then
    -                      begin
    -                        taicpu(p).opcode:=A_DEC;
    -                        taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
    -                        taicpu(p).ops:=1;
    -                        DebugMsg(SPeepholeOptimization + 'Lea2Dec done',p);
    -                      end
    -                    else
    -                      begin
    -                        if (l<0) and (l<>-2147483648) then
    -                          begin
    -                            taicpu(p).opcode:=A_SUB;
    -                            taicpu(p).loadConst(0,-l);
    -                            DebugMsg(SPeepholeOptimization + 'Lea2Sub done',p);
    -                          end
    -                        else
    -                          begin
    -                            taicpu(p).opcode:=A_ADD;
    -                            taicpu(p).loadConst(0,l);
    -                            DebugMsg(SPeepholeOptimization + 'Lea2Add done',p);
    -                          end;
    -                      end;
    -                    Result:=true;
    -                    exit;
    -                  end;
    -          end;
    -        if GetNextInstruction(p,hp1) and
    -          MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) and
    -          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    -          MatchOpType(Taicpu(hp1),top_reg,top_reg) and
    -          (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) then
    -          begin
    -            TransferUsedRegs(TmpUsedRegs);
    -            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    -            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
    -              begin
    -                taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    -                DebugMsg(SPeepholeOptimization + 'LeaMov2Lea done',p);
    -                asml.Remove(hp1);
    -                hp1.Free;
    -                result:=true;
    -              end;
    -          end;
    +        repeat
    +          { removes seg register prefixes from LEA operations, as they
    +            don't do anything}
    +          taicpu(p).oper[0]^.ref^.Segment:=NR_NO;
    +          { changes "lea (%reg1), %reg2" into "mov %reg1, %reg2" }
    +          if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
    +             (taicpu(p).oper[0]^.ref^.index = NR_NO) and
    +             { do not mess with leas acessing the stack pointer }
    +             (taicpu(p).oper[1]^.reg <> NR_STACK_POINTER_REG) and
    +             (not(Assigned(taicpu(p).oper[0]^.ref^.Symbol))) then
    +            begin
    +              if (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
    +                 (taicpu(p).oper[0]^.ref^.offset = 0) then
    +                begin
    +                  hp1:=taicpu.op_reg_reg(A_MOV,taicpu(p).opsize,taicpu(p).oper[0]^.ref^.base,
    +                    taicpu(p).oper[1]^.reg);
    +                  InsertLLItem(p.previous,p.next, hp1);
    +                  DebugMsg(SPeepholeOptimization + 'Lea2Mov done',hp1);
    +                  p.free;
    +                  p:=hp1;
    +                  Result:=true;
    +                  exit;
    +                end
    +              else if (taicpu(p).oper[0]^.ref^.offset = 0) then
    +                begin
    +                  hp1:=taicpu(p.Next);
    +                  DebugMsg(SPeepholeOptimization + 'Lea2Nop done',p);
    +                  asml.remove(p);
    +                  p.free;
    +                  p:=hp1;
    +                  Result:=true;
    +                  if (hp1 <> BlockEnd) and MatchInstruction(hp1, A_LEA) then
    +                    Continue
    +                  else
    +                    Exit;
    +                end
    +              { continue to use lea to adjust the stack pointer,
    +                it is the recommended way, but only if not optimizing for size }
    +              else if (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) or
    +                (cs_opt_size in current_settings.optimizerswitches) then
    +                with taicpu(p).oper[0]^.ref^ do
    +                  if (base = taicpu(p).oper[1]^.reg) then
    +                    begin
    +                      l:=offset;
    +                      if (l=1) and UseIncDec then
    +                        begin
    +                          taicpu(p).opcode:=A_INC;
    +                          taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
    +                          taicpu(p).ops:=1;
    +                          DebugMsg(SPeepholeOptimization + 'Lea2Inc done',p);
    +                        end
    +                      else if (l=-1) and UseIncDec then
    +                        begin
    +                          taicpu(p).opcode:=A_DEC;
    +                          taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
    +                          taicpu(p).ops:=1;
    +                          DebugMsg(SPeepholeOptimization + 'Lea2Dec done',p);
    +                        end
    +                      else
    +                        begin
    +                          if (l<0) and (l<>-2147483648) then
    +                            begin
    +                              taicpu(p).opcode:=A_SUB;
    +                              taicpu(p).loadConst(0,-l);
    +                              DebugMsg(SPeepholeOptimization + 'Lea2Sub done',p);
    +                            end
    +                          else
    +                            begin
    +                              taicpu(p).opcode:=A_ADD;
    +                              taicpu(p).loadConst(0,l);
    +                              DebugMsg(SPeepholeOptimization + 'Lea2Add done',p);
    +                            end;
    +                        end;
    +                      Result:=true;
    +                      exit;
    +                    end;
    +            end;
    +          if GetNextInstruction(p,hp1) and
    +            MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) and
    +            MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
    +            MatchOpType(Taicpu(hp1),top_reg,top_reg) and
    +            (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) then
    +            begin
    +              TransferUsedRegs(TmpUsedRegs);
    +              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
    +              if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
    +                begin
    +                  taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
    +                  DebugMsg(SPeepholeOptimization + 'LeaMov2Lea done',p);
    +                  asml.Remove(hp1);
    +                  hp1.Free;
    +                  result:=true;
    +                  Continue;
    +                end;
    +            end;
    +          Exit;
    +        until False;
           end;
     
     
    @@ -2241,45 +3246,52 @@
     {$endif i386}
           begin
             Result:=false;
    -        { * change "subl $2, %esp; pushw x" to "pushl x"}
    -        { * change "sub/add const1, reg" or "dec reg" followed by
    -            "sub const2, reg" to one "sub ..., reg" }
    -        if MatchOpType(taicpu(p),top_const,top_reg) then
    -          begin
    +        repeat
    +          { * change "subl $2, %esp; pushw x" to "pushl x"}
    +          { * change "sub/add const1, reg" or "dec reg" followed by
    +              "sub const2, reg" to one "sub ..., reg" }
    +          if MatchOpType(taicpu(p),top_const,top_reg) then
    +            begin
     {$ifdef i386}
    -            if (taicpu(p).oper[0]^.val = 2) and
    -               (taicpu(p).oper[1]^.reg = NR_ESP) and
    -               { Don't do the sub/push optimization if the sub }
    -               { comes from setting up the stack frame (JM)    }
    -               (not(GetLastInstruction(p,hp1)) or
    -               not(MatchInstruction(hp1,A_MOV,[S_L]) and
    -                 MatchOperand(taicpu(hp1).oper[0]^,NR_ESP) and
    -                 MatchOperand(taicpu(hp1).oper[0]^,NR_EBP))) then
    -              begin
    -                hp1 := tai(p.next);
    -                while Assigned(hp1) and
    -                      (tai(hp1).typ in [ait_instruction]+SkipInstr) and
    -                      not RegReadByInstruction(NR_ESP,hp1) and
    -                      not RegModifiedByInstruction(NR_ESP,hp1) do
    -                  hp1 := tai(hp1.next);
    -                if Assigned(hp1) and
    -                  MatchInstruction(hp1,A_PUSH,[S_W]) then
    -                  begin
    -                    taicpu(hp1).changeopsize(S_L);
    -                    if taicpu(hp1).oper[0]^.typ=top_reg then
    -                      setsubreg(taicpu(hp1).oper[0]^.reg,R_SUBWHOLE);
    -                    hp1 := tai(p.next);
    -                    asml.remove(p);
    -                    p.free;
    -                    p := hp1;
    -                    Result:=true;
    -                    exit;
    -                  end;
    -              end;
    +              if (taicpu(p).oper[0]^.val = 2) and
    +                 (taicpu(p).oper[1]^.reg = NR_ESP) and
    +                 { Don't do the sub/push optimization if the sub }
    +                 { comes from setting up the stack frame (JM)    }
    +                 (not(GetLastInstruction(p,hp1)) or
    +                 not(MatchInstruction(hp1,A_MOV,[S_L]) and
    +                   MatchOperand(taicpu(hp1).oper[0]^,NR_ESP) and
    +                   MatchOperand(taicpu(hp1).oper[0]^,NR_EBP))) then
    +                begin
    +                  hp1 := tai(p.next);
    +                  while Assigned(hp1) and
    +                        (tai(hp1).typ in [ait_instruction]+SkipInstr) and
    +                        not RegReadByInstruction(NR_ESP,hp1) and
    +                        not RegModifiedByInstruction(NR_ESP,hp1) do
    +                    hp1 := tai(hp1.next);
    +                  if Assigned(hp1) and
    +                    MatchInstruction(hp1,A_PUSH,[S_W]) then
    +                    begin
    +                      taicpu(hp1).changeopsize(S_L);
    +                      if taicpu(hp1).oper[0]^.typ=top_reg then
    +                        setsubreg(taicpu(hp1).oper[0]^.reg,R_SUBWHOLE);
    +                      hp1 := tai(p.next);
    +                      asml.remove(p);
    +                      p.free;
    +                      p := hp1;
    +                      Result:=true;
    +                      exit;
    +                    end;
    +                end;
     {$endif i386}
    -            if DoSubAddOpt(p) then
    -              Result:=true;
    -          end;
    +              if DoSubAddOpt(p) then
    +                begin
    +                  Result:=true;
    +                  if (p <> BlockEnd) and MatchInstruction(p, A_SUB) then
    +                    Continue;
    +                end;
    +            end;
    +          Exit;
    +        until False;
           end;
     
     
    @@ -2365,6 +3377,7 @@
     {$endif x86_64}
                   then
                   begin
    +{$ifndef x86_64}
                     if not(TmpBool2) and
                         (taicpu(p).oper[0]^.val = 1) then
                       begin
    @@ -2372,11 +3385,17 @@
                           taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg)
                       end
                     else
    +{$endif x86_64}
                       hp1 := taicpu.op_ref_reg(A_LEA, taicpu(p).opsize, TmpRef,
                                   taicpu(p).oper[1]^.reg);
    -                InsertLLItem(p.previous, p.next, hp1);
    +
    +                hp2 := tai(p.next);
    +                InsertLLItem(p.previous, hp2, hp1);
    +                asml.Remove(p);
                     p.free;
                     p := hp1;
    +                UpdateUsedRegs(hp2);
    +                Result := True;
                   end;
               end
     {$ifndef x86_64}
    @@ -2393,6 +3412,7 @@
                       InsertLLItem(p.previous, p.next, hp1);
                       p.free;
                       p := hp1;
    +                  Result := True;
                     end
                { changes "shl $2, %reg" to "lea (,%reg,4), %reg"
                  "shl $3, %reg" to "lea (,%reg,8), %reg }
    @@ -2406,6 +3426,7 @@
                    InsertLLItem(p.previous, p.next, hp1);
                    p.free;
                    p := hp1;
    +               Result := True;
                  end;
               end
     {$endif x86_64}
    @@ -3306,14 +4039,17 @@
         function TX86AsmOptimizer.OptPass1Movx(var p : tai) : boolean;
           var
             hp1,hp2: tai;
    +        GetNextInstruction_p: Boolean;
           begin
             result:=false;
    +        GetNextInstruction_p := GetNextInstruction(p, hp1);
    +
             if (taicpu(p).oper[1]^.typ = top_reg) and
    -           GetNextInstruction(p,hp1) and
    +           GetNextInstruction_p and
                (hp1.typ = ait_instruction) and
                IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
                GetNextInstruction(hp1,hp2) and
    -           MatchInstruction(hp2,A_MOV,[]) and
    +           MatchInstruction(hp2,A_MOV) and
                (taicpu(hp2).oper[0]^.typ = top_reg) and
                OpsEqual(taicpu(hp2).oper[1]^,taicpu(p).oper[0]^) and
     {$ifdef i386}
    @@ -3374,7 +4110,7 @@
               begin
                 { removes superfluous And's after movzx's }
                 if (taicpu(p).oper[1]^.typ = top_reg) and
    -              GetNextInstruction(p, hp1) and
    +              GetNextInstruction_p and
                   (tai(hp1).typ = ait_instruction) and
                   (taicpu(hp1).opcode = A_AND) and
                   (taicpu(hp1).oper[0]^.typ = top_const) and
    @@ -3389,31 +4125,38 @@
                             asml.remove(hp1);
                             hp1.free;
                           end;
    -                    S_WL{$ifdef x86_64}, S_WQ{$endif x86_64}:
    -                      if (taicpu(hp1).oper[0]^.val = $ffff) then
    -                        begin
    -                          DebugMsg(SPeepholeOptimization + 'var5',p);
    -                          asml.remove(hp1);
    -                          hp1.free;
    +                  S_WL{$ifdef x86_64}, S_WQ{$endif x86_64}:
    +                    if (taicpu(hp1).oper[0]^.val = $ffff) then
    +                      begin
    +                        DebugMsg(SPeepholeOptimization + 'var5',p);
    +                        asml.remove(hp1);
    +                        hp1.free;
                             end;
     {$ifdef x86_64}
    -                    S_LQ:
    -                      if (taicpu(hp1).oper[0]^.val = $ffffffff) then
    -                        begin
    -                          if (cs_asm_source in current_settings.globalswitches) then
    -                            asml.insertbefore(tai_comment.create(strpnew(SPeepholeOptimization + 'var6')),p);
    -                          asml.remove(hp1);
    -                          hp1.Free;
    -                        end;
    +                  S_LQ:
    +                    if (taicpu(hp1).oper[0]^.val = $ffffffff) then
    +                      begin
    +                        if (cs_asm_source in current_settings.globalswitches) then
    +                          asml.insertbefore(tai_comment.create(strpnew(SPeepholeOptimization + 'var6')),p);
    +                        asml.remove(hp1);
    +                        hp1.Free;
    +                      end;
     {$endif x86_64}
                       else
    -                    ;
    +                  { Do nothing };
                     end;
    +
    +                { We need to get the new 'hp1' }
    +                GetNextInstruction_p := GetNextInstruction(p, hp1);
                   end;
    -            { changes some movzx constructs to faster synonims (all examples
    +            { changes some movzx constructs to faster synonyms (all examples
                   are given with eax/ax, but are also valid for other registers)}
                 if (taicpu(p).oper[1]^.typ = top_reg) then
                   if (taicpu(p).oper[0]^.typ = top_reg) then
    +
    +                { Don't blindly set Result to True, otherwise we might get
    +                  an infinite loop as AND and MOVZX convert to each other. }
    +
                     case taicpu(p).opsize of
                       S_BW:
                         begin
    @@ -3425,8 +4168,9 @@
                               taicpu(p).changeopsize(S_W);
                               taicpu(p).loadConst(0,$ff);
                               DebugMsg(SPeepholeOptimization + 'var7',p);
    +                          Result := MatchInstruction(hp1, A_AND, [S_W]) or Result;
                             end
    -                      else if GetNextInstruction(p, hp1) and
    +                      else if GetNextInstruction_p and
                             (tai(hp1).typ = ait_instruction) and
                             (taicpu(hp1).opcode = A_AND) and
                             (taicpu(hp1).oper[0]^.typ = top_const) and
    @@ -3440,6 +4184,7 @@
                               taicpu(p).changeopsize(S_W);
                               setsubreg(taicpu(p).oper[0]^.reg,R_SUBW);
                               taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ff);
    +                          Result := True;
                             end;
                         end;
                       S_BL:
    @@ -3450,9 +4195,10 @@
                             begin
                               taicpu(p).opcode := A_AND;
                               taicpu(p).changeopsize(S_L);
    -                          taicpu(p).loadConst(0,$ff)
    +                          taicpu(p).loadConst(0,$ff);
    +                          Result := MatchInstruction(hp1, A_AND, [S_L]) or Result;
                             end
    -                      else if GetNextInstruction(p, hp1) and
    +                      else if GetNextInstruction_p and
                             (tai(hp1).typ = ait_instruction) and
                             (taicpu(hp1).opcode = A_AND) and
                             (taicpu(hp1).oper[0]^.typ = top_const) and
    @@ -3469,7 +4215,8 @@
                                 is invalid in assembler PM }
                               setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
                               taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ff);
    -                        end
    +                          Result := True;
    +                        end;
                         end;
     {$ifndef i8086}
                       S_WL:
    @@ -3482,8 +4229,9 @@
                               taicpu(p).opcode := A_AND;
                               taicpu(p).changeopsize(S_L);
                               taicpu(p).loadConst(0,$ffff);
    +                          Result := MatchInstruction(hp1, A_AND, [S_L]) or Result;
                             end
    -                      else if GetNextInstruction(p, hp1) and
    +                      else if GetNextInstruction_p and
                             (tai(hp1).typ = ait_instruction) and
                             (taicpu(hp1).opcode = A_AND) and
                             (taicpu(hp1).oper[0]^.typ = top_const) and
    @@ -3500,6 +4248,7 @@
                                 is invalid in assembler PM }
                               setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
                               taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ffff);
    +                          Result := True;
                             end;
                         end;
     {$endif i8086}
    @@ -3508,7 +4257,7 @@
                     end
                   else if (taicpu(p).oper[0]^.typ = top_ref) then
                       begin
    -                    if GetNextInstruction(p, hp1) and
    +                    if GetNextInstruction_p and
                           (tai(hp1).typ = ait_instruction) and
                           (taicpu(hp1).opcode = A_AND) and
                           MatchOpType(taicpu(hp1),top_const,top_reg) and
    @@ -3572,172 +4321,187 @@
           begin
             Result:=false;
     
    -        if GetNextInstruction(p, hp1) then
    -          begin
    -            if MatchOpType(taicpu(p),top_const,top_reg) and
    -              MatchInstruction(hp1,A_AND,[]) and
    -              MatchOpType(taicpu(hp1),top_const,top_reg) and
    -              (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    -              { the second register must contain the first one, so compare their subreg types }
    -              (getsubreg(taicpu(p).oper[1]^.reg)<=getsubreg(taicpu(hp1).oper[1]^.reg)) and
    -              (abs(taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val)<$80000000) then
    -              { change
    -                  and const1, reg
    -                  and const2, reg
    -                to
    -                  and (const1 and const2), reg
    -              }
    -              begin
    -                taicpu(hp1).loadConst(0, taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val);
    -                DebugMsg(SPeepholeOptimization + 'AndAnd2And done',hp1);
    -                asml.remove(p);
    -                p.Free;
    -                p:=hp1;
    -                Result:=true;
    -                exit;
    -              end
    -            else if MatchOpType(taicpu(p),top_const,top_reg) and
    -              MatchInstruction(hp1,A_MOVZX,[]) and
    -              (taicpu(hp1).oper[0]^.typ = top_reg) and
    -              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    -              (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
    -               (((taicpu(p).opsize=S_W) and
    -                 (taicpu(hp1).opsize=S_BW)) or
    -                ((taicpu(p).opsize=S_L) and
    -                 (taicpu(hp1).opsize in [S_WL,S_BL]))
    +        repeat
    +
    +          if GetNextInstruction(p, hp1) and (hp1.typ = ait_instruction) then
    +            begin
    +              if MatchOpType(taicpu(p),top_const,top_reg) then
    +                case taicpu(hp1).opcode of
    +                  A_AND:
    +                    if MatchOpType(taicpu(hp1),top_const,top_reg) and
    +                      (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
    +                      { the second register must contain the first one, so compare their subreg types }
    +                      (getsubreg(taicpu(p).oper[1]^.reg)<=getsubreg(taicpu(hp1).oper[1]^.reg)) and
    +                      (abs(taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val)<$80000000) then
    +                      { change
    +                          and const1, reg
    +                          and const2, reg
    +                        to
    +                          and (const1 and const2), reg
    +                      }
    +                      begin
    +                        taicpu(hp1).loadConst(0, taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val);
    +                        DebugMsg(SPeepholeOptimization + 'AndAnd2And done',hp1);
    +                        asml.remove(p);
    +                        p.Free;
    +                        p:=hp1;
    +                        Result := True;
    +                        Continue; { p is still AND, so it's safe to re-enter the loop }
    +                      end;
    +                  A_MOVZX:
    +                    if (taicpu(hp1).oper[0]^.typ = top_reg) then
    +                      begin
    +
    +                        if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    +                        (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
    +                        (((taicpu(p).opsize=S_W) and
    +                         (taicpu(hp1).opsize=S_BW)) or
    +                        ((taicpu(p).opsize=S_L) and
    +                         (taicpu(hp1).opsize in [S_WL,S_BL]))
     {$ifdef x86_64}
    -                  or
    -                 ((taicpu(p).opsize=S_Q) and
    -                  (taicpu(hp1).opsize in [S_BQ,S_WQ]))
    +                          or
    +                         ((taicpu(p).opsize=S_Q) and
    +                          (taicpu(hp1).opsize in [S_BQ,S_WQ]))
     {$endif x86_64}
    -                ) then
    -                  begin
    -                    if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
    -                        ((taicpu(p).oper[0]^.val and $ff)=taicpu(p).oper[0]^.val)
    -                         ) or
    -                       (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
    -                        ((taicpu(p).oper[0]^.val and $ffff)=taicpu(p).oper[0]^.val))
    -                    then
    -                      begin
    -                        { Unlike MOVSX, MOVZX doesn't actually have a version that zero-extends a
    -                          32-bit register to a 64-bit register, or even a version called MOVZXD, so
    -                          code that tests for the presence of AND 0xffffffff followed by MOVZX is
    -                          wasted, and is indictive of a compiler bug if it were triggered. [Kit]
    +                        ) then
    +                          begin
    +                            if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
    +                                ((taicpu(p).oper[0]^.val and $ff)=taicpu(p).oper[0]^.val)
    +                                 ) or
    +                               (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
    +                                ((taicpu(p).oper[0]^.val and $ffff)=taicpu(p).oper[0]^.val))
    +                            then
    +                              begin
    +                                { Unlike MOVSX, MOVZX doesn't actually have a version that zero-extends a
    +                                  32-bit register to a 64-bit register, or even a version called MOVZXD, so
    +                                  code that tests for the presence of AND 0xffffffff followed by MOVZX is
    +                                  wasted, and is indictive of a compiler bug if it were triggered. [Kit]
     
    -                          NOTE: To zero-extend from 32 bits to 64 bits, simply use the standard MOV.
    -                        }
    -                        DebugMsg(SPeepholeOptimization + 'AndMovzToAnd done',p);
    +                                  NOTE: To zero-extend from 32 bits to 64 bits, simply use the standard MOV.
    +                                }
    +                                DebugMsg(SPeepholeOptimization + 'AndMovzToAnd done',p);
     
    -                        asml.remove(hp1);
    -                        hp1.free;
    -                        Exit;
    +                                asml.remove(hp1);
    +                                hp1.free;
    +                                Result := True;
    +                                Continue;
    +                              end;
    +                          end;
                           end;
    -                  end
    -            else if MatchOpType(taicpu(p),top_const,top_reg) and
    -              MatchInstruction(hp1,A_SHL,[]) and
    -              MatchOpType(taicpu(hp1),top_const,top_reg) and
    -              (getsupreg(taicpu(p).oper[1]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) then
    -              begin
    +                  A_SHL:
    +                    if MatchOpType(taicpu(hp1),top_const,top_reg) and
    +                      (getsupreg(taicpu(p).oper[1]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) then
    +                      begin
     {$ifopt R+}
     {$define RANGE_WAS_ON}
     {$R-}
     {$endif}
    -                { get length of potential and mask }
    -                MaskLength:=SizeOf(taicpu(p).oper[0]^.val)*8-BsrQWord(taicpu(p).oper[0]^.val)-1;
    +                        { get length of potential and mask }
    +                        MaskLength:=SizeOf(taicpu(p).oper[0]^.val)*8-BsrQWord(taicpu(p).oper[0]^.val)-1;
     
    -                { really a mask? }
    +                        { really a mask? }
     {$ifdef RANGE_WAS_ON}
     {$R+}
     {$endif}
    -                if (((QWord(1) shl MaskLength)-1)=taicpu(p).oper[0]^.val) and
    -                  { unmasked part shifted out? }
    -                  ((MaskLength+taicpu(hp1).oper[0]^.val)>=topsize2memsize[taicpu(hp1).opsize]) then
    -                  begin
    -                    DebugMsg(SPeepholeOptimization + 'AndShlToShl done',p);
    +                        if (((QWord(1) shl MaskLength)-1)=taicpu(p).oper[0]^.val) and
    +                          { unmasked part shifted out? }
    +                          ((MaskLength+taicpu(hp1).oper[0]^.val)>=topsize2memsize[taicpu(hp1).opsize]) then
    +                          begin
    +                            DebugMsg(SPeepholeOptimization + 'AndShlToShl done',p);
     
    -                    { take care of the register (de)allocs following p }
    -                    UpdateUsedRegs(tai(p.next));
    -                    asml.remove(p);
    -                    p.free;
    -                    p:=hp1;
    -                    Result:=true;
    -                    exit;
    -                  end;
    -              end
    -            else if MatchOpType(taicpu(p),top_const,top_reg) and
    -              MatchInstruction(hp1,A_MOVSX{$ifdef x86_64},A_MOVSXD{$endif x86_64},[]) and
    -              (taicpu(hp1).oper[0]^.typ = top_reg) and
    -              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    -              (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
    -               (((taicpu(p).opsize=S_W) and
    -                 (taicpu(hp1).opsize=S_BW)) or
    -                ((taicpu(p).opsize=S_L) and
    -                 (taicpu(hp1).opsize in [S_WL,S_BL]))
    +                            { take care of the register (de)allocs following p }
    +                            UpdateUsedRegs(tai(p.next));
    +                            asml.remove(p);
    +                            p.free;
    +                            p:=hp1;
    +                            Result:=true;
    +                            exit;
    +                          end;
    +                      end;
    +                  A_MOVSX{$ifdef x86_64},A_MOVSXD{$endif x86_64}:
    +                    if (taicpu(hp1).oper[0]^.typ = top_reg) and
    +                    MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
    +                    (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
    +                    (
    +                      (
    +                        (taicpu(p).opsize=S_W) and
    +                        (taicpu(hp1).opsize=S_BW)
    +                      ) or (
    +                        (taicpu(p).opsize=S_L) and
    +                        (taicpu(hp1).opsize in [S_WL,S_BL])
     {$ifdef x86_64}
    -                 or
    -                 ((taicpu(p).opsize=S_Q) and
    -                 (taicpu(hp1).opsize in [S_BQ,S_WQ,S_LQ]))
    +                      ) or (
    +                        (taicpu(p).opsize=S_Q) and
    +                        (taicpu(hp1).opsize in [S_BQ,S_WQ,S_LQ])
     {$endif x86_64}
    -                ) then
    -                  begin
    -                    if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
    -                        ((taicpu(p).oper[0]^.val and $7f)=taicpu(p).oper[0]^.val)
    -                         ) or
    -                       (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
    -                        ((taicpu(p).oper[0]^.val and $7fff)=taicpu(p).oper[0]^.val))
    +                      )
    +                    ) then
    +                      begin
    +                        if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
    +                            ((taicpu(p).oper[0]^.val and $7f)=taicpu(p).oper[0]^.val)
    +                             ) or
    +                           (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
    +                            ((taicpu(p).oper[0]^.val and $7fff)=taicpu(p).oper[0]^.val))
     {$ifdef x86_64}
    -                       or
    -                       (((taicpu(hp1).opsize)=S_LQ) and
    -                        ((taicpu(p).oper[0]^.val and $7fffffff)=taicpu(p).oper[0]^.val)
    -                       )
    +                           or
    +                           (((taicpu(hp1).opsize)=S_LQ) and
    +                            ((taicpu(p).oper[0]^.val and $7fffffff)=taicpu(p).oper[0]^.val)
    +                           )
     {$endif x86_64}
    -                       then
    -                       begin
    -                         DebugMsg(SPeepholeOptimization + 'AndMovsxToAnd',p);
    -                         asml.remove(hp1);
    -                         hp1.free;
    -                         Exit;
    -                       end;
    -                  end
    -            else if (taicpu(p).oper[1]^.typ = top_reg) and
    -              (hp1.typ = ait_instruction) and
    -              (taicpu(hp1).is_jmp) and
    -              (taicpu(hp1).opcode<>A_JMP) and
    -              not(RegInUsedRegs(taicpu(p).oper[1]^.reg,UsedRegs)) then
    -              begin
    -                { change
    -                    and x, reg
    -                    jxx
    -                  to
    -                    test x, reg
    -                    jxx
    -                  if reg is deallocated before the
    -                  jump, but only if it's a conditional jump (PFV)
    -                }
    -                taicpu(p).opcode := A_TEST;
    -                Exit;
    -              end;
    -          end;
    +                           then
    +                           begin
    +                             DebugMsg(SPeepholeOptimization + 'AndMovsxToAnd',p);
    +                             asml.remove(hp1);
    +                             hp1.free;
    +                             Result := True;
    +                             Continue;
    +                           end;
    +                      end;
    +                  else
    +                    { Do nothing };
    +                end;
     
    -        { Lone AND tests }
    -        if MatchOpType(taicpu(p),top_const,top_reg) then
    -          begin
    -            {
    -              - Convert and $0xFF,reg to and reg,reg if reg is 8-bit
    -              - Convert and $0xFFFF,reg to and reg,reg if reg is 16-bit
    -              - Convert and $0xFFFFFFFF,reg to and reg,reg if reg is 32-bit
    -            }
    -            if ((taicpu(p).oper[0]^.val = $FF) and (taicpu(p).opsize = S_B)) or
    -              ((taicpu(p).oper[0]^.val = $FFFF) and (taicpu(p).opsize = S_W)) or
    -              ((taicpu(p).oper[0]^.val = $FFFFFFFF) and (taicpu(p).opsize = S_L)) then
    -              begin
    -                taicpu(p).loadreg(0, taicpu(p).oper[1]^.reg)
    -              end;
    -          end;
    +              if (taicpu(p).oper[1]^.typ = top_reg) and
    +                (hp1.typ = ait_instruction) and
    +                (taicpu(hp1).is_jmp) and
    +                (taicpu(hp1).opcode<>A_JMP) and
    +                not(RegInUsedRegs(taicpu(p).oper[1]^.reg,UsedRegs)) then
    +                begin
    +                  { change
    +                      and x, reg
    +                      jxx
    +                    to
    +                      test x, reg
    +                      jxx
    +                    if reg is deallocated before the
    +                    jump, but only if it's a conditional jump (PFV)
    +                  }
    +                  taicpu(p).opcode := A_TEST;
    +                  Exit;
    +                end;
    +            end;
     
    +          { Lone AND tests }
    +          if MatchOpType(taicpu(p),top_const,top_reg) then
    +            begin
    +              {
    +                - Convert and $0xFF,reg to and reg,reg if reg is 8-bit
    +                - Convert and $0xFFFF,reg to and reg,reg if reg is 16-bit
    +                - Convert and $0xFFFFFFFF,reg to and reg,reg if reg is 32-bit
    +              }
    +              if ((taicpu(p).oper[0]^.val = $FF) and (taicpu(p).opsize = S_B)) or
    +                ((taicpu(p).oper[0]^.val = $FFFF) and (taicpu(p).opsize = S_W)) or
    +                ((taicpu(p).oper[0]^.val = $FFFFFFFF) and (taicpu(p).opsize = S_L)) then
    +                begin
    +                  taicpu(p).loadreg(0, taicpu(p).oper[1]^.reg)
    +                end;
    +            end;
    +
    +          Exit;
    +        until False;
    +
           end;
     
    -
         function TX86AsmOptimizer.PostPeepholeOptLea(var p : tai) : Boolean;
           begin
             Result:=false;
    
    overhaul-standalone.patch (44,635 bytes)
  • x86_64 Optimisation Specification.pdf (159,567 bytes)

Relationships

parent of 0036271 resolvedFlorian [Patch] Jump optimisations in code generator 

Activities

J. Gareth Moreton

2018-12-01 17:04

developer  

Metric.txt (6,396 bytes)
Compilation script:

ppcx64 -Sc -Sg -Mobjfpc -FEC:\Users\NLO-012\Documents\Programming\lazarus -g- -Xs -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\sqldb -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\libtar\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fpmkunit\src -FuC:\Users\NLO-012\Documents\Programming\lazarus\packager -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fppkg\src -FuC:\Users\NLO-012\Documents\Programming\fpc\compiler\systems -FlC:\Users\NLO-012\Documents\Programming\fpc\units\x86_64-win64\rtl -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\win64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\inc -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\x86_64 -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\win\wininc -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\win -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas\sysutils -FiC:\users\NLO-012\Documents\Programming\lazarus\ide\include -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\inc -FuC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\interfaces\win32 -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazutils -FiC:\Users\NLO-012\Documents\Programming\fpc\rtl\objpas\classes -FuC:\users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\inc -FuC:\users\NLO-012\Documents\Programming\fpc\packages\fcl-base\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl -FuC:\users\NLO-012\Documents\Programming\fpc\packages\fcl-image\src -FiC:\users\NLO-012\Documents\Programming\lazarus\lcl\include -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\winunits-base\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\win -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\rtl-objpas\src\inc -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\paszlib\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\hash\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\pasjpeg\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\widgetset -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazutils -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src\win -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\chm\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-json\src -FuC:\users\NLO-012\Documents\Programming\lazarus\lcl\forms -FuC:\users\NLO-012\Documents\Programming\lazarus\components\codetools -FiC:\users\NLO-012\Documents\Programming\lazarus\ide\include\win64 -FuC:\users\NLO-012\Documents\Programming\lazarus\components\ideintf -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazcontrols -FuC:\users\NLO-012\Documents\Programming\lazarus\components\debuggerintf -FuC:\users\NLO-012\Documents\Programming\lazarus\debugger -FuC:\users\NLO-012\Documents\Programming\lazarus\components\synedit -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-registry\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\regexpr\src -FuC:\users\NLO-012\Documents\Programming\lazarus\packager\registration -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\base -FuC:\users\NLO-012\Documents\Programming\lazarus\components\ideintf -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-res\src -FuC:\users\NLO-012\Documents\Programming\lazarus\packager -FuC:\users\NLO-012\Documents\Programming\lazarus\designer -FuC:\users\NLO-012\Documents\Programming\lazarus\ide\frames -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-xml\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-extra\src\win -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\winunits-jedi\src -FuC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-db\src\dbase -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-process\src\winall -FiC:\Users\NLO-012\Documents\Programming\fpc\packages\fcl-base\src\win -FuC:\users\NLO-012\Documents\Programming\lazarus\components\lazdebuggergdbmi -FuC:\users\NLO-012\Documents\Programming\lazarus\debugger\frames -FuC:\users\NLO-012\Documents\Programming\lazarus\converter -FuC:\users\NLO-012\Documents\Programming\lazarus\packager\frames C:\Users\NLO-012\Documents\Programming\lazarus\ide\lazarus.pp -vs -a -B -O3

Trunk build:
Discard first run (extra time taken due to disk polling etc.)

[161.777] 1285546 lines compiled, 161.8 sec, 9134400 bytes code, 788644 bytes data
[134.719] 1285546 lines compiled, 134.7 sec, 9134400 bytes code, 788644 bytes data
[124.336] 1285546 lines compiled, 124.3 sec, 9134400 bytes code, 788644 bytes data
[126.129] 1285546 lines compiled, 126.1 sec, 9134400 bytes code, 788644 bytes data

Average: 128.367s (x)

Optimisation overhaul:
Discard first run (extra time taken due to disk polling etc.)

[117.906] 1285498 lines compiled, 117.9 sec, 9124736 bytes code, 788580 bytes data
[109.793] 1285498 lines compiled, 109.8 sec, 9124736 bytes code, 788580 bytes data
[109.480] 1285498 lines compiled, 109.5 sec, 9124736 bytes code, 788580 bytes data
[106.266] 1285498 lines compiled, 106.3 sec, 9124736 bytes code, 788580 bytes data

Average: 108.533s (y)

Saving: 1 - (y/x) = 0.154505 = ~15% faster


Changing -O3 to -O1...

Trunk build:
Discard first run (extra time taken due to disk polling etc.)

[130.668] 1285571 lines compiled, 130.7 sec, 10196576 bytes code, 788996 bytes data
[128.012] 1285571 lines compiled, 128.0 sec, 10196576 bytes code, 788996 bytes data
[137.973] 1285571 lines compiled, 138.0 sec, 10196576 bytes code, 788996 bytes data
[131.625] 1285571 lines compiled, 131.6 sec, 10196576 bytes code, 788996 bytes data

Averge: 132.533s (x)

Optimisation overhaul:
Discard first run (extra time taken due to disk polling etc.)

[162.082] 1285498 lines compiled, 162.1 sec, 10182608 bytes code, 788836 bytes data
[125.703] 1285498 lines compiled, 125.7 sec, 10182608 bytes code, 788836 bytes data
[126.027] 1285498 lines compiled, 126.0 sec, 10182608 bytes code, 788836 bytes data
[126.824] 1285498 lines compiled, 126.8 sec, 10182608 bytes code, 788836 bytes data

Average: 126.167 (y)

Saving: 1 - (y/x) = 0.048038 = ~5% faster
Metric.txt (6,396 bytes)

J. Gareth Moreton

2018-12-02 06:41

developer   ~0112311

Last edited: 2019-07-11 11:03

View 4 revisions

So following advice from Florian, I have split my submission into 5 separate patches so the components are easier to test individually. The compilation failure on Linux has also been fixed - my changes were reacting badly to a particular MOV optimisation that itself appears to be faulty and should be properly fixed later (search for my note "This optimisation seems to be flawed. It produces incorrect code" in compiler/x86/aoptx86.pas, on line 2377 once all five patches have been applied).

Additionally, I have refactored much of the code so the x86_64-specific stuff is better separated from platform-agnostic modules.

Patch and prerequisite information:


base - no prerequisites

Some rearrangement of methods and new base functions required for the overhaul.


global - no prerequisites

Changes to platform-agnostic code.


singlepass - requires base + global

The code that collapses pre-peephole, pass 1 and pass 2 into a single pass.


standalone - requires global

Some standalone peephole optimiser changes, made primarily so -O1 doesn't lose performance, but are optimisations that work by themselves nonetheless.


mov-refactor - requires base + global

An overhaul of OptPass1MOV for x86_64 (the $ifdef's were too intertwined to easily separate this out from 64-32-split). Besides combining it with OptPass2MOV, it also attempts to rearrange the code so fewer conditions need to be checked (e.g. all situations where a MOV follows a MOV are now covered by a single branch).


64-32-split - requires base + global

Upgrades to the individual optimisation procedures for x86_64, but also isolating i386 from the changes at the same time (used as a control case). NOTE: Though this patch doesn't require standalone, singlepass or mov-refactor, optimisation will underperform without them.

J. Gareth Moreton

2018-12-06 17:53

developer   ~0112407

Last edited: 2018-12-06 17:54

View 2 revisions

Made a number of bug fixes for Linux and additional refactoring, including a further splitting of the patches, since OptPass1MOV got a lot of refactoring compared to everything else.

Additionally, the single pass loop now has a safety check and will break out if it iterates 5 times or more.

Re-download and reapply everything, especially as overhaul-base now contains a new method.

J. Gareth Moreton

2018-12-07 00:08

developer   ~0112414

Got some new timing metrics for compiling Lazarus under Windows. I haven't got them for Linux because I only have it on a virtual machine, which significantly skews the performance:

----

-O3: Trunk

[125.383] 1285571 lines compiled, 125.4 sec, 9137600 bytes code, 788740 bytes data
[122.078] 1285571 lines compiled, 122.1 sec, 9137600 bytes code, 788740 bytes data
[119.125] 1285571 lines compiled, 119.1 sec, 9137600 bytes code, 788740 bytes data

Avg. 122.195 sec. Binary size = 19,325,952 bytes

-O3: Overhaul

[103.133] 1285571 lines compiled, 103.1 sec, 9133968 bytes code, 788740 bytes data
[105.234] 1285571 lines compiled, 105.2 sec, 9133968 bytes code, 788740 bytes data
[104.906] 1285571 lines compiled, 104.9 sec, 9133968 bytes code, 788740 bytes data

Avg. 104.424 sec. Binary size = 19,322,368 bytes

Time improvement: 14.5%
Size improvement: 0.0185%

----

-O2: Trunk

[118.852] 1285571 lines compiled, 118.9 sec, 9103760 bytes code, 788996 bytes data
[120.266] 1285571 lines compiled, 120.3 sec, 9103760 bytes code, 788996 bytes data
[116.531] 1285571 lines compiled, 116.5 sec, 9103760 bytes code, 788996 bytes data

Avg. 118.550 sec. Binary size = 19,292,672 bytes

-O2: Overhaul

[100.875] 1285571 lines compiled, 100.9 sec, 9100096 bytes code, 788996 bytes data
[100.922] 1285571 lines compiled, 100.9 sec, 9100096 bytes code, 788996 bytes data
[101.813] 1285571 lines compiled, 101.8 sec, 9100096 bytes code, 788996 bytes data

Avg. 101.203 sec. Binary size = 19,289,088 bytes

Time improvement: 14.6%
Size improvement: 0.0186%

----

-O1: Trunk

[114.641] 1285571 lines compiled, 114.6 sec, 10196576 bytes code, 788996 bytes data
[112.734] 1285571 lines compiled, 112.7 sec, 10196576 bytes code, 788996 bytes data
[113.516] 1285571 lines compiled, 113.5 sec, 10196576 bytes code, 788996 bytes data

Avg. 113.630 sec. Binary size = 20,370,432 bytes

-O1: Overhaul

[99.711] 1285571 lines compiled, 99.7 sec, 10193536 bytes code, 788996 bytes data
[102.375] 1285571 lines compiled, 102.4 sec, 10193536 bytes code, 788996 bytes data
[102.211] 1285571 lines compiled, 102.2 sec, 10193536 bytes code, 788996 bytes data

Avg. 101.432 sec. Binary size = 20,370,360 bytes

Time improvement: 10.7%
Size improvement: 0.000353%

----

Note there are no actual new optimisations, so to speak. The size improvements come from more intelligent elimination of dead labels and the removal of unnecessary alignment hints, which allows some new branch optimisations to be found.

J. Gareth Moreton

2018-12-09 06:55

developer   ~0112457

Well I got some feedback from Ryan Joseph in the mailing list. For his projects that take 20 seconds to compile on the trunk, the time difference is barely noticeable, which I guess is not the worst thing (it taking longer would be bad). Seems that the overhaul only becomes apparent on large projects like Lazarus.

J. Gareth Moreton

2018-12-10 14:05

developer   ~0112479

This is becoming quite an involved bit of research! At the moment I'm looking for small savings that, by themselves, don't amount to much, but cumulatively, make a fair saving. These are things like skipping over the function prologue and epilogue and using a concept called 'object pooling' with TmpUsedRegs in order to cut back on the number of times it's created and destroyed.

The hard part is splitting everything in to bite-sized patches for easier evaluation.

J. Gareth Moreton

2018-12-23 23:14

developer   ~0112846

Updated overhaul-mov-refactor.patch to better optimise "mov add mov" combinations (since this was optimised before). It also contains an XOR 'deoptimisation' (which is reoptimised in the post-peephole stage) to make some other MOV optimisations easier (it couldn't be separated out of the patch).

J. Gareth Moreton

2019-01-22 12:47

developer   ~0113578

I've updated the patches to be compatible with the latest trunk version, and also ported the optimisations to i386, which actually simplifies things a fair bit. As a result, there is no longer a "64-32-split" patch.

Note... be careful not to apply the same patch twice, as it may add hunks multiple times and cause compiler errors. Otherwise, the order in which you apply the patches should not matter.

J. Gareth Moreton

2019-01-22 14:11

developer   ~0113581

NOTE: "overhaul-mov-refactor.patch" contains code that calls a function that is only implemented in "overhaul-singlepass.patch" (and "overhaul-mov-refactor.patch" doesn't otherwise depend on this patch).

As a result, the code segment in question is commented out with a TODO note to only re-enable it once the function is implemented. The code in question is located between lines 1651 and 1663 of "compiler/x86/aoptx86.pas" after all patches are applied.

The other alternative, to implement the function in both patches, caused merge conflicts. Apologies for the complexity of this.

J. Gareth Moreton

2019-01-22 14:34

developer   ~0113582

Fixed a range error in "overhaul-singlepass.patch" that gets triggered on cross-compilation to a 16-bit target.

J. Gareth Moreton

2019-01-22 15:52

developer   ~0113583

Some final tweaks made so 'make fullcycle OPT="-n -CriotR -gl"' completes without any errors.

J. Gareth Moreton

2019-02-22 21:27

developer   ~0114350

Removed a couple of unused variables from overhaul-singlepass.patch.

Florian

2019-02-22 22:58

administrator   ~0114352

Did you run regression tests?

J. Gareth Moreton

2019-02-22 23:15

developer   ~0114353

Last edited: 2019-02-22 23:16

View 2 revisions

I did, and I did find a bug elsewhere. Stand by though. I'm updating the patch files and doing another regression test.

("make fullcycle" found an error on the "aarch64" target)

J. Gareth Moreton

2019-02-22 23:18

developer   ~0114354

In the meantime, here's a draft of a specification I wrote for the design and implementation of this overhaul, which hopefully explains some of my choices.

J. Gareth Moreton

2019-02-22 23:37

developer   ~0114356

Updated patch files so they merge better with the current trunk, and fixed some bugs that caused cross compilation to fail.

J. Gareth Moreton

2019-02-23 16:51

developer   ~0114368

So I've finished running the regression tests. I've got a discrepancy on x86_64-win64 in that a different number of tests were run. I'll have to see what's going on and see if there are any actual regressions.

[x86_64-win64 (Overhaul)]
Total = 7617 (199:7418)
Total number of compilations = 4785 (180:4605)
Successfully compiled = 3451
Successfully failed = 1154
Compilation failures = 178
Compilation that did not fail while they should = 2
Total number of runs = 2832 (19:2813)
Successful runs = 2813
Failed runs = 19
Number units compiled = 149
Number program that should not be run = 467
Number of skipped tests = 498
Number of skipped graph tests = 10
Number of skipped interactive tests = 25
Number of skipped known bug tests = 7
Number of skipped tests for other versions = 4
Number of skipped tests for other cpus = 266
Number of skipped tests for other targets = 186


[x86_64-win64 (TRUNK)]
Total = 7814 (35:7779)
Total number of compilations = 4813 (22:4791)
Successfully compiled = 3637
Successfully failed = 1154
Compilation failures = 20
Compilation that did not fail while they should = 2
Total number of runs = 3001 (13:2988)
Successful runs = 2988
Failed runs = 13
Number units compiled = 162
Number program that should not be run = 471
Number of skipped tests = 505
Number of skipped graph tests = 10
Number of skipped interactive tests = 31
Number of skipped known bug tests = 7
Number of skipped tests for other versions = 4
Number of skipped tests for other cpus = 266
Number of skipped tests for other targets = 187

(Also updated the PDF file - there's a new subsection in it named "Label Clustering")

J. Gareth Moreton

2019-02-26 03:02

developer   ~0114453

Last edited: 2019-02-26 04:02

View 2 revisions

Updated the PDF file, since there are a few new changes to the optimisation loop in the overhaul that should probably be explained.

New patches will be uploaded tomorrow once the regression tests have finished running overnight (final verification... yesterday I only had one failure).

I also got some new metrics from compiling Lazarus:

-O3 Trunk

[79.124] 1291566 lines compiled, 79.1 sec, 9449568 bytes code, 796356 bytes data
[78.889] 1291566 lines compiled, 78.9 sec, 9449568 bytes code, 796356 bytes data
[78.930] 1291566 lines compiled, 78.9 sec, 9449568 bytes code, 796356 bytes data

Final size: 19,793,408

-O3 Overhaul

[65.296] 1291566 lines compiled, 65.3 sec, 9427248 bytes code, 796356 bytes data
[64.604] 1291566 lines compiled, 64.6 sec, 9427248 bytes code, 796356 bytes data
[65.002] 1291566 lines compiled, 65.0 sec, 9427248 bytes code, 796356 bytes data

Final size: 19,770,880

Speed saving: 17.74%
Size saving: 0.11%

----

-O2 Trunk

[71.011] 1291566 lines compiled, 71.0 sec, 9418304 bytes code, 796612 bytes data
[75.082] 1291566 lines compiled, 75.1 sec, 9418304 bytes code, 796612 bytes data
[75.249] 1291566 lines compiled, 75.2 sec, 9418304 bytes code, 796612 bytes data

Final size: 19,762,176

-O2 Overhaul

[62.208] 1291566 lines compiled, 62.2 sec, 9394256 bytes code, 796612 bytes data
[61.770] 1291566 lines compiled, 61.8 sec, 9394256 bytes code, 796612 bytes data
[62.039] 1291566 lines compiled, 62.0 sec, 9394256 bytes code, 796612 bytes data

Final size: 19,738,112

Speed saving: 15.96%
Size saving: 0.12%

----

The fact that the files are larger under -O3 compared to -O2 doesn't seem to be the fault of the peephole optimizer, but is definitely a notable anomaly that probably warrants further investigation. Note that for the size savings, most of the savings come from better jump optimisations rather than any kind of new specific optimisation.

J. Gareth Moreton

2019-02-26 14:39

developer   ~0114462

I've uploaded the patches so others can test, but it's not quite ready for merging because I've got a single failure on x86_64-win64 (webtbs/tw33417.pp) that I'm trying to reproduce and get to the bottom of.

There's a little bit of rearranging among the patches themselves, so the prerequisites are a little different:

overhaul-base - no prerequisites
overhaul-global - no prerequisites
overhaul-standalone - requires overhaul-global
overhaul-singlepass - requires overhaul-base + overhaul_global
overhaul-mov-refactor - requires overhaul-base + overhaul_global

Really, all five should be applied together to gain the full benefits.

J. Gareth Moreton

2019-02-26 15:53

developer   ~0114464

Actually, never mind! The failure seems to have been a system glitch of some kind - running the test again, it passed without incident. Everything's ready!

J. Gareth Moreton

2019-02-26 16:18

developer   ~0114465

Last edited: 2019-02-26 16:22

View 2 revisions

I should add that the overhaul also affects i386 - it's for the best because it removes some code duplication since i386 and x86_64 now share the same Peephole Optimizer (before, i386 had a rather convoluted one), with overridden methods to handle platform-specific optimisations.

J. Gareth Moreton

2019-02-26 17:33

developer   ~0114466

Unfortunately I'm not out of the park yet - there are some problems with building the compiler in Linux that I'm just trying to narrow down. Stand by.

J. Gareth Moreton

2019-02-26 22:06

developer   ~0114474

Fixed! Still got to run the regression tests again to make sure nothing else exploded though, but updated the patches.

J. Gareth Moreton

2019-02-27 01:18

developer   ~0114480

Regression tests were successful.

J. Gareth Moreton

2019-02-28 03:56

developer   ~0114496

Last edited: 2019-02-28 04:16

View 2 revisions

Fixed a crash that only cropped up if code was built with debugging information. Also refactored OptPass1MOV and OptPass1Jcc a little bit to make them faster if no optimisations are made (it avoids calls to GetNextInstruction and UpdateUsedRegs whenever possible, since they're pretty expensive).

All should be okay now, but running regression tests again overnight.

Florian

2019-02-28 20:40

administrator   ~0114519

I am trying to review the patches one by one.

General:
Did you test and investigate if the single pass approach works for the other CPU targets? It is pretty invasive after all. Or can the other CPUs still use the old structure of passes?

overhaul-base.patch:
Is it really needed that aoptutils has to depend on aoptcpub? This is against the design/dependency chain principle of the aopt units.

J. Gareth Moreton

2019-02-28 21:47

developer   ~0114523

Last edited: 2019-02-28 21:48

View 2 revisions

Other CPUs still use the old structure of passes - only i386 and x86_64 use the new structure. And "make fullcycle" works.

As for overhaul-base.patch, I included aoptcpub because of moving IsJumpToLabelUncond to "compiler/aoptutils.pas" so it can be used elsewhere. Unfortunately, it references the platform-dependent "aopt_uncondjmp" constant. If it can't go in aoptutils because of the design dependency, where would be a good place to put it instead?

    { Returns True if hp is an unconditional jump to a label }
    function IsJumpToLabelUncond(hp: taicpu): boolean;
      begin
{$if defined(avr)}
        result:=(hp.opcode in aopt_uncondjmp) and // <-- constant defined in aoptcpub
{$else avr}
        result:=(hp.opcode=aopt_uncondjmp) and // <-- constant defined in aoptcpub
{$endif avr}
{$if defined(arm) or defined(aarch64)}
          (hp.condition=c_None) and
{$endif arm or aarch64}
{$if defined(riscv32) or defined(riscv64)}
          (hp.ops>0) and
          (hp.oper[0]^.reg=NR_X0) and
{$else riscv}
          (hp.ops>0) and
{$endif riscv}
          (JumpTargetOp(hp)^.typ = top_ref) and
          (JumpTargetOp(hp)^.ref^.symbol is TAsmLabel);
      end;

----

Otherwise, that function is the only reason why the unit depends on aoptcpub.

Florian

2019-02-28 21:52

administrator   ~0114526

What's the reason for moving it out of the aopt class?

J. Gareth Moreton

2019-02-28 22:16

developer   ~0114527

I might have overcomplicated that part, but the reason being is that a number of the functions, like IsJumpToLabelUncond, is now used by "compiler/aoptobj.pas", which doesn't use aopt. Thinking about it, it's probably much simpler to keep those functions where they are (with the "inline" hints if necessary), declare them in the interface section and simply have aoptobj depend on aopt. Should I re-make overhaul-base to do this?

J. Gareth Moreton

2019-03-01 06:55

developer   ~0114532

Last edited: 2019-03-01 07:01

View 2 revisions

I've updated overhaul-base and overhaul-global so functions aren't moved around unnecessarily and cross-platform units don't use platform-specific units. Things feel a bit cleaner already.

(No actual function code was changed, just where the functions are located)

J. Gareth Moreton

2019-07-11 07:25

developer   ~0117163

Last edited: 2019-07-11 07:26

View 2 revisions

Updated patches to merge with current trunk. This includes fixing case blocks to contain "else" blocks, which are now required even if they're empty so that every possible value has a branch.

Running the test suite on i386-win32 and x86_64-win64 did yield a couple of regressions, but these seemed to be false positives caused by my overzealous antivirus. Re-testing with the antivirus disabled allowed them to pass.



overhaul-base.patch (3,376 bytes)
Index: compiler/aopt.pas
===================================================================
--- compiler/aopt.pas	(revision 42345)
+++ compiler/aopt.pas	(working copy)
@@ -53,9 +53,9 @@
         { Builds a table with the locations of the labels in the TAsmList.
           Also fixes some RegDeallocs like "# %eax released; push (%eax)"  }
         Procedure BuildLabelTableAndFixRegAlloc;
-        procedure clear;
       protected
         procedure pass_1;
+        procedure clear;
       End;
       TAsmOptimizerClass = class of TAsmOptimizer;
 
Index: compiler/aoptbase.pas
===================================================================
--- compiler/aoptbase.pas	(revision 42345)
+++ compiler/aoptbase.pas	(working copy)
@@ -176,7 +176,7 @@
   End;
 
 
-  function labelCanBeSkipped(p: tai_label): boolean;
+  function labelCanBeSkipped(p: tai_label): boolean; inline;
   begin
     labelCanBeSkipped := not(p.labsym.is_used) or (p.labsym.labeltype<>alt_jump);
   end;
Index: compiler/aoptobj.pas
===================================================================
--- compiler/aoptobj.pas	(revision 42345)
+++ compiler/aoptobj.pas	(working copy)
@@ -371,6 +396,15 @@
 
        Function ArrayRefsEq(const r1, r2: TReference): Boolean;
 
+       { Returns a pointer to the operand that contains the destination label }
+       function JumpTargetOp(ai: taicpu): poper;
+
+       { Returns True if hp is any jump to a label }
+       function IsJumpToLabel(hp: taicpu): boolean;
+
+       { Returns True if hp is an unconditional jump to a label }
+       function IsJumpToLabelUncond(hp: taicpu): boolean;
+
     { ***************************** Implementation **************************** }
 
   Implementation
Index: compiler/aoptutils.pas
===================================================================
--- compiler/aoptutils.pas	(revision 42345)
+++ compiler/aoptutils.pas	(working copy)
@@ -38,15 +38,22 @@
     { skips all labels and returns the next "real" instruction }
     function SkipLabels(hp: tai; var hp2: tai): boolean;
 
+    { sets hp2 to hp and returns True if hp is not nil }
+    function SetAndTest(const hp: tai; out hp2: tai): Boolean;
+
   implementation
 
-    function MatchOpType(const p : taicpu; type0: toptype) : Boolean;
+    uses
+      aasmbase;
+
+
+    function MatchOpType(const p : taicpu; type0: toptype) : Boolean; inline;
       begin
         Result:=(p.ops=1) and (p.oper[0]^.typ=type0);
       end;
 
 
-    function MatchOpType(const p : taicpu; type0,type1 : toptype) : Boolean;
+    function MatchOpType(const p : taicpu; type0,type1 : toptype) : Boolean; inline;
       begin
         Result:=(p.ops=2) and (p.oper[0]^.typ=type0) and (p.oper[1]^.typ=type1);
       end;
@@ -53,7 +60,7 @@
 
 
 {$if max_operands>2}
-    function MatchOpType(const p : taicpu; type0,type1,type2 : toptype) : Boolean;
+    function MatchOpType(const p : taicpu; type0,type1,type2 : toptype) : Boolean; inline;
       begin
         Result:=(p.ops=3) and (p.oper[0]^.typ=type0) and (p.oper[1]^.typ=type1) and (p.oper[2]^.typ=type2);
       end;
@@ -78,6 +85,11 @@
           end;
       end;
 
+    { sets hp2 to hp and returns True if hp is not nil }
+    function SetAndTest(const hp: tai; out hp2: tai): Boolean; inline;
+      begin
+        hp2 := hp;
+        Result := Assigned(hp);
+      end;
 
 end.
-
overhaul-base.patch (3,376 bytes)
overhaul-global.patch (19,603 bytes)
Index: compiler/aoptobj.pas
===================================================================
--- compiler/aoptobj.pas	(revision 42345)
+++ compiler/aoptobj.pas	(working copy)
@@ -24,6 +24,8 @@
 }
 Unit AoptObj;
 
+{ $DEFINE DEBUG_JUMP}
+
   {$i fpcdefs.inc}
 
   { general, processor independent objects for use by the assembler optimizer }
@@ -268,10 +270,21 @@
         Procedure CreateUsedRegs(var regs: TAllUsedRegs);
         Procedure ClearUsedRegs;
         Procedure UpdateUsedRegs(p : Tai);
-        class procedure UpdateUsedRegs(var Regs: TAllUsedRegs; p: Tai);
+        { Function always returns True.  Used so the method can be inserted into
+          an if-block when paired with RegUsedAfterInstruction, say }
+        class function UpdateUsedRegs(var Regs: TAllUsedRegs; p: Tai): Boolean;
         Function CopyUsedRegs(var dest : TAllUsedRegs) : boolean;
+
+        { If UpdateUsedRegsAndOptimize has read ahead, the result is one before
+          the next valid entry (so "p.Next" returns what's expected).  If no
+          reading ahead happened, then the result is equal to p. }
+        function UpdateUsedRegsAndOptimize(p : Tai): Tai;
+
         procedure RestoreUsedRegs(const Regs : TAllUsedRegs);
-        procedure TransferUsedRegs(var dest: TAllUsedRegs);
+
+        { Function always returns True.  Used so the method can be inserted into
+          an if-block when paired with RegUsedAfterInstruction, say }
+        function TransferUsedRegs(var dest: TAllUsedRegs): Boolean;
         class Procedure ReleaseUsedRegs(const regs : TAllUsedRegs);
         class Function RegInUsedRegs(reg : TRegister;regs : TAllUsedRegs) : boolean;
         class Procedure IncludeRegInUsedRegs(reg : TRegister;var regs : TAllUsedRegs);
@@ -351,6 +364,7 @@
         procedure RemoveDelaySlot(hp1: tai);
 
         { peephole optimizer }
+        function GetFirstInstruction(const Start: tai; var p: tai): Boolean; virtual;
         procedure PrePeepHoleOpts; virtual;
         procedure PeepHoleOptPass1; virtual;
         procedure PeepHoleOptPass2; virtual;
@@ -363,6 +377,17 @@
         function PeepHoleOptPass2Cpu(var p: tai): boolean; virtual;
         function PostPeepHoleOptsCpu(var p: tai): boolean; virtual;
 
+        { Removes all instructions between an unconditional jump and the next label }
+        procedure RemoveDeadCodeAfterJump(p: taicpu);
+
+        { If hp is a label, strip it if its reference count is zero.  Repeat until
+          a non-label is found, or a label with a non-zero reference count.
+          True is returned if something was stripped }
+        function StripDeadLabels(hp: tai; var NextValid: tai): Boolean;
+
+        { Checks and removes "jmp @@lbl; @lbl". Returns True if the jump was removed }
+        function CollapseZeroDistJump(var p: tai; hp1: tai; ThisLabel: TAsmLabel): Boolean;
+
         { insert debug comments about which registers are read and written by
           each instruction. Useful for debugging the InstructionLoadsFromReg and
           other similar functions. }
@@ -900,7 +934,81 @@
             UsedRegs[i].Clear;
         end;
 
+      { If UpdateUsedRegsAndOptimize has read ahead, the result is one before
+        the next valid entry (so "p.Next" returns what's expected).  If no
+        reading ahead happened, then the result is equal to p. }
+      function TAOptObj.UpdateUsedRegsAndOptimize(p : Tai): Tai;
+        var
+          NotFirst: Boolean;
+        begin
+          { this code is based on TUsedRegs.Update to avoid multiple passes through the asmlist,
+            the code is duplicated here }
 
+          Result := p;
+          if (p.typ in [ait_instruction, ait_label]) then
+            begin
+              if (p.next <> BlockEnd) and (tai(p.next).typ <> ait_instruction) then
+                begin
+                  { Advance one, otherwise the routine exits immediately and wastes time }
+                  p := tai(p.Next);
+                  NotFirst := True;
+                end
+              else
+                { If the next entry is an instruction, nothing will be updated or
+                  optimised here, so exit now to save time }
+                Exit;
+            end
+          else
+            NotFirst := False;
+
+          repeat
+            while assigned(p) and
+                  ((p.typ in (SkipInstr + [ait_align, ait_label] - [ait_RegAlloc])) or
+                   ((p.typ = ait_marker) and
+                    (tai_Marker(p).Kind in [mark_AsmBlockEnd,mark_NoLineInfoStart,mark_NoLineInfoEnd]))) do
+                 begin
+                   { Here's the optimise part }
+                   if (p.typ in [ait_align, ait_label]) then
+                     begin
+                       if StripDeadLabels(p, p) then
+                         begin
+                           { Note, if the first instruction is stripped and is
+                             the only one that gets removed, Result will now
+                             contain a dangling pointer, so compensate for this. }
+                           if not NotFirst then
+                             Result := tai(p.Previous);
+
+                           Continue;
+                         end;
+
+                       if ((p.typ = ait_label) and not labelCanBeSkipped(tai_label(p))) then
+                         Break;
+                     end;
+
+                   Result := p;
+                   p := tai(p.next);
+                 end;
+            while assigned(p) and
+                  (p.typ=ait_RegAlloc) Do
+              begin
+                case tai_regalloc(p).ratype of
+                  ra_alloc :
+                    Include(UsedRegs[getregtype(tai_regalloc(p).reg)].UsedRegs, getsupreg(tai_regalloc(p).reg));
+                  ra_dealloc :
+                    Exclude(UsedRegs[getregtype(tai_regalloc(p).reg)].UsedRegs, getsupreg(tai_regalloc(p).reg));				
+                  else
+                    { Do nothing };
+                end;
+                Result := p;
+                p := tai(p.next);
+              end;
+            NotFirst := True;
+          until not(assigned(p)) or
+                (not(p.typ in SkipInstr + [ait_align]) and
+                 not((p.typ = ait_label) and
+                     labelCanBeSkipped(tai_label(p))));
+        end;
+
       procedure TAOptObj.UpdateUsedRegs(p : Tai);
         begin
           { this code is based on TUsedRegs.Update to avoid multiple passes through the asmlist,
@@ -933,12 +1041,14 @@
         end;
 
 
-      class procedure TAOptObj.UpdateUsedRegs(var Regs : TAllUsedRegs;p : Tai);
+      class function TAOptObj.UpdateUsedRegs(var Regs : TAllUsedRegs;p : Tai): Boolean;
         var
           i : TRegisterType;
         begin
           for i:=low(TRegisterType) to high(TRegisterType) do
             Regs[i].Update(p);
+
+          Result := True;
         end;
 
 
@@ -964,7 +1074,7 @@
       end;
 
 
-      procedure TAOptObj.TransferUsedRegs(var dest: TAllUsedRegs);
+      function TAOptObj.TransferUsedRegs(var dest: TAllUsedRegs): Boolean;
       var
         i : TRegisterType;
       begin
@@ -973,6 +1083,8 @@
           the only published means to modify the internal state en-masse. [Kit] }
         for i:=low(TRegisterType) to high(TRegisterType) do
           dest[i].Create_Regset(i, UsedRegs[i].GetUsedRegs);
+
+        Result := True;
       end;
 
 
@@ -1338,17 +1450,33 @@
 
 
     function FindAnyLabel(hp: tai; var l: tasmlabel): Boolean;
+      var
+        next: tai;
       begin
         FindAnyLabel := false;
-        while assigned(hp.next) and
-              (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
-          hp := tai(hp.next);
-        if assigned(hp.next) and
-           (tai(hp.next).typ = ait_label) then
+
+        while True do
           begin
-            FindAnyLabel := true;
-            l := tai_label(hp.next).labsym;
-          end
+            while assigned(hp.next) and
+                  (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
+              hp := tai(hp.next);
+
+            next := tai(hp.next);
+            if assigned(next) and
+              (tai(next).typ = ait_label) then
+              begin
+                l := tai_label(next).labsym;
+                if not l.is_used then
+                  begin
+                    { Unsafe label }
+                    hp := next;
+                    Continue;
+                  end;
+
+                FindAnyLabel := true;
+              end;
+            Exit;
+          end;
       end;
 
 
@@ -1414,7 +1542,230 @@
           execute before branch, so code stays correct if branch is removed. }
       end;
 
+    { Search forward from BlockStart until we find the first instruction }
+    function TAOptObj.GetFirstInstruction(const Start: tai; var p: tai): Boolean;
+      begin
+        Result := False;
+        p := Start;
+        while (p <> BlockEnd) do
+          begin
+            if (p.Typ = ait_instruction) then
+              begin
+                Result := True;
+                Exit;
+              end
+            else
+              begin
+                UpdateUsedRegs(p);
+                p := tai(p.Next);
+              end;
+          end;
+      end;
 
+    { Removes all instructions between an unconditional jump and the next label }
+    procedure TAOptObj.RemoveDeadCodeAfterJump(p: taicpu);
+      var
+        hp1, hp2: tai;
+      begin
+        if not IsJumpToLabelUncond(p) then
+          Exit;
+
+        { the following if-block removes all code between a jmp and the next label,
+          because it can never be executed
+        }
+        while GetNextInstruction(p, hp1) and
+              (hp1 <> BlockEnd) and
+              (hp1.typ <> ait_label)
+{$ifdef JVM}
+              and (hp1.typ <> ait_jcatch)
+{$endif}
+              do
+          if not(hp1.typ in ([ait_label]+skipinstr)) then
+            begin
+              if (hp1.typ = ait_instruction) and
+                 taicpu(hp1).is_jmp and
+                 (JumpTargetOp(taicpu(hp1))^.typ = top_ref) and
+                 (JumpTargetOp(taicpu(hp1))^.ref^.symbol is TAsmLabel) then
+                 TAsmLabel(JumpTargetOp(taicpu(hp1))^.ref^.symbol).decrefs;
+              { don't kill start/end of assembler block,
+                no-line-info-start/end etc }
+              if (hp1.typ <> ait_marker) then
+                begin
+{$ifdef cpudelayslot}
+                  if (hp1.typ=ait_instruction) and (taicpu(hp1).is_jmp) then
+                    RemoveDelaySlot(hp1);
+{$endif cpudelayslot}
+                  if (hp1.typ = ait_align) then
+                    begin
+                      { Only remove the align if a label doesn't immediately follow }
+                      if GetNextInstruction(hp1, hp2) and (hp2.typ = ait_label) then
+                        { The label is unskippable }
+                        Exit;
+                    end;
+                  asml.remove(hp1);
+                  hp1.free;
+                end
+              else
+                p:=taicpu(hp1);
+            end
+          else
+            Break;
+      end;
+
+    { If hp is a label, strip it if its reference count is zero.  Repeat until
+      a non-label is found, or a label with a non-zero reference count.
+      True is returned if something was stripped }
+    function TAOptObj.StripDeadLabels(hp: tai; var NextValid: tai): Boolean;
+      var
+        tmp: tai;
+        hp1: tai;
+        CurrentAlign: tai;
+      begin
+        CurrentAlign := nil;
+        Result := False;
+        hp1 := hp;
+        NextValid := hp;
+
+        { Stop if hp is an instruction, for example }
+        while (hp1 <> BlockEnd) and (hp1.typ in [ait_label,ait_align]) do
+          begin
+            case hp1.typ of
+              ait_label:
+                begin
+                  with tai_label(hp1).labsym do
+                    if is_used or (bind <> AB_LOCAL) or (labeltype <> alt_jump) then
+                      begin
+                        { Valid label }
+                        if Result then
+                          NextValid := hp1;
+                        Exit;
+                      end;
+
+                  { Set tmp to the next valid entry }
+                  tmp := tai(hp1.Next);
+                  { Remove label }
+                  AsmL.Remove(hp1);
+                  hp1.Free;
+
+                  hp1 := tmp;
+
+                  Result := True;
+                  Continue;
+                end;
+              { Also remove the align if it comes before an unused label }
+              ait_align:
+                begin
+                  tmp := tai(hp1.Next);
+
+                  if (cs_debuginfo in current_settings.moduleswitches) or
+                     (cs_use_lineinfo in current_settings.globalswitches) then
+                     { Don't remove aligns if debuginfo is present }
+                    begin
+                      if (tmp.typ in [ait_label,ait_align]) then
+                        begin
+                          hp1 := tmp;
+                          Continue;
+                        end
+                      else
+                        Break;
+                    end;
+
+                  if tmp = BlockEnd then
+                    { End of block }
+                    Exit;
+
+                  case tmp.typ of
+                    ait_align: { Merge the aligns - we might as well }
+                      begin
+                        { Actually the correct operation here is not max, but
+                          the least common multiple, but alignments are
+                          strictly powers of two anyway, so the largest of the
+                          two alignments is also the LCM. [Kit] }
+                        tai_align_abstract(hp1).aligntype := max(tai_align_abstract(hp1).aligntype, tai_align_abstract(tmp).aligntype);
+                        AsmL.Remove(tmp);
+                        tmp.Free;
+                        Result := True;
+                        Continue;
+                      end;
+                    ait_label:
+                      begin
+                        { Signal that we can possibly delete this align entry }
+                        CurrentAlign := hp1;
+
+                        with tai_label(tmp).labsym do
+                          if is_used or (bind <> AB_LOCAL) or (labeltype <> alt_jump) then
+                            begin
+                              { Valid label }
+                              if Result then
+                                NextValid := hp1;
+                              Exit;
+                            end;
+
+                        { Remove label }
+                        AsmL.Remove(tmp);
+                        tmp.Free;
+
+                        Result := True;
+
+                        { Re-evaluate the align and see what follows }
+                        Continue;
+                      end
+                    else
+                      begin
+                        { Set hp1 to the instruction after the align, because the
+                          align might get deleted later and hence set NextValid
+                          to a dangling pointer. [Kit] }
+                        hp1 := tmp;
+                        Break;
+                      end;
+                  end;
+                end
+              else
+                Break;
+            end;
+            hp1 := tai(hp1.Next);
+          end;
+
+        { hp1 will be the next valid entry }
+        NextValid := hp1;
+
+        if Assigned(CurrentAlign) then
+          begin
+            { Remove the alignment field }
+            AsmL.Remove(CurrentAlign);
+            CurrentAlign.Free;
+          end;
+      end;
+
+    function TAOptObj.CollapseZeroDistJump(var p: tai; hp1: tai; ThisLabel: TAsmLabel): Boolean;
+      var
+        tmp: tai;
+      begin
+        Result := False;
+
+        { remove jumps to labela coming right after them }
+        if FindLabel(ThisLabel, hp1) and
+            { TODO: FIXME removing the first instruction fails}
+            (p<>blockstart) then
+          begin
+            ThisLabel.decrefs;
+
+            tmp := tai(p.Next); { Might be an align before the label }
+{$ifdef cpudelayslot}
+            RemoveDelaySlot(p);
+{$endif cpudelayslot}
+            asml.remove(p);
+            p.free;
+
+            StripDeadLabels(tmp, hp1);
+
+            p:=hp1;
+            Result := True;
+          end;
+
+    end;
+
+
     function TAOptObj.GetFinalDestination(hp: taicpu; level: longint): boolean;
       {traces sucessive jumps to their final destination and sets it, e.g.
        je l1                je l3
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 42345)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -50,6 +58,12 @@
 
         procedure DebugMsg(const s : string; p : tai);inline;
 
+        { TODO: This method is declared here so it can be more easily split away
+          into a separate patch file - once fully implemented into the trunk, it
+          can be moved with the other OptPass1 routines }
+
+        function OptPass1XOR(var p : tai) : boolean;
+
         class function IsExitCode(p : tai) : boolean;
         class function isFoldableArithOp(hp1 : taicpu; reg : tregister) : boolean;
         procedure RemoveLastDeallocForFuncRes(p : tai);
@@ -96,6 +105,7 @@
     function MatchInstruction(const instr: tai; const op1,op2: TAsmOp; const opsize: topsizes): boolean;
     function MatchInstruction(const instr: tai; const op1,op2,op3: TAsmOp; const opsize: topsizes): boolean;
     function MatchInstruction(const instr: tai; const ops: array of TAsmOp; const opsize: topsizes): boolean;
+    function MatchInstruction(const instr: tai; const op: TAsmOp): boolean; inline;
 
     function MatchOperand(const oper: TOper; const reg: TRegister): boolean; inline;
     function MatchOperand(const oper: TOper; const a: tcgint): boolean; inline;
@@ -119,6 +129,14 @@
     SPeepholeOptimization = '';
 {$endif DEBUG_AOPTCPU}
 
+
+  function debug_tostr(i: tcgint): string;
+  function debug_regname(r: TRegister): string;
+  function debug_operstr(oper: TOper): string;
+  function debug_op2str(opcode: tasmop): string;
+  function debug_opsize2str(opsize: topsize): string;
+
+
   implementation
 
     uses
@@ -183,6 +204,14 @@
       end;
 
 
+    function MatchInstruction(const instr: tai; const op: TAsmOp): boolean;
+      begin
+        result :=
+          (instr.typ = ait_instruction) and
+          (taicpu(instr).opcode = op);
+      end;
+
+
     function MatchOperand(const oper: TOper; const reg: TRegister): boolean; inline;
       begin
         result := (oper.typ = top_reg) and (oper.reg = reg);
@@ -1176,6 +1576,22 @@
       end;
 
 
+    function TX86AsmOptimizer.OptPass1XOR(var p: tai): boolean;
+      begin
+        Result := False;
+        if (taicpu(p).oper[0]^.typ = top_reg) and
+           (taicpu(p).oper[1]^.typ = top_reg) and
+           (taicpu(p).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
+         { temporarily change this to 'mov reg,0' to make it easier }
+         { for the CSE. Will be changed back in the post-peephole stage }
+          begin
+            taicpu(p).opcode := A_MOV;
+            taicpu(p).loadConst(0,0);
+            Result := True;
+          end;
+      end;
+
+
     function TX86AsmOptimizer.OptPass1VOP(var p : tai) : boolean;
       var
         hp1 : tai;
overhaul-global.patch (19,603 bytes)
overhaul-mov-refactor.patch (94,203 bytes)
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 42345)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -1216,251 +1635,769 @@
       var
         hp1, hp2: tai;
         GetNextInstruction_p: Boolean;
+        hp3: tai;
+        HP_Result: Boolean;
         PreMessage, RegName1, RegName2, InputVal, MaskNum: string;
         NewSize: topsize;
+      label
+        MovCaseBlock_CheckNext, MovCaseBlock;
+
+        function MOVRefOptimize: Boolean;
+          begin
+            Result := False;
+            if MatchOpType(taicpu(p),top_reg,top_reg) and
+              MatchOpType(taicpu(hp1),top_ref,top_reg) and
+            ((taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg)
+             or
+             (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg)
+              ) and
+            (getsupreg(taicpu(hp1).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) then
+            { mov reg1, reg2
+              mov/zx/sx (reg2, ..), reg2      to   mov/zx/sx (reg1, ..), reg2}
+            begin
+              if (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) then
+                taicpu(hp1).oper[0]^.ref^.base := taicpu(p).oper[0]^.reg;
+              if (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) then
+                taicpu(hp1).oper[0]^.ref^.index := taicpu(p).oper[0]^.reg;
+              DebugMsg(SPeepholeOptimization + 'MovMovXX2MoVXX 1 done',p);
+              asml.remove(p);
+              p.free;
+              p := hp1;
+              Result:=true;
+            end;
+          end;
+
       begin
         Result:=false;
+        repeat
 
-        GetNextInstruction_p:=GetNextInstruction(p, hp1);
+          GetNextInstruction_p := GetNextInstruction(p, hp1);
 
-        {  remove mov reg1,reg1? }
-        if MatchOperand(taicpu(p).oper[0]^,taicpu(p).oper[1]^)
-        then
-          begin
-            DebugMsg(SPeepholeOptimization + 'Mov2Nop done',p);
-            { take care of the register (de)allocs following p }
-            UpdateUsedRegs(tai(p.next));
-            asml.remove(p);
-            p.free;
-            p:=hp1;
-            Result:=true;
-            exit;
-          end;
+          { remove mov reg1,reg1? }
+          if MatchOperand(taicpu(p).oper[0]^,taicpu(p).oper[1]^)
+          then
+            begin
+              DebugMsg(SPeepholeOptimization + 'Mov2Nop done',p);
+              { take care of the register (de)allocs following p }
+              UpdateUsedRegsAndOptimize(tai(p.next));
+              asml.remove(p);
+              p.free;
+              p:=hp1;
+              Result:=true;
+              if MatchInstruction(hp1, A_MOV) then
+                Continue
+              else
+                exit;
+            end;
 
-        if GetNextInstruction_p and
-          MatchInstruction(hp1,A_AND,[]) and
-          (taicpu(p).oper[1]^.typ = top_reg) and
-          MatchOpType(taicpu(hp1),top_const,top_reg) then
-          begin
-            if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) then
-              begin
-                case taicpu(p).opsize of
-                  S_L:
-                    if (taicpu(hp1).oper[0]^.val = $ffffffff) then
-                      begin
-                        { Optimize out:
-                            mov x, %reg
-                            and ffffffffh, %reg
-                        }
-                        DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 1 done',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        exit;
-                      end;
-                  S_Q: { TODO: Confirm if this is even possible }
-                    if (taicpu(hp1).oper[0]^.val = $ffffffffffffffff) then
-                      begin
-                        { Optimize out:
-                            mov x, %reg
-                            and ffffffffffffffffh, %reg
-                        }
-                        DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 2 done',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        exit;
-                      end;
+          if GetNextInstruction_p and
+            MatchInstruction(hp1,A_JMP) then
+            { Doing this optimisation here allows for some additional
+              optimisations in the same pass.  This ensures that certain
+              MOV optimisations are still performed under -O1. [Kit] }
+            begin
+              if GetNextInstruction(hp1, hp2) and CollapseZeroDistJump(hp1, hp2, TAsmLabel(taicpu(hp1).oper[0]^.ref^.symbol)) then
+                begin
+                  if tai(hp1).typ = ait_instruction then
+                    { hp1 is now the next instruction }
+                    GetNextInstruction_p := True
                   else
-                    ;
-                end;
-              end
-            else if (taicpu(p).oper[1]^.typ = top_reg) and (taicpu(hp1).oper[1]^.typ = top_reg) and
-              (taicpu(p).oper[0]^.typ <> top_const) and { MOVZX only supports registers and memory, not immediates (use MOV for that!) }
-              (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
-              then
+                    { Note, if hp1 lands on a label, it won't be skippable, so
+                      Exit if that happens }
+                    if (tai(hp1).typ in SkipInstr) then
+                      GetNextInstruction_p := GetNextInstruction(hp1, hp1)
+                    else
+                      Exit;
+                end
+              else
+                Exit;
+            end;
+
+MovCaseBlock_CheckNext:
+          { All the following optimisations require a next instruction }
+          if not GetNextInstruction_p or (hp1.typ <> ait_instruction) then
+            Exit;
+
+MovCaseBlock:
+          case taicpu(hp1).opcode of
+            { Optimisations where next instruction = XOR }
+            A_XOR:
               begin
-                InputVal := debug_operstr(taicpu(p).oper[0]^);
-                MaskNum := debug_tostr(taicpu(hp1).oper[0]^.val);
+                { OptPass1XOR doesn't use register tracking, so no need to
+                  update and restore the register array }
+                HP_Result := OptPass1XOR(hp1);
 
-                case taicpu(p).opsize of
-                  S_B:
-                    if (taicpu(hp1).oper[0]^.val = $ff) then
-                      begin
-                        { Convert:
-                            movb x, %regl        movb x, %regl
-                            andw ffh, %regw      andl ffh, %regd
-                          To:
-                            movzbw x, %regd      movzbl x, %regd
+                if HP_Result then
+                  goto MovCaseBlock;
+              end;
+            { Optimisations where next instruction = AND }
+            A_AND:
+              if (taicpu(p).oper[1]^.typ = top_reg) and
+                MatchOpType(taicpu(hp1),top_const,top_reg) then
+                begin
+                  if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) then
+                    begin
+                      case taicpu(p).opsize of
+                        S_L:
+                          if (taicpu(hp1).oper[0]^.val = $ffffffff) then
+                            begin
+                              { Optimize out:
+                                  mov x, %reg
+                                  and ffffffffh, %reg
+                              }
+                              DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 1 done',p);
+                              asml.remove(hp1);
+                              hp1.free;
+                              GetNextInstruction_p := GetNextInstruction(p, hp1);
+                              goto MovCaseBlock_CheckNext;
+                            end;
+                        S_Q: { TODO: Confirm if this is even possible }
+                          if (taicpu(hp1).oper[0]^.val = $ffffffffffffffff) then
+                            begin
+                              { Optimize out:
+                                  mov x, %reg
+                                  and ffffffffffffffffh, %reg
+                              }
+                              DebugMsg(SPeepholeOptimization + 'MovAnd2Mov 2 done',p);
+                              asml.remove(hp1);
+                              hp1.free;
+                              GetNextInstruction_p := GetNextInstruction(p, hp1);
+                              goto MovCaseBlock_CheckNext;
+                            end;
+                        else
+                          { Do nothing };
+                      end;
+                    end
+                  else if (taicpu(p).oper[1]^.typ = top_reg) and (taicpu(hp1).oper[1]^.typ = top_reg) and
+                    (taicpu(p).oper[0]^.typ <> top_const) and { MOVZX only supports registers and memory, not immediates (use MOV for that!) }
+                    (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
+                    then
+                    begin
+                      InputVal := debug_operstr(taicpu(p).oper[0]^);
+                      MaskNum := debug_tostr(taicpu(hp1).oper[0]^.val);
 
-                          (Identical registers, just different sizes)
-                        }
-                        RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 8-bit register name }
-                        RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 16/32-bit register name }
+                      case taicpu(p).opsize of
+                        S_B:
+                          if (taicpu(hp1).oper[0]^.val = $ff) then
+                            begin
+                              { Convert:
+                                  movb x, %regl        movb x, %regl
+                                  andw ffh, %regw      andl ffh, %regd
+                                To:
+                                  movzbw x, %regd      movzbl x, %regd
 
-                        case taicpu(hp1).opsize of
-                          S_W: NewSize := S_BW;
-                          S_L: NewSize := S_BL;
+                                (Identical registers, just different sizes)
+                              }
+                              RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 8-bit register name }
+                              RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 16/32-bit register name }
+
+                              case taicpu(hp1).opsize of
+                                S_W: NewSize := S_BW;
+                                S_L: NewSize := S_BL;
 {$ifdef x86_64}
-                          S_Q: NewSize := S_BQ;
+                                S_Q: NewSize := S_BQ;
 {$endif x86_64}
+                                else
+                                  InternalError(2018011510);
+                              end;
+                            end
                           else
-                            InternalError(2018011510);
-                        end;
-                      end
-                    else
-                      NewSize := S_NO;
-                  S_W:
-                    if (taicpu(hp1).oper[0]^.val = $ffff) then
-                      begin
-                        { Convert:
-                            movw x, %regw
-                            andl ffffh, %regd
-                          To:
-                            movzwl x, %regd
+                            NewSize := S_NO;
+                        S_W:
+                          if (taicpu(hp1).oper[0]^.val = $ffff) then
+                            begin
+                              { Convert:
+                                  movw x, %regw
+                                  andl ffffh, %regd
+                                To:
+                                  movzwl x, %regd
 
-                          (Identical registers, just different sizes)
-                        }
-                        RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 16-bit register name }
-                        RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 32-bit register name }
+                                (Identical registers, just different sizes)
+                              }
+                              RegName1 := debug_regname(taicpu(p).oper[1]^.reg); { 16-bit register name }
+                              RegName2 := debug_regname(taicpu(hp1).oper[1]^.reg); { 32-bit register name }
 
-                        case taicpu(hp1).opsize of
-                          S_L: NewSize := S_WL;
+                              case taicpu(hp1).opsize of
+                                S_L: NewSize := S_WL;
 {$ifdef x86_64}
-                          S_Q: NewSize := S_WQ;
+                                S_Q: NewSize := S_WQ;
 {$endif x86_64}
+                                else
+                                  InternalError(2018011511);
+                              end;
+                            end
                           else
-                            InternalError(2018011511);
+                            NewSize := S_NO;
+                        else
+                          NewSize := S_NO;
+                      end;
+
+                      if NewSize <> S_NO then
+                        begin
+                          PreMessage := 'mov' + debug_opsize2str(taicpu(p).opsize) + ' ' + InputVal + ',' + RegName1;
+
+                          { The actual optimization }
+                          taicpu(p).opcode := A_MOVZX;
+                          taicpu(p).changeopsize(NewSize);
+                          taicpu(p).oper[1]^ := taicpu(hp1).oper[1]^;
+
+                          { Safeguard if "and" is followed by a conditional command }
+                          TransferUsedRegs(TmpUsedRegs);
+                          UpdateUsedRegs(TmpUsedRegs,tai(p.next));
+
+                          if (RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then
+                            begin
+                              { At this point, the "and" command is effectively equivalent to
+                                "test %reg,%reg". This will be handled separately by the
+                                Peephole Optimizer. [Kit] }
+
+                              DebugMsg(SPeepholeOptimization + PreMessage +
+                                ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
+                            end
+                          else
+                            begin
+                              DebugMsg(SPeepholeOptimization + PreMessage + '; and' + debug_opsize2str(taicpu(hp1).opsize) + ' $' + MaskNum + ',' + RegName2 +
+                                ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
+
+                              asml.Remove(hp1);
+                              hp1.Free;
+                            end;
+
+                          Result := True;
+                          Exit;
+
                         end;
-                      end
-                    else
-                      NewSize := S_NO;
-                  else
-                    NewSize := S_NO;
+                    end;
                 end;
 
-                if NewSize <> S_NO then
+            { Optimisations where next instruction = MOV }
+            A_MOV:
+              begin
+                if taicpu(hp1).opsize = taicpu(p).opsize then
                   begin
-                    PreMessage := 'mov' + debug_opsize2str(taicpu(p).opsize) + ' ' + InputVal + ',' + RegName1;
+                    if (taicpu(p).oper[1]^.typ = top_reg) and
+                      MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
+                      begin
+                        { we have
+                            mov x, %treg
+                            mov %treg, y
+                        }
 
-                    { The actual optimization }
-                    taicpu(p).opcode := A_MOVZX;
-                    taicpu(p).changeopsize(NewSize);
-                    taicpu(p).oper[1]^ := taicpu(hp1).oper[1]^;
+                        if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[1]^)) then
+                          begin
+                            if (TransferUsedRegs(TmpUsedRegs) and
+                              UpdateUsedRegs(TmpUsedRegs, tai(p.Next)) and
+                              RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
+                              begin
+                                { we've got
 
-                    { Safeguard if "and" is followed by a conditional command }
-                    TransferUsedRegs(TmpUsedRegs);
-                    UpdateUsedRegs(TmpUsedRegs,tai(p.next));
+                                  mov x, %treg
+                                  mov %treg, y
 
-                    if (RegUsedAfterInstruction(NR_DEFAULTFLAGS, hp1, TmpUsedRegs)) then
-                      begin
-                        { At this point, the "and" command is effectively equivalent to
-                          "test %reg,%reg". This will be handled separately by the
-                          Peephole Optimizer. [Kit] }
+                                  ... but %treg is used afterwards.  We can optimise this to minimise a pipeline stall:
 
-                        DebugMsg(SPeepholeOptimization + PreMessage +
-                          ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
-                      end
-                    else
-                      begin
-                        DebugMsg(SPeepholeOptimization + PreMessage + '; and' + debug_opsize2str(taicpu(hp1).opsize) + ' $' + MaskNum + ',' + RegName2 +
-                          ' -> movz' + debug_opsize2str(NewSize) + ' ' + InputVal + ',' + RegName2, p);
+                                  mov x, %treg
+                                  mov x, y
 
-                        asml.Remove(hp1);
-                        hp1.Free;
+                                  x must be a constant or a register, and y must also a register.  It can work if x
+                                  is a reference that doesn't contain %treg, but this ends up using an AGU as well
+                                  as an ALU and harms hyperthreading and instruction throughput. [Kit]
+                                }
+                                if (taicpu(hp1).oper[1]^.typ = top_reg) and (taicpu(p).oper[0]^.typ <> top_ref) then
+                                  begin
+
+                                    if (taicpu(p).oper[0]^.typ = top_reg) then
+                                      begin
+
+                                        if (
+                                          (taicpu(p).oper[0]^.reg = taicpu(hp1).oper[1]^.reg) or
+                                          (taicpu(hp1).oper[0]^.reg = taicpu(hp1).oper[1]^.reg)
+                                        ) then
+                                        begin
+                                          { If %treg = x or y, then remove the second MOV }
+                                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1a',p);
+                                          asml.remove(hp1);
+                                          hp1.free;
+                                          GetNextInstruction_p := GetNextInstruction(p, hp1);
+                                          goto MovCaseBlock_CheckNext;
+                                        end;
+
+                                        { Make sure the optimizer is aware that register x is used for an extra instruction }
+                                        if taicpu(p).oper[0]^.typ = top_reg then
+                                          AllocRegBetween(taicpu(p).oper[0]^.reg, p, hp1, UsedRegs);
+                                      end;
+
+                                    taicpu(hp1).loadOper(0,taicpu(p).oper[0]^);
+                                    DebugMsg(SPeepholeOptimization + 'mov x, %reg; mov %reg, y -> mov x, %reg; mov x, y', p);
+                                    { Don't need to set the Result to True because the change was done to the next command }
+
+                                  end;
+                              end
+                            else
+                              begin
+                                { we've got
+
+                                  mov x, %treg
+                                  mov %treg, y
+
+                                  with %treg is not used after }
+                                case taicpu(p).oper[0]^.typ Of
+                                  top_reg:
+                                    begin
+                                      { change
+                                          mov %reg, %treg
+                                          mov %treg, y
+
+                                          to
+
+                                          mov %reg, y
+                                      }
+                                      if taicpu(hp1).oper[1]^.typ=top_reg then
+                                        AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                                      taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
+                                      DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
+                                      asml.remove(hp1);
+                                      hp1.free;
+                                      Result := True;
+                                      Continue;
+                                    end;
+                                  top_const:
+                                    begin
+                                      { change
+                                          mov const, %treg
+                                          mov %treg, y
+
+                                          to
+
+                                          mov const, y
+                                      }
+                                      if (taicpu(hp1).oper[1]^.typ=top_reg) or
+                                        ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
+                                        begin
+                                          if taicpu(hp1).oper[1]^.typ=top_reg then
+                                            AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                                          taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
+                                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
+                                          asml.remove(hp1);
+                                          hp1.free;
+                                          Result := True;
+                                          Continue;
+                                        end;
+                                    end;
+                                  top_ref:
+                                    if (taicpu(hp1).oper[1]^.typ = top_reg) then
+                                      begin
+                                        { change
+                                             mov mem, %treg
+                                             mov %treg, %reg
+
+                                             to
+
+                                             mov mem, %reg"
+                                        }
+                                        AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                                        taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
+                                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
+                                        asml.remove(hp1);
+                                        hp1.free;
+                                        Result:=true;
+                                        Continue;
+                                      end;
+                                  else
+                                    InternalError(2019071001);
+                                end;
+                              end;
+                          end;
                       end;
 
-                    Result := True;
-                    Exit;
+                    if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
+                     (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
+                        {  mov reg1, mem1     or     mov mem1, reg1
+                           mov mem2, reg2            mov reg2, mem2}
+                      begin
+                        if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
+                          { mov reg1, mem1     or     mov mem1, reg1
+                            mov mem2, reg1            mov reg2, mem1}
+                          begin
+                            if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
+                              { Removes the second statement from
+                                mov reg1, mem1/reg2
+                                mov mem1/reg2, reg1 }
+                              begin
+                                if taicpu(p).oper[0]^.typ=top_reg then
+                                  AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
+                                DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1',p);
+                                asml.remove(hp1);
+                                hp1.free;
+                                Result:=true;
+                                Continue;
+                              end
+                            else
+                              begin
+                                if (taicpu(p).oper[1]^.typ = top_ref) and
+                                  { mov reg1, mem1
+                                    mov mem2, reg1 }
+                                   (taicpu(hp1).oper[0]^.ref^.refaddr = addr_no) and
+                                   GetNextInstruction(hp1, hp2) and
+                                   MatchInstruction(hp2,A_CMP,[taicpu(p).opsize]) and
+                                   OpsEqual(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^) and
+                                   OpsEqual(taicpu(p).oper[0]^,taicpu(hp2).oper[1]^) and
+                                   not (
+                                     TransferUsedRegs(TmpUsedRegs) and
+                                     UpdateUsedRegs(TmpUsedRegs, tai(hp1.next)) and
+                                     RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp2, TmpUsedRegs)
+                                   ) then
+                                   { change                   to
+                                     mov reg1, mem1           mov reg1, mem1
+                                     mov mem2, reg1           cmp reg1, mem2
+                                     cmp mem1, reg1
+                                   }
+                                  begin
+                                    asml.remove(hp2);
+                                    hp2.free;
+                                    taicpu(hp1).opcode := A_CMP;
+                                    taicpu(hp1).loadref(1,taicpu(hp1).oper[0]^.ref^);
+                                    taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
+                                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
+                                    DebugMsg(SPeepholeOptimization + 'MovMovCmp2MovCmp done',hp1);
+                                  end;
+                              end;
+                          end
+                        else if (taicpu(p).oper[1]^.typ=top_ref) and
+                          OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
+                          begin
+                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
+                            taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
+                            DebugMsg(SPeepholeOptimization + 'MovMov2MovMov1 done',p);
+                          end
+                        else
+                          begin
+                            if MatchOpType(taicpu(p),top_ref,top_reg) and
+                              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
+                              (taicpu(hp1).oper[1]^.typ = top_ref) and
+                              GetNextInstruction(hp1, hp2) and
+                              MatchInstruction(hp2,A_MOV,[taicpu(p).opsize]) and
+                              MatchOpType(taicpu(hp2),top_ref,top_reg) and
+                              RefsEqual(taicpu(hp2).oper[0]^.ref^, taicpu(hp1).oper[1]^.ref^)  then
+                              if not RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^) and
+                                 not (
+                                   TransferUsedRegs(TmpUsedRegs) and
+                                   RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,tmpUsedRegs)
+                                 ) then
+                                {   mov mem1, %reg1
+                                    mov %reg1, mem2
+                                    mov mem2, reg2
+                                 to:
+                                    mov mem1, reg2
+                                    mov reg2, mem2}
+                                begin
+                                  AllocRegBetween(taicpu(hp2).oper[1]^.reg,p,hp2,usedregs);
+                                  DebugMsg(SPeepholeOptimization + 'MovMovMov2MovMov 1 done',p);
+                                  taicpu(p).loadoper(1,taicpu(hp2).oper[1]^);
+                                  taicpu(hp1).loadoper(0,taicpu(hp2).oper[1]^);
+                                  asml.remove(hp2);
+                                  hp2.free;
+                                end
+{$ifdef i386}
+                              { this is enabled for i386 only, as the rules to create the reg sets below
+                                are too complicated for x86-64, so this makes this code too error prone
+                                on x86-64
+                              }
+                              else if (taicpu(p).oper[1]^.reg <> taicpu(hp2).oper[1]^.reg) and
+                                not(RegInRef(taicpu(p).oper[1]^.reg,taicpu(p).oper[0]^.ref^)) and
+                                not(RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^)) then
+                                {   mov mem1, reg1         mov mem1, reg1
+                                    mov reg1, mem2         mov reg1, mem2
+                                    mov mem2, reg2         mov mem2, reg1
+                                 to:                    to:
+                                    mov mem1, reg1         mov mem1, reg1
+                                    mov mem1, reg2         mov reg1, mem2
+                                    mov reg1, mem2
 
+                                 or (if mem1 depends on reg1
+                              and/or if mem2 depends on reg2)
+                                 to:
+                                     mov mem1, reg1
+                                     mov reg1, mem2
+                                     mov reg1, reg2
+                                }
+                                begin
+                                  taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
+                                  taicpu(hp1).loadReg(1,taicpu(hp2).oper[1]^.reg);
+                                  taicpu(hp2).loadRef(1,taicpu(hp2).oper[0]^.ref^);
+                                  taicpu(hp2).loadReg(0,taicpu(p).oper[1]^.reg);
+                                  AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
+                                  if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
+                                     (getsupreg(taicpu(p).oper[0]^.ref^.base) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
+                                    AllocRegBetween(taicpu(p).oper[0]^.ref^.base,p,hp2,usedregs);
+                                  if (taicpu(p).oper[0]^.ref^.index <> NR_NO) and
+                                     (getsupreg(taicpu(p).oper[0]^.ref^.index) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
+                                    AllocRegBetween(taicpu(p).oper[0]^.ref^.index,p,hp2,usedregs);
+                                end
+                              else if (taicpu(hp1).Oper[0]^.reg <> taicpu(hp2).Oper[1]^.reg) then
+                                begin
+                                  taicpu(hp2).loadReg(0,taicpu(hp1).Oper[0]^.reg);
+                                  AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
+                                end
+                              else
+                                begin
+                                  asml.remove(hp2);
+                                  hp2.free;
+                                end
+{$endif i386}
+                                ;
+                          end;
+                      end;
                   end;
-              end;
-          end
-        else if GetNextInstruction_p and
-          MatchInstruction(hp1,A_MOV,[]) and
-          (taicpu(p).oper[1]^.typ = top_reg) and
-          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
-            { we have
-                mov x, %treg
-                mov %treg, y
-            }
-            if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[1]^)) and
-               not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
-              { we've got
+    (*          { movl [mem1],reg1
+                  movl [mem1],reg2
 
-                mov x, %treg
-                mov %treg, y
+                  to
 
-                with %treg is not used after }
-              case taicpu(p).oper[0]^.typ Of
-                top_reg:
+                  movl [mem1],reg1
+                  movl reg1,reg2
+                 }
+                if (taicpu(p).oper[0]^.typ = top_ref) and
+                  (taicpu(p).oper[1]^.typ = top_reg) and
+                  (taicpu(hp1).oper[0]^.typ = top_ref) and
+                  (taicpu(hp1).oper[1]^.typ = top_reg) and
+                  (taicpu(p).opsize = taicpu(hp1).opsize) and
+                  RefsEqual(TReference(taicpu(p).oper[0]^^),taicpu(hp1).oper[0]^^.ref^) and
+                  (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.base) and
+                  (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.index) then
+                  taicpu(hp1).loadReg(0,taicpu(p).oper[1]^.reg)
+                *)
+
+                {   movl const1,[mem1]
+                    movl [mem1],reg1
+
+                    to
+
+                    movl const1,reg1
+                    movl reg1,[mem1]
+                }
+                if MatchOpType(Taicpu(p),top_const,top_ref) and
+                     MatchOpType(Taicpu(hp1),top_ref,top_reg) and
+                     (taicpu(p).opsize = taicpu(hp1).opsize) and
+                     RefsEqual(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.ref^) and
+                     not(RegInRef(taicpu(hp1).oper[1]^.reg,taicpu(hp1).oper[0]^.ref^)) then
                   begin
-                    { change
-                        mov %reg, %treg
-                        mov %treg, y
+                    AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                    taicpu(hp1).loadReg(0,taicpu(hp1).oper[1]^.reg);
+                    taicpu(hp1).loadRef(1,taicpu(p).oper[1]^.ref^);
+                    taicpu(p).loadReg(1,taicpu(hp1).oper[0]^.reg);
+                    taicpu(hp1).fileinfo := taicpu(p).fileinfo;
+                    DebugMsg(SPeepholeOptimization + 'MovMov2MovMov 1',p);
+                  end
+                {
+                  mov*  x,reg1
+                  mov*  y,reg1
 
-                        to
+                  to
 
-                        mov %reg, y
-                    }
-                    if taicpu(hp1).oper[1]^.typ=top_reg then
-                      AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
-                    taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
-                    DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
-                    asml.remove(hp1);
-                    hp1.free;
+                  mov*  y,reg1
+                }
+                else if (taicpu(p).oper[1]^.typ=top_reg) and
+                  MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
+                  not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[0]^)) then
+                  begin
+                    DebugMsg(SPeepholeOptimization + 'MovMov2Mov 4 done',p);
+                    { take care of the register (de)allocs following p }
+                    UpdateUsedRegs(tai(p.next));
+                    asml.remove(p);
+                    p.free;
+                    p:=hp1;
                     Result:=true;
-                    Exit;
+                    Continue;
+                  end
+                else if MOVRefOptimize then
+                  begin
+                    Result := True;
+                    if MatchInstruction(hp1, A_MOV) then
+                      Continue
+                    else
+                      Exit;
                   end;
-                top_const:
-                  begin
-                    { change
-                        mov const, %treg
-                        mov %treg, y
+              end;
 
-                        to
+            { Optimisations where next instruction = LEA }
+            A_LEA:
+{$ifdef x86_64}
+              if (taicpu(hp1).opsize in [S_L,S_Q]) then
+{$else x86_64}
+              if (taicpu(hp1).opsize = S_L) then
+{$endif x86_64}
+                begin
+                  { Optimise the LEA into something more manageable if possible,
+                    but requires temporarily advancing the used register tracker }
+                  TransferUsedRegs(TmpUsedRegs);
+                  UpdateUsedRegs(tai(p.next));
 
-                        mov const, y
-                    }
-                    if (taicpu(hp1).oper[1]^.typ=top_reg) or
-                      ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
-                      begin
-                        if taicpu(hp1).oper[1]^.typ=top_reg then
-                          AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
-                        taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
-                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        Exit;
-                      end;
-                  end;
-                top_ref:
-                  if (taicpu(hp1).oper[1]^.typ = top_reg) then
+                  HP_Result := OptPass1LEA(hp1);
+
+                  { Restore proper state }
+                  RestoreUsedRegs(TmpUsedRegs);
+
+                  if HP_Result then
                     begin
-                      { change
-                           mov mem, %treg
-                           mov %treg, %reg
+                      if (hp1 = BlockEnd) or (hp1.typ <> ait_instruction) then
+                        begin
+                          Result := True;
+                          Exit;
+                        end;
 
-                           to
+                      if (taicpu(hp1).opcode <> A_LEA) then
+                        { Go back to the start of the case block if hp1 was changed into something other than LEA }
+                        goto MovCaseBlock;
+                  end;
 
-                           mov mem, %reg"
-                      }
-                      taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
-                      DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
-                      asml.remove(hp1);
-                      hp1.free;
+                  if MatchOpType(Taicpu(p),top_ref,top_reg) and
+                   ((MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(hp1).oper[1]^.reg,Taicpu(p).oper[1]^.reg) and
+                     (Taicpu(hp1).oper[0]^.ref^.base<>Taicpu(p).oper[1]^.reg)
+                    ) or
+                    (MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(p).oper[1]^.reg,Taicpu(hp1).oper[1]^.reg) and
+                     (Taicpu(hp1).oper[0]^.ref^.index<>Taicpu(p).oper[1]^.reg)
+                    )
+                    { reg1 may not be used afterwards }
+                  ) and not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs))
+                  then
+                    { mov reg1,ref
+                      lea reg2,[reg1,reg2]
+
+                      to
+
+                      add reg2,ref}
+                    begin
+                      Taicpu(hp1).opcode:=A_ADD;
+                      Taicpu(hp1).oper[0]^.ref^:=Taicpu(p).oper[0]^.ref^;
+                      DebugMsg(SPeepholeOptimization + 'MovLea2Add done',hp1);
+                      UpdateUsedRegs(tai(p.Next));
+                      asml.remove(p);
+                      p.free;
+                      p:=hp1;
                       Result:=true;
                       Exit;
                     end;
-                else
-                  ;
-              end;
-          end
-        else
+                end;
+
+            { Optimisations where next instruction = TEST or = CMP }
+            A_TEST, A_CMP:
+              { change
+                  mov reg1, mem1
+                  test/cmp x, mem1
+
+                  to
+
+                  mov reg1, mem1
+                  test/cmp x, reg1
+              }
+              if MatchOpType(taicpu(p),top_reg,top_ref) and
+                (taicpu(hp1).opsize = taicpu(p).opsize) and
+                (taicpu(hp1).oper[1]^.typ = top_ref) and
+                RefsEqual(taicpu(p).oper[1]^.ref^, taicpu(hp1).oper[1]^.ref^) then
+                begin
+                  taicpu(hp1).loadreg(1,taicpu(p).oper[0]^.reg);
+                  DebugMsg(SPeepholeOptimization + 'MovTestCmp2MovTestCmp 1',hp1);
+                  AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
+                  { Structure of operations hasn't changed, so fall through the
+                    case block to see what else can be done }
+                end;
+
+            { Optimisations where next instruction = BTS or = BTR }
+            A_BTS, A_BTR:
+              if MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
+                MatchOperand(Taicpu(p).oper[0]^,0) and
+                (Taicpu(p).oper[1]^.typ = top_reg) and
+                MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp1).oper[1]^) and
+                GetNextInstruction(hp1, hp2) and
+                MatchInstruction(hp2,A_OR,[Taicpu(p).opsize]) and
+                MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp2).oper[1]^) then
+                { mov reg1,0
+                  bts reg1,operand1             -->      mov reg1,operand2
+                  or  reg1,operand2                      bts reg1,operand1}
+                begin
+                  Taicpu(hp2).opcode:=A_MOV;
+                  asml.remove(hp1);
+                  insertllitem(hp2,hp2.next,hp1);
+                  asml.remove(p);
+                  p.free;
+                  p:=hp2;
+
+                  { hp2 is a MOV command, so it's safe to continue }
+                  Continue;
+                end;
+
+            { Optimisations where next instruction = MOVZX or = MOVSX or = MOVSXD }
+            A_MOVZX, A_MOVSX {$ifdef x86_64}, A_MOVSXD{$endif x86_64}:
+              if MatchOpType(taicpu(p),top_reg,top_reg) then
+                begin
+                  if MOVRefOptimize then
+                    begin
+                      Result := True;
+                      if MatchInstruction(hp1, A_MOV) then
+                        Continue
+                      else
+                        Exit;
+                    end
+                  else if MatchOpType(taicpu(hp1),top_reg,top_reg) and
+                    (taicpu(hp1).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
+                    { mov reg1, reg2                mov reg1, reg2
+                      movzx/sx reg2, reg3      to   movzx/sx reg1, reg3}
+                    begin
+                      taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
+                      DebugMsg(SPeepholeOptimization + 'mov %reg1,%reg2; movzx/sx %reg2,%reg3 -> mov %reg1,%reg2; movzx/sx %reg1,%reg3',p);
+
+                      { Don't remove the MOV command without first checking that reg2 isn't used afterwards,
+                        or unless supreg(reg3) = supreg(reg2)). [Kit] }
+
+
+                      if (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) or
+                        not (
+                          TransferUsedRegs(TmpUsedRegs) and
+                          UpdateUsedRegs(TmpUsedRegs, tai(p.next)) and
+                          UpdateUsedRegs(TmpUsedRegs, tai(hp1.next)) and
+                          RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)
+                        )
+                      then
+                        begin
+                          asml.remove(p);
+                          p.free;
+                          p := hp1;
+                          Result:=true;
+                        end;
+
+                      exit;
+                    end;
+                end;
+
+            { Last of the two-instruction optimisations: }
+            else
+              { leave out the mov from "mov reg, x(%frame_pointer); leave/ret" (with
+                x >= RetOffset) as it doesn't do anything (it writes either to a
+                parameter or to the temporary storage room for the function
+                result)
+              }
+
+              if IsExitCode(hp1) and
+                MatchOpType(taicpu(p),top_reg,top_ref) and
+                (taicpu(p).oper[1]^.ref^.base = current_procinfo.FramePointer) and
+                not(assigned(current_procinfo.procdef.funcretsym) and
+                   (taicpu(p).oper[1]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
+                (taicpu(p).oper[1]^.ref^.index = NR_NO) then
+                begin
+                  asml.remove(p);
+                  p.free;
+                  p:=hp1;
+                  DebugMsg(SPeepholeOptimization + 'removed deadstore before leave/ret',p);
+                  RemoveLastDeallocForFuncRes(p);
+                  Result:=true;
+                  exit;
+                end;
+
+          end;
+
+          { Miscellaneous optimisations }
+
           { Change
              mov %reg1, %reg2
              xxx %reg2, ???
@@ -1472,9 +2409,10 @@
 
              to avoid a write/read penalty
           }
+
+          { NOTE: Don't put this in the case block above, otherwise it won't be
+            called if hp1.opcode = A_AND. [Kit] }
           if MatchOpType(taicpu(p),top_reg,top_reg) and
-             GetNextInstruction(p,hp1) and
-             (tai(hp1).typ = ait_instruction) and
              (taicpu(hp1).ops >= 1) and
              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
             { we have
@@ -1496,12 +2434,12 @@
                 begin
                   TransferUsedRegs(TmpUsedRegs);
                   { reg1 will be used after the first instruction,
-                    so update the allocation info                  }
+                    so update the allocation info }
                   AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
-                  if GetNextInstruction(hp1, hp2) and
-                     (hp2.typ = ait_instruction) and
-                     taicpu(hp2).is_jmp and
-                     not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg, hp1, TmpUsedRegs)) then
+                  if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg, hp1, TmpUsedRegs)) and
+                    GetNextInstruction(hp1, hp2) and
+                    (hp2.typ = ait_instruction) and
+                    taicpu(hp2).is_jmp then
                       { change
 
                         mov %reg1, %reg2
@@ -1516,11 +2454,12 @@
                       begin
                         taicpu(hp1).loadoper(0,taicpu(p).oper[0]^);
                         taicpu(hp1).loadoper(1,taicpu(p).oper[0]^);
+                        taicpu(hp1).opcode := A_TEST; { Changing it now saves on some unnecessary processing later }
                         DebugMsg(SPeepholeOptimization + 'MovTestJxx2TestMov done',p);
                         asml.remove(p);
                         p.free;
                         p := hp1;
-                        Exit;
+                        Result := True;
                       end
                     else
                       { change
@@ -1538,460 +2477,403 @@
                         taicpu(hp1).loadoper(0,taicpu(p).oper[0]^);
                         taicpu(hp1).loadoper(1,taicpu(p).oper[0]^);
                         DebugMsg(SPeepholeOptimization + 'MovTestJxx2MovTestJxx done',p);
+                        { Don't need to set Result to true because the MOV itself wasn't changed }
                       end;
-                end
-            end
-        else
-          { leave out the mov from "mov reg, x(%frame_pointer); leave/ret" (with
-            x >= RetOffset) as it doesn't do anything (it writes either to a
-            parameter or to the temporary storage room for the function
-            result)
-          }
-          if GetNextInstruction_p and
-            (tai(hp1).typ = ait_instruction) then
+                  Exit;
+                end;
+            end;
+
+          if (taicpu(p).oper[1]^.typ = top_reg) and GetNextInstruction(hp1, hp2) then
             begin
-              if IsExitCode(hp1) and
-                MatchOpType(taicpu(p),top_reg,top_ref) and
-                (taicpu(p).oper[1]^.ref^.base = current_procinfo.FramePointer) and
-                not(assigned(current_procinfo.procdef.funcretsym) and
-                   (taicpu(p).oper[1]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
-                (taicpu(p).oper[1]^.ref^.index = NR_NO) then
+              if MatchInstruction(hp2,A_MOV) and
+                (taicpu(hp2).oper[0]^.typ = top_reg) and
+                (SuperRegistersEqual(taicpu(hp2).oper[0]^.reg,taicpu(p).oper[1]^.reg)) and
+                (
+{$ifdef x86_64}
+                  (
+                    { Upper 32 bit of a register are guaranteed to be set to zero if only using the lower 32 bits }
+                    (taicpu(hp1).opsize = S_Q) and (taicpu(p).opsize >= S_L) and (taicpu(hp2).opsize = taicpu(p).opsize) and
+                    IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBQ))
+                  ) or
+{$endif x86_64}
+                  (
+                    { This inequality works because S_NO, S_B, S_W, S_L and S_Q are
+                    in sequentual order, and a MOV cannot be of size S_NO. [Kit] }
+                    (taicpu(hp2).opsize <= taicpu(p).opsize) and
+                    (
+                      (
+                        (taicpu(hp1).opsize = S_L) and
+                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBD))
+                      ) or
+                      (
+                        (taicpu(hp1).opsize = S_W) and
+                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBW))
+                      ) or
+                      (
+                        (taicpu(hp1).opsize = S_B) and
+                        IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER, getsupreg(taicpu(p).oper[1]^.reg), R_SUBL))
+                      )
+                    )
+                  )
+                ) then
                 begin
-                  asml.remove(p);
-                  p.free;
-                  p:=hp1;
-                  DebugMsg(SPeepholeOptimization + 'removed deadstore before leave/ret',p);
-                  RemoveLastDeallocForFuncRes(p);
-                  exit;
-                end
-              { change
-                  mov reg1, mem1
-                  test/cmp x, mem1
-
-                  to
-
-                  mov reg1, mem1
-                  test/cmp x, reg1
-              }
-              else if MatchOpType(taicpu(p),top_reg,top_ref) and
-                  MatchInstruction(hp1,A_CMP,A_TEST,[taicpu(p).opsize]) and
-                  (taicpu(hp1).oper[1]^.typ = top_ref) and
-                   RefsEqual(taicpu(p).oper[1]^.ref^, taicpu(hp1).oper[1]^.ref^) then
-                  begin
-                    taicpu(hp1).loadreg(1,taicpu(p).oper[0]^.reg);
-                    DebugMsg(SPeepholeOptimization + 'MovTestCmp2MovTestCmp 1',hp1);
-                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
-                  end;
-            end;
-
-        { Next instruction is also a MOV ? }
-        if GetNextInstruction_p and
-          MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) then
-          begin
-            if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
-               (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
-                {  mov reg1, mem1     or     mov mem1, reg1
-                   mov mem2, reg2            mov reg2, mem2}
-              begin
-                if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
-                  { mov reg1, mem1     or     mov mem1, reg1
-                    mov mem2, reg1            mov reg2, mem1}
-                  begin
-                    if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
-                      { Removes the second statement from
-                        mov reg1, mem1/reg2
-                        mov mem1/reg2, reg1 }
-                      begin
-                        if taicpu(p).oper[0]^.typ=top_reg then
-                          AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
-                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 1',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        exit;
-                      end
-                    else
-                      begin
-                        TransferUsedRegs(TmpUsedRegs);
-                        UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
-                        if (taicpu(p).oper[1]^.typ = top_ref) and
-                          { mov reg1, mem1
-                            mov mem2, reg1 }
-                           (taicpu(hp1).oper[0]^.ref^.refaddr = addr_no) and
-                           GetNextInstruction(hp1, hp2) and
-                           MatchInstruction(hp2,A_CMP,[taicpu(p).opsize]) and
-                           OpsEqual(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^) and
-                           OpsEqual(taicpu(p).oper[0]^,taicpu(hp2).oper[1]^) and
-                           not(RegUsedAfterInstruction(taicpu(p).oper[0]^.reg, hp2, TmpUsedRegs)) then
-                           { change                   to
-                             mov reg1, mem1           mov reg1, mem1
-                             mov mem2, reg1           cmp reg1, mem2
-                             cmp mem1, reg1
-                           }
-                          begin
-                            asml.remove(hp2);
-                            hp2.free;
-                            taicpu(hp1).opcode := A_CMP;
-                            taicpu(hp1).loadref(1,taicpu(hp1).oper[0]^.ref^);
-                            taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
-                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
-                            DebugMsg(SPeepholeOptimization + 'MovMovCmp2MovCmp done',hp1);
+                  if OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) then
+                    { change   movq           reg/ref, reg2
+                               add/sub/or/... reg3/$const, reg2
+                               mov            reg2, reg/ref
+                               dealloc        reg2
+                      to
+                               add/sub/or/... reg3/$const, reg/ref      }
+                    begin
+                      TransferUsedRegs(TmpUsedRegs);
+                      UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+                      UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
+                      If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
+                        begin
+                          { by example:
+                              movq    %rsi,%rax       movq    %rsi,%rax     p
+                              decl    %eax            addl    %edx,%eax     hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                            ->
+                              movq    %rsi,%eax       movq    %rsi,%eax     p
+                              decw    %ax             addw    %dx,%ax       hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                          }
+                          DebugMsg(SPeepholeOptimization + 'MovOpMov2Op ('+
+                                debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
+                                debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
+                                debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize)+')',p);
+                          taicpu(hp1).changeopsize(taicpu(hp2).opsize);
+                          {
+                            ->
+                              movq    %rsi,%rax       movq    %rsi,%rax     p
+                              decw    %si             addw    %dx,%si       hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                          }
+                          case taicpu(hp1).ops of
+                            1:
+                              begin
+                                taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
+                                if taicpu(hp1).oper[0]^.typ=top_reg then
+                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
+                              end;
+                            2:
+                              begin
+                                taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
+                                if (taicpu(hp1).oper[0]^.typ=top_reg) and
+                                  (taicpu(hp1).opcode<>A_SHL) and
+                                  (taicpu(hp1).opcode<>A_SHR) and
+                                  (taicpu(hp1).opcode<>A_SAR) then
+                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
+                              end;
+                            else
+                              internalerror(2008042701);
                           end;
-                      end;
-                  end
-                else if (taicpu(p).oper[1]^.typ=top_ref) and
-                  OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
-                  begin
-                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,UsedRegs);
-                    taicpu(hp1).loadreg(0,taicpu(p).oper[0]^.reg);
-                    DebugMsg(SPeepholeOptimization + 'MovMov2MovMov1 done',p);
-                  end
-                else
-                  begin
-                    TransferUsedRegs(TmpUsedRegs);
-                    if GetNextInstruction(hp1, hp2) and
-                      MatchOpType(taicpu(p),top_ref,top_reg) and
-                      MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
-                      (taicpu(hp1).oper[1]^.typ = top_ref) and
-                      MatchInstruction(hp2,A_MOV,[taicpu(p).opsize]) and
-                      MatchOpType(taicpu(hp2),top_ref,top_reg) and
-                      RefsEqual(taicpu(hp2).oper[0]^.ref^, taicpu(hp1).oper[1]^.ref^)  then
-                      if not RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^) and
-                         not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,tmpUsedRegs)) then
-                        {   mov mem1, %reg1
-                            mov %reg1, mem2
-                            mov mem2, reg2
-                         to:
-                            mov mem1, reg2
-                            mov reg2, mem2}
-                        begin
-                          AllocRegBetween(taicpu(hp2).oper[1]^.reg,p,hp2,usedregs);
-                          DebugMsg(SPeepholeOptimization + 'MovMovMov2MovMov 1 done',p);
-                          taicpu(p).loadoper(1,taicpu(hp2).oper[1]^);
-                          taicpu(hp1).loadoper(0,taicpu(hp2).oper[1]^);
+                          {
+                            ->
+                              decw    %si             addw    %dx,%si       p
+                          }
+                          UpdateUsedRegs(tai(p.Next));
+                          asml.remove(p);
                           asml.remove(hp2);
-                          hp2.free;
-                        end
+                          p.Free;
+                          hp2.Free;
+                          p := hp1;
+                          Result:=true;
+                          Exit;
+                        end;
+                    end
+                  else if (taicpu(hp2).oper[1]^.typ = top_reg) and
+                    not(SuperRegistersEqual(taicpu(hp1).oper[0]^.reg,taicpu(hp2).oper[1]^.reg))
 {$ifdef i386}
-                      { this is enabled for i386 only, as the rules to create the reg sets below
-                        are too complicated for x86-64, so this makes this code too error prone
-                        on x86-64
-                      }
-                      else if (taicpu(p).oper[1]^.reg <> taicpu(hp2).oper[1]^.reg) and
-                        not(RegInRef(taicpu(p).oper[1]^.reg,taicpu(p).oper[0]^.ref^)) and
-                        not(RegInRef(taicpu(hp2).oper[1]^.reg,taicpu(hp2).oper[0]^.ref^)) then
-                        {   mov mem1, reg1         mov mem1, reg1
-                            mov reg1, mem2         mov reg1, mem2
-                            mov mem2, reg2         mov mem2, reg1
-                         to:                    to:
-                            mov mem1, reg1         mov mem1, reg1
-                            mov mem1, reg2         mov reg1, mem2
-                            mov reg1, mem2
+                    { byte registers of esi, edi, ebp, esp are not available on i386 }
+                    and (
+                      (taicpu(hp2).opsize<>S_B) or
+                      not (
+                        (getsupreg(taicpu(p).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]) or
+                        (getsupreg(taicpu(hp1).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP])
+                      )
+                    )
+{$endif i386}
+                    then
+                    { change   movq           reg/ref, reg2
+                               add/sub/or/... regX/$const, reg2
+                               mov            reg2, reg3
+                               dealloc        reg2
+                      to
+                               movq           reg/ref, reg3
+                               add/sub/or/... reg3/$const, reg3
+                    }
+                    begin
+                      TransferUsedRegs(TmpUsedRegs);
+                      UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+                      UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
+                      If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
+                        begin
+                          { by example:
+                              movswl  %si,%eax        movswl  %si,%eax      p
+                              decl    %eax            addl    %edx,%eax     hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                            ->
+                              movswl  %si,%eax        movswl  %si,%eax      p
+                              decw    %ax             addw    %dx,%ax       hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                          }
+                          DebugMsg(SPeepholeOptimization + 'MovOpMov2MovOp ('+
+                                debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
+                                debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
+                                debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
+                          taicpu(hp1).changeopsize(taicpu(hp2).opsize);
+                          taicpu(p).changeopsize(taicpu(hp2).opsize);
+                          if taicpu(p).oper[0]^.typ=top_reg then
+                            setsubreg(taicpu(p).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
 
-                         or (if mem1 depends on reg1
-                      and/or if mem2 depends on reg2)
-                         to:
-                             mov mem1, reg1
-                             mov reg1, mem2
-                             mov reg1, reg2
-                        }
-                        begin
-                          taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
-                          taicpu(hp1).loadReg(1,taicpu(hp2).oper[1]^.reg);
-                          taicpu(hp2).loadRef(1,taicpu(hp2).oper[0]^.ref^);
-                          taicpu(hp2).loadReg(0,taicpu(p).oper[1]^.reg);
-                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
-                          if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
-                             (getsupreg(taicpu(p).oper[0]^.ref^.base) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
-                            AllocRegBetween(taicpu(p).oper[0]^.ref^.base,p,hp2,usedregs);
-                          if (taicpu(p).oper[0]^.ref^.index <> NR_NO) and
-                             (getsupreg(taicpu(p).oper[0]^.ref^.index) in [RS_EAX,RS_EBX,RS_ECX,RS_EDX,RS_ESI,RS_EDI]) then
-                            AllocRegBetween(taicpu(p).oper[0]^.ref^.index,p,hp2,usedregs);
-                        end
-                      else if (taicpu(hp1).Oper[0]^.reg <> taicpu(hp2).Oper[1]^.reg) then
-                        begin
-                          taicpu(hp2).loadReg(0,taicpu(hp1).Oper[0]^.reg);
-                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp2,usedregs);
-                        end
-                      else
-                        begin
+                          taicpu(p).loadoper(1, taicpu(hp2).oper[1]^);
+                          AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp1,usedregs);
+                          {
+                            ->
+                              movswl  %si,%eax        movswl  %si,%eax      p
+                              decw    %si             addw    %dx,%si       hp1
+                              movw    %ax,%si         movw    %ax,%si       hp2
+                          }
+                          case taicpu(hp1).ops of
+                            1:
+                              begin
+                                taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
+                                if taicpu(hp1).oper[0]^.typ=top_reg then
+                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
+                              end;
+                            2:
+                              begin
+                                taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
+                                if (taicpu(hp1).oper[0]^.typ=top_reg) and
+                                  (taicpu(hp1).opcode<>A_SHL) and
+                                  (taicpu(hp1).opcode<>A_SHR) and
+                                  (taicpu(hp1).opcode<>A_SAR) then
+                                  setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
+                              end;
+                            else
+                              internalerror(2018111801);
+                          end;
+                          {
+                            ->
+                              decw    %si             addw    %dx,%si       p
+                          }
                           asml.remove(hp2);
-                          hp2.free;
-                        end
-{$endif i386}
-                        ;
-                  end;
-              end
-(*          { movl [mem1],reg1
-              movl [mem1],reg2
+                          hp2.Free;
+                          Continue;
+                        end;
+                    end;
+{$ifdef x86_64}
+                end
+              else if (taicpu(p).opsize = S_L) and
+                (
+                  MatchInstruction(hp1, A_MOV) and
+                  (taicpu(hp1).opsize = S_L) and
+                  (taicpu(hp1).oper[1]^.typ = top_reg)
+                ) and (
+                  (tai(hp2).typ=ait_instruction) and
+                  (taicpu(hp2).opsize = S_Q) and
+                  (
+                    (
+                      MatchInstruction(hp2, A_ADD) and
+                      (taicpu(hp2).opsize = S_Q) and
+                      (taicpu(hp2).oper[0]^.typ = top_reg) and (taicpu(hp2).oper[1]^.typ = top_reg) and
+                      (
+                        (
+                          (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) and
+                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
+                        ) or (
+                          (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
+                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
+                        )
+                      )
+                    ) or (
+                      MatchInstruction(hp2, A_LEA) and
+                      (taicpu(hp2).oper[0]^.ref^.offset = 0) and
+                      (taicpu(hp2).oper[0]^.ref^.scalefactor <= 1) and
+                      (
+                        (
+                          (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(p).oper[1]^.reg)) and
+                          (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(hp1).oper[1]^.reg))
+                        ) or (
+                          (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
+                          (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(p).oper[1]^.reg))
+                        )
+                      ) and (
+                        (
+                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
+                        ) or (
+                          (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
+                        )
+                      )
+                    )
+                  )
+                ) and (
+                  GetNextInstruction(hp2, hp3) and
+                  MatchInstruction(hp3, A_SHR) and
+                  (taicpu(hp3).opsize = S_Q) and
+                  (taicpu(hp3).oper[0]^.typ = top_const) and (taicpu(hp2).oper[1]^.typ = top_reg) and
+                  (taicpu(hp3).oper[0]^.val = 1) and
+                  (taicpu(hp3).oper[1]^.reg = taicpu(hp2).oper[1]^.reg)
+                ) then
+                begin
+                  { Change   movl    x,    reg1d         movl    x,    reg1d
+                             movl    y,    reg2d         movl    y,    reg2d
+                             addq    reg2q,reg1q   or    leaq    (reg1q,reg2q),reg1q
+                             shrq    $1,   reg1q         shrq    $1,   reg1q
 
-              to
+                  ( reg1d and reg2d can be switched around in the first two instructions )
 
-              movl [mem1],reg1
-              movl reg1,reg2
-             }
-             else if (taicpu(p).oper[0]^.typ = top_ref) and
-                (taicpu(p).oper[1]^.typ = top_reg) and
-                (taicpu(hp1).oper[0]^.typ = top_ref) and
-                (taicpu(hp1).oper[1]^.typ = top_reg) and
-                (taicpu(p).opsize = taicpu(hp1).opsize) and
-                RefsEqual(TReference(taicpu(p).oper[0]^^),taicpu(hp1).oper[0]^^.ref^) and
-                (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.base) and
-                (taicpu(p).oper[1]^.reg<>taicpu(hp1).oper[0]^^.ref^.index) then
-                taicpu(hp1).loadReg(0,taicpu(p).oper[1]^.reg)
-              else*)
+                    To       movl    x,    reg1d
+                             addl    y,    reg1d
+                             rcrl    $1,   reg1d
 
-            {   movl const1,[mem1]
-                movl [mem1],reg1
+                    This corresponds to the common expression (x + y) shr 1, where
+                    x and y are Cardinals (replacing "shr 1" with "div 2" produces
+                    smaller code, but won't account for x + y causing an overflow). [Kit]
+                  }
 
-                to
+                  if (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) then
+                    { Change first MOV command to have the same register as the final output }
+                    taicpu(p).oper[1]^.reg := taicpu(hp1).oper[1]^.reg
+                  else
+                    taicpu(hp1).oper[1]^.reg := taicpu(p).oper[1]^.reg;
 
-                movl const1,reg1
-                movl reg1,[mem1]
-            }
-            else if MatchOpType(Taicpu(p),top_const,top_ref) and
-                 MatchOpType(Taicpu(hp1),top_ref,top_reg) and
-                 (taicpu(p).opsize = taicpu(hp1).opsize) and
-                 RefsEqual(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.ref^) and
-                 not(RegInRef(taicpu(hp1).oper[1]^.reg,taicpu(hp1).oper[0]^.ref^)) then
-              begin
-                AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
-                taicpu(hp1).loadReg(0,taicpu(hp1).oper[1]^.reg);
-                taicpu(hp1).loadRef(1,taicpu(p).oper[1]^.ref^);
-                taicpu(p).loadReg(1,taicpu(hp1).oper[0]^.reg);
-                taicpu(hp1).fileinfo := taicpu(p).fileinfo;
-                DebugMsg(SPeepholeOptimization + 'MovMov2MovMov 1',p);
-              end
-            {
-              mov*  x,reg1
-              mov*  y,reg1
+                  { Change second MOV command to an ADD command. This is easier than
+                    converting the existing command because it means we don't have to
+                    touch 'y', which might be a complicated reference, and also the
+                    fact that the third command might either be ADD or LEA. [Kit] }
+                  taicpu(hp1).opcode := A_ADD;
 
-              to
+                  { Delete old ADD/LEA instruction }
+                  asml.remove(hp2);
+                  hp2.free;
 
-              mov*  y,reg1
-            }
-            else if (taicpu(p).oper[1]^.typ=top_reg) and
-              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
-              not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[0]^)) then
-              begin
-                DebugMsg(SPeepholeOptimization + 'MovMov2Mov 4 done',p);
-                { take care of the register (de)allocs following p }
-                UpdateUsedRegs(tai(p.next));
-                asml.remove(p);
-                p.free;
-                p:=hp1;
-                Result:=true;
-                exit;
-              end;
-          end
+                  { Convert "shrq $1, reg1q" to "rcr $1, reg1d" }
+                  taicpu(hp3).opcode := A_RCR;
+                  taicpu(hp3).changeopsize(S_L);
+                  setsubreg(taicpu(hp3).oper[1]^.reg, R_SUBD);
+{$endif x86_64}
+                end;
+            end;
 
-        else if (taicpu(p).oper[1]^.typ = top_reg) and
-          GetNextInstruction_p and
-          (hp1.typ = ait_instruction) and
-          GetNextInstruction(hp1, hp2) and
-          MatchInstruction(hp2,A_MOV,[]) and
-          (SuperRegistersEqual(taicpu(hp2).oper[0]^.reg,taicpu(p).oper[1]^.reg)) and
-          (IsFoldableArithOp(taicpu(hp1), taicpu(p).oper[1]^.reg) or
-           ((taicpu(p).opsize=S_L) and (taicpu(hp1).opsize=S_Q) and (taicpu(hp2).opsize=S_L) and
-            IsFoldableArithOp(taicpu(hp1), newreg(R_INTREGISTER,getsupreg(taicpu(p).oper[1]^.reg),R_SUBQ)))
-          ) then
-          begin
-            if OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) and
-              (taicpu(hp2).oper[0]^.typ=top_reg) then
-              { change   movsX/movzX    reg/ref, reg2
-                         add/sub/or/... reg3/$const, reg2
-                         mov            reg2 reg/ref
-                         dealloc        reg2
-                to
-                         add/sub/or/... reg3/$const, reg/ref      }
-              begin
-                TransferUsedRegs(TmpUsedRegs);
-                UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-                UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
-                If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
-                  begin
-                    { by example:
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decl    %eax            addl    %edx,%eax     hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                      ->
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decw    %eax            addw    %edx,%eax     hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                    }
-                    DebugMsg(SPeepholeOptimization + 'MovOpMov2Op ('+
-                          debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
-                          debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
-                          debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
-                    taicpu(hp1).changeopsize(taicpu(hp2).opsize);
-                    {
-                      ->
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decw    %si             addw    %dx,%si       hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                    }
-                    case taicpu(hp1).ops of
-                      1:
-                        begin
-                          taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
-                          if taicpu(hp1).oper[0]^.typ=top_reg then
-                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
-                        end;
-                      2:
-                        begin
-                          taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
-                          if (taicpu(hp1).oper[0]^.typ=top_reg) and
-                            (taicpu(hp1).opcode<>A_SHL) and
-                            (taicpu(hp1).opcode<>A_SHR) and
-                            (taicpu(hp1).opcode<>A_SAR) then
-                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
-                        end;
-                      else
-                        internalerror(2008042701);
-                    end;
-                    {
-                      ->
-                        decw    %si             addw    %dx,%si       p
-                    }
-                    asml.remove(p);
-                    asml.remove(hp2);
-                    p.Free;
-                    hp2.Free;
-                    p := hp1;
-                  end;
-              end
-            else if MatchOpType(taicpu(hp2),top_reg,top_reg) and
-              not(SuperRegistersEqual(taicpu(hp1).oper[0]^.reg,taicpu(hp2).oper[1]^.reg)) and
-              ((topsize2memsize[taicpu(hp1).opsize]<= topsize2memsize[taicpu(hp2).opsize]) or
-               { opsize matters for these opcodes, we could probably work around this, but it is not worth the effort }
-               ((taicpu(hp1).opcode<>A_SHL) and (taicpu(hp1).opcode<>A_SHR) and (taicpu(hp1).opcode<>A_SAR))
+          if (taicpu(p).oper[0]^.typ = top_ref) and
+            (
+              (
+                (taicpu(hp1).opcode=A_LEA) and
+                (
+                  (
+                    MatchReference(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_INVALID) and
+                    (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg)
+                  ) or (
+                    MatchReference(taicpu(hp1).oper[0]^.ref^,NR_INVALID, taicpu(p).oper[1]^.reg) and
+                    (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg)
+                  ) or
+                  MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_NO) or
+                  MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,NR_NO,taicpu(p).oper[1]^.reg)
+                ) and
+                { GetNextInstruction is not factored out so it is only called
+                  when all the other independent conditional checks are True
+                  (we also need access to hp2 for MatchOperand) }
+                GetNextInstruction(hp1,hp2) and
+                (
+                  not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs) or
+                  (
+                    MatchInstruction(hp2,A_MOV) and
+                    MatchOperand(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^)
+                  )
+                )
+              ) or (
+                IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
+                GetNextInstruction(hp1,hp2)
               )
-{$ifdef i386}
-              { byte registers of esi, edi, ebp, esp are not available on i386 }
-              and ((taicpu(hp2).opsize<>S_B) or not(getsupreg(taicpu(hp1).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]))
-              and ((taicpu(hp2).opsize<>S_B) or not(getsupreg(taicpu(p).oper[0]^.reg) in [RS_ESI,RS_EDI,RS_EBP,RS_ESP]))
-{$endif i386}
-              then
-              { change   movsX/movzX    reg/ref, reg2
-                         add/sub/or/... regX/$const, reg2
-                         mov            reg2, reg3
-                         dealloc        reg2
-                to
-                         movsX/movzX    reg/ref, reg3
-                         add/sub/or/... reg3/$const, reg3
-              }
-              begin
-                TransferUsedRegs(TmpUsedRegs);
-                UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-                UpdateUsedRegs(TmpUsedRegs, tai(hp1.next));
-                If not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp2,TmpUsedRegs)) then
-                  begin
-                    { by example:
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decl    %eax            addl    %edx,%eax     hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                      ->
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decw    %eax            addw    %edx,%eax     hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                    }
-                    DebugMsg(SPeepholeOptimization + 'MovOpMov2MovOp ('+
-                          debug_op2str(taicpu(p).opcode)+debug_opsize2str(taicpu(p).opsize)+' '+
-                          debug_op2str(taicpu(hp1).opcode)+debug_opsize2str(taicpu(hp1).opsize)+' '+
-                          debug_op2str(taicpu(hp2).opcode)+debug_opsize2str(taicpu(hp2).opsize),p);
-                    { limit size of constants as well to avoid assembler errors, but
-                      check opsize to avoid overflow when left shifting the 1 }
-                    if (taicpu(p).oper[0]^.typ=top_const) and (topsize2memsize[taicpu(hp2).opsize]<=4) then
-                      taicpu(p).oper[0]^.val:=taicpu(p).oper[0]^.val and ((qword(1) shl (topsize2memsize[taicpu(hp2).opsize]*8))-1);
-                    taicpu(hp1).changeopsize(taicpu(hp2).opsize);
-                    taicpu(p).changeopsize(taicpu(hp2).opsize);
-                    if taicpu(p).oper[0]^.typ=top_reg then
-                      setsubreg(taicpu(p).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
-                    taicpu(p).loadoper(1, taicpu(hp2).oper[1]^);
-                    AllocRegBetween(taicpu(p).oper[1]^.reg,p,hp1,usedregs);
-                    {
-                      ->
-                        movswl  %si,%eax        movswl  %si,%eax      p
-                        decw    %si             addw    %dx,%si       hp1
-                        movw    %ax,%si         movw    %ax,%si       hp2
-                    }
-                    case taicpu(hp1).ops of
-                      1:
-                        begin
-                          taicpu(hp1).loadoper(0, taicpu(hp2).oper[1]^);
-                          if taicpu(hp1).oper[0]^.typ=top_reg then
-                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
-                        end;
-                      2:
-                        begin
-                          taicpu(hp1).loadoper(1, taicpu(hp2).oper[1]^);
-                          if (taicpu(hp1).oper[0]^.typ=top_reg) and
-                            (taicpu(hp1).opcode<>A_SHL) and
-                            (taicpu(hp1).opcode<>A_SHR) and
-                            (taicpu(hp1).opcode<>A_SAR) then
-                            setsubreg(taicpu(hp1).oper[0]^.reg,getsubreg(taicpu(hp2).oper[0]^.reg));
-                        end;
-                      else
-                        internalerror(2018111801);
-                    end;
-                    {
-                      ->
-                        decw    %si             addw    %dx,%si       p
-                    }
-                    asml.remove(hp2);
-                    hp2.Free;
+            ) and
+            MatchInstruction(hp2,A_MOV) and
+            (taicpu(hp2).oper[1]^.typ = top_ref) and
+            (
+              MatchOperand(taicpu(hp1).oper[taicpu(hp1).ops-1]^,taicpu(hp2).oper[0]^)
+{$ifdef x86_64}
+              or (
+                (taicpu(hp1).oper[taicpu(hp1).ops-1]^.typ = top_reg) and
+                (taicpu(hp2).oper[0]^.typ = top_reg)
+                { This is not an exact match, but because only 32 bits are read
+                  from the reference, anything written to the upper 32 bits can
+                  be considered discarded.  Inconsistencies will only occur if
+                  a 64-bit variable is mapped onto a 32-bit variable using the
+                  "absolute" keyword, which is generally not recommended. [Kit] }
+                and SuperRegistersEqual(taicpu(hp1).oper[taicpu(hp1).ops-1]^.reg, taicpu(hp2).oper[0]^.reg)
+                and (getsubreg(taicpu(p).oper[1]^.reg) = R_SUBD)
+                and (getsubreg(taicpu(hp1).oper[taicpu(hp1).ops-1]^.reg) = R_SUBD)
+                and (getsubreg(taicpu(hp2).oper[0]^.reg) = R_SUBQ)
+              )
+{$endif x86_64}
+            ) then
+            begin
+              if RefsEqual(taicpu(hp2).oper[1]^.ref^,taicpu(p).oper[0]^.ref^) and
+                not (
+                  TransferUsedRegs(TmpUsedRegs) and
+                  UpdateUsedRegs(TmpUsedRegs,tai(p.next)) and
+                  UpdateUsedRegs(TmpUsedRegs,tai(hp1.next)) and
+                  RegUsedAfterInstruction(taicpu(hp2).oper[0]^.reg,hp2,TmpUsedRegs)
+                ) then
+                { change   mov            (ref), reg
+                           add/sub/or/... reg2/$const, reg
+                           mov            reg, (ref)
+                           # release reg
+                  to       add/sub/or/... reg2/$const, (ref)    }
+                begin
+                  case taicpu(hp1).opcode of
+                    A_INC,A_DEC,A_NOT,A_NEG :
+                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
+                    A_LEA :
+                      begin
+                        taicpu(hp1).opcode:=A_ADD;
+                        taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
+                        if (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.index<>NR_NO) then
+                          taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.index)
+                        else if (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.base<>NR_NO) then
+                          taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.base)
+                        else
+                          begin
+                            { Optimise for size if applicable }
+                            if UseIncDec then
+                              begin
+                                case taicpu(hp1).oper[0]^.ref^.offset of
+                                  1:
+                                    begin
+                                      taicpu(hp1).opcode:=A_INC;
+                                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
+                                      taicpu(hp1).ops := 1;
+                                    end;
+                                  -1:
+                                    begin
+                                      taicpu(hp1).opcode:=A_DEC;
+                                      taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
+                                      taicpu(hp1).ops := 1;
+                                    end;
+                                  else
+                                    taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
+                                end;
+                              end
+                            else
+                              taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
+                          end;
+                        DebugMsg(SPeepholeOptimization + 'FoldLea done',hp1);
+                      end;
+                    else
+                      taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
                   end;
-              end;
-          end
-        else if GetNextInstruction_p and
-          MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
-          GetNextInstruction(hp1, hp2) and
-          MatchInstruction(hp2,A_OR,[Taicpu(p).opsize]) and
-          MatchOperand(Taicpu(p).oper[0]^,0) and
-          (Taicpu(p).oper[1]^.typ = top_reg) and
-          MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp1).oper[1]^) and
-          MatchOperand(Taicpu(p).oper[1]^,Taicpu(hp2).oper[1]^) then
-          { mov reg1,0
-            bts reg1,operand1             -->      mov reg1,operand2
-            or  reg1,operand2                      bts reg1,operand1}
-          begin
-            Taicpu(hp2).opcode:=A_MOV;
-            asml.remove(hp1);
-            insertllitem(hp2,hp2.next,hp1);
-            asml.remove(p);
-            p.free;
-            p:=hp1;
-          end
-
-        else if GetNextInstruction_p and
-           MatchInstruction(hp1,A_LEA,[S_L]) and
-           MatchOpType(Taicpu(p),top_ref,top_reg) and
-           ((MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(hp1).oper[1]^.reg,Taicpu(p).oper[1]^.reg) and
-             (Taicpu(hp1).oper[0]^.ref^.base<>Taicpu(p).oper[1]^.reg)
-            ) or
-            (MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(p).oper[1]^.reg,Taicpu(hp1).oper[1]^.reg) and
-             (Taicpu(hp1).oper[0]^.ref^.index<>Taicpu(p).oper[1]^.reg)
-            )
-           ) then
-           { mov reg1,ref
-             lea reg2,[reg1,reg2]
-
-             to
-
-             add reg2,ref}
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            { reg1 may not be used afterwards }
-            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
-              begin
-                Taicpu(hp1).opcode:=A_ADD;
-                Taicpu(hp1).oper[0]^.ref^:=Taicpu(p).oper[0]^.ref^;
-                DebugMsg(SPeepholeOptimization + 'MovLea2Add done',hp1);
-                asml.remove(p);
-                p.free;
-                p:=hp1;
-              end;
-          end;
+                  asml.remove(p);
+                  asml.remove(hp2);
+                  p.free;
+                  hp2.free;
+                  p := hp1;
+                  Result := True;
+                end;
+            end;
+          Exit;
+        until False;
       end;
 
 
overhaul-singlepass.patch (157,392 bytes)
Index: compiler/aoptobj.pas
===================================================================
--- compiler/aoptobj.pas	(revision 42345)
+++ compiler/aoptobj.pas	(working copy)
@@ -1429,96 +1780,150 @@
        to avoid endless loops with constructs such as "l5: ; jmp l5"           }
 
       var p1: tai;
+          p2: tai;
           {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
-          p2: tai;
-          l: tasmlabel;
+          p3: tai;
           {$endif}
+          ThisLabel, l: tasmlabel;
 
       begin
-        GetfinalDestination := false;
+        GetFinalDestination := false;
         if level > 20 then
           exit;
-        p1 := getlabelwithsym(tasmlabel(JumpTargetOp(hp)^.ref^.symbol));
+
+        ThisLabel := TAsmLabel(JumpTargetOp(hp)^.ref^.symbol);
+        p1 := getlabelwithsym(ThisLabel);
         if assigned(p1) then
           begin
             SkipLabels(p1,p1);
-            if (tai(p1).typ = ait_instruction) and
+            if (p1.typ = ait_instruction) and
                (taicpu(p1).is_jmp) then
-              if { the next instruction after the label where the jump hp arrives}
-                 { is unconditional or of the same type as hp, so continue       }
-                 IsJumpToLabelUncond(taicpu(p1))
+              begin
+                p2 := tai(p1.Next);
+
+                { Collapse any zero distance jumps we stumble across }
+                while (p1<>blockstart) and CollapseZeroDistJump(p1, p2, TAsmLabel(JumpTargetOp(taicpu(p1))^.ref^.symbol)) do
+                  begin
+                    { TODO: FIXME removing the first instruction fails}
+                    if (p1.typ = ait_label) then
+                      SkipLabels(p1, p1);
+
+                    if not Assigned(p1) then
+                      { No more valid commands }
+                      Exit;
+
+                    { Check to see that we are actually still at a jump }
+                    if not ((tai(p1).typ = ait_instruction) and (taicpu(p1).is_jmp)) then
+                      begin
+                        { Required to ensure recursion works properly, but to also
+                          return false if a jump isn't modified. [Kit] }
+                        if level > 0 then GetFinalDestination := True;
+                        Exit;
+                      end;
+
+                    p2 := tai(p1.Next);
+                    if p2 = BlockEnd then
+                      Exit;
+                  end;
+
 {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
-{ for MIPS, it isn't enough to check the condition; first operands must be same, too. }
-                 or
-                 conditions_equal(taicpu(p1).condition,hp.condition) or
+                p3 := p2;
+{$endif not MIPS and not RV64 and not RV32 and not JVM}
 
-                 { the next instruction after the label where the jump hp arrives
-                   is the opposite of hp (so this one is never taken), but after
-                   that one there is a branch that will be taken, so perform a
-                   little hack: set p1 equal to this instruction (that's what the
-                   last SkipLabels is for, only works with short bool evaluation)}
-                 (conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) and
-                  SkipLabels(p1,p2) and
-                  (p2.typ = ait_instruction) and
-                  (taicpu(p2).is_jmp) and
-                   (IsJumpToLabelUncond(taicpu(p2)) or
-                   (conditions_equal(taicpu(p2).condition,hp.condition))) and
-                  SkipLabels(p1,p1))
+                if { the next instruction after the label where the jump hp arrives}
+                   { is unconditional or of the same type as hp, so continue       }
+                   IsJumpToLabelUncond(taicpu(p1))
+{$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
+  { for MIPS, it isn't enough to check the condition; first operands must be same, too. }
+                   or
+                   conditions_equal(taicpu(p1).condition,hp.condition) or
+
+                   { the next instruction after the label where the jump hp arrives
+                     is the opposite of hp (so this one is never taken), but after
+                     that one there is a branch that will be taken, so perform a
+                     little hack: set p1 equal to this instruction }
+                   (conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) and
+                     SkipLabels(p3,p2) and
+                     (p2.typ = ait_instruction) and
+                     (taicpu(p2).is_jmp) and
+                       (IsJumpToLabelUncond(taicpu(p2)) or
+                       (conditions_equal(taicpu(p2).condition,hp.condition))
+                     ) and
+                     SetAndTest(p2,p1)
+                   )
 {$endif not MIPS and not RV64 and not RV32 and not JVM}
-                 then
-                begin
-                  { quick check for loops of the form "l5: ; jmp l5 }
-                  if (tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).labelnr =
-                       tasmlabel(JumpTargetOp(hp)^.ref^.symbol).labelnr) then
-                    exit;
-                  if not GetFinalDestination(taicpu(p1),succ(level)) then
-                    exit;
+                   then
+                  begin
+                    { quick check for loops of the form "l5: ; jmp l5" }
+                    if (TAsmLabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).labelnr = ThisLabel.labelnr) then
+                      exit;
+                    if not GetFinalDestination(taicpu(p1),succ(level)) then
+                      exit;
+
+                    { NOTE: Do not move this before the "l5: ; jmp l5" check,
+                      because GetFinalDestination may change the destination
+                      label of p1. [Kit] }
+
+                    l := tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol);
+
 {$if defined(aarch64)}
-                  { can't have conditional branches to
-                    global labels on AArch64, because the
-                    offset may become too big }
-                  if not(taicpu(hp).condition in [C_None,C_AL,C_NV]) and
-                     (tasmlabel(JumpTargetOp(taicpu(p1))^.ref^.symbol).bind<>AB_LOCAL) then
-                    exit;
+                    { can't have conditional branches to
+                      global labels on AArch64, because the
+                      offset may become too big }
+                    if not(taicpu(hp).condition in [C_None,C_AL,C_NV]) and
+                       (l.bind<>AB_LOCAL) then
+                      exit;
 {$endif aarch64}
-                  tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
-                  JumpTargetOp(hp)^.ref^.symbol:=JumpTargetOp(taicpu(p1))^.ref^.symbol;
-                  tasmlabel(JumpTargetOp(hp)^.ref^.symbol).increfs;
-                end
+                    ThisLabel.decrefs;
+                    JumpTargetOp(hp)^.ref^.symbol:=l;
+                    l.increfs;
+                    GetFinalDestination := True;
+                    Exit;
+                  end
 {$if not defined(MIPS) and not defined(riscv64) and not defined(riscv32) and not defined(JVM)}
-              else
-                if conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) then
-                  if not FindAnyLabel(p1,l) then
+                else
+                  if conditions_equal(taicpu(p1).condition,inverse_cond(hp.condition)) then
                     begin
-      {$ifdef finaldestdebug}
-                      insertllitem(asml,p1,p1.next,tai_comment.Create(
-                        strpnew('previous label inserted'))));
-      {$endif finaldestdebug}
-                      current_asmdata.getjumplabel(l);
-                      insertllitem(p1,p1.next,tai_label.Create(l));
-                      tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
-                      JumpTargetOp(hp)^.ref^.symbol := l;
-                      l.increfs;
-      {               this won't work, since the new label isn't in the labeltable }
-      {               so it will fail the rangecheck. Labeltable should become a   }
-      {               hashtable to support this:                                   }
-      {               GetFinalDestination(asml, hp);                               }
-                    end
-                  else
-                    begin
-      {$ifdef finaldestdebug}
-                      insertllitem(asml,p1,p1.next,tai_comment.Create(
-                        strpnew('next label reused'))));
-      {$endif finaldestdebug}
-                      l.increfs;
-                      tasmlabel(JumpTargetOp(hp)^.ref^.symbol).decrefs;
-                      JumpTargetOp(hp)^.ref^.symbol := l;
-                      if not GetFinalDestination(hp,succ(level)) then
-                        exit;
+                      if not FindAnyLabel(p1,l) then
+                        begin
+{$ifdef finaldestdebug}
+                          insertllitem(asml,p1,p1.next,tai_comment.Create(
+                            strpnew('previous label inserted'))));
+{$endif finaldestdebug}
+                          current_asmdata.getjumplabel(l);
+                          insertllitem(p1,p1.next,tai_label.Create(l));
+
+                          ThisLabel.decrefs;
+                          JumpTargetOp(hp)^.ref^.symbol := l;
+                          l.increfs;
+                          GetFinalDestination := True;
+          {               this won't work, since the new label isn't in the labeltable }
+          {               so it will fail the rangecheck. Labeltable should become a   }
+          {               hashtable to support this:                                   }
+          {               GetFinalDestination(asml, hp);                               }
+                        end
+                      else
+                        begin
+{$ifdef finaldestdebug}
+                          insertllitem(asml,p1,p1.next,tai_comment.Create(
+                            strpnew('next label reused'))));
+{$endif finaldestdebug}
+                          l.increfs;
+                          ThisLabel.decrefs;
+                          JumpTargetOp(hp)^.ref^.symbol := l;
+                          if not GetFinalDestination(hp,succ(level)) then
+                            exit;
+                        end;
+                      GetFinalDestination := True;
+                      Exit;
                     end;
 {$endif not MIPS and not RV64 and not RV32 and not JVM}
+              end;
           end;
-        GetFinalDestination := true;
+
+        { Required to ensure recursion works properly, but to also
+          return false if a jump isn't modified. [Kit] }
+        if level > 0 then GetFinalDestination := True;
       end;
 
 
Index: compiler/i386/aoptcpu.pas
===================================================================
--- compiler/i386/aoptcpu.pas	(revision 42345)
+++ compiler/i386/aoptcpu.pas	(working copy)
@@ -34,12 +34,21 @@
       Aasmbase,aasmtai,aasmdata;
 
     Type
+
+      { TCpuAsmOptimizer }
+
       TCpuAsmOptimizer = class(TX86AsmOptimizer)
-        procedure Optimize; override;
-        procedure PrePeepHoleOpts; override;
-        procedure PeepHoleOptPass1; override;
-        procedure PeepHoleOptPass2; override;
+        function PeepHoleOptPass1Cpu(var p: tai): boolean; override;
         procedure PostPeepHoleOpts; override;
+        function PostPeepHoleOptsCpu(var p : tai) : boolean; override;
+
+        { Optimizations specific to i386 }
+        function OptPass1FSTPFISTP(var p : tai) : boolean;
+        function OptPass1FLD(var p: tai): Boolean;
+        function OptPass1PUSH(var p: tai): Boolean;
+
+        { The x86_64 version is very different }
+        function PostPeepholeOptMovzx(var p : tai) : Boolean; inline;
       end;
 
     Var
@@ -55,769 +64,423 @@
       aasmcfi,
       procinfo,
       cgutils,
-      { units we should get rid off: }
+      systems,
+      { units we should get rid of: }
       symsym,symconst;
 
 
-  { Checks if the register is a 32 bit general purpose register }
-  function isgp32reg(reg: TRegister): boolean;
+    { Checks if the register is a 32 bit general purpose register }
+    function isgp32reg(reg: TRegister): boolean; inline;
+      begin
+        {$push}{$warnings off}
+        isgp32reg:=(getregtype(reg)=R_INTREGISTER) and (getsupreg(reg)>=RS_EAX) and (getsupreg(reg)<=RS_EBX);
+        {$pop}
+      end;
+
+
+  { converts a TChange variable to a TRegister }
+  function tch2reg(ch: tinschange): tsuperregister;
+    const
+      ch2reg: array[CH_REAX..CH_REDI] of tsuperregister = (RS_EAX,RS_ECX,RS_EDX,RS_EBX,RS_ESP,RS_EBP,RS_ESI,RS_EDI);
     begin
-      {$push}{$warnings off}
-      isgp32reg:=(getregtype(reg)=R_INTREGISTER) and (getsupreg(reg)>=RS_EAX) and (getsupreg(reg)<=RS_EBX);
-      {$pop}
+      if (ch <= CH_REDI) then
+        tch2reg := ch2reg[ch]
+      else if (ch <= CH_WEDI) then
+        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_REDI))]
+      else if (ch <= CH_RWEDI) then
+        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_WEDI))]
+      else if (ch <= CH_MEDI) then
+        tch2reg := ch2reg[tinschange(ord(ch) - ord(CH_RWEDI))]
+      else
+        InternalError(2016041901)
     end;
 
 
-{ returns true if p contains a memory operand with a segment set }
-function InsContainsSegRef(p: taicpu): boolean;
-var
-  i: longint;
-begin
-  result:=true;
-  for i:=0 to p.opercnt-1 do
-    if (p.oper[i]^.typ=top_ref) and
-       (p.oper[i]^.ref^.segment<>NR_NO) then
-      exit;
-  result:=false;
-end;
+  { returns true if p contains a memory operand with a segment set }
+  function InsContainsSegRef(p: taicpu): boolean;
+    var
+      i: longint;
+    begin
+      result:=true;
+      for i:=0 to p.opercnt-1 do
+        if (p.oper[i]^.typ=top_ref) and
+           (p.oper[i]^.ref^.segment<>NR_NO) then
+          exit;
+      result:=false;
+    end;
 
 
-procedure TCPUAsmOptimizer.PrePeepHoleOpts;
-var
-  p: tai;
-begin
-  p := BlockStart;
-  while (p <> BlockEnd) Do
+  function TCpuAsmOptimizer.OptPass1FSTPFISTP(var p: tai): boolean;
+    var
+      hp1, hp2: tai;
     begin
-      case p.Typ Of
-        Ait_Instruction:
-          begin
-            if InsContainsSegRef(taicpu(p)) then
+      Result := false;
+
+      if (taicpu(p).oper[0]^.typ = top_ref) and
+         getNextInstruction(p, hp1) and
+         (hp1.typ = ait_instruction) and
+         (((taicpu(hp1).opcode = A_FLD) and
+           (taicpu(p).opcode = A_FSTP)) or
+          ((taicpu(p).opcode = A_FISTP) and
+           (taicpu(hp1).opcode = A_FILD))) and
+         (taicpu(hp1).oper[0]^.typ = top_ref) and
+         (taicpu(hp1).opsize = taicpu(p).opsize) and
+         RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
+        begin
+          { replacing fstp f;fld f by fst f is only valid for extended because of rounding }
+          if (taicpu(p).opsize=S_FX) and
+             getNextInstruction(hp1, hp2) and
+             (hp2.typ = ait_instruction) and
+             IsExitCode(hp2) and
+             (taicpu(p).oper[0]^.ref^.base = current_procinfo.FramePointer) and
+             not(assigned(current_procinfo.procdef.funcretsym) and
+                 (taicpu(p).oper[0]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
+             (taicpu(p).oper[0]^.ref^.index = NR_NO) then
+            begin
+              asml.remove(p);
+              asml.remove(hp1);
+              p.free;
+              hp1.free;
+              p := hp2;
+              removeLastDeallocForFuncRes(p);
+              Result := true;
+            end
+          (* can't be done because the store operation rounds
+          else
+            { fst can't store an extended value! }
+            if (taicpu(p).opsize <> S_FX) and
+               (taicpu(p).opsize <> S_IQ) then
               begin
-                p := tai(p.next);
-                continue;
-              end;
-            case taicpu(p).opcode Of
-              A_IMUL:
-                if PrePeepholeOptIMUL(p) then
-                  Continue;
-              A_SAR,A_SHR:
-                if PrePeepholeOptSxx(p) then
-                  continue;
-              A_XOR:
-                begin
-                  if (taicpu(p).oper[0]^.typ = top_reg) and
-                     (taicpu(p).oper[1]^.typ = top_reg) and
-                     (taicpu(p).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
-                   { temporarily change this to 'mov reg,0' to make it easier }
-                   { for the CSE. Will be changed back in pass 2              }
-                    begin
-                      taicpu(p).opcode := A_MOV;
-                      taicpu(p).loadConst(0,0);
-                    end;
-                end;
-              else
-                ;
-            end;
-          end;
-        else
-          ;
-      end;
-      p := tai(p.next)
+                if (taicpu(p).opcode = A_FSTP) then
+                  taicpu(p).opcode := A_FST
+                else taicpu(p).opcode := A_FIST;
+                asml.remove(hp1);
+                hp1.free;
+              end
+          *)
+        end;
     end;
-end;
 
+  function TCpuAsmOptimizer.OptPass1FLD(var p: tai): Boolean;
+    var
+      hp1, hp2: tai;
+    begin
+      Result := False;
 
-{ First pass of peephole optimizations }
-procedure TCPUAsmOPtimizer.PeepHoleOptPass1;
-
-function WriteOk : Boolean;
-  begin
-    writeln('Ok');
-    Result:=True;
-  end;
-
-var
-  p,hp1,hp2 : tai;
-  hp3,hp4: tai;
-  v:aint;
-
-  function GetFinalDestination(asml: TAsmList; hp: taicpu; level: longint): boolean;
-  {traces sucessive jumps to their final destination and sets it, e.g.
-   je l1                je l3
-   <code>               <code>
-   l1:       becomes    l1:
-   je l2                je l3
-   <code>               <code>
-   l2:                  l2:
-   jmp l3               jmp l3
-
-   the level parameter denotes how deeep we have already followed the jump,
-   to avoid endless loops with constructs such as "l5: ; jmp l5"           }
-
-  var p1, p2: tai;
-      l: tasmlabel;
-
-    function FindAnyLabel(hp: tai; var l: tasmlabel): Boolean;
-    begin
-      FindAnyLabel := false;
-      while assigned(hp.next) and
-            (tai(hp.next).typ in (SkipInstr+[ait_align])) Do
-        hp := tai(hp.next);
-      if assigned(hp.next) and
-         (tai(hp.next).typ = ait_label) then
+      if (taicpu(p).oper[0]^.typ = top_reg) and
+        GetNextInstruction(p, hp1) and
+        (hp1.typ = Ait_Instruction) and
+         (taicpu(hp1).oper[0]^.typ = top_reg) and
+        (taicpu(hp1).oper[1]^.typ = top_reg) and
+        (taicpu(hp1).oper[0]^.reg = NR_ST) and
+        (taicpu(hp1).oper[1]^.reg = NR_ST1) then
+        { change                        to
+            fld      reg               fxxx reg,st
+            fxxxp    st, st1 (hp1)
+          Remark: non commutative operations must be reversed!
+        }
         begin
-          FindAnyLabel := true;
-          l := tai_label(hp.next).labsym;
+          if taicpu(hp1).opcode in [A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP] then
+            begin
+              case taicpu(hp1).opcode Of
+                A_FADDP: taicpu(hp1).opcode := A_FADD;
+                A_FMULP: taicpu(hp1).opcode := A_FMUL;
+                A_FSUBP: taicpu(hp1).opcode := A_FSUBR;
+                A_FSUBRP: taicpu(hp1).opcode := A_FSUB;
+                A_FDIVP: taicpu(hp1).opcode := A_FDIVR;
+                A_FDIVRP: taicpu(hp1).opcode := A_FDIV;
+			    else
+			      InternalError(2019071010);
+              end;
+              taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
+              taicpu(hp1).oper[1]^.reg := NR_ST;
+              asml.remove(p);
+              p.free;
+              p := hp1;
+              Result := True;
+            end;
         end
-    end;
-
-  begin
-    GetfinalDestination := false;
-    if level > 20 then
-      exit;
-    p1 := getlabelwithsym(tasmlabel(hp.oper[0]^.ref^.symbol));
-    if assigned(p1) then
-      begin
-        SkipLabels(p1,p1);
-        if (tai(p1).typ = ait_instruction) and
-           (taicpu(p1).is_jmp) then
-          if { the next instruction after the label where the jump hp arrives}
-             { is unconditional or of the same type as hp, so continue       }
-             (taicpu(p1).condition in [C_None,hp.condition]) or
-             { the next instruction after the label where the jump hp arrives}
-             { is the opposite of hp (so this one is never taken), but after }
-             { that one there is a branch that will be taken, so perform a   }
-             { little hack: set p1 equal to this instruction (that's what the}
-             { last SkipLabels is for, only works with short bool evaluation)}
-             ((taicpu(p1).condition = inverse_cond(hp.condition)) and
-              SkipLabels(p1,p2) and
-              (p2.typ = ait_instruction) and
-              (taicpu(p2).is_jmp) and
-              (taicpu(p2).condition in [C_None,hp.condition]) and
-              SkipLabels(p1,p1)) then
+      else
+        if (taicpu(p).oper[0]^.typ = top_ref) and
+           GetNextInstruction(p, hp2) and
+           (hp2.typ = Ait_Instruction) and
+           (taicpu(hp2).ops = 2) and
+           (taicpu(hp2).oper[0]^.typ = top_reg) and
+           (taicpu(hp2).oper[1]^.typ = top_reg) and
+           (taicpu(p).opsize in [S_FS, S_FL]) and
+           (taicpu(hp2).oper[0]^.reg = NR_ST) and
+           (taicpu(hp2).oper[1]^.reg = NR_ST1) then
+          if GetLastInstruction(p, hp1) and
+             (hp1.typ = ait_Instruction) and
+             ((taicpu(hp1).opcode = A_FLD) or
+              (taicpu(hp1).opcode = A_FST)) and
+             (taicpu(hp1).opsize = taicpu(p).opsize) and
+             (taicpu(hp1).oper[0]^.typ = top_ref) and
+             RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
             begin
-              { quick check for loops of the form "l5: ; jmp l5 }
-              if (tasmlabel(taicpu(p1).oper[0]^.ref^.symbol).labelnr =
-                   tasmlabel(hp.oper[0]^.ref^.symbol).labelnr) then
-                exit;
-              if not GetFinalDestination(asml, taicpu(p1),succ(level)) then
-                exit;
-              tasmlabel(hp.oper[0]^.ref^.symbol).decrefs;
-              hp.oper[0]^.ref^.symbol:=taicpu(p1).oper[0]^.ref^.symbol;
-              tasmlabel(hp.oper[0]^.ref^.symbol).increfs;
-            end
-          else
-            if (taicpu(p1).condition = inverse_cond(hp.condition)) then
-              if not FindAnyLabel(p1,l) then
+              if ((taicpu(hp2).opcode = A_FMULP) or
+                  (taicpu(hp2).opcode = A_FADDP)) then
+              { change                      to
+                  fld/fst   mem1  (hp1)       fld/fst   mem1
+                  fld       mem1  (p)         fadd/
+                  faddp/                       fmul     st, st
+                  fmulp  st, st1 (hp2) }
                 begin
-  {$ifdef finaldestdebug}
-                  insertllitem(asml,p1,p1.next,tai_comment.Create(
-                    strpnew('previous label inserted'))));
-  {$endif finaldestdebug}
-                  current_asmdata.getjumplabel(l);
-                  insertllitem(p1,p1.next,tai_label.Create(l));
-                  tasmlabel(taicpu(hp).oper[0]^.ref^.symbol).decrefs;
-                  hp.oper[0]^.ref^.symbol := l;
-                  l.increfs;
-  {               this won't work, since the new label isn't in the labeltable }
-  {               so it will fail the rangecheck. Labeltable should become a   }
-  {               hashtable to support this:                                   }
-  {               GetFinalDestination(asml, hp);                               }
+                  asml.remove(p);
+                  p.free;
+                  p := hp1;
+                  if (taicpu(hp2).opcode = A_FADDP) then
+                    taicpu(hp2).opcode := A_FADD
+                  else
+                    taicpu(hp2).opcode := A_FMUL;
+                  taicpu(hp2).oper[1]^.reg := NR_ST;
+                  Result := True;
                 end
               else
+              { change              to
+                  fld/fst mem1 (hp1)   fld/fst mem1
+                  fld     mem1 (p)     fld      st}
                 begin
-  {$ifdef finaldestdebug}
-                  insertllitem(asml,p1,p1.next,tai_comment.Create(
-                    strpnew('next label reused'))));
-  {$endif finaldestdebug}
-                  l.increfs;
-                  hp.oper[0]^.ref^.symbol := l;
-                  if not GetFinalDestination(asml, hp,succ(level)) then
-                    exit;
+                  taicpu(p).changeopsize(S_FL);
+                  taicpu(p).loadreg(0,NR_ST);
                 end;
-      end;
-    GetFinalDestination := true;
-  end;
 
-begin
-  p := BlockStart;
-  ClearUsedRegs;
-  while (p <> BlockEnd) Do
-    begin
-      UpDateUsedRegs(UsedRegs, tai(p.next));
-      case p.Typ Of
-        ait_instruction:
-          begin
-            current_filepos:=taicpu(p).fileinfo;
-            if InsContainsSegRef(taicpu(p)) then
-              begin
-                p := tai(p.next);
-                continue;
-              end;
-            { Handle Jmp Optimizations }
-            if taicpu(p).is_jmp then
-              begin
-                { the following if-block removes all code between a jmp and the next label,
-                  because it can never be executed }
-                if (taicpu(p).opcode = A_JMP) then
-                  begin
-                    hp2:=p;
-                    while GetNextInstruction(hp2, hp1) and
-                          (hp1.typ <> ait_label) do
-                      if not(hp1.typ in ([ait_label]+skipinstr)) then
-                        begin
-                          { don't kill start/end of assembler block,
-                            no-line-info-start/end, cfi end, etc }
-                          if not(hp1.typ in [ait_align,ait_marker]) and
-                             ((hp1.typ<>ait_cfi) or
-                              (tai_cfi_base(hp1).cfityp<>cfi_endproc)) then
-                            begin
-                              asml.remove(hp1);
-                              hp1.free;
-                            end
-                          else
-                            hp2:=hp1;
-                        end
-                      else break;
-                    end;
-                { remove jumps to a label coming right after them }
-                if GetNextInstruction(p, hp1) then
-                  begin
-                    if FindLabel(tasmlabel(taicpu(p).oper[0]^.ref^.symbol), hp1) and
-  { TODO: FIXME removing the first instruction fails}
-                        (p<>blockstart) then
-                      begin
-                        hp2:=tai(hp1.next);
-                        asml.remove(p);
-                        p.free;
-                        p:=hp2;
-                        continue;
-                      end
-                    else
-                      begin
-                        if hp1.typ = ait_label then
-                          SkipLabels(hp1,hp1);
-                        if (tai(hp1).typ=ait_instruction) and
-                            (taicpu(hp1).opcode=A_JMP) and
-                            GetNextInstruction(hp1, hp2) and
-                            FindLabel(tasmlabel(taicpu(p).oper[0]^.ref^.symbol), hp2) then
-                          begin
-                            if taicpu(p).opcode=A_Jcc then
-                              begin
-                                taicpu(p).condition:=inverse_cond(taicpu(p).condition);
-                                tai_label(hp2).labsym.decrefs;
-                                taicpu(p).oper[0]^.ref^.symbol:=taicpu(hp1).oper[0]^.ref^.symbol;
-                                { when free'ing hp1, the ref. isn't decresed, so we don't
-                                  increase it (FK)
+            end
+          else
+            begin
+              if taicpu(hp2).opcode in [A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP] then
+          { change                        to
+              fld      mem2    (p)        fxxx       mem2
+              fxxxp    st, st1 (hp2)                      }
 
-                                  taicpu(p).oper[0]^.ref^.symbol.increfs;
-                                }
-                                asml.remove(hp1);
-                                hp1.free;
-                                GetFinalDestination(asml, taicpu(p),0);
-                              end
-                            else
-                              begin
-                                GetFinalDestination(asml, taicpu(p),0);
-                                p:=tai(p.next);
-                                continue;
-                              end;
-                          end
-                        else
-                          GetFinalDestination(asml, taicpu(p),0);
-                      end;
+                begin
+                  case taicpu(hp2).opcode Of
+                    A_FADDP: taicpu(p).opcode := A_FADD;
+                    A_FMULP: taicpu(p).opcode := A_FMUL;
+                    A_FSUBP: taicpu(p).opcode := A_FSUBR;
+                    A_FSUBRP: taicpu(p).opcode := A_FSUB;
+                    A_FDIVP: taicpu(p).opcode := A_FDIVR;
+                    A_FDIVRP: taicpu(p).opcode := A_FDIV;
+					else
+					  InternalError(2019071011);
                   end;
-              end
-            else
-            { All other optimizes }
-              begin
-                case taicpu(p).opcode Of
-                  A_AND:
-                    if OptPass1And(p) then
-                      continue;
-                  A_CMP:
-                    begin
-                      { cmp register,$8000                neg register
-                        je target                 -->     jo target
+                  asml.remove(hp2);
+                  hp2.free;
+                end;
+            end;
+    end;
 
-                        .... only if register is deallocated before jump.}
-                      case Taicpu(p).opsize of
-                        S_B: v:=$80;
-                        S_W: v:=$8000;
-                        S_L: v:=aint($80000000);
-                        else
-                          internalerror(2013112905);
-                      end;
-                      if (taicpu(p).oper[0]^.typ=Top_const) and
-                         (taicpu(p).oper[0]^.val=v) and
-                         (Taicpu(p).oper[1]^.typ=top_reg) and
-                         GetNextInstruction(p, hp1) and
-                         (hp1.typ=ait_instruction) and
-                         (taicpu(hp1).opcode=A_Jcc) and
-                         (Taicpu(hp1).condition in [C_E,C_NE]) and
-                         not(RegInUsedRegs(Taicpu(p).oper[1]^.reg, UsedRegs)) then
-                        begin
-                          Taicpu(p).opcode:=A_NEG;
-                          Taicpu(p).loadoper(0,Taicpu(p).oper[1]^);
-                          Taicpu(p).clearop(1);
-                          Taicpu(p).ops:=1;
-                          if Taicpu(hp1).condition=C_E then
-                            Taicpu(hp1).condition:=C_O
-                          else
-                            Taicpu(hp1).condition:=C_NO;
-                          continue;
-                        end;
-                      {
-                      @@2:                              @@2:
-                        ....                              ....
-                        cmp operand1,0
-                        jle/jbe @@1
-                        dec operand1             -->      sub operand1,1
-                        jmp @@2                           jge/jae @@2
-                      @@1:                              @@1:
-                        ...                               ....}
-                      if (taicpu(p).oper[0]^.typ = top_const) and
-                         (taicpu(p).oper[1]^.typ in [top_reg,top_ref]) and
-                         (taicpu(p).oper[0]^.val = 0) and
-                         GetNextInstruction(p, hp1) and
-                         (hp1.typ = ait_instruction) and
-                         (taicpu(hp1).is_jmp) and
-                         (taicpu(hp1).opcode=A_Jcc) and
-                         (taicpu(hp1).condition in [C_LE,C_BE]) and
-                         GetNextInstruction(hp1,hp2) and
-                         (hp2.typ = ait_instruction) and
-                         (taicpu(hp2).opcode = A_DEC) and
-                         OpsEqual(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
-                         GetNextInstruction(hp2, hp3) and
-                         (hp3.typ = ait_instruction) and
-                         (taicpu(hp3).is_jmp) and
-                         (taicpu(hp3).opcode = A_JMP) and
-                         GetNextInstruction(hp3, hp4) and
-                         FindLabel(tasmlabel(taicpu(hp1).oper[0]^.ref^.symbol),hp4) then
-                        begin
-                          taicpu(hp2).Opcode := A_SUB;
-                          taicpu(hp2).loadoper(1,taicpu(hp2).oper[0]^);
-                          taicpu(hp2).loadConst(0,1);
-                          taicpu(hp2).ops:=2;
-                          taicpu(hp3).Opcode := A_Jcc;
-                          case taicpu(hp1).condition of
-                            C_LE: taicpu(hp3).condition := C_GE;
-                            C_BE: taicpu(hp3).condition := C_AE;
-                            else
-                              internalerror(2019050903);
-                          end;
-                          asml.remove(p);
-                          asml.remove(hp1);
-                          p.free;
-                          hp1.free;
-                          p := hp2;
-                          continue;
-                        end
-                    end;
-                  A_FLD:
-                    if OptPass1FLD(p) then
-                      continue;
-                  A_FSTP,A_FISTP:
-                    if OptPass1FSTP(p) then
-                      continue;
-                  A_LEA:
-                    begin
-                      if OptPass1LEA(p) then
-                        continue;
-                    end;
 
-                  A_MOV:
-                    begin
-                      If OptPass1MOV(p) then
-                        Continue;
-                    end;
+  function TCpuAsmOptimizer.OptPass1PUSH(var p: tai): Boolean;
+    var
+      hp1: tai;
+    begin
+      Result := False;
+      if (taicpu(p).opsize = S_W) and
+         (taicpu(p).oper[0]^.typ = Top_Const) and
+         GetNextInstruction(p, hp1) and
+         (tai(hp1).typ = ait_instruction) and
+         (taicpu(hp1).opcode = A_PUSH) and
+         (taicpu(hp1).oper[0]^.typ = Top_Const) and
+         (taicpu(hp1).opsize = S_W) then
+        begin
+          taicpu(p).changeopsize(S_L);
+          taicpu(p).loadConst(0,taicpu(p).oper[0]^.val shl 16 + word(taicpu(hp1).oper[0]^.val));
+          asml.remove(hp1);
+          hp1.free;
+        end;
+    end;
 
-                  A_MOVSX,
-                  A_MOVZX :
-                    begin
-                      If OptPass1Movx(p) then
-                        Continue
-                    end;
 
-(* should not be generated anymore by the current code generator
-                  A_POP:
+    function TCpuAsmOptimizer.PostPeepholeOptMovzx(var p: tai): Boolean;
+      var
+        hp1: tai;
+      begin
+        { if register vars are on, it's possible there is code like }
+        {   "cmpl $3,%eax; movzbl 8(%ebp),%ebx; je .Lxxx"           }
+        { so we can't safely replace the movzx then with xor/mov,   }
+        { since that would change the flags (JM)                    }
+        Result := False;
+        if not(cs_opt_regvar in current_settings.optimizerswitches) then
+          begin
+            if (taicpu(p).oper[1]^.typ = top_reg) then
+              if (taicpu(p).oper[0]^.typ = top_reg)
+                then
+                  if (taicpu(p).opsize = S_BL) and
+                    IsGP32Reg(taicpu(p).oper[1]^.reg) and
+                    not(cs_opt_size in current_settings.optimizerswitches) and
+                    (current_settings.optimizecputype = cpu_Pentium) then
+                    {Change "movzbl %reg1, %reg2" to
+                     "xorl %reg2, %reg2; movb %reg1, %reg2" for Pentium and
+                     PentiumMMX}
                     begin
-                      if target_info.system=system_i386_go32v2 then
-                      begin
-                        { Transform a series of pop/pop/pop/push/push/push to }
-                        { 'movl x(%esp),%reg' for go32v2 (not for the rest,   }
-                        { because I'm not sure whether they can cope with     }
-                        { 'movl x(%esp),%reg' with x > 0, I believe we had    }
-                        { such a problem when using esp as frame pointer (JM) }
-                        if (taicpu(p).oper[0]^.typ = top_reg) then
-                          begin
-                            hp1 := p;
-                            hp2 := p;
-                            l := 0;
-                            while getNextInstruction(hp1,hp1) and
-                                  (hp1.typ = ait_instruction) and
-                                  (taicpu(hp1).opcode = A_POP) and
-                                  (taicpu(hp1).oper[0]^.typ = top_reg) do
-                              begin
-                                hp2 := hp1;
-                                inc(l,4);
-                              end;
-                            getLastInstruction(p,hp3);
-                            l1 := 0;
-                            while (hp2 <> hp3) and
-                                  assigned(hp1) and
-                                  (hp1.typ = ait_instruction) and
-                                  (taicpu(hp1).opcode = A_PUSH) and
-                                  (taicpu(hp1).oper[0]^.typ = top_reg) and
-                                  (taicpu(hp1).oper[0]^.reg.enum = taicpu(hp2).oper[0]^.reg.enum) do
-                              begin
-                                { change it to a two op operation }
-                                taicpu(hp2).oper[1]^.typ:=top_none;
-                                taicpu(hp2).ops:=2;
-                                taicpu(hp2).opcode := A_MOV;
-                                taicpu(hp2).loadoper(1,taicpu(hp1).oper[0]^);
-                                reference_reset(tmpref);
-                                tmpRef.base.enum:=R_INTREGISTER;
-                                tmpRef.base.number:=NR_STACK_POINTER_REG;
-                                convert_register_to_enum(tmpref.base);
-                                tmpRef.offset := l;
-                                taicpu(hp2).loadRef(0,tmpRef);
-                                hp4 := hp1;
-                                getNextInstruction(hp1,hp1);
-                                asml.remove(hp4);
-                                hp4.free;
-                                getLastInstruction(hp2,hp2);
-                                dec(l,4);
-                                inc(l1);
-                              end;
-                            if l <> -4 then
-                              begin
-                                inc(l,4);
-                                for l1 := l1 downto 1 do
-                                  begin
-                                    getNextInstruction(hp2,hp2);
-                                    dec(taicpu(hp2).oper[0]^.ref^.offset,l);
-                                  end
-                              end
-                          end
-                        end
-                      else
-                        begin
-                          if (taicpu(p).oper[0]^.typ = top_reg) and
-                            GetNextInstruction(p, hp1) and
-                            (tai(hp1).typ=ait_instruction) and
-                            (taicpu(hp1).opcode=A_PUSH) and
-                            (taicpu(hp1).oper[0]^.typ = top_reg) and
-                            (taicpu(hp1).oper[0]^.reg.enum=taicpu(p).oper[0]^.reg.enum) then
-                            begin
-                              { change it to a two op operation }
-                              taicpu(p).oper[1]^.typ:=top_none;
-                              taicpu(p).ops:=2;
-                              taicpu(p).opcode := A_MOV;
-                              taicpu(p).loadoper(1,taicpu(p).oper[0]^);
-                              reference_reset(tmpref);
-                              TmpRef.base.enum := R_ESP;
-                              taicpu(p).loadRef(0,TmpRef);
-                              asml.remove(hp1);
-                              hp1.free;
-                            end;
-                        end;
-                    end;
-*)
-                  A_PUSH:
-                    begin
-                      if (taicpu(p).opsize = S_W) and
-                         (taicpu(p).oper[0]^.typ = Top_Const) and
-                         GetNextInstruction(p, hp1) and
-                         (tai(hp1).typ = ait_instruction) and
-                         (taicpu(hp1).opcode = A_PUSH) and
-                         (taicpu(hp1).oper[0]^.typ = Top_Const) and
-                         (taicpu(hp1).opsize = S_W) then
-                        begin
-                          taicpu(p).changeopsize(S_L);
-                          taicpu(p).loadConst(0,taicpu(p).oper[0]^.val shl 16 + word(taicpu(hp1).oper[0]^.val));
-                          asml.remove(hp1);
-                          hp1.free;
-                        end;
-                    end;
-                  A_SHL, A_SAL:
-                    if OptPass1SHLSAL(p) then
-                      Continue;
-                  A_SUB:
-                    if OptPass1Sub(p) then
-                      continue;
-                  A_VMOVAPS,
-                  A_VMOVAPD:
-                    if OptPass1VMOVAP(p) then
-                      continue;
-                  A_VDIVSD,
-                  A_VDIVSS,
-                  A_VSUBSD,
-                  A_VSUBSS,
-                  A_VMULSD,
-                  A_VMULSS,
-                  A_VADDSD,
-                  A_VADDSS,
-                  A_VANDPD,
-                  A_VANDPS,
-                  A_VORPD,
-                  A_VORPS,
-                  A_VXORPD,
-                  A_VXORPS:
-                    if OptPass1VOP(p) then
-                      continue;
-                  A_MULSD,
-                  A_MULSS,
-                  A_ADDSD,
-                  A_ADDSS:
-                    if OptPass1OP(p) then
-                      continue;
-                  A_MOVAPD,
-                  A_MOVAPS:
-                    if OptPass1MOVAP(p) then
-                      continue;
-                  A_VMOVSD,
-                  A_VMOVSS,
-                  A_MOVSD,
-                  A_MOVSS:
-                    if OptPass1MOVXX(p) then
-                      continue;
-                  A_SETcc:
-                    begin
-                      if OptPass1SETcc(p) then
-                        continue;
+                      hp1 := taicpu.op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
+                      InsertLLItem(p.previous, p, hp1);
+                      taicpu(p).opcode := A_MOV;
+                      taicpu(p).changeopsize(S_B);
+                      setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
                     end
-                  else
-                    ;
-                end;
-            end; { if is_jmp }
-          end;
+                else if (taicpu(p).oper[0]^.typ = top_ref) and
+                  (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
+                  (taicpu(p).oper[0]^.ref^.index <> taicpu(p).oper[1]^.reg) and
+                  not(cs_opt_size in current_settings.optimizerswitches) and
+                  IsGP32Reg(taicpu(p).oper[1]^.reg) and
+                  (current_settings.optimizecputype = cpu_Pentium) and
+                  (taicpu(p).opsize = S_BL) then
+                  {changes "movzbl mem, %reg" to "xorl %reg, %reg; movb mem, %reg8" for
+                    Pentium and PentiumMMX}
+                  begin
+                    hp1 := taicpu.Op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
+                    taicpu(p).opcode := A_MOV;
+                    taicpu(p).changeopsize(S_B);
+                    setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
+                    InsertLLItem(p.previous, p, hp1);
+                  end;
+          end
         else
           ;
       end;
-      updateUsedRegs(UsedRegs,p);
-      p:=tai(p.next);
-    end;
-end;
 
 
-procedure TCPUAsmOptimizer.PeepHoleOptPass2;
-var
-  p : tai;
-begin
-  p := BlockStart;
-  ClearUsedRegs;
-  while (p <> BlockEnd) Do
-    begin
-      UpdateUsedRegs(UsedRegs, tai(p.next));
-      case p.Typ Of
-        Ait_Instruction:
-          begin
-            if InsContainsSegRef(taicpu(p)) then
-              begin
-                p := tai(p.next);
-                continue;
-              end;
-            case taicpu(p).opcode Of
-              A_Jcc:
-                if OptPass2Jcc(p) then
-                  continue;
-              A_FSTP,A_FISTP:
-                if OptPass1FSTP(p) then
-                  continue;
-              A_IMUL:
-                if OptPass2Imul(p) then
-                  continue;
-              A_JMP:
-                if OptPass2Jmp(p) then
-                  continue;
-              A_MOV:
-                begin
-                  if OptPass2MOV(p) then
-                    continue;
-                end
-              else
-                ;
-            end;
+    function TCpuAsmOptimizer.PeepHoleOptPass1Cpu(var p: tai): boolean;
+      var
+        Opcode: TAsmOp;
+      begin
+        result:=False;
+        { p is known to be an instruction by this point }
+
+        { Use a local variable/register to reduce the number of pointer
+          dereferences (the peephole optimiser would never optimise this
+          by itself because the compiler has to consider the possibility
+          of multi-threaded race hazards. [Kit] }
+        Opcode := taicpu(p).opcode;
+
+        { Clever optimisation: MOV instructions appear disproportionally
+          more frequently than any other instruction, so check for this
+          opcode first and reduce the total number of comparisons
+          required over the entire block. [Kit] }
+        if Opcode = A_MOV then
+          Result := OptPass1MOV(p)
+        else
+          case Opcode of
+            A_PUSH:
+              Result := OptPass1PUSH(p);
+            A_AND:
+              Result:=OptPass1AND(p);
+            A_XOR:
+              Result:=OptPass1XOR(p);
+            A_MOVSX,
+            A_MOVZX:
+              Result:=OptPass1Movx(p);
+            A_VMOVAPS,
+            A_VMOVAPD,
+            A_VMOVUPS,
+            A_VMOVUPD:
+              result:=OptPass1VMOVAP(p);
+            A_MOVAPD,
+            A_MOVAPS,
+            A_MOVUPD,
+            A_MOVUPS:
+              result:=OptPass1MOVAP(p);
+            A_VDIVSD,
+            A_VDIVSS,
+            A_VSUBSD,
+            A_VSUBSS,
+            A_VMULSD,
+            A_VMULSS,
+            A_VADDSD,
+            A_VADDSS,
+            A_VANDPD,
+            A_VANDPS,
+            A_VORPD,
+            A_VORPS,
+            A_VXORPD,
+            A_VXORPS:
+              result:=OptPass1VOP(p);
+            A_MULSD,
+            A_MULSS,
+            A_ADDSD,
+            A_ADDSS:
+              result:=OptPass1OP(p);
+            A_VMOVSD,
+            A_VMOVSS,
+            A_MOVSD,
+            A_MOVSS:
+              result:=OptPass1MOVXX(p);
+            A_FSTP,A_FISTP:
+              Result := OptPass1FSTPFISTP(p);
+            A_FLD:
+              Result := OptPass1FLD(p);
+            A_LEA:
+              result:=OptPass1LEA(p);
+            A_SUB:
+              result:=OptPass1Sub(p);
+            A_SHL,A_SAL:
+              result:=OptPass1SHLSAL(p);
+            A_SHR,A_SAR:
+              result:=OptPass1SHRSAR(p);
+            A_SETcc:
+              result:=OptPass1SETcc(p);
+            A_IMUL:
+              Result:=OptPass1Imul(p);
+            A_JMP:
+              Result:=OptPass1Jmp(p);
+            A_Jcc:
+              Result:=OptPass1Jcc(p);
+			else
+			  { Do nothing };
           end;
-        else
-          ;
       end;
-      p := tai(p.next)
-    end;
-end;
 
 
-procedure TCPUAsmOptimizer.PostPeepHoleOpts;
-var
-  p,hp1: tai;
-begin
-  p := BlockStart;
-  ClearUsedRegs;
-  while (p <> BlockEnd) Do
-    begin
-      UpdateUsedRegs(UsedRegs, tai(p.next));
-      case p.Typ Of
-        Ait_Instruction:
+    function TCpuAsmOptimizer.PostPeepHoleOptsCpu(var p: tai): boolean;
+      begin
+        Result := False;
+        case taicpu(p).opcode Of
+          A_CALL:
+            Result := PostPeepHoleOptCall(p);
+          A_LEA:
+            Result := PostPeepholeOptLea(p);
+          A_CMP:
+            Result := PostPeepholeOptCmp(p);
+          A_MOV:
+            Result := PostPeepholeOptMov(p);
+          A_TEST, A_OR:
+            Result := PostPeepholeOptTestOr(p);
+          A_MOVZX:
+            Result := PostPeepholeOptMovzx(p);
+		  else
+		    { Do nothing };
+        end;
+      end;
+
+
+    procedure TCpuAsmOptimizer.PostPeepHoleOpts;
+      var
+        p,hp1: tai;
+      begin
+        p := BlockStart;
+        ClearUsedRegs;
+        while (p <> BlockEnd) Do
           begin
-            if InsContainsSegRef(taicpu(p)) then
+            UpdateUsedRegs(UsedRegs, tai(p.next));
+            if p.Typ = ait_Instruction then
               begin
-                p := tai(p.next);
-                continue;
+                if InsContainsSegRef(taicpu(p)) then
+                  begin
+                    p := tai(p.next);
+                    continue;
+                  end;
+                if PostPeepHoleOptsCpu(p) then
+                  Continue;
               end;
-            case taicpu(p).opcode Of
-              A_CALL:
-                if PostPeepHoleOptCall(p) then
-                  Continue;
-              A_LEA:
-                if PostPeepholeOptLea(p) then
-                  Continue;
-              A_CMP:
-                if PostPeepholeOptCmp(p) then
-                  Continue;
-              A_MOV:
-                if PostPeepholeOptMov(p) then
-                  Continue;
-              A_MOVZX:
-                { if register vars are on, it's possible there is code like }
-                {   "cmpl $3,%eax; movzbl 8(%ebp),%ebx; je .Lxxx"           }
-                { so we can't safely replace the movzx then with xor/mov,   }
-                { since that would change the flags (JM)                    }
-                if not(cs_opt_regvar in current_settings.optimizerswitches) then
-                 begin
-                  if (taicpu(p).oper[1]^.typ = top_reg) then
-                    if (taicpu(p).oper[0]^.typ = top_reg)
-                      then
-                        case taicpu(p).opsize of
-                          S_BL:
-                            begin
-                              if IsGP32Reg(taicpu(p).oper[1]^.reg) and
-                                 not(cs_opt_size in current_settings.optimizerswitches) and
-                                 (current_settings.optimizecputype = cpu_Pentium) then
-                                  {Change "movzbl %reg1, %reg2" to
-                                   "xorl %reg2, %reg2; movb %reg1, %reg2" for Pentium and
-                                   PentiumMMX}
-                                begin
-                                  hp1 := taicpu.op_reg_reg(A_XOR, S_L,
-                                              taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg);
-                                  InsertLLItem(p.previous, p, hp1);
-                                  taicpu(p).opcode := A_MOV;
-                                  taicpu(p).changeopsize(S_B);
-                                  setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
-                                end;
-                            end;
-                          else
-                            ;
-                        end
-                      else if (taicpu(p).oper[0]^.typ = top_ref) and
-                          (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
-                          (taicpu(p).oper[0]^.ref^.index <> taicpu(p).oper[1]^.reg) and
-                          not(cs_opt_size in current_settings.optimizerswitches) and
-                          IsGP32Reg(taicpu(p).oper[1]^.reg) and
-                          (current_settings.optimizecputype = cpu_Pentium) and
-                          (taicpu(p).opsize = S_BL) then
-                        {changes "movzbl mem, %reg" to "xorl %reg, %reg; movb mem, %reg8" for
-                          Pentium and PentiumMMX}
-                        begin
-                          hp1 := taicpu.Op_reg_reg(A_XOR, S_L, taicpu(p).oper[1]^.reg,
-                                      taicpu(p).oper[1]^.reg);
-                          taicpu(p).opcode := A_MOV;
-                          taicpu(p).changeopsize(S_B);
-                          setsubreg(taicpu(p).oper[1]^.reg,R_SUBL);
-                          InsertLLItem(p.previous, p, hp1);
-                        end;
-                 end;
-              A_TEST, A_OR:
-                begin
-                  if PostPeepholeOptTestOr(p) then
-                    Continue;
-                end;
-              else
-                ;
-            end;
+
+            p := tai(p.next)
           end;
-        else
-          ;
+        OptReferences;
       end;
-      p := tai(p.next)
-    end;
-  OptReferences;
-end;
 
 
-Procedure TCpuAsmOptimizer.Optimize;
-Var
-  HP: Tai;
-  pass: longint;
-  slowopt, changed, lastLoop: boolean;
-Begin
-  slowopt := (cs_opt_level3 in current_settings.optimizerswitches);
-  pass := 0;
-  changed := false;
-  repeat
-     lastLoop :=
-       not(slowopt) or
-       (not changed and (pass > 2)) or
-      { prevent endless loops }
-       (pass = 4);
-     changed := false;
-   { Setup labeltable, always necessary }
-     blockstart := tai(asml.first);
-     pass_1;
-   { Blockend now either contains an ait_marker with Kind = mark_AsmBlockStart, }
-   { or nil                                                                }
-     While Assigned(BlockStart) Do
-       Begin
-         if (cs_opt_peephole in current_settings.optimizerswitches) then
-           begin
-            if (pass = 0) then
-              PrePeepHoleOpts;
-              { Peephole optimizations }
-               PeepHoleOptPass1;
-              { Only perform them twice in the first pass }
-               if pass = 0 then
-                 PeepHoleOptPass1;
-           end;
-        { More peephole optimizations }
-         if (cs_opt_peephole in current_settings.optimizerswitches) then
-           begin
-             PeepHoleOptPass2;
-             if lastLoop then
-               PostPeepHoleOpts;
-           end;
-
-        { Continue where we left off, BlockEnd is either the start of an }
-        { assembler block or nil                                         }
-         BlockStart := BlockEnd;
-         While Assigned(BlockStart) And
-               (BlockStart.typ = ait_Marker) And
-               (Tai_Marker(BlockStart).Kind = mark_AsmBlockStart) Do
-           Begin
-           { We stopped at an assembler block, so skip it }
-            Repeat
-              BlockStart := Tai(BlockStart.Next);
-            Until (BlockStart.Typ = Ait_Marker) And
-                  (Tai_Marker(Blockstart).Kind = mark_AsmBlockEnd);
-           { Blockstart now contains a Tai_marker(mark_AsmBlockEnd) }
-             If GetNextInstruction(BlockStart, HP) And
-                ((HP.typ <> ait_Marker) Or
-                 (Tai_Marker(HP).Kind <> mark_AsmBlockStart)) Then
-             { There is no assembler block anymore after the current one, so }
-             { optimize the next block of "normal" instructions              }
-               pass_1
-             { Otherwise, skip the next assembler block }
-             else
-               blockStart := hp;
-           End;
-       End;
-     inc(pass);
-  until lastLoop;
-  dfa.free;
-
-End;
-
-
 begin
   casmoptimizer:=TCpuAsmOptimizer;
 end.
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 42345)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -30,16 +30,24 @@
     uses
       globtype,
       cpubase,
-      aasmtai,aasmcpu,
+      aasmtai,aasmcpu,aasmdata,
       cgbase,cgutils,
       aopt,aoptobj;
 
     type
+
       TX86AsmOptimizer = class(TAsmOptimizer)
         function RegLoadedWithNewValue(reg : tregister; hp : tai) : boolean; override;
         function InstructionLoadsFromReg(const reg : TRegister; const hp : tai) : boolean; override;
         function RegReadByInstruction(reg : TRegister; hp : tai) : boolean;
+        procedure Optimize; override;
+        procedure PeepHoleOptPass1; override;
+        function GetFirstInstruction(const Start: tai; var p: tai): Boolean; override;
+        constructor Create(_AsmL: TAsmList); override;
+        destructor Destroy; override;
       protected
+        StatePreserveRegs: TAllUsedRegs;
+
         { checks whether loading a new value in reg1 overwrites the entirety of reg2 }
         function Reg1WriteOverwritesReg2Entirely(reg1, reg2: tregister): boolean;
         { checks whether reading the value in reg1 depends on the value of reg2. This
@@ -56,8 +70,9 @@
 
         function DoSubAddOpt(var p : tai) : Boolean;
 
-        function PrePeepholeOptSxx(var p : tai) : boolean;
-        function PrePeepholeOptIMUL(var p : tai) : boolean;
+        { - Below are optimisations common to both i386 and x86_64
+          - See i386/aoptcpu.pas for i386-specific optimisations
+          - See x86_64/aoptcpu.pas for x86_64-specific optimisations }
 
         function OptPass1AND(var p : tai) : boolean;
         function OptPass1VMOVAP(var p : tai) : boolean;
@@ -71,24 +86,18 @@
         function OptPass1Sub(var p : tai) : boolean;
         function OptPass1SHLSAL(var p : tai) : boolean;
         function OptPass1SETcc(var p: tai): boolean;
-        function OptPass1FSTP(var p: tai): boolean;
-        function OptPass1FLD(var p: tai): boolean;
+        function OptPass1SHRSAR(var p : tai) : boolean;
+        function OptPass1Imul(var p : tai) : boolean;
+        function OptPass1Jmp(var p : tai) : boolean;
+        function OptPass1Jcc(var p : tai) : boolean;
+        function OptPass1CMP(var p : tai) : boolean;
 
-        function OptPass2MOV(var p : tai) : boolean;
-        function OptPass2Imul(var p : tai) : boolean;
-        function OptPass2Jmp(var p : tai) : boolean;
-        function OptPass2Jcc(var p : tai) : boolean;
+        function PostPeepholeOptMov(var p : tai) : Boolean; inline;
+        function PostPeepholeOptCmp(var p : tai) : Boolean; inline;
+        function PostPeepholeOptTestOr(var p : tai) : Boolean; inline;
+        function PostPeepholeOptCall(var p : tai) : Boolean; inline;
+        function PostPeepholeOptLea(var p : tai) : Boolean; inline;
 
-        function PostPeepholeOptMov(var p : tai) : Boolean;
-{$ifdef x86_64} { These post-peephole optimisations only affect 64-bit registers. [Kit] }
-        function PostPeepholeOptMovzx(var p : tai) : Boolean;
-        function PostPeepholeOptXor(var p : tai) : Boolean;
-{$endif}
-        function PostPeepholeOptCmp(var p : tai) : Boolean;
-        function PostPeepholeOptTestOr(var p : tai) : Boolean;
-        function PostPeepholeOptCall(var p : tai) : Boolean;
-        function PostPeepholeOptLea(var p : tai) : Boolean;
-
         procedure OptReferences;
       end;
 
@@ -130,8 +148,11 @@
       aoptutils,
       symconst,symsym,
       cgx86,
-      itcpugas;
+      itcpugas,
+      systems,
+      aoptcpub;
 
+
     function MatchInstruction(const instr: tai; const op: TAsmOp; const opsize: topsizes): boolean;
       begin
         result :=
@@ -494,7 +523,459 @@
       end;
     end;
 
+  procedure TX86AsmOptimizer.Optimize;
+    var
+      HP: tai;
+    begin
+      BlockStart := tai(AsmL.First);
+      pass_1;
+      while Assigned(BlockStart) do
+        begin
 
+          if (cs_opt_peephole in current_settings.optimizerswitches) then
+            begin
+              { Peephole optimizations }
+              PeepHoleOptPass1;
+              PostPeepHoleOpts;
+            end;
+          { free memory }
+          clear;
+          { continue where we left off, BlockEnd is either the start of an }
+          { assembler block or nil}
+          BlockStart := BlockEnd;
+          While Assigned(BlockStart) And
+                (BlockStart.typ = ait_Marker) And
+                (tai_Marker(BlockStart).Kind = mark_AsmBlockStart) Do
+            Begin
+             { we stopped at an assembler block, so skip it    }
+             While GetNextInstruction(BlockStart, BlockStart) And
+                   ((BlockStart.Typ <> Ait_Marker) Or
+                    (tai_Marker(Blockstart).Kind <> mark_AsmBlockEnd)) Do;
+             { blockstart now contains a tai_marker(mark_AsmBlockEnd) }
+             If GetNextInstruction(BlockStart, HP) And
+                ((HP.typ <> ait_Marker) Or
+                 (Tai_Marker(HP).Kind <> mark_AsmBlockStart)) Then
+             { There is no assembler block anymore after the current one, so }
+             { optimize the next block of "normal" instructions              }
+               pass_1
+             { Otherwise, skip the next assembler block }
+             else
+               blockStart := hp;
+            end;
+        end;
+    end;
+
+  procedure TX86AsmOptimizer.PeepHoleOptPass1;
+    var
+      stoploop:boolean;
+
+      { If a group of labels are clustered, change the jump to point to the last one
+        that is still referenced }
+      function CollapseLabelCluster(jump: tai; var lbltai: tai): TAsmLabel; inline;
+        var
+          LastLabel: TAsmLabel;
+          hp2: tai;
+        begin
+          Result := tai_label(lbltai).labsym;
+          LastLabel := Result;
+          hp2 := tai(lbltai.next);
+
+          while (hp2 <> BlockEnd) and (hp2.typ in SkipInstr + [ait_align, ait_label]) do
+            begin
+
+              if (hp2.typ = ait_label) and
+                (tai_label(hp2).labsym.is_used) and
+                (tai_label(hp2).labsym.labeltype = alt_jump) then
+                LastLabel := tai_label(hp2).labsym;
+
+              hp2 := tai(hp2.next);
+            end;
+
+          if (Result <> LastLabel) then
+            begin
+              Result.decrefs;
+              JumpTargetOp(taicpu(jump))^.ref^.symbol := LastLabel;
+              LastLabel.increfs;
+              Result := LastLabel;
+              lbltai := hp2;
+            end;
+        end;
+
+      function UnconditionalJumpShortcut(NCJLabel: TAsmLabel; NCJ: tai; level: Integer): TAsmLabel;
+        var
+          NewLabel: TAsmLabel;
+          LabelTai, AfterLabel: tai;
+        begin
+          Result := nil;
+          if level > 20 then Exit;
+
+          if not ((NCJ.typ=ait_instruction) and IsJumpToLabelUncond(taicpu(NCJ))) then
+            Exit;
+
+          LabelTai := getlabelwithsym(NCJLabel);
+          if not Assigned(LabelTai) then
+            Exit;
+
+          SkipLabels(LabelTai, AfterLabel);
+
+          if (AfterLabel.typ=ait_instruction) and IsJumpToLabelUncond(taicpu(AfterLabel)) then
+            begin
+              NewLabel := TAsmLabel(JumpTargetOp(taicpu(AfterLabel))^.ref^.symbol);
+
+              if NCJLabel = NewLabel then
+                { Identical jump }
+                Exit;
+
+              Result := UnconditionalJumpShortcut(NewLabel, AfterLabel, succ(level));
+              if not Assigned(Result) then
+                Result := NewLabel;
+
+              NCJLabel.decrefs;
+              JumpTargetOp(taicpu(NCJ))^.ref^.symbol := Result;
+              Result.increfs;
+            end;
+        end;
+
+      function ConditionalJumpShortcut(CJLabel: TAsmLabel; var p: tai; hp1: tai): Boolean; inline;
+        var
+          hp2: tai;
+          NCJLabel: TAsmLabel;
+        begin
+          Result := False;
+
+          StripDeadLabels(hp1, hp1);
+
+          if (hp1 <> BlockEnd) and
+            (tai(hp1).typ=ait_instruction) and
+            IsJumpToLabelUncond(taicpu(hp1)) then
+            begin
+
+              NCJLabel := TAsmLabel(JumpTargetOp(taicpu(hp1))^.ref^.symbol);
+
+              if CJLabel = NCJLabel then
+                begin
+{$ifdef DEBUG_JUMP}
+                  WriteLn('JUMP DEBUG: Short-circuited conditional jump');
+{$endif DEBUG_JUMP}
+                  { Both jumps go to the same label }
+                  CJLabel.decrefs;
+{$ifdef cpudelayslot}
+                  RemoveDelaySlot(p);
+{$endif cpudelayslot}
+                  UpdateUsedRegs(tai(p.Next));
+                  AsmL.Remove(p);
+                  p.Free;
+                  p := hp1;
+
+                  Result := True;
+                  Exit;
+                end;
+
+              { Do it now to get it out of the way and to aid the
+                following optimisation }
+              RemoveDeadCodeAfterJump(taicpu(hp1));
+
+              if GetNextInstruction(hp1, hp2) then
+                begin
+
+                  if FindLabel(CJLabel, hp2) then
+                    begin
+                      { change the following jumps:
+                          jmp<cond> CJLabel         jmp<cond_inverted> NCJLabel
+                          jmp       NCJLabel >>>    <code>
+                        CJLabel:                  NCJLabel:
+                          <code>
+                        NCJLabel:
+                      }
+{$if defined(arm) or defined(aarch64)}
+                      if (taicpu(p).condition<>C_None)
+{$if defined(aarch64)}
+                      { can't have conditional branches to
+                        global labels on AArch64, because the
+                        offset may become too big }
+                      and (NCJLabel.bind=AB_LOCAL)
+{$endif aarch64}
+                    then
+                      begin
+{$endif arm or aarch64}
+{$ifdef DEBUG_JUMP}
+                        WriteLn('JUMP DEBUG: Conditional jump optimisation');
+{$endif DEBUG_JUMP}
+                        taicpu(p).condition:=inverse_cond(taicpu(p).condition);
+                        CJLabel.decrefs;
+
+                        JumpTargetOp(taicpu(p))^.ref^.symbol := JumpTargetOp(taicpu(hp1))^.ref^.symbol;
+
+                        { when freeing hp1, the reference count
+                          isn't decreased, so don't increase }
+{$ifdef cpudelayslot}
+                        RemoveDelaySlot(hp1);
+{$endif cpudelayslot}
+                        asml.remove(hp1);
+                        hp1.free;
+
+                        Result := True;
+{$if defined(arm) or defined(aarch64)}
+                      end;
+{$endif arm or aarch64}
+                    end
+                  else if CollapseZeroDistJump(hp1, hp2, NCJLabel) then
+                    Result := True;
+                end;
+            end;
+
+          if GetFinalDestination(taicpu(p),0) then
+            stoploop := False;
+
+          Exit;
+        end;
+
+
+      function JumpOptimizations(var p: tai): Boolean; inline;
+        var
+          hp1, hp2: tai;
+          ThisLabel: TAsmLabel;
+          ThisPassResult: Boolean;
+        begin
+          Result := False;
+          repeat
+            ThisPassResult := False;
+
+            { Remove unreachable code between the jump and the next label }
+            RemoveDeadCodeAfterJump(taicpu(p));
+
+            if GetNextInstruction(p, hp1) and (hp1 <> BlockEnd) then
+              begin
+                SkipEntryExitMarker(hp1,hp1);
+                if (hp1 = BlockEnd) then
+                  Exit;
+
+                ThisLabel := TAsmLabel(JumpTargetOp(taicpu(p))^.ref^.symbol);
+
+                { If there are multiple labels in a row, change the destination to the last one
+                  in order to aid optimisation later }
+                hp2 := getlabelwithsym(ThisLabel);
+
+                { getlabelwithsym returning nil occurs if a label is in a
+                  different block (e.g. on the other side of an asm...end pair). }
+                if Assigned(hp2) then
+                  begin
+                    ThisLabel := CollapseLabelCluster(p, hp2);
+
+                    if CollapseZeroDistJump(p, hp1, ThisLabel) then
+                      begin
+                        stoploop := False;
+                        Result := True;
+                        Continue;
+                      end;
+
+                    if IsJumpToLabelUncond(taicpu(p)) then
+                      ThisPassResult := Assigned(UnconditionalJumpShortcut(ThisLabel, p, 0))
+                    else if (taicpu(p).opcode = aopt_condjmp) then
+                      ThisPassResult := ConditionalJumpShortcut(ThisLabel, p, hp1);
+                  end;
+              end;
+
+            Result := Result or ThisPassResult;
+          until not (ThisPassResult and (p.typ = ait_instruction) and IsJumpToLabel(taicpu(p)));
+        end;
+
+    var
+      p : tai;
+      orig_instr: tasmop;
+      StartPoint: tai;
+      StartingRegs: TAllUsedRegs;
+      FirstInstruction, OptLevel3: Boolean;
+      loopcount: Integer;
+
+    begin
+      { Very minor speed-up.  Reduce the chance of a memory stall and the
+        requirement of using bitwise operations by only checking this flag once
+        and storing a Boolean result on the stack. }
+      OptLevel3 := (cs_opt_level3 in current_settings.optimizerswitches);
+
+      ClearUsedRegs;
+
+      { Search forward from BlockStart until we find the first instruction }
+      if not GetFirstInstruction(BlockStart, StartPoint) then
+        Exit;
+
+      { Preserve the register allocation state at StartPoint }
+      if OptLevel3 then
+        CopyUsedRegs(StartingRegs);
+
+      LoopCount := 5;
+
+      repeat
+        stoploop:=true;
+        p := StartPoint;
+        FirstInstruction := True;
+
+        while (p <> BlockEnd) Do
+          begin
+            prefetch(p.Next);
+
+            case p.Typ Of
+              ait_instruction:
+                begin
+                  orig_instr := taicpu(p).opcode;
+                  {$ifdef DEBUG_OPTALLOC}
+                  if p.Typ=ait_instruction then
+                    InsertLLItem(tai(p.Previous),p,tai_comment.create(strpnew(GetAllocationString(UsedRegs))));
+                  {$endif DEBUG_OPTALLOC}
+
+                  { The whole "MatchInstruction(p, orig_instr)" thing... if the instruction type hasn't changed, then
+                    the peephole optimiser assumes that no further optimisations can be done on that instruction and
+                    so moves on instead of calling the individual routine again in PeepHoleOptPass1Cpu. }
+
+                  { Handle Jmp Optimizations first }
+                  if IsJumpToLabel(taicpu(p)) and JumpOptimizations(p) then
+                    begin
+                      UpdateUsedRegs(p);
+                      if FirstInstruction then
+                        { Update StartPoint, since the old p was removed;
+                          don't set FirstInstruction to False though, as
+                          the new p might get removed too. }
+                        StartPoint := p;
+
+                      Continue;
+                    end;
+
+                  if PeepHoleOptPass1Cpu(p) then
+                    begin
+                      stoploop:=false;
+                      if (p = BlockEnd) then
+                        Continue;
+
+                      UpdateUsedRegs(p);
+                      if FirstInstruction then
+                        { Update StartPoint, since the old p was removed;
+                          don't set FirstInstruction to False though, as
+                          the new p might get removed too. }
+                        StartPoint := p;
+
+                      if not MatchInstruction(p, orig_instr) then
+                        continue;
+                    end;
+                end;
+              else
+                { Other optimizations }
+                begin
+                end;
+            end;
+            FirstInstruction := False;
+            p := tai(UpdateUsedRegsAndOptimize(p).Next);
+          end;
+
+        { Restore the register allocation state to what it was at StartPoint,
+          ready for the next loop iteration. }
+        if OptLevel3 and not stoploop then
+          RestoreUsedRegs(StartingRegs);
+
+        Dec(loopcount);
+
+      until stoploop or not OptLevel3 or (loopcount <= 0);
+      if (loopcount <= 0) and not stoploop then
+        DebugMsg(SPeepholeOptimization + 'Possible infinite loop in peephole optimizer', BlockStart);
+
+      if OptLevel3 then
+        ReleaseUsedRegs(StartingRegs);
+    end;
+
+  constructor TX86AsmOptimizer.Create(_AsmL: TAsmList);
+    begin
+      inherited Create(_AsmL);
+
+      { Pooled object for preserving the used register state in OptPass1Jcc }
+      CreateUsedRegs(StatePreserveRegs);
+    end;
+
+  destructor TX86AsmOptimizer.Destroy;
+    begin
+      ReleaseUsedRegs(StatePreserveRegs);
+      inherited Destroy;
+    end;
+
+  { Search forward from Start until we find the first instruction }
+  function TX86AsmOptimizer.GetFirstInstruction(const Start: tai; var p: tai): Boolean;
+    begin
+      p := Start;
+      Result := False;
+      while Assigned(p) and (p <> BlockEnd) do
+        begin
+          if (p.Typ = ait_seh_directive) then
+            begin
+              if (tai_seh_directive(p).kind = ash_endprologue) then
+                { End of prologue }
+                begin
+                  UpdateUsedRegs(p);
+                  Result := GetNextInstruction(p, p);
+                  Exit;
+                end
+              else
+                p := tai(p.Next);
+            end
+          else if (p.Typ = ait_regalloc) then
+            begin
+              UpdateUsedRegs(p);
+              repeat
+                p := tai(p.Next);
+                { All of the nearby register allocations have been handled already }
+              until (p.Typ <> ait_regalloc);
+            end
+          else if (p.Typ <> ait_instruction) then
+            begin
+              p := tai(p.Next);
+            end
+          else if
+            { Skip over instructions related to the function prologue }
+            (taicpu(p).opcode = A_PUSH) or
+            ((taicpu(p).opcode = A_LEA) and (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP)) or
+            ((taicpu(p).opcode = A_SUB) and (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP)) or
+            ((taicpu(p).opcode = A_MOV) and (taicpu(p).oper[0]^.typ = top_reg) and (
+            { An alternative to PUSH: writing a register to a particular point on the stack }
+              (
+                { Preserving stack pointer }
+                (getsupreg(taicpu(p).oper[1]^.reg) = RS_ESP) and
+                (taicpu(p).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = RS_EBP)
+              ) or (
+                (taicpu(p).oper[1]^.typ = top_ref) and (getsupreg(taicpu(p).oper[1]^.ref^.base) in [RS_ESP, RS_EBP])) and
+                (
+                  { If a scratch register is being written to the stack, it's likely preserving a parameter, so don't exclude }
+                  not ((target_info.system in [system_i386_win32]) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RDX, RS_RCX])) and
+                  not ((target_info.system in [system_x86_64_win64]) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RDX, RS_RCX, RS_R8, RS_R9, RS_R10, RS_R11])) and
+                  not (((target_info.system in systems_linux) or (target_info.system in systems_android)) and (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RDI, RS_RSI, RS_RAX, RS_RDX, RS_RCX, RS_R8, RS_R9, RS_R10, RS_R11]))
+                )
+              )
+            ) or (
+              { Writing XMM registers to the stack }
+              (
+                { Cannot use the "in" operator here as putting these opcodes
+                  into a set causes compiler error e03074. [Kit] }
+                (taicpu(p).opcode = A_MOVDQA) or
+                (taicpu(p).opcode = A_MOVDQU) or
+                (taicpu(p).opcode = A_VMOVDQA) or
+                (taicpu(p).opcode = A_VMOVDQU)
+              ) and
+              (taicpu(p).oper[0]^.typ = top_reg) and
+              (taicpu(p).oper[1]^.typ = top_ref) and (getsupreg(taicpu(p).oper[1]^.ref^.base) = RS_EBP) and
+              (
+                { If a scratch register is being written to the stack, it's likely preserving a parameter, so don't exclude }
+                not (getsupreg(taicpu(p).oper[0]^.reg) in [RS_XMM0, RS_XMM1, RS_XMM2, RS_XMM3, RS_XMM4, RS_XMM5]) or
+                (getsubreg(taicpu(p).oper[0]^.reg) <> R_SUBMMX)
+              )
+            ) then
+              p := tai(p.Next)
+
+          else
+            begin
+              Result := True;
+              Exit;
+            end;
+        end;
+    end;
+
+
 {$ifdef DEBUG_AOPTCPU}
     procedure TX86AsmOptimizer.DebugMsg(const s: string;p : tai);
       begin
@@ -645,7 +1126,7 @@
       end;
 
 
-    function TX86AsmOptimizer.PrePeepholeOptSxx(var p : tai) : boolean;
+    function TX86AsmOptimizer.OptPass1SHRSAR(var p : tai) : boolean;
       var
         hp1 : tai;
         l : TCGInt;
@@ -659,7 +1140,7 @@
 
           either "sar/and", "shl/and" or just "and" depending on const1 and const2 }
         if GetNextInstruction(p, hp1) and
-          MatchInstruction(hp1,A_SHL,[]) and
+          MatchInstruction(hp1,A_SHL) and
           (taicpu(p).oper[0]^.typ = top_const) and
           (taicpu(hp1).oper[0]^.typ = top_const) and
           (taicpu(hp1).opsize = taicpu(p).opsize) and
@@ -701,6 +1182,7 @@
                   else
                     Internalerror(2017050702)
                 end;
+                Result := True;
               end
             else if (taicpu(p).oper[0]^.val = taicpu(hp1).oper[0]^.val) then
               begin
@@ -719,95 +1201,12 @@
                 end;
                 asml.remove(hp1);
                 hp1.free;
+                Result := True;
               end;
           end;
       end;
 
 
-    function TX86AsmOptimizer.PrePeepholeOptIMUL(var p : tai) : boolean;
-      var
-        opsize : topsize;
-        hp1 : tai;
-        tmpref : treference;
-        ShiftValue : Cardinal;
-        BaseValue : TCGInt;
-      begin
-        result:=false;
-        opsize:=taicpu(p).opsize;
-        { changes certain "imul const, %reg"'s to lea sequences }
-        if (MatchOpType(taicpu(p),top_const,top_reg) or
-            MatchOpType(taicpu(p),top_const,top_reg,top_reg)) and
-           (opsize in [S_L{$ifdef x86_64},S_Q{$endif x86_64}]) then
-          if (taicpu(p).oper[0]^.val = 1) then
-            if (taicpu(p).ops = 2) then
-             { remove "imul $1, reg" }
-              begin
-                hp1 := tai(p.Next);
-                asml.remove(p);
-                DebugMsg(SPeepholeOptimization + 'Imul2Nop done',p);
-                p.free;
-                p := hp1;
-                result:=true;
-              end
-            else
-             { change "imul $1, reg1, reg2" to "mov reg1, reg2" }
-              begin
-                hp1 := taicpu.Op_Reg_Reg(A_MOV, opsize, taicpu(p).oper[1]^.reg,taicpu(p).oper[2]^.reg);
-                InsertLLItem(p.previous, p.next, hp1);
-                DebugMsg(SPeepholeOptimization + 'Imul2Mov done',p);
-                p.free;
-                p := hp1;
-              end
-          else if
-           ((taicpu(p).ops <= 2) or
-            (taicpu(p).oper[2]^.typ = Top_Reg)) and
-           not(cs_opt_size in current_settings.optimizerswitches) and
-           (not(GetNextInstruction(p, hp1)) or
-             not((tai(hp1).typ = ait_instruction) and
-                 ((taicpu(hp1).opcode=A_Jcc) and
-                  (taicpu(hp1).condition in [C_O,C_NO])))) then
-            begin
-              {
-                imul X, reg1, reg2 to
-                  lea (reg1,reg1,Y), reg2
-                  shl ZZ,reg2
-                imul XX, reg1 to
-                  lea (reg1,reg1,YY), reg1
-                  shl ZZ,reg2
-
-                This optimziation makes sense for pretty much every x86, except the VIA Nano3000: it has IMUL latency 2, lea/shl pair as well,
-                it does not exist as a separate optimization target in FPC though.
-
-                This optimziation can be applied as long as only two bits are set in the constant and those two bits are separated by
-                at most two zeros
-              }
-              reference_reset(tmpref,1,[]);
-              if (PopCnt(QWord(taicpu(p).oper[0]^.val))=2) and (BsrQWord(taicpu(p).oper[0]^.val)-BsfQWord(taicpu(p).oper[0]^.val)<=3) then
-                begin
-                  ShiftValue:=BsfQWord(taicpu(p).oper[0]^.val);
-                  BaseValue:=taicpu(p).oper[0]^.val shr ShiftValue;
-                  TmpRef.base := taicpu(p).oper[1]^.reg;
-                  TmpRef.index := taicpu(p).oper[1]^.reg;
-                  if not(BaseValue in [3,5,9]) then
-                    Internalerror(2018110101);
-                  TmpRef.ScaleFactor := BaseValue-1;
-                  if (taicpu(p).ops = 2) then
-                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[1]^.reg)
-                  else
-                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[2]^.reg);
-                  AsmL.InsertAfter(hp1,p);
-                  DebugMsg(SPeepholeOptimization + 'Imul2LeaShl done',p);
-                  AsmL.Remove(p);
-                  taicpu(hp1).fileinfo:=taicpu(p).fileinfo;
-                  p.free;
-                  p := hp1;
-                  if ShiftValue>0 then
-                    AsmL.InsertAfter(taicpu.op_const_reg(A_SHL, opsize, ShiftValue, taicpu(hp1).oper[1]^.reg),hp1);
-              end;
-            end;
-      end;
-
-
     function TX86AsmOptimizer.RegLoadedWithNewValue(reg: tregister; hp: tai): boolean;
       var
         p: taicpu;
@@ -944,7 +1343,7 @@
         hp2,hp3 : tai;
       begin
         { some x86-64 issue a NOP before the real exit code }
-        if MatchInstruction(p,A_NOP,[]) then
+        if MatchInstruction(p,A_NOP) then
           GetNextInstruction(p,p);
         result:=assigned(p) and (p.typ=ait_instruction) and
         ((taicpu(p).opcode = A_RET) or
@@ -1054,7 +1453,7 @@
           GetNextInstruction(p, hp1) and
           (hp1.typ = ait_instruction) and
           GetNextInstruction(hp1, hp2) and
-          MatchInstruction(hp2,taicpu(p).opcode,[]) and
+          MatchInstruction(hp2,taicpu(p).opcode) and
           OpsEqual(taicpu(hp2).oper[1]^, taicpu(p).oper[0]^) and
           MatchOpType(taicpu(hp2),top_reg,top_reg) and
           MatchOperand(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
@@ -1169,6 +1568,7 @@
                         asml.Remove(hp2);
                         hp2.Free;
                         p:=hp1;
+                        Result := True;
                       end;
                   end;
               end;
@@ -1190,25 +1606,28 @@
             V<Op>X   %mreg1,%mreg2,%mreg4
           ?
         }
-        if GetNextInstruction(p,hp1) and
-          { we mix single and double operations here because we assume that the compiler
-            generates vmovapd only after double operations and vmovaps only after single operations }
-          MatchInstruction(hp1,A_VMOVAPD,A_VMOVAPS,[S_NO]) and
-          MatchOperand(taicpu(p).oper[2]^,taicpu(hp1).oper[0]^) and
-          (taicpu(hp1).oper[1]^.typ=top_reg) then
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-            if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg,hp1,TmpUsedRegs)
-             ) then
-              begin
-                taicpu(p).loadoper(2,taicpu(hp1).oper[1]^);
-                DebugMsg(SPeepholeOptimization + 'VOpVmov2VOp done',p);
-                asml.Remove(hp1);
-                hp1.Free;
-                result:=true;
-              end;
-          end;
+        repeat
+          if GetNextInstruction(p,hp1) and
+            { we mix single and double operations here because we assume that the compiler
+              generates vmovapd only after double operations and vmovaps only after single operations }
+            MatchInstruction(hp1,A_VMOVAPD,A_VMOVAPS,[S_NO]) and
+            MatchOperand(taicpu(p).oper[2]^,taicpu(hp1).oper[0]^) and
+            (taicpu(hp1).oper[1]^.typ=top_reg) then
+            begin
+              TransferUsedRegs(TmpUsedRegs);
+              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+              if not(RegUsedAfterInstruction(taicpu(hp1).oper[0]^.reg,hp1,TmpUsedRegs)
+               ) then
+                begin
+                  taicpu(p).loadoper(2,taicpu(hp1).oper[1]^);
+                  DebugMsg(SPeepholeOptimization + 'VOpVmov2VOp done',p);
+                  asml.Remove(hp1);
+                  hp1.Free;
+                  Continue; { Can we do it again? }
+                end;
+            end;
+          Exit;
+        until False;
       end;
 
 
@@ -2234,6 +3142,103 @@
       end;
 
 
+    function TX86AsmOptimizer.OptPass1CMP(var p: tai): boolean;
+      var
+        hp1, hp2, hp3, hp4: tai;
+        v: TCGInt; { using aint will cause problems when compiling on i8086 }
+      begin
+        Result := False;
+
+        { Though "GetNextInstruction" and the check to see if hp1 is A_Jcc could
+          be factored out, it's better to do the cheap checks first to see if the
+          CMP instruction fulfils the criteria before calling the relatively
+          expensive GetNextInstruction call. [Kit] }
+        if (taicpu(p).oper[0]^.typ=Top_const) then
+          begin
+            { cmp %reg,$8000                    neg %reg
+              je target                 -->     jo target
+
+              .... only if register is deallocated before jump.}
+            case Taicpu(p).opsize of
+              S_B: v:=$80;
+              S_W: v:=$8000;
+              S_L: v:=$80000000;
+{$ifdef x86_64}
+              S_Q: v:=$8000000000000000;
+{$endif x86_64}
+              else
+                internalerror(2013112905);
+            end;
+
+            if (taicpu(p).oper[0]^.val=v) and
+              (Taicpu(p).oper[1]^.typ=top_reg) and
+              GetNextInstruction(p, hp1) and
+              (hp1.typ=ait_instruction) and
+              (taicpu(hp1).opcode=A_Jcc) and
+              (Taicpu(hp1).condition in [C_E,C_NE]) and
+              not(RegInUsedRegs(Taicpu(p).oper[1]^.reg, UsedRegs)) then
+            begin
+              Taicpu(p).opcode:=A_NEG;
+              Taicpu(p).loadoper(0,Taicpu(p).oper[1]^);
+              Taicpu(p).clearop(1);
+              Taicpu(p).ops:=1;
+              if taicpu(hp1).condition=C_E then
+                taicpu(hp1).condition := C_O
+              else
+                taicpu(hp1).condition := C_NO;
+
+              { No need to set Result to True because no other optimisations
+                use or check for NEG }
+            end;
+            {
+            @@2:                              @@2:
+              ....                              ....
+              cmp operand1,0
+              jle/jbe @@1
+              dec operand1             -->      sub operand1,1
+              jmp @@2                           jge/jae @@2
+            @@1:                              @@1:
+              ...                               ....}
+            if (taicpu(p).oper[1]^.typ in [top_reg,top_ref]) and
+              (taicpu(p).oper[0]^.val = 0) and
+              GetNextInstruction(p, hp1) and
+              (hp1.typ = ait_instruction) and
+              (taicpu(hp1).is_jmp) and
+              (taicpu(hp1).opcode=A_Jcc) and
+              (taicpu(hp1).condition in [C_LE,C_BE]) and
+              GetNextInstruction(hp1,hp2) and
+              (hp2.typ = ait_instruction) and
+              (taicpu(hp2).opcode = A_DEC) and
+              OpsEqual(taicpu(hp2).oper[0]^,taicpu(p).oper[1]^) and
+              GetNextInstruction(hp2, hp3) and
+              (hp3.typ = ait_instruction) and
+              (taicpu(hp3).is_jmp) and
+              (taicpu(hp3).opcode = A_JMP) and
+              GetNextInstruction(hp3, hp4) and
+              FindLabel(tasmlabel(taicpu(hp1).oper[0]^.ref^.symbol),hp4) then
+            begin
+              taicpu(hp2).Opcode := A_SUB;
+              taicpu(hp2).loadoper(1,taicpu(hp2).oper[0]^);
+              taicpu(hp2).loadConst(0,1);
+              taicpu(hp2).ops:=2;
+              taicpu(hp3).Opcode := A_Jcc;
+			  
+              if taicpu(hp1).condition=C_LE then
+                taicpu(hp3).condition := C_GE
+			  else
+                taicpu(hp3).condition := C_AE;
+
+              asml.remove(p);
+              asml.remove(hp1);
+              p.free;
+              hp1.free;
+              p := hp2;
+              Result := True;
+            end;
+          end;
+      end;
+
+
     function TX86AsmOptimizer.OptPass1Sub(var p : tai) : boolean;
 {$ifdef i386}
       var
@@ -2426,7 +3447,7 @@
           (taicpu(p).oper[0]^.reg = taicpu(hp1).oper[0]^.reg) and
           (taicpu(hp1).oper[0]^.reg = taicpu(hp1).oper[1]^.reg) and
           GetNextInstruction(hp1, hp2) and
-          MatchInstruction(hp2, A_Jcc, []) then
+          MatchInstruction(hp2, A_Jcc) then
           { Change from:             To:
 
             set(C) %reg              j(~C) label
@@ -2476,403 +3497,112 @@
       end;
 
 
-    function TX86AsmOptimizer.OptPass1FSTP(var p: tai): boolean;
-      { returns true if a "continue" should be done after this optimization }
-      var
-        hp1, hp2: tai;
+    function CanBeCMOV(p : tai) : boolean; inline;
       begin
-        Result := false;
-        if MatchOpType(taicpu(p),top_ref) and
-           GetNextInstruction(p, hp1) and
-           (hp1.typ = ait_instruction) and
-           (((taicpu(hp1).opcode = A_FLD) and
-             (taicpu(p).opcode = A_FSTP)) or
-            ((taicpu(p).opcode = A_FISTP) and
-             (taicpu(hp1).opcode = A_FILD))) and
-           MatchOpType(taicpu(hp1),top_ref) and
-           (taicpu(hp1).opsize = taicpu(p).opsize) and
-           RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
-          begin
-            { replacing fstp f;fld f by fst f is only valid for extended because of rounding }
-            if (taicpu(p).opsize=S_FX) and
-               GetNextInstruction(hp1, hp2) and
-               (hp2.typ = ait_instruction) and
-               IsExitCode(hp2) and
-               (taicpu(p).oper[0]^.ref^.base = current_procinfo.FramePointer) and
-               not(assigned(current_procinfo.procdef.funcretsym) and
-                   (taicpu(p).oper[0]^.ref^.offset < tabstractnormalvarsym(current_procinfo.procdef.funcretsym).localloc.reference.offset)) and
-               (taicpu(p).oper[0]^.ref^.index = NR_NO) then
-              begin
-                asml.remove(p);
-                asml.remove(hp1);
-                p.free;
-                hp1.free;
-                p := hp2;
-                RemoveLastDeallocForFuncRes(p);
-                Result := true;
-              end
-            (* can't be done because the store operation rounds
-            else
-              { fst can't store an extended value! }
-              if (taicpu(p).opsize <> S_FX) and
-                 (taicpu(p).opsize <> S_IQ) then
-                begin
-                  if (taicpu(p).opcode = A_FSTP) then
-                    taicpu(p).opcode := A_FST
-                  else taicpu(p).opcode := A_FIST;
-                  asml.remove(hp1);
-                  hp1.free;
-                end
-            *)
-          end;
+         CanBeCMOV:=assigned(p) and
+           MatchInstruction(p,A_MOV,[S_W,S_L,S_Q]) and
+           { we can't use cmov ref,reg because
+             ref could be nil and cmov still throws an exception
+             if ref=nil but the mov isn't done (FK)
+            or ((taicpu(p).oper[0]^.typ = top_ref) and
+             (taicpu(p).oper[0]^.ref^.refaddr = addr_no))
+           }
+           MatchOpType(taicpu(p),top_reg,top_reg);
       end;
 
 
-     function TX86AsmOptimizer.OptPass1FLD(var p : tai) : boolean;
+    function TX86AsmOptimizer.OptPass1Imul(var p : tai) : boolean;
       var
-       hp1, hp2: tai;
+        opsize : topsize;
+        hp1 : tai;
+        tmpref : treference;
+        ShiftValue : Cardinal;
+        BaseValue : TCGInt;
       begin
         result:=false;
-        if MatchOpType(taicpu(p),top_reg) and
-           GetNextInstruction(p, hp1) and
-           (hp1.typ = Ait_Instruction) and
-           MatchOpType(taicpu(hp1),top_reg,top_reg) and
-           (taicpu(hp1).oper[0]^.reg = NR_ST) and
-           (taicpu(hp1).oper[1]^.reg = NR_ST1) then
-           { change                        to
-               fld      reg               fxxx reg,st
-               fxxxp    st, st1 (hp1)
-             Remark: non commutative operations must be reversed!
-           }
-          begin
-              case taicpu(hp1).opcode Of
-                A_FMULP,A_FADDP,
-                A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP:
-                  begin
-                    case taicpu(hp1).opcode Of
-                      A_FADDP: taicpu(hp1).opcode := A_FADD;
-                      A_FMULP: taicpu(hp1).opcode := A_FMUL;
-                      A_FSUBP: taicpu(hp1).opcode := A_FSUBR;
-                      A_FSUBRP: taicpu(hp1).opcode := A_FSUB;
-                      A_FDIVP: taicpu(hp1).opcode := A_FDIVR;
-                      A_FDIVRP: taicpu(hp1).opcode := A_FDIV;
-                      else
-                        internalerror(2019050534);
-                    end;
-                    taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
-                    taicpu(hp1).oper[1]^.reg := NR_ST;
-                    asml.remove(p);
-                    p.free;
-                    p := hp1;
-                    Result:=true;
-                    exit;
-                  end;
-                else
-                  ;
-              end;
-          end
-        else
-          if MatchOpType(taicpu(p),top_ref) and
-             GetNextInstruction(p, hp2) and
-             (hp2.typ = Ait_Instruction) and
-             MatchOpType(taicpu(hp2),top_reg,top_reg) and
-             (taicpu(p).opsize in [S_FS, S_FL]) and
-             (taicpu(hp2).oper[0]^.reg = NR_ST) and
-             (taicpu(hp2).oper[1]^.reg = NR_ST1) then
-            if GetLastInstruction(p, hp1) and
-              MatchInstruction(hp1,A_FLD,A_FST,[taicpu(p).opsize]) and
-              MatchOpType(taicpu(hp1),top_ref) and
-              RefsEqual(taicpu(p).oper[0]^.ref^, taicpu(hp1).oper[0]^.ref^) then
-              if ((taicpu(hp2).opcode = A_FMULP) or
-                  (taicpu(hp2).opcode = A_FADDP)) then
-              { change                      to
-                  fld/fst   mem1  (hp1)       fld/fst   mem1
-                  fld       mem1  (p)         fadd/
-                  faddp/                       fmul     st, st
-                  fmulp  st, st1 (hp2) }
+        opsize:=taicpu(p).opsize;
+        { changes certain "imul const, %reg"'s to lea sequences }
+        if (MatchOpType(taicpu(p),top_const,top_reg) or
+            MatchOpType(taicpu(p),top_const,top_reg,top_reg)) and
+{$ifdef x86_64}
+           (opsize in [S_L,S_Q])
+{$else x86_64}
+           (opsize = S_L)
+{$endif x86_64}
+          then
+          if (taicpu(p).oper[0]^.val = 1) then
+            begin
+              if (taicpu(p).ops = 2) then
+               { remove "imul $1, reg" }
                 begin
+                  hp1 := tai(p.Next);
                   asml.remove(p);
-                  p.free;
-                  p := hp1;
-                  if (taicpu(hp2).opcode = A_FADDP) then
-                    taicpu(hp2).opcode := A_FADD
-                  else
-                    taicpu(hp2).opcode := A_FMUL;
-                  taicpu(hp2).oper[1]^.reg := NR_ST;
+                  DebugMsg(SPeepholeOptimization + 'Imul2Nop done',p);
                 end
               else
-              { change              to
-                  fld/fst mem1 (hp1)   fld/fst mem1
-                  fld     mem1 (p)     fld      st}
+               { change "imul $1, reg1, reg2" to "mov reg1, reg2" }
                 begin
-                  taicpu(p).changeopsize(S_FL);
-                  taicpu(p).loadreg(0,NR_ST);
-                end
-            else
-              begin
-                case taicpu(hp2).opcode Of
-                  A_FMULP,A_FADDP,A_FSUBP,A_FDIVP,A_FSUBRP,A_FDIVRP:
-            { change                        to
-                fld/fst  mem1    (hp1)      fld/fst    mem1
-                fld      mem2    (p)        fxxx       mem2
-                fxxxp    st, st1 (hp2)                      }
+                  hp1 := taicpu.Op_Reg_Reg(A_MOV, opsize, taicpu(p).oper[1]^.reg,taicpu(p).oper[2]^.reg);
+                  InsertLLItem(p.previous, p.next, hp1);
+                  DebugMsg(SPeepholeOptimization + 'Imul2Mov done',p);
+                end;
 
-                    begin
-                      case taicpu(hp2).opcode Of
-                        A_FADDP: taicpu(p).opcode := A_FADD;
-                        A_FMULP: taicpu(p).opcode := A_FMUL;
-                        A_FSUBP: taicpu(p).opcode := A_FSUBR;
-                        A_FSUBRP: taicpu(p).opcode := A_FSUB;
-                        A_FDIVP: taicpu(p).opcode := A_FDIVR;
-                        A_FDIVRP: taicpu(p).opcode := A_FDIV;
-                        else
-                          internalerror(2019050533);
-                      end;
-                      asml.remove(hp2);
-                      hp2.free;
-                    end
-                  else
-                    ;
-                end
-              end
-      end;
+              p.free;
+              p := hp1;
+              Result := True;
+              Exit;
+            end
+          else if
+           ((taicpu(p).ops <= 2) or
+            (taicpu(p).oper[2]^.typ = Top_Reg)) and
+           not(cs_opt_size in current_settings.optimizerswitches) and
+           (not(GetNextInstruction(p, hp1)) or
+             not((tai(hp1).typ = ait_instruction) and
+                 ((taicpu(hp1).opcode=A_Jcc) and
+                  (taicpu(hp1).condition in [C_O,C_NO])))) then
+            begin
+              {
+                imul X, reg1, reg2 to
+                  lea (reg1,reg1,Y), reg2
+                  shl ZZ,reg2
+                imul XX, reg1 to
+                  lea (reg1,reg1,YY), reg1
+                  shl ZZ,reg2
 
+                This optimization makes sense for pretty much every x86, except the VIA Nano3000: it has IMUL latency 2, lea/shl pair as well,
+                it does not exist as a separate optimization target in FPC though.
 
-   function TX86AsmOptimizer.OptPass2MOV(var p : tai) : boolean;
-      var
-       hp1,hp2: tai;
-{$ifdef x86_64}
-       hp3: tai;
-{$endif x86_64}
-      begin
-        Result:=false;
-        if MatchOpType(taicpu(p),top_reg,top_reg) and
-          GetNextInstruction(p, hp1) and
-{$ifdef x86_64}
-          MatchInstruction(hp1,A_MOVZX,A_MOVSX,A_MOVSXD,[]) and
-{$else x86_64}
-          MatchInstruction(hp1,A_MOVZX,A_MOVSX,[]) and
-{$endif x86_64}
-          MatchOpType(taicpu(hp1),top_reg,top_reg) and
-          (taicpu(hp1).oper[0]^.reg = taicpu(p).oper[1]^.reg) then
-          { mov reg1, reg2                mov reg1, reg2
-            movzx/sx reg2, reg3      to   movzx/sx reg1, reg3}
-          begin
-            taicpu(hp1).oper[0]^.reg := taicpu(p).oper[0]^.reg;
-            DebugMsg(SPeepholeOptimization + 'mov %reg1,%reg2; movzx/sx %reg2,%reg3 -> mov %reg1,%reg2;movzx/sx %reg1,%reg3',p);
+                This optimization can be applied as long as only two bits are set in the constant and those two bits are separated by
+                at most two zeros
+              }
+              reference_reset(tmpref,1,[]);
+              if (PopCnt(QWord(taicpu(p).oper[0]^.val))=2) and (BsrQWord(taicpu(p).oper[0]^.val)-BsfQWord(taicpu(p).oper[0]^.val)<=3) then
+                begin
+                  ShiftValue:=BsfQWord(taicpu(p).oper[0]^.val);
+                  BaseValue:=taicpu(p).oper[0]^.val shr ShiftValue;
+                  TmpRef.base := taicpu(p).oper[1]^.reg;
+                  TmpRef.index := taicpu(p).oper[1]^.reg;
+                  if not(BaseValue in [3,5,9]) then
+                    Internalerror(2018110101);
+                  TmpRef.ScaleFactor := BaseValue-1;
+                  if (taicpu(p).ops = 2) then
+                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[1]^.reg)
+                  else
+                    hp1 := taicpu.op_ref_reg(A_LEA, opsize, TmpRef, taicpu(p).oper[2]^.reg);
+                  AsmL.InsertAfter(hp1,p);
+                  DebugMsg(SPeepholeOptimization + 'Imul2LeaShl done',p);
+                  AsmL.Remove(p);
+                  taicpu(hp1).fileinfo:=taicpu(p).fileinfo;
+                  p.free;
+                  p := hp1;
+                  if ShiftValue>0 then
+                    AsmL.InsertAfter(taicpu.op_const_reg(A_SHL, opsize, ShiftValue, taicpu(hp1).oper[1]^.reg),hp1);
 
-            { Don't remove the MOV command without first checking that reg2 isn't used afterwards,
-              or unless supreg(reg3) = supreg(reg2)). [Kit] }
-
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-
-            if (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) or
-              not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)
-            then
-              begin
-                asml.remove(p);
-                p.free;
-                p := hp1;
-                Result:=true;
+                  { LEA won't get optimised, so no need to set Result to True }
+                  Exit;
               end;
+            end;
 
-            exit;
-          end
-        else if MatchOpType(taicpu(p),top_reg,top_reg) and
-          GetNextInstruction(p, hp1) and
-{$ifdef x86_64}
-          MatchInstruction(hp1,[A_MOV,A_MOVZX,A_MOVSX,A_MOVSXD],[]) and
-{$else x86_64}
-          MatchInstruction(hp1,A_MOV,A_MOVZX,A_MOVSX,[]) and
-{$endif x86_64}
-          MatchOpType(taicpu(hp1),top_ref,top_reg) and
-          ((taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg)
-           or
-           (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg)
-            ) and
-          (getsupreg(taicpu(hp1).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) then
-          { mov reg1, reg2
-            mov/zx/sx (reg2, ..), reg2      to   mov/zx/sx (reg1, ..), reg2}
-          begin
-            if (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) then
-              taicpu(hp1).oper[0]^.ref^.base := taicpu(p).oper[0]^.reg;
-            if (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) then
-              taicpu(hp1).oper[0]^.ref^.index := taicpu(p).oper[0]^.reg;
-            DebugMsg(SPeepholeOptimization + 'MovMovXX2MoVXX 1 done',p);
-            asml.remove(p);
-            p.free;
-            p := hp1;
-            Result:=true;
-            exit;
-          end
-        else if (taicpu(p).oper[0]^.typ = top_ref) and
-          GetNextInstruction(p,hp1) and
-          (hp1.typ = ait_instruction) and
-          { while the GetNextInstruction(hp1,hp2) call could be factored out,
-            doing it separately in both branches allows to do the cheap checks
-            with low probability earlier }
-          ((IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
-            GetNextInstruction(hp1,hp2) and
-            MatchInstruction(hp2,A_MOV,[])
-           ) or
-           ((taicpu(hp1).opcode=A_LEA) and
-             GetNextInstruction(hp1,hp2) and
-             MatchInstruction(hp2,A_MOV,[]) and
-            ((MatchReference(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_INVALID) and
-             (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg)
-              ) or
-             (MatchReference(taicpu(hp1).oper[0]^.ref^,NR_INVALID,
-              taicpu(p).oper[1]^.reg) and
-             (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg)) or
-             (MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,taicpu(p).oper[1]^.reg,NR_NO)) or
-             (MatchReferenceWithOffset(taicpu(hp1).oper[0]^.ref^,NR_NO,taicpu(p).oper[1]^.reg))
-            ) and
-            ((MatchOperand(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^)) or not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)))
-           )
-          ) and
-          MatchOperand(taicpu(hp1).oper[taicpu(hp1).ops-1]^,taicpu(hp2).oper[0]^) and
-          (taicpu(hp2).oper[1]^.typ = top_ref) then
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs,tai(p.next));
-            UpdateUsedRegs(TmpUsedRegs,tai(hp1.next));
-            if (RefsEqual(taicpu(hp2).oper[1]^.ref^,taicpu(p).oper[0]^.ref^) and
-              not(RegUsedAfterInstruction(taicpu(hp2).oper[0]^.reg,hp2,TmpUsedRegs))) then
-              { change   mov            (ref), reg
-                         add/sub/or/... reg2/$const, reg
-                         mov            reg, (ref)
-                         # release reg
-                to       add/sub/or/... reg2/$const, (ref)    }
-              begin
-                case taicpu(hp1).opcode of
-                  A_INC,A_DEC,A_NOT,A_NEG :
-                    taicpu(hp1).loadRef(0,taicpu(p).oper[0]^.ref^);
-                  A_LEA :
-                    begin
-                      taicpu(hp1).opcode:=A_ADD;
-                      if (taicpu(hp1).oper[0]^.ref^.index<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.index<>NR_NO) then
-                        taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.index)
-                      else if (taicpu(hp1).oper[0]^.ref^.base<>taicpu(p).oper[1]^.reg) and (taicpu(hp1).oper[0]^.ref^.base<>NR_NO) then
-                        taicpu(hp1).loadreg(0,taicpu(hp1).oper[0]^.ref^.base)
-                      else
-                        taicpu(hp1).loadconst(0,taicpu(hp1).oper[0]^.ref^.offset);
-                      taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
-                      DebugMsg(SPeepholeOptimization + 'FoldLea done',hp1);
-                    end
-                  else
-                    taicpu(hp1).loadRef(1,taicpu(p).oper[0]^.ref^);
-                end;
-                asml.remove(p);
-                asml.remove(hp2);
-                p.free;
-                hp2.free;
-                p := hp1
-              end;
-            Exit;
-{$ifdef x86_64}
-          end
-        else if (taicpu(p).opsize = S_L) and
-          (taicpu(p).oper[1]^.typ = top_reg) and
-          (
-            GetNextInstruction(p, hp1) and
-            MatchInstruction(hp1, A_MOV,[]) and
-            (taicpu(hp1).opsize = S_L) and
-            (taicpu(hp1).oper[1]^.typ = top_reg)
-          ) and (
-            GetNextInstruction(hp1, hp2) and
-            (tai(hp2).typ=ait_instruction) and
-            (taicpu(hp2).opsize = S_Q) and
-            (
-              (
-                MatchInstruction(hp2, A_ADD,[]) and
-                (taicpu(hp2).opsize = S_Q) and
-                (taicpu(hp2).oper[0]^.typ = top_reg) and (taicpu(hp2).oper[1]^.typ = top_reg) and
-                (
-                  (
-                    (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(p).oper[1]^.reg)) and
-                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
-                  ) or (
-                    (getsupreg(taicpu(hp2).oper[0]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
-                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
-                  )
-                )
-              ) or (
-                MatchInstruction(hp2, A_LEA,[]) and
-                (taicpu(hp2).oper[0]^.ref^.offset = 0) and
-                (taicpu(hp2).oper[0]^.ref^.scalefactor <= 1) and
-                (
-                  (
-                    (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(p).oper[1]^.reg)) and
-                    (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(hp1).oper[1]^.reg))
-                  ) or (
-                    (getsupreg(taicpu(hp2).oper[0]^.ref^.base) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
-                    (getsupreg(taicpu(hp2).oper[0]^.ref^.index) = getsupreg(taicpu(p).oper[1]^.reg))
-                  )
-                ) and (
-                  (
-                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg))
-                  ) or (
-                    (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(p).oper[1]^.reg))
-                  )
-                )
-              )
-            )
-          ) and (
-            GetNextInstruction(hp2, hp3) and
-            MatchInstruction(hp3, A_SHR,[]) and
-            (taicpu(hp3).opsize = S_Q) and
-            (taicpu(hp3).oper[0]^.typ = top_const) and (taicpu(hp2).oper[1]^.typ = top_reg) and
-            (taicpu(hp3).oper[0]^.val = 1) and
-            (taicpu(hp3).oper[1]^.reg = taicpu(hp2).oper[1]^.reg)
-          ) then
-          begin
-            { Change   movl    x,    reg1d         movl    x,    reg1d
-                       movl    y,    reg2d         movl    y,    reg2d
-                       addq    reg2q,reg1q   or    leaq    (reg1q,reg2q),reg1q
-                       shrq    $1,   reg1q         shrq    $1,   reg1q
-
-            ( reg1d and reg2d can be switched around in the first two instructions )
-
-              To       movl    x,    reg1d
-                       addl    y,    reg1d
-                       rcrl    $1,   reg1d
-
-              This corresponds to the common expression (x + y) shr 1, where
-              x and y are Cardinals (replacing "shr 1" with "div 2" produces
-              smaller code, but won't account for x + y causing an overflow). [Kit]
-            }
-
-            if (getsupreg(taicpu(hp2).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) then
-              { Change first MOV command to have the same register as the final output }
-              taicpu(p).oper[1]^.reg := taicpu(hp1).oper[1]^.reg
-            else
-              taicpu(hp1).oper[1]^.reg := taicpu(p).oper[1]^.reg;
-
-            { Change second MOV command to an ADD command. This is easier than
-              converting the existing command because it means we don't have to
-              touch 'y', which might be a complicated reference, and also the
-              fact that the third command might either be ADD or LEA. [Kit] }
-            taicpu(hp1).opcode := A_ADD;
-
-            { Delete old ADD/LEA instruction }
-            asml.remove(hp2);
-            hp2.free;
-
-            { Convert "shrq $1, reg1q" to "rcr $1, reg1d" }
-            taicpu(hp3).opcode := A_RCR;
-            taicpu(hp3).changeopsize(S_L);
-            setsubreg(taicpu(hp3).oper[1]^.reg, R_SUBD);
-{$endif x86_64}
-          end;
-      end;
-
-
-    function TX86AsmOptimizer.OptPass2Imul(var p : tai) : boolean;
-      var
-        hp1 : tai;
-      begin
-        Result:=false;
         if (taicpu(p).ops >= 2) and
            ((taicpu(p).oper[0]^.typ = top_const) or
             ((taicpu(p).oper[0]^.typ = top_ref) and (taicpu(p).oper[0]^.ref^.refaddr=addr_full))) and
@@ -2881,7 +3611,7 @@
             ((taicpu(p).oper[2]^.typ = top_reg) and
              (taicpu(p).oper[2]^.reg = taicpu(p).oper[1]^.reg))) and
            GetLastInstruction(p,hp1) and
-           MatchInstruction(hp1,A_MOV,[]) and
+           MatchInstruction(hp1,A_MOV) and
            MatchOpType(taicpu(hp1),top_reg,top_reg) and
            ((taicpu(hp1).oper[1]^.reg = taicpu(p).oper[1]^.reg) or
             ((taicpu(hp1).opsize=S_L) and (taicpu(p).opsize=S_Q) and SuperRegistersEqual(taicpu(hp1).oper[1]^.reg,taicpu(p).oper[1]^.reg))) then
@@ -2898,6 +3628,10 @@
                 DebugMsg(SPeepholeOptimization + 'MovImul2Imul done',p);
                 asml.remove(hp1);
                 hp1.free;
+                { Though p is still IMUL, the overhauled peephole optimiser
+                  won't call OptPass1Imul again because the instruction type
+                  hasn't changed (it'a assumed that if p still has the same
+                  instruction, no more optimisations can be done on it) }
                 result:=true;
               end;
           end;
@@ -2904,7 +3638,7 @@
       end;
 
 
-    function TX86AsmOptimizer.OptPass2Jmp(var p : tai) : boolean;
+    function TX86AsmOptimizer.OptPass1Jmp(var p : tai) : boolean;
       var
         hp1 : tai;
       begin
@@ -2925,6 +3659,9 @@
             if (taicpu(p).condition=C_None) and assigned(hp1) and SkipLabels(hp1,hp1) and
               MatchInstruction(hp1,A_RET,[S_NO]) then
               begin
+                { This jump optimisation would be missed otherwise. [Kit] }
+                RemoveDeadCodeAfterJump(taicpu(p));
+
                 tasmlabel(taicpu(p).oper[0]^.ref^.symbol).decrefs;
                 taicpu(p).opcode:=A_RET;
                 taicpu(p).is_jmp:=false;
@@ -2943,23 +3680,9 @@
       end;
 
 
-    function CanBeCMOV(p : tai) : boolean;
-      begin
-         CanBeCMOV:=assigned(p) and
-           MatchInstruction(p,A_MOV,[S_W,S_L,S_Q]) and
-           { we can't use cmov ref,reg because
-             ref could be nil and cmov still throws an exception
-             if ref=nil but the mov isn't done (FK)
-            or ((taicpu(p).oper[0]^.typ = top_ref) and
-             (taicpu(p).oper[0]^.ref^.refaddr = addr_no))
-           }
-           MatchOpType(taicpu(p),top_reg,top_reg);
-      end;
-
-
-    function TX86AsmOptimizer.OptPass2Jcc(var p : tai) : boolean;
+    function TX86AsmOptimizer.OptPass1Jcc(var p : tai) : boolean;
       var
-        hp1,hp2,hp3,hp4,hpmov2: tai;
+        hp1,hp2,hp3,hp4,hpmov1,hpmov2: tai;
         carryadd_opcode : TAsmOp;
         l : Longint;
         condition : TAsmCond;
@@ -2967,339 +3690,349 @@
       begin
         result:=false;
         symbol:=nil;
-        if GetNextInstruction(p,hp1) then
-          begin
-            symbol := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
+        if not GetNextInstruction(p,hp1) or (hp1.typ <> ait_instruction) then
+          { No next instruction, so exit }
+          Exit;
 
-            if (hp1.typ=ait_instruction) and
-               GetNextInstruction(hp1,hp2) and (hp2.typ=ait_label) and
-               (Tasmlabel(symbol) = Tai_label(hp2).labsym) then
-                 { jb @@1                            cmc
-                   inc/dec operand           -->     adc/sbb operand,0
-                   @@1:
+        symbol := TAsmLabel(taicpu(p).oper[0]^.ref^.symbol);
 
-                   ... and ...
+        if (hp1.typ=ait_instruction) and
+           GetNextInstruction(hp1,hp2) and (hp2.typ=ait_label) and
+           (Tasmlabel(symbol) = Tai_label(hp2).labsym) then
+             { jb @@1                            cmc
+               inc/dec operand           -->     adc/sbb operand,0
+               @@1:
 
-                   jnb @@1
-                   inc/dec operand           -->     adc/sbb operand,0
-                   @@1: }
+               ... and ...
+
+               jnb @@1
+               inc/dec operand           -->     adc/sbb operand,0
+               @@1: }
+          begin
+            carryadd_opcode:=A_NONE;
+            if Taicpu(p).condition in [C_NAE,C_B] then
               begin
-                carryadd_opcode:=A_NONE;
-                if Taicpu(p).condition in [C_NAE,C_B] then
+                if Taicpu(hp1).opcode=A_INC then
+                  carryadd_opcode:=A_ADC;
+                if Taicpu(hp1).opcode=A_DEC then
+                  carryadd_opcode:=A_SBB;
+                if carryadd_opcode<>A_NONE then
                   begin
-                    if Taicpu(hp1).opcode=A_INC then
-                      carryadd_opcode:=A_ADC;
-                    if Taicpu(hp1).opcode=A_DEC then
-                      carryadd_opcode:=A_SBB;
-                    if carryadd_opcode<>A_NONE then
-                      begin
-                        Taicpu(p).clearop(0);
-                        Taicpu(p).ops:=0;
-                        Taicpu(p).is_jmp:=false;
-                        Taicpu(p).opcode:=A_CMC;
-                        Taicpu(p).condition:=C_NONE;
-                        Taicpu(hp1).ops:=2;
-                        Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
-                        Taicpu(hp1).loadconst(0,0);
-                        Taicpu(hp1).opcode:=carryadd_opcode;
-                        result:=true;
-                        exit;
-                      end;
+                    Taicpu(p).clearop(0);
+                    Taicpu(p).ops:=0;
+                    Taicpu(p).is_jmp:=false;
+                    Taicpu(p).opcode:=A_CMC;
+                    Taicpu(p).condition:=C_NONE;
+                    Taicpu(hp1).ops:=2;
+                    Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
+                    Taicpu(hp1).loadconst(0,0);
+                    Taicpu(hp1).opcode:=carryadd_opcode;
+                    result:=true;
+                    exit;
                   end;
-                if Taicpu(p).condition in [C_AE,C_NB] then
+              end;
+            if Taicpu(p).condition in [C_AE,C_NB] then
+              begin
+                if Taicpu(hp1).opcode=A_INC then
+                  carryadd_opcode:=A_ADC;
+                if Taicpu(hp1).opcode=A_DEC then
+                  carryadd_opcode:=A_SBB;
+                if carryadd_opcode<>A_NONE then
                   begin
-                    if Taicpu(hp1).opcode=A_INC then
-                      carryadd_opcode:=A_ADC;
-                    if Taicpu(hp1).opcode=A_DEC then
-                      carryadd_opcode:=A_SBB;
-                    if carryadd_opcode<>A_NONE then
-                      begin
-                        asml.remove(p);
-                        p.free;
-                        Taicpu(hp1).ops:=2;
-                        Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
-                        Taicpu(hp1).loadconst(0,0);
-                        Taicpu(hp1).opcode:=carryadd_opcode;
-                        p:=hp1;
-                        result:=true;
-                        exit;
-                      end;
+                    asml.remove(p);
+                    p.free;
+                    Taicpu(hp1).ops:=2;
+                    Taicpu(hp1).loadoper(1,Taicpu(hp1).oper[0]^);
+                    Taicpu(hp1).loadconst(0,0);
+                    Taicpu(hp1).opcode:=carryadd_opcode;
+                    p:=hp1;
+                    result:=true;
+                    exit;
                   end;
               end;
+          end;
 
-            if ((hp1.typ = ait_label) and (symbol = tai_label(hp1).labsym))
-                or ((hp1.typ = ait_align) and GetNextInstruction(hp1, hp2) and (hp2.typ = ait_label) and (symbol = tai_label(hp2).labsym)) then
+        if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
+          begin
+            { check for
+                   jCC   xxx
+                   <several movs>
+                xxx:
+            }
+            l:=0;
+            { We already have hp1 from above };
+
+            { Look ahead with the register usage }
+            TransferUsedRegs(StatePreserveRegs); { We can't use TmpUsedRegs because that's used by OptPass1MOV }
+            UpdateUsedRegs(tai(p.Next));
+
+            hpmov1 := hp1;
+            while (hp1 <> BlockEnd) and
+              (hp1.typ = ait_instruction) and
+              (taicpu(hp1).opcode = A_MOV) do
+              { Will stop on labels }
               begin
-                { If Jcc is immediately followed by the label that it's supposed to jump to, remove it }
-                DebugMsg(SPeepholeOptimization + 'Removed conditional jump whose destination was immediately after it', p);
-                UpdateUsedRegs(hp1);
-
-                TAsmLabel(symbol).decrefs;
-                { if the label refs. reach zero, remove any alignment before the label }
-                if (hp1.typ = ait_align) then
+                { Check to see if the MOV can't be optimised first }
+                if OptPass1MOV(hp1) then
                   begin
-                    UpdateUsedRegs(hp2);
-                    if (TAsmLabel(symbol).getrefs = 0) then
-                    begin
-                      asml.Remove(hp1);
-                      hp1.Free;
-                    end;
-                    hp1 := hp2; { Set hp1 to the label }
+                    UpdateUsedRegs(hp1);
+                    Continue;
                   end;
 
-                asml.remove(p);
-                p.free;
+                if not CanBeCMOV(hp1) then
+                  Break;
 
-                if (TAsmLabel(symbol).getrefs = 0) then
-                  begin
-                    GetNextInstruction(hp1, p); { Instruction following the label }
-                    asml.remove(hp1);
-                    hp1.free;
+                UpdateUsedRegs(tai(hp1.Next));
+                inc(l);
+                GetNextInstruction(hp1,hp1);
+              end;
 
-                    UpdateUsedRegs(p);
-                    Result := True;
-                  end
-                else
+            if (hp1 <> BlockEnd) then
+              begin
+                if FindLabel(tasmlabel(symbol),hp1) then
                   begin
-                    { We don't need to set the result to True because we know hp1
-                      is a label and won't trigger any optimisation routines. [Kit] }
-                    p := hp1;
-                  end;
+                    if (l<=4) and (l>0) then
+                      begin
+                        condition:=inverse_cond(taicpu(p).condition);
+                        repeat
+                          taicpu(hpmov1).opcode:=A_CMOVcc;
+                          taicpu(hpmov1).condition:=condition;
+                          GetNextInstruction(hpmov1,hpmov1);
+                        until not(CanBeCMOV(hpmov1));
 
-                Exit;
-              end;
-          end;
+                        { Don't decrement the reference count on the label yet, otherwise
+                          GetNextInstruction might skip over the label if it drops to
+                          zero. }
+                        GetNextInstruction(hp1,hp2);
+                        UpdateUsedRegs(tai(hp1.Next));
 
-{$ifndef i8086}
-        if CPUX86_HAS_CMOV in cpu_capabilities[current_settings.cputype] then
-          begin
-             { check for
-                    jCC   xxx
-                    <several movs>
-                 xxx:
-             }
-             l:=0;
-             GetNextInstruction(p, hp1);
-             while assigned(hp1) and
-               CanBeCMOV(hp1) and
-               { stop on labels }
-               not(hp1.typ=ait_label) do
-               begin
-                  inc(l);
-                  GetNextInstruction(hp1,hp1);
-               end;
-             if assigned(hp1) then
-               begin
-                  if FindLabel(tasmlabel(symbol),hp1) then
-                    begin
-                      if (l<=4) and (l>0) then
-                        begin
-                          condition:=inverse_cond(taicpu(p).condition);
-                          GetNextInstruction(p,hp1);
-                          repeat
-                            if not Assigned(hp1) then
-                              InternalError(2018062900);
+                        { if the label refs. reach zero, remove any alignment before the label }
+                        if (hp1.typ = ait_align) and (hp2.typ = ait_label) then
+                          begin
+                            { Ref = 1 means it will drop to zero }
+                            if (tasmlabel(symbol).getrefs=1) then
+                              begin
+                                asml.Remove(hp1);
+                                hp1.Free;
+                              end;
+                          end
+                        else
+                          hp2 := hp1;
 
-                            taicpu(hp1).opcode:=A_CMOVcc;
-                            taicpu(hp1).condition:=condition;
-                            UpdateUsedRegs(hp1);
-                            GetNextInstruction(hp1,hp1);
-                          until not(CanBeCMOV(hp1));
+                        if not Assigned(hp2) then
+                          InternalError(2018062910);
 
-                          { Don't decrement the reference count on the label yet, otherwise
-                            GetNextInstruction might skip over the label if it drops to
-                            zero. }
-                          GetNextInstruction(hp1,hp2);
+                        if (hp2.typ <> ait_label) then
+                          begin
+                            { There's something other than CMOVs here.  Move the original jump
+                              to right before this point, then break out.
 
-                          { if the label refs. reach zero, remove any alignment before the label }
-                          if (hp1.typ = ait_align) and (hp2.typ = ait_label) then
-                            begin
-                              { Ref = 1 means it will drop to zero }
-                              if (tasmlabel(symbol).getrefs=1) then
-                                begin
-                                  asml.Remove(hp1);
-                                  hp1.Free;
-                                end;
-                            end
-                          else
-                            hp2 := hp1;
+                              Originally this was part of the above internal error, but it got
+                              triggered on the bootstrapping process sometimes. Investigate. [Kit] }
 
-                          if not Assigned(hp2) then
-                            InternalError(2018062910);
+                            asml.remove(p);
+                            asml.insertbefore(p, hp2);
 
-                          if (hp2.typ <> ait_label) then
-                            begin
-                              { There's something other than CMOVs here.  Move the original jump
-                                to right before this point, then break out.
+                            UpdateUsedRegs(p);
+                            DebugMsg('Jcc/CMOVcc drop-out', p);
+                            Result := True;
+                            Exit;
+                          end;
 
-                                Originally this was part of the above internal error, but it got
-                                triggered on the bootstrapping process sometimes. Investigate. [Kit] }
-                              asml.remove(p);
-                              asml.insertbefore(p, hp2);
-                              DebugMsg('Jcc/CMOVcc drop-out', p);
-                              UpdateUsedRegs(p);
-                              Result := True;
-                              Exit;
-                            end;
+                        UpdateUsedRegs(tai(hp2.Next));
 
-                          { Now we can safely decrement the reference count }
-                          tasmlabel(symbol).decrefs;
+                        { Now we can safely decrement the reference count }
+                        tasmlabel(symbol).decrefs;
 
-                          { Remove the original jump }
-                          asml.Remove(p);
-                          p.Free;
+                        { Remove the original jump }
+                        asml.Remove(p);
+                        p.Free;
 
-                          GetNextInstruction(hp2, p); { Instruction after the label }
+                        GetNextInstruction(hp2, p); { Instruction after the label }
 
-                          { Remove the label if this is its final reference }
-                          if (tasmlabel(symbol).getrefs=0) then
-                            begin
-                              asml.remove(hp2);
-                              hp2.free;
-                            end;
+                        { Remove the label if this is its final reference }
+                        if (tasmlabel(symbol).getrefs=0) then
+                          begin
+                            asml.remove(hp2);
+                            hp2.free;
+                          end;
 
-                          if Assigned(p) then
-                            begin
-                              UpdateUsedRegs(p);
-                              result:=true;
-                            end;
-                          exit;
-                        end;
-                    end
-                  else
-                    begin
-                       { check further for
-                              jCC   xxx
-                              <several movs 1>
-                              jmp   yyy
-                      xxx:
-                              <several movs 2>
-                      yyy:
-                       }
-                      { hp2 points to jmp yyy }
-                      hp2:=hp1;
-                      { skip hp1 to xxx (or an align right before it) }
-                      GetNextInstruction(hp1, hp1);
+                        if Assigned(p) and (l > 0) then
+                          result:=true;
 
-                      if assigned(hp2) and
-                        assigned(hp1) and
-                        (l<=3) and
-                        (hp2.typ=ait_instruction) and
-                        (taicpu(hp2).is_jmp) and
-                        (taicpu(hp2).condition=C_None) and
-                        { real label and jump, no further references to the
-                          label are allowed }
-                        (tasmlabel(symbol).getrefs=1) and
-                        FindLabel(tasmlabel(symbol),hp1) then
-                         begin
-                           l:=0;
-                           { skip hp1 to <several moves 2> }
-                           if (hp1.typ = ait_align) then
-                             GetNextInstruction(hp1, hp1);
+                        Exit;
+                      end;
+                  end
+                else if (l<=3) and (hp1 <> BlockEnd) and (hp1.typ=ait_instruction) and (taicpu(hp1).opcode = A_JMP) then
+                  begin
+                    { check further for
+                            jCC   xxx
+                            <several movs 1>
+                            jmp   yyy                <-- Unconditional jump only
+                    xxx:
+                            <several movs 2>
+                    yyy:
+                     }
+                    { hp2 points to jmp yyy }
+                    hp2:=hp1;
+                    { skip hp1 to xxx (or an align right before it) }
+                    GetNextInstruction(hp1, hp1);
 
-                           GetNextInstruction(hp1, hpmov2);
+                    { real label and jump, no further references to the
+                      label are allowed }
+                    if (hp1 <> BlockEnd) and (tasmlabel(symbol).getrefs=1) and
+                      FindLabel(tasmlabel(symbol),hp1) then
+                      begin
+                        { Do the first batch of CMOVs }
+                        condition:=inverse_cond(taicpu(p).condition);
+                        repeat
+                          taicpu(hpmov1).opcode:=A_CMOVcc;
+                          taicpu(hpmov1).condition:=condition;
+                          GetNextInstruction(hpmov1,hpmov1);
+                        until (hpmov1 = BlockEnd) or
+                          not(CanBeCMOV(hpmov1));
 
-                           hp1 := hpmov2;
-                           while assigned(hp1) and
-                             CanBeCMOV(hp1) do
+                        { It's safe to keep UsedRegs as is now, so save the
+                          state while the other set of MOVs is dealt with }
+                        TransferUsedRegs(StatePreserveRegs);
+                        UpdateUsedRegs(tai(hp2.next));
+                        l:=0;
+                        { skip hp1 to <several moves 2> }
+                        if (hp1.typ = ait_align) then
+                          begin
+                            UpdateUsedRegs(hp1);
+                            GetNextInstruction(hp1, hp1);
+                          end;
+
+                        UpdateUsedRegs(tai(hp1.Next));
+                        GetNextInstruction(hp1, hpmov2);
+
+                        hp1 := hpmov2;
+                        while assigned(hp1) and
+                          (hp1.typ = ait_instruction) and
+                          (taicpu(hp1).opcode = A_MOV) do
+                          begin
+                           { Check to see if the MOV can't be optimised first }
+                           if OptPass1MOV(hp1) then
                              begin
-                               inc(l);
-                               GetNextInstruction(hp1, hp1);
+                               UpdateUsedRegs(hp1);
+                               Continue;
                              end;
-                           { hp1 points to yyy (or an align right before it) }
-                           hp3 := hp1;
-                           if assigned(hp1) and
-                             FindLabel(tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol),hp1) then
-                             begin
-                                condition:=inverse_cond(taicpu(p).condition);
-                                GetNextInstruction(p,hp1);
-                                repeat
-                                  taicpu(hp1).opcode:=A_CMOVcc;
-                                  taicpu(hp1).condition:=condition;
-                                  UpdateUsedRegs(hp1);
-                                  GetNextInstruction(hp1,hp1);
-                                until not(assigned(hp1)) or
-                                  not(CanBeCMOV(hp1));
 
-                                condition:=inverse_cond(condition);
-                                hp1 := hpmov2;
-                                { hp1 is now at <several movs 2> }
-                                while Assigned(hp1) and CanBeCMOV(hp1) do
-                                  begin
-                                    taicpu(hp1).opcode:=A_CMOVcc;
-                                    taicpu(hp1).condition:=condition;
-                                    UpdateUsedRegs(hp1);
-                                    GetNextInstruction(hp1,hp1);
-                                  end;
+                            if not CanBeCMOV(hp1) then
+                              Break;
 
-                                hp1 := p;
+                            UpdateUsedRegs(tai(hp1.Next));
+                            inc(l);
+                            GetNextInstruction(hp1, hp1);
+                          end;
+                        { if yyy is the expected label, then hp1 points to it (or an align right before it) }
+                        hp3 := hp1;
+                        if assigned(hp1) and
+                          FindLabel(tasmlabel(taicpu(hp2).oper[0]^.ref^.symbol),hp1) then
+                          begin
 
-                                { Get first instruction after label }
-                                GetNextInstruction(hp3, p);
+                            condition:=inverse_cond(condition);
+                            hp1 := hpmov2;
+                            { hp1 is now at <several movs 2> }
+                            while Assigned(hp1) and CanBeCMOV(hp1) do
+                              begin
+                                taicpu(hp1).opcode:=A_CMOVcc;
+                                taicpu(hp1).condition:=condition;
+                                GetNextInstruction(hp1,hp1);
+                              end;
 
-                                if assigned(p) and (hp3.typ = ait_align) then
-                                  GetNextInstruction(p, p);
+                            hp1 := p;
 
-                                { Don't dereference yet, as doing so will cause
-                                  GetNextInstruction to skip the label and
-                                  optional align marker. [Kit] }
-                                GetNextInstruction(hp2, hp4);
+                            UpdateUsedRegs(tai(hp3.Next));
 
-                                { remove jCC }
+                            { Get first instruction after label }
+                            GetNextInstruction(hp3, p);
+
+                            if assigned(p) and (hp3.typ = ait_align) then
+                              begin
+                                UpdateUsedRegs(tai(p.Next));
+                                GetNextInstruction(p, p);
+                              end;
+
+                            { Don't dereference yet, as doing so will cause
+                              GetNextInstruction to skip the label and
+                              optional align marker. [Kit] }
+                            GetNextInstruction(hp2, hp4);
+
+                            { remove jCC }
+                            asml.remove(hp1);
+                            hp1.free;
+
+                            { Remove label xxx (it will have a ref of zero due to the initial check }
+                            if (hp4.typ = ait_align) then
+                              begin
+                                { Account for alignment as well }
+                                GetNextInstruction(hp4, hp1);
                                 asml.remove(hp1);
                                 hp1.free;
+                              end;
 
-                                { Remove label xxx (it will have a ref of zero due to the initial check }
-                                if (hp4.typ = ait_align) then
+                            asml.remove(hp4);
+                            hp4.free;
+
+                            { Now we can safely decrement it }
+                            tasmlabel(symbol).decrefs;
+
+                            { remove jmp }
+                            symbol := taicpu(hp2).oper[0]^.ref^.symbol;
+
+                            asml.remove(hp2);
+                            hp2.free;
+
+                            { Remove label yyy (and the optional alignment) if its reference will fall to zero }
+                            if tasmlabel(symbol).getrefs = 1 then
+                              begin
+                                if (hp3.typ = ait_align) then
                                   begin
                                     { Account for alignment as well }
-                                    GetNextInstruction(hp4, hp1);
+                                    GetNextInstruction(hp3, hp1);
                                     asml.remove(hp1);
                                     hp1.free;
                                   end;
 
-                                asml.remove(hp4);
-                                hp4.free;
+                                asml.remove(hp3);
+                                hp3.free;
 
-                                { Now we can safely decrement it }
+                                { As before, now we can safely decrement it }
                                 tasmlabel(symbol).decrefs;
+                              end;
 
-                                { remove jmp }
-                                symbol := taicpu(hp2).oper[0]^.ref^.symbol;
+                            if Assigned(p) and (l > 0) then
+                              result:=true;
 
-                                asml.remove(hp2);
-                                hp2.free;
+                            Exit;
+                          end
+                        else
+                          begin
+                            { The first batch of MOVs was changed to CMOV instructions, but not the second }
 
-                                { Remove label yyy (and the optional alignment) if its reference will fall to zero }
-                                if tasmlabel(symbol).getrefs = 1 then
-                                  begin
-                                    if (hp3.typ = ait_align) then
-                                      begin
-                                        { Account for alignment as well }
-                                        GetNextInstruction(hp3, hp1);
-                                        asml.remove(hp1);
-                                        hp1.free;
-                                      end;
+                            { remove jCC }
+                            tasmlabel(symbol).decrefs;
+                            asml.remove(p);
+                            p.free;
 
-                                    asml.remove(hp3);
-                                    hp3.free;
+                            p := hp2; { Set the current instruction to the JMP command, which is about to be changed... }
 
-                                    { As before, now we can safely decrement it }
-                                    tasmlabel(symbol).decrefs;
-                                  end;
+                            { Change the JMP to a jCC with the opposite condition }
+                            taicpu(p).opcode:=A_Jcc;
+                            taicpu(p).condition:=condition;
 
-                                if Assigned(p) then
-                                  begin
-                                    UpdateUsedRegs(p);
-                                    result:=true;
-                                  end;
-                                exit;
-                             end;
-                         end;
-                    end;
-               end;
+                            Result := True;
+                            { UsedRegs will be set to the state at the modified jump below... }
+                          end;
+                      end;
+                  end;
+              end;
+
+            { Restore UsedRegs state to the appropriate position }
+            RestoreUsedRegs(StatePreserveRegs);
           end;
-{$endif i8086}
       end;
 
 
@@ -3974,75 +4739,6 @@
       end;
 
 
-{$ifdef x86_64}
-    function TX86AsmOptimizer.PostPeepholeOptMovzx(var p : tai) : Boolean;
-      var
-        PreMessage: string;
-      begin
-        Result := False;
-        { Code size reduction by J. Gareth "Kit" Moreton }
-        { Convert MOVZBQ and MOVZWQ to MOVZBL and MOVZWL respectively if it removes the REX prefix }
-        if (taicpu(p).opsize in [S_BQ, S_WQ]) and
-          (getsupreg(taicpu(p).oper[1]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP])
-        then
-          begin
-            { Has 64-bit register name and opcode suffix }
-            PreMessage := 'movz' + debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' -> movz';
-
-            { The actual optimization }
-            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
-            if taicpu(p).opsize = S_BQ then
-              taicpu(p).changeopsize(S_BL)
-            else
-              taicpu(p).changeopsize(S_WL);
-
-            DebugMsg(SPeepholeOptimization + PreMessage +
-              debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' (removes REX prefix)', p);
-          end;
-      end;
-
-
-    function TX86AsmOptimizer.PostPeepholeOptXor(var p : tai) : Boolean;
-      var
-        PreMessage, RegName: string;
-      begin
-        { Code size reduction by J. Gareth "Kit" Moreton }
-        { change "xorq %reg,%reg" to "xorl %reg,%reg" for %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rbp and %rsp,
-          as this removes the REX prefix }
-
-        Result := False;
-        if not OpsEqual(taicpu(p).oper[0]^,taicpu(p).oper[1]^) then
-          Exit;
-
-        if taicpu(p).oper[0]^.typ <> top_reg then
-          { Should be impossible if both operands were equal, since one of XOR's operands must be a register }
-          InternalError(2018011500);
-
-        case taicpu(p).opsize of
-          S_Q:
-            begin
-              if (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP]) then
-                begin
-                  RegName := debug_regname(taicpu(p).oper[0]^.reg); { 64-bit register name }
-                  PreMessage := 'xorq ' + RegName + ',' + RegName + ' -> xorl ';
-
-                  { The actual optimization }
-                  setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
-                  setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
-                  taicpu(p).changeopsize(S_L);
-
-                  RegName := debug_regname(taicpu(p).oper[0]^.reg); { 32-bit register name }
-
-                  DebugMsg(SPeepholeOptimization + PreMessage + RegName + ',' + RegName + ' (removes REX prefix)', p);
-                end;
-            end;
-          else
-            ;
-        end;
-      end;
-{$endif}
-
-
     procedure TX86AsmOptimizer.OptReferences;
       var
         p: tai;
Index: compiler/x86_64/aoptcpu.pas
===================================================================
--- compiler/x86_64/aoptcpu.pas	(revision 42345)
+++ compiler/x86_64/aoptcpu.pas	(working copy)
@@ -30,12 +30,16 @@
 uses cpubase, aasmtai, aopt, aoptx86;
 
 type
+
+  { TCpuAsmOptimizer }
+
   TCpuAsmOptimizer = class(TX86AsmOptimizer)
-    function PrePeepHoleOptsCpu(var p: tai): boolean; override;
     function PeepHoleOptPass1Cpu(var p: tai): boolean; override;
-    function PeepHoleOptPass2Cpu(var p: tai): boolean; override;
-    function PostPeepHoleOptsCpu(var p : tai) : boolean; override;
-    procedure PostPeepHoleOpts; override;
+    function PostPeepHoleOptsCpu(var p : tai): boolean; override;
+
+    { Optimisations specific to x86_64 }
+    function PostPeepholeOptMovzx(var p : tai): Boolean; inline;
+    function PostPeepholeOptXor(var p : tai): Boolean; inline;
   end;
 
 implementation
@@ -42,132 +46,111 @@
 
 uses
   globals,
-  aasmcpu;
+  aasmcpu,
+  cgbase,
+  verbose;
 
-    function TCpuAsmOptimizer.PrePeepHoleOptsCpu(var p : tai) : boolean;
-      begin
-        result := false;
-        case p.typ of
-          ait_instruction:
-            begin
-              case taicpu(p).opcode of
-                A_IMUL:
-                  result:=PrePeepholeOptIMUL(p);
-                A_SAR,A_SHR:
-                  result:=PrePeepholeOptSxx(p);
-                else
-                  ;
-              end;
-            end;
-          else
-            ;
-        end;
-      end;
 
-
     function TCpuAsmOptimizer.PeepHoleOptPass1Cpu(var p: tai): boolean;
+      var
+        Opcode: TAsmOp;
       begin
+        { p is known to be an instruction by this point }
+
         result:=False;
-        case p.typ of
-          ait_instruction:
-            begin
-              case taicpu(p).opcode of
-                A_AND:
-                  Result:=OptPass1AND(p);
-                A_MOV:
-                  Result:=OptPass1MOV(p);
-                A_MOVSX,
-                A_MOVZX:
-                  Result:=OptPass1Movx(p);
-                A_VMOVAPS,
-                A_VMOVAPD,
-                A_VMOVUPS,
-                A_VMOVUPD:
-                  result:=OptPass1VMOVAP(p);
-                A_MOVAPD,
-                A_MOVAPS,
-                A_MOVUPD,
-                A_MOVUPS:
-                  result:=OptPass1MOVAP(p);
-                A_VDIVSD,
-                A_VDIVSS,
-                A_VSUBSD,
-                A_VSUBSS,
-                A_VMULSD,
-                A_VMULSS,
-                A_VADDSD,
-                A_VADDSS,
-                A_VANDPD,
-                A_VANDPS,
-                A_VORPD,
-                A_VORPS,
-                A_VXORPD,
-                A_VXORPS:
-                  result:=OptPass1VOP(p);
-                A_MULSD,
-                A_MULSS,
-                A_ADDSD,
-                A_ADDSS:
-                  result:=OptPass1OP(p);
-                A_VMOVSD,
-                A_VMOVSS,
-                A_MOVSD,
-                A_MOVSS:
-                  result:=OptPass1MOVXX(p);
-                A_LEA:
-                  result:=OptPass1LEA(p);
-                A_SUB:
-                  result:=OptPass1Sub(p);
-                A_SHL,A_SAL:
-                  result:=OptPass1SHLSAL(p);
-                A_SETcc:
-                  result:=OptPass1SETcc(p);
-                A_FSTP,A_FISTP:
-                  result:=OptPass1FSTP(p);
-                A_FLD:
-                  result:=OptPass1FLD(p);
-                else
-                  ;
-              end;
-            end;
-          else
-            ;
-        end;
+        { Use a local variable/register to reduce the number of pointer
+          dereferences (the peephole optimiser would never optimise this
+          by itself because the compiler has to consider the possibility
+          of multi-threaded race hazards. [Kit] }
+        Opcode := taicpu(p).opcode;
 
+        { Clever optimisation: MOV instructions appear disproportionally
+          more frequently than any other instruction, so check for this
+          opcode first and reduce the total number of comparisons
+          required over the entire block. [Kit] }
+        if Opcode = A_MOV then
+          Result := OptPass1MOV(p)
+        else
+          case Opcode of
+            A_AND:
+              Result:=OptPass1AND(p);
+            A_MOVSX,
+            A_MOVZX:
+              Result:=OptPass1Movx(p);
+            A_VMOVAPS,
+            A_VMOVAPD,
+            A_VMOVUPS,
+            A_VMOVUPD:
+              result:=OptPass1VMOVAP(p);
+            A_MOVAPD,
+            A_MOVAPS,
+            A_MOVUPD,
+            A_MOVUPS:
+              result:=OptPass1MOVAP(p);
+            A_VDIVSD,
+            A_VDIVSS,
+            A_VSUBSD,
+            A_VSUBSS,
+            A_VMULSD,
+            A_VMULSS,
+            A_VADDSD,
+            A_VADDSS,
+            A_VANDPD,
+            A_VANDPS,
+            A_VORPD,
+            A_VORPS,
+            A_VXORPD,
+            A_VXORPS:
+              result:=OptPass1VOP(p);
+            A_MULSD,
+            A_MULSS,
+            A_ADDSD,
+            A_ADDSS:
+              result:=OptPass1OP(p);
+            A_VMOVSD,
+            A_VMOVSS,
+            A_MOVSD,
+            A_MOVSS:
+              result:=OptPass1MOVXX(p);
+            A_LEA:
+              result:=OptPass1LEA(p);
+            A_SUB:
+              result:=OptPass1Sub(p);
+            A_CMP:
+              Result:=OptPass1CMP(p);
+            A_SHL,A_SAL:
+              result:=OptPass1SHLSAL(p);
+            A_SHR,A_SAR:
+              result:=OptPass1SHRSAR(p);
+            A_SETcc:
+              result:=OptPass1SETcc(p);
+            A_IMUL:
+              Result:=OptPass1Imul(p);
+            A_JMP:
+              Result:=OptPass1Jmp(p);
+            A_Jcc:
+              Result:=OptPass1Jcc(p);
+            A_XOR:
+              Result:=OptPass1XOR(p);
+            else
+	      { Do nothing };
+          end;
       end;
 
 
-    function TCpuAsmOptimizer.PeepHoleOptPass2Cpu(var p : tai) : boolean;
-      begin
-        Result := False;
-        case p.typ of
-          ait_instruction:
-            begin
-              case taicpu(p).opcode of
-                A_MOV:
-                  Result:=OptPass2MOV(p);
-                A_IMUL:
-                  Result:=OptPass2Imul(p);
-                A_JMP:
-                  Result:=OptPass2Jmp(p);
-                A_Jcc:
-                  Result:=OptPass2Jcc(p);
-                else
-                  ;
-              end;
-            end;
-          else
-            ;
-        end;
-      end;
-
-
     function TCpuAsmOptimizer.PostPeepHoleOptsCpu(var p: tai): boolean;
+      var
+        i: Integer;
       begin
         result := false;
         case p.typ of
           ait_instruction:
             begin
+              { Optimise the references }
+              for i:=0 to taicpu(p).ops-1 do
+                if taicpu(p).oper[i]^.typ=top_ref then
+                  optimize_ref(taicpu(p).oper[i]^.ref^,false);
+
               case taicpu(p).opcode of
                 A_MOV:
                   Result:=PostPeepholeOptMov(p);
@@ -194,12 +177,66 @@
       end;
 
 
-    procedure TCpuAsmOptimizer.PostPeepHoleOpts;
+    function TCpuAsmOptimizer.PostPeepholeOptMovzx(var p: tai): Boolean;
+      var
+        PreMessage: string;
       begin
-        inherited;
-        OptReferences;
+        Result := False;
+        { Code size reduction by J. Gareth "Kit" Moreton }
+        { Convert MOVZBQ and MOVZWQ to MOVZBL and MOVZWL respectively if it removes the REX prefix }
+        if (taicpu(p).opsize in [S_BQ, S_WQ]) and
+          (getsupreg(taicpu(p).oper[1]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP])
+        then
+          begin
+            { Has 64-bit register name and opcode suffix }
+            PreMessage := 'movz' + debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' -> movz';
+
+            { The actual optimization }
+            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
+            if taicpu(p).opsize = S_BQ then
+              taicpu(p).changeopsize(S_BL)
+            else
+              taicpu(p).changeopsize(S_WL);
+
+            DebugMsg(SPeepholeOptimization + PreMessage +
+              debug_opsize2str(taicpu(p).opsize) + ' ' + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg) + ' (removes REX prefix)', p);
+          end;
       end;
 
+
+    function TCpuAsmOptimizer.PostPeepholeOptXor(var p : tai) : Boolean;
+      var
+        PreMessage, RegName: string;
+      begin
+        { Code size reduction by J. Gareth "Kit" Moreton }
+        { change "xorq %reg,%reg" to "xorl %reg,%reg" for %rax, %rcx, %rdx, %rbx, %rsi, %rdi, %rbp and %rsp,
+          as this removes the REX prefix }
+
+        Result := False;
+        if not OpsEqual(taicpu(p).oper[0]^,taicpu(p).oper[1]^) then
+          Exit;
+
+        if taicpu(p).oper[0]^.typ <> top_reg then
+          { Should be impossible if both operands were equal, since one of XOR's operands must be a register }
+          InternalError(2018011500);
+
+        if (taicpu(p).opsize = S_Q) and
+          (getsupreg(taicpu(p).oper[0]^.reg) in [RS_RAX, RS_RCX, RS_RDX, RS_RBX, RS_RSI, RS_RDI, RS_RBP, RS_RSP]) then
+          begin
+            RegName := debug_regname(taicpu(p).oper[0]^.reg); { 64-bit register name }
+            PreMessage := 'xorq ' + RegName + ',' + RegName + ' -> xorl ';
+
+            { The actual optimization }
+            setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
+            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
+            taicpu(p).changeopsize(S_L);
+
+            RegName := debug_regname(taicpu(p).oper[0]^.reg); { 32-bit register name }
+
+            DebugMsg(SPeepholeOptimization + PreMessage + RegName + ',' + RegName + ' (removes REX prefix)', p);
+          end;
+      end;
+
 begin
   casmoptimizer := TCpuAsmOptimizer;
 end.
overhaul-singlepass.patch (157,392 bytes)

rd0x

2019-07-11 08:33

reporter   ~0117165

The patches seems to not follow the 'code guidelines', it has x := y; instead of x:=y;
Maybe you can solve it with regex:
(\w+)\s:=\s(\w+)
replace to
$1:=$2

J. Gareth Moreton

2019-07-11 09:10

developer   ~0117169

Admittedly yes, there are a few places where code guidelines aren't followed, even in the code that already existed. For the moment I just wanted to make it work with the trunk, since it had changed enough that merge conflicts resulted.

I'll make the convention fixes when Florian next gives feedback, since last time I broke a rule with what units can be included (which I hope I haven't reverted by mistake).

J. Gareth Moreton

2019-07-11 09:43

developer   ~0117172

At the same time it would be nice to know if others are getting speed savings and improved code generation, since that was my intention. I can think of other places where I can make improvements since I know a bit more about the node system now, on top of things I've listed at the end of the PDF (I forgot about some of those - it pays to write things down!).

J. Gareth Moreton

2019-07-11 11:20

developer   ~0117175

Fixed a bug where the compiler would enter an infinite loop if only overhaul-standalone.patch and its prerequisite overhaul-global.patch were applied.

overhaul-standalone.patch (44,635 bytes)
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 42345)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -1998,53 +2880,68 @@
    function TX86AsmOptimizer.OptPass1MOVXX(var p : tai) : boolean;
       var
         hp1 : tai;
+        orig_instr: tasmop;
       begin
         Result:=false;
-        if taicpu(p).ops <> 2 then
-          exit;
-        if GetNextInstruction(p,hp1) and
-          MatchInstruction(hp1,taicpu(p).opcode,[taicpu(p).opsize]) and
-          (taicpu(hp1).ops = 2) then
-          begin
-            if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
-               (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
-                {  movXX reg1, mem1     or     movXX mem1, reg1
-                   movXX mem2, reg2            movXX reg2, mem2}
-              begin
-                if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
-                  { movXX reg1, mem1     or     movXX mem1, reg1
-                    movXX mem2, reg1            movXX reg2, mem1}
-                  begin
-                    if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
-                      begin
-                        { Removes the second statement from
-                          movXX reg1, mem1/reg2
-                          movXX mem1/reg2, reg1
-                        }
-                        if taicpu(p).oper[0]^.typ=top_reg then
-                          AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
-                        { Removes the second statement from
-                          movXX mem1/reg1, reg2
-                          movXX reg2, mem1/reg1
-                        }
-                        if (taicpu(p).oper[1]^.typ=top_reg) and
-                          not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)) then
-                          begin
-                            asml.remove(p);
-                            p.free;
-                            GetNextInstruction(hp1,p);
-                            DebugMsg(SPeepholeOptimization + 'MovXXMovXX2Nop 1 done',p);
-                          end
-                        else
-                          DebugMsg(SPeepholeOptimization + 'MovXXMovXX2MoVXX 1 done',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        exit;
-                      end
-                end;
-            end;
-        end;
+        repeat
+          orig_instr := taicpu(p).opcode;
+          if taicpu(p).ops <> 2 then
+            exit;
+          if GetNextInstruction(p,hp1) and
+            MatchInstruction(hp1,orig_instr,[taicpu(p).opsize]) and
+            (taicpu(hp1).ops = 2) then
+            begin
+              if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
+                 (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
+                  {  movXX reg1, mem1     or     movXX mem1, reg1
+                     movXX mem2, reg2            movXX reg2, mem2}
+                begin
+                  if OpsEqual(taicpu(hp1).oper[1]^,taicpu(p).oper[0]^) then
+                    { movXX reg1, mem1     or     movXX mem1, reg1
+                      movXX mem2, reg1            movXX reg2, mem1}
+                    begin
+                      if OpsEqual(taicpu(hp1).oper[0]^,taicpu(p).oper[1]^) then
+                        begin
+                          { Removes the second statement from
+                            movXX reg1, mem1/reg2
+                            movXX mem1/reg2, reg1
+                          }
+                          if taicpu(p).oper[0]^.typ=top_reg then
+                            AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
+                          { Removes the second statement from
+                            movXX mem1/reg1, reg2
+                            movXX reg2, mem1/reg1
+                          }
+
+
+                          if (taicpu(p).oper[1]^.typ=top_reg) and
+                            not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,UsedRegs)) then
+                            begin
+                              asml.remove(p);
+                              p.free;
+                              asml.remove(hp1);
+                              hp1.free;
+                              Result := True;
+
+                              DebugMsg(SPeepholeOptimization + 'MovXXMovXX2Nop 1 done',p);
+
+                              if GetNextInstruction(hp1,p) and MatchInstruction(hp1,orig_instr) then
+                                Continue;
+                            end
+                          else
+                            begin
+                              DebugMsg(SPeepholeOptimization + 'MovXXMovXX2MoVXX 1 done',p);
+                              asml.remove(hp1);
+                              hp1.free;
+                              Result := True;
+                              Continue;
+                            end;
+                        end
+                  end;
+              end;
+          end;
+          Exit;
+        until False;
       end;
 
 
@@ -2062,26 +2959,30 @@
             <Op>X    %mreg2,%mreg1
           ?
         }
-        if GetNextInstruction(p,hp1) and
-          { we mix single and double opperations here because we assume that the compiler
-            generates vmovapd only after double operations and vmovaps only after single operations }
-          MatchInstruction(hp1,A_MOVAPD,A_MOVAPS,[S_NO]) and
-          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
-          MatchOperand(taicpu(p).oper[0]^,taicpu(hp1).oper[1]^) and
-          (taicpu(p).oper[0]^.typ=top_reg) then
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
-              begin
-                taicpu(p).loadoper(0,taicpu(hp1).oper[0]^);
-                taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
-                DebugMsg(SPeepholeOptimization + 'OpMov2Op done',p);
-                asml.Remove(hp1);
-                hp1.Free;
-                result:=true;
-              end;
-          end;
+        repeat
+          if GetNextInstruction(p,hp1) and
+            { we mix single and double opperations here because we assume that the compiler
+              generates vmovapd only after double operations and vmovaps only after single operations }
+            MatchInstruction(hp1,A_MOVAPD,A_MOVAPS,[S_NO]) and
+            MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
+            MatchOperand(taicpu(p).oper[0]^,taicpu(hp1).oper[1]^) and
+            (taicpu(p).oper[0]^.typ=top_reg) then
+            begin
+              TransferUsedRegs(TmpUsedRegs);
+              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+              if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
+                begin
+                  taicpu(p).loadoper(0,taicpu(hp1).oper[0]^);
+                  taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
+                  DebugMsg(SPeepholeOptimization + 'OpMov2Op done',p);
+                  asml.Remove(hp1);
+                  hp1.Free;
+                  result:=true;
+                  Continue;
+                end;
+            end;
+          Exit;
+        until False;
       end;
 
 
@@ -2091,96 +2992,103 @@
         l : ASizeInt;
       begin
         Result:=false;
-        { removes seg register prefixes from LEA operations, as they
-          don't do anything}
-        taicpu(p).oper[0]^.ref^.Segment:=NR_NO;
-        { changes "lea (%reg1), %reg2" into "mov %reg1, %reg2" }
-        if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
-           (taicpu(p).oper[0]^.ref^.index = NR_NO) and
-           { do not mess with leas acessing the stack pointer }
-           (taicpu(p).oper[1]^.reg <> NR_STACK_POINTER_REG) and
-           (not(Assigned(taicpu(p).oper[0]^.ref^.Symbol))) then
-          begin
-            if (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
-               (taicpu(p).oper[0]^.ref^.offset = 0) then
-              begin
-                hp1:=taicpu.op_reg_reg(A_MOV,taicpu(p).opsize,taicpu(p).oper[0]^.ref^.base,
-                  taicpu(p).oper[1]^.reg);
-                InsertLLItem(p.previous,p.next, hp1);
-                DebugMsg(SPeepholeOptimization + 'Lea2Mov done',hp1);
-                p.free;
-                p:=hp1;
-                Result:=true;
-                exit;
-              end
-            else if (taicpu(p).oper[0]^.ref^.offset = 0) then
-              begin
-                hp1:=taicpu(p.Next);
-                DebugMsg(SPeepholeOptimization + 'Lea2Nop done',p);
-                asml.remove(p);
-                p.free;
-                p:=hp1;
-                Result:=true;
-                exit;
-              end
-            { continue to use lea to adjust the stack pointer,
-              it is the recommended way, but only if not optimizing for size }
-            else if (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) or
-              (cs_opt_size in current_settings.optimizerswitches) then
-              with taicpu(p).oper[0]^.ref^ do
-                if (base = taicpu(p).oper[1]^.reg) then
-                  begin
-                    l:=offset;
-                    if (l=1) and UseIncDec then
-                      begin
-                        taicpu(p).opcode:=A_INC;
-                        taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
-                        taicpu(p).ops:=1;
-                        DebugMsg(SPeepholeOptimization + 'Lea2Inc done',p);
-                      end
-                    else if (l=-1) and UseIncDec then
-                      begin
-                        taicpu(p).opcode:=A_DEC;
-                        taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
-                        taicpu(p).ops:=1;
-                        DebugMsg(SPeepholeOptimization + 'Lea2Dec done',p);
-                      end
-                    else
-                      begin
-                        if (l<0) and (l<>-2147483648) then
-                          begin
-                            taicpu(p).opcode:=A_SUB;
-                            taicpu(p).loadConst(0,-l);
-                            DebugMsg(SPeepholeOptimization + 'Lea2Sub done',p);
-                          end
-                        else
-                          begin
-                            taicpu(p).opcode:=A_ADD;
-                            taicpu(p).loadConst(0,l);
-                            DebugMsg(SPeepholeOptimization + 'Lea2Add done',p);
-                          end;
-                      end;
-                    Result:=true;
-                    exit;
-                  end;
-          end;
-        if GetNextInstruction(p,hp1) and
-          MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) and
-          MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
-          MatchOpType(Taicpu(hp1),top_reg,top_reg) and
-          (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) then
-          begin
-            TransferUsedRegs(TmpUsedRegs);
-            UpdateUsedRegs(TmpUsedRegs, tai(p.next));
-            if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
-              begin
-                taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
-                DebugMsg(SPeepholeOptimization + 'LeaMov2Lea done',p);
-                asml.Remove(hp1);
-                hp1.Free;
-                result:=true;
-              end;
-          end;
+        repeat
+          { removes seg register prefixes from LEA operations, as they
+            don't do anything}
+          taicpu(p).oper[0]^.ref^.Segment:=NR_NO;
+          { changes "lea (%reg1), %reg2" into "mov %reg1, %reg2" }
+          if (taicpu(p).oper[0]^.ref^.base <> NR_NO) and
+             (taicpu(p).oper[0]^.ref^.index = NR_NO) and
+             { do not mess with leas acessing the stack pointer }
+             (taicpu(p).oper[1]^.reg <> NR_STACK_POINTER_REG) and
+             (not(Assigned(taicpu(p).oper[0]^.ref^.Symbol))) then
+            begin
+              if (taicpu(p).oper[0]^.ref^.base <> taicpu(p).oper[1]^.reg) and
+                 (taicpu(p).oper[0]^.ref^.offset = 0) then
+                begin
+                  hp1:=taicpu.op_reg_reg(A_MOV,taicpu(p).opsize,taicpu(p).oper[0]^.ref^.base,
+                    taicpu(p).oper[1]^.reg);
+                  InsertLLItem(p.previous,p.next, hp1);
+                  DebugMsg(SPeepholeOptimization + 'Lea2Mov done',hp1);
+                  p.free;
+                  p:=hp1;
+                  Result:=true;
+                  exit;
+                end
+              else if (taicpu(p).oper[0]^.ref^.offset = 0) then
+                begin
+                  hp1:=taicpu(p.Next);
+                  DebugMsg(SPeepholeOptimization + 'Lea2Nop done',p);
+                  asml.remove(p);
+                  p.free;
+                  p:=hp1;
+                  Result:=true;
+                  if (hp1 <> BlockEnd) and MatchInstruction(hp1, A_LEA) then
+                    Continue
+                  else
+                    Exit;
+                end
+              { continue to use lea to adjust the stack pointer,
+                it is the recommended way, but only if not optimizing for size }
+              else if (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) or
+                (cs_opt_size in current_settings.optimizerswitches) then
+                with taicpu(p).oper[0]^.ref^ do
+                  if (base = taicpu(p).oper[1]^.reg) then
+                    begin
+                      l:=offset;
+                      if (l=1) and UseIncDec then
+                        begin
+                          taicpu(p).opcode:=A_INC;
+                          taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
+                          taicpu(p).ops:=1;
+                          DebugMsg(SPeepholeOptimization + 'Lea2Inc done',p);
+                        end
+                      else if (l=-1) and UseIncDec then
+                        begin
+                          taicpu(p).opcode:=A_DEC;
+                          taicpu(p).loadreg(0,taicpu(p).oper[1]^.reg);
+                          taicpu(p).ops:=1;
+                          DebugMsg(SPeepholeOptimization + 'Lea2Dec done',p);
+                        end
+                      else
+                        begin
+                          if (l<0) and (l<>-2147483648) then
+                            begin
+                              taicpu(p).opcode:=A_SUB;
+                              taicpu(p).loadConst(0,-l);
+                              DebugMsg(SPeepholeOptimization + 'Lea2Sub done',p);
+                            end
+                          else
+                            begin
+                              taicpu(p).opcode:=A_ADD;
+                              taicpu(p).loadConst(0,l);
+                              DebugMsg(SPeepholeOptimization + 'Lea2Add done',p);
+                            end;
+                        end;
+                      Result:=true;
+                      exit;
+                    end;
+            end;
+          if GetNextInstruction(p,hp1) and
+            MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) and
+            MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
+            MatchOpType(Taicpu(hp1),top_reg,top_reg) and
+            (taicpu(p).oper[1]^.reg<>NR_STACK_POINTER_REG) then
+            begin
+              TransferUsedRegs(TmpUsedRegs);
+              UpdateUsedRegs(TmpUsedRegs, tai(p.next));
+              if not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg,hp1,TmpUsedRegs)) then
+                begin
+                  taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
+                  DebugMsg(SPeepholeOptimization + 'LeaMov2Lea done',p);
+                  asml.Remove(hp1);
+                  hp1.Free;
+                  result:=true;
+                  Continue;
+                end;
+            end;
+          Exit;
+        until False;
       end;
 
 
@@ -2241,45 +3246,52 @@
 {$endif i386}
       begin
         Result:=false;
-        { * change "subl $2, %esp; pushw x" to "pushl x"}
-        { * change "sub/add const1, reg" or "dec reg" followed by
-            "sub const2, reg" to one "sub ..., reg" }
-        if MatchOpType(taicpu(p),top_const,top_reg) then
-          begin
+        repeat
+          { * change "subl $2, %esp; pushw x" to "pushl x"}
+          { * change "sub/add const1, reg" or "dec reg" followed by
+              "sub const2, reg" to one "sub ..., reg" }
+          if MatchOpType(taicpu(p),top_const,top_reg) then
+            begin
 {$ifdef i386}
-            if (taicpu(p).oper[0]^.val = 2) and
-               (taicpu(p).oper[1]^.reg = NR_ESP) and
-               { Don't do the sub/push optimization if the sub }
-               { comes from setting up the stack frame (JM)    }
-               (not(GetLastInstruction(p,hp1)) or
-               not(MatchInstruction(hp1,A_MOV,[S_L]) and
-                 MatchOperand(taicpu(hp1).oper[0]^,NR_ESP) and
-                 MatchOperand(taicpu(hp1).oper[0]^,NR_EBP))) then
-              begin
-                hp1 := tai(p.next);
-                while Assigned(hp1) and
-                      (tai(hp1).typ in [ait_instruction]+SkipInstr) and
-                      not RegReadByInstruction(NR_ESP,hp1) and
-                      not RegModifiedByInstruction(NR_ESP,hp1) do
-                  hp1 := tai(hp1.next);
-                if Assigned(hp1) and
-                  MatchInstruction(hp1,A_PUSH,[S_W]) then
-                  begin
-                    taicpu(hp1).changeopsize(S_L);
-                    if taicpu(hp1).oper[0]^.typ=top_reg then
-                      setsubreg(taicpu(hp1).oper[0]^.reg,R_SUBWHOLE);
-                    hp1 := tai(p.next);
-                    asml.remove(p);
-                    p.free;
-                    p := hp1;
-                    Result:=true;
-                    exit;
-                  end;
-              end;
+              if (taicpu(p).oper[0]^.val = 2) and
+                 (taicpu(p).oper[1]^.reg = NR_ESP) and
+                 { Don't do the sub/push optimization if the sub }
+                 { comes from setting up the stack frame (JM)    }
+                 (not(GetLastInstruction(p,hp1)) or
+                 not(MatchInstruction(hp1,A_MOV,[S_L]) and
+                   MatchOperand(taicpu(hp1).oper[0]^,NR_ESP) and
+                   MatchOperand(taicpu(hp1).oper[0]^,NR_EBP))) then
+                begin
+                  hp1 := tai(p.next);
+                  while Assigned(hp1) and
+                        (tai(hp1).typ in [ait_instruction]+SkipInstr) and
+                        not RegReadByInstruction(NR_ESP,hp1) and
+                        not RegModifiedByInstruction(NR_ESP,hp1) do
+                    hp1 := tai(hp1.next);
+                  if Assigned(hp1) and
+                    MatchInstruction(hp1,A_PUSH,[S_W]) then
+                    begin
+                      taicpu(hp1).changeopsize(S_L);
+                      if taicpu(hp1).oper[0]^.typ=top_reg then
+                        setsubreg(taicpu(hp1).oper[0]^.reg,R_SUBWHOLE);
+                      hp1 := tai(p.next);
+                      asml.remove(p);
+                      p.free;
+                      p := hp1;
+                      Result:=true;
+                      exit;
+                    end;
+                end;
 {$endif i386}
-            if DoSubAddOpt(p) then
-              Result:=true;
-          end;
+              if DoSubAddOpt(p) then
+                begin
+                  Result:=true;
+                  if (p <> BlockEnd) and MatchInstruction(p, A_SUB) then
+                    Continue;
+                end;
+            end;
+          Exit;
+        until False;
       end;
 
 
@@ -2365,6 +3377,7 @@
 {$endif x86_64}
               then
               begin
+{$ifndef x86_64}
                 if not(TmpBool2) and
                     (taicpu(p).oper[0]^.val = 1) then
                   begin
@@ -2372,11 +3385,17 @@
                       taicpu(p).oper[1]^.reg, taicpu(p).oper[1]^.reg)
                   end
                 else
+{$endif x86_64}
                   hp1 := taicpu.op_ref_reg(A_LEA, taicpu(p).opsize, TmpRef,
                               taicpu(p).oper[1]^.reg);
-                InsertLLItem(p.previous, p.next, hp1);
+
+                hp2 := tai(p.next);
+                InsertLLItem(p.previous, hp2, hp1);
+                asml.Remove(p);
                 p.free;
                 p := hp1;
+                UpdateUsedRegs(hp2);
+                Result := True;
               end;
           end
 {$ifndef x86_64}
@@ -2393,6 +3412,7 @@
                   InsertLLItem(p.previous, p.next, hp1);
                   p.free;
                   p := hp1;
+                  Result := True;
                 end
            { changes "shl $2, %reg" to "lea (,%reg,4), %reg"
              "shl $3, %reg" to "lea (,%reg,8), %reg }
@@ -2406,6 +3426,7 @@
                InsertLLItem(p.previous, p.next, hp1);
                p.free;
                p := hp1;
+               Result := True;
              end;
           end
 {$endif x86_64}
@@ -3306,14 +4039,17 @@
     function TX86AsmOptimizer.OptPass1Movx(var p : tai) : boolean;
       var
         hp1,hp2: tai;
+        GetNextInstruction_p: Boolean;
       begin
         result:=false;
+        GetNextInstruction_p := GetNextInstruction(p, hp1);
+
         if (taicpu(p).oper[1]^.typ = top_reg) and
-           GetNextInstruction(p,hp1) and
+           GetNextInstruction_p and
            (hp1.typ = ait_instruction) and
            IsFoldableArithOp(taicpu(hp1),taicpu(p).oper[1]^.reg) and
            GetNextInstruction(hp1,hp2) and
-           MatchInstruction(hp2,A_MOV,[]) and
+           MatchInstruction(hp2,A_MOV) and
            (taicpu(hp2).oper[0]^.typ = top_reg) and
            OpsEqual(taicpu(hp2).oper[1]^,taicpu(p).oper[0]^) and
 {$ifdef i386}
@@ -3374,7 +4110,7 @@
           begin
             { removes superfluous And's after movzx's }
             if (taicpu(p).oper[1]^.typ = top_reg) and
-              GetNextInstruction(p, hp1) and
+              GetNextInstruction_p and
               (tai(hp1).typ = ait_instruction) and
               (taicpu(hp1).opcode = A_AND) and
               (taicpu(hp1).oper[0]^.typ = top_const) and
@@ -3389,31 +4125,38 @@
                         asml.remove(hp1);
                         hp1.free;
                       end;
-                    S_WL{$ifdef x86_64}, S_WQ{$endif x86_64}:
-                      if (taicpu(hp1).oper[0]^.val = $ffff) then
-                        begin
-                          DebugMsg(SPeepholeOptimization + 'var5',p);
-                          asml.remove(hp1);
-                          hp1.free;
+                  S_WL{$ifdef x86_64}, S_WQ{$endif x86_64}:
+                    if (taicpu(hp1).oper[0]^.val = $ffff) then
+                      begin
+                        DebugMsg(SPeepholeOptimization + 'var5',p);
+                        asml.remove(hp1);
+                        hp1.free;
                         end;
 {$ifdef x86_64}
-                    S_LQ:
-                      if (taicpu(hp1).oper[0]^.val = $ffffffff) then
-                        begin
-                          if (cs_asm_source in current_settings.globalswitches) then
-                            asml.insertbefore(tai_comment.create(strpnew(SPeepholeOptimization + 'var6')),p);
-                          asml.remove(hp1);
-                          hp1.Free;
-                        end;
+                  S_LQ:
+                    if (taicpu(hp1).oper[0]^.val = $ffffffff) then
+                      begin
+                        if (cs_asm_source in current_settings.globalswitches) then
+                          asml.insertbefore(tai_comment.create(strpnew(SPeepholeOptimization + 'var6')),p);
+                        asml.remove(hp1);
+                        hp1.Free;
+                      end;
 {$endif x86_64}
                   else
-                    ;
+                  { Do nothing };
                 end;
+
+                { We need to get the new 'hp1' }
+                GetNextInstruction_p := GetNextInstruction(p, hp1);
               end;
-            { changes some movzx constructs to faster synonims (all examples
+            { changes some movzx constructs to faster synonyms (all examples
               are given with eax/ax, but are also valid for other registers)}
             if (taicpu(p).oper[1]^.typ = top_reg) then
               if (taicpu(p).oper[0]^.typ = top_reg) then
+
+                { Don't blindly set Result to True, otherwise we might get
+                  an infinite loop as AND and MOVZX convert to each other. }
+
                 case taicpu(p).opsize of
                   S_BW:
                     begin
@@ -3425,8 +4168,9 @@
                           taicpu(p).changeopsize(S_W);
                           taicpu(p).loadConst(0,$ff);
                           DebugMsg(SPeepholeOptimization + 'var7',p);
+                          Result := MatchInstruction(hp1, A_AND, [S_W]) or Result;
                         end
-                      else if GetNextInstruction(p, hp1) and
+                      else if GetNextInstruction_p and
                         (tai(hp1).typ = ait_instruction) and
                         (taicpu(hp1).opcode = A_AND) and
                         (taicpu(hp1).oper[0]^.typ = top_const) and
@@ -3440,6 +4184,7 @@
                           taicpu(p).changeopsize(S_W);
                           setsubreg(taicpu(p).oper[0]^.reg,R_SUBW);
                           taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ff);
+                          Result := True;
                         end;
                     end;
                   S_BL:
@@ -3450,9 +4195,10 @@
                         begin
                           taicpu(p).opcode := A_AND;
                           taicpu(p).changeopsize(S_L);
-                          taicpu(p).loadConst(0,$ff)
+                          taicpu(p).loadConst(0,$ff);
+                          Result := MatchInstruction(hp1, A_AND, [S_L]) or Result;
                         end
-                      else if GetNextInstruction(p, hp1) and
+                      else if GetNextInstruction_p and
                         (tai(hp1).typ = ait_instruction) and
                         (taicpu(hp1).opcode = A_AND) and
                         (taicpu(hp1).oper[0]^.typ = top_const) and
@@ -3469,7 +4215,8 @@
                             is invalid in assembler PM }
                           setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
                           taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ff);
-                        end
+                          Result := True;
+                        end;
                     end;
 {$ifndef i8086}
                   S_WL:
@@ -3482,8 +4229,9 @@
                           taicpu(p).opcode := A_AND;
                           taicpu(p).changeopsize(S_L);
                           taicpu(p).loadConst(0,$ffff);
+                          Result := MatchInstruction(hp1, A_AND, [S_L]) or Result;
                         end
-                      else if GetNextInstruction(p, hp1) and
+                      else if GetNextInstruction_p and
                         (tai(hp1).typ = ait_instruction) and
                         (taicpu(hp1).opcode = A_AND) and
                         (taicpu(hp1).oper[0]^.typ = top_const) and
@@ -3500,6 +4248,7 @@
                             is invalid in assembler PM }
                           setsubreg(taicpu(p).oper[0]^.reg, R_SUBD);
                           taicpu(hp1).loadConst(0,taicpu(hp1).oper[0]^.val and $ffff);
+                          Result := True;
                         end;
                     end;
 {$endif i8086}
@@ -3508,7 +4257,7 @@
                 end
               else if (taicpu(p).oper[0]^.typ = top_ref) then
                   begin
-                    if GetNextInstruction(p, hp1) and
+                    if GetNextInstruction_p and
                       (tai(hp1).typ = ait_instruction) and
                       (taicpu(hp1).opcode = A_AND) and
                       MatchOpType(taicpu(hp1),top_const,top_reg) and
@@ -3572,172 +4321,187 @@
       begin
         Result:=false;
 
-        if GetNextInstruction(p, hp1) then
-          begin
-            if MatchOpType(taicpu(p),top_const,top_reg) and
-              MatchInstruction(hp1,A_AND,[]) and
-              MatchOpType(taicpu(hp1),top_const,top_reg) and
-              (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
-              { the second register must contain the first one, so compare their subreg types }
-              (getsubreg(taicpu(p).oper[1]^.reg)<=getsubreg(taicpu(hp1).oper[1]^.reg)) and
-              (abs(taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val)<$80000000) then
-              { change
-                  and const1, reg
-                  and const2, reg
-                to
-                  and (const1 and const2), reg
-              }
-              begin
-                taicpu(hp1).loadConst(0, taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val);
-                DebugMsg(SPeepholeOptimization + 'AndAnd2And done',hp1);
-                asml.remove(p);
-                p.Free;
-                p:=hp1;
-                Result:=true;
-                exit;
-              end
-            else if MatchOpType(taicpu(p),top_const,top_reg) and
-              MatchInstruction(hp1,A_MOVZX,[]) and
-              (taicpu(hp1).oper[0]^.typ = top_reg) and
-              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
-              (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
-               (((taicpu(p).opsize=S_W) and
-                 (taicpu(hp1).opsize=S_BW)) or
-                ((taicpu(p).opsize=S_L) and
-                 (taicpu(hp1).opsize in [S_WL,S_BL]))
+        repeat
+
+          if GetNextInstruction(p, hp1) and (hp1.typ = ait_instruction) then
+            begin
+              if MatchOpType(taicpu(p),top_const,top_reg) then
+                case taicpu(hp1).opcode of
+                  A_AND:
+                    if MatchOpType(taicpu(hp1),top_const,top_reg) and
+                      (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp1).oper[1]^.reg)) and
+                      { the second register must contain the first one, so compare their subreg types }
+                      (getsubreg(taicpu(p).oper[1]^.reg)<=getsubreg(taicpu(hp1).oper[1]^.reg)) and
+                      (abs(taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val)<$80000000) then
+                      { change
+                          and const1, reg
+                          and const2, reg
+                        to
+                          and (const1 and const2), reg
+                      }
+                      begin
+                        taicpu(hp1).loadConst(0, taicpu(p).oper[0]^.val and taicpu(hp1).oper[0]^.val);
+                        DebugMsg(SPeepholeOptimization + 'AndAnd2And done',hp1);
+                        asml.remove(p);
+                        p.Free;
+                        p:=hp1;
+                        Result := True;
+                        Continue; { p is still AND, so it's safe to re-enter the loop }
+                      end;
+                  A_MOVZX:
+                    if (taicpu(hp1).oper[0]^.typ = top_reg) then
+                      begin
+
+                        if MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
+                        (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
+                        (((taicpu(p).opsize=S_W) and
+                         (taicpu(hp1).opsize=S_BW)) or
+                        ((taicpu(p).opsize=S_L) and
+                         (taicpu(hp1).opsize in [S_WL,S_BL]))
 {$ifdef x86_64}
-                  or
-                 ((taicpu(p).opsize=S_Q) and
-                  (taicpu(hp1).opsize in [S_BQ,S_WQ]))
+                          or
+                         ((taicpu(p).opsize=S_Q) and
+                          (taicpu(hp1).opsize in [S_BQ,S_WQ]))
 {$endif x86_64}
-                ) then
-                  begin
-                    if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
-                        ((taicpu(p).oper[0]^.val and $ff)=taicpu(p).oper[0]^.val)
-                         ) or
-                       (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
-                        ((taicpu(p).oper[0]^.val and $ffff)=taicpu(p).oper[0]^.val))
-                    then
-                      begin
-                        { Unlike MOVSX, MOVZX doesn't actually have a version that zero-extends a
-                          32-bit register to a 64-bit register, or even a version called MOVZXD, so
-                          code that tests for the presence of AND 0xffffffff followed by MOVZX is
-                          wasted, and is indictive of a compiler bug if it were triggered. [Kit]
+                        ) then
+                          begin
+                            if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
+                                ((taicpu(p).oper[0]^.val and $ff)=taicpu(p).oper[0]^.val)
+                                 ) or
+                               (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
+                                ((taicpu(p).oper[0]^.val and $ffff)=taicpu(p).oper[0]^.val))
+                            then
+                              begin
+                                { Unlike MOVSX, MOVZX doesn't actually have a version that zero-extends a
+                                  32-bit register to a 64-bit register, or even a version called MOVZXD, so
+                                  code that tests for the presence of AND 0xffffffff followed by MOVZX is
+                                  wasted, and is indictive of a compiler bug if it were triggered. [Kit]
 
-                          NOTE: To zero-extend from 32 bits to 64 bits, simply use the standard MOV.
-                        }
-                        DebugMsg(SPeepholeOptimization + 'AndMovzToAnd done',p);
+                                  NOTE: To zero-extend from 32 bits to 64 bits, simply use the standard MOV.
+                                }
+                                DebugMsg(SPeepholeOptimization + 'AndMovzToAnd done',p);
 
-                        asml.remove(hp1);
-                        hp1.free;
-                        Exit;
+                                asml.remove(hp1);
+                                hp1.free;
+                                Result := True;
+                                Continue;
+                              end;
+                          end;
                       end;
-                  end
-            else if MatchOpType(taicpu(p),top_const,top_reg) and
-              MatchInstruction(hp1,A_SHL,[]) and
-              MatchOpType(taicpu(hp1),top_const,top_reg) and
-              (getsupreg(taicpu(p).oper[1]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) then
-              begin
+                  A_SHL:
+                    if MatchOpType(taicpu(hp1),top_const,top_reg) and
+                      (getsupreg(taicpu(p).oper[1]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) then
+                      begin
 {$ifopt R+}
 {$define RANGE_WAS_ON}
 {$R-}
 {$endif}
-                { get length of potential and mask }
-                MaskLength:=SizeOf(taicpu(p).oper[0]^.val)*8-BsrQWord(taicpu(p).oper[0]^.val)-1;
+                        { get length of potential and mask }
+                        MaskLength:=SizeOf(taicpu(p).oper[0]^.val)*8-BsrQWord(taicpu(p).oper[0]^.val)-1;
 
-                { really a mask? }
+                        { really a mask? }
 {$ifdef RANGE_WAS_ON}
 {$R+}
 {$endif}
-                if (((QWord(1) shl MaskLength)-1)=taicpu(p).oper[0]^.val) and
-                  { unmasked part shifted out? }
-                  ((MaskLength+taicpu(hp1).oper[0]^.val)>=topsize2memsize[taicpu(hp1).opsize]) then
-                  begin
-                    DebugMsg(SPeepholeOptimization + 'AndShlToShl done',p);
+                        if (((QWord(1) shl MaskLength)-1)=taicpu(p).oper[0]^.val) and
+                          { unmasked part shifted out? }
+                          ((MaskLength+taicpu(hp1).oper[0]^.val)>=topsize2memsize[taicpu(hp1).opsize]) then
+                          begin
+                            DebugMsg(SPeepholeOptimization + 'AndShlToShl done',p);
 
-                    { take care of the register (de)allocs following p }
-                    UpdateUsedRegs(tai(p.next));
-                    asml.remove(p);
-                    p.free;
-                    p:=hp1;
-                    Result:=true;
-                    exit;
-                  end;
-              end
-            else if MatchOpType(taicpu(p),top_const,top_reg) and
-              MatchInstruction(hp1,A_MOVSX{$ifdef x86_64},A_MOVSXD{$endif x86_64},[]) and
-              (taicpu(hp1).oper[0]^.typ = top_reg) and
-              MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
-              (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
-               (((taicpu(p).opsize=S_W) and
-                 (taicpu(hp1).opsize=S_BW)) or
-                ((taicpu(p).opsize=S_L) and
-                 (taicpu(hp1).opsize in [S_WL,S_BL]))
+                            { take care of the register (de)allocs following p }
+                            UpdateUsedRegs(tai(p.next));
+                            asml.remove(p);
+                            p.free;
+                            p:=hp1;
+                            Result:=true;
+                            exit;
+                          end;
+                      end;
+                  A_MOVSX{$ifdef x86_64},A_MOVSXD{$endif x86_64}:
+                    if (taicpu(hp1).oper[0]^.typ = top_reg) and
+                    MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and
+                    (getsupreg(taicpu(hp1).oper[0]^.reg)=getsupreg(taicpu(hp1).oper[1]^.reg)) and
+                    (
+                      (
+                        (taicpu(p).opsize=S_W) and
+                        (taicpu(hp1).opsize=S_BW)
+                      ) or (
+                        (taicpu(p).opsize=S_L) and
+                        (taicpu(hp1).opsize in [S_WL,S_BL])
 {$ifdef x86_64}
-                 or
-                 ((taicpu(p).opsize=S_Q) and
-                 (taicpu(hp1).opsize in [S_BQ,S_WQ,S_LQ]))
+                      ) or (
+                        (taicpu(p).opsize=S_Q) and
+                        (taicpu(hp1).opsize in [S_BQ,S_WQ,S_LQ])
 {$endif x86_64}
-                ) then
-                  begin
-                    if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
-                        ((taicpu(p).oper[0]^.val and $7f)=taicpu(p).oper[0]^.val)
-                         ) or
-                       (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
-                        ((taicpu(p).oper[0]^.val and $7fff)=taicpu(p).oper[0]^.val))
+                      )
+                    ) then
+                      begin
+                        if (((taicpu(hp1).opsize) in [S_BW,S_BL{$ifdef x86_64},S_BQ{$endif x86_64}]) and
+                            ((taicpu(p).oper[0]^.val and $7f)=taicpu(p).oper[0]^.val)
+                             ) or
+                           (((taicpu(hp1).opsize) in [S_WL{$ifdef x86_64},S_WQ{$endif x86_64}]) and
+                            ((taicpu(p).oper[0]^.val and $7fff)=taicpu(p).oper[0]^.val))
 {$ifdef x86_64}
-                       or
-                       (((taicpu(hp1).opsize)=S_LQ) and
-                        ((taicpu(p).oper[0]^.val and $7fffffff)=taicpu(p).oper[0]^.val)
-                       )
+                           or
+                           (((taicpu(hp1).opsize)=S_LQ) and
+                            ((taicpu(p).oper[0]^.val and $7fffffff)=taicpu(p).oper[0]^.val)
+                           )
 {$endif x86_64}
-                       then
-                       begin
-                         DebugMsg(SPeepholeOptimization + 'AndMovsxToAnd',p);
-                         asml.remove(hp1);
-                         hp1.free;
-                         Exit;
-                       end;
-                  end
-            else if (taicpu(p).oper[1]^.typ = top_reg) and
-              (hp1.typ = ait_instruction) and
-              (taicpu(hp1).is_jmp) and
-              (taicpu(hp1).opcode<>A_JMP) and
-              not(RegInUsedRegs(taicpu(p).oper[1]^.reg,UsedRegs)) then
-              begin
-                { change
-                    and x, reg
-                    jxx
-                  to
-                    test x, reg
-                    jxx
-                  if reg is deallocated before the
-                  jump, but only if it's a conditional jump (PFV)
-                }
-                taicpu(p).opcode := A_TEST;
-                Exit;
-              end;
-          end;
+                           then
+                           begin
+                             DebugMsg(SPeepholeOptimization + 'AndMovsxToAnd',p);
+                             asml.remove(hp1);
+                             hp1.free;
+                             Result := True;
+                             Continue;
+                           end;
+                      end;
+                  else
+                    { Do nothing };
+                end;
 
-        { Lone AND tests }
-        if MatchOpType(taicpu(p),top_const,top_reg) then
-          begin
-            {
-              - Convert and $0xFF,reg to and reg,reg if reg is 8-bit
-              - Convert and $0xFFFF,reg to and reg,reg if reg is 16-bit
-              - Convert and $0xFFFFFFFF,reg to and reg,reg if reg is 32-bit
-            }
-            if ((taicpu(p).oper[0]^.val = $FF) and (taicpu(p).opsize = S_B)) or
-              ((taicpu(p).oper[0]^.val = $FFFF) and (taicpu(p).opsize = S_W)) or
-              ((taicpu(p).oper[0]^.val = $FFFFFFFF) and (taicpu(p).opsize = S_L)) then
-              begin
-                taicpu(p).loadreg(0, taicpu(p).oper[1]^.reg)
-              end;
-          end;
+              if (taicpu(p).oper[1]^.typ = top_reg) and
+                (hp1.typ = ait_instruction) and
+                (taicpu(hp1).is_jmp) and
+                (taicpu(hp1).opcode<>A_JMP) and
+                not(RegInUsedRegs(taicpu(p).oper[1]^.reg,UsedRegs)) then
+                begin
+                  { change
+                      and x, reg
+                      jxx
+                    to
+                      test x, reg
+                      jxx
+                    if reg is deallocated before the
+                    jump, but only if it's a conditional jump (PFV)
+                  }
+                  taicpu(p).opcode := A_TEST;
+                  Exit;
+                end;
+            end;
 
+          { Lone AND tests }
+          if MatchOpType(taicpu(p),top_const,top_reg) then
+            begin
+              {
+                - Convert and $0xFF,reg to and reg,reg if reg is 8-bit
+                - Convert and $0xFFFF,reg to and reg,reg if reg is 16-bit
+                - Convert and $0xFFFFFFFF,reg to and reg,reg if reg is 32-bit
+              }
+              if ((taicpu(p).oper[0]^.val = $FF) and (taicpu(p).opsize = S_B)) or
+                ((taicpu(p).oper[0]^.val = $FFFF) and (taicpu(p).opsize = S_W)) or
+                ((taicpu(p).oper[0]^.val = $FFFFFFFF) and (taicpu(p).opsize = S_L)) then
+                begin
+                  taicpu(p).loadreg(0, taicpu(p).oper[1]^.reg)
+                end;
+            end;
+
+          Exit;
+        until False;
+
       end;
 
-
     function TX86AsmOptimizer.PostPeepholeOptLea(var p : tai) : Boolean;
       begin
         Result:=false;
overhaul-standalone.patch (44,635 bytes)

J. Gareth Moreton

2019-07-11 11:35

developer   ~0117177

Updated the specification to fix some spelling and grammar mistakes, and also a new addition for future expansion.

J. Gareth Moreton

2019-07-11 11:37

developer  

x86_64 Optimisation Specification.pdf (159,567 bytes)

Akira1364

2019-07-21 22:49

reporter   ~0117339

Last edited: 2019-07-21 23:27

View 2 revisions

The patches, all together, do definitely give noticeably better results. There seems to be an issue somewhere with codegen at -O3 and higher, however, that results in "Internal error 2013102801" in some cases.

A "guaranteed reproducer" I've found is the "cldrparser" utility in the "utils/unicode" folder.

On x86_64 Windows, doing "fpc -O3 cldrparser.lpr", for me, always results in "helper.pas(2850,1) Fatal: Internal error 2013102801". (helper.pas being a unit used by the cldrparser tool, of course.)

J. Gareth Moreton

2019-07-22 01:00

developer   ~0117341

Ooo, thank you for the feedback, and that's good that it gives better results. I'll take a look to see what's triggering the internal error. Hopefully it's easily patched.

J. Gareth Moreton

2019-07-28 08:23

developer   ~0117449

Can you confirm that this still happens? I'm unable to reproduce the issue, both on my working branch and when applying the patches to the trunk. It links successfully.

[FPCDirectory]\utils\unicode>\pp\bin\x86_64-win64\fpc -O3 -B cldrparser.lpr
Free Pascal Compiler version 3.3.1 [2019/07/28] for x86_64
Copyright (c) 1993-2018 by Florian Klaempfl and others
Target OS: Win64 for x64
Compiling cldrparser.lpr
Compiling cldrhelper.pas
Compiling helper.pas
Compiling trie.pas
helper.pas(1264,34) Warning: Local variable "actualDataLen" does not seem to be initialized
helper.pas(1380,36) Warning: Local variable "actualDataLen" does not seem to be initialized
helper.pas(1522,41) Warning: Local variable "actualDataLen" does not seem to be initialized
helper.pas(1559,16) Note: Call to subroutine "function TPropRec.GetCategory:<enumeration type>;" marked as inline is not inlined
helper.pas(1559,29) Note: Call to subroutine "function TPropRec.GetCategory:<enumeration type>;" marked as inline is not inlined
helper.pas(1564,16) Note: Call to subroutine "function TPropRec.GetWhiteSpace:Boolean;" marked as inline is not inlined
helper.pas(1564,31) Note: Call to subroutine "function TPropRec.GetWhiteSpace:Boolean;" marked as inline is not inlined
helper.pas(1570,16) Note: Call to subroutine "function TPropRec.GetHangulSyllable:Boolean;" marked as inline is not inlined
helper.pas(1570,35) Note: Call to subroutine "function TPropRec.GetHangulSyllable:Boolean;" marked as inline is not inlined
helper.pas(1835,39) Warning: Local variable "actualNumLen" does not seem to be initialized
helper.pas(1834,40) Warning: Local variable "actualDataLen" does not seem to be initialized
helper.pas(1833,36) Warning: Local variable "actualPropLen" does not seem to be initialized
helper.pas(2674,38) Warning: Local variable "actualDataLen" does not seem to be initialized
helper.pas(3056,13) Warning: Local variable "p1" does not seem to be initialized
helper.pas(3566,3) Note: Local variable "locLine" not used
helper.pas(4021,6) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_1:Boolean;" marked as inline is not inlined
helper.pas(4023,6) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_2:Boolean;" marked as inline is not inlined
helper.pas(4043,10) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_1:Boolean;" marked as inline is not inlined
helper.pas(4050,10) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_2:Boolean;" marked as inline is not inlined
helper.pas(4066,8) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_1:Boolean;" marked as inline is not inlined
helper.pas(4068,8) Note: Call to subroutine "function TUCA_PropItemRec.IsWeightCompress_2:Boolean;" marked as inline is not inlined
helper.pas(4073,6) Note: Call to subroutine "function TUCA_PropItemRec.GetContextual:Boolean;" marked as inline is not inlined
Writing Resource String Table file: helper.rsj
cldrhelper.pas(513,6) Note: Local variable "i" not used
cldrhelper.pas(775,12) Note: Call to subroutine "function TReorderUnit.IsVirtual:Boolean;" marked as inline is not inlined
cldrhelper.pas(778,11) Warning: Function result variable does not seem to be initialized
cldrhelper.pas(755,3) Warning: Function result variable does not seem to be initialized
cldrhelper.pas(1106,5) Warning: Function result variable does not seem to be initialized
cldrhelper.pas(1145,11) Note: Call to subroutine "function TReorderUnit.IsVirtual:Boolean;" marked as inline is not inlined
cldrhelper.pas(1519,19) Warning: Local variable "pwb" does not seem to be initialized
cldrhelper.pas(1551,49) Warning: Local variable "pt" does not seem to be initialized
cldrhelper.pas(1603,13) Note: Call to subroutine "function TReorderUnit.IsVirtual:Boolean;" marked as inline is not inlined
cldrhelper.pas(1615,12) Note: Call to subroutine "function TReorderUnit.IsVirtual:Boolean;" marked as inline is not inlined
cldrhelper.pas(1659,9) Note: Local variable "k" not used
cldrhelper.pas(1692,13) Note: Call to subroutine "function TReorderUnit.IsVirtual:Boolean;" marked as inline is not inlined
cldrhelper.pas(2791,35) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrhelper.pas(2792,41) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrhelper.pas(2793,42) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
Writing Resource String Table file: cldrhelper.rsj
Compiling cldrtest.pas
Compiling unicodeset.pas
Compiling grbtree.pas
grbtree.pas(617,3) Note: Call to subroutine "procedure TRBTree<helper.TUnicodeCodePointArray,unicodeset.TUnicodeCodePointArrayComparator>.TreeFreeIterator(AItem:TRBTree$2$crcC7C812BB.PBaseIterator); Static;" marked as inline is not inlined
grbtree.pas(628,14) Note: Call to subroutine "function TRBTree<helper.TUnicodeCodePointArray,unicodeset.TUnicodeCodePointArrayComparator>.TreeIteratorMoveNext(AIterator:TRBTree$2$crcC7C812BB.PBaseIterator):^TRBTree$2$crcC7C812BB.TRBTreeNode; Static;" marked as inline is not inlined
grbtree.pas(638,14) Note: Call to subroutine "function TRBTree<helper.TUnicodeCodePointArray,unicodeset.TUnicodeCodePointArrayComparator>.TreeIteratorMovePrevious(AIterator:TRBTree$2$crcC7C812BB.PBaseIterator):^TRBTree$2$crcC7C812BB.TRBTreeNode; Static;" marked as inline is not inlined
grbtree.pas(643,13) Note: Call to subroutine "function TRBTree$2$crcC7C812BB.TIterator.GetCurrentNode:^TRBTree$2$crcC7C812BB.TRBTreeNode;" marked as inline is not inlined
grbtree.pas(364,11) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(368,15) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(422,13) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(428,17) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(380,1) Warning: Function result variable does not seem to be initialized
grbtree.pas(476,15) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(479,11) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(583,37) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
grbtree.pas(584,35) Note: Call to subroutine "function TUnicodeCodePointArrayComparator.Compare(const A:TUnicodeCodePointArray;const B:TUnicodeCodePointArray):LongInt; Static;" marked as inline is not inlined
unicodeset.pas(171,4) Note: "array of const" not yet supported inside inline procedure/function
unicodeset.pas(198,3) Note: Call to subroutine "procedure TPatternParser.CheckEOF(ALength:LongInt);" marked as inline is not inlined
unicodeset.pas(206,5) Note: Call to subroutine "procedure TPatternParser.UnexpectedEOF;" marked as inline is not inlined
unicodeset.pas(333,9) Note: Call to subroutine "procedure TUnicodeSet.Add(AString:TUnicodeCodePointArray);" marked as inline is not inlined
unicodeset.pas(340,9) Note: Call to subroutine "procedure TUnicodeSet.AddRange(const AStart:LongWord;const AEnd:LongWord);" marked as inline is not inlined
unicodeset.pas(342,9) Note: Call to subroutine "procedure TUnicodeSet.Add(AChar:LongWord);" marked as inline is not inlined
unicodeset.pas(390,10) Note: Call to subroutine "function TUnicodeSet.Contains(const AChar:LongWord):Boolean;" marked as inline is not inlined
Writing Resource String Table file: unicodeset.rsj
Compiling cldrtxt.pas
cldrtxt.pas(556,5) Note: Local variable "kc" not used
cldrtxt.pas(556,9) Note: Local variable "k" not used
cldrtxt.pas(772,67) Warning: Local variable "lineLength" does not seem to be initialized
cldrtxt.pas(787,7) Warning: Local variable "linePos" does not seem to be initialized
cldrtxt.pas(771,10) Warning: Local variable "specialChararter" does not seem to be initialized
cldrtxt.pas(772,67) Warning: Local variable "line" of a managed type does not seem to be initialized
cldrtxt.pas(720,3) Warning: Function result variable does not seem to be initialized
cldrtxt.pas(689,7) Warning: Function result variable of a managed type does not seem to be initialized
cldrtxt.pas(358,24) Warning: Local variable "ks" of a managed type does not seem to be initialized
Compiling cldrxml.pas
cldrxml.pas(246,49) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(253,18) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(337,81) Warning: Implicit string type conversion from "ShortString" to "UnicodeString"
cldrxml.pas(352,44) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(400,29) Warning: Local variable "simpleCharTag" does not seem to be initialized
cldrxml.pas(236,3) Warning: Function result variable does not seem to be initialized
cldrxml.pas(518,44) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrxml.pas(585,35) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(586,17) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(917,35) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(918,30) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(984,26) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(985,25) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(986,29) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(1016,92) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrxml.pas(1046,26) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(1047,25) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(1048,29) Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"
cldrxml.pas(1090,92) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrtest.pas(615,5) Warning: Function result variable of a managed type does not seem to be initialized
cldrtest.pas(798,42) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrtest.pas(842,35) Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
cldrtest.pas(4201,3) Note: Local variable "i" not used
cldrtest.pas(5670,3) Note: Local variable "imp" not used
cldrtest.pas(5719,3) Note: Local variable "imp" not used
Linking cldrparser.exe
17906 lines compiled, 4.3 sec, 612544 bytes code, 115556 bytes data
46 warning(s) issued
45 note(s) issued

J. Gareth Moreton

2019-07-28 08:45

developer   ~0117450

At least as it currently stands for me, if all the patches are applied, everything works perfectly under x86_64-win64. Anyone else able to reproduce it?

Akira1364

2019-07-29 15:01

reporter   ~0117488

I can't check at the moment, but I'll take another look at it later today.

J. Gareth Moreton

2019-07-30 01:47

developer   ~0117497

I do have a request... is it possible to test how the overhaul performs for i386-linux (I can't get around the configuration problems on my 64-bit machine) and x86_64-darwin (I don't have a suitable platform)?

Logic says that x86_64-darwin should perform the same as x86_64-linux, since both use the System V ABI and the like, but I can only speculate here and not run a full regression suite.

Non-Intel platforms should not be affected by the changes, but I guess this asking for a lot given how expensive multi-platform testing can be.

Akira1364

2019-07-30 17:08

reporter   ~0117507

Ok, took another look: it seems that this was related to the fact I was building the compiler *itself* with -O3, and not the usual -O2. With the absolute default makefile settings for the compiler, doing fpc -O3 -B cldrparser.lpr does work. However, with an O3-built compiler, fpc -O3 -B cldrparser.lpr gives the error I described earlier.

Not sure what this indicates, exactly.

J. Gareth Moreton

2019-07-30 17:20

developer   ~0117508

Last edited: 2019-07-30 20:32

View 2 revisions

I'm not sure either, but it's fascinating to see that. I'll do some investigating. Thank you.

J. Gareth Moreton

2019-07-31 09:24

developer   ~0117516

Last edited: 2019-07-31 09:27

View 2 revisions

Still no luck. How are you building the compiler? This is my process ([FPCRoot] is my directory tree):

- cd /d [FPCRoot]\compiler
- make distclean all install DATA2INC=[FPCRoot]\utils\bin\x86_64-win64\data2inc.exe FPCOPT=-O3
- cd /d C:\PP\bin\x86_64-win64
- fpcmkcfg -d basepath=C:\PP\ -o C:\PP\bin\x86_64-win64\fpc.cfg
- cd /d [FPCRoot]\utils\unicode
- \pp\bin\x86_64-win64\fpc -O3 -B cldrparser.lpr

Compiles successfully.

Of course, there's a chance I'm doing something completely wrong and the -O3 option is getting overridden back to -O2 with my configuration. I apologise if I'm being dumb here.

J. Gareth Moreton

2019-08-11 19:15

developer   ~0117643

I'm very sorry, but I'm unable to reproduce your bug at all, so everything appears to be working.

What is your exact process for applying the patches and building the compiler? I really want to make the presentation perfect for Florian!

Florian

2019-08-17 18:15

administrator   ~0117719

overhaul-base.patch applied. Sorry, for the slow progress.

J. Gareth Moreton

2019-08-18 14:07

developer   ~0117728

No worries. Thank you so much Florian. Slowly but surely making sure the patches are all good and correct before applying!

Akira1364

2019-08-19 18:23

reporter   ~0117738

Last edited: 2019-08-19 21:59

View 2 revisions

> I'm very sorry, but I'm unable to reproduce your bug at all, so everything appears to be working.

I actually can't reproduce the bug anymore either (after the last week or so of revisions) so I'd say don't worry about it, I think.
Not sure what was going on exactly. An extreme edge case, whatever it was. Will let you know if I ever see it crop up again.

J. Gareth Moreton

2019-08-19 22:13

developer   ~0117740

Will that's a relief! Thanks for your continued testing.

J. Gareth Moreton

2019-10-26 17:09

developer   ~0118853

Suspending this issue as it currently stands since Florian rejected the patches in the development mailing list, citing interdependencies and questionable implementation choices. Nevertheless, children of this issue will be created at a future date to implement the overhaul in cleaner and mote bite-sized chunks.

Issue History

Date Modified Username Field Change
2018-12-01 17:04 J. Gareth Moreton New Issue
2018-12-01 17:04 J. Gareth Moreton File Added: x86_64-opt-overhaul.patch
2018-12-01 17:04 J. Gareth Moreton File Added: Metric.txt
2018-12-01 17:05 J. Gareth Moreton Description Updated View Revisions
2018-12-02 06:31 J. Gareth Moreton Tag Attached: 64-bit
2018-12-02 06:31 J. Gareth Moreton Tag Attached: compiler
2018-12-02 06:31 J. Gareth Moreton Tag Attached: optimization
2018-12-02 06:31 J. Gareth Moreton Tag Attached: patch
2018-12-02 06:31 J. Gareth Moreton Tag Attached: refactoring
2018-12-02 06:31 J. Gareth Moreton Tag Attached: x86
2018-12-02 06:31 J. Gareth Moreton Tag Attached: x86_64-win64
2018-12-02 06:31 J. Gareth Moreton File Deleted: x86_64-opt-overhaul.patch
2018-12-02 06:38 J. Gareth Moreton File Added: overhaul-base.patch
2018-12-02 06:38 J. Gareth Moreton File Added: overhaul-global.patch
2018-12-02 06:38 J. Gareth Moreton File Added: overhaul-singlepass.patch
2018-12-02 06:38 J. Gareth Moreton File Added: overhaul-standalone.patch
2018-12-02 06:39 J. Gareth Moreton File Added: overhaul-64-32-split.patch
2018-12-02 06:41 J. Gareth Moreton Note Added: 0112311
2018-12-06 17:45 J. Gareth Moreton File Deleted: overhaul-base.patch
2018-12-06 17:46 J. Gareth Moreton File Deleted: overhaul-global.patch
2018-12-06 17:46 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2018-12-06 17:46 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2018-12-06 17:46 J. Gareth Moreton File Deleted: overhaul-64-32-split.patch
2018-12-06 17:46 J. Gareth Moreton File Added: overhaul-base.patch
2018-12-06 17:47 J. Gareth Moreton File Added: overhaul-global.patch
2018-12-06 17:47 J. Gareth Moreton File Added: overhaul-singlepass.patch
2018-12-06 17:48 J. Gareth Moreton File Added: overhaul-standalone.patch
2018-12-06 17:48 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2018-12-06 17:50 J. Gareth Moreton File Added: overhaul-64-32-split.patch
2018-12-06 17:52 J. Gareth Moreton Note Edited: 0112311 View Revisions
2018-12-06 17:53 J. Gareth Moreton Note Added: 0112407
2018-12-06 17:54 J. Gareth Moreton Note Edited: 0112407 View Revisions
2018-12-07 00:08 J. Gareth Moreton Note Added: 0112414
2018-12-09 06:55 J. Gareth Moreton Note Added: 0112457
2018-12-10 14:05 J. Gareth Moreton Note Added: 0112479
2018-12-23 23:11 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2018-12-23 23:12 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2018-12-23 23:14 J. Gareth Moreton Note Added: 0112846
2019-01-22 12:43 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-01-22 12:43 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-01-22 12:43 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-01-22 12:43 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-01-22 12:44 J. Gareth Moreton File Deleted: overhaul-64-32-split.patch
2019-01-22 12:44 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-01-22 12:44 J. Gareth Moreton File Added: overhaul-base.patch
2019-01-22 12:44 J. Gareth Moreton File Added: overhaul-global.patch
2019-01-22 12:44 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-01-22 12:45 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-01-22 12:45 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-01-22 12:47 J. Gareth Moreton Note Added: 0113578
2019-01-22 13:50 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-01-22 13:50 J. Gareth Moreton File Added: overhaul-global.patch
2019-01-22 14:03 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-01-22 14:03 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-01-22 14:04 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-01-22 14:04 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-01-22 14:07 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-01-22 14:07 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-01-22 14:11 J. Gareth Moreton Note Added: 0113581
2019-01-22 14:32 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-01-22 14:33 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-01-22 14:34 J. Gareth Moreton Note Added: 0113582
2019-01-22 14:56 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-01-22 14:56 J. Gareth Moreton File Added: overhaul-global.patch
2019-01-22 15:52 J. Gareth Moreton Note Added: 0113583
2019-02-22 21:26 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-02-22 21:27 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-02-22 21:27 J. Gareth Moreton Note Added: 0114350
2019-02-22 22:58 Florian Note Added: 0114352
2019-02-22 23:15 J. Gareth Moreton Note Added: 0114353
2019-02-22 23:16 J. Gareth Moreton Note Edited: 0114353 View Revisions
2019-02-22 23:17 J. Gareth Moreton File Added: x86_64 Optimisation Specification.pdf
2019-02-22 23:18 J. Gareth Moreton Note Added: 0114354
2019-02-22 23:34 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-02-22 23:34 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-02-22 23:34 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-02-22 23:35 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-02-22 23:35 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-02-22 23:35 J. Gareth Moreton File Added: overhaul-base.patch
2019-02-22 23:35 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-02-22 23:35 J. Gareth Moreton File Added: overhaul-global.patch
2019-02-22 23:36 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-02-22 23:36 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-02-22 23:37 J. Gareth Moreton Note Added: 0114356
2019-02-23 16:27 J. Gareth Moreton File Deleted: x86_64 Optimisation Specification.pdf
2019-02-23 16:28 J. Gareth Moreton File Added: x86_64 Optimisation Specification.pdf
2019-02-23 16:51 J. Gareth Moreton Note Added: 0114368
2019-02-23 16:52 J. Gareth Moreton File Added: i386-win32-regression.log
2019-02-23 16:52 J. Gareth Moreton File Added: x86_64-win64-regression.log
2019-02-26 02:57 J. Gareth Moreton File Deleted: i386-win32-regression.log
2019-02-26 02:57 J. Gareth Moreton File Deleted: x86_64-win64-regression.log
2019-02-26 02:57 J. Gareth Moreton File Deleted: x86_64 Optimisation Specification.pdf
2019-02-26 02:58 J. Gareth Moreton File Added: x86_64 Optimisation Specification.pdf
2019-02-26 03:02 J. Gareth Moreton Note Added: 0114453
2019-02-26 04:02 J. Gareth Moreton Note Edited: 0114453 View Revisions
2019-02-26 14:32 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-02-26 14:33 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-02-26 14:34 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-02-26 14:35 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-02-26 14:35 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-02-26 14:35 J. Gareth Moreton File Added: overhaul-base.patch
2019-02-26 14:35 J. Gareth Moreton File Added: overhaul-global.patch
2019-02-26 14:36 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-02-26 14:36 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-02-26 14:36 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-02-26 14:39 J. Gareth Moreton Note Added: 0114462
2019-02-26 15:53 J. Gareth Moreton Note Added: 0114464
2019-02-26 16:18 J. Gareth Moreton Note Added: 0114465
2019-02-26 16:22 J. Gareth Moreton Note Edited: 0114465 View Revisions
2019-02-26 17:33 J. Gareth Moreton Note Added: 0114466
2019-02-26 22:04 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-02-26 22:05 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-02-26 22:05 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-02-26 22:05 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-02-26 22:05 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-02-26 22:06 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-02-26 22:06 J. Gareth Moreton Note Added: 0114474
2019-02-27 01:18 J. Gareth Moreton Note Added: 0114480
2019-02-28 03:47 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-02-28 03:47 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-02-28 03:47 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-02-28 03:48 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-02-28 03:48 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-02-28 03:49 J. Gareth Moreton File Added: overhaul-base.patch
2019-02-28 03:49 J. Gareth Moreton File Added: overhaul-global.patch
2019-02-28 03:49 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-02-28 03:50 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-02-28 03:50 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-02-28 03:56 J. Gareth Moreton Note Added: 0114496
2019-02-28 04:16 J. Gareth Moreton Note Edited: 0114496 View Revisions
2019-02-28 20:40 Florian Note Added: 0114519
2019-02-28 21:47 J. Gareth Moreton Note Added: 0114523
2019-02-28 21:48 J. Gareth Moreton Note Edited: 0114523 View Revisions
2019-02-28 21:52 Florian Note Added: 0114526
2019-02-28 22:16 J. Gareth Moreton Note Added: 0114527
2019-03-01 06:52 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-03-01 06:52 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-03-01 06:53 J. Gareth Moreton File Added: overhaul-base.patch
2019-03-01 06:53 J. Gareth Moreton File Added: overhaul-global.patch
2019-03-01 06:55 J. Gareth Moreton Note Added: 0114532
2019-03-01 06:59 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-03-01 07:00 J. Gareth Moreton File Added: overhaul-base.patch
2019-03-01 07:01 J. Gareth Moreton Note Edited: 0114532 View Revisions
2019-07-11 07:22 J. Gareth Moreton File Deleted: overhaul-singlepass.patch
2019-07-11 07:22 J. Gareth Moreton File Deleted: overhaul-base.patch
2019-07-11 07:22 J. Gareth Moreton File Deleted: overhaul-global.patch
2019-07-11 07:23 J. Gareth Moreton File Deleted: overhaul-mov-refactor.patch
2019-07-11 07:23 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-07-11 07:25 J. Gareth Moreton File Added: overhaul-base.patch
2019-07-11 07:25 J. Gareth Moreton File Added: overhaul-global.patch
2019-07-11 07:25 J. Gareth Moreton File Added: overhaul-mov-refactor.patch
2019-07-11 07:25 J. Gareth Moreton File Added: overhaul-singlepass.patch
2019-07-11 07:25 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-07-11 07:25 J. Gareth Moreton Note Added: 0117163
2019-07-11 07:26 J. Gareth Moreton Note Edited: 0117163 View Revisions
2019-07-11 08:33 rd0x Note Added: 0117165
2019-07-11 09:10 J. Gareth Moreton Note Added: 0117169
2019-07-11 09:43 J. Gareth Moreton Note Added: 0117172
2019-07-11 11:02 J. Gareth Moreton Note Edited: 0112311 View Revisions
2019-07-11 11:03 J. Gareth Moreton Note Edited: 0112311 View Revisions
2019-07-11 11:18 J. Gareth Moreton File Deleted: overhaul-standalone.patch
2019-07-11 11:20 J. Gareth Moreton File Added: overhaul-standalone.patch
2019-07-11 11:20 J. Gareth Moreton Note Added: 0117175
2019-07-11 11:34 J. Gareth Moreton File Deleted: x86_64 Optimisation Specification.pdf
2019-07-11 11:35 J. Gareth Moreton File Added: x86_64 Optimisation Specification.pdf
2019-07-11 11:35 J. Gareth Moreton Note Added: 0117177
2019-07-11 11:36 J. Gareth Moreton File Deleted: x86_64 Optimisation Specification.pdf
2019-07-11 11:37 J. Gareth Moreton File Added: x86_64 Optimisation Specification.pdf
2019-07-21 22:49 Akira1364 Note Added: 0117339
2019-07-21 23:27 Akira1364 Note Edited: 0117339 View Revisions
2019-07-22 01:00 J. Gareth Moreton Note Added: 0117341
2019-07-28 08:23 J. Gareth Moreton Note Added: 0117449
2019-07-28 08:45 J. Gareth Moreton Note Added: 0117450
2019-07-29 15:01 Akira1364 Note Added: 0117488
2019-07-30 01:47 J. Gareth Moreton Note Added: 0117497
2019-07-30 17:08 Akira1364 Note Added: 0117507
2019-07-30 17:20 J. Gareth Moreton Note Added: 0117508
2019-07-30 20:32 J. Gareth Moreton Note Edited: 0117508 View Revisions
2019-07-31 09:24 J. Gareth Moreton Note Added: 0117516
2019-07-31 09:27 J. Gareth Moreton Note Edited: 0117516 View Revisions
2019-08-11 19:15 J. Gareth Moreton Note Added: 0117643
2019-08-17 18:15 Florian Note Added: 0117719
2019-08-18 14:07 J. Gareth Moreton Note Added: 0117728
2019-08-19 18:23 Akira1364 Note Added: 0117738
2019-08-19 21:59 Akira1364 Note Edited: 0117738 View Revisions
2019-08-19 22:13 J. Gareth Moreton Note Added: 0117740
2019-10-26 17:09 J. Gareth Moreton Assigned To => J. Gareth Moreton
2019-10-26 17:09 J. Gareth Moreton Status new => closed
2019-10-26 17:09 J. Gareth Moreton Resolution open => suspended
2019-10-26 17:09 J. Gareth Moreton FPCTarget => -
2019-10-26 17:09 J. Gareth Moreton Note Added: 0118853
2019-11-06 15:09 J. Gareth Moreton Relationship added parent of 0036271