View Issue Details

IDProjectCategoryView StatusLast Update
0036608FPCCompilerpublic2020-02-01 22:53
ReporterJ. Gareth MoretonAssigned ToFlorian 
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Platformi386 and x86_64OSMicrosoft WindowsOS Version10 Professional
Product Version3.3.1Product Buildr44000 
Target VersionFixed in Version3.3.1 
Summary0036608: [Patch] x86 "OptPass1MOV" improvements - Part 2
DescriptionThis patch introduces some deeper optimisation to MOV operations and the instructions that immediately follow them, changing registers of equal value to avoid pipeline stalls, but which also does the work of many other optimisations in OptPass1MOV (these will be factored out later).

For example, one method in the Lazarus source used to compile into the following:

    pushq %rbp
    movq %rcx,%rbp
    movq -8(%rbp),%rax
    movb $0,2080(%rax)
    popq %rbp
    ret

With the new optimisation, the "movq -8(%rbp),%rax" instruction becomes "movq -8(%rcx),%rax", breaking the dependency chain between it and the "movq %rcx,%rbp" instruction before it (future research may also identify the fact that %rbp is not required and so can simplify the structured exception handling for this routine).
Steps To ReproduceApply patch and confirm correct compilation of i386 and x86_64 builds of the compiler, as well as test runs (my own tests on i386-win32 and x86_64-win64 show no regressions).
Additional InformationMost of the optimsation code is in its own routine, named DeepMOVOpt. This is so the code can be reused later on with the deeper MOV optimisations that use GetNextInstructionUsingReg.

Some optimisations have been observed to be redundant and no longer necessary. These will be stripped out in a future patch once it is observed that they truly are redundant - these include, but are not limited to:

- MovTest/Cmp/Or/AndJxx2Test/Cmp/Or/AndJxx when using CMP
- MovTest/Cmp/Or/AndJxx2MovTest/Cmp/Or/AndJxx when using CMP
- MovMov2Mov 2
- MovMov2Mov 4
- MovMov2Mov 6c
- Mov2Nop 2 (related to MovMov2Mov 6c)
- Some MOV/MOVZX and MOV/MOVSX optimisations.

Some "MovTestCmp2MovTestCmp 1" optimisations get missed because the reference in the MOV instruction gets its register changed (thus no longer matching the reference in the CMP instruction). This will get rectified when the GetNextInstructionUsingReg optimisations are extended to use DeepMOVOpt and thus change this reference as well.
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision44086
FPCOldBugId
FPCTarget-
Attached Files
  • OptPass1MOV-logs.zip (67,878 bytes)
  • OptPass1MOV-deep-optimise-part-2.patch (15,145 bytes)
    Index: compiler/x86/aoptx86.pas
    ===================================================================
    --- compiler/x86/aoptx86.pas	(revision 44032)
    +++ compiler/x86/aoptx86.pas	(working copy)
    @@ -51,6 +51,18 @@
               depend on the value in AH). }
             function Reg1ReadDependsOnReg2(reg1, reg2: tregister): boolean;
     
    +        { Replaces all references to AOldReg in a memory reference to ANewReg }
    +        class function ReplaceRegisterInRef(var ref: TReference; const AOldReg, ANewReg: TRegister): Boolean; static;
    +
    +        { Replaces all references to AOldReg in an operand to ANewReg }
    +        class function ReplaceRegisterInOper(const p: taicpu; const OperIdx: Integer; const AOldReg, ANewReg: TRegister): Boolean; static;
    +
    +        { Replaces all references to AOldReg in an instruction to ANewReg,
    +          except where the register is being written }
    +        function ReplaceRegisterInInstruction(const p: taicpu; const AOldReg, ANewReg: TRegister): Boolean;
    +
    +        function DeepMOVOpt(const p_mov: taicpu; const hp: taicpu): Boolean;
    +
             procedure DebugMsg(const s : string; p : tai);inline;
     
             class function IsExitCode(p : tai) : boolean; static;
    @@ -1506,6 +1518,188 @@
           end;
     
     
    +    { Replaces all references to AOldReg in a memory reference to ANewReg }
    +    class function TX86AsmOptimizer.ReplaceRegisterInRef(var ref: TReference; const AOldReg, ANewReg: TRegister): Boolean;
    +      var
    +        OldSupReg: TSuperRegister;
    +        OldSubReg, MemSubReg: TSubRegister;
    +      begin
    +        Result := False;
    +        { For safety reasons, only check for exact register matches }
    +
    +        { Check base register }
    +        if (ref.base = AOldReg) then
    +          begin
    +            ref.base := ANewReg;
    +            Result := True;
    +          end;
    +
    +        { Check index register }
    +        if (ref.index = AOldReg) then
    +          begin
    +            ref.index := ANewReg;
    +            Result := True;
    +          end;
    +      end;
    +
    +
    +    { Replaces all references to AOldReg in an operand to ANewReg }
    +    class function TX86AsmOptimizer.ReplaceRegisterInOper(const p: taicpu; const OperIdx: Integer; const AOldReg, ANewReg: TRegister): Boolean;
    +      var
    +        OldSupReg, NewSupReg: TSuperRegister;
    +        OldSubReg, NewSubReg, MemSubReg: TSubRegister;
    +        OldRegType: TRegisterType;
    +        ThisOper: POper;
    +      begin
    +        ThisOper := p.oper[OperIdx]; { Faster to access overall }
    +        Result := False;
    +
    +        if (AOldReg = NR_NO) or (ANewReg = NR_NO) then
    +          InternalError(2020011801);
    +
    +        OldSupReg := getsupreg(AOldReg);
    +        OldSubReg := getsubreg(AOldReg);
    +        OldRegType := getregtype(AOldReg);
    +        NewSupReg := getsupreg(ANewReg);
    +        NewSubReg := getsubreg(ANewReg);
    +
    +        if OldRegType <> getregtype(ANewReg) then
    +          InternalError(2020011802);
    +
    +        if OldSubReg <> NewSubReg then
    +          InternalError(2020011803);
    +
    +        case ThisOper^.typ of
    +          top_reg:
    +            if (
    +              (ThisOper^.reg = AOldReg) or
    +                (
    +                  (OldRegType = R_INTREGISTER) and
    +                  (getsupreg(ThisOper^.reg) = OldSupReg) and
    +                  (getregtype(ThisOper^.reg) = R_INTREGISTER) and
    +                  (
    +                    (getsubreg(ThisOper^.reg) <= OldSubReg)
    +{$ifndef x86_64}
    +                    and (
    +                    { Under i386 and i8086, ESI, EDI, EBP and ESP
    +                      don't have an 8-bit representation }
    +                      (getsubreg(ThisOper^.reg) >= R_SUBW) or
    +                      not (NewSupReg in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
    +                    )
    +{$endif x86_64}
    +                  )
    +                )
    +              ) then
    +              begin
    +                ThisOper^.reg := newreg(getregtype(ANewReg), NewSupReg, getsubreg(p.oper[OperIdx]^.reg));;
    +                Result := True;
    +              end;
    +          top_ref:
    +            if ReplaceRegisterInRef(ThisOper^.ref^, AOldReg, ANewReg) then
    +              Result := True;
    +
    +          else
    +            ;
    +        end;
    +      end;
    +
    +
    +    { Replaces all references to AOldReg in an instruction to ANewReg }
    +    function TX86AsmOptimizer.ReplaceRegisterInInstruction(const p: taicpu; const AOldReg, ANewReg: TRegister): Boolean;
    +      const
    +        ReadFlag: array[0..3] of TInsChange = (Ch_Rop1, Ch_Rop2, Ch_Rop3, Ch_Rop4);
    +      var
    +        OperIdx: Integer;
    +      begin
    +        Result := False;
    +
    +        for OperIdx := 0 to p.ops - 1 do
    +          if (ReadFlag[OperIdx] in InsProp[p.Opcode].Ch) and
    +          { The shift and rotate instructions can only use CL }
    +          not (
    +            (OperIdx = 0) and
    +            { This second condition just helps to avoid unnecessarily
    +              calling MatchInstruction for 10 different opcodes }
    +            (p.oper[0]^.reg = NR_CL) and
    +            MatchInstruction(p, [A_RCL, A_RCR, A_ROL, A_ROR, A_SAL, A_SAR, A_SHL, A_SHLD, A_SHR, A_SHRD], [])
    +          ) then
    +            Result := ReplaceRegisterInOper(p, OperIdx, AOldReg, ANewReg) or Result;
    +      end;
    +
    +
    +    function TX86AsmOptimizer.DeepMOVOpt(const p_mov: taicpu; const hp: taicpu): Boolean;
    +      var
    +        CurrentReg, ReplaceReg: TRegister;
    +        SubReg: TSubRegister;
    +      begin
    +        Result := False;
    +
    +        ReplaceReg := taicpu(p_mov).oper[0]^.reg;
    +        CurrentReg := taicpu(p_mov).oper[1]^.reg;
    +
    +        case hp.opcode of
    +          A_FSTSW, A_FNSTSW,
    +          A_IN,   A_INS,  A_OUT,  A_OUTS,
    +          A_CMPS, A_LODS, A_MOVS, A_SCAS, A_STOS:
    +            { These routines have explicit operands, but they are restricted in
    +              what they can be (e.g. IN and OUT can only read from AL, AX or
    +              EAX. }
    +            Exit;
    +
    +          A_IMUL:
    +            begin
    +                { The 1-operand version writes to implicit registers
    +                  The 2-operand version reads from the first operator, and reads
    +                  from and writes to the second (equivalent to Ch_ROp1, ChRWOp2).
    +                  the 3-operand version reads from a register that it doesn't write to
    +                }
    +              case hp.ops of
    +                1:
    +                  if (
    +                    (
    +                      (hp.opsize = S_B) and (getsupreg(CurrentReg) <> RS_EAX)
    +                    ) or
    +                      not (getsupreg(CurrentReg) in [RS_EAX, RS_EDX])
    +                  ) and ReplaceRegisterInOper(hp, 0, CurrentReg, ReplaceReg) then
    +                    begin
    +                      Result := True;
    +                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 1)', hp);
    +                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
    +                    end;
    +                2:
    +                  { Only modify the first parameter }
    +                  if ReplaceRegisterInOper(hp, 0, CurrentReg, ReplaceReg) then
    +                    begin
    +                      Result := True;
    +                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 2)', hp);
    +                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
    +                    end;
    +                3:
    +                  { Only modify the second parameter }
    +                  if ReplaceRegisterInOper(hp, 1, CurrentReg, ReplaceReg) then
    +                    begin
    +                      Result := True;
    +                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 3)', hp);
    +                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
    +                    end;
    +                else
    +                  InternalError(2020012901);
    +              end;
    +            end;
    +
    +          else
    +            if (hp.ops > 0) and
    +              ReplaceRegisterInInstruction(hp, CurrentReg, ReplaceReg) then
    +              begin
    +                Result := True;
    +                DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovXXX2MovXXX)', hp);
    +
    +                AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
    +              end;
    +          end;
    +      end;
    +
    +
         function TX86AsmOptimizer.OptPass1MOV(var p : tai) : boolean;
           var
             hp1, hp2: tai;
    @@ -1536,6 +1730,146 @@
             if not GetNextInstruction_p or (hp1.typ <> ait_instruction) then
               Exit;
     
    +        { Look for:
    +            mov %reg1,%reg2
    +            ??? %reg2,r/m
    +          Change to:
    +            mov %reg1,%reg2
    +            ??? %reg1,r/m
    +        }
    +        if MatchOpType(taicpu(p), top_reg, top_reg) then
    +          begin
    +            CurrentReg := taicpu(p).oper[1]^.reg;
    +
    +            if RegReadByInstruction(CurrentReg, hp1) and
    +              DeepMOVOpt(taicpu(p), taicpu(hp1)) then
    +              begin
    +                TransferUsedRegs(TmpUsedRegs);
    +                UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
    +
    +                if not RegUsedAfterInstruction(CurrentReg, hp1, TmpUsedRegs) and
    +                  { Just in case something didn't get modified (e.g. an
    +                    implicit register) }
    +                  not RegReadByInstruction(CurrentReg, hp1) then
    +                  begin
    +                    { We can remove the original MOV }
    +                    DebugMsg(SPeepholeOptimization + 'Mov2Nop 3 done',p);
    +                    Asml.Remove(p);
    +                    p.Free;
    +                    p := hp1;
    +
    +                    { TmpUsedRegs contains the results of "UpdateUsedRegs(tai(p.Next))" already,
    +                      so just restore it to UsedRegs instead of calculating it again }
    +                    RestoreUsedRegs(TmpUsedRegs);
    +                    Result := True;
    +                    Exit;
    +                  end;
    +
    +                { If we know a MOV instruction has become a null operation, we might as well
    +                  get rid of it now to save time. }
    +                if (taicpu(hp1).opcode = A_MOV) and
    +                  (taicpu(hp1).oper[1]^.typ = top_reg) and
    +                  SuperRegistersEqual(taicpu(hp1).oper[1]^.reg, taicpu(p).oper[0]^.reg) and
    +                  { Just being a register is enough to confirm it's a null operation }
    +                  (taicpu(hp1).oper[0]^.typ = top_reg) then
    +                  begin
    +
    +                    Result := True;
    +
    +                    { Speed-up to reduce a pipeline stall... if we had something like...
    +
    +                        movl %eax,%edx
    +                        movw %dx,%ax
    +
    +                      ... the second instruction would change to movw %ax,%ax, but
    +                      given that it is now %ax that's active rather than %eax,
    +                      penalties might occur due to a partial register write, so instead,
    +                      change it to a MOVZX instruction when optimising for speed.
    +                    }
    +                    if not (cs_opt_size in current_settings.optimizerswitches) and
    +{$ifdef i8086}
    +                      { MOVZX was only introduced on the 386 }
    +                      (current_settings.cputype >= cpu_386) and
    +{$endif i8086}
    +                      (
    +                        (taicpu(hp1).opsize < taicpu(p).opsize)
    +{$ifdef x86_64}
    +                        { operations already implicitly set the upper 64 bits to zero }
    +                        and not ((taicpu(hp1).opsize = S_L) and (taicpu(p).opsize = S_Q))
    +{$endif x86_64}
    +                      ) then
    +                      begin
    +                        CurrentReg := taicpu(hp1).oper[1]^.reg;
    +
    +                        DebugMsg(SPeepholeOptimization + 'Zero-extension to minimise pipeline stall (Mov2Movz)',hp1);
    +                        case taicpu(p).opsize of
    +                          S_W:
    +                            if taicpu(hp1).opsize = S_B then
    +                              taicpu(hp1).opsize := S_BW
    +                            else
    +                              InternalError(2020012911);
    +                          S_L{$ifdef x86_64}, S_Q{$endif x86_64}:
    +                            case taicpu(hp1).opsize of
    +                              S_B:
    +                                taicpu(hp1).opsize := S_BL;
    +                              S_W:
    +                                taicpu(hp1).opsize := S_WL;
    +                              else
    +                                InternalError(2020012912);
    +                            end;
    +                          else
    +                            InternalError(2020012910);
    +                        end;
    +
    +                        taicpu(hp1).opcode := A_MOVZX;
    +                        taicpu(hp1).oper[1]^.reg := newreg(getregtype(CurrentReg), getsupreg(CurrentReg), R_SUBD)
    +                      end
    +                    else
    +                      begin
    +                        GetNextInstruction_p := GetNextInstruction(hp1, hp2);
    +                        DebugMsg(SPeepholeOptimization + 'Mov2Nop 4 done',hp1);
    +                        asml.remove(hp1);
    +                        hp1.free;
    +
    +                        { The instruction after what was hp1 is now the immediate next instruction,
    +                          so we can continue to make optimisations if it's present }
    +                        if not GetNextInstruction_p or (hp2.typ <> ait_instruction) then
    +                          Exit;
    +
    +                        hp1 := hp2;
    +                      end;
    +                  end;
    +
    +              end;
    +          end;
    +
    +        { Depending on the DeepMOVOpt above, it may turn out that hp1 completely
    +          overwrites the original destination register.  e.g.
    +
    +          movl   %reg1d,%reg2d
    +          movslq %reg1d,%reg2q
    +
    +          In this case, we can remove the MOV
    +        }
    +        if (taicpu(p).oper[1]^.typ = top_reg) and
    +          MatchInstruction(hp1, [A_LEA, A_MOV, A_MOVSX, A_MOVZX{$ifdef x86_64}, A_MOVSXD{$endif x86_64}], []) and
    +          { The RegInOp check makes sure that movb r/m,%reg1b; movzbl %reg1b,%reg1l"
    +            and "movl r/m,%reg1; leal $1(%reg1,%reg2),%reg1" etc. are not incorrectly
    +            optimised }
    +          (taicpu(hp1).oper[1]^.typ = top_reg) and
    +          not RegInOp(taicpu(p).oper[1]^.reg, taicpu(hp1).oper[0]^) and
    +          Reg1WriteOverwritesReg2Entirely(taicpu(hp1).oper[1]^.reg, taicpu(p).oper[1]^.reg) then
    +          begin
    +            DebugMsg(SPeepholeOptimization + 'Mov2Nop 5 done',p);
    +            { take care of the register (de)allocs following p }
    +            UpdateUsedRegs(tai(p.next));
    +            asml.remove(p);
    +            p.free;
    +            p:=hp1;
    +            Result := True;
    +            Exit;
    +          end;
    +
             if (taicpu(hp1).opcode = A_AND) and
               (taicpu(p).oper[1]^.typ = top_reg) and
               MatchOpType(taicpu(hp1),top_const,top_reg) then
    

Activities

Florian

2020-01-23 21:51

administrator   ~0120691

x86_64-linux breaks at ppc1 with:

ports.inc(33,3) Error: Asm: [out reg16,reg8] invalid combination of opcode and operands
ports.inc(39,33) Error: Asm: [in reg8,reg16] invalid combination of opcode and operands
ports.inc(45,3) Error: Asm: [out reg16,reg16] invalid combination of opcode and operands
ports.inc(51,33) Error: Asm: [in reg16,reg16] invalid combination of opcode and operands
ports.inc(57,3) Error: Asm: [out reg16,reg32] invalid combination of opcode and operands
ports.inc(63,33) Error: Asm: [in reg32,reg16] invalid combination of opcode and operands
ports.inc(64,1) Fatal: There were 6 errors compiling module, stopping

J. Gareth Moreton

2020-01-24 08:23

developer   ~0120695

Aah, thanks Florian - I'll check it out. Always Linux! At least it's only ppc1... that should be easy enough to debug.

J. Gareth Moreton

2020-01-24 23:08

developer   ~0120723

Last edited: 2020-01-24 23:13

View 2 revisions

I found the problem. It was changing the registers in "out" instructions, which can only read from AL/AX/EAX for the data and DX or an immediate for the port number. No such operation appears in Windows builds, hence why it escaped my notice. Also... no excuse for not testing on Linux properly... but I had problems booting my virtual machine due to disk space limitations!

Rather than treating IN and OUT as special cases, I'm developing a more general solution in case there are other opcodes I've missed.

J. Gareth Moreton

2020-01-24 23:36

developer   ~0120724

Well, so far I haven't been successful with finding a good general solution because IN and OUT have explicit operands, but are present in order to determine the size of the operation.

IN can only take the form "in $imm8/%dx, %al/%ax/%eax", while OUT is the same but with the parameters reversed - different registers cannot be used even though they are explicit.

In the meantime, here's the patch with those two opcodes checked. I'll run a full test overnight to see if anything else gets caught. A more general solution would be nice, but InsProp isn't detailed enough to help with this one.

J. Gareth Moreton

2020-01-25 03:41

developer   ~0120725

Last edited: 2020-01-25 03:48

View 2 revisions

I've found 9 opcodes so far that have explicit operands but which are fixed in some way, and have updated the patch to include them.

There's no straightforward way to detect if a particular opcode has this design, STOS and LODS etc have particular flags set in their InsProp entries that imply an element of inflexibility (e.g. STOS has a lone memory operand, but has the Ch_WMemEDI flag set, among others), but nothing concrete. I would like to suggest adding a new "TInsChange" value, named Ch_ExplicitOps, that's added to instructions where the operands must take on a very specific form and shouldn't be blindly changed.

Further testing and refactoring may be required.

ADDENDUM: I removed a specific check for MUL, DIV and IDIV because the code was no different to the general case, hence was just code duplication.

J. Gareth Moreton

2020-01-25 17:40

developer   ~0120736

Linux regression tests have passed - attached the logs of the trunk and with the patch applied, but with timestamps stripped to make direct comparison easier. Any failures that appear are common to both.

OptPass1MOV-logs.zip (67,878 bytes)

Florian

2020-01-25 22:07

administrator   ~0120741

There is still something wrong: on i386-linux the compiler generates wrong floating point code when the patch is active, resulting in 0000064:0000050 extra failures in the test suite :(

J. Gareth Moreton

2020-01-25 22:11

reporter   ~0120742

Last edited: 2020-01-25 22:12

View 2 revisions

Oh crumbs - thanks! That is a thought in that the old floating-point stack isn't really used under x86_64. Time to investigate. Interesting that i386-win32 is fine though, so I wonder what the problem is this time.

Still, the biggest changes always produce the most problems. I'll take a look and get it fixed.

On another note, what do you think about the "Ch_Explicit" flag? Only a handful of instructions would have it because it mostly refers to the older CISC-style ones rather than more modern ones that are usually more flexible in their register usage.

J. Gareth Moreton

2020-01-26 06:34

reporter   ~0120752

Not going too well currently... I haven't been able to reproduce the failures. Running the tests on i386-linux, at least the standard make process without any additional flags, yields no regressions. Is there something particular that you're doing?

Florian

2020-01-26 11:37

administrator   ~0120755

I am using -O4 -Cpcoreavx2

J. Gareth Moreton

2020-01-26 15:41

reporter   ~0120760

Aah, I didn't use -Cpcoreavx2; I'll give that a try... mind you, I'm not even sure if my computer supports AVX2.

nanobit

2020-01-26 16:28

reporter   ~0120762

I have no experience, but maybe Intel SDE emulator can help to find more bugs.

J. Gareth Moreton

2020-01-26 20:26

reporter   ~0120771

Okay, some progress. For me, when trying to compile on i386-linux with -O4 and -CpCOREAVX2, I get a build error because ppc3 and ppc386 differ. Time to investigate.

J. Gareth Moreton

2020-01-27 03:18

reporter   ~0120775

Also reproduced on i386-win32, which will be easier for me to debug. It may take a while though, and 0036630 is going to be a higher priority.

I don't think these build errors are down to no AVX2 support, since the trunk builds without incident, and invalid or unsupported opcodes usually raise exceptions.

J. Gareth Moreton

2020-01-29 11:56

reporter   ~0120792

I might have found the cause, but I won't be able to confirm it until later when I get back to my personal laptop. With most instructions that operate on the general-purpose registers, only the last operand (in AT&T mode) is written to, but one of the more recent instructions breaks this rule... MULX, a fairly recent instruction. Both the last and 2nd-to-last operands (both registers) are written to. I sense I'm not accounting for this, and hence the low 32 bytes are being stored in the wrong register in some situations. I'll attempt to write a better solution to detect for such operands.

J. Gareth Moreton

2020-01-30 02:00

reporter   ~0120799

it was the MULX instruction! And my general-purpose solution found some extra optimisations on top of that. I'm now running the Windows regression tests overnight to see if everything holds up, then will run Linux tomorrow.

I also added an extra speed optimisation that I had noticed while studying the assembler dumps - occasionally you get the following sequence:

movl %eax,%edi
movw %di,%ax

Originally, my code changed %di to %ax, and then the movw instruction was removed completely because it's a null operation ("mov %ax,%ax"). When optimising for speed, it will instead change it to "movzwl %ax,%eax" or equivalent, which zeroes the unused parts of the register and minimises partial write penalties (on i8086, it will only do it on the 386 or above, the processor that introduced the MOVZX instruction).

J. Gareth Moreton

2020-01-31 01:35

reporter   ~0120813

Hopefully this patch has everything fixed! Windows tests have passed and i386-linux has passed for "-O4 -CpCOREAVX2". x86_64-linux is still running.

OptPass1MOV-deep-optimise-part-2.patch (15,145 bytes)
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 44032)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -51,6 +51,18 @@
           depend on the value in AH). }
         function Reg1ReadDependsOnReg2(reg1, reg2: tregister): boolean;
 
+        { Replaces all references to AOldReg in a memory reference to ANewReg }
+        class function ReplaceRegisterInRef(var ref: TReference; const AOldReg, ANewReg: TRegister): Boolean; static;
+
+        { Replaces all references to AOldReg in an operand to ANewReg }
+        class function ReplaceRegisterInOper(const p: taicpu; const OperIdx: Integer; const AOldReg, ANewReg: TRegister): Boolean; static;
+
+        { Replaces all references to AOldReg in an instruction to ANewReg,
+          except where the register is being written }
+        function ReplaceRegisterInInstruction(const p: taicpu; const AOldReg, ANewReg: TRegister): Boolean;
+
+        function DeepMOVOpt(const p_mov: taicpu; const hp: taicpu): Boolean;
+
         procedure DebugMsg(const s : string; p : tai);inline;
 
         class function IsExitCode(p : tai) : boolean; static;
@@ -1506,6 +1518,188 @@
       end;
 
 
+    { Replaces all references to AOldReg in a memory reference to ANewReg }
+    class function TX86AsmOptimizer.ReplaceRegisterInRef(var ref: TReference; const AOldReg, ANewReg: TRegister): Boolean;
+      var
+        OldSupReg: TSuperRegister;
+        OldSubReg, MemSubReg: TSubRegister;
+      begin
+        Result := False;
+        { For safety reasons, only check for exact register matches }
+
+        { Check base register }
+        if (ref.base = AOldReg) then
+          begin
+            ref.base := ANewReg;
+            Result := True;
+          end;
+
+        { Check index register }
+        if (ref.index = AOldReg) then
+          begin
+            ref.index := ANewReg;
+            Result := True;
+          end;
+      end;
+
+
+    { Replaces all references to AOldReg in an operand to ANewReg }
+    class function TX86AsmOptimizer.ReplaceRegisterInOper(const p: taicpu; const OperIdx: Integer; const AOldReg, ANewReg: TRegister): Boolean;
+      var
+        OldSupReg, NewSupReg: TSuperRegister;
+        OldSubReg, NewSubReg, MemSubReg: TSubRegister;
+        OldRegType: TRegisterType;
+        ThisOper: POper;
+      begin
+        ThisOper := p.oper[OperIdx]; { Faster to access overall }
+        Result := False;
+
+        if (AOldReg = NR_NO) or (ANewReg = NR_NO) then
+          InternalError(2020011801);
+
+        OldSupReg := getsupreg(AOldReg);
+        OldSubReg := getsubreg(AOldReg);
+        OldRegType := getregtype(AOldReg);
+        NewSupReg := getsupreg(ANewReg);
+        NewSubReg := getsubreg(ANewReg);
+
+        if OldRegType <> getregtype(ANewReg) then
+          InternalError(2020011802);
+
+        if OldSubReg <> NewSubReg then
+          InternalError(2020011803);
+
+        case ThisOper^.typ of
+          top_reg:
+            if (
+              (ThisOper^.reg = AOldReg) or
+                (
+                  (OldRegType = R_INTREGISTER) and
+                  (getsupreg(ThisOper^.reg) = OldSupReg) and
+                  (getregtype(ThisOper^.reg) = R_INTREGISTER) and
+                  (
+                    (getsubreg(ThisOper^.reg) <= OldSubReg)
+{$ifndef x86_64}
+                    and (
+                    { Under i386 and i8086, ESI, EDI, EBP and ESP
+                      don't have an 8-bit representation }
+                      (getsubreg(ThisOper^.reg) >= R_SUBW) or
+                      not (NewSupReg in [RS_ESI, RS_EDI, RS_EBP, RS_ESP])
+                    )
+{$endif x86_64}
+                  )
+                )
+              ) then
+              begin
+                ThisOper^.reg := newreg(getregtype(ANewReg), NewSupReg, getsubreg(p.oper[OperIdx]^.reg));;
+                Result := True;
+              end;
+          top_ref:
+            if ReplaceRegisterInRef(ThisOper^.ref^, AOldReg, ANewReg) then
+              Result := True;
+
+          else
+            ;
+        end;
+      end;
+
+
+    { Replaces all references to AOldReg in an instruction to ANewReg }
+    function TX86AsmOptimizer.ReplaceRegisterInInstruction(const p: taicpu; const AOldReg, ANewReg: TRegister): Boolean;
+      const
+        ReadFlag: array[0..3] of TInsChange = (Ch_Rop1, Ch_Rop2, Ch_Rop3, Ch_Rop4);
+      var
+        OperIdx: Integer;
+      begin
+        Result := False;
+
+        for OperIdx := 0 to p.ops - 1 do
+          if (ReadFlag[OperIdx] in InsProp[p.Opcode].Ch) and
+          { The shift and rotate instructions can only use CL }
+          not (
+            (OperIdx = 0) and
+            { This second condition just helps to avoid unnecessarily
+              calling MatchInstruction for 10 different opcodes }
+            (p.oper[0]^.reg = NR_CL) and
+            MatchInstruction(p, [A_RCL, A_RCR, A_ROL, A_ROR, A_SAL, A_SAR, A_SHL, A_SHLD, A_SHR, A_SHRD], [])
+          ) then
+            Result := ReplaceRegisterInOper(p, OperIdx, AOldReg, ANewReg) or Result;
+      end;
+
+
+    function TX86AsmOptimizer.DeepMOVOpt(const p_mov: taicpu; const hp: taicpu): Boolean;
+      var
+        CurrentReg, ReplaceReg: TRegister;
+        SubReg: TSubRegister;
+      begin
+        Result := False;
+
+        ReplaceReg := taicpu(p_mov).oper[0]^.reg;
+        CurrentReg := taicpu(p_mov).oper[1]^.reg;
+
+        case hp.opcode of
+          A_FSTSW, A_FNSTSW,
+          A_IN,   A_INS,  A_OUT,  A_OUTS,
+          A_CMPS, A_LODS, A_MOVS, A_SCAS, A_STOS:
+            { These routines have explicit operands, but they are restricted in
+              what they can be (e.g. IN and OUT can only read from AL, AX or
+              EAX. }
+            Exit;
+
+          A_IMUL:
+            begin
+                { The 1-operand version writes to implicit registers
+                  The 2-operand version reads from the first operator, and reads
+                  from and writes to the second (equivalent to Ch_ROp1, ChRWOp2).
+                  the 3-operand version reads from a register that it doesn't write to
+                }
+              case hp.ops of
+                1:
+                  if (
+                    (
+                      (hp.opsize = S_B) and (getsupreg(CurrentReg) <> RS_EAX)
+                    ) or
+                      not (getsupreg(CurrentReg) in [RS_EAX, RS_EDX])
+                  ) and ReplaceRegisterInOper(hp, 0, CurrentReg, ReplaceReg) then
+                    begin
+                      Result := True;
+                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 1)', hp);
+                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
+                    end;
+                2:
+                  { Only modify the first parameter }
+                  if ReplaceRegisterInOper(hp, 0, CurrentReg, ReplaceReg) then
+                    begin
+                      Result := True;
+                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 2)', hp);
+                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
+                    end;
+                3:
+                  { Only modify the second parameter }
+                  if ReplaceRegisterInOper(hp, 1, CurrentReg, ReplaceReg) then
+                    begin
+                      Result := True;
+                      DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovIMul2MovIMul 3)', hp);
+                      AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
+                    end;
+                else
+                  InternalError(2020012901);
+              end;
+            end;
+
+          else
+            if (hp.ops > 0) and
+              ReplaceRegisterInInstruction(hp, CurrentReg, ReplaceReg) then
+              begin
+                Result := True;
+                DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + debug_regname(ReplaceReg) + '; changed to minimise pipeline stall (MovXXX2MovXXX)', hp);
+
+                AllocRegBetween(ReplaceReg, p_mov, hp, UsedRegs);
+              end;
+          end;
+      end;
+
+
     function TX86AsmOptimizer.OptPass1MOV(var p : tai) : boolean;
       var
         hp1, hp2: tai;
@@ -1536,6 +1730,146 @@
         if not GetNextInstruction_p or (hp1.typ <> ait_instruction) then
           Exit;
 
+        { Look for:
+            mov %reg1,%reg2
+            ??? %reg2,r/m
+          Change to:
+            mov %reg1,%reg2
+            ??? %reg1,r/m
+        }
+        if MatchOpType(taicpu(p), top_reg, top_reg) then
+          begin
+            CurrentReg := taicpu(p).oper[1]^.reg;
+
+            if RegReadByInstruction(CurrentReg, hp1) and
+              DeepMOVOpt(taicpu(p), taicpu(hp1)) then
+              begin
+                TransferUsedRegs(TmpUsedRegs);
+                UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
+
+                if not RegUsedAfterInstruction(CurrentReg, hp1, TmpUsedRegs) and
+                  { Just in case something didn't get modified (e.g. an
+                    implicit register) }
+                  not RegReadByInstruction(CurrentReg, hp1) then
+                  begin
+                    { We can remove the original MOV }
+                    DebugMsg(SPeepholeOptimization + 'Mov2Nop 3 done',p);
+                    Asml.Remove(p);
+                    p.Free;
+                    p := hp1;
+
+                    { TmpUsedRegs contains the results of "UpdateUsedRegs(tai(p.Next))" already,
+                      so just restore it to UsedRegs instead of calculating it again }
+                    RestoreUsedRegs(TmpUsedRegs);
+                    Result := True;
+                    Exit;
+                  end;
+
+                { If we know a MOV instruction has become a null operation, we might as well
+                  get rid of it now to save time. }
+                if (taicpu(hp1).opcode = A_MOV) and
+                  (taicpu(hp1).oper[1]^.typ = top_reg) and
+                  SuperRegistersEqual(taicpu(hp1).oper[1]^.reg, taicpu(p).oper[0]^.reg) and
+                  { Just being a register is enough to confirm it's a null operation }
+                  (taicpu(hp1).oper[0]^.typ = top_reg) then
+                  begin
+
+                    Result := True;
+
+                    { Speed-up to reduce a pipeline stall... if we had something like...
+
+                        movl %eax,%edx
+                        movw %dx,%ax
+
+                      ... the second instruction would change to movw %ax,%ax, but
+                      given that it is now %ax that's active rather than %eax,
+                      penalties might occur due to a partial register write, so instead,
+                      change it to a MOVZX instruction when optimising for speed.
+                    }
+                    if not (cs_opt_size in current_settings.optimizerswitches) and
+{$ifdef i8086}
+                      { MOVZX was only introduced on the 386 }
+                      (current_settings.cputype >= cpu_386) and
+{$endif i8086}
+                      (
+                        (taicpu(hp1).opsize < taicpu(p).opsize)
+{$ifdef x86_64}
+                        { operations already implicitly set the upper 64 bits to zero }
+                        and not ((taicpu(hp1).opsize = S_L) and (taicpu(p).opsize = S_Q))
+{$endif x86_64}
+                      ) then
+                      begin
+                        CurrentReg := taicpu(hp1).oper[1]^.reg;
+
+                        DebugMsg(SPeepholeOptimization + 'Zero-extension to minimise pipeline stall (Mov2Movz)',hp1);
+                        case taicpu(p).opsize of
+                          S_W:
+                            if taicpu(hp1).opsize = S_B then
+                              taicpu(hp1).opsize := S_BW
+                            else
+                              InternalError(2020012911);
+                          S_L{$ifdef x86_64}, S_Q{$endif x86_64}:
+                            case taicpu(hp1).opsize of
+                              S_B:
+                                taicpu(hp1).opsize := S_BL;
+                              S_W:
+                                taicpu(hp1).opsize := S_WL;
+                              else
+                                InternalError(2020012912);
+                            end;
+                          else
+                            InternalError(2020012910);
+                        end;
+
+                        taicpu(hp1).opcode := A_MOVZX;
+                        taicpu(hp1).oper[1]^.reg := newreg(getregtype(CurrentReg), getsupreg(CurrentReg), R_SUBD)
+                      end
+                    else
+                      begin
+                        GetNextInstruction_p := GetNextInstruction(hp1, hp2);
+                        DebugMsg(SPeepholeOptimization + 'Mov2Nop 4 done',hp1);
+                        asml.remove(hp1);
+                        hp1.free;
+
+                        { The instruction after what was hp1 is now the immediate next instruction,
+                          so we can continue to make optimisations if it's present }
+                        if not GetNextInstruction_p or (hp2.typ <> ait_instruction) then
+                          Exit;
+
+                        hp1 := hp2;
+                      end;
+                  end;
+
+              end;
+          end;
+
+        { Depending on the DeepMOVOpt above, it may turn out that hp1 completely
+          overwrites the original destination register.  e.g.
+
+          movl   %reg1d,%reg2d
+          movslq %reg1d,%reg2q
+
+          In this case, we can remove the MOV
+        }
+        if (taicpu(p).oper[1]^.typ = top_reg) and
+          MatchInstruction(hp1, [A_LEA, A_MOV, A_MOVSX, A_MOVZX{$ifdef x86_64}, A_MOVSXD{$endif x86_64}], []) and
+          { The RegInOp check makes sure that movb r/m,%reg1b; movzbl %reg1b,%reg1l"
+            and "movl r/m,%reg1; leal $1(%reg1,%reg2),%reg1" etc. are not incorrectly
+            optimised }
+          (taicpu(hp1).oper[1]^.typ = top_reg) and
+          not RegInOp(taicpu(p).oper[1]^.reg, taicpu(hp1).oper[0]^) and
+          Reg1WriteOverwritesReg2Entirely(taicpu(hp1).oper[1]^.reg, taicpu(p).oper[1]^.reg) then
+          begin
+            DebugMsg(SPeepholeOptimization + 'Mov2Nop 5 done',p);
+            { take care of the register (de)allocs following p }
+            UpdateUsedRegs(tai(p.next));
+            asml.remove(p);
+            p.free;
+            p:=hp1;
+            Result := True;
+            Exit;
+          end;
+
         if (taicpu(hp1).opcode = A_AND) and
           (taicpu(p).oper[1]^.typ = top_reg) and
           MatchOpType(taicpu(hp1),top_const,top_reg) then

J. Gareth Moreton

2020-01-31 21:36

reporter   ~0120832

x86_64-linux has passed for -O4 -CpCOREAVX2.

Florian

2020-02-01 22:53

administrator   ~0120843

Thanks, works fine now.

Issue History

Date Modified Username Field Change
2020-01-20 09:44 J. Gareth Moreton New Issue
2020-01-20 09:44 J. Gareth Moreton File Added: OptPass1MOV-deep-optimise-part-2.patch
2020-01-20 09:45 J. Gareth Moreton Tag Attached: patch
2020-01-20 09:45 J. Gareth Moreton Tag Attached: compiler
2020-01-20 09:45 J. Gareth Moreton Tag Attached: optimizations
2020-01-20 09:45 J. Gareth Moreton Tag Attached: x86_64
2020-01-20 09:45 J. Gareth Moreton Tag Attached: x86
2020-01-20 09:45 J. Gareth Moreton Tag Attached: i386
2020-01-23 21:51 Florian Note Added: 0120691
2020-01-24 08:23 J. Gareth Moreton Note Added: 0120695
2020-01-24 23:08 J. Gareth Moreton Note Added: 0120723
2020-01-24 23:13 J. Gareth Moreton Note Edited: 0120723 View Revisions
2020-01-24 23:32 J. Gareth Moreton File Deleted: OptPass1MOV-deep-optimise-part-2.patch
2020-01-24 23:36 J. Gareth Moreton File Added: OptPass1MOV-deep-optimise-part-2.patch
2020-01-24 23:36 J. Gareth Moreton Note Added: 0120724
2020-01-25 03:31 J. Gareth Moreton File Deleted: OptPass1MOV-deep-optimise-part-2.patch
2020-01-25 03:41 J. Gareth Moreton File Added: OptPass1MOV-deep-optimise-part-2.patch
2020-01-25 03:41 J. Gareth Moreton Note Added: 0120725
2020-01-25 03:48 J. Gareth Moreton Note Edited: 0120725 View Revisions
2020-01-25 17:40 J. Gareth Moreton File Added: OptPass1MOV-logs.zip
2020-01-25 17:40 J. Gareth Moreton Note Added: 0120736
2020-01-25 22:07 Florian Note Added: 0120741
2020-01-25 22:11 J. Gareth Moreton Note Added: 0120742
2020-01-25 22:12 J. Gareth Moreton Note Edited: 0120742 View Revisions
2020-01-26 06:34 J. Gareth Moreton Note Added: 0120752
2020-01-26 11:37 Florian Note Added: 0120755
2020-01-26 15:41 J. Gareth Moreton Note Added: 0120760
2020-01-26 16:28 nanobit Note Added: 0120762
2020-01-26 20:26 J. Gareth Moreton Note Added: 0120771
2020-01-27 03:18 J. Gareth Moreton Note Added: 0120775
2020-01-29 11:56 J. Gareth Moreton Note Added: 0120792
2020-01-30 02:00 J. Gareth Moreton Note Added: 0120799
2020-01-30 02:00 J. Gareth Moreton File Deleted: OptPass1MOV-deep-optimise-part-2.patch
2020-01-31 01:35 J. Gareth Moreton File Added: OptPass1MOV-deep-optimise-part-2.patch
2020-01-31 01:35 J. Gareth Moreton Note Added: 0120813
2020-01-31 21:36 J. Gareth Moreton Note Added: 0120832
2020-02-01 22:53 Florian Assigned To => Florian
2020-02-01 22:53 Florian Status new => resolved
2020-02-01 22:53 Florian Resolution open => fixed
2020-02-01 22:53 Florian Fixed in Version => 3.3.1
2020-02-01 22:53 Florian Fixed in Revision => 44086
2020-02-01 22:53 Florian FPCTarget => -
2020-02-01 22:53 Florian Note Added: 0120843