View Issue Details

IDProjectCategoryView StatusLast Update
0036382FPCCompilerpublic2020-01-06 22:05
ReporterJ. Gareth Moreton Assigned ToFlorian  
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Platformi386 and x86_64OSMicrosoft Windows 
Product Version3.3.1 
Summary0036382: [Patch] x86 "OptPass1MOV" improvements
Description"OptPass1MOV-Improvement.patch" applies some optimisations to the peephole optimisation function of the same name. Repeated checks to the "GetNextInstruction_p Boolean" variable are factored out, since all of the optimisations bar the first one require a succeeding instruction.

More importantly, the deep 'MovMov2Mov' optimisations have been improved to be slightly faster (by not checking the immediate next instruction, as that's done elsewhere), allowing it to be performed even if the temporary register remains in use (it just preserves the first instruction in this case) and allowing the optimiser to check the 2nd next instruction under -O1 and -O2, since "GetNextInstructionUsingReg" only checks one instruction ahead unless -O3 is specified (which means we can remove the -O3 check that was present in OptPass1MOV). This optimisation was only made possible thanks to the x86-specific implementation of "RegModifiedByInstruction" over at 0036376.

"x86_mov_call.patch" is a subtle performance boost for the compiler: Pass 1 of the peephole optimiser under i386 and x86_64 will now check if the instruction is a MOV before any other (it has been removed from the case block and added to its own if statement). The reason this has been done is because MOV instructions appear disproportionately more frequently than any other instruction. About 25% of all instructions are MOVs of some kind.
Steps To ReproduceApply patches and confirm correct compilation and successful regression testing under i386 and x86_64 platforms.
Additional InformationThe improved optimisations produce smaller and faster code and can sometimes reduce the number of passes required in the peephole optimizer.

Optimizer performance would be greatly improved if register virtualisation is delayed, since the MovMov2Mov optimisation can often remove registers completely, but which have already been reserved earlier when the registers were devirtualised. Also, the optimisation cannot be performed if the stack is used for temporary storage (due to a lack of free registers), something that wouldn't be a problem when the registers are still virtual and assumed to not be modified by another thread. This would be an area of future research.
Tagscompiler, i386, optimizations, patch, x86, x86_64
Fixed in Revision
FPCOldBugId
FPCTarget-
Attached Files

Relationships

child of 0036376 resolvedFlorian [Patch] x86 implementation of RegModifiedByInstruction 

Activities

Florian

2019-12-01 22:23

administrator   ~0119570

I tried it, but on i386-linux it freezes the test suite runs. Maybe it just triggers another bug but at least it does not seem to work right now. Does the full testsuite work for you on i386?

J. Gareth Moreton

2019-12-01 22:25

developer   ~0119571

No worries, thanks. I'll do some investigating. Can't get it perfect every time. Any particular test that it freezes on?

Florian

2019-12-01 22:40

administrator   ~0119572

No :( that's the weired part: the dotest program just stalls.

J. Gareth Moreton

2019-12-01 22:43

developer   ~0119573

That is definitely very weird. I'll let you know what I find and submit a fixed patch if needs be. I'll re-run all of my test suites overnight for Windows, and if that doesn't yield anything, I'll try to get things working with Linux. I'm trying to think what could possibly cause an infinite loop, unless I made a mistake in RegModifiedByInstruction.

Florian

2019-12-01 23:19

administrator   ~0119575

in r43623, I committed a modified RegModifiedByInstruction as the old implementation had some problems. But it does not fix the stall.

J. Gareth Moreton

2019-12-01 23:33

developer   ~0119576

What was the problem, curiously? I would check the diff, but my tests are running and I didn't pull today's changes first.

J. Gareth Moreton

2019-12-02 11:57

developer   ~0119577

Okay, on i386-win32 and x86_64-win64, there are no regressions in the test suite (logs are binary equal), so it's specific to Linux.

J. Gareth Moreton

2019-12-04 10:58

developer   ~0119608

Last edited: 2019-12-04 10:58

View 2 revisions

Hi Florian - can you confirm if the hang still happens? I built yesterday's trunk with the patch applied and the tests started running without a hitch (unless I messed up the script - "sudo make full TEST_FPC=/usr/local/lib/fpc/3.3.1/ppc386")

J. Gareth Moreton

2019-12-06 03:55

developer   ~0119650

I made a minor update to how the new optimisation checks for register usage. It should ultimately be a cheaper check while still achieving the same result. I am not sure if it fixes the crash on i386-linux (which I have yet to reproduce).

J. Gareth Moreton

2019-12-06 04:49

developer   ~0119651

Some interesting insight here... I tried the idea of an optimisation where if you write to a register, and the register is immediately deallocated afterwards, then you can remove the instruction because it is effectively non-functional. That's the theory... in practice, it causes an unspecified runtime error while running ppc2.

In other words, "if (taicpu(p).oper[1]^.typ = top_reg) and not RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, p, UsedRegs) then" doesn't work.

My guess is that the registers aren't as cleanly tracked as they should be.

J. Gareth Moreton

2019-12-06 16:04

developer   ~0119663

Ignore that last message... turns out that RegUsedAfterInstruction always returns False if the instruction loads a new value to the register, and for MOV instructions to the function result, the register is often deallocated right before the instruction. Makes things a little bit complicated in that regard, but there might be ways to work around it, or otherwise adjust the positions of where registers are allocated and deallocated for more accurate optimisation, since there are situations where registers are written to needlessly.

J. Gareth Moreton

2019-12-16 02:28

developer   ~0119873

Last edited: 2019-12-16 02:30

View 2 revisions

Completed a full regression run with a number of patches combined, namely these two, the one over at 0036437, and a patch I sent Florian privately that addresses an internal error for x86_64-darwin - no regressions noted on i386-win32 and x86_64-win64. Will attempt i386-linux tonight.

J. Gareth Moreton

2019-12-16 15:57

developer   ~0119886

One new failure in i386-linux: webtbs/tw2377.pp - now to determine if that is due to my additions or me unintentionally pressing something on the keyboard during that test (it tests the Keyboard unit).

J. Gareth Moreton

2019-12-16 18:50

developer   ~0119888

Can't seem to reproduce the failure. Seemed to be a glitch elsewhere. Granted, my changes were to the peephole optimiser so a failure on this test is non-sensical anyway. Still, might need another party to test.

Florian

2019-12-16 22:14

administrator   ~0119891

It still stalls on i386-linux for me.

J. Gareth Moreton

2019-12-17 05:09

developer   ~0119893

What's your setup? Commands run and the versions of supplementary tools like binutils.

I did notice some stalls in the test run, but they also occurred on the trunk.

J. Gareth Moreton

2019-12-17 07:27

developer   ~0119894

Made some progress... building with OPT="-O4" causes ppc3 and ppc386 to differ, so something is getting corrupted somewhere.

J. Gareth Moreton

2019-12-18 20:49

developer   ~0119922

I found the code that causes the bootstrapping error, but I don't yet know WHY it's causing it, and I rather not disable it if I can help it. I'm investigating the assembler dumps to see if something gets removed that shouldn't be removed, but nothing is standing out yet.

J. Gareth Moreton

2019-12-22 02:55

developer   ~0120005

Finally found the problem that was causing the bootstrapping and test errors. Some register tracking was erroneous and causing incorrect optimisation on subsequent passing. Here is the fixed patch.

I have also provided a slightly more efficient GetNextInstructionUsingReg method that returns False if it couldn't find an instruction that used the relevant register, thereby removing the need to run other checks.

Regression tests on i386-win32 and x86_64-win64 have passed, and a run on i386-linux proved successful as well.

J. Gareth Moreton

2019-12-22 04:47

developer   ~0120006

After analysing the produced assembler files, I've noticed that there's still plenty of room for improvement, but I figure I should get this patch tested and confirmed first before I go on adding any new things.

Florian

2019-12-22 15:33

administrator   ~0120017

I do not like the GetNextInstructionUsingReg part for two reasons:
- it is different for all other platforms, so this is maximally confusing
- also its code is much harder to read
Please do not micro optimize this part of the compiler, it makes like really harder.

J. Gareth Moreton

2019-12-22 15:53

developer   ~0120019

Last edited: 2019-12-22 15:56

View 3 revisions

Understood. I reverted the method save for one small change:

the condition "is_calljmp(taicpu(Next).opcode)" has been changed to "(taicpu(Next).opcode = A_JMP)", because loosening this restriction means that OptPass1MOV can optimise instructions that appear after a conditional jump. If GetNextInstructionUsingReg stops on a conditional jump, then there is no practical way to make those peephole optimisations which, sometimes, can result in the reduction of used registers.

And the OptPass1MOV patch has been updated so it doesn't make any assumptions based on the rejected GetNextInstructionUsingReg method (like hp2 being of type ait_instruction).

ADDENDUM: If Next's opcode is A_CALL, this is already caught by the fact that its instruction flags have "Ch_All" set, hence causing RegInInstruction to return True.

OptPass1MOV-Improvement-v4.patch (21,148 bytes)   
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 43706)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -1489,9 +1489,10 @@
     function TX86AsmOptimizer.OptPass1MOV(var p : tai) : boolean;
       var
         hp1, hp2: tai;
-        GetNextInstruction_p: Boolean;
+        GetNextInstruction_p, TempRegUsed: Boolean;
         PreMessage, RegName1, RegName2, InputVal, MaskNum: string;
         NewSize: topsize;
+        CurrentReg: TRegister;
       begin
         Result:=false;
 
@@ -1501,7 +1502,7 @@
         if MatchOperand(taicpu(p).oper[0]^,taicpu(p).oper[1]^)
         then
           begin
-            DebugMsg(SPeepholeOptimization + 'Mov2Nop done',p);
+            DebugMsg(SPeepholeOptimization + 'Mov2Nop 1 done',p);
             { take care of the register (de)allocs following p }
             UpdateUsedRegs(tai(p.next));
             asml.remove(p);
@@ -1511,8 +1512,11 @@
             exit;
           end;
 
-        if GetNextInstruction_p and
-          MatchInstruction(hp1,A_AND,[]) and
+        { All the next optimisations require a next instruction }
+        if not GetNextInstruction_p or (hp1.typ <> ait_instruction) then
+          Exit;
+
+        if (taicpu(hp1).opcode = A_AND) and
           (taicpu(p).oper[1]^.typ = top_reg) and
           MatchOpType(taicpu(hp1),top_const,top_reg) then
           begin
@@ -1651,12 +1655,12 @@
               end;
           end;
         { Next instruction is also a MOV ? }
-        if GetNextInstruction_p and
-          MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) then
+        if MatchInstruction(hp1,A_MOV,[taicpu(p).opsize]) then
           begin
             if (taicpu(p).oper[1]^.typ = top_reg) and
               MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) then
               begin
+                CurrentReg := taicpu(p).oper[1]^.reg;
                 TransferUsedRegs(TmpUsedRegs);
                 UpdateUsedRegs(TmpUsedRegs, tai(p.Next));
                 { we have
@@ -1663,78 +1667,113 @@
                     mov x, %treg
                     mov %treg, y
                 }
-                if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp1).oper[1]^)) and
-                   not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp1, TmpUsedRegs)) then
-                  { we've got
+                if not(RegInOp(CurrentReg, taicpu(hp1).oper[1]^)) then
+                  if not(RegUsedAfterInstruction(CurrentReg, hp1, TmpUsedRegs)) then
+                    { we've got
 
-                    mov x, %treg
-                    mov %treg, y
+                      mov x, %treg
+                      mov %treg, y
 
-                    with %treg is not used after }
-                  case taicpu(p).oper[0]^.typ Of
-                    top_reg:
-                      begin
-                        { change
-                            mov %reg, %treg
-                            mov %treg, y
+                      with %treg is not used after }
+                    case taicpu(p).oper[0]^.typ Of
+                      top_reg:
+                        begin
+                          { change
+                              mov %reg, %treg
+                              mov %treg, y
 
-                            to
+                              to
 
-                            mov %reg, y
-                        }
-                        if taicpu(hp1).oper[1]^.typ=top_reg then
-                          AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
-                        taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
-                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
-                        asml.remove(hp1);
-                        hp1.free;
-                        Result:=true;
-                        Exit;
-                      end;
-                    top_const:
-                      begin
-                        { change
-                            mov const, %treg
-                            mov %treg, y
+                              mov %reg, y
+                          }
+                          if taicpu(hp1).oper[1]^.typ=top_reg then
+                            AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                          taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
+                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 2 done',p);
+                          asml.remove(hp1);
+                          hp1.free;
+                          Result:=true;
+                          Exit;
+                        end;
+                      top_const:
+                        begin
+                          { change
+                              mov const, %treg
+                              mov %treg, y
 
-                            to
+                              to
 
-                            mov const, y
-                        }
-                        if (taicpu(hp1).oper[1]^.typ=top_reg) or
-                          ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
+                              mov const, y
+                          }
+                          if (taicpu(hp1).oper[1]^.typ=top_reg) or
+                            ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
+                            begin
+                              if taicpu(hp1).oper[1]^.typ=top_reg then
+                                AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
+                              taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
+                              DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
+                              asml.remove(hp1);
+                              hp1.free;
+                              Result:=true;
+                              Exit;
+                            end;
+                        end;
+                      top_ref:
+                        if (taicpu(hp1).oper[1]^.typ = top_reg) then
                           begin
-                            if taicpu(hp1).oper[1]^.typ=top_reg then
-                              AllocRegBetween(taicpu(hp1).oper[1]^.reg,p,hp1,usedregs);
-                            taicpu(p).loadOper(1,taicpu(hp1).oper[1]^);
-                            DebugMsg(SPeepholeOptimization + 'MovMov2Mov 5 done',p);
+                            { change
+                                 mov mem, %treg
+                                 mov %treg, %reg
+
+                                 to
+
+                                 mov mem, %reg"
+                            }
+                            taicpu(p).loadreg(1, taicpu(hp1).oper[1]^.reg);
+                            DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
                             asml.remove(hp1);
                             hp1.free;
                             Result:=true;
                             Exit;
                           end;
-                      end;
-                    top_ref:
-                      if (taicpu(hp1).oper[1]^.typ = top_reg) then
+                      else
+                        { Do nothing };
+                    end
+                  else
+                    { %treg is used afterwards }
+
+                    case taicpu(p).oper[0]^.typ of
+                      top_const:
+                        if
+                          (
+                            not (cs_opt_size in current_settings.optimizerswitches) or
+                            (taicpu(hp1).opsize = S_B)
+                          ) and
+                          (
+                            (taicpu(hp1).oper[1]^.typ = top_reg) or
+                            ((taicpu(p).oper[0]^.val >= low(longint)) and (taicpu(p).oper[0]^.val <= high(longint)))
+                          ) then
+                          begin
+                            DebugMsg(SPeepholeOptimization + debug_operstr(taicpu(hp1).oper[0]^) + ' = $' + debug_tostr(taicpu(p).oper[0]^.val) + '; changed to minimise pipeline stall (MovMov2Mov 6b)',hp1);
+                            taicpu(hp1).loadconst(0, taicpu(p).oper[0]^.val);
+                          end;
+                      top_reg:
                         begin
-                          { change
-                               mov mem, %treg
-                               mov %treg, %reg
-
-                               to
-
-                               mov mem, %reg"
-                          }
-                          taicpu(p).loadoper(1,taicpu(hp1).oper[1]^);
-                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 3 done',p);
-                          asml.remove(hp1);
-                          hp1.free;
-                          Result:=true;
-                          Exit;
+                          DebugMsg(SPeepholeOptimization + debug_operstr(taicpu(hp1).oper[0]^) + ' = ' + debug_regname(taicpu(p).oper[0]^.reg) + '; changed to minimise pipeline stall (MovMov2Mov 6c)',hp1);
+                          AllocRegBetween(taicpu(p).oper[0]^.reg, p, hp1, UsedRegs);
+                          if MatchOperand(taicpu(hp1).oper[1]^, taicpu(p).oper[0]^.reg) then
+                            begin
+                              DebugMsg(SPeepholeOptimization + 'Mov2Nop 2 done',hp1);
+                              asml.remove(hp1);
+                              hp1.free;
+                              Result := True;
+                              Exit;
+                            end;
+                          taicpu(hp1).loadreg(0, taicpu(p).oper[0]^.reg);
                         end;
-                    else
-                      ;
-                  end;
+                      else
+                        { Do nothing };
+                    end;
               end;
             if (taicpu(hp1).oper[0]^.typ = taicpu(p).oper[1]^.typ) and
                (taicpu(hp1).oper[1]^.typ = taicpu(p).oper[0]^.typ) then
@@ -1934,14 +1973,17 @@
                 exit;
               end;
           end;
+
         { search further than the next instruction for a mov }
-        if (cs_opt_level3 in current_settings.optimizerswitches) and
+        if
           { check as much as possible before the expensive GetNextInstructionUsingReg call }
           (taicpu(p).oper[1]^.typ = top_reg) and
           (taicpu(p).oper[0]^.typ in [top_reg,top_const]) and
+          not RegModifiedByInstruction(taicpu(p).oper[1]^.reg, hp1) and
           { we work with hp2 here, so hp1 can be still used later on when
             checking for GetNextInstruction_p }
-          GetNextInstructionUsingReg(p,hp2,taicpu(p).oper[1]^.reg) and
+          { GetNextInstructionUsingReg only searches one instruction ahead unless -O3 is specified }
+          GetNextInstructionUsingReg(hp1,hp2,taicpu(p).oper[1]^.reg) and
           MatchInstruction(hp2,A_MOV,[]) and
           MatchOperand(taicpu(p).oper[1]^,taicpu(hp2).oper[0]^) and
           ((taicpu(p).oper[0]^.typ=top_const) or
@@ -1950,42 +1992,92 @@
            )
           ) then
           begin
-            TransferUsedRegs(TmpUsedRegs);
             { we have
                 mov x, %treg
                 mov %treg, y
             }
-            if not(RegInOp(taicpu(p).oper[1]^.reg,taicpu(hp2).oper[1]^)) and
-               not(RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp2, TmpUsedRegs)) then
-              { we've got
 
-                mov x, %treg
-                mov %treg, y
+            TransferUsedRegs(TmpUsedRegs);
+            TmpUsedRegs[R_INTREGISTER].Update(tai(p.Next));
 
-                with %treg is not used after }
-              case taicpu(p).oper[0]^.typ Of
-                top_reg:
-                  begin
-                    { change
-                        mov %reg, %treg
-                        mov %treg, y
+            { We don't need to call UpdateUsedRegs for every instruction between
+              p and hp2 because the register we're concerned about will not
+              become deallocated (otherwise GetNextInstructionUsingReg would
+              have stopped at an earlier instruction). [Kit] }
 
-                        to
+            TempRegUsed :=
+              RegUsedAfterInstruction(taicpu(p).oper[1]^.reg, hp2, TmpUsedRegs) or
+              RegReadByInstruction(taicpu(p).oper[1]^.reg, hp1);
 
-                        mov %reg, y
-                    }
-                    AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp2,usedregs);
-                    taicpu(hp2).loadOper(0,taicpu(p).oper[0]^);
-                    DebugMsg(SPeepholeOptimization + 'MovMov2Mov 6 done',p);
-                    { take care of the register (de)allocs following p }
-                    UpdateUsedRegs(tai(p.next));
-                    asml.remove(p);
-                    p.free;
-                    p:=hp1;
-                    Result:=true;
-                    Exit;
-                  end;
-                top_const:
+            case taicpu(p).oper[0]^.typ Of
+              top_reg:
+                begin
+                  { change
+                      mov %reg, %treg
+                      mov %treg, y
+
+                      to
+
+                      mov %reg, y
+                  }
+                  CurrentReg := taicpu(p).oper[0]^.reg; { Saves on a handful of pointer dereferences }
+                  RegName1 := debug_regname(taicpu(hp2).oper[0]^.reg);
+                  if taicpu(hp2).oper[1]^.reg = CurrentReg then
+                    begin
+                      { %reg = y - remove hp2 completely (doing it here instead of relying on
+                        the "mov %reg,%reg" optimisation might cut down on a pass iteration) }
+
+                      if TempRegUsed then
+                        begin
+                          DebugMsg(SPeepholeOptimization + debug_regname(CurrentReg) + ' = ' + RegName1 + '; removed unnecessary instruction (MovMov2MovNop 6b}',hp2);
+                          AllocRegBetween(CurrentReg, p, hp2, UsedRegs);
+                          asml.remove(hp2);
+                          hp2.Free;
+                        end
+                      else
+                        begin
+                          asml.remove(hp2);
+                          hp2.Free;
+
+                          { We can remove the original MOV too }
+                          DebugMsg(SPeepholeOptimization + 'MovMov2NopNop 6b done',p);
+                          { take care of the register (de)allocs following p }
+                          UpdateUsedRegs(tai(p.next));
+                          asml.remove(p);
+                          p.free;
+                          p:=hp1;
+                          Result:=true;
+                          Exit;
+                        end;
+                    end
+                  else
+                    begin
+                      AllocRegBetween(CurrentReg, p, hp2, UsedRegs);
+                      taicpu(hp2).loadReg(0, CurrentReg);
+                      if TempRegUsed then
+                        begin
+                          { Don't remove the first instruction if the temporary register is in use }
+                          DebugMsg(SPeepholeOptimization + RegName1 + ' = ' + debug_regname(CurrentReg) + '; changed to minimise pipeline stall (MovMov2Mov 6a}',hp2);
+
+                          { No need to set Result to True. If there's another instruction later on
+                            that can be optimised, it will be detected when the main Pass 1 loop
+                            reaches what is now hp2 and passes it through OptPass1MOV. [Kit] };
+                        end
+                      else
+                        begin
+                          DebugMsg(SPeepholeOptimization + 'MovMov2Mov 6 done',p);
+                          { take care of the register (de)allocs following p }
+                          UpdateUsedRegs(tai(p.next));
+                          asml.remove(p);
+                          p.free;
+                          p:=hp1;
+                          Result:=true;
+                          Exit;
+                        end;
+                    end;
+                end;
+              top_const:
+                if not (cs_opt_size in current_settings.optimizerswitches) or (taicpu(hp2).opsize = S_B) then
                   begin
                     { change
                         mov const, %treg
@@ -1998,15 +2090,29 @@
                     if (taicpu(hp2).oper[1]^.typ=top_reg) or
                       ((taicpu(p).oper[0]^.val>=low(longint)) and (taicpu(p).oper[0]^.val<=high(longint))) then
                       begin
+                        RegName1 := debug_regname(taicpu(hp2).oper[0]^.reg);
                         taicpu(hp2).loadOper(0,taicpu(p).oper[0]^);
-                        DebugMsg(SPeepholeOptimization + 'MovMov2Mov 7 done',p);
-                        { take care of the register (de)allocs following p }
-                        UpdateUsedRegs(tai(p.next));
-                        asml.remove(p);
-                        p.free;
-                        p:=hp1;
-                        Result:=true;
-                        Exit;
+
+                        if TempRegUsed then
+                          begin
+                            { Don't remove the first instruction if the temporary register is in use }
+                            DebugMsg(SPeepholeOptimization + RegName1 + ' = ' + debug_tostr(taicpu(p).oper[0]^.val) + '; changed to minimise pipeline stall (MovMov2Mov 7a)',hp2);
+
+                            { No need to set Result to True. If there's another instruction later on
+                              that can be optimised, it will be detected when the main Pass 1 loop
+                              reaches what is now hp2 and passes it through OptPass1MOV. [Kit] };
+                          end
+                        else
+                          begin
+                            DebugMsg(SPeepholeOptimization + 'MovMov2Mov 7 done',p);
+                            { take care of the register (de)allocs following p }
+                            UpdateUsedRegs(tai(p.next));
+                            asml.remove(p);
+                            p.free;
+                            p:=hp1;
+                            Result:=true;
+                            Exit;
+                          end;
                       end;
                   end;
                 else
@@ -2024,8 +2130,7 @@
 
            to avoid a write/read penalty
         }
-        if GetNextInstruction_p and
-           MatchOpType(taicpu(p),top_reg,top_reg) and
+        if MatchOpType(taicpu(p),top_reg,top_reg) and
            ((MatchInstruction(hp1,A_OR,A_AND,A_TEST,[]) and
             MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[0]^) and
             (taicpu(hp1).oper[1]^.typ = top_reg) and
@@ -2095,8 +2200,7 @@
           parameter or to the temporary storage room for the function
           result)
         }
-        if GetNextInstruction_p and
-          IsExitCode(hp1) and
+        if IsExitCode(hp1) and
           MatchOpType(taicpu(p),top_reg,top_ref) and
           (taicpu(p).oper[1]^.ref^.base = current_procinfo.FramePointer) and
           not(assigned(current_procinfo.procdef.funcretsym) and
@@ -2111,8 +2215,7 @@
             Result:=true;
             exit;
           end;
-        if GetNextInstruction_p and
-          MatchOpType(taicpu(p),top_reg,top_ref) and
+        if MatchOpType(taicpu(p),top_reg,top_ref) and
           MatchInstruction(hp1,A_CMP,A_TEST,[taicpu(p).opsize]) and
           (taicpu(hp1).oper[1]^.typ = top_ref) and
            RefsEqual(taicpu(p).oper[1]^.ref^, taicpu(hp1).oper[1]^.ref^) then
@@ -2131,8 +2234,7 @@
             AllocRegBetween(taicpu(p).oper[0]^.reg,p,hp1,usedregs);
             exit;
           end;
-        if GetNextInstruction_p and
-          (taicpu(p).oper[1]^.typ = top_reg) and
+        if (taicpu(p).oper[1]^.typ = top_reg) and
           (hp1.typ = ait_instruction) and
           GetNextInstruction(hp1, hp2) and
           MatchInstruction(hp2,A_MOV,[]) and
@@ -2289,8 +2391,7 @@
                   end;
               end;
           end;
-        if GetNextInstruction_p and
-          MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
+        if MatchInstruction(hp1,A_BTS,A_BTR,[Taicpu(p).opsize]) and
           GetNextInstruction(hp1, hp2) and
           MatchInstruction(hp2,A_OR,[Taicpu(p).opsize]) and
           MatchOperand(Taicpu(p).oper[0]^,0) and
@@ -2311,8 +2412,7 @@
             exit;
           end;
 
-        if GetNextInstruction_p and
-           MatchInstruction(hp1,A_LEA,[S_L]) and
+        if MatchInstruction(hp1,A_LEA,[S_L]) and
            MatchOpType(Taicpu(p),top_ref,top_reg) and
            ((MatchReference(Taicpu(hp1).oper[0]^.ref^,Taicpu(hp1).oper[1]^.reg,Taicpu(p).oper[1]^.reg) and
              (Taicpu(hp1).oper[0]^.ref^.base<>Taicpu(p).oper[1]^.reg)

J. Gareth Moreton

2019-12-22 21:09

developer   ~0120027

Removed the "GetNextInstructionUsingReg" patch, as that needs further discussion.

J. Gareth Moreton

2019-12-24 00:48

developer   ~0120047

Does the latest patch work as intended? I plan to refine the optimisations further down the road, building on this patch once it's accepted.

Florian

2019-12-26 10:33

administrator   ~0120067

I tested it and on x86_64-linux it breaks test/terecs14 (I use -O4).

J. Gareth Moreton

2019-12-26 13:20

developer   ~0120070

Okay, I'll have a check. My virtual machine is only i386-linux currently. Still, it's an improvement over the last problems!

J. Gareth Moreton

2020-01-02 22:14

developer   ~0120189

I haven't been able to reproduce the failure on x86_64-linux. I've attached my log file for the OptPass1MOV-Improvement-v4.patch - the log is identical to the trunk when the compiler is built under -O4. I stripped oui the dates and times to make comparison easier for myself, and there are 15 failures, but these also appear in the trunk's log and so are not related to this patch.

My command lines were:

Building: sudo make clean all install OPT="-O4"
Testing; sudo make full TEST_FPC=/usr/local/lib/fpc/3.3.1/ppcx64

Am I doing something wrong?
linux-mov-refactor.log (372,863 bytes)

J. Gareth Moreton

2020-01-03 23:23

developer   ~0120201

Regenerated x86_mov_call.patch so it merges succesfully with the most recent trunk revision.
x86_mov_call.patch (1,949 bytes)   
Index: compiler/i386/aoptcpu.pas
===================================================================
--- compiler/i386/aoptcpu.pas	(revision 43854)
+++ compiler/i386/aoptcpu.pas	(working copy)
@@ -140,6 +140,15 @@
               current_filepos:=taicpu(p).fileinfo;
               if InsContainsSegRef(taicpu(p)) then
                 exit;
+                
+              { Subtle speed boost: MOV appears disproportionately more
+                frequently than any other instruction }
+              if taicpu(p).opcode = A_MOV then
+                begin
+                  Result:=OptPass1MOV(p);
+                  Exit;
+                end;
+                
               case taicpu(p).opcode Of
                 A_AND:
                   Result:=OptPass1And(p);
@@ -151,8 +160,6 @@
                   Result:=OptPass1FSTP(p);
                 A_LEA:
                   Result:=OptPass1LEA(p);
-                A_MOV:
-                  Result:=OptPass1MOV(p);
                 A_MOVSX,
                 A_MOVZX :
                   Result:=OptPass1Movx(p);
Index: compiler/x86_64/aoptcpu.pas
===================================================================
--- compiler/x86_64/aoptcpu.pas	(revision 43854)
+++ compiler/x86_64/aoptcpu.pas	(working copy)
@@ -71,11 +71,17 @@
         case p.typ of
           ait_instruction:
             begin
+              { Subtle speed boost: MOV appears disproportionately more
+                frequently than any other instruction }
+              if taicpu(p).opcode = A_MOV then
+                begin
+                  Result:=OptPass1MOV(p);
+                  Exit;
+                end;
+                
               case taicpu(p).opcode of
                 A_AND:
                   Result:=OptPass1AND(p);
-                A_MOV:
-                  Result:=OptPass1MOV(p);
                 A_MOVSX,
                 A_MOVZX:
                   Result:=OptPass1Movx(p);
x86_mov_call.patch (1,949 bytes)   

Florian

2020-01-06 22:05

administrator   ~0120242

I found the problem: X86AsmOptimizer.RegModifiedByInstruction did not work properly for IMUL, I fixed. As I couldn't measure any improvement by x86_mov_call.patch, I think it is not worth to apply it.

Issue History

Date Modified Username Field Change
2019-11-30 00:07 J. Gareth Moreton New Issue
2019-11-30 00:07 J. Gareth Moreton File Added: OptPass1MOV-Improvement.patch
2019-11-30 00:07 J. Gareth Moreton File Added: x86_mov_call.patch
2019-11-30 00:07 J. Gareth Moreton Relationship added child of 0036376
2019-11-30 00:08 J. Gareth Moreton Tag Attached: patch
2019-11-30 00:08 J. Gareth Moreton Tag Attached: compiler
2019-11-30 00:08 J. Gareth Moreton Tag Attached: optimizations
2019-11-30 00:08 J. Gareth Moreton Tag Attached: x86_64
2019-11-30 00:08 J. Gareth Moreton Tag Attached: x86
2019-11-30 00:08 J. Gareth Moreton Tag Attached: i386
2019-11-30 00:11 J. Gareth Moreton Additional Information Updated View Revisions
2019-11-30 00:11 J. Gareth Moreton FPCTarget => -
2019-11-30 00:12 J. Gareth Moreton Additional Information Updated View Revisions
2019-11-30 00:12 J. Gareth Moreton File Deleted: OptPass1MOV-Improvement.patch
2019-11-30 00:13 J. Gareth Moreton File Added: OptPass1MOV-Improvement.patch
2019-12-01 22:23 Florian Note Added: 0119570
2019-12-01 22:25 J. Gareth Moreton Note Added: 0119571
2019-12-01 22:40 Florian Note Added: 0119572
2019-12-01 22:43 J. Gareth Moreton Note Added: 0119573
2019-12-01 23:19 Florian Note Added: 0119575
2019-12-01 23:33 J. Gareth Moreton Note Added: 0119576
2019-12-02 11:57 J. Gareth Moreton Note Added: 0119577
2019-12-04 10:58 J. Gareth Moreton Note Added: 0119608
2019-12-04 10:58 J. Gareth Moreton Note Edited: 0119608 View Revisions
2019-12-06 03:55 J. Gareth Moreton File Added: OptPass1MOV-Improvement-v2.patch
2019-12-06 03:55 J. Gareth Moreton Note Added: 0119650
2019-12-06 04:49 J. Gareth Moreton Note Added: 0119651
2019-12-06 04:50 J. Gareth Moreton File Deleted: OptPass1MOV-Improvement.patch
2019-12-06 16:04 J. Gareth Moreton Note Added: 0119663
2019-12-16 02:28 J. Gareth Moreton Note Added: 0119873
2019-12-16 02:30 J. Gareth Moreton Note Edited: 0119873 View Revisions
2019-12-16 15:57 J. Gareth Moreton Note Added: 0119886
2019-12-16 18:50 J. Gareth Moreton Note Added: 0119888
2019-12-16 22:14 Florian Note Added: 0119891
2019-12-17 05:09 J. Gareth Moreton Note Added: 0119893
2019-12-17 07:27 J. Gareth Moreton Note Added: 0119894
2019-12-18 20:49 J. Gareth Moreton Note Added: 0119922
2019-12-22 02:31 J. Gareth Moreton File Deleted: OptPass1MOV-Improvement-v2.patch
2019-12-22 02:55 J. Gareth Moreton File Added: GetNextInstructionUsingReg-Improvement.patch
2019-12-22 02:55 J. Gareth Moreton File Added: OptPass1MOV-Improvement-v3.patch
2019-12-22 02:55 J. Gareth Moreton Note Added: 0120005
2019-12-22 04:47 J. Gareth Moreton Note Added: 0120006
2019-12-22 15:33 Florian Note Added: 0120017
2019-12-22 15:46 J. Gareth Moreton File Deleted: GetNextInstructionUsingReg-Improvement.patch
2019-12-22 15:46 J. Gareth Moreton File Deleted: OptPass1MOV-Improvement-v3.patch
2019-12-22 15:53 J. Gareth Moreton File Added: GetNextInstructionUsingReg-Improvement-v2.patch
2019-12-22 15:53 J. Gareth Moreton File Added: OptPass1MOV-Improvement-v4.patch
2019-12-22 15:53 J. Gareth Moreton Note Added: 0120019
2019-12-22 15:55 J. Gareth Moreton Note Edited: 0120019 View Revisions
2019-12-22 15:56 J. Gareth Moreton Note Edited: 0120019 View Revisions
2019-12-22 21:09 J. Gareth Moreton File Deleted: GetNextInstructionUsingReg-Improvement-v2.patch
2019-12-22 21:09 J. Gareth Moreton Note Added: 0120027
2019-12-24 00:48 J. Gareth Moreton Note Added: 0120047
2019-12-26 10:33 Florian Note Added: 0120067
2019-12-26 13:20 J. Gareth Moreton Note Added: 0120070
2020-01-02 22:14 J. Gareth Moreton File Added: linux-mov-refactor.log
2020-01-02 22:14 J. Gareth Moreton Note Added: 0120189
2020-01-03 23:22 J. Gareth Moreton File Deleted: x86_mov_call.patch
2020-01-03 23:23 J. Gareth Moreton File Added: x86_mov_call.patch
2020-01-03 23:23 J. Gareth Moreton Note Added: 0120201
2020-01-06 22:05 Florian Assigned To => Florian
2020-01-06 22:05 Florian Status new => resolved
2020-01-06 22:05 Florian Resolution open => fixed
2020-01-06 22:05 Florian Note Added: 0120242