View Issue Details

IDProjectCategoryView StatusLast Update
0038560FPCCompilerpublic2021-03-01 23:20
ReporterJ. Gareth Moreton Assigned ToFlorian  
PrioritylowSeveritytweakReproducibilityN/A
Status resolvedResolutionfixed 
Platformx86_64OSMicrosoft Windows 
Product Version3.3.1 
Fixed in Version3.3.1 
Summary0038560: [Patch] MOV/SHR reference optimisation (x64 only)
DescriptionThis patch optimises 64-bit reads from memory that then discard the lower 32-bits with SHR. That is, it replaces pairs such as:

movq (%rdx),%rax
shrq $32,%rax

With:

movl 4(%rdx),$eax

Shifts greater than 32 are also optimised, with the register size changed to 32-bit and 32 subtracted from the shift amount. This may open up deeper optimisations and are smaller in code size.
Steps To ReproduceApply patch and confirm correct compilation.
Additional InformationA nice cascade optimisation can be seen in the System unit:

    ...
    movq (%rcx),%rax
    shrq $32,%rax
    andl $-2147483648,%eax (presumably a TEST instruction appeared after this line that was removed in the post-peephole stage)
    jne .Lj1562
    ...

Becomes:

    ...
    je .Lj1562
# Peephole Optimization: movq (%rcx),%rax; shrq $32,%rax -> movl 4(%rcx),%eax (MovShr2Mov)
# Peephole Optimization: MovAndTest2Test done
    testl $-2147483648,4(%rcx)
    jne .Lj1562
    ...
Tagscompiler, optimization, patch, x86_64
Fixed in Revision48857
FPCOldBugId
FPCTarget-
Attached Files

Activities

J. Gareth Moreton

2021-03-01 00:15

developer  

mov-shr-optimisation.patch (2,490 bytes)   
Index: compiler/x86/aoptx86.pas
===================================================================
--- compiler/x86/aoptx86.pas	(revision 48842)
+++ compiler/x86/aoptx86.pas	(working copy)
@@ -3076,6 +3076,55 @@
             Result:=true;
             exit;
           end;
+
+{$ifdef x86_64}
+        { Convert:
+            movq x(ref),%reg64
+            shrq y,%reg64
+          To:
+            movq x+4(ref),%reg32
+            shrq y-32,%reg32 (Remove if y = 32)
+        }
+        if (taicpu(p).opsize = S_Q) and
+          (taicpu(p).oper[0]^.typ = top_ref) and { Second operand will be a register }
+          (taicpu(p).oper[0]^.ref^.offset <= $7FFFFFFB) and
+          MatchInstruction(hp1, A_SHR, [taicpu(p).opsize]) and
+          MatchOpType(taicpu(hp1), top_const, top_reg) and
+          (taicpu(hp1).oper[0]^.val >= 32) and
+          (taicpu(hp1).oper[1]^.reg = taicpu(p).oper[1]^.reg) then
+          begin
+            RegName1 := debug_regname(taicpu(hp1).oper[1]^.reg);
+            PreMessage := 'movq ' + debug_operstr(taicpu(p).oper[0]^) + ',' + RegName1 + '; ' +
+              'shrq $' + debug_tostr(taicpu(hp1).oper[0]^.val) + ',' + RegName1 + ' -> movl ';
+
+            { Convert to 32-bit }
+            setsubreg(taicpu(p).oper[1]^.reg, R_SUBD);
+            taicpu(p).opsize := S_L;
+
+            Inc(taicpu(p).oper[0]^.ref^.offset, 4);
+
+            PreMessage := PreMessage + debug_operstr(taicpu(p).oper[0]^) + ',' + debug_regname(taicpu(p).oper[1]^.reg);
+            if (taicpu(hp1).oper[0]^.val = 32) then
+              begin
+                DebugMsg(SPeepholeOptimization + PreMessage + ' (MovShr2Mov)', p);
+                RemoveInstruction(hp1);
+              end
+            else
+              begin
+                { This will potentially open up more arithmetic operations since
+                  the peephole optimizer now has a big hint that only the lower
+                  32 bits are currently in use (and opcodes are smaller in size) }
+                setsubreg(taicpu(hp1).oper[1]^.reg, R_SUBD);
+                taicpu(hp1).opsize := S_L;
+
+                Dec(taicpu(hp1).oper[0]^.val, 32);
+                DebugMsg(SPeepholeOptimization + PreMessage +
+                  '; shrl $' + debug_tostr(taicpu(hp1).oper[0]^.val) + ',' + debug_regname(taicpu(hp1).oper[1]^.reg) + ' (MovShr2MovShr)', p);
+              end;
+            Result := True;
+            Exit;
+          end;
+{$endif x86_64}
       end;
 
 
mov-shr-optimisation.patch (2,490 bytes)   

Benito van der Zander

2021-03-01 12:51

reporter   ~0129276

You could run a benchmark to see how much faster it is?

J. Gareth Moreton

2021-03-01 13:20

developer   ~0129281

i'll try to come up with something. For shifts greater than 32, there shouldn't be any loss or gain since the cycle count is identical - the only direct saving is that it takes fewer bytes to encode the instructions (I'll double-check, but it might only be one byte, and that's if the SHR instruction is NOT modifying R8-R15).

Florian

2021-03-01 21:44

administrator   ~0129293

Thanks, applied.

@Benito: It's more a matter of "bit by bit" or as we say in German "Mühsam ernährt sich das Eichhörnchen" (To feed as a squirrel is hard): overall the patch reduces the system unit by 48 bytes or 0.005 %.

J. Gareth Moreton

2021-03-01 23:20

developer   ~0129299

Not quite as good as my first ever patch that reduced the System unit by around 50KB (changes "movq x,%regq" to "movl x,%regd" where possible), but every little helps! Also, that 48 bytes includes the 16-byte alignment granularity. so it will be interesting to see exactly how many byte savings the individual instructions make.

I'm glad you agree with my philosophy, Florian. Small but frequent savings of just a handful of bytes or clock cycles eventually add up into something massive.

Issue History

Date Modified Username Field Change
2021-03-01 00:15 J. Gareth Moreton New Issue
2021-03-01 00:15 J. Gareth Moreton File Added: mov-shr-optimisation.patch
2021-03-01 00:16 J. Gareth Moreton Tag Attached: patch
2021-03-01 00:16 J. Gareth Moreton Tag Attached: compiler
2021-03-01 00:16 J. Gareth Moreton Tag Attached: optimization
2021-03-01 00:16 J. Gareth Moreton Tag Attached: x86_64
2021-03-01 00:18 J. Gareth Moreton Priority normal => low
2021-03-01 00:18 J. Gareth Moreton Severity minor => tweak
2021-03-01 00:18 J. Gareth Moreton FPCTarget => -
2021-03-01 12:51 Benito van der Zander Note Added: 0129276
2021-03-01 13:20 J. Gareth Moreton Note Added: 0129281
2021-03-01 21:44 Florian Assigned To => Florian
2021-03-01 21:44 Florian Status new => resolved
2021-03-01 21:44 Florian Resolution open => fixed
2021-03-01 21:44 Florian Fixed in Version => 3.3.1
2021-03-01 21:44 Florian Fixed in Revision => 48857
2021-03-01 21:44 Florian Note Added: 0129293
2021-03-01 23:20 J. Gareth Moreton Note Added: 0129299