View Issue Details

IDProjectCategoryView StatusLast Update
0038837FPCCompilerpublic2021-05-01 20:56
ReporterJ. Gareth Moreton Assigned ToFlorian  
PrioritylowSeveritytweakReproducibilityN/A
Status resolvedResolutionfixed 
Platformaarch64-linuxOSDebian GNU/Linux (Raspberry Pi) 
Product Version3.3.1 
Fixed in Version3.3.1 
Summary0038837: [Patch] AArch64 Improved speed and efficiency with constant generation
DescriptionAfter the introduction of "magic division", a new class of constants were revealed... reciprocals of small divisors that were massive 64-bit numbers but which could be encoded with an "orr/movk" pair, since often only the first word differed. This patch therefore permits the encoding of constants where copying the 3rd word onto the 1st word produced a value that's valid for AArch64's barrel shifter, and then using movk to correct the 1st word. For example, instead of encoding $AAAAAAAAAAAAAAAB (reciprocal of 3) as "movz reg,#0xAAAB; movk reg,#0xAAAA,lsl 16; movk reg,#0xAAAA,lsl 32; movk reg,#0xAAAA,lsl 48", it is instead encoded as "orr reg,xzr,#0xAAAAAAAAAAAAAAAA; movk reg,#0xAAAB". Cycle count is identical (2), but overall code size is 8 bytes smaller.

Additionally, 32-bit numbers are now encoded as a single ORR instruction (as the synthetic MOV instruction for clarity) where possible.
Steps To ReproduceApply patch and confirm correct compilation and slightly smaller code size.
Additional InformationSome minor speed-ups have been applied to the routine by rearranging some conditional checks, and to help follow convention and aid analysis of assembly dumps, ORR Instructions are encoded as MOV instructions when they appear by themselves. Also, in some situations, a 64-bit constant was applied to a 32-bit register (usually as the result of sign-extension) and would appear as such in the assembly dump - such constants are now truncated by the compiler to remove confusion and prevent potential future assembler errors due to the oversized value.
Tagsaarch64, compiler, optimization, patch
Fixed in Revision49321
FPCOldBugId
FPCTarget-
Attached Files

Activities

J. Gareth Moreton

2021-05-01 03:06

developer  

newconst.patch (5,599 bytes)   
Index: compiler/aarch64/cgcpu.pas
===================================================================
--- compiler/aarch64/cgcpu.pas	(revision 49298)
+++ compiler/aarch64/cgcpu.pas	(working copy)
@@ -583,7 +583,7 @@
         opc: tasmop;
         shift: byte;
         so: tshifterop;
-        reginited,doinverted: boolean;
+        reginited,doinverted,extendedsize: boolean;
         manipulated_a: tcgint;
         leftover_a: word;
       begin
@@ -590,6 +590,8 @@
 {$ifdef extdebug}
         list.concat(tai_comment.Create(strpnew('Generating constant ' + tostr(a) + ' / $' + hexstr(a, 16))));
 {$endif extdebug}
+        extendedsize := (size in [OS_64,OS_S64]);
+
         case a of
           { Small positive number }
           $0..$FFFF:
@@ -613,19 +615,50 @@
             end;
           else
             begin
+              if not extendedsize then
+                { Mostly so programmers don't get confused when they view the disassembly and
+                  'a' is sign-extended to 64-bit, say, but also avoids potential problems with
+                  third-party assemblers if the number is out of bounds for a given size }
+                a := Cardinal(a);
 
-              if size in [OS_64,OS_S64] then
+              { Check to see if a is a valid shifter constant that can be encoded in ORR as is }
+              if is_shifter_const(a,size) then
                 begin
-                  { Check to see if a is a valid shifter constant that can be encoded in ORR as is }
-                  if is_shifter_const(a,size) then
+                  { Use synthetic "MOV" instruction instead of "ORR reg,wzr,#a" (an alias),
+                    since AArch64 conventions prefer this, and it's clearer in the
+                    disassembly }
+                  list.concat(taicpu.op_reg_const(A_MOV,reg,a));
+                  Exit;
+                end;
+
+              { If the value of a fits into 32 bits, it's fastest to use movz/movk regardless }
+              if extendedsize and ((a shr 32) <> 0) then
+                begin
+                  { This determines whether this write can be performed with an ORR followed by MOVK
+                    by copying the 3nd word to the 1st word for the ORR constant, then overwriting
+                    the 1st word.  The alternative would require 4 instructions.  This sequence is
+                    common when division reciprocals are calculated (e.g. 3 produces AAAAAAAAAAAAAAAB). }
+                  leftover_a := word(a and $FFFF);
+                  manipulated_a := (a and $FFFFFFFFFFFF0000) or ((a shr 32) and $FFFF);
+                  { if manipulated_a = a, don't check, because is_shifter_const was already
+                    called for a and it returned False.  Reduces processing time. [Kit] }
+                  if (manipulated_a <> a) and is_shifter_const(manipulated_a, OS_64) then
                     begin
-                      list.concat(taicpu.op_reg_reg_const(A_ORR,reg,makeregsize(NR_XZR,size),a));
+                      { Encode value as:
+                          orr  reg,xzr,manipulated_a
+                          movk reg,#(leftover_a)
+
+                        Use "orr" instead of "mov" here for the assembly dump so it better
+                        implies that something special is happening with the number arrangement.
+                      }
+                      list.concat(taicpu.op_reg_reg_const(A_ORR, reg, NR_XZR, manipulated_a));
+                      list.concat(taicpu.op_reg_const(A_MOVK, reg, leftover_a));
                       Exit;
                     end;
 
                   { This determines whether this write can be performed with an ORR followed by MOVK
                     by copying the 2nd word to the 4th word for the ORR constant, then overwriting
-                    the 4th word (unless the word is.  The alternative would require 3 instructions }
+                    the 4th word.  The alternative would require 3 instructions }
                   leftover_a := word(a shr 48);
                   manipulated_a := (a and $0000FFFFFFFFFFFF);
 
@@ -642,13 +675,16 @@
                   manipulated_a := manipulated_a or (((a shr 16) and $FFFF) shl 48);
                   { if manipulated_a = a, don't check, because is_shifter_const was already
                     called for a and it returned False.  Reduces processing time. [Kit] }
-                  if (manipulated_a <> a) and is_shifter_const(manipulated_a, size) then
+                  if (manipulated_a <> a) and is_shifter_const(manipulated_a, OS_64) then
                     begin
                       { Encode value as:
                           orr  reg,xzr,manipulated_a
                           movk reg,#(leftover_a),lsl #48
+
+                        Use "orr" instead of "mov" here for the assembly dump so it better
+                        implies that something special is happening with the number arrangement.
                       }
-                      list.concat(taicpu.op_reg_reg_const(A_ORR, reg, makeregsize(NR_XZR, size), manipulated_a));
+                      list.concat(taicpu.op_reg_reg_const(A_ORR, reg, NR_XZR, manipulated_a));
                       shifterop_reset(so);
                       so.shiftmode := SM_LSL;
                       so.shiftimm := 48;
@@ -679,10 +715,7 @@
                   end;
                 end
               else
-                begin
-                  a:=cardinal(a);
-                  doinverted:=False;
-                end;
+                doinverted:=False;
             end;
         end;
 
newconst.patch (5,599 bytes)   

J. Gareth Moreton

2021-05-01 03:57

developer   ~0130696

Last edited: 2021-05-01 03:59

View 2 revisions

As a side note, MOV instructions with immediates will be much easier to find future optimisations for - for example, this segment from the System unit has so much potential:

    mov w3,#2146435072
    cmp w1,w3
    b.gt .Lj1541
.Lj1542:
    mov w3,#2146435072
    cmp w1,w3
    cset w3,eq
    cmp w2,#0
    cset w2,ne
    and w2,w2,w3
    cbz w2,.Lj1543

It's worth noting that the label .Lj1542 is not referenced by any branch instructions but still seems to be live for some reason. That will have to be checked at some point, since live labels that should be dead prevent a lot of optimisations. With that fixed, the second mov/cmp pair can be removed.

Florian

2021-05-01 20:56

administrator   ~0130710

Thanks, applied.

Issue History

Date Modified Username Field Change
2021-05-01 03:06 J. Gareth Moreton New Issue
2021-05-01 03:06 J. Gareth Moreton File Added: newconst.patch
2021-05-01 03:06 J. Gareth Moreton Tag Attached: patch
2021-05-01 03:06 J. Gareth Moreton Tag Attached: aarch64
2021-05-01 03:06 J. Gareth Moreton Tag Attached: optimization
2021-05-01 03:06 J. Gareth Moreton Tag Attached: compiler
2021-05-01 03:07 J. Gareth Moreton Priority normal => low
2021-05-01 03:07 J. Gareth Moreton Severity minor => tweak
2021-05-01 03:07 J. Gareth Moreton Description Updated View Revisions
2021-05-01 03:07 J. Gareth Moreton FPCTarget => -
2021-05-01 03:08 J. Gareth Moreton Description Updated View Revisions
2021-05-01 03:09 J. Gareth Moreton Description Updated View Revisions
2021-05-01 03:57 J. Gareth Moreton Note Added: 0130696
2021-05-01 03:59 J. Gareth Moreton Note Edited: 0130696 View Revisions
2021-05-01 20:56 Florian Assigned To => Florian
2021-05-01 20:56 Florian Status new => resolved
2021-05-01 20:56 Florian Resolution open => fixed
2021-05-01 20:56 Florian Fixed in Version => 3.3.1
2021-05-01 20:56 Florian Fixed in Revision => 49321
2021-05-01 20:56 Florian Note Added: 0130710