View Issue Details

IDProjectCategoryView StatusLast Update
0037554FPCCompilerpublic2020-08-16 06:43
ReporterJ. Gareth Moreton Assigned ToFlorian  
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Platformaarch64OSLinux (Raspberry Pi OS) 
Product Version3.0.4 
Summary0037554: AArch64 generates incorrect code for some immediate values
DescriptionSo I've been looking at improving some of the Arm-64 code generated when assigning values to variables, and I was a bit shocked to discover that a number of said assignments produce incorrect code, usually changing the immediate value to -1.
Steps To ReproduceWrite and run the following program:

program BadValue;
begin
  WriteLn($FFFFFFFBFFFFFFFF)
end.

Observe output to be -1, and the generated assembly language for the actual parameter to be something like "movn x2, #0x0, lsl 32" (this is meant to write something to the 33nd to 48th bits, but just writes all zeroes then inverts the entire register, producing $FFFFFFFFFFFFFFFF = -1)
Additional InformationI have fixed the bug and made a comprehensive test, but am waiting on the test suite to finish running. However, my fix is embedded among my general improvements to MOV commands.

This bug is present in FPC 3.0.4 and the trunk.
Tagsaarch64, compiler
Fixed in Revision46401,46402
FPCOldBugId
FPCTarget-
Attached Files

Activities

J. Gareth Moreton

2020-08-12 09:18

developer   ~0124784

My more comprehensive test:
tw37554.pp (6,114 bytes)   
{ %CPU=AARCH64 }

program tw37554;

{$mode objfpc}{$H+}

const
  CmpArray: array[0..88] of Int64 = (
    -1, $FFFFFFFF, -3000000, -2147483648, -131073,

    $FFFFFFFFFFFF0000, $FFFFFFFF0000FFFF, $FFFF0000FFFFFFFF, $0000FFFFFFFFFFFF,

    $FFFFFFFFFFFFFFFE, $FFFFFFFFFFFFFFFD, $FFFFFFFFFFFFFFFB, $FFFFFFFFFFFFFFF7,
    $FFFFFFFFFFFFFFEF, $FFFFFFFFFFFFFFDF, $FFFFFFFFFFFFFFBF, $FFFFFFFFFFFFFF7F,
    $FFFFFFFFFFFFFEFF, $FFFFFFFFFFFFFDFF, $FFFFFFFFFFFFFBFF, $FFFFFFFFFFFFF7FF,
    $FFFFFFFFFFFFEFFF, $FFFFFFFFFFFFDFFF, $FFFFFFFFFFFFBFFF, $FFFFFFFFFFFF7FFF,
    $FFFFFFFFFFFEFFFF, $FFFFFFFFFFFDFFFF, $FFFFFFFFFFFBFFFF, $FFFFFFFFFFF7FFFF,
    $FFFFFFFFFFEFFFFF, $FFFFFFFFFFDFFFFF, $FFFFFFFFFFBFFFFF, $FFFFFFFFFF7FFFFF,
    $FFFFFFFFFEFFFFFF, $FFFFFFFFFDFFFFFF, $FFFFFFFFFBFFFFFF, $FFFFFFFFF7FFFFFF,
    $FFFFFFFFEFFFFFFF, $FFFFFFFFDFFFFFFF, $FFFFFFFFBFFFFFFF, $FFFFFFFF7FFFFFFF,
    $FFFFFFFEFFFFFFFF, $FFFFFFFDFFFFFFFF, $FFFFFFFBFFFFFFFF, $FFFFFFF7FFFFFFFF,
    $FFFFFFEFFFFFFFFF, $FFFFFFDFFFFFFFFF, $FFFFFFBFFFFFFFFF, $FFFFFF7FFFFFFFFF,
    $FFFFFEFFFFFFFFFF, $FFFFFDFFFFFFFFFF, $FFFFFBFFFFFFFFFF, $FFFFF7FFFFFFFFFF,
    $FFFFEFFFFFFFFFFF, $FFFFDFFFFFFFFFFF, $FFFFBFFFFFFFFFFF, $FFFF7FFFFFFFFFFF,
    $FFFEFFFFFFFFFFFF, $FFFDFFFFFFFFFFFF, $FFFBFFFFFFFFFFFF, $FFF7FFFFFFFFFFFF,
    $FFEFFFFFFFFFFFFF, $FFDFFFFFFFFFFFFF, $FFBFFFFFFFFFFFFF, $FF7FFFFFFFFFFFFF,
    $FEFFFFFFFFFFFFFF, $FDFFFFFFFFFFFFFF, $FBFFFFFFFFFFFFFF, $F7FFFFFFFFFFFFFF,
    $EFFFFFFFFFFFFFFF, $DFFFFFFFFFFFFFFF, $BFFFFFFFFFFFFFFF, $7FFFFFFFFFFFFFFF,

    $FFFFFFFFFFFF1234, $FFFFFFFF1234FFFF, $FFFF1234FFFFFFFF, $1234FFFFFFFFFFFF,
    $FFFFFFFF12341234, $FFFF1234FFFF1234, $FFFF12341234FFFF, $FFFF123412341234,
    $FFFFFFFFFFFF0001, $FFFFFFFF0001FFFF, $FFFF0001FFFFFFFF, $0001FFFFFFFFFFFF,

    $0000000100000001, $0000000500000005, $0000AAAA0000AAAA, $0000FFFF0000FFFF
  );

var
  Fail: Boolean;

procedure CompareImmediate(CmpIndex: Integer; TestVal: Int64);
begin
  Write('Test ', CmpIndex, '; input constant: ', TestVal, '; comparing against: ', CmpArray[CmpIndex], ' - ');
  if TestVal = CmpArray[CmpIndex] then
    begin
      WriteLn('Pass');
      Exit;
    end;

  WriteLn('FAIL - expected ', CmpArray[CmpIndex]);
  Fail := True;
end;

begin
  Fail := False;

  CompareImmediate(0, -1);
  CompareImmediate(1, $FFFFFFFF);
  CompareImmediate(2, -3000000);
  CompareImmediate(3, -2147483648);
  CompareImmediate(4, -131073);

  CompareImmediate(5, $FFFFFFFFFFFF0000);
  CompareImmediate(6, $FFFFFFFF0000FFFF);
  CompareImmediate(7, $FFFF0000FFFFFFFF);
  CompareImmediate(8, $0000FFFFFFFFFFFF);

  CompareImmediate(9, $FFFFFFFFFFFFFFFE);
  CompareImmediate(10, $FFFFFFFFFFFFFFFD);
  CompareImmediate(11, $FFFFFFFFFFFFFFFB);
  CompareImmediate(12, $FFFFFFFFFFFFFFF7);

  CompareImmediate(13, $FFFFFFFFFFFFFFEF);
  CompareImmediate(14, $FFFFFFFFFFFFFFDF);
  CompareImmediate(15, $FFFFFFFFFFFFFFBF);
  CompareImmediate(16, $FFFFFFFFFFFFFF7F);

  CompareImmediate(17, $FFFFFFFFFFFFFEFF);
  CompareImmediate(18, $FFFFFFFFFFFFFDFF);
  CompareImmediate(19, $FFFFFFFFFFFFFBFF);
  CompareImmediate(20, $FFFFFFFFFFFFF7FF);

  CompareImmediate(21, $FFFFFFFFFFFFEFFF);
  CompareImmediate(22, $FFFFFFFFFFFFDFFF);
  CompareImmediate(23, $FFFFFFFFFFFFBFFF);
  CompareImmediate(24, $FFFFFFFFFFFF7FFF);

  CompareImmediate(25, $FFFFFFFFFFFEFFFF);
  CompareImmediate(26, $FFFFFFFFFFFDFFFF);
  CompareImmediate(27, $FFFFFFFFFFFBFFFF);
  CompareImmediate(28, $FFFFFFFFFFF7FFFF);

  CompareImmediate(29, $FFFFFFFFFFEFFFFF);
  CompareImmediate(30, $FFFFFFFFFFDFFFFF);
  CompareImmediate(31, $FFFFFFFFFFBFFFFF);
  CompareImmediate(32, $FFFFFFFFFF7FFFFF);

  CompareImmediate(33, $FFFFFFFFFEFFFFFF);
  CompareImmediate(34, $FFFFFFFFFDFFFFFF);
  CompareImmediate(35, $FFFFFFFFFBFFFFFF);
  CompareImmediate(36, $FFFFFFFFF7FFFFFF);

  CompareImmediate(37, $FFFFFFFFEFFFFFFF);
  CompareImmediate(38, $FFFFFFFFDFFFFFFF);
  CompareImmediate(39, $FFFFFFFFBFFFFFFF);
  CompareImmediate(40, $FFFFFFFF7FFFFFFF);

  CompareImmediate(41, $FFFFFFFEFFFFFFFF);
  CompareImmediate(42, $FFFFFFFDFFFFFFFF);
  CompareImmediate(43, $FFFFFFFBFFFFFFFF);
  CompareImmediate(44, $FFFFFFF7FFFFFFFF);

  CompareImmediate(45, $FFFFFFEFFFFFFFFF);
  CompareImmediate(46, $FFFFFFDFFFFFFFFF);
  CompareImmediate(47, $FFFFFFBFFFFFFFFF);
  CompareImmediate(48, $FFFFFF7FFFFFFFFF);

  CompareImmediate(49, $FFFFFEFFFFFFFFFF);
  CompareImmediate(50, $FFFFFDFFFFFFFFFF);
  CompareImmediate(51, $FFFFFBFFFFFFFFFF);
  CompareImmediate(52, $FFFFF7FFFFFFFFFF);

  CompareImmediate(53, $FFFFEFFFFFFFFFFF);
  CompareImmediate(54, $FFFFDFFFFFFFFFFF);
  CompareImmediate(55, $FFFFBFFFFFFFFFFF);
  CompareImmediate(56, $FFFF7FFFFFFFFFFF);

  CompareImmediate(57, $FFFEFFFFFFFFFFFF);
  CompareImmediate(58, $FFFDFFFFFFFFFFFF);
  CompareImmediate(59, $FFFBFFFFFFFFFFFF);
  CompareImmediate(60, $FFF7FFFFFFFFFFFF);

  CompareImmediate(61, $FFEFFFFFFFFFFFFF);
  CompareImmediate(62, $FFDFFFFFFFFFFFFF);
  CompareImmediate(63, $FFBFFFFFFFFFFFFF);
  CompareImmediate(64, $FF7FFFFFFFFFFFFF);

  CompareImmediate(65, $FEFFFFFFFFFFFFFF);
  CompareImmediate(66, $FDFFFFFFFFFFFFFF);
  CompareImmediate(67, $FBFFFFFFFFFFFFFF);
  CompareImmediate(68, $F7FFFFFFFFFFFFFF);

  CompareImmediate(69, $EFFFFFFFFFFFFFFF);
  CompareImmediate(70, $DFFFFFFFFFFFFFFF);
  CompareImmediate(71, $BFFFFFFFFFFFFFFF);
  CompareImmediate(72, $7FFFFFFFFFFFFFFF);

  CompareImmediate(73, $FFFFFFFFFFFF1234);
  CompareImmediate(74, $FFFFFFFF1234FFFF);
  CompareImmediate(75, $FFFF1234FFFFFFFF);
  CompareImmediate(76, $1234FFFFFFFFFFFF);

  CompareImmediate(77, $FFFFFFFF12341234);
  CompareImmediate(78, $FFFF1234FFFF1234);
  CompareImmediate(79, $FFFF12341234FFFF);
  CompareImmediate(80, $FFFF123412341234);

  CompareImmediate(81, $FFFFFFFFFFFF0001);
  CompareImmediate(82, $FFFFFFFF0001FFFF);
  CompareImmediate(83, $FFFF0001FFFFFFFF);
  CompareImmediate(84, $0001FFFFFFFFFFFF);

  CompareImmediate(85, $0000000100000001);
  CompareImmediate(86, $0000000500000005);
  CompareImmediate(87, $0000AAAA0000AAAA);
  CompareImmediate(88, $0000FFFF0000FFFF);

  { Spacing }
  WriteLn('');

  if Fail then
    Halt(1)
  else
    WriteLn('ok');
end.

tw37554.pp (6,114 bytes)   

J. Gareth Moreton

2020-08-12 20:53

developer   ~0124810

Last edited: 2020-08-12 20:58

View 4 revisions

The following patch contains correct constant generation as well as improvements overall. For example, the number "-3000000" is produced using fewer instructions:

movn x1,50879
movk x1,65490,lsl 16

...versus...

movz x1,14656
movk x1,65490,lsl 16
movk x1,65535,lsl 32
movk x1,65535,lsl 48

(#'s removed because the bug tracker interprets them as links)

I haven't been able to do much in terms of testing compiler speed because of how often my Raspberry PI gets CPU-throttled (and I'm still waiting to hear about my laptop), but I've attempted to structure the upgraded generation function to check common constants first (e.g. small signed values that fit into 16 bits).

NOTE: If you need a fix that's more to the point, the bug is due to the fact that the 'preva' value never gets updated in the first loop and hence is always equal to the first word.
mov-upgrade.patch (9,568 bytes)   
Index: compiler/aarch64/cgcpu.pas
===================================================================
--- compiler/aarch64/cgcpu.pas	(revision 46353)
+++ compiler/aarch64/cgcpu.pas	(working copy)
@@ -580,102 +580,165 @@
 
     procedure tcgaarch64.a_load_const_reg(list: TAsmList; size: tcgsize; a: tcgint; reg : tregister);
       var
-        preva: tcgint;
         opc: tasmop;
-        shift,maxshift: byte;
+        shift: byte;
         so: tshifterop;
-        reginited: boolean;
-        mask: tcgint;
+        reginited,doinverted: boolean;
+        manipulated_a: tcgint;
+        leftover_a: word;
       begin
-        { if we load a value into a 32 bit register, it is automatically
-          zero-extended to 64 bit }
-        if (hi(a)=0) and
-           (size in [OS_64,OS_S64]) then
-          begin
-            size:=OS_32;
-            reg:=makeregsize(reg,size);
-          end;
-        { values <= 32 bit are stored in a 32 bit register }
-        if not(size in [OS_64,OS_S64]) then
-          a:=cardinal(a);
+        case a of
+          { Small positive number }
+          $0..$FFFF:
+            begin
+              list.concat(taicpu.op_reg_const(A_MOVZ, reg, a));
+              Exit;
+            end;
+          { Small negative number }
+          -65536..-1:
+            begin
+              list.concat(taicpu.op_reg_const(A_MOVN, reg, Word(not a)));
+              Exit;
+            end;
+          { Can be represented as a negative number more compactly }
+          $FFFF0000..$FFFFFFFF:
+            begin
+              { if we load a value into a 32 bit register, it is automatically
+                zero-extended to 64 bit }
+              list.concat(taicpu.op_reg_const(A_MOVN, makeregsize(reg,OS_32), Word(not a)));
+              Exit;
+            end;
+          else
+            begin
 
-        if size in [OS_64,OS_S64] then
-          begin
-            mask:=-1;
-            maxshift:=64;
-          end
-        else
-          begin
-            mask:=$ffffffff;
-            maxshift:=32;
-          end;
-        { single movn enough? (to be extended) }
-        shift:=16;
-        preva:=a;
-        repeat
-          if (a shr shift)=(mask shr shift) then
-            begin
-              if shift=16 then
-                list.concat(taicpu.op_reg_const(A_MOVN,reg,not(word(preva))))
+              if size in [OS_64,OS_S64] then
+                begin
+                  { Check to see if a is a valid shifter constant that can be encoded in ORR as is }
+                  if is_shifter_const(a,size) then
+                    begin
+                      list.concat(taicpu.op_reg_reg_const(A_ORR,reg,makeregsize(NR_XZR,size),a));
+                      Exit;
+                    end;
+
+                  { This determines whether this write can be peformed with an ORR followed by MOVK
+                    by copying the 2nd word to the 4th word for the ORR constant, then overwriting
+                    the 4th word (unless the word is.  The alternative would require 3 instructions }
+                  leftover_a := word(a shr 48);
+                  manipulated_a := (a and $0000FFFFFFFFFFFF);
+
+                  if manipulated_a = $0000FFFFFFFFFFFF then
+                    begin
+                      { This is even better, as we can just use a single MOVN on the last word }
+                      shifterop_reset(so);
+                      so.shiftmode := SM_LSL;
+                      so.shiftimm := 48;
+                      list.concat(taicpu.op_reg_const_shifterop(A_MOVN, reg, word(not leftover_a), so));
+                      Exit;
+                    end;
+
+                  manipulated_a := manipulated_a or (((a shr 16) and $FFFF) shl 48);
+                  { if manipulated_a = a, don't check, because is_shifter_const was already
+                    called for a and it returned False.  Reduces processing time. [Kit] }
+                  if (manipulated_a <> a) and is_shifter_const(manipulated_a, size) then
+                    begin
+                      list.concat(taicpu.op_reg_reg_const(A_ORR, reg, makeregsize(NR_XZR, size), manipulated_a));
+                      if (leftover_a <> 0) then
+                        begin
+                          shifterop_reset(so);
+                          so.shiftmode := SM_LSL;
+                          so.shiftimm := 48;
+                          list.concat(taicpu.op_reg_const_shifterop(A_MOVK, reg, leftover_a, so));
+                        end;
+                      Exit;
+                    end;
+
+                  case a of
+                    { If a is in the given negative range, it can be stored
+                      more efficiently if it is inverted.  }
+                    TCgInt($FFFF000000000000)..-65537:
+                      begin
+                        { NOTE: This excluded range can be more efficiently
+                          stored as the first 16 bits followed by a shifter constant }
+                        case a of
+                          TCgInt($FFFF0000FFFF0000)..TCgInt($FFFF0000FFFFFFFF):
+                            doinverted := False
+                          else
+                            begin
+                              doinverted := True;
+                              a := not a;
+                            end;
+                        end;
+                      end;
+
+                    else
+                      doinverted := False;
+                  end;
+                end
               else
                 begin
-                  shifterop_reset(so);
-                  so.shiftmode:=SM_LSL;
-                  so.shiftimm:=shift-16;
-                  list.concat(taicpu.op_reg_const_shifterop(A_MOVN,reg,not(word(preva)),so));
+                  a:=cardinal(a);
+                  doinverted:=False;
                 end;
-              exit;
             end;
-          { only try the next 16 bits if the current one is all 1 bits, since
-            the movn will set all lower bits to 1 }
-          if word(a shr (shift-16))<>$ffff then
-            break;
-          inc(shift,16);
-        until shift=maxshift;
+        end;
+
         reginited:=false;
         shift:=0;
-        { can be optimized later to use more movn }
+
+        if doinverted then
+          opc:=A_MOVN
+        else
+          opc:=A_MOVZ;
+
         repeat
           { leftover is shifterconst? (don't check if we can represent it just
             as effectively with movz/movk, as this check is expensive) }
-          if ((shift<tcgsize2size[size]*(8 div 2)) and
-              (word(a)<>0) and
-              ((a shr 16)<>0)) and
-             is_shifter_const(a shl shift,size) then
+          if (word(a)<>0) then
             begin
-              if reginited then
-                list.concat(taicpu.op_reg_reg_const(A_ORR,reg,reg,a shl shift))
+
+              if not doinverted and
+                ((shift<tcgsize2size[size]*(8 div 2)) and
+                  ((a shr 16)<>0)) and
+                 is_shifter_const(a shl shift,size) then
+                begin
+                  if reginited then
+                    list.concat(taicpu.op_reg_reg_const(A_ORR,reg,reg,a shl shift))
+                  else
+                    list.concat(taicpu.op_reg_reg_const(A_ORR,reg,makeregsize(NR_XZR,size),a shl shift));
+
+                  exit;
+                end;
+
+              { set all 16 bit parts <> 0 }
+              if shift=0 then
+                begin
+                  list.concat(taicpu.op_reg_const(opc,reg,word(a)));
+                  reginited:=true;
+                end
               else
-                list.concat(taicpu.op_reg_reg_const(A_ORR,reg,makeregsize(NR_XZR,size),a shl shift));
-              exit;
+                begin
+                  shifterop_reset(so);
+                  so.shiftmode:=SM_LSL;
+                  so.shiftimm:=shift;
+                  if not reginited then
+                    begin
+                      list.concat(taicpu.op_reg_const_shifterop(opc,reg,word(a),so));
+                      reginited:=true;
+                    end
+                  else
+                    begin
+                      if doinverted then
+                        list.concat(taicpu.op_reg_const_shifterop(A_MOVK,reg,word(not a),so))
+                      else
+                        list.concat(taicpu.op_reg_const_shifterop(A_MOVK,reg,word(a),so));
+                    end;
+                end;
             end;
-          { set all 16 bit parts <> 0 }
-          if (word(a)<>0) or
-             ((shift=0) and
-              (a=0)) then
-            if shift=0 then
-              begin
-                list.concat(taicpu.op_reg_const(A_MOVZ,reg,word(a)));
-                reginited:=true;
-              end
-            else
-              begin
-                shifterop_reset(so);
-                so.shiftmode:=SM_LSL;
-                so.shiftimm:=shift;
-                if not reginited then
-                  begin
-                    opc:=A_MOVZ;
-                    reginited:=true;
-                  end
-                else
-                  opc:=A_MOVK;
-                list.concat(taicpu.op_reg_const_shifterop(opc,reg,word(a),so));
-              end;
-            preva:=a;
-            a:=a shr 16;
-           inc(shift,16);
-        until word(preva)=preva;
+
+          a:=a shr 16;
+          inc(shift,16);
+        until a = 0;
+
         if not reginited then
           internalerror(2014102702);
       end;
mov-upgrade.patch (9,568 bytes)   

Florian

2020-08-12 23:16

administrator   ~0124812

Thanks, applied.

J. Gareth Moreton

2020-08-16 06:41

developer   ~0124913

Last edited: 2020-08-16 06:43

View 2 revisions

I should probably add a minor extra note. I've discovered that Cortex-A72 processors and the like have a special bit of microcode that allows them to initialise constants faster than normal, namely a sequential"mov; movk lsl 16" pair, writing to the same register, can be executed in a single cycle, and similarly with a "movk lsl 32; movk lsl 48" pair. What this means is that a 32-bit constant can be initialised in just 1 cycle, and a 64-bit constant in 2 cycles, even if there are 3 or 4 instructions (movn doesn't have this special behaviour).

I double-checked my code, armed with this new knowledge, but learnt that all is fine, because the code won't ever produce a "movn; movk; movk" triplet (which would take 3 cycles), the few numbers that have both a "mov; movk" and a "movn; movk" representation will always use the former, and 32-bit numbers will only use movn if it can be used in isolation. Long story short, it shouldn't ever produce a sequence that will take longer to execute, but it will produce a shorter one if it exists.

Issue History

Date Modified Username Field Change
2020-08-12 09:16 J. Gareth Moreton New Issue
2020-08-12 09:17 J. Gareth Moreton Tag Attached: compiler
2020-08-12 09:17 J. Gareth Moreton Tag Attached: aarch64
2020-08-12 09:18 J. Gareth Moreton Note Added: 0124784
2020-08-12 09:18 J. Gareth Moreton File Added: tw37554.pp
2020-08-12 09:23 J. Gareth Moreton Priority normal => high
2020-08-12 09:23 J. Gareth Moreton Severity minor => major
2020-08-12 09:23 J. Gareth Moreton Steps to Reproduce Updated View Revisions
2020-08-12 09:23 J. Gareth Moreton FPCTarget => -
2020-08-12 09:24 J. Gareth Moreton Steps to Reproduce Updated View Revisions
2020-08-12 20:53 J. Gareth Moreton Note Added: 0124810
2020-08-12 20:53 J. Gareth Moreton File Added: mov-upgrade.patch
2020-08-12 20:54 J. Gareth Moreton Note Edited: 0124810 View Revisions
2020-08-12 20:55 J. Gareth Moreton Note Edited: 0124810 View Revisions
2020-08-12 20:58 J. Gareth Moreton Note Edited: 0124810 View Revisions
2020-08-12 23:16 Florian Assigned To => Florian
2020-08-12 23:16 Florian Status new => resolved
2020-08-12 23:16 Florian Resolution open => fixed
2020-08-12 23:16 Florian Fixed in Revision => 46401,46402
2020-08-12 23:16 Florian Note Added: 0124812
2020-08-16 06:41 J. Gareth Moreton Note Added: 0124913
2020-08-16 06:43 J. Gareth Moreton Note Edited: 0124913 View Revisions