View Issue Details

IDProjectCategoryView StatusLast Update
0033093FPCCompilerpublic2018-12-31 01:06
ReporterMartok Assigned To 
PrioritynormalSeverityminorReproducibilityN/A
Status newResolutionopen 
Summary0033093: [Patch] Peephole optimizations for case with jumptables
DescriptionBuilding on top of 0033038, this patch adds a peephole optimization for the heads of "case Ordinal of", if checked jumptables are used.
The effect is that register loading and masking is unified to just 3 instructions with no latency between them (the same that most other compilers generate for jumptables), saving up to 4 cycles in microbenchmarks.

I have intentionally submitted a patch containing a commented-out check for the instruction after the AND on i386. I believe this check is unneccessary (as the comment above describes), but have left it in case it may not be. Need a second opinion on this.
TagsNo tags attached.
Fixed in Revision
FPCOldBugId
FPCTarget
Attached Files

Activities

Martok

2018-01-29 00:08

reporter  

Peephole_Casenode.patch (2,537 bytes)   
diff --git compiler/x86/aoptx86.pas compiler/x86/aoptx86.pas
index a70d541712..ddab24e713 100644
--- compiler/x86/aoptx86.pas
+++ compiler/x86/aoptx86.pas
@@ -1114,7 +1114,7 @@ unit aoptx86;
 
     function TX86AsmOptimizer.OptPass1MOV(var p : tai) : boolean;
       var
-        hp1, hp2: tai;
+        hp1, hp2, hp3, hp4: tai;
         TmpUsedRegs : TAllUsedRegs;
         GetNextInstruction_p: Boolean;
         PreMessage, RegName1, RegName2, InputVal, MaskNum: string;
@@ -1161,6 +1161,43 @@ unit aoptx86;
             exit;
           end;
 
+
+        { Convert
+            mov    w,%reg8                   p
+            cmp    x,%reg8                   hp1
+            jxx    y                         hp2
+            and    $0xff,sup(%reg8)          hp3
+          # jmp    z(sup(%reg8)*a)           different between i386 and x86_64, unchecked
+          To:
+            movzbl w,%reg
+            cmpl   x,%reg
+            jxx    y
+          # jmp    z(sup(%reg8)*a)
+          ( generated by casenode )
+        }
+        if ((taicpu(p).opsize = S_B) and (taicpu(p).oper[1]^.typ = top_reg)) and
+           (GetNextInstruction_p and MatchInstruction(hp1, A_CMP, [S_B]) and MatchOperand(taicpu(p).oper[1]^, taicpu(hp1).oper[1]^)) and
+           (GetNextInstruction(hp1, hp2) and MatchInstruction(hp2, A_Jcc, [])) and
+           (GetNextInstruction(hp2, hp3) and MatchInstruction(hp3, A_AND, [S_L]) and
+              ((taicpu(hp3).oper[0]^.typ = top_const) and (taicpu(hp3).oper[0]^.val = $ff)) and
+              ((taicpu(hp3).oper[1]^.typ = top_reg) and (getsupreg(taicpu(p).oper[1]^.reg) = getsupreg(taicpu(hp3).oper[1]^.reg)))
+            )(* and
+           (GetNextInstruction(hp3, hp4) and MatchInstruction(hp4, A_JMP, []) and
+              ((taicpu(hp4).oper[0]^.typ = top_ref) and (taicpu(hp4).oper[0]^.ref^.index = taicpu(hp3).oper[1]^.reg))
+            ) *)then begin
+          { replace with full register load }
+          taicpu(p).opcode := A_MOVZX;
+          taicpu(p).changeopsize(S_BL);
+          taicpu(p).oper[1]^.reg := taicpu(hp3).oper[1]^.reg;
+          { replace cmpb with cmpl }
+          taicpu(hp1).changeopsize(S_L);
+          taicpu(hp1).oper[1]^.reg := taicpu(p).oper[1]^.reg;
+          { remove extra and }
+          AsmL.Remove(hp3);
+          hp3.Free;
+          DebugMsg(SPeepholeOptimization + 'MovCaseNode 1 done',p);
+        end;
+
         if GetNextInstruction_p and
           MatchInstruction(hp1,A_AND,[]) and
           (taicpu(p).oper[1]^.typ = top_reg) and
Peephole_Casenode.patch (2,537 bytes)   

Martok

2018-01-29 00:10

reporter   ~0106109

Added Bonus: in all of my tests, this makes the execution time of the statement indepent of jumptable_no_range, i.o.w. we get the check for free. Yay for correctly predicted speculative execution :)
Delphi+TP compatibility could now be restored at *no* extra cost, if the core team were so inclined.

Florian

2018-02-10 10:38

administrator   ~0106310

Why not doing this at node level so all architectures benefit from this?

Martok

2018-02-10 12:35

reporter   ~0106311

Would love to, but I don't think that is possible. This part of the codegen is already only available in Tx86casenode.
But even there all of the redundancy is caused by (on their own) atomic loads from opcgsize to OS_INT. Unless the hlcg gets a new method for "indirect branch into jumptable" that basically emits the optimized version of the first 30 lines of tx86casenode.genjumptable, I don't see how this could be done nicer?

Martok

2018-02-11 20:03

reporter  

0033093_Casenode_order.patch (3,676 bytes)   
commit 4397fd52aacedca291ce91ae77ce8bb0af04b249
Author: Martok <martok@martoks-place.de>
Date:   Sun Feb 11 19:47:09 2018 +0100

    Casenode: generate more optimizer-friendly code for jumptables

diff --git a/compiler/x86/nx86set.pas b/compiler/x86/nx86set.pas
index ac5d03b35b..be649a956c 100644
--- a/compiler/x86/nx86set.pas
+++ b/compiler/x86/nx86set.pas
@@ -106,20 +106,18 @@ implementation
         { This generates near pointers on i8086 }
         labeltyp:=aitconst_ptr;
         opcgsize:=def_cgsize(opsize);
+        { make it a 32bit register }
+        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_INT);
+        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_INT,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
              { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-             cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,opcgsize,aint(min_),hregister);
+             cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,OS_INT,aint(min_),indexreg);
              { case expr greater than max_ => goto elselabel }
-             cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,opcgsize,OC_A,aint(max_)-aint(min_),hregister,elselabel);
+             cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,OS_INT,OC_A,aint(max_)-aint(min_),indexreg,elselabel);
              min_:=0;
-             { do not sign extend when we load the index register, as we applied an offset above }
-             opcgsize:=tcgsize2unsigned[opcgsize];
           end;
         current_asmdata.getglobaldatalabel(table);
-        { make it a 32bit register }
-        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_INT);
-        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_INT,hregister,indexreg);
         { create reference }
         reference_reset_symbol(href,table,0,sizeof(pint),[]);
         href.offset:=(-aint(min_))*sizeof(aint);
diff --git a/compiler/x86_64/nx64set.pas b/compiler/x86_64/nx64set.pas
index 10ec06a7d5..ca5dcc7a68 100644
--- a/compiler/x86_64/nx64set.pas
+++ b/compiler/x86_64/nx64set.pas
@@ -103,12 +103,14 @@ implementation
 
         last:=min_;
         opcgsize:=def_cgsize(opsize);
+        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_ADDR);
+        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_ADDR,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
              { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-             cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,opcgsize,aint(min_),hregister);
+             cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,OS_ADDR,aint(min_),indexreg);
              { case expr greater than max_ => goto elselabel }
-             cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,opcgsize,OC_A,aint(max_)-aint(min_),hregister,elselabel);
+             cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,OS_ADDR,OC_A,aint(max_)-aint(min_),indexreg,elselabel);
              min_:=0;
              { do not sign extend when we load the index register, as we applied an offset above }
              opcgsize:=tcgsize2unsigned[opcgsize];
@@ -116,8 +118,6 @@ implementation
 
         { local label in order to avoid using GOT }
         current_asmdata.getlabel(tablelabel,alt_data);
-        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_ADDR);
-        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_ADDR,hregister,indexreg);
         { load table address }
         reference_reset_symbol(href,tablelabel,0,4,[]);
         basereg:=cg.getaddressregister(current_asmdata.CurrAsmList);
0033093_Casenode_order.patch (3,676 bytes)   

Martok

2018-02-11 20:06

reporter   ~0106340

Last edited: 2018-02-11 20:07

View 2 revisions

Actually, scratch that. I'm being stupid.
It can be solved in the node, by loading the correct size first and then relying on a single (existing) peephole opt. Better patch attached; Feel free to change the title of this bug since it's no longer a peephole patch. Still only for the two platforms though.

One question though: is there any particular reason tx8664casenode uses OS_ADDR, while tx86casenode uses OS_INT for the index reg size?

Florian

2018-02-11 22:22

administrator   ~0106342

Well, I thought more about inserting a type conversion at node level (if possible and needed) to make the case expression a natural sized integer.

Martok

2018-02-12 14:32

reporter   ~0106349

That could only be done right before calling genjumptable, otherwise we would loose the shorter opcodes for jumptrees and -lists, right?
Are there platforms that do something useful with indexregs of sizes other than their nativeint (i.e. a hypothetical "jmp [$rbx+$al*8]")? I believe that was a thing on some large-word mainframes, but not so sure about platforms supported by FPC. But if there are, we may not want to give that up too early.

Martok

2018-12-28 01:59

reporter  

0033093_Casenode_order_v2.patch (3,653 bytes)   
From 1ab83937a2e0f38603eb765b65e54cbed7660954 Mon Sep 17 00:00:00 2001
From: Martok <martok@martoks-place.de>
Date: Fri, 28 Dec 2018 01:00:24 +0100
Subject: Casenode: generate more optimizer-friendly code for jumptables


diff --git a/compiler/x86/nx86set.pas b/compiler/x86/nx86set.pas
index f1ba20c9fc..787f1e2452 100644
--- a/compiler/x86/nx86set.pas
+++ b/compiler/x86/nx86set.pas
@@ -119,6 +119,9 @@ implementation
 
         AlmostExhaustive := False;
 
+        { make it a 32bit register }
+        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_ADDR);
+        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_ADDR,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
 
@@ -145,19 +148,14 @@ implementation
             else
               begin
                 { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-                cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,opcgsize,aint(min_),hregister);
+                cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,OS_ADDR,aint(min_),indexreg);
                 { case expr greater than max_ => goto elselabel }
-                cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,opcgsize,OC_A,aint(max_)-aint(min_),hregister,elselabel);
+                cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,OS_ADDR,OC_A,Range,indexreg,elselabel);
                 min_:=0;
-                { do not sign extend when we load the index register, as we applied an offset above }
-                opcgsize:=tcgsize2unsigned[opcgsize];
               end;
           end;
 
         current_asmdata.getglobaldatalabel(table);
-        { make it a 32bit register }
-        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_INT);
-        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_INT,hregister,indexreg);
         { create reference }
         reference_reset_symbol(href,table,0,sizeof(pint),[]);
         href.offset:=(-aint(min_))*sizeof(aint);
diff --git a/compiler/x86_64/nx64set.pas b/compiler/x86_64/nx64set.pas
index 3f76a7d32d..e165de8f03 100644
--- a/compiler/x86_64/nx64set.pas
+++ b/compiler/x86_64/nx64set.pas
@@ -118,6 +118,9 @@ implementation
         AlmostExhaustive := False;
         oldmin := min_;
 
+        { make it a 64bit register }
+        indexreg:=cg.makeregsize(jtlist,hregister,OS_ADDR);
+        cg.a_load_reg_reg(jtlist,opcgsize,OS_ADDR,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
 
@@ -144,19 +147,15 @@ implementation
             else
               begin
                 { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-                cg.a_op_const_reg(jtlist,OP_SUB,opcgsize,aint(min_),hregister);
+                cg.a_op_const_reg(jtlist,OP_SUB,OS_ADDR,aint(min_),indexreg);
                 { case expr greater than max_ => goto elselabel }
-                cg.a_cmp_const_reg_label(jtlist,opcgsize,OC_A,Range,hregister,elselabel);
+                cg.a_cmp_const_reg_label(jtlist,OS_ADDR,OC_A,Range,indexreg,elselabel);
                 min_:=0;
-                { do not sign extend when we load the index register, as we applied an offset above }
-                opcgsize:=tcgsize2unsigned[opcgsize];
               end;
           end;
 
         { local label in order to avoid using GOT }
         current_asmdata.getlabel(tablelabel,alt_data);
-        indexreg:=cg.makeregsize(jtlist,hregister,OS_ADDR);
-        cg.a_load_reg_reg(jtlist,opcgsize,OS_ADDR,hregister,indexreg);
         { load table address }
         reference_reset_symbol(href,tablelabel,0,4,[]);
         basereg:=cg.getaddressregister(jtlist);

Martok

2018-12-28 02:03

reporter   ~0112931

Patch updated for after r40676. Still has the same effect.

J. Gareth Moreton

2018-12-28 04:18

developer   ~0112933

I was going to ask about doing it at the node level because it's generally faster and cleaner overall than relying on a specific peephole optimisation that just otherwise patches inefficient code generation. I probably should have looked into it before I uploaded my own patch, but I wasn't 100% certain of what it was doing.

Martok

2018-12-28 05:46

reporter   ~0112936

It's not really inefficient code, "mov x, reg; and $ff, reg" => "movzbl x, reg" is a rather common optimization pattern.

If you find a good solution, all the better! There's still the question whether it makes sense to upgrade the operand that early, when the only benefit is in this particular generation path.

Jonas Maebe

2018-12-28 18:28

manager   ~0112961

Small comment about the patch: it should use OS_INT/OS_SINT instead of OS_ADDR. In general they are the same (even on i8086), but it's cleaner. If that code would ever be translated to the high level code generator, it would also be clearer which def should be used.

Martok

2018-12-29 15:25

reporter   ~0112982

In that case, either x86 or x86_64 is currently wrong. That's what I asked about in february ;-)

What's the formal difference between these? I was looking for "the size of a native register".

Jonas Maebe

2018-12-29 19:34

manager   ~0112987

OS_INT/OS_SINT is the size of a native integer register. OS_ADDR is the size of an address register. At the Pascal level it's the difference between NativeInt/NativeUInt on the one hand and Pointer/PtrInt/PtrIUnt on the other hand (although not exactly, because OS_ADDR maps to OS_16 for i8086 while pointer is 32 bit there; this was probably done to avoid having to introduce a new OS_ADDROFFSET type, but it's not very clean/correct).

Martok

2018-12-31 01:06

reporter   ~0113029

In other places, the index is specifically checked to be R_INTREGISTER, so that makes sense logically. Patch updated.

Martok

2018-12-31 01:06

reporter  

0033093_Casenode_order_v3.patch (3,645 bytes)   
From 3334233216aa7720d451b1767822280cb1243e7c Mon Sep 17 00:00:00 2001
From: Martok <martok@martoks-place.de>
Date: Fri, 28 Dec 2018 01:00:24 +0100
Subject: Casenode: generate more optimizer-friendly code for jumptables


diff --git a/compiler/x86/nx86set.pas b/compiler/x86/nx86set.pas
index f1ba20c9fc..273d1db3a4 100644
--- a/compiler/x86/nx86set.pas
+++ b/compiler/x86/nx86set.pas
@@ -119,6 +119,9 @@ implementation
 
         AlmostExhaustive := False;
 
+        { make it a 32bit register }
+        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_INT);
+        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_INT,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
 
@@ -145,19 +148,14 @@ implementation
             else
               begin
                 { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-                cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,opcgsize,aint(min_),hregister);
+                cg.a_op_const_reg(current_asmdata.CurrAsmList,OP_SUB,OS_INT,aint(min_),indexreg);
                 { case expr greater than max_ => goto elselabel }
-                cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,opcgsize,OC_A,aint(max_)-aint(min_),hregister,elselabel);
+                cg.a_cmp_const_reg_label(current_asmdata.CurrAsmList,OS_INT,OC_A,Range,indexreg,elselabel);
                 min_:=0;
-                { do not sign extend when we load the index register, as we applied an offset above }
-                opcgsize:=tcgsize2unsigned[opcgsize];
               end;
           end;
 
         current_asmdata.getglobaldatalabel(table);
-        { make it a 32bit register }
-        indexreg:=cg.makeregsize(current_asmdata.CurrAsmList,hregister,OS_INT);
-        cg.a_load_reg_reg(current_asmdata.CurrAsmList,opcgsize,OS_INT,hregister,indexreg);
         { create reference }
         reference_reset_symbol(href,table,0,sizeof(pint),[]);
         href.offset:=(-aint(min_))*sizeof(aint);
diff --git a/compiler/x86_64/nx64set.pas b/compiler/x86_64/nx64set.pas
index 3f76a7d32d..be2cce6e5b 100644
--- a/compiler/x86_64/nx64set.pas
+++ b/compiler/x86_64/nx64set.pas
@@ -118,6 +118,9 @@ implementation
         AlmostExhaustive := False;
         oldmin := min_;
 
+        { make it a 64bit register }
+        indexreg:=cg.makeregsize(jtlist,hregister,OS_INT);
+        cg.a_load_reg_reg(jtlist,opcgsize,OS_INT,hregister,indexreg);
         if not(jumptable_no_range) then
           begin
 
@@ -144,19 +147,15 @@ implementation
             else
               begin
                 { a <= x <= b <-> unsigned(x-a) <= (b-a) }
-                cg.a_op_const_reg(jtlist,OP_SUB,opcgsize,aint(min_),hregister);
+                cg.a_op_const_reg(jtlist,OP_SUB,OS_INT,aint(min_),indexreg);
                 { case expr greater than max_ => goto elselabel }
-                cg.a_cmp_const_reg_label(jtlist,opcgsize,OC_A,Range,hregister,elselabel);
+                cg.a_cmp_const_reg_label(jtlist,OS_INT,OC_A,Range,indexreg,elselabel);
                 min_:=0;
-                { do not sign extend when we load the index register, as we applied an offset above }
-                opcgsize:=tcgsize2unsigned[opcgsize];
               end;
           end;
 
         { local label in order to avoid using GOT }
         current_asmdata.getlabel(tablelabel,alt_data);
-        indexreg:=cg.makeregsize(jtlist,hregister,OS_ADDR);
-        cg.a_load_reg_reg(jtlist,opcgsize,OS_ADDR,hregister,indexreg);
         { load table address }
         reference_reset_symbol(href,tablelabel,0,4,[]);
         basereg:=cg.getaddressregister(jtlist);

Issue History

Date Modified Username Field Change
2018-01-29 00:08 Martok New Issue
2018-01-29 00:08 Martok File Added: Peephole_Casenode.patch
2018-01-29 00:10 Martok Note Added: 0106109
2018-02-10 10:38 Florian Note Added: 0106310
2018-02-10 12:35 Martok Note Added: 0106311
2018-02-11 20:03 Martok File Added: 0033093_Casenode_order.patch
2018-02-11 20:06 Martok Note Added: 0106340
2018-02-11 20:07 Martok Note Edited: 0106340 View Revisions
2018-02-11 22:22 Florian Note Added: 0106342
2018-02-12 14:32 Martok Note Added: 0106349
2018-12-28 01:59 Martok File Added: 0033093_Casenode_order_v2.patch
2018-12-28 02:03 Martok Note Added: 0112931
2018-12-28 04:18 J. Gareth Moreton Note Added: 0112933
2018-12-28 05:46 Martok Note Added: 0112936
2018-12-28 18:28 Jonas Maebe Note Added: 0112961
2018-12-29 15:25 Martok Note Added: 0112982
2018-12-29 19:34 Jonas Maebe Note Added: 0112987
2018-12-31 01:06 Martok Note Added: 0113029
2018-12-31 01:06 Martok File Added: 0033093_Casenode_order_v3.patch