View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0038527 | FPC | Compiler | public | 2021-02-21 23:50 | 2021-02-23 01:22 |
Reporter | runewalsh | Assigned To | J. Gareth Moreton | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Platform | i386-win32, x86_64-win64 | ||||
Product Version | 3.3.1 | ||||
Summary | 0038527: Dreadful, mind-chilling CSE failure | ||||
Description | Code below breaks under -O2 or higher: function F returns just 4*n. -OoNOCSE makes it work. | ||||
Steps To Reproduce | function F(n: SizeUint): SizeUint; begin result := 4 * n + 4 * n; end; begin writeln('Reference F(5): ', 4 * 5 + 4 * 5); writeln(' Actual F(5): ', F(5)); end. | ||||
Tags | No tags attached. | ||||
Fixed in Revision | 48792 | ||||
FPCOldBugId | |||||
FPCTarget | - | ||||
Attached Files |
|
|
Bug report confirmed! This is indeed a really enormous error :-( Hopefully Gareth will be able to fix this rapidly... Pierre |
|
I'll get on it. |
|
I confess when I first saw the title, I thought it was one of Si Nicholson's spam bug reports about the Pascal language in general and how they want it to be more C-based. That aside, it doesn't look like CSE itself is at fault, but some that manages temporary references. When compiling with DEBUG_NODE_XML enabled, the following branches in the "firstpass" tree are generated: <statementn pos="4,1"> <blockn resultdef="$void" pos="4,1" flags="nf_pass1_done" complexity="0"> <statementn pos="4,1"> <nothingn resultdef="$void" pos="4,1" flags="nf_pass1_done" complexity="0" /> </statementn> <statementn pos="5,17"> <tempcreaten resultdef="$void" pos="5,17" flags="nf_pass1_done" complexity="0" id="$7C46BA70"> <typedef>QWord</typedef> <tempflags>ti_may_be_in_reg,ti_const,ti_cleanup_only</tempflags> <temptype>tt_persistent</temptype> <size>8</size> <tempinit> <assignn resultdef="$void" pos="4,1" flags="nf_pass1_done" complexity="3"> <temprefn resultdef="QWord" pos="4,1" flags="nf_pass1_done,nf_write" complexity="1" id="$7C46BA70"> <typedef>QWord</typedef> <tempflags>ti_may_be_in_reg,ti_const,ti_cleanup_only</tempflags> <temptype>tt_persistent</temptype> </temprefn> <shln resultdef="QWord" pos="5,17" flags="nf_pass1_done" complexity="1"> <loadn resultdef="QWord" pos="5,19" flags="nf_pass1_done" complexity="0"> <symbol>N</symbol> </loadn> <ordconstn resultdef="QWord" pos="5,15" flags="nf_pass1_done" complexity="0" rangecheck="FALSE"> <value>2</value> </ordconstn> </shln> </assignn> </tempinit> </tempcreaten> </statementn> </blockn> </statementn> <statementn pos="4,1"> <assignn resultdef="$void" pos="5,5" flags="nf_pass1_done" complexity="1"> <typeconvn resultdef="QWord" pos="5,5" flags="nf_pass1_done,nf_absolute" complexity="0" convtype="tc_equal"> <loadn resultdef="QWord" pos="5,12" flags="nf_pass1_done,nf_write" complexity="0"> <symbol>result</symbol> </loadn> </typeconvn> <addn resultdef="QWord" pos="5,21" flags="nf_pass1_done" complexity="1"> <temprefn resultdef="QWord" pos="5,17" flags="nf_pass1_done" complexity="1" id="$7C46BA70"> <typedef>QWord</typedef> <tempflags>ti_may_be_in_reg,ti_const,ti_cleanup_only</tempflags> <temptype>tt_persistent</temptype> </temprefn> <temprefn resultdef="QWord" pos="5,25" flags="nf_pass1_done" complexity="1" id="$7C46BA70"> <typedef>QWord</typedef> <tempflags>ti_may_be_in_reg,ti_const,ti_cleanup_only</tempflags> <temptype>tt_persistent</temptype> </temprefn> </addn> </assignn> </statementn> From those nodes alone, everything should work as expected: "N shl 2" (n * 4) is stored in a temporary store (they're unnamed, so the only way they can be referenced is via their pointer address... $7C46BA70 in this case). The answer is then added with itself and the result stored in Result. HOWEVER... I've just discovered that if the LeaLea2Lea peephole optimisation is disabled (alternatively, see what happens if you try to build the above code sample with "-OoNOPEEPHOLE"), the code compiles correctly. Looks like another bug that only rears its ugly head in a rare number of cases. It looks like LeaLea2Lea doesn't optimise the following pair of assembly instructions properly: leaq (,%rcx,4),%rax leaq (%rax,%rax),%rax (without LeaLea2Lea, the second instruction is converted into "addq %rax,%rax" by the Lea2AddBase optimisation) I don't recall writing this optimisation, but I'll see what I can figure out. |
|
Okay, I've made a patch that reworks the LeaLea2Lea optimisation and hpoefully fixes the bug (admittedly by adding a brand new optimisation!). There's room for improvement, I'm sure. Once I finish running the tests and confirm everything is good, I'll upload it. |
|
Tests on Windows pass without incident when run with -O4. lealea2lea-rework.patch (11,518 bytes)
Index: compiler/x86/aoptx86.pas =================================================================== --- compiler/x86/aoptx86.pas (revision 48789) +++ compiler/x86/aoptx86.pas (working copy) @@ -3360,67 +3360,164 @@ if (taicpu(p).oper[1]^.reg <> NR_STACK_POINTER_REG) and GetNextInstructionUsingReg(p,hp1,taicpu(p).oper[1]^.reg) then begin - { changes - lea offset1(regX), reg1 - lea offset2(reg1), reg1 - to - lea offset1+offset2(regX), reg1 } - + { Check common LEA/LEA conditions } if MatchInstruction(hp1,A_LEA,[taicpu(p).opsize]) and - MatchOperand(taicpu(p).oper[1]^,taicpu(hp1).oper[1]^) and - (taicpu(p).oper[0]^.ref^.relsymbol=nil) and - (taicpu(p).oper[0]^.ref^.segment=NR_NO) and - (taicpu(p).oper[0]^.ref^.symbol=nil) and - (((taicpu(hp1).oper[0]^.ref^.base=taicpu(p).oper[1]^.reg) and - (taicpu(p).oper[0]^.ref^.scalefactor <= 1) and - (taicpu(p).oper[0]^.ref^.index=NR_NO) and - (taicpu(p).oper[0]^.ref^.index=taicpu(hp1).oper[0]^.ref^.index) and - (taicpu(p).oper[0]^.ref^.scalefactor=taicpu(hp1).oper[0]^.ref^.scalefactor) - ) or - ((taicpu(hp1).oper[0]^.ref^.index=taicpu(p).oper[1]^.reg) and - (taicpu(p).oper[0]^.ref^.index=NR_NO) - ) or - ((taicpu(hp1).oper[0]^.ref^.base=taicpu(p).oper[1]^.reg) and - (taicpu(hp1).oper[0]^.ref^.scalefactor <= 1) and - ((taicpu(p).oper[0]^.ref^.base=NR_NO) or - ((taicpu(p).oper[0]^.ref^.base=taicpu(p).oper[0]^.ref^.base) and - (taicpu(p).oper[0]^.ref^.index=NR_NO) - ) - ) and - not(RegUsedBetween(taicpu(p).oper[0]^.ref^.index,p,hp1))) - ) and - not(RegUsedBetween(taicpu(p).oper[0]^.ref^.base,p,hp1)) and - (taicpu(p).oper[0]^.ref^.relsymbol=taicpu(hp1).oper[0]^.ref^.relsymbol) and - (taicpu(p).oper[0]^.ref^.segment=taicpu(hp1).oper[0]^.ref^.segment) and - (taicpu(p).oper[0]^.ref^.symbol=taicpu(hp1).oper[0]^.ref^.symbol) then + (taicpu(p).oper[1]^.reg = taicpu(hp1).oper[1]^.reg) and + (taicpu(p).oper[0]^.ref^.relsymbol = nil) and + (taicpu(p).oper[0]^.ref^.segment = NR_NO) and + (taicpu(p).oper[0]^.ref^.symbol = nil) and + (taicpu(hp1).oper[0]^.ref^.relsymbol = nil) and + (taicpu(hp1).oper[0]^.ref^.segment = NR_NO) and + (taicpu(hp1).oper[0]^.ref^.symbol = nil) and + ( + (taicpu(p).oper[0]^.ref^.base = NR_NO) or { Don't call RegUsedBetween unnecessarily } + not(RegUsedBetween(taicpu(p).oper[0]^.ref^.base,p,hp1)) + ) then begin - DebugMsg(SPeepholeOptimization + 'LeaLea2Lea done',p); - if taicpu(hp1).oper[0]^.ref^.index=taicpu(p).oper[1]^.reg then + { changes + lea (regX,scale), reg1 + lea offset(reg1,reg1), reg1 + to + lea offset(regX,scale*2), reg1 + + and + lea (regX,scale1), reg1 + lea offset(reg1,scale2), reg1 + to + lea offset(regX,scale1*scale2), reg1 + + ... so long as the final scale does not exceed 8 + + (Similarly, allow the first instruction to be "lea (regX,regX),reg1") + } + if (taicpu(p).oper[0]^.ref^.offset = 0) and + (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) and + ( + ( + (taicpu(p).oper[0]^.ref^.base = NR_NO) + ) or ( + (taicpu(p).oper[0]^.ref^.scalefactor <= 1) and + ( + (taicpu(p).oper[0]^.ref^.base = taicpu(p).oper[0]^.ref^.index) and + not(RegUsedBetween(taicpu(p).oper[0]^.ref^.index, p, hp1)) + ) + ) + ) and ( + ( + { lea (reg1,scale2), reg1 variant } + (taicpu(hp1).oper[0]^.ref^.base = NR_NO) and + ( + ( + (taicpu(p).oper[0]^.ref^.base = NR_NO) and + (taicpu(hp1).oper[0]^.ref^.scalefactor * taicpu(p).oper[0]^.ref^.scalefactor <= 8) + ) or ( + { lea (regX,regX), reg1 variant } + (taicpu(p).oper[0]^.ref^.base <> NR_NO) and + (taicpu(hp1).oper[0]^.ref^.scalefactor <= 4) + ) + ) + ) or ( + { lea (reg1,reg1), reg1 variant } + (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) and + (taicpu(hp1).oper[0]^.ref^.scalefactor <= 1) + ) + ) then begin - taicpu(hp1).oper[0]^.ref^.index:=taicpu(p).oper[0]^.ref^.base; - inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset*max(taicpu(hp1).oper[0]^.ref^.scalefactor,1)); - { if the register is used as index and base, we have to increase for base as well - and adapt base } - if taicpu(hp1).oper[0]^.ref^.base=taicpu(p).oper[1]^.reg then + DebugMsg(SPeepholeOptimization + 'LeaLea2Lea 2 done',p); + + { Make everything homogeneous to make calculations easier } + if (taicpu(p).oper[0]^.ref^.base <> NR_NO) then begin - taicpu(hp1).oper[0]^.ref^.base:=taicpu(p).oper[0]^.ref^.base; - inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset); + if taicpu(p).oper[0]^.ref^.index <> NR_NO then + { Convert lea (regX,regX),reg1 to lea (regX,2),reg1 } + taicpu(p).oper[0]^.ref^.scalefactor := 2 + else + taicpu(p).oper[0]^.ref^.index := taicpu(p).oper[0]^.ref^.base; + + taicpu(p).oper[0]^.ref^.base := NR_NO; end; + + if (taicpu(hp1).oper[0]^.ref^.base = NR_NO) then + begin + { Just to prevent miscalculations } + if (taicpu(hp1).oper[0]^.ref^.scalefactor = 0) then + taicpu(hp1).oper[0]^.ref^.scalefactor := taicpu(p).oper[0]^.ref^.scalefactor + else + taicpu(hp1).oper[0]^.ref^.scalefactor := taicpu(hp1).oper[0]^.ref^.scalefactor * taicpu(p).oper[0]^.ref^.scalefactor; + end + else + begin + taicpu(hp1).oper[0]^.ref^.base := NR_NO; + taicpu(hp1).oper[0]^.ref^.scalefactor := taicpu(p).oper[0]^.ref^.scalefactor * 2; + end; + + taicpu(hp1).oper[0]^.ref^.index := taicpu(p).oper[0]^.ref^.index; + RemoveCurrentP(p); + result:=true; + exit; end - else + + { changes + lea offset1(regX), reg1 + lea offset2(reg1), reg1 + to + lea offset1+offset2(regX), reg1 } + else if + ( + (taicpu(hp1).oper[0]^.ref^.index = taicpu(p).oper[1]^.reg) and + (taicpu(p).oper[0]^.ref^.index = NR_NO) + ) or ( + (taicpu(hp1).oper[0]^.ref^.base = taicpu(p).oper[1]^.reg) and + (taicpu(hp1).oper[0]^.ref^.scalefactor <= 1) and + ( + ( + (taicpu(p).oper[0]^.ref^.index = NR_NO) or + (taicpu(p).oper[0]^.ref^.base = NR_NO) + ) or ( + (taicpu(p).oper[0]^.ref^.scalefactor <= 1) and + ( + (taicpu(p).oper[0]^.ref^.index = NR_NO) or + ( + (taicpu(p).oper[0]^.ref^.index = taicpu(p).oper[0]^.ref^.base) and + ( + (taicpu(hp1).oper[0]^.ref^.index = NR_NO) or + (taicpu(hp1).oper[0]^.ref^.base = NR_NO) + ) + ) + ) + ) + ) + ) then begin - inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset); - taicpu(hp1).oper[0]^.ref^.base:=taicpu(p).oper[0]^.ref^.base; + DebugMsg(SPeepholeOptimization + 'LeaLea2Lea 1 done',p); + + if taicpu(hp1).oper[0]^.ref^.index=taicpu(p).oper[1]^.reg then + begin + taicpu(hp1).oper[0]^.ref^.index:=taicpu(p).oper[0]^.ref^.base; + inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset*max(taicpu(hp1).oper[0]^.ref^.scalefactor,1)); + { if the register is used as index and base, we have to increase for base as well + and adapt base } + if taicpu(hp1).oper[0]^.ref^.base=taicpu(p).oper[1]^.reg then + begin + taicpu(hp1).oper[0]^.ref^.base:=taicpu(p).oper[0]^.ref^.base; + inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset); + end; + end + else + begin + inc(taicpu(hp1).oper[0]^.ref^.offset,taicpu(p).oper[0]^.ref^.offset); + taicpu(hp1).oper[0]^.ref^.base:=taicpu(p).oper[0]^.ref^.base; + end; + if taicpu(p).oper[0]^.ref^.index<>NR_NO then + begin + taicpu(hp1).oper[0]^.ref^.base:=taicpu(hp1).oper[0]^.ref^.index; + taicpu(hp1).oper[0]^.ref^.index:=taicpu(p).oper[0]^.ref^.index; + taicpu(hp1).oper[0]^.ref^.scalefactor:=taicpu(p).oper[0]^.ref^.scalefactor; + end; + RemoveCurrentP(p); + result:=true; + exit; end; - if taicpu(p).oper[0]^.ref^.index<>NR_NO then - begin - taicpu(hp1).oper[0]^.ref^.base:=taicpu(hp1).oper[0]^.ref^.index; - taicpu(hp1).oper[0]^.ref^.index:=taicpu(p).oper[0]^.ref^.index; - taicpu(hp1).oper[0]^.ref^.scalefactor:=taicpu(p).oper[0]^.ref^.scalefactor; - end; - RemoveCurrentP(p); - result:=true; - exit; end; { Change: |
|
I applied the patch provided by Gareth, after testing on x86_64-linux. Could you please test if the ug is fixed, and close the bug report if it is the case. Thanks to Gareth, Pierre |
|
Yay, it works, thank you, I was really scared. |
Date Modified | Username | Field | Change |
---|---|---|---|
2021-02-21 23:50 | runewalsh | New Issue | |
2021-02-22 00:04 | Pierre Muller | Note Added: 0129080 | |
2021-02-22 00:14 | J. Gareth Moreton | Assigned To | => J. Gareth Moreton |
2021-02-22 00:14 | J. Gareth Moreton | Status | new => assigned |
2021-02-22 00:14 | J. Gareth Moreton | Note Added: 0129081 | |
2021-02-22 01:52 | J. Gareth Moreton | Note Added: 0129082 | |
2021-02-22 01:53 | J. Gareth Moreton | Note Edited: 0129082 | View Revisions |
2021-02-22 01:53 | J. Gareth Moreton | Note Edited: 0129082 | View Revisions |
2021-02-22 01:54 | J. Gareth Moreton | Note Edited: 0129082 | View Revisions |
2021-02-22 02:14 | J. Gareth Moreton | Note Edited: 0129082 | View Revisions |
2021-02-22 12:29 | J. Gareth Moreton | Note Added: 0129095 | |
2021-02-22 22:49 | J. Gareth Moreton | Note Added: 0129110 | |
2021-02-22 22:49 | J. Gareth Moreton | File Added: lealea2lea-rework.patch | |
2021-02-22 22:49 | J. Gareth Moreton | Assigned To | J. Gareth Moreton => Pierre Muller |
2021-02-22 22:49 | J. Gareth Moreton | Status | assigned => feedback |
2021-02-22 22:49 | J. Gareth Moreton | FPCTarget | => - |
2021-02-23 00:18 | Pierre Muller | Assigned To | Pierre Muller => J. Gareth Moreton |
2021-02-23 00:18 | Pierre Muller | Status | feedback => resolved |
2021-02-23 00:18 | Pierre Muller | Resolution | open => fixed |
2021-02-23 00:18 | Pierre Muller | Fixed in Revision | => 48792 |
2021-02-23 00:18 | Pierre Muller | Note Added: 0129111 | |
2021-02-23 01:22 | runewalsh | Note Added: 0129112 |