View Issue Details

IDProjectCategoryView StatusLast Update
0035406FPCCompilerpublic2020-01-21 14:10
ReporterJ. Gareth Moreton Assigned To 
Status closedResolutionfixed 
PlatformCross-platform (x86_64 benefits)OSMicrosoft Windows 
Product Version3.3.1 
Target Version3.3.1Fixed in Version3.3.1 
Summary0035406: [Refactor] TEntryFile.getbyte() optimisation
DescriptionThe following patch improves the getbyte() method by a few cycles by simplifying a couple of expressions. The patch file is small enough that it should be self-explanatory, but:

- "entryidx+1>entry.size" is changed to "entryidx >= entry.size"
- "bufsize-bufidx>=1" is changed to "bufidx < bufsize"

This changes the layout of the method slightly when compared to getword() etc, but in this instance, because the size of a byte is pretty much a set standard, there should be no loss in maintainability and portability.

Steps To ReproduceApply patch and confirm identical behaviour of compiler, but with a slight speed and size improvement.
Additional InformationThe method in question is used when loading in PPU files.

On x86_64-win64, the compiled assembly language (under -O3) shows significant improvement; not counting possible memory stalls, I estimate an improvement of 1 cycle on each of the expressions (which includes parallel execution of independent instructions), and as can be seen by the raw machine code, a size reduction of 9 bytes on both expressions to result in an 18-byte saving, enough to shrink the size of the compiled module since that number is larger than the common 16-byte alignment granularity.

Old: "entryidx+1>entry.size":

000000010008C279 48634324 movslq 0x24(%rbx),%rax
000000010008C27D 488d4001 lea 0x1(%rax),%rax
000000010008C281 48635328 movslq 0x28(%rbx),%rdx
000000010008C285 4839d0 cmp %rdx,%rax
000000010008C288 7e0e jle 0x10008c298 <GETBYTE+40>

New: "entryidx >= entry.size"

000000010008C279 8b4328 mov 0x28(%rbx),%eax
000000010008C27C 3b4324 cmp 0x24(%rbx),%eax
000000010008C27F 7f0e jg 0x10008c28f <GETBYTE+31>


Old: "bufsize-bufidx>=1"

000000010008C298 48634314 movslq 0x14(%rbx),%rax
000000010008C29C 48635318 movslq 0x18(%rbx),%rdx
000000010008C2A0 4829d0 sub %rdx,%rax
000000010008C2A3 4883f801 cmp $0x1,%rax
000000010008C2A7 7c18 jl 0x10008c2c1 <GETBYTE+81>

New: "bufidx < bufsize"

000000010008C28F 8b4318 mov 0x18(%rbx),%eax
000000010008C292 3b4314 cmp 0x14(%rbx),%eax
000000010008C295 7d18 jge 0x10008c2af <GETBYTE+63>


These improvements will likely not be as profound on 32-bit platforms, but they should not be worse than before - for one thing, the optimisations result in 1 fewer register being allocated. The reason why x86_64 assembler is so convoluted is because of Object Pascal's habit of expanding intermediate expressions to the CPU word size (hence the use of MOVSLQ instructions to expand LongInts to 64-bit), and is something not easily simplified by the peephole optimiser, for example.
Tagscompiler, patch, refactoring
Fixed in Revision41924
Attached Files


J. Gareth Moreton

2019-04-18 01:31


getbyte-optimisation.patch (614 bytes)   
Index: compiler/entfile.pas
--- compiler/entfile.pas	(revision 41892)
+++ compiler/entfile.pas	(working copy)
@@ -600,13 +600,13 @@
 function tentryfile.getbyte:byte;
-  if entryidx+1>entry.size then
+  if entryidx >= entry.size then { Check to see if reading a byte would go beyond EOF }
-  if bufsize-bufidx>=1 then
+  if bufidx < bufsize then { Can we read the byte from the buffer directly? }
getbyte-optimisation.patch (614 bytes)   

J. Gareth Moreton

2019-04-18 04:55

developer   ~0115629

Last edited: 2019-04-18 05:11

View 5 revisions

To compare i386-win32 - under -O3 again:

Old: "entryidx+1>entry.size":

00479897 8b431c mov 0x1c(%ebx),%eax
0047989A 83c001 add $0x1,%eax
0047989D 3b4320 cmp 0x20(%ebx),%eax
004798A0 7e0d jle 0x4798af <TENTRYFILE__GETBYTE+31>

New: "entryidx >= entry.size"

00479897 8b4320 mov 0x20(%ebx),%eax
0047989A 3b431c cmp 0x1c(%ebx),%eax
0047989D 7f0d jg 0x4798ac <TENTRYFILE__GETBYTE+28>

(approx. 1 cycle saved; 3 bytes saved)


Old: "bufsize-bufidx>=1"

004798AF 8b430c mov 0xc(%ebx),%eax
004798B2 8b5310 mov 0x10(%ebx),%edx
004798B5 29d0 sub %edx,%eax
004798B7 83f801 cmp $0x1,%eax
004798BA 7c14 jl 0x4798d0 <TENTRYFILE__GETBYTE+64>

New: "bufidx < bufsize"

004798AC 8b4310 mov 0x10(%ebx),%eax
004798AF 3b430c cmp 0xc(%ebx),%eax
004798B2 7d1c jge 0x4798d0 <TENTRYFILE__GETBYTE+64>

(approx. 1 cycle saved; 5 bytes saved)


I've also noticed that, under -O1, the register allocator is rather inefficient, since it keeps overwriting %eax, which contains Self, and constantly has to save and restore itself from the stack. Under -O2 and above, it's copied into %ebx and that is used instead.

Additionally, there's potential for a peephole optimisation:

add $0x1,%reg / lea $1(%reg),%reg
cmp x,%reg
jle ###
(%reg not used after)

cmp x,%reg
jl ###

That would make the assembler on the first expression equal to the patched version (albeit with the arguments reversed, i.e. "y < x" instead of "x > y").


TL:DR; i386 benefits as well.

J. Gareth Moreton

2019-04-21 18:41

developer   ~0115709

Last edited: 2019-04-21 18:57

View 3 revisions

As discussed in another topic, there is some merit in the proposal to improve node-level optimisation or peephole optimisation in order to achieve the same performance gains as described above, without changing the actual source code, although I wonder if it's still worth applying the above changes while deeper optimisation undergoes further research.

It's worth noting that it would be difficult for an optimiser to equate "bufsize-bufidx>=1" with "bufidx < bufsize" because the former can suffer from problems with overflows. Even though the two variables should never contain negative numbers, the compiler does not know this, and because bufsize could logically contain -2,147,483,648 ($80000000) and bufidx contain 2,147,483,647 ($7FFFFFFF), this causes the two expressions to not be equivalent on 64-bit because of sign extension; implicitly, the expression becomes "(Int64(bufsize) - Int64(bufidx)) >= Int64(1)", which given the above numbers, becomes "$FFFFFFFF00000001 >= 1", which is False under signed comparison rules. Under 32-bit systems though, the subtraction overflows and the expression becomes "1 >= 1", which is True. Using "bufidx < bufsize" ensures the most optimal machine code is generated and is hopefully easy enough to understand.

J. Gareth Moreton

2020-01-21 14:10

developer   ~0120599

Code seems to be working as expected with no reported issues.

Issue History

Date Modified Username Field Change
2019-04-18 01:31 J. Gareth Moreton New Issue
2019-04-18 01:31 J. Gareth Moreton File Added: getbyte-optimisation.patch
2019-04-18 01:32 J. Gareth Moreton Tag Attached: compiler
2019-04-18 01:32 J. Gareth Moreton Tag Attached: patch
2019-04-18 01:32 J. Gareth Moreton Tag Attached: refactoring
2019-04-18 04:02 J. Gareth Moreton Additional Information Updated View Revisions
2019-04-18 04:55 J. Gareth Moreton Note Added: 0115629
2019-04-18 04:56 J. Gareth Moreton Note Edited: 0115629 View Revisions
2019-04-18 04:57 J. Gareth Moreton Note Edited: 0115629 View Revisions
2019-04-18 05:11 J. Gareth Moreton Note Edited: 0115629 View Revisions
2019-04-18 05:11 J. Gareth Moreton Note Edited: 0115629 View Revisions
2019-04-21 18:41 J. Gareth Moreton Note Added: 0115709
2019-04-21 18:56 J. Gareth Moreton Note Edited: 0115709 View Revisions
2019-04-21 18:57 J. Gareth Moreton Note Edited: 0115709 View Revisions
2019-04-22 15:04 Jonas Maebe Assigned To => Jonas Maebe
2019-04-22 15:04 Jonas Maebe Status new => assigned
2019-04-22 15:16 Jonas Maebe Status assigned => resolved
2019-04-22 15:16 Jonas Maebe Resolution open => fixed
2019-04-22 15:16 Jonas Maebe Fixed in Version => 3.3.1
2019-04-22 15:16 Jonas Maebe Fixed in Revision => 41924
2020-01-21 14:10 J. Gareth Moreton Assigned To Jonas Maebe =>
2020-01-21 14:10 J. Gareth Moreton Status resolved => closed
2020-01-21 14:10 J. Gareth Moreton FPCTarget => -
2020-01-21 14:10 J. Gareth Moreton Note Added: 0120599