[Refactor] TEntryFile.getbyte() optimisation
Original Reporter info from Mantis: CuriousKit @CuriousKit
-
Reporter name: J. Gareth Moreton
Original Reporter info from Mantis: CuriousKit @CuriousKit
- Reporter name: J. Gareth Moreton
Description:
The following patch improves the getbyte() method by a few cycles by simplifying a couple of expressions. The patch file is small enough that it should be self-explanatory, but:
- "entryidx+1>entry.size" is changed to "entryidx >= entry.size"
- "bufsize-bufidx>=1" is changed to "bufidx < bufsize"
This changes the layout of the method slightly when compared to getword() etc, but in this instance, because the size of a byte is pretty much a set standard, there should be no loss in maintainability and portability.
Steps to reproduce:
Apply patch and confirm identical behaviour of compiler, but with a slight speed and size improvement.
Additional information:
The method in question is used when loading in PPU files.
On x86_64-win64, the compiled assembly language (under -O3) shows significant improvement; not counting possible memory stalls, I estimate an improvement of 1 cycle on each of the expressions (which includes parallel execution of independent instructions), and as can be seen by the raw machine code, a size reduction of 9 bytes on both expressions to result in an 18-byte saving, enough to shrink the size of the compiled module since that number is larger than the common 16-byte alignment granularity.
Old: "entryidx+1>entry.size":
000000010008C279 48634324 movslq 0x24(%rbx),%rax
000000010008C27D 488d4001 lea 0x1(%rax),%rax
000000010008C281 48635328 movslq 0x28(%rbx),%rdx
000000010008C285 4839d0 cmp %rdx,%rax
000000010008C288 7e0e jle 0x10008c298 &LtPos;GETBYTE+40>
New: "entryidx >= entry.size"
000000010008C279 8b4328 mov 0x28(%rbx),%eax
000000010008C27C 3b4324 cmp 0x24(%rbx),%eax
000000010008C27F 7f0e jg 0x10008c28f &LtPos;GETBYTE+31>
----
Old: "bufsize-bufidx>=1"
000000010008C298 48634314 movslq 0x14(%rbx),%rax
000000010008C29C 48635318 movslq 0x18(%rbx),%rdx
000000010008C2A0 4829d0 sub %rdx,%rax
000000010008C2A3 4883f801 cmp $0x1,%rax
000000010008C2A7 7c18 jl 0x10008c2c1 &LtPos;GETBYTE+81>
New: "bufidx < bufsize"
000000010008C28F 8b4318 mov 0x18(%rbx),%eax
000000010008C292 3b4314 cmp 0x14(%rbx),%eax
000000010008C295 7d18 jge 0x10008c2af &LtPos;GETBYTE+63>
----
These improvements will likely not be as profound on 32-bit platforms, but they should not be worse than before - for one thing, the optimisations result in 1 fewer register being allocated. The reason why x86_64 assembler is so convoluted is because of Object Pascal's habit of expanding intermediate expressions to the CPU word size (hence the use of MOVSLQ instructions to expand LongInts to 64-bit), and is something not easily simplified by the peephole optimiser, for example.
Mantis conversion info:
- Mantis ID: 35406
- OS: Microsoft Windows
- OS Build: 10 Professional
- Build: r41892
- Platform: Cross-platform (x86_64 benefits)
- Version: 3.3.1
- Fixed in version: 3.3.1
- Fixed in revision: 41924 (#55aeac44)
- Target version: 3.3.1