Optimizer does worse job when inlining or with record temps.
Original Reporter info from Mantis: marco @marcoonthegit
-
Reporter name: Marco van de Voort
Original Reporter info from Mantis: marco @marcoonthegit
- Reporter name: Marco van de Voort
Description:
Codegeneration inefficiency test factor 3 twiddling in mixed radix (with Winograd factoring?) FFT
The routines are functionally equivalent, but generated assembler for FFT_3 is always longer than FFT_3D.
The difference is that FFT_3 process tcomplex record with inlined operations, while FFT_3D was
coded out in single operations and temps. But parameters are the same.
Note that the FFT3D code has more (explicit) operations in Pascal (loading of vector into single temps). Still the optimizer
does a better job.
with -O4 (15% less instructions in 32-bit). In 64-bit AVX it is noticable that the -3D variant uses more XMM registers, while
the normal variant only uses xmm0, and xmm1 for a few two register instructions.
I assume the problem is either the complex temps or the inlined functions.
Delphi also suffers from this FFT_3 hardly uses any regs other than XMM0 and XMM1, while FFT_3D does, and the difference is
even bigger (Delphi better optimization?) FFT_3 596 bytes, FFT_3D 384 bytes. Code however is not 1:1 comparable because delphi does all operations in Double, so there are hordes of single to double and back conversions
Steps to reproduce:
fpc -al -O4 -Opcoreavx2 -Cpcoreavx2 -Cfavx2 ugly.dpr (Windows 64-bit target)
compare fft_3 and fft_3d generated assembler
Additional information:
Not in my critical path, just byproduct of some performance investigation
Mantis conversion info:
- Mantis ID: 36324
- Monitored by: » @MageSlayer (Denis Golovan)