frac function is slow on AMD in Linux (but fast on Intel or in Windows)
Original Reporter info from Mantis: Artlav
-
Reporter name: Artyom
Original Reporter info from Mantis: Artlav
- Reporter name: Artyom
Description:
frac function is about 20 times slower on Linux on AMD CPUs (tested on Ryzen 2600, Ryzen 3600 and Threadripper 3975WX) than on Windows on the same CPUs.
On Intel CPUs it's close to equally fast on every OS.
This only happens on x86_64, when compiled for i386 there is no difference in performance.
Digging into the RTL, on windows it's using fpc_frac_real that is outside FPC_HAS_TYPE_EXTENDED ifdef (in rtl/x86_64/math.inc), which is double SSE code similar to the frac_sse of my example bit.
While on Linux it is using the one inside it, which is extended x87 fistpq code.
So it comes down to Windows not supporting extended type and thus getting a double SSE frac implementation, while Linux does support extended type, and thus is using extended x87 frac implementation.
And as far as i can find, AMD's implementation of old 80bit FPU operations is MUCH, MUCH slower than Intel's.
Given that frac is a fairly basic function and AMD CPUs are rapidly gaining popularity, this is a rather critical issue.
Steps to reproduce:
Run this code
//############################################################################//
{$ifdef mswindows}{$apptype console}{$endif}
program frac_tst;
//############################################################################//
function frac_sse(const d:double):double;assembler;nostackframe;
asm
movq %xmm0, %rax
shr $48, %rax
and $0x7ff0,%ax
cmp $0x4330,%ax
jge .L0
cvttsd2si %xmm0, %rax
cvtsi2sd %rax, %xmm4
subsd %xmm4, %xmm0
ret
.L0:
xorpd %xmm0, %xmm0
end;
//############################################################################//
procedure main;
var x:double;
i:integer;
begin
write('System: ');
x:=0;
for i:=0 to 9999999 do x:=x+frac(i/10);
writeln(x:3:3);
write('Custom: ');
x:=0;
for i:=0 to 9999999 do x:=x+frac_sse(i/10);
writeln(x:3:3);
end;
//############################################################################//
begin
main;
{$ifdef mswindows}readln;{$endif}
end.
//############################################################################//
Tweak iteration count for the run time to be noticeable, observe time difference between platforms and between system and custom frac.
On Intel or Windows both would be fast.
On AMD Linux, system one would be 10-20 times slower.
Mantis conversion info:
- Mantis ID: 39275
- Version: 3.3.1
- Monitored by: » @Alexey-T1 (CudaText man)