View Issue Details

IDProjectCategoryView StatusLast Update
0038547FPCCompilerpublic2021-02-27 12:57
ReporterSi Nicholson Assigned ToFlorian  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionwon't fix 
Summary0038547: methods containing assembly code not inlined by 'inline' keyword
DescriptionWas hoping to create a more user friendly method of accessing advanced X86 / ARM SIMD functions...

  B16x4 = array [0..3] of Word;

  MM0_16 = object
  end;

  MM1_16 = object
  end;

  operator := (var words: B16x4) : MM0_16; inline;
  operator := (var words: B16x4) : MM1_16; inline;
  operator := (var words: MM0_16) : B16x4; inline;
  operator := (var words: MM1_16 : B16x4; inline;

  operator * (mmx: MM1_16) : MM0_16;
  operator * (mmx: MM0_16) : MM1_16;

implementation

operator := (var words: B16x4) : MM0_16;
begin asm movq mm0, [words] end; end;

operator := (var words: B16x4) : MM1_16;
begin asm movq mm1, [words] end; end;

operator := (var words: MM0_16) : B16x4;
begin asm [words], words end; end;

operator := (var words: MM1_16) : B16x4;
begin asm [words], words end; end;

operator * (words: MM0_16) : MM1_16;
begin asm pmullw mm0, mm1 end; end;

operator * (words: MM1_16) : MM0_16;
begin asm pmullw mm0, mm1 end; end;

.....

var x : MM0_16 = [1,2,3,4];
       y : MM1_16 = [5,6,7,8];

x:=x*y; // pmulLQ mm0, mm1

I've included an example with operator overloading and memory variables (rather than assignment from/to RAX as that would be handled by an RAX type in the same vein), but x.mul (y) or x.mulLQ (y) will not inline either. Intrinsics are OK but inlined assembly methods would improve overall readability and ease of use in some cases. Personally I like assembly, but manually rolling out and inserting it inline makes for messy code. This is my example, but there are less generic reasons to implement inlined asm methods. The idea of assigning a variable to mmx and sse registers is 'prettier'.
TagsNo tags attached.
Fixed in Revision
FPCOldBugId
FPCTarget-
Attached Files

Activities

Florian

2021-02-26 21:15

administrator   ~0129179

Inlining assembler is either inefficient, very difficult or even impossible in the general case, so FPC won't support it. The consent is that FPC supports either simple vector operations directly or that intrinsics are used.

J. Gareth Moreton

2021-02-27 06:28

developer   ~0129193

I did try implementing this myself on x86_64 as a test case, but was rejected because of breaking principles that raw assembly language not be modified by the compiler (e.g. changing RET instructions into appropriate jumps when the routine is inlined) and would have to be reprogrammed for each supported platform.

There is development and support for intrinsics, but a number of syntactic issues have not yet been resolved or decided upon. In the meantime I am experimenting with seeing how well I can make the compiler support vectorisation of common SSE and NEON operations, for example.

Benito van der Zander

2021-02-27 11:53

reporter   ~0129197

That really bad

Recently I profiled my string builder on appending 100MB char by char. It was like:

386 ms: Pascal, no inline, (or inline with disabled optimizations)

210 ms: Pascal, inline

203 ms: assembly function, no inline


So inlining made the Pascal version almost twice as fast, so inlined assembly, would probably bring it down to something like 110ms.


Baselines:

4000 ms: FPC's own string builder

200 ms: Java's string builder

30 ms: FillChar on pre allocated string

Florian

2021-02-27 12:07

administrator   ~0129198

> So inlining made the Pascal version almost twice as fast, so inlined assembly, would probably bring it down to something like 110ms.

Unlikely in general. When inlining pascal, the compiler can optimize register usage and instruction selection, for inlined assembler this is not possible. Inlined assembler routines are still bound to the calling conventions as this is what the assembler code expects. So one gets basically rid of the call/ret and that's it. If pascal code is much slower than the corresponding assembler code, it should be investigate why this is and maybe it can be solved at either code or compiler level. In general, we do not consider inline assembler as a solution as FPC is a portable compiler. There are very few exceptions where we cannot use pascal and need inline assembler, but these cases should be kept as small as possible.

Benito van der Zander

2021-02-27 12:57

reporter   ~0129199

>If pascal code is much slower than the corresponding assembler code, it should be investigate why this is and maybe it can be solved at either code or compiler level

I guess it is because it does not keep fields in registers.

Especially for a string builder where there are only three things to do. Check the current length at the buffer, write after the current length, increment the current length. In assembly the current length can be read once, but the Pascal code reads it three times (however, putting the length in a temporary, local variable also did not help)

Issue History

Date Modified Username Field Change
2021-02-26 16:54 Si Nicholson New Issue
2021-02-26 21:15 Florian Assigned To => Florian
2021-02-26 21:15 Florian Status new => resolved
2021-02-26 21:15 Florian Resolution open => won't fix
2021-02-26 21:15 Florian FPCTarget => -
2021-02-26 21:15 Florian Note Added: 0129179
2021-02-27 06:28 J. Gareth Moreton Note Added: 0129193
2021-02-27 11:53 Benito van der Zander Note Added: 0129197
2021-02-27 12:07 Florian Note Added: 0129198
2021-02-27 12:57 Benito van der Zander Note Added: 0129199