View Issue Details

IDProjectCategoryView StatusLast Update
0010275FPCCompilerpublic2020-04-30 14:05
ReporterChris Rorden Assigned ToJonas Maebe  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionreopened 
Product Version2.2.1 
Fixed in Version2.4.0 
Summary0010275: FPC 33% slower than Delphi4 for 2D Gaussian Blur
DescriptionI am offering a bounty for anyone who can help speed up my image smoothing routines. My software (www.mricro.com) applies 3D smoothing, filtering and interpolation to MR images. I have found FPC/Lazarus consistently slower than Delphi.

I think it is fantastic that FPC supports lots of platforms. I am not sure if it is my particular coding style or that FPC is not optimized for random arrays. However, I am offering a reward for anyone who can help boost the speed of this code to within 10% of Delphi4:
 http://www.sph.sc.edu/comd/rorden/mricron/bounty/
Additional Information
The slow code is in the routine
 procedure SmoothInput (lFWHM: integer);
The procdure is supplied with the Full-Width-Half-Maximum of the smoothing kernel. A look-up-table is generated for the kernel and then applied sequentially horizontally and vertically.
TagsNo tags attached.
Fixed in Revision
FPCOldBugId
FPCTarget-
Attached Files

Activities

Florian

2007-11-26 15:39

administrator   ~0016385

We are working on this, see r9330, r9333, r9334. On my machine, it is currently 1550 against 1350 with a slightly modified code.

Jonas Maebe

2007-12-09 21:09

manager   ~0016639

Last edited: 2007-12-09 21:13

You may want to consider to simply recompile the application for x86_64 (although you'll obviously need an OS which can run x86_64 binaries in that case). I don't have any Delphi nor Windows, but a test program calling the original SmoothInputAlign 1000000 times in a loop on my MacBook (Core 2 Duo 2.2Ghz) with the current 2.3.1 svn version gives:

firefly:~/fpc/test/opts jmaebe$ ppn10 -O-2ppentium4 -Cppentium4 -Cfsse2 sa
Free Pascal Compiler version 2.3.1 [2007/12/09] for i386
Copyright (c) 1993-2007 by Florian Klaempfl
Target OS: Darwin for i386
Compiling sa.pas
Assembling program
Linking sa
221 lines compiled, 0.1 sec
firefly:~/fpc/test/opts jmaebe$ time ./sa 1

real 0m5.871s
user 0m5.854s
sys 0m0.006s


firefly:~/fpc/test/opts jmaebe$ ppcx10 -O-2 sa
Free Pascal Compiler version 2.3.1 [2007/12/09] for x86_64
Copyright (c) 1993-2007 by Florian Klaempfl
Target OS: Darwin for x86_64
Compiling sa.pas
Assembling program
Linking sa
221 lines compiled, 0.1 sec
firefly:~/fpc/test/opts jmaebe$ time ./sa 1

real 0m3.473s
user 0m3.469s
sys 0m0.004s

Jonas Maebe

2007-12-09 21:21

manager   ~0016640

Last edited: 2007-12-10 12:45

One note: these results are not directly comparable with what you'd get under Linux/x86_64 or Win64, because on Darwin we use libc's math routines (such as exp). I don't know how fast our own implementation of exp is on x86_64 (and at least on i386, a third of the time is spent in that routine).

Edit: I was wrong, since darwin/x86_64 also supports extended, we don't use the libc math routines there either. So it should be comparable after all.

Jonas Maebe

2008-02-12 13:16

manager   ~0017740

Please try the following:

a) add "uses cmem;"
b) add the following routine near the top of your program, so it will be used instead of the version in the system unit:

{$ifdef fpc}
{$asmmode att}
    function exp(d : ValReal) : ValReal;assembler;
      var
        cw1,cw2: word;
      asm
        // comes from DJ GPP
        fldt d
        fldl2e
        fmulp %st,%st(1)
        fstcw CW1
        fstcw CW2
        fwait
        andw $0xf3ff,CW2
        orw $0x0400,CW2
        fldcw CW2
        fld %st(0)
        frndint
        fldcw CW1
        fxch %st(1)
        fsub %st(1),%st
        f2xm1
        fld1
        faddp %st,%st(1)
        fscale
        fstp %st(1)
     end;
{$endif}

c) compile with -O-2ppentium4 -Cppentium4 -Cfsse2


Explanation:
a) the memory allocation/freeing pattern of the benchmark causes the heap manager to allocate from the OS, subdivide in small blocks, and free memory blocks to the OS again at a very high rate. At least on Linux and Mac OS X, libc's memory manager is faster in this particular use-case.

In the current 2.3.1, you can configure how many memory blocks the memory manager keeps around (4 by default) before it frees them back to the OS. If you set it to 8 at the start of your program (MaxKeptOSChunks:=8), you'll get optimal performance on 2.3.1 (note: this does not work with 2.2.0).

While the performance of the default memory manager in 2.3.1 (keeping 4 blocks around) is already better than that of 2.2.0, I'm not sure whether the default value of MaxKeptOSChunks will be set higher than it currently is.

b) the only difference with the version in the 2.2.0 system unit, is that the one in the system unit contains an "fclex" instruction at the end. On a Pentium 4, this one instruction makes the benchmark 4 times slower. On an Athlon XP, it's 2 times. On a Core2 Duo, it's 5%. I don't know how big the impact will be on your Core Duo, but it won't hurt at least.

This change will be in a future release version.

c) I don't know whether this will improve performance a lot with 2.2.0, but it shouldn't hurt either (in 2.3.1, the effects are quite a bit better).


With the "MaxKeptOSChunks:=8" change to the program, FPC 2.3.1's version of the program (which uses the fclex-less version of the exp function from the system unit) is faster than Kylix 3 on a P4 and on an Athlon XP.

You may want to try a) and b) separately to see which one gives you the best speed gains (it's possible that these will already close the performance gap for the most part when using 2.2.0).

Jonas Maebe

2008-03-14 21:19

manager   ~0018335

No feedback from submitter.

Chris Rorden

2008-06-10 15:58

reporter   ~0020075

Jonas-

Sorry for the delay. I just tried your suggestions with Lazarus svn 15373, with the snapshot download Lazarus-0.9.25-fpc-2.2.3-20080610-win32.exe

I tried this on my Core2Duo T9300 (2.5GHz Penryn). The Delphi code clocked in at 1313, while the standard Lazarus ran at 2344. Once I appled all your suggestions, the code was marginally slower (2547). It is interesting that the Darwin code ran much faster with 64-bit compile, though I assume the Lazarus applications compiled for the Carbon widgetset will need to stay 32-bit. In any case, it sounds like this may be resolved when Lazarus gets the 2.4 compiler. Thanks for your suggestions.

Jonas Maebe

2008-06-10 22:26

manager   ~0020078

2.3.1 is (with cmem) under Mac OS X only 100ms faster than 2.2.3 on my machine (under Mac OS X; 3.6 vs 3.7 seconds for 1000000 calls of SmoothInputAlign in a loop). Either I've been looking at the wrong routine (SmoothInputAlign), or there are some serious performance problems with Microsoft libc memory allocator the way it is used in your program.

Chris Rorden

2008-06-10 23:43

reporter   ~0020079

I do not think it is just the getmem calls. The following code only calls getmem once (and therefore only assigns a small amount of memory on each loop), yet this new code requires 1625ms (63% of time compared to when always assigning memory arrays) if compiled with Lazarus and 750ms if compiled with Delphi (57% faster)...

procedure SmoothInputNoAlloc (lFWHM: integer; var lxra: singlep0; var lTempBuff: Singlep);
var
   lexpd,lcumgauss,lsigma: single;
   lcutoffvoxx,lI,lY,lX,lXi,lYi,lMin,lMax,lPos,lYPos: integer;

begin
        lXi := gSrcWid;
        lYi := gSrcHt;
        lsigma := (lFWHM)/sqrt(8*ln(2)); // % FWHM -> sigma
        lcutoffvoxx := round(6*lsigma); // % highest / lowest voxel to go out to
        lexpd := 2*lsigma*lsigma;
        lCumGauss := 0;
        for lI := 0 to lcutoffvoxx do begin
            lxra^[lI] := exp(-1*(lI*lI)/lexpd) ;
            lCumGauss := lCumGauss + lxra^[lI];
        end;
        lCumGauss := 2*lCumGauss - lxra^[0];
        if lCumGauss <> 0 then
           for lI := 0 to lcutoffvoxx do begin
            lxra^[lI] := lxra^[lI]/lCumGauss;
           end;

        //smooth horizontally
        for lY := 1 to lYi do begin
            lyPos := (lY-1)*lXi;
            for lX := 1 to lXi do begin
                lMin := lX - lCutoffVoxX;
                if lMin < 1 then lMin := 1;
                lMax := lX + lCutoffVoxX;
                if lMax > lXi then lMax := lXi;
                lCumGauss := 0;
                for lPos := lMin to lMax do
                    lCumGauss := lCumGauss + gBuff^[lPos+lYPos]*lXra^[abs(lX-lPos)] ;
                lTempBuff^[lX+lYPos] := (lCumGauss);
            end; {lX}
        end; {lY}
        //smooth vertically
        for lX := 1 to lXi do begin
            for lY := 1 to lYi do begin
                lMin := lY - lCutoffVoxX;
                if lMin < 1 then lMin := 1;
                lMax := lY + lCutoffVoxX;
                if lMax > lYi then lMax := lYi;
                lCumGauss := 0;
                for lPos := lMin to lMax do begin
                    lyPos := ((lPos-1)*lXi)+lX;
                    lCumGauss := lCumGauss + lTempBuff^[lYPos]*lXra^[abs(lY-lPos)] ;
                end;
                lyPos := (lY-1)*lXi;
                gSmoothBuff^[lX+lYPos] := round(lCumGauss);
            end; {lX}
        end; {lY}

end;

Jonas Maebe

2008-06-12 15:57

manager   ~0020107

Now I understand the problem. I had made a small separate test program, but did not initialise the gSrcWid and gSrcHt variables. So I was just benchmarking the setup code (the getmems/freemems and the exp call) rather than the inner loops.

After fixing this, I can reproduce it.

Martin Friebe

2008-06-25 22:47

manager   ~0020350

Last edited: 2008-06-26 01:01

I have made some changes to Smoothing (not yet SmoothAlign).

On my machine Smooth would take 2500 (pressing the test button) and is now down to 1700. below is the new code fragment, that needs to be replaced in the original source.

Apparently reducing the amount of variables used in the most inner loop, helps the "keep in registers" optimization (or so I guess).
Attempting the same pointer stuff, in some of the other/outer loops on the other hand was contra productive. Which is odd.


[edited below, du to a small bug / timing is still the same]


procedure SmoothInput (lFWHM: integer);
var
   lsigma,lexpd,lcumgauss: single;
   lcutoffvoxx,lI,lY,lX,lXi,lYi,lMin,lMax,lPos,lYPos: integer;
   lxra: singlep0;
   lTempBuff: Singlep;
   x : Bytep;
   x2, x3 : ^Byte; //xxx
   s1, s2 : ^Single;
   ldiff : integer;
begin
        lXi := gSrcWid;
        lYi := gSrcHt;
        lsigma := (lFWHM)/sqrt(8*ln(2)); // % FWHM -> sigma
        lcutoffvoxx := round(6*lsigma); // % highest / lowest voxel to go out to
           getmem(lxra,(lcutoffvoxx+1)*sizeof(single) * 2); // xxx 82
           getmem(lTempBuff,lXi*lYi*sizeof(single));
        lexpd := 2*lsigma*lsigma;
        lCumGauss := 0;
        for lI := 0 to lcutoffvoxx do begin
            lxra^[lcutoffvoxx + lI] := exp(-1*(lI*lI)/lexpd) ;
            lxra^[lcutoffvoxx - lI] := lxra^[lcutoffvoxx + lI] ; //xxx
            lCumGauss := lCumGauss + lxra^[lcutoffvoxx+lI];
        end;
        lCumGauss := 2*lCumGauss - lxra^[lcutoffvoxx];
        if lCumGauss <> 0 then
           for lI := 0 to lcutoffvoxx do begin
            lxra^[lcutoffvoxx + lI] := lxra^[lcutoffvoxx + lI]/lCumGauss;
            lxra^[lcutoffvoxx - lI] := lxra^[lcutoffvoxx + lI] ; //xxx
           end;

        //smooth horizontally
        x:=gBuff;
        for lY := 1 to lYi do begin
            lyPos := (lY-1)*lXi;
            for lX := 1 to lXi do begin
                lMin := lX - lCutoffVoxX;
                if lMin < 1 then lMin := 1;
                lMax := lX + lCutoffVoxX;
                if lMax > lXi then lMax := lXi;
                lCumGauss := 0;
                x2 := @(x^[lMin+lYPos]); // xxx
                s1 := @(lXra^[lcutoffvoxx + lX-lMin]); // xxx
                for lPos := lMin to lMax do // while lPos < lX do
                begin // xxx
                      lCumGauss := lCumGauss + (x2^ * s1^);
                      dec(s1);
                      inc(x2);
                end; // xxx
                lTempBuff^[lX+lYPos] := (lCumGauss);
            end; {lX}
        end; {lY}
        //smooth vertically
        for lX := 1 to lXi do begin
            for lY := 1 to lYi do begin
                lMin := lY - lCutoffVoxX;
                if lMin < 1 then lMin := 1;
                lMax := lY + lCutoffVoxX;
                if lMax > lYi then lMax := lYi;
                lCumGauss := 0;
                s1 := @(lXra^[lcutoffvoxx + lY-lMin]); // xxx
                s2 := @(lTempBuff^[((lMin-1)*lXi) + lX]);
                for lPos := lMin to lMax do begin
                    lCumGauss := lCumGauss + s2^ * s1^;
                    dec(s1);
                    inc(s2, lXi);
                end;
                lyPos := (lY-1)*lXi;
                gSmoothBuff^[lX+lYPos] := round(lCumGauss);
            end; {lX}
        end; {lY}
        //free memory
           freemem(lTempBuff);
           freemem(lxra);
end;

Martin Friebe

2008-06-25 22:49

manager   ~0020351

Just as a note: all my tests were done with
fpc2.3.1
Intel 32 bit core duo

Martin Friebe

2008-06-26 01:41

manager   ~0020357

Interesting.

replace "s1" in the 2nd(vertical) loops, by a new variable "s3".
On my PC this increased speed again. But I couldn't find any other variables, that would have benefited from a similar rename.
Interesting why fpc can optimize it better, if it's a variable that hasn't been used before.


                lCumGauss := 0;
                s3 := @(lXra^[lcutoffvoxx + lY-lMin]); // xxx
                s2 := @(lTempBuff^[((lMin-1)*lXi) + lX]);
                for lPos := lMin to lMax do
                begin // xxx
                    lCumGauss := lCumGauss + s2^ * s3^;
                    dec(s3);
                    inc(s2, lXi);
                end;

Martin Friebe

2008-06-26 10:07

manager   ~0020362

The best way to allow any optimizer to do its work, seems to be to cut the 2 loops out and place them in their own procedures. Gave me an other tiny speed burst. (I am not sure if it really brings much to reassign the procedure vars to local vars / I have a tiny bit of fluctation in the timing anyway, soo it's hard to say if its a few milliseconds faster or not)


procedure SmoothInput (lFWHM: integer);
  procedure l1(qlXi, qlYi, qlCutoffVoxX : integer;qx : Bytep;qlxra: singlep0;qlTempBuff: Singlep);
// procedure l1(lXi, lYi, lCutoffVoxX : integer; x : Bytep; lxra: singlep0; lTempBuff: Singlep);
  var
  lcumgauss: single;
   lI,lY,lX,lMin,lMax,lPos,lYPos: integer;
    s1 : ^Single;
    x2 : ^Byte; //xxx
    lXi, lYi, lCutoffVoxX : integer;x : Bytep;lxra: singlep0;lTempBuff: Singlep;
  begin
    lXi := qlXi; lYi:=qlYi; lCutoffVoxX:=qlCutoffVoxX;x:=qx ;lxra:=qlxra ;lTempBuff:=qlTempBuff;
    for lY := 1 to lYi do begin
        lyPos := (lY-1)*lXi;
        for lX := 1 to lXi do begin
            lMin := lX - lCutoffVoxX;
            if lMin < 1 then lMin := 1;
            lMax := lX + lCutoffVoxX;
            if lMax > lXi then lMax := lXi;
            lCumGauss := 0;
            x2 := @(x^[lMin+lYPos]); // xxx
            s1 := @(lXra^[lcutoffvoxx - (lX-lMin)]); // xxx
            for lPos := lMin to lMax do
            begin // xxx
                  lCumGauss := lCumGauss + (x2^ * s1^);
                  inc(s1);
                  inc(x2);
            end; // xxx
            lTempBuff^[lX+lYPos] := (lCumGauss);
        end; {lX}
    end; {lY}
  end;

  procedure l2(qlXi, qlYi, qlCutoffVoxX : integer;qx : Bytep;qlxra: singlep0;qlTempBuff: Singlep);
// procedure l2(lXi, lYi, lCutoffVoxX : integer; x : Bytep; lxra: singlep0; lTempBuff: Singlep);
  var
  lcumgauss: single;
   lI,lY,lX,lMin,lMax,lPos,lYPos: integer;
    s3 : ^Single;
    s2 : ^Single;
    lXi, lYi, lCutoffVoxX : integer;x : Bytep;lxra: singlep0;lTempBuff: Singlep;
  begin
   lXi := qlXi; lYi:=qlYi; lCutoffVoxX:=qlCutoffVoxX;x:=qx ;lxra:=qlxra ;lTempBuff:=qlTempBuff;
    for lX := 1 to lXi do begin
// lyPos:=lx;
        for lY := 1 to lYi do begin
            lMin := lY - lCutoffVoxX;
            if lMin < 1 then lMin := 1;
            lMax := lY + lCutoffVoxX;
            if lMax > lYi then lMax := lYi;
            lCumGauss := 0;
            s3 := @(lXra^[lcutoffvoxx + lY-lMin]); // xxx
            s2 := @(lTempBuff^[((lMin-1)*lXi) + lX]);
            for lPos := lMin to lMax do
            begin // xxx
                lCumGauss := lCumGauss + s2^ * s3^;
                dec(s3);
                inc(s2, lXi);
            end;
            lyPos := (lY-1)*lXi;
            gSmoothBuff^[lx+lYPos] := round(lCumGauss);
            //inc(lYPos, lXi);
        end; {lX}
    end; {lY}
  end;

  var
   lsigma,lexpd,lcumgauss: single;
   lcutoffvoxx,lI,lY,lX,lXi,lYi,lMin,lMax,lPos,lYPos: integer;
   llx, lly, llpos, llmax, llmin : integer;
   lxra: singlep0;
   lTempBuff: Singlep;
   x : Bytep;
   s1 : ^Single;

begin
        lXi := gSrcWid;
        lYi := gSrcHt;
        lsigma := (lFWHM)/sqrt(8*ln(2)); // % FWHM -> sigma
        lcutoffvoxx := round(6*lsigma); // % highest / lowest voxel to go out to
           getmem(lxra,(lcutoffvoxx+1)*sizeof(single) * 2); // xxx 82
           getmem(lTempBuff,lXi*lYi*sizeof(single));
        lexpd := 2*lsigma*lsigma;
        lCumGauss := 0;
        s1:=@(lxra^[lcutoffvoxx ]);
        for lI := 0 to lcutoffvoxx do begin
            s1^ := exp(-1*(lI*lI)/lexpd) ;
            lxra^[lcutoffvoxx - lI] := s1^ ; //xxx
            lCumGauss := lCumGauss + s1^;
            inc(s1);
        end;
        lCumGauss := 2*lCumGauss - lxra^[lcutoffvoxx];
        if lCumGauss <> 0 then
           for lI := 0 to lcutoffvoxx do begin
            lxra^[lcutoffvoxx + lI] := lxra^[lcutoffvoxx + lI]/lCumGauss;
            lxra^[lcutoffvoxx - lI] := lxra^[lcutoffvoxx + lI] ; //xxx
           end;

        //smooth horizontally
        x:=gBuff;
        l1(lXi, lYi, lCutoffVoxX ,x ,lxra,lTempBuff);
        //smooth vertically
        l2(lXi, lYi, lCutoffVoxX ,x ,lxra,lTempBuff);
        //free memory
           freemem(lTempBuff);
           freemem(lxra);
end;

Chris Rorden

2008-06-27 21:20

reporter   ~0020417

Martin -
 Congratulations. You won the $50 bounty. If you want to contact me (rorden#gwm.sc.edu). I would still like to keep the bounty open for someone who can add this to the compiler, as it would make maintaining and writing code a bit cleaner.
 Samuel Herzog also made a suggestion that speeds up this code, but did so both for Delphi and Lazarus. In any case, his solution is clever, so I will share it here, especially as it can be combined with Martin's technique. Essentially, he used the 'continue' option for Multiply+Adds where a zero was involved, so
 for lPos := lMin to lMax do
    lCumGauss := lCumGauss + gBuff^[lPos+lYPos]*lXra^[abs(lX-lPos)] ;
becomes
 for lPos := lMin to lMax do begin
    if gBuff^[lPos+lYPos] = 0 then continue;
    lCumGauss := lCumGauss + gBuff^[lPos+lYPos]*lXra^[abs(lX-lPos)] ;
 end;


Here are the timings for Delphi vs Lazarus on my Core2Duo (where MF is Martin's suggestions and SH is Sams):
 Standard 1281 vs 2360
 MF 1297 vs 1360 <- This wins my bounty!
 SH 1000 vs 1953
 MF+SH 954 vs 1250

Martin Friebe

2008-06-28 00:27

manager   ~0020425

Last edited: 2008-06-28 00:59

Great, very glad to hear that it helped.

As for the bounty, if I can I would donate it to the FreePascal Team (for a great free compiler).

And for the fpc-team, some ideas about the fpc optimizations (hope they are welcome):

Maybe the above solution can help finding a few possibilities to optimize "keep var in register" better? (afaik, it's 99% down to "var in register".

Most noticeable, it should be possible to optimize both loops separately. Putting them into separate procedures will force this, and get a better speed. WHY? Even in the same procedure body, they should optimize the same?

Also to few (or the wrong / in the wrong places) registers are used for this.
The Most inner loop (looking at the assembler)
            for lPos := lMin to lMax do
            begin // xxx
                lCumGauss := lCumGauss + s2^ * s3^;
                dec(s3);
                inc(s2, lXi);
            end;

s3, s2 are truly kept in edi,esi
lpos is in edx
lmax is NOT in registers
lCumGauss , lXi are NOT in a register

But neither EBX,ECX is used in this loop => so it would be possible to move them into registers? (I am not an Intel assembler expert, so I am not 100% sure on this possibility..)
lmax (as the end condition for the loop) isn't even needed. Given that lpos is not accessed, lpos could simply count down to 0 (but that's different kind of optimization)

Jonas Maebe

2008-06-28 01:17

manager   ~0020427

We know what the problem is since a long time: lack of ssa support in loops. Florian has been working on adding it, but it isn't easy.

Martin Friebe

2008-06-28 01:34

manager   ~0020428

Sorry, i didn't want to step on anyone, nor id I want to make it sound easy...

In the meantime, more optimization in pascal

type bp = ^BYTE;

in l2 / the vert smooth

            r1:= 4*lXI; // <<<<< actually better *sizeof(s2^) ?
            for lPos := lMin to lMax do
            begin // xxx
                lCumGauss := lCumGauss + s2^ * s3^;
                dec(s3);
                inc(bp(s2), r1); // <<<<< instead of: inc(s2, lXi);
            end;
            
This will avoid that lXi needs to be adjusted for the pointer size each time.

Jonas Maebe

2008-06-28 17:05

manager   ~0020443

No offence taken, it was just for your information.

Chris Rorden

2008-06-29 05:26

reporter   ~0020447

Jonas-

I appreciate compiler changes are challenging. Thanks for all your work. Can you clarify what you mean by SSA support? Do you mean SSE? If so, I do not think Delphi 4 (update 2, 1998), as Intel did not introduce SSE until 1999. It does seem like one difference is the way Martin points to the dynamic arrays, e.g. s1 : ^Single; versus my code that used array^[position]. Is there any chance this might explain the difference?

-c

Jonas Maebe

2008-06-29 09:33

manager   ~0020448

SSA means "static single assignment". See e.g. http://en.wikipedia.org/wiki/Static_single_assignment_form

Chris Rorden

2008-07-01 15:46

reporter   ~0020481

Jonas and Martin-

Thanks very much. This does make sense. Thanks again to all the FPC developers: this is a really versatile tool tool that has allowed me to support a wide range of platforms.

-chris

Florian

2008-07-01 16:32

administrator   ~0020482

And please leave this bug open, I still want to get the compiler to make the original code better using SSA.

Marco van de Voort

2009-11-08 17:51

manager   ~0032006

Any update how this fares with 2.5.1/2.3.1?

Jonas Maebe

2009-11-08 19:49

manager   ~0032010

Last edited: 2009-11-08 19:50

It won't change a thing (not a noticeable one anyway).

In fact, it may well be worse since the i386 assembler optimizer has been disabled by default.

Chris Rorden

2010-02-26 20:43

reporter   ~0034815

Just to point out that this problem still exists in the latest versions of Lazarus (0.9.29) FPC (2.4.1). I have updated the source code and compiled samples to include the suggestions of Martin and Samuel:
  http://www.cabiatl.com/mricro/bounty/
The page now contrasts the speed of the different algorithms (Delphi: FPC time in ms)
 Simple code 2359 : 1328ms
 Martin's code 1375 : 1297ms
 Martin+Samuel's code 1219 : 968ms

Chris Rorden

2010-08-02 01:47

reporter   ~0039872

Martin just suggesting putting {$FPUTYPE SSE2} at the top of the file (as I can assume all users have at least a Pentium 4 [circa 2001]) computer. With this and a few changes (described below), Lazarus/FPC now outperforms Delphi:
 Source Laz : D7
 Simple code 1766 : 1343ms
 Martin's code 1266 : 1282ms
 Martin+Samuel's code 828 : 937ms
However, note it is still slower for the Simple code - presumably the simple code will see a benefit when FPC adds support for SSA (I used FPC 2.4.1). Therefore, while FPC can produce faster code, I do think this sample provides a real-world test to see how future SSA support can speed up computations.

> 2 changes:
> 1) SmoothInputMFSH / l1
> lYPos is now incremented, instead of calculating a product each round.
> This is a very small improvement.
>
> 2a) SmoothInputMFSH / l1
> //for lPos := lMin to lMax do begin // xxx
> for lPos := lmax - lMin downto 0 do begin // xxx
>
> Compare with 0, does not need to store the end value => free a register
>
> 2b) SmoothInputMFSH / l2
> Oddly the same loop optimization here, would increase time => must be
> screwing up register allocation.


procedure SmoothInputMFSH (lFWHM: integer);
  procedure l1(qlXi, qlYi, qlCutoffVoxX : integer;qx : Bytep;qlxra: singlep0;qlTempBuff: Singlep);
  var
  lcumgauss: single;
   lY,lX,lMin,lMax,lPos,lYPos: integer;
    s1 : ^Single;
    x2 : ^Byte;
    lXi, lYi, lCutoffVoxX : integer;x : Bytep;lxra: singlep0;lTempBuff: Singlep;
  begin
    lXi := qlXi; lYi:=qlYi; lCutoffVoxX:=qlCutoffVoxX;x:=qx ;lxra:=qlxra ;lTempBuff:=qlTempBuff;
    lYPos := 0;
    for lY := 1 to lYi do begin
        for lX := 1 to lXi do begin
            lMin := lX - lCutoffVoxX;
            if lMin < 1 then lMin := 1;

            lMax := lX + lCutoffVoxX;
            if lMax > lXi then lMax := lXi;

            lCumGauss := 0;
            x2 := @(x^[lMin+lYPos]); // xxx
            s1 := @(lXra^[lcutoffvoxx - (lX-lMin)]); // xxx
            //for lPos := lMin to lMax do begin // xxx
            for lPos := lmax - lMin downto 0 do begin // xxx
                  if x2^ <> 0 then
                       lCumGauss := lCumGauss + (x2^ * s1^);
                  inc(s1);
                  inc(x2);
            end; // xxx
            lTempBuff^[lX+lYPos] := (lCumGauss);
        end; {lX}
       inc(lYPos, lxi);
    end; {lY}
  end;

  procedure l2(qlXi, qlYi, qlCutoffVoxX : integer;qlxra: singlep0;qlTempBuff: Singlep);
  var
  lcumgauss: single;
   lY,lX,lMin,lMax,lYPos: integer;
    s3 ,s3e : ^Single;
    s2 : ^Single;
    //x : Bytep;
    lXi, lYi, lCutoffVoxX : integer;lxra: singlep0;lTempBuff: Singlep;
  begin
   lXi := qlXi; lYi:=qlYi; lCutoffVoxX:=qlCutoffVoxX;lxra:=qlxra ;lTempBuff:=qlTempBuff; //x:=qx ;
    for lX := 1 to lXi do begin
        lYPos := 0;
        for lY := 1 to lYi do begin
            lMin := lY - lCutoffVoxX;
            if lMin < 1 then lMin := 1;
            lMax := lY + lCutoffVoxX;
            if lMax > lYi then lMax := lYi;
            lCumGauss := 0;
            s3 := @(lXra^[lcutoffvoxx + lY-lMin]);
            s3e := @(lXra^[lcutoffvoxx + lY-lMin - (lmax - lMin) - 1]);
            //s3e := s3 - (lmax - lMin) - 1;
            s2 := @(lTempBuff^[((lMin-1)*lXi) + lX]);
            //for lPos := lMin to lMax do
            while true do
            begin // xxx
                if s2^ <> 0 then
                   lCumGauss := lCumGauss + s2^ * s3^;
                dec(s3);
                if s3 = s3e then break;
                inc(s2, lXi);
            end;
            //lyPos := (lY-1)*lXi;
            gSmoothBuff^[lx+lYPos] := round(lCumGauss);
            inc(lYPos, lXi);
        end; {lX}
    end; {lY}
  end;

  var
   lsigma,lexpd,lcumgauss: single;
   lcutoffvoxx,lI,lXi,lYi: integer;
   lxra: singlep0;
   lTempBuff: Singlep;
   x : Bytep;
   s1 : ^Single;

begin
        lXi := gSrcWid;
        lYi := gSrcHt;
        lsigma := (lFWHM)/sqrt(8*ln(2)); // % FWHM -> sigma
        lcutoffvoxx := round(6*lsigma); // % highest / lowest voxel to go out to
           getmem(lxra,(lcutoffvoxx+1)*sizeof(single) * 2); // xxx 82
           getmem(lTempBuff,lXi*lYi*sizeof(single));
        lexpd := 2*lsigma*lsigma;
        lCumGauss := 0;
        s1:=@(lxra^[lcutoffvoxx ]);
        for lI := 0 to lcutoffvoxx do begin
            s1^ := exp(-1*(lI*lI)/lexpd) ;
            lxra^[lcutoffvoxx - lI] := s1^ ; //xxx
            lCumGauss := lCumGauss + s1^;
            inc(s1);
        end;
        lCumGauss := 2*lCumGauss - lxra^[lcutoffvoxx];
        if lCumGauss <> 0 then
           for lI := 0 to lcutoffvoxx do begin
            lxra^[lcutoffvoxx + lI] := lxra^[lcutoffvoxx + lI]/lCumGauss;
            lxra^[lcutoffvoxx - lI] := lxra^[lcutoffvoxx + lI] ; //xxx
           end;

        //smooth horizontally
        x:=gBuff;
        l1(lXi, lYi, lCutoffVoxX ,x ,lxra,lTempBuff);
        //smooth vertically
        l2(lXi, lYi, lCutoffVoxX ,lxra,lTempBuff);
        //free memory
           freemem(lTempBuff);
           freemem(lxra);
end;

Marco van de Voort

2014-11-04 15:49

manager   ~0078895

Seems the original link is no longer available. I tried to find a bounty/ directory on the new site, but failed.

samuel herzog

2014-11-04 19:49

reporter   ~0078901

Are you looking for this link?

http://www.cabiatl.com/mricro/bounty/

Chris Rorden

2020-04-30 13:29

reporter   ~0122556

I am going to close this issue as I no longer have Delphi 4 installed. I would be interested to know if recent releases of FPC support single static assignment (which Jonas Maebe thought would resolve the issue). If someone has addressed this, I am happy to provide the bounty. If someone wants me to dig up a copy of Delphi to test, I am happy to do so.

Jonas Maebe

2020-04-30 14:05

manager   ~0122559

It's resolved on platforms that are supported by the LLVM backend, but right now that's only
* macOS/x86-64
* Linux/x86-64
* Linux/AArch64
* Linux/ARM

Issue History

Date Modified Username Field Change
2007-11-26 15:33 Chris Rorden New Issue
2007-11-26 15:39 Florian Note Added: 0016385
2007-12-09 21:09 Jonas Maebe Note Added: 0016639
2007-12-09 21:13 Jonas Maebe Note Edited: 0016639
2007-12-09 21:21 Jonas Maebe Note Added: 0016640
2007-12-10 12:45 Jonas Maebe Note Edited: 0016640
2008-02-12 13:16 Jonas Maebe FPCTarget => -
2008-02-12 13:16 Jonas Maebe Note Added: 0017740
2008-02-12 13:16 Jonas Maebe Status new => feedback
2008-03-14 21:19 Jonas Maebe Status feedback => resolved
2008-03-14 21:19 Jonas Maebe Fixed in Version => 2.3.1
2008-03-14 21:19 Jonas Maebe Resolution open => fixed
2008-03-14 21:19 Jonas Maebe Assigned To => Jonas Maebe
2008-03-14 21:19 Jonas Maebe Note Added: 0018335
2008-06-10 15:58 Chris Rorden Status resolved => feedback
2008-06-10 15:58 Chris Rorden Resolution fixed => reopened
2008-06-10 15:58 Chris Rorden Note Added: 0020075
2008-06-10 22:26 Jonas Maebe Note Added: 0020078
2008-06-10 23:43 Chris Rorden Note Added: 0020079
2008-06-12 15:57 Jonas Maebe Note Added: 0020107
2008-06-25 22:47 Martin Friebe Note Added: 0020350
2008-06-25 22:49 Martin Friebe Note Added: 0020351
2008-06-26 01:01 Martin Friebe Note Edited: 0020350
2008-06-26 01:41 Martin Friebe Note Added: 0020357
2008-06-26 10:07 Martin Friebe Note Added: 0020362
2008-06-27 21:20 Chris Rorden Note Added: 0020417
2008-06-28 00:27 Martin Friebe Note Added: 0020425
2008-06-28 00:29 Martin Friebe Note Edited: 0020425
2008-06-28 00:59 Martin Friebe Note Edited: 0020425
2008-06-28 01:17 Jonas Maebe Note Added: 0020427
2008-06-28 01:34 Martin Friebe Note Added: 0020428
2008-06-28 17:05 Jonas Maebe Note Added: 0020443
2008-06-29 05:26 Chris Rorden Note Added: 0020447
2008-06-29 09:33 Jonas Maebe Note Added: 0020448
2008-07-01 15:46 Chris Rorden Note Added: 0020481
2008-07-01 16:32 Florian Note Added: 0020482
2009-11-08 17:51 Marco van de Voort Note Added: 0032006
2009-11-08 19:49 Jonas Maebe Note Added: 0032010
2009-11-08 19:50 Jonas Maebe Note Edited: 0032010
2010-02-26 20:43 Chris Rorden Note Added: 0034815
2010-08-02 01:47 Chris Rorden Note Added: 0039872
2014-11-04 15:49 Marco van de Voort Note Added: 0078895
2014-11-04 19:49 samuel herzog Note Added: 0078901
2020-04-30 13:29 Chris Rorden Status feedback => closed
2020-04-30 13:29 Chris Rorden Note Added: 0122556
2020-04-30 14:05 Jonas Maebe Note Added: 0122559