View Issue Details

IDProjectCategoryView StatusLast Update
0032637FPCPatchpublic2020-09-29 09:40
ReporterJ. Gareth Moreton Assigned To 
PrioritynormalSeverityminorReproducibilityalways
Status newResolutionopen 
PlatformWin64OSWindows 7 (64-bit) 
Product Version3.1.1 
Summary0032637: AMD64 versions of FillWord, FillDWord and FillQWord are poorly optimised
DescriptionThe implementations of FillWord, FillDWord and FillQWord are very poorly optimised on AMD64 platforms (e.g. Win64), falling back on general-purpose Pascal code. This may catch programmers off-guard who expect these functions to be faster than FillChar (generally they're about the same speed) when initialising memory of a known type larger than a byte.

Find attached a patch that implement assembly language optimisations for Win64 and 64-bit Linux (System V ABI).
Steps To ReproduceUse QueryPerformanceTimer or equivalent to evaluate the average running time of FillWord, FillDWord and FillQWord, then apply the patch and perform the tests again to (hopefully) see drastic improvements.
Additional InformationThe implementations make use of SSE2 (required to be present in 64-bit systems) and non-temporal hints when filling blocks of memory of a megabyte or larger. Smaller blocks make use of "rep stosq".

The Windows versions have been thoroughly tested for correctness, including on memory not aligned to a 16-byte boundary, and with counts that are not a power of two, but the Linux versions have NOT been tested for correctness due to the submitter's inability to currently compile and test for Linux.

Possibly requires additional code for proper stack unwinding during Structured Exception Handling in Windows due to the presence of "push %rdi" - Linux does not have this issue as the stack and non-volatile registers are not utilised.

Limitation:

The pointer to x must fall on a 2-byte, 4-byte and 8-byte boundary for FillWord, FillDWord and FillQWord respectively - failure to do so will likely raise an exception (caused by calling "movntdq" with misaligned memory). This limitation is fair because writing across a boundary in normal conditions (e.g. writing a Word to memory with an odd-numbered pointer) is highly unusual and normally deliberately contrived, since implicit and explicit memory assignment routines tend to put the memory block on a boundary that's relevant to the requested type, or to the machine word size.
Tagsoptimizations
Fixed in Revision
FPCOldBugId
FPCTarget
Attached Files

Activities

J. Gareth Moreton

2017-11-03 03:23

developer  

x86_64_FillWord_FillDWord_FillQWord.patch (7,567 bytes)   
Index: rtl/x86_64/x86_64.inc
===================================================================
--- rtl/x86_64/x86_64.inc	(revision 37548)
+++ rtl/x86_64/x86_64.inc	(working copy)
@@ -466,6 +466,287 @@
   end;
 {$endif FPC_SYSTEM_HAS_FILLCHAR}
 
+{$ifndef FPC_SYSTEM_HAS_FILLWORD}
+{$define FPC_SYSTEM_HAS_FILLWORD}
+procedure FillWord(var x; count: SizeInt; value: Word); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8w value
+  linux: rdi dest, rsi count, dx  value }
+  {$ifdef win64}
+    push    %rdi
+    movzwq  %r8w, %rax
+    mov     $0x0001000100010001, %r9
+    mov     %rcx, %rdi
+  {$else}
+    movzwq  %dx,  %rax
+    mov     $0x0001000100010001, %r9
+    mov     %dil, %cl
+  {$endif}
+    imul    %r9,  %rax
+
+    { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
+    and     $0xe, %cl
+    jz      .LAligned16
+    test    $0x2, %cl
+    jz      .LAligned4
+
+    mov     %r8w, (%rdi)
+    add     $0x2, %cl
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x2, %rdi
+
+  .LAligned4:
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %eax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if "test $0x8, %cl" sets ZF here, then the memory block was originally 2, 4 or 6 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x4, %rdx
+  {$else}
+    sub     $0x4, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+    pinsrw  $0x0, %r8w,  %xmm0
+  {$ifdef win64}
+    mov     %dl,  %r10b
+    shr     $0x2, %rdx
+    and     $0x3, %r10b
+    mov     %rdx, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    mov     %sil, %r10b
+    shr     $0x2, %rsi
+    and     $0x3, %r10b
+    mov     %rsi, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    shr     $0x1,  %r10b
+    cld
+    rep     stosq
+    jnc     .LNoLooseWord
+    mov     %ax,  (%rdi)
+    lea     2(%rdi), %rdi
+
+  .LNoLooseWord:
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLDWORD}
+{$define FPC_SYSTEM_HAS_FILLDWORD}
+Procedure FillDWord(var x; count: SizeInt; value: DWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8d value
+  linux: rdi dest, rsi count, edx value }
+  {$ifdef win64}
+    push    %rdi
+    mov     %r8d, %eax
+    mov     %rcx, %rdi
+    shl     $32,  %rax
+    or      %r8,  %rax
+  {$else}
+    mov     %edx, %eax
+    mov     %dil, %cl
+    shl     $32,  %rax
+    or      %rdx, %rax
+  {$endif}
+
+    { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
+    and     $0xc, %cl
+    jz      .LAligned16
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %r8d, (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if TEST CL, $8 sets ZF here, then the memory block was originally 4 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movd    %r8d, %xmm0
+    shr     $0x1, %rdx
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rdx, %rcx
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movd    %edx, %xmm0
+    shr     $0x1, %rsi
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rsi, %rcx
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    test    %r10b, %r10b
+    rep     stosq
+
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLDWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLQWORD}
+{$define FPC_SYSTEM_HAS_FILLQWORD}
+procedure FillQWord(var x; count: SizeInt; value: QWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8  value
+  linux: rdi dest, rsi count, rdx value }
+  {$ifdef win64}
+    push    %rdi
+    mov     %rcx, %rdi
+    test    $0x8, %cl
+  {$else}
+    test    $0x8, %dil
+  {$endif}
+
+  { Do some memory alignment first (it should be at least aligned to a 64-bit boundary already) }
+    jz      .LAligned16
+    mov     %r8,  (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movq    %r8,  %xmm0
+    mov     %r8,  %rax
+    cmp     $0x80000, %rdx
+    mov     %rdx, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of R8 }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movq    %rdx, %xmm0
+    mov     %rdx, %rax
+    cmp     $0x80000, %rsi
+    mov     %rsi, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of RDX }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+    { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    rep     stosq
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLQWORD}
+
 {$ifndef FPC_SYSTEM_HAS_INDEXBYTE}
 {$define FPC_SYSTEM_HAS_INDEXBYTE}
 function IndexByte(Const buf;len:SizeInt;b:byte):SizeInt; assembler; nostackframe;

J. Gareth Moreton

2017-11-03 03:33

developer  

FIXED_x86_64_FillWord_FillDWord_FillQWord.patch (7,924 bytes)   
Index: rtl/x86_64/x86_64.inc
===================================================================
--- rtl/x86_64/x86_64.inc	(revision 37548)
+++ rtl/x86_64/x86_64.inc	(working copy)
@@ -466,6 +466,287 @@
   end;
 {$endif FPC_SYSTEM_HAS_FILLCHAR}
 
+{$ifndef FPC_SYSTEM_HAS_FILLWORD}
+{$define FPC_SYSTEM_HAS_FILLWORD}
+procedure FillWord(var x; count: SizeInt; value: Word); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8w value
+  linux: rdi dest, rsi count, dx  value }
+  {$ifdef win64}
+    test    %rdx, %rdx
+    jz      .LZeroCount
+    push    %rdi
+    movzwq  %r8w, %rax
+    mov     $0x0001000100010001, %r9
+    mov     %rcx, %rdi
+  {$else}
+    test    %rsi, %rsi
+    jz      .LZeroCount
+    movzwq  %dx,  %rax
+    mov     $0x0001000100010001, %r9
+    mov     %dil, %cl
+  {$endif}
+    imul    %r9,  %rax
+
+    { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
+    and     $0xe, %cl
+    jz      .LAligned16
+    test    $0x2, %cl
+    jz      .LAligned4
+
+    mov     %r8w, (%rdi)
+    add     $0x2, %cl
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x2, %rdi
+
+  .LAligned4:
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %eax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if "test $0x8, %cl" sets ZF here, then the memory block was originally 2, 4 or 6 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x4, %rdx
+  {$else}
+    sub     $0x4, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+    pinsrw  $0x0, %r8w,  %xmm0
+  {$ifdef win64}
+    mov     %dl,  %r10b
+    shr     $0x2, %rdx
+    and     $0x3, %r10b
+    mov     %rdx, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    mov     %sil, %r10b
+    shr     $0x2, %rsi
+    and     $0x3, %r10b
+    mov     %rsi, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    shr     $0x1,  %r10b
+    cld
+    rep     stosq
+    jnc     .LNoLooseWord
+    mov     %ax,  (%rdi)
+    lea     2(%rdi), %rdi
+
+  .LNoLooseWord:
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLDWORD}
+{$define FPC_SYSTEM_HAS_FILLDWORD}
+Procedure FillDWord(var x; count: SizeInt; value: DWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8d value
+  linux: rdi dest, rsi count, edx value }
+  {$ifdef win64}
+    test    %rdx, %rdx
+    jz      .LZeroCount
+    push    %rdi
+    mov     %r8d, %eax
+    mov     %rcx, %rdi
+    shl     $32,  %rax
+    or      %r8,  %rax
+  {$else}
+    test    %rsi, %rsi
+    jz      .LZeroCount
+    mov     %edx, %eax
+    mov     %dil, %cl
+    shl     $32,  %rax
+    or      %rdx, %rax
+  {$endif}
+
+    { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
+    and     $0xc, %cl
+    jz      .LAligned16
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %r8d, (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if TEST CL, $8 sets ZF here, then the memory block was originally 4 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movd    %r8d, %xmm0
+    shr     $0x1, %rdx
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rdx, %rcx
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movd    %edx, %xmm0
+    shr     $0x1, %rsi
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rsi, %rcx
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    test    %r10b, %r10b
+    rep     stosq
+
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLDWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLQWORD}
+{$define FPC_SYSTEM_HAS_FILLQWORD}
+procedure FillQWord(var x; count: SizeInt; value: QWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8  value
+  linux: rdi dest, rsi count, rdx value }
+  {$ifdef win64}
+    test    %rdx, %rdx
+    jz      .LZeroCount
+    push    %rdi
+    mov     %rcx, %rdi
+    test    $0x8, %cl
+  {$else}
+    test    %rsi, %rsi
+    jz      .LZeroCount
+    test    $0x8, %dil
+  {$endif}
+
+  { Do some memory alignment first (it should be at least aligned to a 64-bit boundary already) }
+    jz      .LAligned16
+    mov     %r8,  (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movq    %r8,  %xmm0
+    mov     %r8,  %rax
+    cmp     $0x80000, %rdx
+    mov     %rdx, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of R8 }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movq    %rdx, %xmm0
+    mov     %rdx, %rax
+    cmp     $0x80000, %rsi
+    mov     %rsi, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of RDX }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+    { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    rep     stosq
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLQWORD}
+
 {$ifndef FPC_SYSTEM_HAS_INDEXBYTE}
 {$define FPC_SYSTEM_HAS_INDEXBYTE}
 function IndexByte(Const buf;len:SizeInt;b:byte):SizeInt; assembler; nostackframe;

J. Gareth Moreton

2017-11-03 03:36

developer   ~0103851

Last-minute fix. Forgot to check if count was zero when entering the procedure.

J. Gareth Moreton

2017-11-03 17:46

developer   ~0103856

I may have gotten the sanity check wrong. To prevent a change of behaviour, and because the count is signed, the lines that read the following...

test %rdx, %rdx
jz .LZeroCount

...should possibly be changed to...

cmp $0x0, %rdx
jle .LZeroCount

(Change %rdx to %rsi for Linux)

That way, if a negative number is passed into the function, it just drops out, instead of causing a buffer overrun.

J. Gareth Moreton

2017-11-05 15:06

developer   ~0103881

For time metrics - average speed gain is about 50% faster. Because of the initial checks and the memory fencing, the best gains are found when initialising memory blocks of more than a megabyte.

J. Gareth Moreton

2017-11-26 13:05

developer  

EXCEPTION_x86_64_FillWord_FillDWord_FillQWord.patch (8,120 bytes)   
Index: rtl/x86_64/x86_64.inc
===================================================================
--- rtl/x86_64/x86_64.inc	(revision 37566)
+++ rtl/x86_64/x86_64.inc	(working copy)
@@ -466,6 +466,308 @@
   end;
 {$endif FPC_SYSTEM_HAS_FILLCHAR}
 
+{$ifndef FPC_SYSTEM_HAS_FILLWORD}
+{$define FPC_SYSTEM_HAS_FILLWORD}
+Procedure FillWord(var x; count: SizeInt; value: Word); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8w value
+  linux: rdi dest, rsi count, dx  value }
+  {$ifdef win64}
+.seh_endprologue { No prologue actually present }
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    push    %rdi
+.seh_pushreg %rdi
+    movzwq  %r8w, %rax
+    mov     $0x0001000100010001, %r9
+    mov     %rcx, %rdi
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    movzwq  %dx,  %rax
+    mov     $0x0001000100010001, %r9
+    mov     %dil, %cl
+  {$endif}
+    imul    %r9,  %rax
+
+    { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
+    and     $0xe, %cl
+    jz      .LAligned16
+    test    $0x2, %cl
+    jz      .LAligned4
+
+    mov     %r8w, (%rdi)
+    add     $0x2, %cl
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x2, %rdi
+
+  .LAligned4:
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %eax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if "test $0x8, %cl" sets ZF here, then the memory block was originally 2, 4 or 6 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x4, %rdx
+  {$else}
+    sub     $0x4, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+    pinsrw  $0x0, %r8w,  %xmm0
+  {$ifdef win64}
+    mov     %dl,  %r10b
+    shr     $0x2, %rdx
+    and     $0x3, %r10b
+    mov     %rdx, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    mov     %sil, %r10b
+    shr     $0x2, %rsi
+    and     $0x3, %r10b
+    mov     %rsi, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    shr     $0x1,  %r10b
+    cld
+    rep     stosq
+    jnc     .LNoLooseWord
+    mov     %ax,  (%rdi)
+    lea     2(%rdi), %rdi
+
+  .LNoLooseWord:
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLDWORD}
+{$define FPC_SYSTEM_HAS_FILLDWORD}
+Procedure FillDWord(var x; count: SizeInt; value: DWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8d value
+  linux: rdi dest, rsi count, edx value }
+  {$ifdef win64}
+.seh_endprologue { No prologue actually present }
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    push    %rdi
+.seh_pushreg %rdi
+    mov     %r8d, %eax
+    mov     %rcx, %rdi
+    shl     $32,  %rax
+    or      %r8,  %rax
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    mov     %edx, %eax
+    mov     %dil, %cl
+    shl     $32,  %rax
+    or      %rdx, %rax
+  {$endif}
+
+    { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
+    and     $0xc, %cl
+    jz      .LAligned16
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %r8d, (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if TEST CL, $8 sets ZF here, then the memory block was originally 4 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movd    %r8d, %xmm0
+    shr     $0x1, %rdx
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rdx, %rcx
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movd    %edx, %xmm0
+    shr     $0x1, %rsi
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rsi, %rcx
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    test    %r10b, %r10b
+    rep     stosq
+
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLDWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLQWORD}
+{$define FPC_SYSTEM_HAS_FILLQWORD}
+Procedure FillQWord(var x; count: SizeInt; value: QWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8  value
+  linux: rdi dest, rsi count, rdx value }
+  {$ifdef win64}
+.seh_endprologue { No prologue actually present }
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    push    %rdi
+.seh_pushreg %rdi	
+    mov     %rcx, %rdi
+    test    $0x8, %cl
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    test    $0x8, %dil
+  {$endif}
+
+  { Do some memory alignment first (it should be at least aligned to a 64-bit boundary already) }
+    jz      .LAligned16
+    mov     %r8,  (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movq    %r8,  %xmm0
+    mov     %r8,  %rax
+    cmp     $0x80000, %rdx
+    mov     %rdx, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of R8 }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movq    %rdx, %xmm0
+    mov     %rdx, %rax
+    cmp     $0x80000, %rsi
+    mov     %rsi, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of RDX }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+    { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    rep     stosq
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  .LZeroCount:
+  end;
+{$endif FPC_SYSTEM_HAS_FILLQWORD}
+
 {$ifndef FPC_SYSTEM_HAS_INDEXBYTE}
 {$define FPC_SYSTEM_HAS_INDEXBYTE}
 function IndexByte(Const buf;len:SizeInt;b:byte):SizeInt; assembler; nostackframe;

J. Gareth Moreton

2017-11-26 13:06

developer   ~0104283

Uploaded patch with required .seh_pushreg fields, and corrected range checking so the routines instantly return if the count value is negative.

J. Gareth Moreton

2017-11-30 11:56

developer  

STACK_FRAME_x86_64_FillWord_FillDWord_FillQWord.patch (8,021 bytes)   
Index: rtl/x86_64/x86_64.inc
===================================================================
--- rtl/x86_64/x86_64.inc	(revision 37566)
+++ rtl/x86_64/x86_64.inc	(working copy)
@@ -466,6 +466,308 @@
   end;
 {$endif FPC_SYSTEM_HAS_FILLCHAR}
 
+{$ifndef FPC_SYSTEM_HAS_FILLWORD}
+{$define FPC_SYSTEM_HAS_FILLWORD}
+Procedure FillWord(var x; count: SizeInt; value: Word); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8w value
+  linux: rdi dest, rsi count, dx  value }
+  {$ifdef win64}
+    push    %rdi
+.seh_pushreg %rdi
+.seh_endprologue
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    movzwq  %r8w, %rax
+    mov     $0x0001000100010001, %r9
+    mov     %rcx, %rdi
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    movzwq  %dx,  %rax
+    mov     $0x0001000100010001, %r9
+    mov     %dil, %cl
+  {$endif}
+    imul    %r9,  %rax
+
+    { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
+    and     $0xe, %cl
+    jz      .LAligned16
+    test    $0x2, %cl
+    jz      .LAligned4
+
+    mov     %r8w, (%rdi)
+    add     $0x2, %cl
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x2, %rdi
+
+  .LAligned4:
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %eax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if "test $0x8, %cl" sets ZF here, then the memory block was originally 2, 4 or 6 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x4, %rdx
+  {$else}
+    sub     $0x4, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+    pinsrw  $0x0, %r8w,  %xmm0
+  {$ifdef win64}
+    mov     %dl,  %r10b
+    shr     $0x2, %rdx
+    and     $0x3, %r10b
+    mov     %rdx, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    mov     %sil, %r10b
+    shr     $0x2, %rsi
+    and     $0x3, %r10b
+    mov     %rsi, %rcx
+    pshuflw $0x0, %xmm0, %xmm0
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x0, %xmm0, %xmm0
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    shr     $0x1,  %r10b
+    cld
+    rep     stosq
+    jnc     .LNoLooseWord
+    mov     %ax,  (%rdi)
+    lea     2(%rdi), %rdi
+
+  .LNoLooseWord:
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  .LZeroCount:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLDWORD}
+{$define FPC_SYSTEM_HAS_FILLDWORD}
+Procedure FillDWord(var x; count: SizeInt; value: DWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8d value
+  linux: rdi dest, rsi count, edx value }
+  {$ifdef win64}
+    push    %rdi
+.seh_pushreg %rdi
+.seh_endprologue
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    mov     %r8d, %eax
+    mov     %rcx, %rdi
+    shl     $32,  %rax
+    or      %r8,  %rax
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    mov     %edx, %eax
+    mov     %dil, %cl
+    shl     $32,  %rax
+    or      %rdx, %rax
+  {$endif}
+
+    { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
+    and     $0xc, %cl
+    jz      .LAligned16
+    test    $0x4, %cl
+    jz      .LAligned8
+    mov     %r8d, (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x4, %rdi
+    test    $0x8, %cl
+    jnz     .LAligned16 { Note that it's NOT zero here, because if TEST CL, $8 sets ZF here, then the memory block was originally 4 bytes beyond the boundary }
+
+  .LAligned8:
+    mov     %rax, (%rdi)
+  {$ifdef win64}
+    sub     $0x2, %rdx
+  {$else}
+    sub     $0x2, %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movd    %r8d, %xmm0
+    shr     $0x1, %rdx
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rdx, %rcx
+    cmp     $0x80000, %rdx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movd    %edx, %xmm0
+    shr     $0x1, %rsi
+    setc    %r10b
+    pshufd  $0x0, %xmm0, %xmm0
+    mov     %rsi, %rcx
+    cmp     $0x80000, %rsi
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+  { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    test    %r10b, %r10b
+    rep     stosq
+
+    jz      .LNoLooseDWord
+    mov     %eax, (%rdi)
+
+  .LNoLooseDWord:
+  .LZeroCount:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLDWORD}
+
+{$ifndef FPC_SYSTEM_HAS_FILLQWORD}
+{$define FPC_SYSTEM_HAS_FILLQWORD}
+Procedure FillQWord(var x; count: SizeInt; value: QWord); assembler; nostackframe;
+  asm
+{ win64: rcx dest, rdx count, r8  value
+  linux: rdi dest, rsi count, rdx value }
+  {$ifdef win64}
+    push    %rdi
+.seh_pushreg %rdi	
+.seh_endprologue
+    cmp     $0x0, %rdx
+    jle     .LZeroCount
+    mov     %rcx, %rdi
+    test    $0x8, %cl
+  {$else}
+    cmp     $0x0, %rsi
+    jle     .LZeroCount
+    test    $0x8, %dil
+  {$endif}
+
+  { Do some memory alignment first (it should be at least aligned to a 64-bit boundary already) }
+    jz      .LAligned16
+    mov     %r8,  (%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    add     $0x8, %rdi
+
+  .LAligned16:
+  {$ifdef win64}
+    movq    %r8,  %xmm0
+    mov     %r8,  %rax
+    cmp     $0x80000, %rdx
+    mov     %rdx, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of R8 }
+    and     $0x7, %rcx
+    shr     $0x3, %rdx
+  {$else}
+    movq    %rdx, %xmm0
+    mov     %rdx, %rax
+    cmp     $0x80000, %rsi
+    mov     %rsi, %rcx
+    jb      .LNoBlocks { Too small for the non-temporal hint to be worthwhile, so just use STOSQ }
+    pshufd  $0x44,%xmm0, %xmm0 { 01 00 01 00 - XMM0 will now contain two copies of RDX }
+    and     $0x7, %rcx
+    shr     $0x3, %rsi
+  {$endif}
+    jz      .LNoBlocks
+
+    { Write 64 bytes at a time using a non-temporal hint }
+  .LBlockLoop:
+    add     $0x40, %rdi
+    movntdq %xmm0, -0x40(%rdi)
+    movntdq %xmm0, -0x30(%rdi)
+  {$ifdef win64}
+    dec     %rdx
+  {$else}
+    dec     %rsi
+  {$endif}
+    movntdq %xmm0, -0x20(%rdi)
+    movntdq %xmm0, -0x10(%rdi)
+    jnz     .LBlockLoop
+    mfence
+
+  .LNoBlocks:
+    rep     stosq
+  .LZeroCount:
+  {$ifdef win64}
+    pop     %rdi
+  {$endif}
+  end;
+{$endif FPC_SYSTEM_HAS_FILLQWORD}
+
 {$ifndef FPC_SYSTEM_HAS_INDEXBYTE}
 {$define FPC_SYSTEM_HAS_INDEXBYTE}
 function IndexByte(Const buf;len:SizeInt;b:byte):SizeInt; assembler; nostackframe;

J. Gareth Moreton

2017-11-30 11:57

developer   ~0104364

Fixed stack frame hints

J. Gareth Moreton

2018-02-08 10:19

developer   ~0106282

Do these functions work as they should? (Note: apply ONLY the patch "STACK_FRAME_x86_64_FillWord_FillDWord_FillQWord.patch")

Marco van de Voort

2018-02-10 14:25

manager   ~0106314

An example of such aligned access is zeroing out a field of a record that describes a wire (read: socket) or file layout. Such structures might not have the natural system alignment

J. Gareth Moreton

2018-02-13 11:44

developer   ~0106369

Last edited: 2018-02-13 11:47

View 4 revisions

Does that include having, say, a 16-bit field that crosses a word boundary (i.e. sits on an odd-numbered offset)? Granted, I would assume that the entire record is on a boundary that is suitable to pass into FillWord... or in this case, I imagine FillChar is better just on account of the record size not being a clean multiple of a Word or DWord.

Does that mean these functions should be modified to allow for unaligned memory? (i.e. FillWord working with odd-numbered Pointers and FillDWord working with Pointers that aren't a multiple of 4)

NoName

2020-01-07 23:13

reporter   ~0120255

Just to give it a push - isn't everything aligned by default now?

Marco van de Voort

2020-01-08 07:49

manager   ~0120264

Yes. They should work for any pointer. And they should also work for non zero arguments. (iow where high word <>low word).

J. Gareth Moreton

2020-01-08 17:30

developer   ~0120267

I might need to rewrite them in that case so they work with all pointer addresses, even if they are most likely going to be 8 or 16-byte aligned.

Marco van de Voort

2020-01-08 17:56

manager   ~0120268

Maybe a test near the beginning to do less than 10/16 bytes or so with a simple rep stosw?

J. Gareth Moreton

2020-01-08 20:50

developer   ~0120277

Last edited: 2020-01-08 20:51

View 2 revisions

That would work, but you would need to do one at the end as well for 16 minus however many bytes you wrote at the beginning - e.g. if you want to write 5 DWords, but your pointer starts 1 byte ahead of a 4-byte boundary, you'd want to write 3 bytes first, write 4 DWords (probably as a single XMM register if you can), then a final byte,

With the byte-sized FillChar, this is no problem, because you can broadcast the byte into all 16 bytes of an XMM register, but doing the same for Words, DWords and QWords is not so straightforward, as it depends on the alignment of the input pointer. Basically, if a = pointer alignment (mod 16) and s = unit size (2 for Word, 4 for DWord etc), and gcd(a, s) < s, then special treatment is required.

NoName

2020-02-17 21:32

reporter   ~0121151

Maybe the assembler implementations could be taken from http://blog.synopse.info/post/2020/02/17/New-move/fillchar-optimized-sse2/avx-asm-version

J. Gareth Moreton

2020-02-17 22:47

developer   ~0121153

Admittedly I've been a little bit distracted because I've been trying to get my MOVZX optimisations fixed.

J. Gareth Moreton

2020-09-29 09:40

developer   ~0125952

Thanks for bumping this Chris! I'll see about writing versions that can handle unaligned memory.

Issue History

Date Modified Username Field Change
2017-11-03 03:23 J. Gareth Moreton New Issue
2017-11-03 03:23 J. Gareth Moreton File Added: x86_64_FillWord_FillDWord_FillQWord.patch
2017-11-03 03:33 J. Gareth Moreton File Added: FIXED_x86_64_FillWord_FillDWord_FillQWord.patch
2017-11-03 03:36 J. Gareth Moreton Note Added: 0103851
2017-11-03 17:46 J. Gareth Moreton Note Added: 0103856
2017-11-05 15:06 J. Gareth Moreton Note Added: 0103881
2017-11-26 13:05 J. Gareth Moreton File Added: EXCEPTION_x86_64_FillWord_FillDWord_FillQWord.patch
2017-11-26 13:06 J. Gareth Moreton Note Added: 0104283
2017-11-30 11:56 J. Gareth Moreton File Added: STACK_FRAME_x86_64_FillWord_FillDWord_FillQWord.patch
2017-11-30 11:57 J. Gareth Moreton Note Added: 0104364
2017-12-31 15:00 J. Gareth Moreton Tag Attached: optimizations
2018-02-08 10:19 J. Gareth Moreton Note Added: 0106282
2018-02-10 14:25 Marco van de Voort Note Added: 0106314
2018-02-13 11:44 J. Gareth Moreton Note Added: 0106369
2018-02-13 11:45 J. Gareth Moreton Note Edited: 0106369 View Revisions
2018-02-13 11:47 J. Gareth Moreton Note Edited: 0106369 View Revisions
2018-02-13 11:47 J. Gareth Moreton Note Edited: 0106369 View Revisions
2020-01-07 23:13 NoName Note Added: 0120255
2020-01-08 07:49 Marco van de Voort Note Added: 0120264
2020-01-08 17:30 J. Gareth Moreton Note Added: 0120267
2020-01-08 17:56 Marco van de Voort Note Added: 0120268
2020-01-08 20:50 J. Gareth Moreton Note Added: 0120277
2020-01-08 20:51 J. Gareth Moreton Note Edited: 0120277 View Revisions
2020-02-17 21:32 NoName Note Added: 0121151
2020-02-17 22:47 J. Gareth Moreton Note Added: 0121153
2020-09-29 09:40 J. Gareth Moreton Note Added: 0125952