View Issue Details

IDProjectCategoryView StatusLast Update
0023428LazarusLazUtilspublic2013-06-14 15:19
ReporterBart BroersmaAssigned ToFelipe Monteiro de Carvalho 
PrioritynormalSeveritycrashReproducibilityalways
Status closedResolutionfixed 
Platformi386OSWindowOS VersionWin7
Product Version1.1 (SVN)Product Buildr39272 
Target Version1.2.0Fixed in Version1.1 (SVN) 
Summary0023428: Utf8UpperCase in combination with HeapTrace may crash program
DescriptionUtf8UpperCase(#$C9#$91#$C9#$91) can crash a program.
Steps To ReproduceSee attached demo
Additional InformationThis only seems to happen when:
1. HeapTrace is enabled (-gh)
2. The string contains codepoints that, when processed to their uppercase counterpart, make up more bytes than the original one: e.g the "upper" of #$C9#$91 = #$E2#$B#$AD
3. You need at least 2 of those codepoint in the string

Tried to get a backtrace, but got "no stack"...

C:\Users\Bart\LazarusProjecten\bugs\Utf8UC>gdb uc.exe
GNU gdb (GDB) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "mingw32".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from C:\Users\Bart\LazarusProjecten\bugs\Utf8UC/uc.exe...done.
(gdb) run
Starting program: C:\Users\Bart\LazarusProjecten\bugs\Utf8UC/uc.exe
[New Thread 3760.0x334]
UTF8 seems valid
Marked memory at $00240D80 invalid
Wrong signature $50EA5AF2 instead of 61F7CA44
  $0040D6C8
  $004272EE TFORM1__BUTTON1CLICK, line 42 of main.pp
  $004F99B2 TCONTROL__CLICK, line 2718 of ./include/control.inc
  $00510C1F TBUTTONCONTROL__CLICK, line 56 of ./include/buttoncontrol.inc
  $0051120F TCUSTOMBUTTON__CLICK, line 175 of ./include/buttons.inc
  $00511861 TBUTTON__CLICK, line 355 of ./include/buttons.inc
  $00510B5A TBUTTONCONTROL__WMDEFAULTCLICKED, line 26 of ./include/buttoncontr
ol.inc
  $0040ADC6
  $004ED9FE TWINCONTROL__WNDPROC, line 5322 of ./include/wincontrol.inc
WARNING: TLCLComponent.Destroy with LCLRefCount>0. Hint: Maybe the component is
processing an event?
Heap dump by heaptrc unit
854 memory blocks allocated : 1543625/1545400
853 memory blocks freed : 1543612/1545384
1 unfreed memory blocks : 13
True heap size : 688128 (112 used in System startup)
True free heap : 687920
Should be : 687936
Marked memory at $00240D80 invalid
Wrong signature $50EA5AF2 instead of 61F7CA44
  $004141A7
  $0040D6C8
  $004272EE TFORM1__BUTTON1CLICK, line 42 of main.pp
  $004F99B2 TCONTROL__CLICK, line 2718 of ./include/control.inc
  $00510C1F TBUTTONCONTROL__CLICK, line 56 of ./include/buttoncontrol.inc
  $0051120F TCUSTOMBUTTON__CLICK, line 175 of ./include/buttons.inc
  $00511861 TBUTTON__CLICK, line 355 of ./include/buttons.inc
  $00510B5A TBUTTONCONTROL__WMDEFAULTCLICKED, line 26 of ./include/buttoncontr
ol.inc

Program exited with code 01.
(gdb) bt
No stack.
TagsNo tags attached.
Fixed in Revision
LazTarget-
Widgetset
Attached Files

Activities

2012-12-04 00:31

 

utf8uc.zip (2,283 bytes)

Bart Broersma

2012-12-04 12:20

developer   ~0064127

Last edited: 2012-12-04 12:21

Consider this piece of code:
        $C991:
        begin
          OutStr[OutCounter] := #$E2;
          OutStr[OutCounter+1]:= #$B1;
          OutStr[OutCounter+2]:= #$AD;
          NewCharLen := 3;
          CharProcessed := True;
        end;

At some point in time OutStr[OutCounter+2] may be beyond the length of Result (OutStr starts of as PChar(Result)), so we are writing in memory that we may possibly not own?
(I hate PChars, give me a Pascal string anytime and range-checking...)

I don't know if it is the proper solution, but reserving more memory for Result when the function starts seems to prevent this issue from happening:

begin
  // Start with the same string, and progressively modify
  Result:=AInStr;
  UniqueString(Result);
  //Worst case scenario ?
  SetLength(Result, 3*Length(AInStr)); <<----
  OutStr := PChar(Result);

The factor 3 is most certainly too much.
Single byte codepoints never become 3-byte codepoints because they are ASCII, so worst case is 2-byte -> 3-byte?
(I'm just not that familiar with UTF-8)

Bart Broersma

2012-12-04 12:23

developer   ~0064128

Last edited: 2012-12-04 12:24

Confirmed by Ocye on the Lazarus forum (http://forum.lazarus.freepascal.org/index.php/topic,19095.0.html).

Bart Broersma

2012-12-04 12:42

developer   ~0064131

Felipe: I'm assigning this to you, because I believe you did the most work on this. (If I'm mistaken, simply unassign.)

Felipe Monteiro de Carvalho

2012-12-04 14:12

developer   ~0064135

Please try rev 39437 and see if it fixes your issue.

Bart Broersma

2012-12-04 23:32

developer   ~0064151

I think it is fixed, at least the test case works.
I was just wondering. Your fix, in a worst case scenario, may call SetLength multiple times, but in best case not once. My fix calls SetLength always 1 time.
Speedwise does it matter?

I leave it up to you.

Felipe Monteiro de Carvalho

2012-12-05 10:37

developer   ~0064153

Yes, it matters a lot speed-wise. The most important is optimizing for strings which have no special chars which change size. And my solution wins in this case, which is by far the most important one.

A secondary consideration is indeed calling multiple times versus one time, but this is relevant to few strings and it is also even less relevant because the memory manager already optimizes for this, I think. I think it allocates more then what we ask, so many small increases rarely need multiple relocations.

Mattias Gaertner

2012-12-05 11:08

manager   ~0064154

The heap manager often allocates somewhat too much, because it rounds up to the next block size, but that is just a few bytes. An increase via SetLength aka Reallocmem checks if the space behind is allocated. If not, it simply increases the size - which is fast, if not it allocates a new block and copies the whole chunk - which is slow. There are some optimizations for small chunks, but they depend on platform, heap manager and may change in future.
So the best way is to allocate somewhat too much at the beginning, then increase exponentially and at the end decrease to the right size. As encoding conversions typically can only vary by a small factor and strings are small it is best to simply allocate the worst case at the beginning and then decrease at the end.

Issue History

Date Modified Username Field Change
2012-12-04 00:31 Bart Broersma New Issue
2012-12-04 00:31 Bart Broersma File Added: utf8uc.zip
2012-12-04 00:31 Bart Broersma LazTarget => -
2012-12-04 11:37 Bart Broersma Severity major => crash
2012-12-04 11:37 Bart Broersma Category LazReport => LazUtils
2012-12-04 12:20 Bart Broersma Note Added: 0064127
2012-12-04 12:21 Bart Broersma Note Edited: 0064127
2012-12-04 12:23 Bart Broersma Note Added: 0064128
2012-12-04 12:23 Bart Broersma Status new => confirmed
2012-12-04 12:24 Bart Broersma Note Edited: 0064128
2012-12-04 12:41 Bart Broersma Status confirmed => assigned
2012-12-04 12:41 Bart Broersma Assigned To => Felipe Monteiro de Carvalho
2012-12-04 12:42 Bart Broersma Note Added: 0064131
2012-12-04 14:12 Felipe Monteiro de Carvalho Note Added: 0064135
2012-12-04 14:13 Felipe Monteiro de Carvalho Status assigned => feedback
2012-12-04 23:32 Bart Broersma Note Added: 0064151
2012-12-05 10:37 Felipe Monteiro de Carvalho Status feedback => resolved
2012-12-05 10:37 Felipe Monteiro de Carvalho Fixed in Version => 1.1 (SVN)
2012-12-05 10:37 Felipe Monteiro de Carvalho Resolution open => fixed
2012-12-05 10:37 Felipe Monteiro de Carvalho Note Added: 0064153
2012-12-05 11:08 Mattias Gaertner Note Added: 0064154
2012-12-05 16:58 Bart Broersma Status resolved => closed