View Issue Details

IDProjectCategoryView StatusLast Update
0031243LazarusOtherpublic2017-01-25 20:39
ReporterRick Wills Assigned ToBart Broersma  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionno change required 
PlatformWindowsOSWinXP Pro 
Summary0031243: UnicodeToUTF8 is improperly handling unicode values that are above 32767
DescriptionUsing RicMemo (a wrapper to RichEdit) I am attempting to employ the NotoSansPhoenician-Regular font, whose index range starts at 67840. Using the UnicodeToUTF8 with 67840 creates a no-response. Using it with $10900 will post the character on the screen, but it will erratically advance the cursor. It is a right-to-left language, and it advances the cursor to the right, then left, then right, left, etc.

RichMemo1.SelText:=UnicodeToUTF8($10900);

Thaddy (forum member) says that he has found a bug in the ustrings.inc unit. He says that "UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline... It can't handle high surrogate pairs. The UnicodeToUTF8 routines from FPC itself, in ustrings, are correct and CAN handle that codepoint. These can handle high surrogate pairs."

The current ustrings.inc unit has...

//lazutf8 snippet:
   $800..$ffff:
      begin
        Result:=3;
        Buf[0]:=char(byte($e0 or (CodePoint shr 12)));
        Buf[1]:=char(byte((CodePoint shr 6) and $3f) or $80);
        Buf[2]:=char(byte(CodePoint and $3f) or $80);
      end;
    $10000..$10ffff:
      begin
        Result:=4;
        Buf[0]:=char(byte($f0 or (CodePoint shr 18)));
        Buf[1]:=char(byte((CodePoint shr 12) and $3f) or $80);
        Buf[2]:=char(byte((CodePoint shr 6) and $3f) or $80);
        Buf[3]:=char(byte(CodePoint and $3f) or $80);
      end;
  else
    Result:=0;

It should be...

// ustrings.inc snippet:
             $800..$d7ff,$e000..$ffff:
                begin
                  if j+2>=MaxDestBytes then
                    break;
                  Dest[j]:=char($e0 or (lw shr 12));
                  Dest[j+1]:=char($80 or ((lw shr 6) and $3f));
                  Dest[j+2]:=char($80 or (lw and $3f));
                  inc(j,3);
                end;
              $d800..$dbff:
                {High Surrogates}
                begin
                  if j+3>=MaxDestBytes then
                    break;
                  if (i+1<sourcechars) and
                     (word(Source[i+1]) >= $dc00) and
                     (word(Source[i+1]) <= $dfff) then
                    begin
                      { $d7c0 is ($d800 - ($10000 shr 10)) }
                      lw:=(longword(lw-$d7c0) shl 10) + (ord(source[i+1]) xor $dc00);
                      Dest[j]:=char($f0 or (lw shr 18));
                      Dest[j+1]:=char($80 or ((lw shr 12) and $3f));
                      Dest[j+2]:=char($80 or ((lw shr 6) and $3f));
                      Dest[j+3]:=char($80 or (lw and $3f));
                      inc(j,4);
                      inc(i);
                    end;
                end;
              end;
            inc(i);
          end;
Additional InformationRelated forum discussion:
http://forum.lazarus.freepascal.org/index.php/topic,35455.0.html
TagsNo tags attached.
Fixed in Revision
LazTarget-
Widgetset
Attached Files

Activities

Thaddy de Koning

2017-01-17 15:58

reporter   ~0097551

Last edited: 2017-01-17 16:13

View 2 revisions

The bug is the other way around as is obvious from my forum post.
The bug is in LazUTF8.UnicodeToUTF8Inline, not in UStrings.

So this is not an fpc issue, but a Lazarus issue; LazUTF8.UnicodeToUTF8 can not handle 4 byte extended codepoints as is obvious from the sources.

So can someone move it to Lazarus?

Mattias Gaertner

2017-01-17 16:06

manager   ~0097552

The LazUtf8 function UnicodeToUTF8 converts a single UTF32 code point to UTF8.
The System function UnicodeToUTF8 converts a UTF16 string to UTF8.

So, basically both functions work as documented, but have misleading names, although both are consistent within their units.
Maybe the LazUTF8 function can be renamed (with a deprecation warning).

Thaddy de Koning

2017-01-17 16:21

reporter   ~0097553

Last edited: 2017-01-17 16:25

View 3 revisions

No. It seems LazUTF8.UnicodeToUTF8Inline fails on a codepoint higher than MaxWord=65535, so 67840 fails. ustrings.UnicodeToUTF8 passes for 67840.

But I guess it is Lazarus anyway, since they need to deprecate it.

I don't mind if they call the legacy Unicode32ToUTF8
Since UnicodeString is UTF16 in FPC that makes sense.

Mattias Gaertner

2017-01-17 16:32

manager   ~0097554

LazUTF8.UnicodeToUTF8Inline works here for 0..$fffff. Our test suite runs as well.

Did you only test in RichMemo or did you test the function directly?
Maybe the problem is in RichMemo?

Bart Broersma

2017-01-18 18:57

developer   ~0097579

From: http://www.utf8-chartable.de/unicode-utf8-table.pl
U+10900 𐤀 f0 90 a4 80 PHOENICIAN LETTER ALF

LazUtf8.UnicodeToUTF8($10900) gives: Length: 4 : $F0 $90 $A4 $80

So, the conversion seems correct to me.

Bart Broersma

2017-01-21 19:59

developer   ~0097623

Last edited: 2017-01-21 20:02

View 2 revisions

I have compared the output of LazUtf8.UnicodeToUtf8() for codepoints U+$D800 to U+$E3FF (this incudes the high surrogate pares) with the tables on http://www.utf8-chartable.de/unicode-utf8-table.pl and I can see no errors in this range.
See attached sample application: unicodetoutf8.zip

Note that this table does not specify UTF8 sequences for U+$E400 to U+$F900, where LazUtf8.UnicodeToUtf8() does output something.

Bart Broersma

2017-01-22 01:19

developer   ~0097626

Last edited: 2017-01-22 01:35

View 3 revisions

FWIW: I also tested range U+$10900 to U+$10CFF, which includes all Phoenicin codepoints, and I can see no errors in there.
And for amusement range U+$100000 to U+$1003FF (the last 1024 codepoints according to the reference website), without errors.

Updated sample project.

Bart Broersma

2017-01-22 01:35

developer  

unicodetoutf8.zip (89,646 bytes)

Bart Broersma

2017-01-22 01:39

developer   ~0097627

Please provide a codepoint for which LazUtf8.UnicodeToUTF8(CodePoint: cardinal): string; function gives a wrong result.

State what it produces (the byte sequence), and what it should be.
For the latter also state your reference.

If no example is provided, this issue will be resolved as "no change required".

Rick Wills

2017-01-22 17:03

reporter   ~0097632

Last edited: 2017-01-22 17:10

View 3 revisions

Bart Broersma,

Thank you for investigating. I had presented two issues.

One was for the cursor not advancing properly. That was bogus, and due to an earlier attempt to manually advance the cursor when it was not advancing at all... another issue with my own code. I removed my manual code and it advanced properly. My code was fighting with it. My apology. I had gone through so many machinations that I had forgotten that I had placed the code within the function.

The other issue was that I could not use decimal values with the UnicodeToUTF8 function. It wants cardinal values, and Phoenician Unicode starts at values above that range. I was, however, able to enter Hexadecimal values for the same, and it operated properly... until I saved and reload the RTF file. At that point RichEdit activated their Font Binding procedures. The result is Font Swapping. A large portion of the English text was reassigned as Hebrew text. RichEdit thinks that Hebrew includes English, and it likes Hebrew fonts better than English ones (never mind what the user wants). But its action is a result of choking on the Phoenician index.

I have not gotten back to you because I had assumed that the report was being closed... due to Mattias Gaertner's statement that it was likely RichMemo's problem, and not the UnicodeToUTF8 functionality.

I do agree with him at this point, though not that it is RichMemo itself, but is on account of the RichEdit driver that RichMemo relies on. RichEdit is choking on the high range Phoenician index.

Bart Broersma

2017-01-22 17:39

developer   ~0097633

> It wants cardinal values, and Phoenician Unicode starts at values above that range.

That makes no sense to me whatsoever.

High(Cardinal) = 4294967295 = $FFFFFFFF
Phoenician starts at $10900, which (at least in my time) is cleary less than $FFFFFFFF.

> I was, however, able to enter Hexadecimal values for the same

This makes even less sense.
To the compiler there is no difference between hexadecimal and decimal numbers.
They're just numbers.

Anyhow, this has nothing to do with LazUtf8.UnicodeToUtf8() function.
Resolving as "no change required".

Please close.

Issue History

Date Modified Username Field Change
2017-01-17 14:49 Rick Wills New Issue
2017-01-17 15:58 Thaddy de Koning Note Added: 0097551
2017-01-17 16:06 Mattias Gaertner Note Added: 0097552
2017-01-17 16:13 Thaddy de Koning Note Edited: 0097551 View Revisions
2017-01-17 16:21 Thaddy de Koning Note Added: 0097553
2017-01-17 16:24 Thaddy de Koning Note Edited: 0097553 View Revisions
2017-01-17 16:25 Thaddy de Koning Note Edited: 0097553 View Revisions
2017-01-17 16:32 Mattias Gaertner Note Added: 0097554
2017-01-17 22:03 Jonas Maebe Project FPC => Lazarus
2017-01-18 18:57 Bart Broersma Note Added: 0097579
2017-01-21 19:59 Bart Broersma Note Added: 0097623
2017-01-21 20:01 Bart Broersma File Added: unicodetoutf8.zip
2017-01-21 20:02 Bart Broersma Note Edited: 0097623 View Revisions
2017-01-22 01:19 Bart Broersma Note Added: 0097626
2017-01-22 01:34 Bart Broersma Note Edited: 0097626 View Revisions
2017-01-22 01:35 Bart Broersma File Deleted: unicodetoutf8.zip
2017-01-22 01:35 Bart Broersma File Added: unicodetoutf8.zip
2017-01-22 01:35 Bart Broersma Note Edited: 0097626 View Revisions
2017-01-22 01:39 Bart Broersma LazTarget => -
2017-01-22 01:39 Bart Broersma Note Added: 0097627
2017-01-22 01:39 Bart Broersma Assigned To => Bart Broersma
2017-01-22 01:39 Bart Broersma Status new => feedback
2017-01-22 01:50 Bart Broersma Product Version 2.6.4 =>
2017-01-22 01:50 Bart Broersma Additional Information Updated View Revisions
2017-01-22 17:03 Rick Wills Note Added: 0097632
2017-01-22 17:03 Rick Wills Status feedback => assigned
2017-01-22 17:07 Rick Wills Note Edited: 0097632 View Revisions
2017-01-22 17:10 Rick Wills Note Edited: 0097632 View Revisions
2017-01-22 17:39 Bart Broersma Note Added: 0097633
2017-01-22 17:39 Bart Broersma Status assigned => resolved
2017-01-22 17:39 Bart Broersma Resolution open => no change required
2017-01-25 20:39 Rick Wills Status resolved => closed