UnicodeToUTF8 is improperly handling unicode values that are above 32767
Original Reporter info from Mantis: rick2691
-
Reporter name: Rick Wills
Original Reporter info from Mantis: rick2691
- Reporter name: Rick Wills
Description:
Using RicMemo (a wrapper to RichEdit) I am attempting to employ the NotoSansPhoenician-Regular font, whose index range starts at 67840. Using the UnicodeToUTF8 with 67840 creates a no-response. Using it with $10900 will post the character on the screen, but it will erratically advance the cursor. It is a right-to-left language, and it advances the cursor to the right, then left, then right, left, etc.
RichMemo1.SelText:=UnicodeToUTF8($10900);
Thaddy (forum member) says that he has found a bug in the ustrings.inc unit. He says that "UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline... It can't handle high surrogate pairs. The UnicodeToUTF8 routines from FPC itself, in ustrings, are correct and CAN handle that codepoint. These can handle high surrogate pairs."
The current ustrings.inc unit has...
//lazutf8 snippet:
$800..$ffff:
begin
Result:=3;
Buf[0]:=char(byte($e0 or (CodePoint shr 12)));
Buf[1]:=char(byte((CodePoint shr 6) and $3f) or $80);
Buf[2]:=char(byte(CodePoint and $3f) or $80);
end;
$10000..$10ffff:
begin
Result:=4;
Buf[0]:=char(byte($f0 or (CodePoint shr 18)));
Buf[1]:=char(byte((CodePoint shr 12) and $3f) or $80);
Buf[2]:=char(byte((CodePoint shr 6) and $3f) or $80);
Buf[3]:=char(byte(CodePoint and $3f) or $80);
end;
else
Result:=0;
It should be...
// ustrings.inc snippet:
$800..$d7ff,$e000..$ffff:
begin
if j+2>=MaxDestBytes then
break;
Dest[j]:=char($e0 or (lw shr 12));
Dest[j+1]:=char($80 or ((lw shr 6) and $3f));
Dest[j+2]:=char($80 or (lw and $3f));
inc(j,3);
end;
$d800..$dbff:
{High Surrogates}
begin
if j+3>=MaxDestBytes then
break;
if (i+1<sourcechars) and
(word(Source[i+1]) >= $dc00) and
(word(Source[i+1]) <= $dfff) then
begin
{ $d7c0 is ($d800 - ($10000 shr 10)) }
lw:=(longword(lw-$d7c0) shl 10) + (ord(source[i+1]) xor $dc00);
Dest[j]:=char($f0 or (lw shr 18));
Dest[j+1]:=char($80 or ((lw shr 12) and $3f));
Dest[j+2]:=char($80 or ((lw shr 6) and $3f));
Dest[j+3]:=char($80 or (lw and $3f));
inc(j,4);
inc(i);
end;
end;
end;
inc(i);
end;
Additional information:
Related forum discussion:
http://forum.lazarus.freepascal.org/index.php/topic,35455.0.html
Mantis conversion info:
- Mantis ID: 31243
- OS: WinXP Pro
- OS Build: SP3
- Build: 49931
- Platform: Windows