Bug in UTF8FindNearestCharStart
Original Reporter info from Mantis: Bart @flyingsheep
-
Reporter name: Bart Broersma
Original Reporter info from Mantis: Bart @flyingsheep
- Reporter name: Bart Broersma
Description:
UTF8FindNearestCharStart returns wrong result if BytePos points to $B8 in this 3-byte sequence $E0 $B8 $9A (which appears to be a valid codepoint: THAI CHARACTER BO BAIMAI, U+0E1A, see: http://unicode.scarfboy.com/?s=U%2b0E1A).
It returns an index pointing to $B8, where it should point to $E0 instead.
Steps to reproduce:
Unzip and build attached sample.
(The sample project has more code than needed fo this test, but it will just run the test demonstarting the problem.
It outputs:
C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A
1: NCS=0 B=C3
2: NCS=0 B=C3
3: NCS=2 B=E0
4: NCS=3 B=B8 Expected: E0
5: NCS=2 B=E0
Additional information:
I was looking for a similar function in LazUtf8 that would returnn the start of the codepoint, only if the codepoint was valid.
The sampleproject has a function Utf8FindCodepointStart(...): Boolean that does just that.
Run the TestUtf8FindCodepointStart procedure to see the difference in behaviour (with a string that also has invalid codepoints):
C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A $81 $F0
1 C3 TRUE B=C3 CL=2 Cur-S=0 | TRUE B=C3 CL=2 Idx=1 | NCS=0 B=C3
2 A4 TRUE B=C3 CL=2 Cur-S=0 | TRUE B=C3 CL=2 Idx=1 | NCS=0 B=C3
3 E0 TRUE B=E0 CL=3 Cur-S=2 | TRUE B=E0 CL=3 Idx=3 | NCS=2 B=E0
4 B8 TRUE B=E0 CL=3 Cur-S=2 | TRUE B=E0 CL=3 Idx=3 | NCS=3 B=B8
5 9A TRUE B=E0 CL=3 Cur-S=2 | TRUE B=E0 CL=3 Idx=3 | NCS=2 B=E0
6 81 FALSE | FALSE | NCS=5 B=81
7 F0 FALSE | FALSE | NCS=6 B=F0
Mantis conversion info:
- Mantis ID: 29851
- OS: Windows
- OS Build: Win7
- Build: r51965
- Platform: i386
- Version: 1.7 (SVN)
- Fixed in version: 1.6.2
- Fixed in revision: r51973 (#b192fb97)
- Target version: 1.6.2