Bug in UTF8FindNearestCharStart

Original Reporter info from Mantis: Bart @flyingsheep

Reporter name: Bart Broersma

Description:

UTF8FindNearestCharStart returns wrong result if BytePos points to $B8 in this 3-byte sequence $E0 $B8 $9A (which appears to be a valid codepoint: THAI CHARACTER BO BAIMAI, U+0E1A, see: http://unicode.scarfboy.com/?s=U%2b0E1A).

It returns an index pointing to $B8, where it should point to $E0 instead.

Steps to reproduce:

Unzip and build attached sample.
(The sample project has more code than needed fo this test, but it will just run the test demonstarting the problem.

It outputs:
C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A
1: NCS=0 B=C3
2: NCS=0 B=C3
3: NCS=2 B=E0
4: NCS=3 B=B8 Expected: E0
5: NCS=2 B=E0

Additional information:

I was looking for a similar function in LazUtf8 that would returnn the start of the codepoint, only if the codepoint was valid.
The sampleproject has a function Utf8FindCodepointStart(...): Boolean that does just that.

Run the TestUtf8FindCodepointStart procedure to see the difference in behaviour (with a string that also has invalid codepoints):

C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A $81 $F0
 1 C3 TRUE B=C3 CL=2 Cur-S=0  |  TRUE B=C3 CL=2 Idx=1  |   NCS=0 B=C3
 2 A4 TRUE B=C3 CL=2 Cur-S=0  |  TRUE B=C3 CL=2 Idx=1  |   NCS=0 B=C3
 3 E0 TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=2 B=E0
 4 B8 TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=3 B=B8
 5 9A TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=2 B=E0
 6 81 FALSE   |  FALSE   |   NCS=5 B=81
 7 F0 FALSE   |  FALSE   |   NCS=6 B=F0

Mantis conversion info:

Mantis ID: 29851
OS: Windows
OS Build: Win7
Build: r51965
Platform: i386
Version: 1.7 (SVN)
Fixed in version: 1.6.2
Fixed in revision: r51973 (#b192fb97)
Target version: 1.6.2