UTF8ToUnicode destroy string when encounter invalid sequence

Original Reporter info from Mantis: JoshyFun @joshyfun

Reporter name: José Mejuto

Description:

The UTF8ToUnicode procedure in 'wustrings.inc' always return a NULL/blanked string when the source UTF8 string has an invalid sequence. I had check the current trunk code and the same routine is there, so this problem should be present in all FPC RTL versions, not only 2.2.0.

Additional information:

Attached there is a new implementation and the well know UTF8 stress test:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

which this implementation passes (for 2 unicode chars). The implementation allows to expand it easily for 4,5,6 bytes UTF8 sequences, but as they can not fit in two bytes they are handled as invalid UTF8 sequences.
The implementation also handles the LF conversion to CR+LF (as specified in UTF8) but it has been disabled to preserve compatibility with current code.

It has been tested in WinXP with FPC 2.2.0 and 2.2.2.

There is a const value in the code to use as the mark for invalid UTF8 sequence.

Mantis conversion info:

Mantis ID: 11791
Version: 2.2.0
Fixed in version: 2.4.0
Fixed in revision: 12902 (#d67dbcf0)
Monitored by: » @joshyfun (José Mejuto)
Target version: 2.4.0

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

UTF8ToUnicode destroy string when encounter invalid sequence

Original Reporter info from Mantis: JoshyFun @joshyfun Reporter name: José Mejuto

Description:

Additional information:

Mantis conversion info:

Original Reporter info from Mantis: JoshyFun @joshyfun

Reporter name: José Mejuto