UTF8ToUnicode destroy string when encounter invalid sequence
Original Reporter info from Mantis: JoshyFun @joshyfun
-
Reporter name: José Mejuto
Original Reporter info from Mantis: JoshyFun @joshyfun
- Reporter name: José Mejuto
Description:
The UTF8ToUnicode procedure in 'wustrings.inc' always return a NULL/blanked string when the source UTF8 string has an invalid sequence. I had check the current trunk code and the same routine is there, so this problem should be present in all FPC RTL versions, not only 2.2.0.
Additional information:
Attached there is a new implementation and the well know UTF8 stress test:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
which this implementation passes (for 2 unicode chars). The implementation allows to expand it easily for 4,5,6 bytes UTF8 sequences, but as they can not fit in two bytes they are handled as invalid UTF8 sequences.
The implementation also handles the LF conversion to CR+LF (as specified in UTF8) but it has been disabled to preserve compatibility with current code.
It has been tested in WinXP with FPC 2.2.0 and 2.2.2.
There is a const value in the code to use as the mark for invalid UTF8 sequence.