View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0030622 | FPC | RTL | public | 2016-09-20 18:04 | 2017-01-11 15:02 |
Reporter | Tony Whyman | Assigned To | Jonas Maebe | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | assigned | Resolution | reopened | ||
Platform | Windows x64 | OS | Windows 7 | ||
Product Version | 3.0.0 | ||||
Summary | 0030622: SetCodePage transliterates to CP_NONE by deleting the string | ||||
Description | This seems to be a Windows only problem. If you call e.g. SetCodePage(s,CP_NONE,true); Then s always ends up empty. On Linux this works fine and "s" is unchanged. | ||||
Steps To Reproduce | Example Lazarus program attached. Just compile, run and enter some string as the test string. If the CP None checkbox is selected then "convert" gives an empty output. If WIN1252 checkbox is selected then the string is copied to the output. | ||||
Tags | No tags attached. | ||||
Fixed in Revision | |||||
FPCOldBugId | |||||
FPCTarget | |||||
Attached Files |
|
related to | 0025332 | closed | Jonas Maebe | RawByteStrings Drops data |
has duplicate | 0031200 | resolved | Jonas Maebe | Format '%'s returns blank when string has a codepage of CP_NONE |
|
|
|
Please attach a sample project that does NOT depend on Lazarus. 1. Not all fpc devels have Lazarus installed. 2. The bug may be in Lazarus and not fpc |
|
From http://wiki.freepascal.org/FPC_Unicode_support#Code_page_identifiers : "CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any operation on a string that has this dynamic code page is undefined." |
|
I don't believe that "undefined results" should ever be used as an excuse for not checking a function parameter correctly. An undefined outcome is totally correct when there are race conditions or uninitialised variables involved. But when a function argument is inconsistent with the rest of the arguments or there is no clear action to be performed then at the very least an exception should be raised. In this case, we seem to have to worst of all possible outcomes with totally consistent results on both Linux and Windows - except that there is a different outcome on each platform, with Linux treating this case as a "no-op" and Windows petulantly trashing the string. Transliteration to CP_NONE does not make sense. Agreed. In the Linux world, simply ignoring the "convert" flag in this case makes sense and the call to SetCodePage with convert=true (this is also the default) can be interpreted as "set the code page to "x" and transliterate if necessary". On the other hand, the Windows implementation seems to be behaving like a troll and silently trashing the string if you put a foot wrong i.e. set CP_NONE with convert=true. Claiming "undefined results" is not correct in this case because a predictable outcome is perfectly possible. That should be either an exception or silently ignoring convert=true when the codepage is set to CP_NONE. |
|
I also couldn't resist pointing out that on the same wiki page it says "As mentioned earlier, the results of operations on strings with the CP_NONE code page are undefined. As it does not make sense to define a type in the RTL whose behaviour is undefined, the behaviour of RawByteString is somewhat different than that of other AnsiString(X) types." and then goes on to define how a RawByteString with type CP_NONE is treated. The string parameter to SetCodePage is "RawByteString". |
|
'and then goes on to define how a RawByteString with type CP_NONE is treated. The string parameter to SetCodePage is "RawByteString".' Again: CP_NONE is invalid as dynamic codepage. The description is for strings that have the *declared* codepage CP_NONE. I'll change the parameter for setstringcodepage() so that passing CP_NONE will cause a range check warning/error (depending on the setting of {$R+/-}) |
|
Additionally, there is no specific checking for CP_NONE in the conversion routines. We just call the OS conversion routines, and they interpret it in the undefined way. |
|
Cann't CP_NONE be handled as "do no conversion" ? Sometimes it is useful define string data as "byte string" with no relation to character code page. For example Firebird has for [var]char data type NONE charset, which means, that data are put/get "as is" without any conversion. In this case when we read data into string buffer we set code page to CP_NONE indicating, that we do not want any conversion, only move raw data. |
|
Looking at implementation of "SetCodePage" and "fpc_ansistr_to_ansistr" if there is "source CP" = CP_NONE then only copying occurs (no conversion). Is it guaranteed that: SetCodePage(RawByteString(S), CP_NONE, False); SetCodePage(RawByteString(S), CP_ACP, True); will return unaltered S ? |
|
Yes, but you get exactly the same behaviour and result with SetCodePage(RawByteString(S), CP_ACP, False); The "source CP" in fpc_ansistr_to_ansistr is the declared code page of the string, not the dynamic code page (which, as mentioned before, must never be CP_NONE). The reason I haven't fixed this issue yet as described earlier is because exactly the same can happen if you set another invalid code page number as dynamic code page of a string. This cannot be statically checked, since which code pages are supported depends on the operating system and its version. Since unsupported string conversions are not supposed to generate run time errors, and since adding such run time errors may break programs that (seem to) work under Delphi, there is no clean or easy way to resolve this. |
|
> The "source CP" in fpc_ansistr_to_ansistr is the declared code page of the string, not the dynamic code page When I look into Function fpc_AnsiStr_To_AnsiStr() in astrings.inc there is: orgcp:=TranslatePlaceholderCP(StringCodePage(S)); Which I guess is dynamic codepage of "source string", does not ? (On other hand "cp" is declared codepage of "destination string") I understand your argument about supported/unsupported codepages depending on OS, but CP_NONE is special case IMO and IMO can be handled by replacing: 449 if (orgcp=cp) or (orgcp=CP_NONE) then by 449 if (orgcp=cp) or (orgcp=CP_NONE) or (cp=CP_NONE) then But I do not insist on it, as far as I can fix my use case as follows: SetCodePage(RawByteString(S), FCodePage, FCodePage<>CP_NONE); where FCodePage is required "destination CP" ( I am aware of fact, that string should not contain dynamic codepage CP_NONE, but I need somehow reflect fact that string holds raw binary data ) |
|
> I need somehow reflect fact that string holds raw binary data That is like saying that you need a way to somehow reflect that an array of byte contains data encoded using a particular code page. Strings by definition hold data that is encoded in a particular code page. |
|
LacaK, as you asking, here some tests of RawByteString. All tests was made in Windows x86_64 both in Delphi (10.1 Berlin) and FPC 3.1.last 1. CP_NONE const is defined in Windows unit, equals zero and possible not related with other codepage constants. However delphi rtl is assume zero as "codepage is not defined". 2. New RawBytesString creates with DefaultSystemCodePage in normal mode and with CP_UTF8 in NEXTGEN mode {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var s: rawbytestring; begin Writeln(StringCodePage(s) = DefaultSystemCodePage); Resutns: TRUE (OK in FPC, since there is no NEXTGEN mode) 3. Same code in Delphi RTL is processing AnsiString and RawByteString. 4. Codepage changing without conversion does not break string data {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var strr: RawByteString; begin strr := '123'; SetCodePage(strr, CP_NONE, False); Writeln(strr = '123'); Writeln(StringCodePage(strr) = CP_NONE); Returns: TRUE TRUE 5. Codepage changing with conversion does not break string data also {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var strr: RawByteString; begin strr := '123'; SetCodePage(strr, CP_NONE, True); Writeln(strr = '123'); Writeln(StringCodePage(strr) = CP_NONE); Returns: TRUE (Failed in FPC, strr is empty) TRUE (Failed in FPC, strr codepage is DefaultSystemCodePage) 6. Codepage changing without conversion does not break a non latin string data {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} const RAW_UTF8_DATA: AnsiString = #$d0#$9f#$d1#$80#$d0#$b8#$d0#$b2#$d0#$b5#$d1#$82; var str8: UTF8String; strr: RawByteString; begin str8 := 'Привет'; { "Hello" in Russian } Writeln(StringCodePage(str8) = CP_UTF8); strr := str8; Writeln(StringCodePage(strr) = CP_UTF8); SetCodePage(strr, CP_NONE, False); Writeln(strr = RAW_UTF8_DATA); Writeln(StringCodePage(strr) = CP_NONE); Returns: TRUE TRUE TRUE (Failed in FPC, strr is empty) TRUE 7. Codepage changing with conversion does not break a non latin string data and properly converts it {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} const RAW_1251_DATA: AnsiString = #$cf#$f0#$e8#$e2#$e5#$f2; var str8: UTF8String; strr: RawByteString; begin DefaultSystemCodePage := 1251; str8 := 'Привет'; { "Hello" in Russian } Writeln(StringCodePage(str8) = CP_UTF8); strr := str8; Writeln(StringCodePage(strr) = CP_UTF8); SetCodePage(strr, CP_NONE, True); Writeln(strr = RAW_1251_DATA); Writeln(StringCodePage(strr) = CP_NONE); Returns: TRUE TRUE TRUE (Failed in FPC, strr is empty) TRUE (Failed in FPC, strr codepage is 1251) 8. String concatenation is working {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var str1, str2: RawByteString; begin Str1 := '1'; Str2 := '2'; SetCodePage(RawByteString(Str1), CP_NONE, False); SetCodePage(RawByteString(Str2), CP_NONE, False); Str1 := Str1 + Str2; Writeln(Str1 = '12'); Writeln(StringCodePage(Str1) = CP_NONE); Returns: TRUE (Failed in FPC, Str1 is empty) TRUE (Failed in FPC, Str1 codepage is DefaultSystemCodePage) 9. Result string codepage is codepage of the first string {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var str1, str2: RawByteString; begin Str1 := '1'; Str2 := '2'; SetCodePage(RawByteString(Str2), CP_NONE, False); Str1 := Str1 + Str2; Writeln(Str1 = '12'); Writeln(StringCodePage(Str1) = DefaultSystemCodePage); Returns: TRUE (Failed in FPC, Str1 is "1") TRUE 10. But if first string is empty we inherit the second string codepage {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var str1, str2: RawByteString; begin Str1 := ''; Str2 := '2'; SetCodePage(RawByteString(Str2), CP_NONE, False); Str1 := Str1 + Str2; Writeln(Str1 = '2'); Writeln(StringCodePage(Str1) = CP_NONE); Returns: TRUE (Failed in FPC, Str1 is empty) TRUE (Failed in FPC, Str1 codepage is DefaultSystemCodePage 11. If both strings are empty we have an empty string with default codepage {$MODE OBJFPC}{$H+} {$CODEPAGE UTF8} var str1, str2: RawByteString; begin Str1 := ''; Str2 := ''; SetCodePage(RawByteString(Str2), CP_NONE, False); Str1 := Str1 + Str2; Writeln(Str1 = ''); Writeln(StringCodePage(Str1) = DefaultSystemCodePage); Returns: TRUE TRUE |
|
"CP_NONE const is defined in Windows unit" Well, it may be (as $0000, but ifdeffed) but it is X-platform.... It is in system....systemh.inc ($ffff) CP_NONE = $FFFF; // rawbytestring encoding There are duplicates in jsystem... which is superfluous.. And this is plain wrong in wintypes: {$ifndef SYSTEMUNIT} CP_NONE = $0000; {$endif SYSTEMUNIT} |
Date Modified | Username | Field | Change |
---|---|---|---|
2016-09-20 18:04 | Tony Whyman | New Issue | |
2016-09-20 18:04 | Tony Whyman | File Added: CodePageTest.zip | |
2016-09-20 18:14 | Bart Broersma | Note Added: 0094736 | |
2016-09-20 19:12 | Jonas Maebe | Note Added: 0094737 | |
2016-09-20 19:12 | Jonas Maebe | Status | new => resolved |
2016-09-20 19:12 | Jonas Maebe | Resolution | open => no change required |
2016-09-20 19:12 | Jonas Maebe | Assigned To | => Jonas Maebe |
2016-09-24 15:38 | Tony Whyman | Note Added: 0094797 | |
2016-09-24 15:38 | Tony Whyman | Status | resolved => feedback |
2016-09-24 15:38 | Tony Whyman | Resolution | no change required => reopened |
2016-09-24 15:51 | Tony Whyman | Note Added: 0094798 | |
2016-09-24 15:51 | Tony Whyman | Status | feedback => assigned |
2016-09-24 22:32 | Jonas Maebe | Note Added: 0094806 | |
2016-09-24 22:33 | Jonas Maebe | Note Added: 0094807 | |
2016-09-24 22:34 | Jonas Maebe | Relationship added | related to 0025332 |
2016-09-24 23:07 | Jonas Maebe | Note Edited: 0094806 | View Revisions |
2016-09-26 07:29 | LacaK | Note Added: 0094817 | |
2016-12-29 10:02 | LacaK | Note Added: 0097141 | |
2016-12-29 18:22 | Jonas Maebe | Note Added: 0097154 | |
2016-12-30 09:46 | LacaK | Note Added: 0097163 | |
2017-01-06 21:06 | Jonas Maebe | Note Added: 0097341 | |
2017-01-07 12:05 | Jonas Maebe | Relationship added | has duplicate 0031200 |
2017-01-11 13:52 | Dmitriy Pomerantsev | Note Added: 0097412 | |
2017-01-11 13:53 | Dmitriy Pomerantsev | Note Edited: 0097412 | View Revisions |
2017-01-11 13:54 | Dmitriy Pomerantsev | Note Edited: 0097412 | View Revisions |
2017-01-11 14:55 | Thaddy de Koning | Note Added: 0097415 | |
2017-01-11 14:56 | Thaddy de Koning | Note Edited: 0097415 | View Revisions |
2017-01-11 14:58 | Thaddy de Koning | Note Edited: 0097415 | View Revisions |
2017-01-11 14:59 | Thaddy de Koning | Note Edited: 0097415 | View Revisions |
2017-01-11 15:02 | Thaddy de Koning | Note Edited: 0097415 | View Revisions |