View Issue Details

IDProjectCategoryView StatusLast Update
0030622FPCRTLpublic2017-01-11 15:02
ReporterTony Whyman Assigned ToJonas Maebe  
PrioritynormalSeverityminorReproducibilityalways
Status assignedResolutionreopened 
PlatformWindows x64OSWindows 7 
Product Version3.0.0 
Summary0030622: SetCodePage transliterates to CP_NONE by deleting the string
DescriptionThis seems to be a Windows only problem. If you call e.g.

SetCodePage(s,CP_NONE,true);

Then s always ends up empty.

On Linux this works fine and "s" is unchanged.
Steps To ReproduceExample Lazarus program attached. Just compile, run and enter some string as the test string. If the CP None checkbox is selected then "convert" gives an empty output. If WIN1252 checkbox is selected then the string is copied to the output.
TagsNo tags attached.
Fixed in Revision
FPCOldBugId
FPCTarget
Attached Files

Relationships

related to 0025332 closedJonas Maebe RawByteStrings Drops data 
has duplicate 0031200 resolvedJonas Maebe Format '%'s returns blank when string has a codepage of CP_NONE 

Activities

Tony Whyman

2016-09-20 18:04

reporter  

CodePageTest.zip (127,805 bytes)

Bart Broersma

2016-09-20 18:14

reporter   ~0094736

Please attach a sample project that does NOT depend on Lazarus.
1. Not all fpc devels have Lazarus installed.
2. The bug may be in Lazarus and not fpc

Jonas Maebe

2016-09-20 19:12

manager   ~0094737

From http://wiki.freepascal.org/FPC_Unicode_support#Code_page_identifiers : "CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any operation on a string that has this dynamic code page is undefined."

Tony Whyman

2016-09-24 15:38

reporter   ~0094797

I don't believe that "undefined results" should ever be used as an excuse for not checking a function parameter correctly. An undefined outcome is totally correct when there are race conditions or uninitialised variables involved. But when a function argument is inconsistent with the rest of the arguments or there is no clear action to be performed then at the very least an exception should be raised.

In this case, we seem to have to worst of all possible outcomes with totally consistent results on both Linux and Windows - except that there is a different outcome on each platform, with Linux treating this case as a "no-op" and Windows petulantly trashing the string.

Transliteration to CP_NONE does not make sense. Agreed. In the Linux world, simply ignoring the "convert" flag in this case makes sense and the call to SetCodePage with convert=true (this is also the default) can be interpreted as "set the code page to "x" and transliterate if necessary". On the other hand, the Windows implementation seems to be behaving like a troll and silently trashing the string if you put a foot wrong i.e. set CP_NONE with convert=true.

Claiming "undefined results" is not correct in this case because a predictable outcome is perfectly possible. That should be either an exception or silently ignoring convert=true when the codepage is set to CP_NONE.

Tony Whyman

2016-09-24 15:51

reporter   ~0094798

I also couldn't resist pointing out that on the same wiki page it says

"As mentioned earlier, the results of operations on strings with the CP_NONE code page are undefined. As it does not make sense to define a type in the RTL whose behaviour is undefined, the behaviour of RawByteString is somewhat different than that of other AnsiString(X) types."

and then goes on to define how a RawByteString with type CP_NONE is treated. The string parameter to SetCodePage is "RawByteString".

Jonas Maebe

2016-09-24 22:32

manager   ~0094806

Last edited: 2016-09-24 23:07

View 2 revisions

'and then goes on to define how a RawByteString with type CP_NONE is treated. The string parameter to SetCodePage is "RawByteString".'

Again: CP_NONE is invalid as dynamic codepage. The description is for strings that have the *declared* codepage CP_NONE.

I'll change the parameter for setstringcodepage() so that passing CP_NONE will cause a range check warning/error (depending on the setting of {$R+/-})

Jonas Maebe

2016-09-24 22:33

manager   ~0094807

Additionally, there is no specific checking for CP_NONE in the conversion routines. We just call the OS conversion routines, and they interpret it in the undefined way.

LacaK

2016-09-26 07:29

developer   ~0094817

Cann't CP_NONE be handled as "do no conversion" ?

Sometimes it is useful define string data as "byte string" with no relation to character code page. For example Firebird has for [var]char data type NONE charset, which means, that data are put/get "as is" without any conversion.
In this case when we read data into string buffer we set code page to CP_NONE indicating, that we do not want any conversion, only move raw data.

LacaK

2016-12-29 10:02

developer   ~0097141

Looking at implementation of "SetCodePage" and "fpc_ansistr_to_ansistr" if there is "source CP" = CP_NONE then only copying occurs (no conversion). Is it guaranteed that:
  SetCodePage(RawByteString(S), CP_NONE, False);
  SetCodePage(RawByteString(S), CP_ACP, True);
will return unaltered S ?

Jonas Maebe

2016-12-29 18:22

manager   ~0097154

Yes, but you get exactly the same behaviour and result with
  SetCodePage(RawByteString(S), CP_ACP, False);

The "source CP" in fpc_ansistr_to_ansistr is the declared code page of the string, not the dynamic code page (which, as mentioned before, must never be CP_NONE).

The reason I haven't fixed this issue yet as described earlier is because exactly the same can happen if you set another invalid code page number as dynamic code page of a string. This cannot be statically checked, since which code pages are supported depends on the operating system and its version.

Since unsupported string conversions are not supposed to generate run time errors, and since adding such run time errors may break programs that (seem to) work under Delphi, there is no clean or easy way to resolve this.

LacaK

2016-12-30 09:46

developer   ~0097163

> The "source CP" in fpc_ansistr_to_ansistr is the declared code page of the string, not the dynamic code page

When I look into Function fpc_AnsiStr_To_AnsiStr() in astrings.inc there is:
  orgcp:=TranslatePlaceholderCP(StringCodePage(S));
Which I guess is dynamic codepage of "source string", does not ? (On other hand "cp" is declared codepage of "destination string")

I understand your argument about supported/unsupported codepages depending on OS, but CP_NONE is special case IMO and IMO can be handled by replacing:
  449 if (orgcp=cp) or (orgcp=CP_NONE) then
by
  449 if (orgcp=cp) or (orgcp=CP_NONE) or (cp=CP_NONE) then

But I do not insist on it, as far as I can fix my use case as follows:
  SetCodePage(RawByteString(S), FCodePage, FCodePage<>CP_NONE);
where FCodePage is required "destination CP" ( I am aware of fact, that string should not contain dynamic codepage CP_NONE, but I need somehow reflect fact that string holds raw binary data )

Jonas Maebe

2017-01-06 21:06

manager   ~0097341

> I need somehow reflect fact that string holds raw binary data

That is like saying that you need a way to somehow reflect that an array of byte contains data encoded using a particular code page. Strings by definition hold data that is encoded in a particular code page.

Dmitriy Pomerantsev

2017-01-11 13:52

reporter   ~0097412

Last edited: 2017-01-11 13:54

View 3 revisions

LacaK, as you asking, here some tests of RawByteString. All tests was made in Windows x86_64 both in Delphi (10.1 Berlin) and FPC 3.1.last

1. CP_NONE const is defined in Windows unit, equals zero and possible not related with other codepage constants.
However delphi rtl is assume zero as "codepage is not defined".

2. New RawBytesString creates with DefaultSystemCodePage in normal mode and with CP_UTF8 in NEXTGEN mode

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  s: rawbytestring;
begin
  Writeln(StringCodePage(s) = DefaultSystemCodePage);

Resutns:
TRUE (OK in FPC, since there is no NEXTGEN mode)

3. Same code in Delphi RTL is processing AnsiString and RawByteString.

4. Codepage changing without conversion does not break string data

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  strr: RawByteString;
begin
  strr := '123';
  SetCodePage(strr, CP_NONE, False);
  Writeln(strr = '123');
  Writeln(StringCodePage(strr) = CP_NONE);

Returns:
TRUE
TRUE

5. Codepage changing with conversion does not break string data also

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  strr: RawByteString;
begin
  strr := '123';
  SetCodePage(strr, CP_NONE, True);
  Writeln(strr = '123');
  Writeln(StringCodePage(strr) = CP_NONE);

Returns:
TRUE (Failed in FPC, strr is empty)
TRUE (Failed in FPC, strr codepage is DefaultSystemCodePage)

6. Codepage changing without conversion does not break a non latin string data

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

const
  RAW_UTF8_DATA: AnsiString = #$d0#$9f#$d1#$80#$d0#$b8#$d0#$b2#$d0#$b5#$d1#$82;

var
  str8: UTF8String;
  strr: RawByteString;

begin
  str8 := 'Привет'; { "Hello" in Russian }
  Writeln(StringCodePage(str8) = CP_UTF8);
  strr := str8;
  Writeln(StringCodePage(strr) = CP_UTF8);
  SetCodePage(strr, CP_NONE, False);
  Writeln(strr = RAW_UTF8_DATA);
  Writeln(StringCodePage(strr) = CP_NONE);

Returns:
TRUE
TRUE
TRUE (Failed in FPC, strr is empty)
TRUE

7. Codepage changing with conversion does not break a non latin string data and properly converts it

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

const
  RAW_1251_DATA: AnsiString = #$cf#$f0#$e8#$e2#$e5#$f2;

var
  str8: UTF8String;
  strr: RawByteString;

begin
  DefaultSystemCodePage := 1251;
  str8 := 'Привет'; { "Hello" in Russian }
  Writeln(StringCodePage(str8) = CP_UTF8);
  strr := str8;
  Writeln(StringCodePage(strr) = CP_UTF8);
  SetCodePage(strr, CP_NONE, True);
  Writeln(strr = RAW_1251_DATA);
  Writeln(StringCodePage(strr) = CP_NONE);

Returns:
TRUE
TRUE
TRUE (Failed in FPC, strr is empty)
TRUE (Failed in FPC, strr codepage is 1251)

8. String concatenation is working

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  str1, str2: RawByteString;

begin
  Str1 := '1';
  Str2 := '2';
  SetCodePage(RawByteString(Str1), CP_NONE, False);
  SetCodePage(RawByteString(Str2), CP_NONE, False);

  Str1 := Str1 + Str2;

  Writeln(Str1 = '12');
  Writeln(StringCodePage(Str1) = CP_NONE);

Returns:
TRUE (Failed in FPC, Str1 is empty)
TRUE (Failed in FPC, Str1 codepage is DefaultSystemCodePage)

9. Result string codepage is codepage of the first string

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  str1, str2: RawByteString;

begin
  Str1 := '1';
  Str2 := '2';
  SetCodePage(RawByteString(Str2), CP_NONE, False);

  Str1 := Str1 + Str2;

  Writeln(Str1 = '12');
  Writeln(StringCodePage(Str1) = DefaultSystemCodePage);

Returns:
TRUE (Failed in FPC, Str1 is "1")
TRUE

10. But if first string is empty we inherit the second string codepage

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  str1, str2: RawByteString;

begin
  Str1 := '';
  Str2 := '2';
  SetCodePage(RawByteString(Str2), CP_NONE, False);

  Str1 := Str1 + Str2;

  Writeln(Str1 = '2');
  Writeln(StringCodePage(Str1) = CP_NONE);

Returns:
TRUE (Failed in FPC, Str1 is empty)
TRUE (Failed in FPC, Str1 codepage is DefaultSystemCodePage

11. If both strings are empty we have an empty string with default codepage

{$MODE OBJFPC}{$H+}
{$CODEPAGE UTF8}

var
  str1, str2: RawByteString;

begin
  Str1 := '';
  Str2 := '';
  SetCodePage(RawByteString(Str2), CP_NONE, False);

  Str1 := Str1 + Str2;

  Writeln(Str1 = '');
  Writeln(StringCodePage(Str1) = DefaultSystemCodePage);

Returns:
TRUE
TRUE

Thaddy de Koning

2017-01-11 14:55

reporter   ~0097415

Last edited: 2017-01-11 15:02

View 5 revisions

"CP_NONE const is defined in Windows unit"
Well, it may be (as $0000, but ifdeffed) but it is X-platform.... It is in system....systemh.inc ($ffff) CP_NONE = $FFFF; // rawbytestring encoding

There are duplicates in jsystem... which is superfluous..
And this is plain wrong in wintypes:
{$ifndef SYSTEMUNIT}
  CP_NONE = $0000;
{$endif SYSTEMUNIT}

Issue History

Date Modified Username Field Change
2016-09-20 18:04 Tony Whyman New Issue
2016-09-20 18:04 Tony Whyman File Added: CodePageTest.zip
2016-09-20 18:14 Bart Broersma Note Added: 0094736
2016-09-20 19:12 Jonas Maebe Note Added: 0094737
2016-09-20 19:12 Jonas Maebe Status new => resolved
2016-09-20 19:12 Jonas Maebe Resolution open => no change required
2016-09-20 19:12 Jonas Maebe Assigned To => Jonas Maebe
2016-09-24 15:38 Tony Whyman Note Added: 0094797
2016-09-24 15:38 Tony Whyman Status resolved => feedback
2016-09-24 15:38 Tony Whyman Resolution no change required => reopened
2016-09-24 15:51 Tony Whyman Note Added: 0094798
2016-09-24 15:51 Tony Whyman Status feedback => assigned
2016-09-24 22:32 Jonas Maebe Note Added: 0094806
2016-09-24 22:33 Jonas Maebe Note Added: 0094807
2016-09-24 22:34 Jonas Maebe Relationship added related to 0025332
2016-09-24 23:07 Jonas Maebe Note Edited: 0094806 View Revisions
2016-09-26 07:29 LacaK Note Added: 0094817
2016-12-29 10:02 LacaK Note Added: 0097141
2016-12-29 18:22 Jonas Maebe Note Added: 0097154
2016-12-30 09:46 LacaK Note Added: 0097163
2017-01-06 21:06 Jonas Maebe Note Added: 0097341
2017-01-07 12:05 Jonas Maebe Relationship added has duplicate 0031200
2017-01-11 13:52 Dmitriy Pomerantsev Note Added: 0097412
2017-01-11 13:53 Dmitriy Pomerantsev Note Edited: 0097412 View Revisions
2017-01-11 13:54 Dmitriy Pomerantsev Note Edited: 0097412 View Revisions
2017-01-11 14:55 Thaddy de Koning Note Added: 0097415
2017-01-11 14:56 Thaddy de Koning Note Edited: 0097415 View Revisions
2017-01-11 14:58 Thaddy de Koning Note Edited: 0097415 View Revisions
2017-01-11 14:59 Thaddy de Koning Note Edited: 0097415 View Revisions
2017-01-11 15:02 Thaddy de Koning Note Edited: 0097415 View Revisions