View Issue Details

IDProjectCategoryView StatusLast Update
0021195FPCFCLpublic2013-02-14 11:08
ReporterMark Morgan Lloyd Assigned ToPaul Ishenin  
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionfixed 
Platformx86 and othersOSLinux 
Product Version2.7.1 
Fixed in Version3.0.0 
Summary0021195: TStringList.IndexOf fails with widechars
DescriptionAttached program adds widechars to stringlist returning 0 through 4, but IndexOf always returns zero.
Steps To ReproduceRun attached program.
Additional InformationI think this probably affects widechars above #$00ff, I didn't spot it while working with characters up to and including #$00f7.
TagsNo tags attached.
Fixed in Revision23613
FPCOldBugId
FPCTarget
Attached Files

Activities

2012-01-31 11:40

 

stringlisttest.pas (734 bytes)   
program stringListTest;

{$mode objfpc}{$H+}

uses Classes;

const   Iota= #$2373;                   (* ⍳  *)
        Rho= #$2374;                    (* ⍴  *)
        Omega= #$2375;                  (* ⍵  *)
        Alpha= #$237a;                  (* ⍺  *)

var	sl: TStringList;

begin
  sl := TStringList.Create;
  try
    WriteLn('Put iota at ', sl.Add(iota));
    WriteLn('Put rho at ', sl.Add(rho));
    WriteLn('Put omega at ', sl.Add(omega));
    WriteLn('Put alpha at ', sl.Add(alpha));

    WriteLn('Get iota from ', sl.IndexOf(iota));
    WriteLn('Get rho from ', sl.IndexOf(rho));
    WriteLn('Get omega from ', sl.IndexOf(omega));
    WriteLn('Get alpha from ', sl.IndexOf(alpha))
  finally
    sl.Free
  end
end.

stringlisttest.pas (734 bytes)   

Marco van de Voort

2012-01-31 13:24

manager   ~0056191

Last edited: 2012-01-31 13:25

The widechars are probably converted to single byte chars, and the subsequent attempt fails.

tstringlist is not unicode aware atm, so this behaviour is on par.

For Lazarus manual utf8 usage, encode the widechar to an ansistring (containing utf8), and do index using that.

Jonas Maebe

2012-01-31 14:34

manager   ~0056193

The program works correctly with FPC 2.6.0 if cwstring is added to the uses clause (in an UTF-8 locale). Therefore it should also work correctly with FPC 2.7.1 if cwstring is added to the uses clause, but it doesn't.

Mark Morgan Lloyd

2012-02-01 10:08

reporter   ~0056213

Last edited: 2012-02-01 11:00

The program I was working on did have cwstring, apologies for not putting it in the example.

If cwstring is not used, then encoding the widestring using Utf8Encode() works. If it is used, then IndexOf() fails whether or not the widestring is explicitly converted to UTF-8.

Revised test program attached.

I think this is something that might have crept in in stages, since the program I'm working on (currently being compiled using 19886) appears to exhibit slightly different behaviour. However that might also depend on how many widestrings (irrespective of use of Utf8Encode) are being put in the stringlist before IndexOf() is called.

2012-02-01 11:02

 

stringlisttest2.pas (3,444 bytes)   
program stringListTest;

{$mode objfpc}{$H+}

uses { cwstring, } Classes;

const   Pound= #$00a3;                  (* £  *)
        LeftGuillemet= #$00ab;          (* «  *) LeftChevron= LeftGuillemet;
        RightGuillemet= #$00bb;         (* »  *) RightChevron= RightGuillemet;
        Negative= #$00af;               (* ¯  *)
        Multiply= #$00d7;               (* ×  *)
        Divide= #$00f7;                 (* ÷  *)
        LeftArrow= #$2190;              (* ←  *)
        UpArrow= #$2191;                (* ↑  *)
        RightArrow= #$2192;             (* →  *)
        DownArrow= #$2193;              (* ↓  *)
        Delta= #$2206;                  (* ∆  *)
        Del= #$2207;                    (* ∇  *)
        Epsilon= #$2208;                (* ∈  *)
        Jot= #$2218;                    (* ∘  *) SmallCircle= Jot;
        AndSymbol= #$2227;              (* ∧  *)
        OrSymbol= #$2228;               (* ∨  *)
        SouthCap= #$2229;               (* ∩  *)
        NorthCap= #$222a;               (* ∪  *)
        NotEqual= #$2260;               (* ≠  *)
        LessOrEqual= #$2264;            (* ≤  *)
        GreaterOrEqual= #$2265;         (* ≥  *)
        EastCap= #$2282;                (* ⊂  *)
        WestCap= #$2283;                (* ⊃  *)
        TBeam= #$22a4;                  (* ⊤  *)
        Ceiling= #$2308;                (* ⌈  *)
        Floor= #$230a;                  (* ⌊  *)
        IBeam= #$2336;                  (* ⌶  *)
        Comment= #$235d;                (* ⍝  *)
        Logarithm= #$235f;              (* ⍟  *)
        Iota= #$2373;                   (* ⍳  *)
        Rho= #$2374;                    (* ⍴  *)
        Omega= #$2375;                  (* ⍵  *)
        Alpha= #$237a;                  (* ⍺  *)
        Quad= #$2395;                   (* ⎕  *)
        Circle= #$25cb;                 (* ○  *) LargeCircle= Circle;

var	sl: TStringList;

begin
  sl := TStringList.Create;
  try
    WriteLn('--- WITHOUT Utf8Encode(): ---');
    WriteLn('Put ceiling at ', sl.Add(ceiling));
    WriteLn('Put floor at ', sl.Add(floor));
    WriteLn('Put iota at ', sl.Add(iota));
    WriteLn('Put rho at ', sl.Add(rho));
    WriteLn('Put omega at ', sl.Add(omega));
    WriteLn('Put alpha at ', sl.Add(alpha));

    WriteLn('Get ceiling from ', sl.IndexOf(ceiling));
    WriteLn('Get floor from ', sl.IndexOf(floor));
    WriteLn('Get iota from ', sl.IndexOf(iota));
    WriteLn('Get rho from ', sl.IndexOf(rho));
    WriteLn('Get omega from ', sl.IndexOf(omega));
    WriteLn('Get alpha from ', sl.IndexOf(alpha));
    sl.Clear;

    WriteLn('--- WITH Utf8Encode(): ------');
    WriteLn('Put ceiling at ', sl.Add(Utf8Encode(ceiling)));
    WriteLn('Put floor at ', sl.Add(Utf8Encode(floor)));
    WriteLn('Put iota at ', sl.Add(Utf8Encode(iota)));
    WriteLn('Put rho at ', sl.Add(Utf8Encode(rho)));
    WriteLn('Put omega at ', sl.Add(Utf8Encode(omega)));
    WriteLn('Put alpha at ', sl.Add(Utf8Encode(alpha)));

    WriteLn('Get ceiling from ', sl.IndexOf(Utf8Encode(ceiling)));
    WriteLn('Get floor from ', sl.IndexOf(Utf8Encode(floor)));
    WriteLn('Get iota from ', sl.IndexOf(Utf8Encode(iota)));
    WriteLn('Get rho from ', sl.IndexOf(Utf8Encode(rho)));
    WriteLn('Get omega from ', sl.IndexOf(Utf8Encode(omega)));
    WriteLn('Get alpha from ', sl.IndexOf(Utf8Encode(alpha)))
  finally
    sl.Free
  end
end.

stringlisttest2.pas (3,444 bytes)   

Paul Ishenin

2012-02-14 08:40

developer   ~0056807

It does not work because all those chars can't be represented in the ansi codepage and replaced by '?'.

Although UTF8 version works properly now.

Jonas Maebe

2012-02-14 10:57

manager   ~0056810

On Unix the ansi code page usually is utf-8, so that's all that's needed. I think you can resolve this report.

Mark Morgan Lloyd

2012-02-14 18:15

reporter   ~0056812

Once I knew there was a problem I worked around it by making sure that I was using UTF-8 rather than widestring.

If there are residual issues for widestring hopefully they're documented.

Jonas Maebe

2012-02-14 21:21

manager   ~0056821

The "residual widestring issue" is simply that tstringlist only works with ansistring. That is documented, in the sense that all declarations of that class use the "string" type and "string=ansistring" in that unit (string=widestring does not exist in current FPC versions).

Paul Ishenin

2012-02-15 01:08

developer   ~0056825

> On Unix the ansi code page usually is utf-8, so that's all that's needed. I think you can resolve this report.

The problem is that conversion from WideString to AnsiString for TStringList.Add() is made by compiler and for compiler the default codepage is "iso-8859-1" (if other is not set by {$codepage} or -Fc or by {$mode}).

If I add {$codepage utf-8} at the top of the sourcefile I get a working example for non-utf8 calls too.

Jonas Maebe

2012-03-09 16:11

manager   ~0057438

Last edited: 2012-03-10 22:58

I started writing an entry about this for User_Changes_Trunk, like this:

* '''Old behaviour''': If a constant widechar or wide/unicodestring had to be converted to an ansistring, this conversion would be performed at run time to whatever the ansi codepage of the program was. The reason is that it was impossible for the compiler to guess what the codepage at run time would be, and there was no way for the compiler to annotate the converted ansistring with the picked codepage so that it could be corrected at run time if required.
* '''New behaviour''': With the addition of support for codepage-aware ansistring support to the compiler, it is now possible to convert widestrings to ansistring at compile time. The compiler will use the codepage setting for the source file as destination codepage (default: codepage 437).
* '''Effect''': Passing a constant widechar/string to an ansistring parameter, or assigning a widechar/string to an ansistring, will result in data loss if the source code's codepage cannot represent the wide data.

But then I came to the reason, and I can't really think of a good reason:

a) it may save some execution time (no conversion from utf-16 to DefaultSystemCodePage), but it can also increase it if the codepage at run time is not the same as the one of the source file (on Unix you now have to convert first from the source code's codepage to utf-16, and then from utf-16 to DefaultSystemCodePage; on Windows you may need only one OS call, but internally it probably does something similar because it's unlikely that they have tables for every possible combination of source and destination code page)

b) it may be Delphi-compatible, but is it likely to break any Delphi code if we delay the conversion of widestring constants to run time so that the string contains data encoded in DefaultSystemCodePage instead of in whatever the encoding of the source file was?

In summary, I think the current behaviour breaks existing code for no good reason.

Mark Morgan Lloyd

2012-03-09 18:28

reporter   ~0057445

When I raised the issue I think it was because I didn't understand what was going on, now that I see the detailed description I'm almost apologetic.

BUT if the behaviour is to be described in terms of PC-style codepages, please could I suggest that it's of paramount importance that the one or two characters in each that represent currency don't get mnagled without warning.

Is there a compiler macro that allows code to determine the current {$codepage setting?

Paul Ishenin

2013-02-14 10:04

developer   ~0065680

Since r23613 compiler does not convert at compile time some unicode constants (>127) to ansistring (except to UTF8 string). This should solve this issue at unixes where default system codepage is UTF8.

Mark, there is no macro to determine $codepage setting.

Mark Morgan Lloyd

2013-02-14 11:08

reporter   ~0065682

Thanks.

Issue History

Date Modified Username Field Change
2012-01-31 11:40 Mark Morgan Lloyd New Issue
2012-01-31 11:40 Mark Morgan Lloyd File Added: stringlisttest.pas
2012-01-31 13:24 Marco van de Voort Note Added: 0056191
2012-01-31 13:25 Marco van de Voort Note Edited: 0056191
2012-01-31 13:25 Marco van de Voort Note Edited: 0056191
2012-01-31 14:34 Jonas Maebe Note Added: 0056193
2012-02-01 10:08 Mark Morgan Lloyd Note Added: 0056213
2012-02-01 11:00 Mark Morgan Lloyd Note Edited: 0056213
2012-02-01 11:02 Mark Morgan Lloyd File Added: stringlisttest2.pas
2012-02-14 08:40 Paul Ishenin Note Added: 0056807
2012-02-14 10:57 Jonas Maebe Note Added: 0056810
2012-02-14 18:15 Mark Morgan Lloyd Note Added: 0056812
2012-02-14 21:21 Jonas Maebe Note Added: 0056821
2012-02-15 01:08 Paul Ishenin Note Added: 0056825
2012-03-09 16:11 Jonas Maebe Note Added: 0057438
2012-03-09 18:28 Mark Morgan Lloyd Note Added: 0057445
2012-03-09 18:31 Jonas Maebe Note Edited: 0057438
2012-03-10 22:58 Jonas Maebe Note Edited: 0057438
2012-04-23 07:10 Paul Ishenin Status new => assigned
2012-04-23 07:10 Paul Ishenin Assigned To => Paul Ishenin
2013-02-14 10:02 Paul Ishenin Fixed in Revision => 23613
2013-02-14 10:04 Paul Ishenin Note Added: 0065680
2013-02-14 10:05 Paul Ishenin Status assigned => resolved
2013-02-14 10:05 Paul Ishenin Fixed in Version => 2.7.1
2013-02-14 10:05 Paul Ishenin Resolution open => fixed
2013-02-14 11:08 Mark Morgan Lloyd Note Added: 0065682
2013-02-14 11:08 Mark Morgan Lloyd Status resolved => closed