View Issue Details

IDProjectCategoryView StatusLast Update
0029848FPCRTLpublic2017-08-04 16:54
ReporterRolf Wetjen Assigned ToMichael Van Canneyt  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionfixed 
PlatformWindowsOSWindows 7 64 
Product Version3.0.0 
Fixed in Version3.1.1 
Summary0029848: TStrings.LoadFromStream doesn't support a unicode stream
DescriptionTStrings.SetTextStr(const Value: string) should be able to check for unicode (BOM) and convert to UTF-8 if needed.
TagsNo tags attached.
Fixed in Revision34475
FPCOldBugId
FPCTarget
Attached Files

Relationships

related to 0027088 resolvedMichael Van Canneyt BOM chars loaded as text in TStrings.LoadFromFile 

Activities

Jonas Maebe

2016-03-16 16:08

manager   ~0091122

This should at most only be done if a special option is specified (which will probably be Delphi incompatible). Otherwise you will break existing programs that load ANSI streams that happen to start with the same characters as a BOM.

Jonas Maebe

2016-03-16 16:09

manager   ~0091123

The way this is handled in Delphi is by explicitly specifying an encoding: http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/Classes_TStrings_LoadFromStream@TStream@TEncoding.html

I don't think we should add our own extensions.

Michael Van Canneyt

2016-03-16 16:18

administrator   ~0091124

@Jonas:
I was planning to handle it as in Delphi: specify a codepage for the stream.

But we'll need to add an option AsUTF8 or so to force it to load UTF8-encoded strings, otherwise the default conversion to the SystemDefaultCodePage will happen.

Rolf Wetjen

2016-03-18 11:19

reporter   ~0091164

Sorry to step in as I'm a Lazarus starter. Isn't is possible to detect the stream encoding (GuessEncoding for example) and then convert to UTF8?

Michael Van Canneyt

2016-03-18 11:23

administrator   ~0091166

Everything is possible, but not necessarily desirable :)

1. Guessing is simply bad practice.
2. Not everyone wants UTF8.
   Lazarus decided to use UTF8, but not everyone uses that.

Jonas Maebe

2016-03-18 11:27

manager   ~0091167

The default conversion to DefaultSystemCodePage should always happen, that is how this class works. With Lazarus nowadays setting DefaultSystemCodePage to CP_UTF8, that is even not a problem for most Lazarus users.

Ondrej Pokorny

2016-07-11 18:50

developer   ~0093657

Patches attached. One for TEncoding object and one for TStrings object.

I kept TStrings.LoadFromStream and .SaveToStream without the Encoding parameter backwards-compatible to FPC, which means they are not compatible to Delphi's behaviour. If you want to change this, I can do that as well.

Ondrej Pokorny

2016-07-11 18:50

developer  

TEncoding-ANSI-1.patch (7,279 bytes)   
Index: rtl/objpas/sysutils/sysencoding.inc
===================================================================
--- rtl/objpas/sysutils/sysencoding.inc	(revision 33943)
+++ rtl/objpas/sysutils/sysencoding.inc	(working copy)
@@ -22,6 +22,29 @@
   Result := FStandardEncodings[seAnsi];
 end;
 
+function TEncoding.GetAnsiBytes(const S: string): TBytes;
+begin
+  Result := GetAnsiBytes(S, 1, Length(S));
+end;
+
+function TEncoding.GetAnsiBytes(const S: string; CharIndex, CharCount: Integer
+  ): TBytes;
+begin
+  Result := GetAnsiBytes(Pointer(@S[CharIndex]), CharCount);
+end;
+
+function TEncoding.GetAnsiString(const Bytes: TBytes): string;
+begin
+  Result := GetAnsiString(Bytes, 0, Length(Bytes));
+end;
+
+function TEncoding.GetAnsiString(const Bytes: TBytes; ByteIndex,
+  ByteCount: Integer): string;
+begin
+  Result := GetAnsiString(Pointer(@Bytes[ByteIndex]), ByteCount);
+  SetCodePage(RawByteString(Result), DefaultSystemCodePage, False);
+end;
+
 class function TEncoding.GetASCII: TEncoding;
 begin
   if not Assigned(FStandardEncodings[seAscii]) then
@@ -393,6 +416,25 @@
   Result := TMBCSEncoding.Create(FCodePage, FMBToWCharFlags, FWCharToMBFlags);
 end;
 
+function TMBCSEncoding.GetAnsiBytes(Chars: PChar; CharCount: Integer): TBytes;
+var
+  S: RawByteString;
+begin
+  SetString(S, Chars, CharCount);
+  SetCodePage(S, DefaultSystemCodePage, False);
+  SetCodePage(S, GetCodePage, True);
+  SetLength(Result, Length(S));
+  if Length(S)>0 then
+    Move(S[1], Result[0], Length(S));
+end;
+
+function TMBCSEncoding.GetAnsiString(Bytes: PByte; ByteCount: Integer): string;
+begin
+  SetString(Result, Pointer(Bytes), ByteCount);
+  SetCodePage(RawByteString(Result), GetCodePage, False);
+  SetCodePage(RawByteString(Result), DefaultSystemCodePage, True);
+end;
+
 function TMBCSEncoding.GetMaxByteCount(CharCount: Integer): Integer;
 begin
   Result := CharCount;
@@ -515,6 +557,23 @@
   Result := TUnicodeEncoding.Create;
 end;
 
+function TUnicodeEncoding.GetAnsiBytes(Chars: PChar; CharCount: Integer
+  ): TBytes;
+var
+  U: UnicodeString;
+begin
+  widestringmanager.Ansi2UnicodeMoveProc(Chars, DefaultSystemCodePage, U, CharCount);
+  SetLength(Result, Length(U)*SizeOf(UnicodeChar));
+  if Length(Result)>0 then
+    Move(U[1], Result[0], Length(Result));
+end;
+
+function TUnicodeEncoding.GetAnsiString(Bytes: PByte; ByteCount: Integer
+  ): string;
+begin
+  widestringmanager.Unicode2AnsiMoveProc(PUnicodeChar(Bytes), RawByteString(Result), DefaultSystemCodePage, ByteCount div SizeOf(UnicodeChar));
+end;
+
 function TUnicodeEncoding.GetMaxByteCount(CharCount: Integer): Integer;
 begin
   Result := CharCount * SizeOf(UnicodeChar);
@@ -586,6 +645,28 @@
   Result := TBigEndianUnicodeEncoding.Create;
 end;
 
+function TBigEndianUnicodeEncoding.GetAnsiBytes(Chars: PChar; CharCount: Integer
+  ): TBytes;
+begin
+  Result := TEncoding.Unicode.GetAnsiBytes(Chars, CharCount);
+  Swap(Result);
+end;
+
+function TBigEndianUnicodeEncoding.GetAnsiString(Bytes: PByte;
+  ByteCount: Integer): string;
+var
+  B: TBytes;
+begin
+  if ByteCount=0 then
+    Exit('');
+
+  SetLength(B, ByteCount);
+  Move(Bytes^, B[0], ByteCount);
+  Swap(B);
+
+  Result := TEncoding.Unicode.GetAnsiString(PByte(@B[0]), ByteCount);
+end;
+
 function TBigEndianUnicodeEncoding.GetPreamble: TBytes;
 begin
   SetLength(Result, 2);
@@ -592,4 +673,21 @@
   Result[0] := $FE;
   Result[1] := $FF;
 end;
+
+procedure TBigEndianUnicodeEncoding.Swap(var B: TBytes);
+var
+  LastB, I: Integer;
+  C: Byte;
+begin
+  LastB := Length(B)-1;
+  I := 0;
+  while I < LastB do
+    begin
+      C := B[I];
+      B[I] := B[I+1];
+      B[I+1] := C;
+      Inc(I, 2);
+    end;
+end;
+
 {$endif VER2_4}
Index: rtl/objpas/sysutils/sysencodingh.inc
===================================================================
--- rtl/objpas/sysutils/sysencodingh.inc	(revision 33943)
+++ rtl/objpas/sysutils/sysencodingh.inc	(working copy)
@@ -48,6 +48,8 @@
     function GetBytes(Chars: PUnicodeChar; CharCount: Integer; Bytes: PByte; ByteCount: Integer): Integer; overload; virtual; abstract;
     function GetCharCount(Bytes: PByte; ByteCount: Integer): Integer; overload; virtual; abstract;
     function GetChars(Bytes: PByte; ByteCount: Integer; Chars: PUnicodeChar; CharCount: Integer): Integer; overload; virtual; abstract;
+    function GetAnsiBytes(Chars: PChar; CharCount: Integer): TBytes; virtual; abstract;
+    function GetAnsiString(Bytes: PByte; ByteCount: Integer): string; virtual; abstract;
     function GetCodePage: Cardinal; virtual; abstract;
     function GetEncodingName: UnicodeString; virtual; abstract;
   public
@@ -83,6 +85,10 @@
     function GetPreamble: TBytes; virtual; abstract;
     function GetString(const Bytes: TBytes): UnicodeString; overload;
     function GetString(const Bytes: TBytes; ByteIndex, ByteCount: Integer): UnicodeString; overload;
+    function GetAnsiBytes(const S: string): TBytes; overload;
+    function GetAnsiBytes(const S: string; CharIndex, CharCount: Integer): TBytes; overload;
+    function GetAnsiString(const Bytes: TBytes): string; overload;
+    function GetAnsiString(const Bytes: TBytes; ByteIndex, ByteCount: Integer): string; overload;
 
     property CodePage: Cardinal read GetCodePage;
     property EncodingName: UnicodeString read GetEncodingName;
@@ -109,6 +115,8 @@
     function GetBytes(Chars: PUnicodeChar; CharCount: Integer; Bytes: PByte; ByteCount: Integer): Integer; overload; override;
     function GetCharCount(Bytes: PByte; ByteCount: Integer): Integer; overload; override;
     function GetChars(Bytes: PByte; ByteCount: Integer; Chars: PUnicodeChar; CharCount: Integer): Integer; overload; override;
+    function GetAnsiBytes(Chars: PChar; CharCount: Integer): TBytes; override;
+    function GetAnsiString(Bytes: PByte; ByteCount: Integer): string; override;
     function GetCodePage: Cardinal; override;
     function GetEncodingName: UnicodeString; override;
   public
@@ -150,6 +158,8 @@
     function GetBytes(Chars: PUnicodeChar; CharCount: Integer; Bytes: PByte; ByteCount: Integer): Integer; overload; override;
     function GetCharCount(Bytes: PByte; ByteCount: Integer): Integer; overload; override;
     function GetChars(Bytes: PByte; ByteCount: Integer; Chars: PUnicodeChar; CharCount: Integer): Integer; overload; override;
+    function GetAnsiBytes(Chars: PChar; CharCount: Integer): TBytes; override;
+    function GetAnsiString(Bytes: PByte; ByteCount: Integer): string; override;
     function GetCodePage: Cardinal; override;
     function GetEncodingName: UnicodeString; override;
   public
@@ -164,8 +174,11 @@
 
   TBigEndianUnicodeEncoding = class(TUnicodeEncoding)
   strict protected
+    procedure Swap(var B: TBytes);
     function GetBytes(Chars: PUnicodeChar; CharCount: Integer; Bytes: PByte; ByteCount: Integer): Integer; overload; override;
     function GetChars(Bytes: PByte; ByteCount: Integer; Chars: PUnicodeChar; CharCount: Integer): Integer; overload; override;
+    function GetAnsiBytes(Chars: PChar; CharCount: Integer): TBytes; override;
+    function GetAnsiString(Bytes: PByte; ByteCount: Integer): string; override;
     function GetCodePage: Cardinal; override;
     function GetEncodingName: UnicodeString; override;
   public
TEncoding-ANSI-1.patch (7,279 bytes)   

Ondrej Pokorny

2016-07-11 18:51

developer  

TStrings-TEncoding-1.patch (8,522 bytes)   
Index: rtl/objpas/classes/classesh.inc
===================================================================
--- rtl/objpas/classes/classesh.inc	(revision 33943)
+++ rtl/objpas/classes/classesh.inc	(working copy)
@@ -599,6 +599,8 @@
 
   TStrings = class(TPersistent)
   private
+    FDefaultEncoding: TEncoding;
+    FEncoding: TEncoding;
     FSpecialCharsInited : boolean;
     FQuoteChar : Char;
     FDelimiter : Char;
@@ -609,10 +611,13 @@
     FSkipLastLineBreak : Boolean;
     FStrictDelimiter : Boolean;
     FLineBreak : String;
+    FWriteBOM: Boolean;
     function GetCommaText: string;
     function GetName(Index: Integer): string;
     function GetValue(const Name: string): string;
     Function GetLBS : TTextLineBreakStyle;
+    procedure SetDefaultEncoding(const ADefaultEncoding: TEncoding);
+    procedure SetEncoding(const AEncoding: TEncoding);
     Procedure SetLBS (AValue : TTextLineBreakStyle);
     procedure ReadData(Reader: TReader);
     procedure SetCommaText(const Value: string);
@@ -654,6 +659,7 @@
     Class Function GetNextLine (Const Value : String; Var S : String; Var P : Integer) : Boolean;
     Function GetNextLinebreak (Const Value : String; Var S : String; Var P : Integer) : Boolean;
   public
+    constructor Create;
     destructor Destroy; override;
     function Add(const S: string): Integer; virtual; overload;
     function AddObject(const S: string; AObject: TObject): Integer; virtual; overload;
@@ -681,11 +687,15 @@
     procedure Insert(Index: Integer; const S: string); virtual; abstract;
     procedure InsertObject(Index: Integer; const S: string;
       AObject: TObject);
-    procedure LoadFromFile(const FileName: string); virtual;
-    procedure LoadFromStream(Stream: TStream); virtual;
+    procedure LoadFromFile(const FileName: string); overload; virtual;
+    procedure LoadFromFile(const FileName: string; AEncoding: TEncoding); overload; virtual;
+    procedure LoadFromStream(Stream: TStream); overload; virtual;
+    procedure LoadFromStream(Stream: TStream; AEncoding: TEncoding); overload; virtual;
     procedure Move(CurIndex, NewIndex: Integer); virtual;
-    procedure SaveToFile(const FileName: string); virtual;
-    procedure SaveToStream(Stream: TStream); virtual;
+    procedure SaveToFile(const FileName: string); overload; virtual;
+    procedure SaveToFile(const FileName: string; AEncoding: TEncoding); overload; virtual;
+    procedure SaveToStream(Stream: TStream); overload; virtual;
+    procedure SaveToStream(Stream: TStream; AEncoding: TEncoding); overload; virtual;
     procedure SetText(TheText: PChar); virtual;
     procedure GetNameValue(Index : Integer; Out AName,AValue : String);
     function  ExtractName(Const S:String):String;
@@ -692,6 +702,8 @@
     Property TextLineBreakStyle : TTextLineBreakStyle Read GetLBS Write SetLBS;
     property Delimiter: Char read GetDelimiter write SetDelimiter;
     property DelimitedText: string read GetDelimitedText write SetDelimitedText;
+    property DefaultEncoding: TEncoding read FDefaultEncoding write SetDefaultEncoding;
+    property Encoding: TEncoding read FEncoding write SetEncoding;
     property LineBreak : string Read GetLineBreak write SetLineBreak;
     Property StrictDelimiter : Boolean Read FStrictDelimiter Write FStrictDelimiter;
     property QuoteChar: Char read GetQuoteChar write SetQuoteChar;
@@ -707,6 +719,7 @@
     property Text: string read GetTextStr write SetTextStr;
     property StringsAdapter: IStringsAdapter read FAdapter write SetStringsAdapter;
     Property SkipLastLineBreak : Boolean Read GetSkipLastLineBreak Write SetSkipLastLineBreak;
+    property WriteBOM: Boolean read FWriteBOM write FWriteBOM;
   end;
 
 { TStringList class }
Index: rtl/objpas/classes/stringl.inc
===================================================================
--- rtl/objpas/classes/stringl.inc	(revision 33943)
+++ rtl/objpas/classes/stringl.inc	(working copy)
@@ -112,6 +112,19 @@
   FDelimiter:=c;
 end;
 
+Procedure TStrings.SetEncoding(const AEncoding: TEncoding);
+begin
+  if (FEncoding<>nil) and not TEncoding.IsStandardEncoding(FEncoding) then
+    FEncoding.Free;
+
+  if TEncoding.IsStandardEncoding(AEncoding) then
+    FEncoding:=AEncoding
+  else if AEncoding<>nil then
+    FEncoding:=AEncoding.Clone
+  else
+    FEncoding:=nil;
+end;
+
 Function TStrings.GetDelimiter : Char;
 begin
   CheckSpecialChars;
@@ -434,6 +447,21 @@
 
 
 
+Procedure TStrings.SetDefaultEncoding(const ADefaultEncoding: TEncoding);
+begin
+  if (FDefaultEncoding<>nil) and not TEncoding.IsStandardEncoding(FDefaultEncoding) then
+    FDefaultEncoding.Free;
+
+  if TEncoding.IsStandardEncoding(ADefaultEncoding) then
+    FDefaultEncoding:=ADefaultEncoding
+  else if ADefaultEncoding<>nil then
+    FDefaultEncoding:=ADefaultEncoding.Clone
+  else
+    FDefaultEncoding:=TEncoding.Default;
+end;
+
+
+
 Procedure TStrings.SetValue(const Name, Value: string);
 
 Var L : longint;
@@ -679,10 +707,22 @@
 destructor TSTrings.Destroy;
 
 begin
+  if (FEncoding<>nil) and not TEncoding.IsStandardEncoding(FEncoding) then
+    FreeAndNil(FEncoding);
+  if (FDefaultEncoding<>nil) and not TEncoding.IsStandardEncoding(FDefaultEncoding) then
+    FreeAndNil(FDefaultEncoding);
+
   inherited destroy;
 end;
 
 
+constructor TStrings.Create;
+begin
+  inherited Create;
+  FDefaultEncoding:=TEncoding.Default;
+  FEncoding:=nil;
+  FWriteBOM:=True;
+end;
 
 Function TStrings.Add(const S: string): Integer;
 
@@ -784,6 +824,9 @@
       FNameValueSeparator:=S.FNameValueSeparator;
       FLBS:=S.FLBS;
       FLineBreak:=S.FLineBreak;
+      FWriteBOM:=S.FWriteBOM;
+      DefaultEncoding:=S.DefaultEncoding;
+      Encoding:=S.Encoding;
       AddStrings(S);
     finally
       EndUpdate;
@@ -939,6 +982,20 @@
 
 
 
+Procedure TStrings.LoadFromFile(const FileName: string; AEncoding: TEncoding);
+Var
+        TheStream : TFileStream;
+begin
+  TheStream:=TFileStream.Create(FileName,fmOpenRead or fmShareDenyWrite);
+  try
+    LoadFromStream(TheStream,AEncoding);
+  finally
+    TheStream.Free;
+  end;
+end;
+
+
+
 Procedure TStrings.LoadFromStream(Stream: TStream);
 {
    Borlands method is no good, since a pipe for
@@ -978,6 +1035,50 @@
 end;
 
 
+Procedure TStrings.LoadFromStream(Stream: TStream; AEncoding: TEncoding);
+{
+   Borlands method is no good, since a pipe for
+   instance doesn't have a size.
+   So we must do it the hard way.
+}
+Const
+  BufSize = 1024;
+  MaxGrow = 1 shl 29;
+
+Var
+  Buffer         : TBytes;
+  T              : string;
+  BytesRead,
+  BufLen,
+  I,BufDelta,
+  PreambleLength : Longint;
+begin
+  // reread into a buffer
+  beginupdate;
+  try
+    SetLength(Buffer,0);
+    BufLen:=0;
+    I:=1;
+    Repeat
+      BufDelta:=BufSize*I;
+      SetLength(Buffer,BufLen+BufDelta);
+      BytesRead:=Stream.Read(Buffer[BufLen],BufDelta);
+      inc(BufLen,BufDelta);
+      If I<MaxGrow then
+        I:=I shl 1;
+    Until BytesRead<>BufDelta;
+    SetLength(Buffer,BufLen-BufDelta+BytesRead);
+    PreambleLength:=TEncoding.GetBufferEncoding(Buffer,AEncoding,FDefaultEncoding);
+    T:=AEncoding.GetAnsiString(Buffer,PreambleLength,Length(Buffer)-PreambleLength);
+    Encoding:=AEncoding;
+    SetLength(Buffer,0);
+    SetTextStr(T);
+  finally
+    EndUpdate;
+  end;
+end;
+
+
 Procedure TStrings.Move(CurIndex, NewIndex: Integer);
 Var
   Obj : TObject;
@@ -1012,18 +1113,59 @@
 
 
 
+Procedure TStrings.SaveToFile(const FileName: string; AEncoding: TEncoding);
+
+Var TheStream : TFileStream;
+
+begin
+  TheStream:=TFileStream.Create(FileName,fmCreate);
+  try
+    SaveToStream(TheStream,AEncoding);
+  finally
+    TheStream.Free;
+  end;
+end;
+
+
+
 Procedure TStrings.SaveToStream(Stream: TStream);
 Var
   S : String;
 begin
-  S:=Text;
-  if S = '' then Exit;
-  Stream.WriteBuffer(Pointer(S)^,Length(S));
+  if Encoding<>nil then
+    SaveToStream(Stream,Encoding)
+  else
+  begin
+    S:=Text;
+    if S = '' then Exit;
+    Stream.WriteBuffer(Pointer(S)^,Length(S));
+  end;
 end;
 
 
 
 
+Procedure TStrings.SaveToStream(Stream: TStream; AEncoding: TEncoding);
+
+Var B : TBytes;
+
+begin
+  if AEncoding=nil then
+    AEncoding:=FDefaultEncoding;
+  if FWriteBOM then
+    begin
+      B:=AEncoding.GetPreamble;
+      if Length(B)>0 then
+        Stream.WriteBuffer(B[0],Length(B));
+    end;
+  B:=AEncoding.GetAnsiBytes(Text);
+  if Length(B)>0 then
+    Stream.WriteBuffer(B[0],Length(B));
+end;
+
+
+
+
 Procedure TStrings.SetText(TheText: PChar);
 
 Var S : String;
TStrings-TEncoding-1.patch (8,522 bytes)   

Michael Van Canneyt

2016-09-10 13:01

administrator   ~0094530

Applied the patches from Ondrej.

I did make Encoding read-only. The Embarcadero documentation states that this is a read-only property which specifies how the strings were read from stream.

Issue History

Date Modified Username Field Change
2016-03-16 13:05 Rolf Wetjen New Issue
2016-03-16 15:59 Michael Van Canneyt Assigned To => Michael Van Canneyt
2016-03-16 15:59 Michael Van Canneyt Status new => assigned
2016-03-16 16:08 Jonas Maebe Note Added: 0091122
2016-03-16 16:09 Jonas Maebe Note Added: 0091123
2016-03-16 16:18 Michael Van Canneyt Note Added: 0091124
2016-03-18 11:19 Rolf Wetjen Note Added: 0091164
2016-03-18 11:23 Michael Van Canneyt Note Added: 0091166
2016-03-18 11:27 Jonas Maebe Note Added: 0091167
2016-07-11 18:50 Ondrej Pokorny Note Added: 0093657
2016-07-11 18:50 Ondrej Pokorny File Added: TEncoding-ANSI-1.patch
2016-07-11 18:51 Ondrej Pokorny File Added: TStrings-TEncoding-1.patch
2016-09-10 13:01 Michael Van Canneyt Fixed in Revision => 34475
2016-09-10 13:01 Michael Van Canneyt Note Added: 0094530
2016-09-10 13:01 Michael Van Canneyt Status assigned => resolved
2016-09-10 13:01 Michael Van Canneyt Fixed in Version => 3.1.1
2016-09-10 13:01 Michael Van Canneyt Resolution open => fixed
2016-09-10 22:04 Michael Van Canneyt Relationship added related to 0027088
2017-08-04 16:54 Rolf Wetjen Status resolved => closed