After reading an empty csv file, TCSVDocument cannot read any other csv file containing valid data reporting it as empty

Original Reporter info from Mantis: sileno

Reporter name: Sileno Gödicke

Description:

Analysis and Cause of the Bug:
TCSVDocument holds a TCSVParser object in the FParser field, initially set to Nil. The parser is created by TCSVDocument at first use, in TCSVDocument.LoadFromStream(AStream: TStream) {at Line 412 of csvdocument.pp}, and it is only freed when the TCSVDocument is freed.
In TCSVDocument.LoadFromStream(AStream: TStream) the stream (filestream or other type of stream) to read from is passed to the parser calling TCSVParser.SetSource(AStream) {at Line 418 of csvdocument.pp}. In TCSVParser.SetSource(AStream) {Line 435 of csvreadwrite.pp} first a check is made to verify if this is a new stream, the stream to read from is stored in the FSourceStream field, and the parser is reset to the beginning of the stream calling TCSVParser.ResetParser {Line 454 of csvreadwrite.pp} which prepares the stream for the subsequent data readings, including the resetting of the EndOfFile flag.
Later, while reading data, the parser will reach the end of file and will set the EndOfFile flag {at Line 354 of csvreadwrite.pp}. From now on, the EndOfFile flag will remain set until it is reset by another call to ResetParser, which will happen only with a new SetSource call, i.e. there is a new file to read.

When reading a new file with TCSVDocument.LoadFromFile, a new filestream is created, data in the file are read with TCSVDocument.LoadFromStream(FileStream) and the filestream is destroyed. At this point the parser field FSourceStream is a dangling pointer, pointing to an invalid memory area, formerly assigned to the old filestream object. If we now try to read another csv file, LoadFromFile will create a new filestream which might or might not have the same memory address of the former filestream now destroyed. i.e. the new file stream might be created reusing the same memory space of the former filestream. When this happens, FParser.SetSource(AStream) compares the new filestream AStream to the old invalid pointer stored in FSourceStream {Line 437 of cvsreadwrite.pp: "If FSourceStream=AStream then exit;"} and since they are equal it will exit without calling ResetParser. As a consequence, the EndOfFile flag will remain set from the previous file access, preventing any further data reading from the new valid file.

I observed that, for reasons unclear to me, everytime a filestream is created and destroyed with no data read in between, either because the file is empty and zero byte read is reported by FFileStream.Read or because no data reading is attempted, the next filestream will be sistematically created at the same memory address. Anyway, one should make no assumption about what memory address will assigned to an object.
After an empty file is read, the next filestream will be everytime created at the same address, FParser.SetSource(AStream) will not reset the stream AStream, EndOfFile will stay set, and FParser will report the file as an empty file. Every following file reading attempt with TCSVDocument.LoadFromFile will create the filestream at the very same memory address since the former filestream was not read, FParser.SetSource(AStream) will not reset the stream again, EndOfFile will stay set and no data will be read. The story will continue forever, preventing data reading from any other file until the parser is destroyed.

The FSourceStream field of the parser stores the stream to read from but, when the stream it points to is managed (created and destroyed) by an external object or procedure, like in the case of TCSVDocument.LoadFromFile, there is no way to know inside the parser object if FSourceStream is still a valid pointer or it has become a dangling pointer, since objects are not reference counted. Actually, no assumption should be made about FSourceStream validity and therefore it should not be used for a check against other, (actually valid) stream pointers.

Bug Fix:
Two possible solutions are proposed as alternatives in the Additional Information section.

Steps to reproduce:

The bug is sistematically reproduceable as follows:

First read an empty csv file with TCSVDocument.LoadFromFile(const AFilename: String).
TCSVDocument.CSVText is now correctly an empty string;
TCSVDocument.RowCount is correctly zero;
TCSVDocument.ColCount[0] is correctly zero;

Now try to read another properly formatted csv file containing some data, again with TCSVDocument.LoadFromFile(const AFilename: String).
TCSVDocument.CSVText is now wrongly an empty string;
TCSVDocument.RowCount is wrongly zero;
TCSVDocument.ColCount[0] is wrongly zero;

After this step TCSVDocument will remain "stuck" in an EOF state till it is destroyed.

A simple test application showing the bug symptoms with two test datafiles is attached.
The empty csv file should be ANSI encoded i.e. zero byte length, if it is UTF8 encoded it will contain three bytes for the BOM and it will not trigger the issue because the BOM data will be read making this a non-empty file.

Tested on Win 7 and Win 10 with Lazarus 1.8.4 win32 and fpc 3.0.4

Additional information:

Bug Fix:
Two possible solutions are proposed as alternatives: a simple solution and a more correct solution.

Simple solution:
TCSVDocument is supposed to load the entire content of a file, or of a stream, in just one time and FParser is only used within the LoadFromStream procedure. At the end of the LoadFromFile or LoadFromStream procedures there is no reason/no need for having a parser object which is still set to a file (or stream) which possibly (certainly for the LoadFromFile procedure) does not exists anymore.
A simple solution is destroying the parser in TCSVDocument.LoadFromStream as soon as the file (or other stream) reading is completed. A new parser will be created in LoadFromStream when a new file (or stream) needs to be read.

Old Code {from Line 394 of csvdocument.pp}
procedure TCSVDocument.LoadFromStream(AStream: TStream);
var
  I, J, MaxCol: Integer;
begin
  Clear;

  if not Assigned(FParser) then
    FParser := TCSVParser.Create;

  FParser.AssignCSVProperties(Self);
  with FParser do
  begin
    SetSource(AStream);
    while ParseNextCell do
      Cells[CurrentCol, CurrentRow] := CurrentCellText;
  end;

  if FEqualColCountPerRow then
  begin
    MaxCol := MaxColCount - 1;
    for I := 0 to RowCount - 1 do
      for J := ColCount[I] to MaxCol do
        Cells[J, I] := '';
  end;
end;   

New Code {from Line 394 of csvdocument.pp}
procedure TCSVDocument.LoadFromStream(AStream: TStream);
var
  I, J, MaxCol: Integer;
begin
  Clear;

  FParser := TCSVParser.Create; // FParser is only used within this procedure

  FParser.AssignCSVProperties(Self);
  with FParser do
  begin
    SetSource(AStream);
    while ParseNextCell do
      Cells[CurrentCol, CurrentRow] := CurrentCellText;
  end;
  FreeAndNil(FParser);  // FParser is only used within this procedure

  if FEqualColCountPerRow then
  begin
    MaxCol := MaxColCount - 1;
    for I := 0 to RowCount - 1 do
      for J := ColCount[I] to MaxCol do
        Cells[J, I] := '';
  end;
end;

This simple solution is not computationally efficient, since we need to create and destroy a new parser object everytime we read a file, but it will not change anything in the parser code and its working logic.

A more correct solution:
Just delete Line 437 of cvsreadwrite.pp: "If FSourceStream=AStream then exit;" thus removing the new stream check in TCSVParser.SetSource(AStream: TStream).
This solution would remove the bug but it would slightly change the parser logic, though.
The current logic is: if the same stream the parser is currently reading from is assigned again to the parser with a TCSVParser.SetSource call, the parser will ignore the request, it will not reset the stream and it will continue the next data reading from where it was in the stream.
The new logic would be: any stream assigned to the parser (the current stream or a new stream) with a TCSVParser.SetSource call, will be reset and the next data reading will be from the stream beginning.
This is a more correct and efficient solution of the TCSVDocument.LoadFromFile bug but it is not clear to me if the change of logic could break some other application of the parser.

Debugging the FCL is not easy because it is not compiled with debug info. If you just copy csvreadwrite.pp and csvdocument.pp in the project directory and recompile it you will be able to follow the code in these two units with the debugger.

Mantis conversion info:

Mantis ID: 33885
OS: Windows
OS Build: 10
Platform: i386
Version: 3.0.4
Fixed in version: 3.1.1
Fixed in revision: 39325 (#398ab09c)
Monitored by: » sileno (Sileno Gödicke)
Target version: 3.2.0

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information