View Issue Details

IDProjectCategoryView StatusLast Update
0024745FPCDatabasepublic2014-03-03 09:56
ReporterDaniel Simões de AlmeidaAssigned ToBigChimp 
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionduplicate 
Product Version2.7.1Product Build20130427-win32 
Target VersionFixed in Version 
Summary0024745: Losing data when saving Database fileds with "Size" defined and UTF8 chars
DescriptionWhen the FieldDefs.Size properties is defined and we are trying to save UTF8 characters with accented words, the Data is Truncated.

This affect all Database components, including third party components like ZeosDB

  
Steps To ReproduceConsidering the BufferDataSet Table below:

  object BufDataset1: TBufDataset
    FileName = 'teste.dat'
    FieldDefs = <
      item
        Name = 'Cod'
        DataType = ftInteger
        Precision = 0
        Size = 5
      end
      item
        Name = 'Desc'
        DataType = ftString
        Precision = 0
        Size = 10
      end>
    left = 136
    top = 136
  end


//This code save data Correctly:
   BufDataset1.Append;
   BufDataset1.Fields[0].AsInteger := 1;
   BufDataset1.Fields[1].AsString := '1234567890';
   BufDataset1.Post;

//This code lose the last 5 characters:
  BufDataset1.Append;
  BufDataset1.Fields[0].AsInteger := 2;
  BufDataset1.Fields[1].AsString := 'áéíóú12345';
  BufDataset1.Post;
Additional InformationI attached a Demo to demonstrate the problem.
TagsNo tags attached.
Fixed in Revision
FPCOldBugId
FPCTarget
Attached Files

Relationships

duplicate of 0025801 resolvedMichael Van Canneyt TStringField may return wrong size for TStringField.DataSize 
related to 0017376 resolvedLacaK TSQLite3Connection not show whole content for string field when the field is asia language 

Activities

Daniel Simões de Almeida

2013-07-14 19:56

reporter  

dbSizeError.zip (3,597 bytes)

Daniel Simões de Almeida

2013-07-14 20:01

reporter   ~0068853

I´m using Lazarus from snapshot:
Lazarus-1.1-42056-fpc-2.7.1-20130711-win32.exe

Reinier Olislagers

2013-07-16 10:39

developer   ~0068882

Known problem: size assumes 1 character=1 byte. UTF8 can have multiple bytes per character. Workaround for now: make your size=characters*4 (maximum bytes per character in UTF8)
Still looking for earlier bug report where this was reported.

Thaddy de Koning

2013-07-16 13:24

reporter   ~0068886

Last edited: 2013-07-17 01:49

View 5 revisions

I think your workaround is not a workaround at all, but the only option to prevent truncation with UTF8. In theory you can make a field change in UTF8 that *requires* 4*byte. Maybe a suggestion to change into a documentation issue.

[edit]
A UTF8 fixed length field should have a fixed byte length that allows for worse case UTF8 byte length, which is 4 bytes per UTF8 char.
If you want to have a fixed length UTF8 field, that is the price you pay.
The reporter should be aware of that. UTF8 <> ANSI, which has a one byte per char representation given a certain codepage.

If for a certain reason the one byte per char representaion is important, store as ANSI and optionally store codepage information separately. Either per entry by adding a codepage field or globally for a table or database.

I think this can be closed or deferred to documentation.

Reinier Olislagers

2013-07-17 10:51

developer   ~0068894

@Thaddy: thanks for your remark. However, my thoughts are that users want to define db field length in characters - they're mostly not interested in bytes. However, that's my opinion and this should be discussed by the db team.

Thaddy de Koning

2013-07-17 22:45

reporter   ~0068918

Last edited: 2013-07-17 22:50

View 3 revisions

A UTF8 fixed length field is Nirvana, I know. But it is a contradiction in terms.... In the context of databases. Unless you can multiply by 4.

Not interested is never! an excuse.
(sorry, may be removed but has technical content)

Reinier Olislagers

2014-03-02 09:49

developer   ~0073382

Bug describes unicode size issues; consolidating with 25801 [FPC] TStringField may return wrong size for TStringField.DataSize

Marco van de Voort

2014-03-02 22:58

manager   ~0073394

4 is the maximum size of a codepoint, not a character afaik. IIRC Thai script is an example of that. (too many accents that are combinable to create a codepoint entry for them all, so base character and accents are separate codepoints combined to one character)

Issue History

Date Modified Username Field Change
2013-07-14 19:56 Daniel Simões de Almeida New Issue
2013-07-14 19:56 Daniel Simões de Almeida File Added: dbSizeError.zip
2013-07-14 20:01 Daniel Simões de Almeida Note Added: 0068853
2013-07-16 10:35 Reinier Olislagers Relationship added related to 0017376
2013-07-16 10:39 Reinier Olislagers Note Added: 0068882
2013-07-16 13:24 Thaddy de Koning Note Added: 0068886
2013-07-16 13:24 Thaddy de Koning Note Edited: 0068886 View Revisions
2013-07-16 13:24 Thaddy de Koning Note Edited: 0068886 View Revisions
2013-07-17 01:48 Thaddy de Koning Note Edited: 0068886 View Revisions
2013-07-17 01:49 Thaddy de Koning Note Edited: 0068886 View Revisions
2013-07-17 10:51 Reinier Olislagers Note Added: 0068894
2013-07-17 22:45 Thaddy de Koning Note Added: 0068918
2013-07-17 22:49 Thaddy de Koning Note Edited: 0068918 View Revisions
2013-07-17 22:50 Thaddy de Koning Note Edited: 0068918 View Revisions
2014-03-02 09:45 Reinier Olislagers Relationship added related to 0025801
2014-03-02 09:49 Reinier Olislagers Note Added: 0073382
2014-03-02 09:52 Reinier Olislagers Relationship replaced duplicate of 0025801
2014-03-02 09:52 Reinier Olislagers Status new => resolved
2014-03-02 09:52 Reinier Olislagers Resolution open => duplicate
2014-03-02 09:52 Reinier Olislagers Assigned To => Reinier Olislagers
2014-03-02 22:58 Marco van de Voort Note Added: 0073394
2014-03-03 09:56 Reinier Olislagers Status resolved => closed