View Issue Details

IDProjectCategoryView StatusLast Update
0025801FPCDatabasepublic2018-06-28 11:51
ReporterJan BaumgartenAssigned ToMichael Van Canneyt 
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
PlatformOSWindowsOS VersionWindows 7
Product Version2.6.2Product Build 
Target Version3.2.0Fixed in Version3.1.1 
Summary0025801: TStringField may return wrong size for TStringField.DataSize
DescriptionTStringField assumes that DataSize is always Size + 1 which is not always true for UTF-8 encoded Strings.

The FCL documentation states:
Size [...] is the declared maximum size of the string (in characters) [...].

So a TStringField where TStringField.Size is 5 should be able to store the String 'äöüßó', which has a Size of 10 bytes if it is encoded in UTF-8.
Unfortunately this is not possible with the current implementation of TStringField because in this case TStringField returns 6 for TStringField.GetDataSize which is too small for storing 10 Bytes of character information.

This leads to an ambiguity when using the TStringField.Size value for further processing because application programmers might check it to see wether a string fits into a field.
Tagsunicode, utf-8, utf8
Fixed in Revision
FPCOldBugId
FPCTarget
Attached Files

Relationships

related to 0017376 resolvedLacaK FPC TSQLite3Connection not show whole content for string field when the field is asia language 
has duplicate 0024745 closedReinier Olislagers FPC Losing data when saving Database fileds with "Size" defined and UTF8 chars 
has duplicate 0025698 resolvedReinier Olislagers Lazarus Maxlength of TDBEdit wrong when string fields in dataset are UTF8 
related to 0027766 assignedMichael Van Canneyt FPC TFieldDef does not support Multibyte character sets 

Activities

LacaK

2014-03-01 17:05

developer   ~0073376

Last edited: 2014-03-01 17:28

View 2 revisions

Related also to 0017376, 0024745, 0025698
This is well known problem, which IMO comes from old Delphi's time, where was assumed that string field holds SBCS string (1char=1byte).
When Lazarus starts using UTF-8 for DB-aware controls, users starts using also UTF-8 as client charset, so from DB backend commes UTF-8 encoded string data , which can not be properly stored in fields buffers.

As mentioned there are 2 properties for TField:
- Size which must be max.number of CHARACTERs - is used by DB-aware controls to set MaxLength to limit size of text written into control
- DataSize which must max.number of BYTEs - is used to allocate buffers to store strings (in field or in record buffer).

So only solutions, which I see is break hardcoded relation between Size and DataSize (DataSize:=Size+1) and :
- let set this two properties independently.
- or let DataSize:=Size*4+1 (to allocate sufficient buffer length)
- or introduce CharSet property at TField and TFieldDef level and derive from CharSet name required "character size"

(I will prepare draft for discussion)

Thaddy de Koning

2014-03-04 16:11

reporter   ~0073416

I think it is sufficient to have the front-end fields behave like fieldlength=maxfieldlength for the charset and trim.
If the database driver is able to store it more efficiently, that's a bonus.
Sqlite, f.e. is fieldlength agnostic in its storage.
But it is also a documentation issue.

Michael Van Canneyt

2014-03-05 15:48

administrator   ~0073481

The documentation is correct. That size=characters + 1 is correct.

TStringField is not designed for, and cannot be used to store, UTF-8 data.
The dataset must map UTF-8 data to widestring fields and perform a conversion,
not map it to tstringfield.

As for SQLite: it is completely unsuitable for use in Pascal.
It guarantees no data constraints or type safety. Anyone using it is on his own.

Jan Baumgarten

2014-03-06 00:01

reporter   ~0073492

Last edited: 2014-03-06 00:28

View 4 revisions

I think, I have to disagree. At least from a Delphi point of view. I think that TStringField in Delphi was designed to use what Embarcadero and Microsoft call AnsiStrings and TWideStringField was designed to use what Embarcadero and Microsoft call a WideString.

But even for AnsiString the assumption of one Character = 1 Byte is wrong because Microsofts supports what they call multi byte character sets (MBCS) like for example Shift-JIS and Big5 with their AnsiString API. And so in my opinion does Delphi.

WideString, as far as I understand it, always means UTF16. UTF8 fits more into the MBCS scheme of things and with this into the AnsiString, I think. Microsoft even assigned it the codpage number 65001 for compatibility reasons. Also UTF8String is a special version of an AnsiString with newer Unicode enabled versions of Delphi.

Maybe the problem is that the FCL TStringField tries to be agnostic to the encoding whereas the VCL TStringField probably never has been. When the VCL can ask Windows which maximum character width to expect, the FCL can not. So I think there should be a way to inform the FCL TStringField what maximum character width to expect or to reduce the Size property without reducing the DataSize property.

Michael Van Canneyt

2014-03-06 10:05

administrator   ~0073497

You are free to disagree. MBCS is just as much a hack.
The RTL and FCL have never supported it (neither has the VCL, just ask length of
a MBCS string, and you'll get the size)

For for ansistring 1 char = 1 byte, by definition. We do not support MBCS, and hence UTF-8 will not go into TStringField.

Jan Baumgarten

2014-03-06 15:06

reporter   ~0073514

If MBCS is a hack, then UTF-16 is a hack as well because it uses surrogate pairs for encoding characters that are not on the basic multilingual plane. And as UTF-8 and UTF-16 surely will stay for a while, it seems logical to have ways to deal with them.

TWideStringField has similar limitations when it comes to representing characters that do not fit into the basic multilingual plane as TStringField has, when it comes to representing MBCS.

Given, that TWideStringField introduces unnecessary conversions between UTF-16 and UTF-8 when used with the LCL and also doesn't support the whole Unicode character set, that may be stored in a UTF-8 encoded database, it seems like a second best fix.

So if you don't want to change TStringField and TWideStringField, maybe another solution is to have a TUTF8StringField and maybe a TUTF16StringField in the FCL. I could try to make a suggestion for an implementation.

It seems that this was already discussed five years ago but has not come to any conclusions yet: https://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg14249.html

If this doesn't solve, developers of database access components will have three options - keep things as they are and maybe work with TStrigField - or - use TWideStringField and do not support the whole range of Unicode - or - have their own implementation of something like UTF8StringField with all the additional hassle that comes from having no support by the FCL...

Michael Van Canneyt

2014-03-06 15:22

administrator   ~0073516

I'm not going to enter a discussion of the various benefits of MCBS, UTF16 and whatnot. Suffice it to say that character sets outside the basic multilingual plane are not consistently supported in FPC. We'd have to revise the complete RTL/FCL/whatnot to do so.

But, as it happens, the creation of a TUTF8stringField and corresponding ftUTF8string is being looked at;

Additionally, we're looking at a property UnicodeConversion for TSQLConnection:

TUnicodeConversion = (ucWidestring,ucUTF8string,ucSizeBased);

Property UnicodeConversion : TUnicodeConversion;

This would control what is being done with unicode data, respectively:
- Always convert (utf16/utf8) to widestring
- Always convert (utf16/utf8) to new UTF8String type
- Convert UTF16 to widestring, UTF8 to new UTF8String data

In each case, converting UTF8 to ftString is simply wrong.

If you want to take a stab at the above proposal, I'm sure we'll accept patches.

Jan Baumgarten

2014-03-07 14:10

reporter   ~0073532

I will make a try for TUTF8StringField and depending on my success there I will try for TSQLConnection. As you talk about the new UTF8String type I assume that these things should be done for the trunk version of fpc.
I will set up a develoment environment during the weekend so I am sure you will see some mail by me on the FPC-Devel mailing list by the end of next week.

Stratis Aravias

2014-04-09 01:19

reporter   ~0074217

@Michael Van Canneyt (0073516)
what do you mean with "..outside the basic multiningual plane..", english only ?

Reinier Olislagers

2014-04-09 09:59

developer   ~0074220

@0073481:
"The documentation is correct. That size=characters + 1 is correct.

TStringField is not designed for, and cannot be used to store, UTF-8 data."

This *must* be documented somewhere then. Users cannot be expected to guess this.

Stratis Aravias

2014-04-09 11:18

reporter   ~0074224

Last edited: 2014-04-09 11:19

View 2 revisions

ok, i got it (or not ?)...
1. i should forget the TStringField for my case (MySql with utf8_general_ci database) and continue without it... till we have the TUTF8StringField. (in my case TStringField truncates the buf to size chars with buf[size]=#0; which results in getting 10 chars when i store 20 greek chars in a VARCHAR[20] field)

2. The TStringField.size documentation could (or should) clear this... and mention that you assume that 1 char=1 byte and developer should forget utf8 fields containing multibyte chars because data will be truncated in size bytes (as i see in source: buf[size]=#0;).

3. Anwser me (someone) please at least on this:
is it SAFE to read/write a_utf8_field from MySQL db with TField e.g.
  qry.getField('a_utf8_field').AsString
without truncation problems in MySQL ?
from my tests till now i see that i can, is this right or i missed something ?

merci...

LacaK

2014-04-09 14:04

developer   ~0074226

Can you please ask it in maillist or forum? As it may require more explanation.

Michael Van Canneyt

2014-04-09 14:31

administrator   ~0074228

@Reinier: The main problem is that it depends on the dataset descendant whether the TStringField can contain a UTF8 string. For example, SQLDB does try to allocate buffers of size 4*(declared char size).

@Stratis: The above means that SQLDB's TStringField will happily store the data of a UTF8 field (you have observed this), but you will be able to store a string in it (through AsString :=XYZ) that is too long for MySQL: SQLDB Will report that the size is OK, but when performing an update, mysql will truncate the data because the number of actual UT8 characters exceeds the declared size.

In short: you cannot make assumptions on max number number of characters (Size/Data etc may not be consistent) for UTF8.

That is another reason why we need the upcoming TUTF8Field: to perform the correct length checks prior to posting. This cannot be done correctly in TStringField as it is now.

Stratis Aravias

2014-04-09 15:26

reporter   ~0074229

a) I create a TStringField for read/write a MySQL utf8 VARCHAR[4] field.
When i try to read 'ωωωω' in a TStringField (e.g. qry_my_field: TStringField that is declared as VARCHAR[4] utf8 in db) i read/write 'ωω' values with TStringField.

b) Next, i delete the qry_my_field: TStringField and i use the qry.FieldByName('my_field').AsString (auto created field?) and i get 'ωωωω' which is ok. I can save it also, so am i safe ?

Reinier Olislagers

2014-04-10 10:45

developer   ~0074256

@Michael: that's fine. Then that documentation belongs at the dataset descendent level I suppose.

EgonHugeist/ZeosDevTeam

2014-09-18 01:16

reporter   ~0077343

Last edited: 2014-09-18 11:44

View 2 revisions

@Michael

A TUTF8Field? Why? TField.DataSize is size in Bytes. You should determine CharacterSet of a Column/ResultSet and align the DataSize/DisplaySize instead. Of course Delphi did assume 1Byte = 1Char where i do NOT see an issue here. Introducing a "upcoming" TUTF8Field would simply kill all compatibility. Do you plan such ideas for FireBird Big5 or other Far-East MBCS CharacterSets too?

I Hope Soo or your approach simply is wrong from beginning!

AFAIK is DB.pas not made to handle LCL related issues(where i think the most reports in this domain are comming from).

Just check some implementations i made on Zeos. We do NOT have trunkations because of TField.Size is aligned to MaxBytes of the Connection/Column-CharSet. Sometimes my TField.DisplaySize alignment leaks for OS's with MBCS DefaultSystemCodePage.

LacaK

2018-05-16 16:27

developer   ~0108344

This should be fixed in trunk, where CodePage aware TStringField exists

Michael Van Canneyt

2018-06-28 11:51

administrator   ~0109104

As remarked by Laco: the new support for codepage aware connections and TStringField should solve this.

Issue History

Date Modified Username Field Change
2014-03-01 13:18 Jan Baumgarten New Issue
2014-03-01 13:18 Jan Baumgarten Status new => assigned
2014-03-01 13:18 Jan Baumgarten Assigned To => Joost van der Sluis
2014-03-01 17:05 LacaK Note Added: 0073376
2014-03-01 17:28 LacaK Note Edited: 0073376 View Revisions
2014-03-02 09:45 Reinier Olislagers Tag Attached: unicode
2014-03-02 09:45 Reinier Olislagers Tag Attached: utf-8
2014-03-02 09:45 Reinier Olislagers Tag Attached: utf8
2014-03-02 09:45 Reinier Olislagers Relationship added related to 0017376
2014-03-02 09:45 Reinier Olislagers Relationship added related to 0024745
2014-03-02 09:45 Reinier Olislagers Relationship added related to 0025698
2014-03-02 09:52 Reinier Olislagers Relationship replaced has duplicate 0024745
2014-03-02 09:53 Reinier Olislagers Relationship replaced has duplicate 0025698
2014-03-04 16:11 Thaddy de Koning Note Added: 0073416
2014-03-05 15:48 Michael Van Canneyt Note Added: 0073481
2014-03-05 15:48 Michael Van Canneyt Assigned To Joost van der Sluis => Michael Van Canneyt
2014-03-06 00:01 Jan Baumgarten Note Added: 0073492
2014-03-06 00:06 Jan Baumgarten Note Edited: 0073492 View Revisions
2014-03-06 00:14 Jan Baumgarten Note Edited: 0073492 View Revisions
2014-03-06 00:28 Jan Baumgarten Note Edited: 0073492 View Revisions
2014-03-06 10:05 Michael Van Canneyt Note Added: 0073497
2014-03-06 15:06 Jan Baumgarten Note Added: 0073514
2014-03-06 15:22 Michael Van Canneyt Note Added: 0073516
2014-03-07 14:10 Jan Baumgarten Note Added: 0073532
2014-04-09 01:19 Stratis Aravias Note Added: 0074217
2014-04-09 09:59 Reinier Olislagers Note Added: 0074220
2014-04-09 11:18 Stratis Aravias Note Added: 0074224
2014-04-09 11:19 Stratis Aravias Note Edited: 0074224 View Revisions
2014-04-09 14:04 LacaK Note Added: 0074226
2014-04-09 14:31 Michael Van Canneyt Note Added: 0074228
2014-04-09 15:26 Stratis Aravias Note Added: 0074229
2014-04-10 10:45 Reinier Olislagers Note Added: 0074256
2014-09-18 01:16 EgonHugeist/ZeosDevTeam Note Added: 0077343
2014-09-18 11:44 EgonHugeist/ZeosDevTeam Note Edited: 0077343 View Revisions
2015-04-06 18:40 Michael Van Canneyt Relationship added related to 0027766
2018-05-16 16:27 LacaK Note Added: 0108344
2018-06-28 11:51 Michael Van Canneyt Note Added: 0109104
2018-06-28 11:51 Michael Van Canneyt Status assigned => resolved
2018-06-28 11:51 Michael Van Canneyt Fixed in Version => 3.1.1
2018-06-28 11:51 Michael Van Canneyt Resolution open => fixed
2018-06-28 11:51 Michael Van Canneyt Target Version => 3.2.0