View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0025801||FPC||Database||public||2014-03-01 13:18||2018-06-28 11:51|
|Reporter||Jan Baumgarten||Assigned To||Michael Van Canneyt|
|Platform||OS||Windows||OS Version||Windows 7|
|Product Version||2.6.2||Product Build|
|Target Version||3.2.0||Fixed in Version||3.1.1|
|Summary||0025801: TStringField may return wrong size for TStringField.DataSize|
|Description||TStringField assumes that DataSize is always Size + 1 which is not always true for UTF-8 encoded Strings.|
The FCL documentation states:
Size [...] is the declared maximum size of the string (in characters) [...].
So a TStringField where TStringField.Size is 5 should be able to store the String 'äöüßó', which has a Size of 10 bytes if it is encoded in UTF-8.
Unfortunately this is not possible with the current implementation of TStringField because in this case TStringField returns 6 for TStringField.GetDataSize which is too small for storing 10 Bytes of character information.
This leads to an ambiguity when using the TStringField.Size value for further processing because application programmers might check it to see wether a string fits into a field.
|Tags||unicode, utf-8, utf8|
|Fixed in Revision|
|related to||0017376||resolved||LacaK||FPC||TSQLite3Connection not show whole content for string field when the field is asia language|
|has duplicate||0024745||closed||FPC||Losing data when saving Database fileds with "Size" defined and UTF8 chars|
|has duplicate||0025698||resolved||Lazarus||Maxlength of TDBEdit wrong when string fields in dataset are UTF8|
|related to||0027766||assigned||Michael Van Canneyt||FPC||TFieldDef does not support Multibyte character sets|
Related also to 0017376, 0024745, 0025698
This is well known problem, which IMO comes from old Delphi's time, where was assumed that string field holds SBCS string (1char=1byte).
When Lazarus starts using UTF-8 for DB-aware controls, users starts using also UTF-8 as client charset, so from DB backend commes UTF-8 encoded string data , which can not be properly stored in fields buffers.
As mentioned there are 2 properties for TField:
- Size which must be max.number of CHARACTERs - is used by DB-aware controls to set MaxLength to limit size of text written into control
- DataSize which must max.number of BYTEs - is used to allocate buffers to store strings (in field or in record buffer).
So only solutions, which I see is break hardcoded relation between Size and DataSize (DataSize:=Size+1) and :
- let set this two properties independently.
- or let DataSize:=Size*4+1 (to allocate sufficient buffer length)
- or introduce CharSet property at TField and TFieldDef level and derive from CharSet name required "character size"
(I will prepare draft for discussion)
I think it is sufficient to have the front-end fields behave like fieldlength=maxfieldlength for the charset and trim.
If the database driver is able to store it more efficiently, that's a bonus.
Sqlite, f.e. is fieldlength agnostic in its storage.
But it is also a documentation issue.
The documentation is correct. That size=characters + 1 is correct.
TStringField is not designed for, and cannot be used to store, UTF-8 data.
The dataset must map UTF-8 data to widestring fields and perform a conversion,
not map it to tstringfield.
As for SQLite: it is completely unsuitable for use in Pascal.
It guarantees no data constraints or type safety. Anyone using it is on his own.
I think, I have to disagree. At least from a Delphi point of view. I think that TStringField in Delphi was designed to use what Embarcadero and Microsoft call AnsiStrings and TWideStringField was designed to use what Embarcadero and Microsoft call a WideString.
But even for AnsiString the assumption of one Character = 1 Byte is wrong because Microsofts supports what they call multi byte character sets (MBCS) like for example Shift-JIS and Big5 with their AnsiString API. And so in my opinion does Delphi.
WideString, as far as I understand it, always means UTF16. UTF8 fits more into the MBCS scheme of things and with this into the AnsiString, I think. Microsoft even assigned it the codpage number 65001 for compatibility reasons. Also UTF8String is a special version of an AnsiString with newer Unicode enabled versions of Delphi.
Maybe the problem is that the FCL TStringField tries to be agnostic to the encoding whereas the VCL TStringField probably never has been. When the VCL can ask Windows which maximum character width to expect, the FCL can not. So I think there should be a way to inform the FCL TStringField what maximum character width to expect or to reduce the Size property without reducing the DataSize property.
You are free to disagree. MBCS is just as much a hack.
The RTL and FCL have never supported it (neither has the VCL, just ask length of
a MBCS string, and you'll get the size)
For for ansistring 1 char = 1 byte, by definition. We do not support MBCS, and hence UTF-8 will not go into TStringField.
If MBCS is a hack, then UTF-16 is a hack as well because it uses surrogate pairs for encoding characters that are not on the basic multilingual plane. And as UTF-8 and UTF-16 surely will stay for a while, it seems logical to have ways to deal with them.
TWideStringField has similar limitations when it comes to representing characters that do not fit into the basic multilingual plane as TStringField has, when it comes to representing MBCS.
Given, that TWideStringField introduces unnecessary conversions between UTF-16 and UTF-8 when used with the LCL and also doesn't support the whole Unicode character set, that may be stored in a UTF-8 encoded database, it seems like a second best fix.
So if you don't want to change TStringField and TWideStringField, maybe another solution is to have a TUTF8StringField and maybe a TUTF16StringField in the FCL. I could try to make a suggestion for an implementation.
It seems that this was already discussed five years ago but has not come to any conclusions yet: https://email@example.com/msg14249.html
If this doesn't solve, developers of database access components will have three options - keep things as they are and maybe work with TStrigField - or - use TWideStringField and do not support the whole range of Unicode - or - have their own implementation of something like UTF8StringField with all the additional hassle that comes from having no support by the FCL...
I'm not going to enter a discussion of the various benefits of MCBS, UTF16 and whatnot. Suffice it to say that character sets outside the basic multilingual plane are not consistently supported in FPC. We'd have to revise the complete RTL/FCL/whatnot to do so.
But, as it happens, the creation of a TUTF8stringField and corresponding ftUTF8string is being looked at;
Additionally, we're looking at a property UnicodeConversion for TSQLConnection:
TUnicodeConversion = (ucWidestring,ucUTF8string,ucSizeBased);
Property UnicodeConversion : TUnicodeConversion;
This would control what is being done with unicode data, respectively:
- Always convert (utf16/utf8) to widestring
- Always convert (utf16/utf8) to new UTF8String type
- Convert UTF16 to widestring, UTF8 to new UTF8String data
In each case, converting UTF8 to ftString is simply wrong.
If you want to take a stab at the above proposal, I'm sure we'll accept patches.
I will make a try for TUTF8StringField and depending on my success there I will try for TSQLConnection. As you talk about the new UTF8String type I assume that these things should be done for the trunk version of fpc.
I will set up a develoment environment during the weekend so I am sure you will see some mail by me on the FPC-Devel mailing list by the end of next week.
@Michael Van Canneyt (0073516)
what do you mean with "..outside the basic multiningual plane..", english only ?
"The documentation is correct. That size=characters + 1 is correct.
TStringField is not designed for, and cannot be used to store, UTF-8 data."
This *must* be documented somewhere then. Users cannot be expected to guess this.
ok, i got it (or not ?)...
1. i should forget the TStringField for my case (MySql with utf8_general_ci database) and continue without it... till we have the TUTF8StringField. (in my case TStringField truncates the buf to size chars with buf[size]=#0; which results in getting 10 chars when i store 20 greek chars in a VARCHAR field)
2. The TStringField.size documentation could (or should) clear this... and mention that you assume that 1 char=1 byte and developer should forget utf8 fields containing multibyte chars because data will be truncated in size bytes (as i see in source: buf[size]=#0;).
3. Anwser me (someone) please at least on this:
is it SAFE to read/write a_utf8_field from MySQL db with TField e.g.
without truncation problems in MySQL ?
from my tests till now i see that i can, is this right or i missed something ?
||Can you please ask it in maillist or forum? As it may require more explanation.|
@Reinier: The main problem is that it depends on the dataset descendant whether the TStringField can contain a UTF8 string. For example, SQLDB does try to allocate buffers of size 4*(declared char size).
@Stratis: The above means that SQLDB's TStringField will happily store the data of a UTF8 field (you have observed this), but you will be able to store a string in it (through AsString :=XYZ) that is too long for MySQL: SQLDB Will report that the size is OK, but when performing an update, mysql will truncate the data because the number of actual UT8 characters exceeds the declared size.
In short: you cannot make assumptions on max number number of characters (Size/Data etc may not be consistent) for UTF8.
That is another reason why we need the upcoming TUTF8Field: to perform the correct length checks prior to posting. This cannot be done correctly in TStringField as it is now.
a) I create a TStringField for read/write a MySQL utf8 VARCHAR field.
When i try to read 'ωωωω' in a TStringField (e.g. qry_my_field: TStringField that is declared as VARCHAR utf8 in db) i read/write 'ωω' values with TStringField.
b) Next, i delete the qry_my_field: TStringField and i use the qry.FieldByName('my_field').AsString (auto created field?) and i get 'ωωωω' which is ok. I can save it also, so am i safe ?
|@Michael: that's fine. Then that documentation belongs at the dataset descendent level I suppose.|
A TUTF8Field? Why? TField.DataSize is size in Bytes. You should determine CharacterSet of a Column/ResultSet and align the DataSize/DisplaySize instead. Of course Delphi did assume 1Byte = 1Char where i do NOT see an issue here. Introducing a "upcoming" TUTF8Field would simply kill all compatibility. Do you plan such ideas for FireBird Big5 or other Far-East MBCS CharacterSets too?
I Hope Soo or your approach simply is wrong from beginning!
AFAIK is DB.pas not made to handle LCL related issues(where i think the most reports in this domain are comming from).
Just check some implementations i made on Zeos. We do NOT have trunkations because of TField.Size is aligned to MaxBytes of the Connection/Column-CharSet. Sometimes my TField.DisplaySize alignment leaks for OS's with MBCS DefaultSystemCodePage.
||This should be fixed in trunk, where CodePage aware TStringField exists|
||As remarked by Laco: the new support for codepage aware connections and TStringField should solve this.|
|2014-03-01 13:18||Jan Baumgarten||New Issue|
|2014-03-01 13:18||Jan Baumgarten||Status||new => assigned|
|2014-03-01 13:18||Jan Baumgarten||Assigned To||=> Joost van der Sluis|
|2014-03-01 17:05||LacaK||Note Added: 0073376|
|2014-03-01 17:28||LacaK||Note Edited: 0073376||View Revisions|
||Tag Attached: unicode|
||Tag Attached: utf-8|
||Tag Attached: utf8|
||Relationship added||related to 0017376|
||Relationship added||related to 0024745|
||Relationship added||related to 0025698|
||Relationship replaced||has duplicate 0024745|
||Relationship replaced||has duplicate 0025698|
|2014-03-04 16:11||Thaddy de Koning||Note Added: 0073416|
|2014-03-05 15:48||Michael Van Canneyt||Note Added: 0073481|
|2014-03-05 15:48||Michael Van Canneyt||Assigned To||Joost van der Sluis => Michael Van Canneyt|
|2014-03-06 00:01||Jan Baumgarten||Note Added: 0073492|
|2014-03-06 00:06||Jan Baumgarten||Note Edited: 0073492||View Revisions|
|2014-03-06 00:14||Jan Baumgarten||Note Edited: 0073492||View Revisions|
|2014-03-06 00:28||Jan Baumgarten||Note Edited: 0073492||View Revisions|
|2014-03-06 10:05||Michael Van Canneyt||Note Added: 0073497|
|2014-03-06 15:06||Jan Baumgarten||Note Added: 0073514|
|2014-03-06 15:22||Michael Van Canneyt||Note Added: 0073516|
|2014-03-07 14:10||Jan Baumgarten||Note Added: 0073532|
|2014-04-09 01:19||Stratis Aravias||Note Added: 0074217|
||Note Added: 0074220|
|2014-04-09 11:18||Stratis Aravias||Note Added: 0074224|
|2014-04-09 11:19||Stratis Aravias||Note Edited: 0074224||View Revisions|
|2014-04-09 14:04||LacaK||Note Added: 0074226|
|2014-04-09 14:31||Michael Van Canneyt||Note Added: 0074228|
|2014-04-09 15:26||Stratis Aravias||Note Added: 0074229|
||Note Added: 0074256|
|2014-09-18 01:16||EgonHugeist/ZeosDevTeam||Note Added: 0077343|
|2014-09-18 11:44||EgonHugeist/ZeosDevTeam||Note Edited: 0077343||View Revisions|
|2015-04-06 18:40||Michael Van Canneyt||Relationship added||related to 0027766|
|2018-05-16 16:27||LacaK||Note Added: 0108344|
|2018-06-28 11:51||Michael Van Canneyt||Note Added: 0109104|
|2018-06-28 11:51||Michael Van Canneyt||Status||assigned => resolved|
|2018-06-28 11:51||Michael Van Canneyt||Fixed in Version||=> 3.1.1|
|2018-06-28 11:51||Michael Van Canneyt||Resolution||open => fixed|
|2018-06-28 11:51||Michael Van Canneyt||Target Version||=> 3.2.0|