View Issue Details

IDProjectCategoryView StatusLast Update
0033666FPCCompilerpublic2019-01-01 17:50
ReporterengkinAssigned ToJonas Maebe 
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Product VersionProduct Build 
Target Version3.2.0Fixed in Version3.2.0 
Summary0033666: Assigning one character to ANSI string gives wrong .ascii output
DescriptionAssigning one character to a single-byte code page string can produce wrong result if the character has a UTF8 value two or more bytes long.

It seems that the internal conversion from UTF8 to the target code page does not terminate the string properly. And it leave some UTF8 related bytes at the end of the .ascii output.
Steps To ReproduceCheck the assembly file (-al) produced for the following program (also attached):

    program Project1;
     
    {$mode objfpc}{$H+}
    {$Codepage UTF8}
     
    type
      CP437String = type ansistring(437);
     
    var
      s_cpUTF8: string;
      s_cp437_1, s_cp437_2: CP437String;
    begin
      s_cpUTF8 := '║';
      s_cp437_1 := '║'; //<--- buggy
      s_cp437_2 := '║1';
    end.

s_cp437_1 receives wrong value:
_$PROJECT1$_Ld2:
   .ascii "\272?\221\000"

the correct value should be:
   .ascii "\272\000"

while the other two variables get correct values:
    _$PROJECT1$_Ld1:
       .ascii "\342\225\221\000"

    _$PROJECT1$_Ld3:
       .ascii "\2721\000"
Additional InformationOne of the two wrong values in s_cp437_1 (\221) equals the last value in s_cpUTF8.

The example character U+2551 ║ is:
code page 437: 186 = &272
UTF8: $E2 $99 $AA = &342 &225 &221

I tested with FPC 3.0.4, Thaddy with trunk r38861.

Related forum post:
http://forum.lazarus.freepascal.org/index.php/topic,41095.0.html
TagsNo tags attached.
Fixed in Revision40637,40735
FPCOldBugId
FPCTarget
Attached Files
  • project1.pp (272 bytes)
    program Project1;
    
    {$apptype CONSOLE}
    {$mode objfpc}{$H+}
    {$Codepage UTF8}
    
    type
      CP437String = type String(437);
    
    var
      s_cpUTF8: string;
      s_cp437_1, s_cp437_2: CP437String;
    begin
      s_cpUTF8  := '║';
      s_cp437_1 := '║';
      s_cp437_2 := '║1';
    end.
    
    project1.pp (272 bytes)
  • changestringtype.patch (488 bytes)
    Index: ncon.pas
    ===================================================================
    --- ncon.pas	(revision 39074)
    +++ ncon.pas	(working copy)
    @@ -987,6 +987,7 @@
                                   setlengthwidestring(pw,l);
                                   ReAllocMem(value_str,l);
                                 end;
    +                          len:=l-1;
                               unicode2ascii(pw,value_str,cp1);
                               donewidestring(pw);
                             end
    
  • prjUTF8With4BUTF8.lpr (1,449 bytes)

Activities

engkin

2018-04-29 17:28

reporter  

project1.pp (272 bytes)
program Project1;

{$apptype CONSOLE}
{$mode objfpc}{$H+}
{$Codepage UTF8}

type
  CP437String = type String(437);

var
  s_cpUTF8: string;
  s_cp437_1, s_cp437_2: CP437String;
begin
  s_cpUTF8  := '║';
  s_cp437_1 := '║';
  s_cp437_2 := '║1';
end.
project1.pp (272 bytes)

Marco van de Voort

2018-04-29 17:42

manager   ~0108068

- no platform named, we can't know what the default encoding is.
- most notably Windows which requires additional code/units to change the default type to utf8. (and no, $codepage utf8 is not enough for that)

engkin

2018-04-29 17:53

reporter   ~0108069

Last edited: 2018-04-29 17:59

View 2 revisions

I tested on Windows.
Thaddy tested on arm-linux.

The string representation in the generated file on both systems was UTF8 for s_cpUTF8:
       .ascii "\342\225\221\000"
Compare with the values in the Additional Information.
The value is in UTF8 because the file is save as UTF8.

Thaddy de Koning

2018-04-30 15:47

reporter   ~0108076

So it seems not platform related.

engkin

2018-05-21 00:57

reporter  

changestringtype.patch (488 bytes)
Index: ncon.pas
===================================================================
--- ncon.pas	(revision 39074)
+++ ncon.pas	(working copy)
@@ -987,6 +987,7 @@
                               setlengthwidestring(pw,l);
                               ReAllocMem(value_str,l);
                             end;
+                          len:=l-1;
                           unicode2ascii(pw,value_str,cp1);
                           donewidestring(pw);
                         end

engkin

2018-05-21 00:58

reporter   ~0108457

The problem is in tstringconstnode.changestringtype - unit ncon.pas - where node length was not updated.

Patch added.

Bart Broersma

2018-05-24 17:56

reporter   ~0108513

@Marco: can you please change "status" to unassigned as well?

BrunoK

2018-05-25 10:14

reporter  

prjUTF8With4BUTF8.lpr (1,449 bytes)

BrunoK

2018-05-25 10:15

reporter   ~0108524

I fell on a nearly identical or probably related case.

Trying to make strings with 4 byte utf8 sequence shows strange behaviour, see attached project.

When viewed via lpDbgByteArray^ in the debugger, the first 0000240+0000157+0000132+0000158 is in the byte array, but the 2nd is replaced with
  195,
  176,
  194,
  157,
  194,
  132,
  194,
  158

BrunoK

2018-05-25 15:56

reporter   ~0108528

These seem to be the compiler decisions for literal :
'240+157+132+158'

Takes 240 decides to do a CP1252 to UTF8 -> 195,176
Takes 157 decides to do a CP1252 to UTF8 -> 194,157

for the others, takes another path. Not CP1250 or CP1252
For '135' the conversion seems to be like in any off
ArrayISO_8859_1ToUTF8
ArrayISO_8859_15ToUTF8
ArrayISO_8859_2ToUTF8
ArrayCP874ToUTF8
                          that gives UTF8 -> 194,132
For '158' the conversion seems to be like for '13'5 but with some other additional code pages
                          that gives UTF8 -> 194,158

With AnsiToUtf8 it gives (decimal char): _195_176_194_157_226_128_158_197_190
There must be a logic, something as engkin reported, but the problem is that, for example, AnsiToUtf8(_240+_157+_132+_158) patched by byte in a RawByteString outputs first _195_176_194_157 as the compiler but the last bytes are different
  _132 -> _226_128,158 (in many code pages)
  _158 -> _197_190 (only in 1250 and 1252)

It is puzzling to figure out why at the beginning of the string (position 3 of my vCP1252Text) the compiler copies the values without transformation and further it starts to make some more or less inconsistent choices.

It seems the problem is in ncon.pas but I must recognize that this is far over my understanding of compiler etc...

FPC 3.0.4

Windows 10 / 32bit but I don't think it is relevant.

Hope it helps.

Thaddy de Koning

2018-05-25 18:05

reporter   ~0108531

Last edited: 2018-05-25 18:07

View 2 revisions

Indeed this may be related.
And it is definitely not an OS or CPU issue.( except but unlikely 32 vs 64 bit)
Enkin and I had exactly the same results on wildly differing architectures.

engkin

2018-05-26 07:06

reporter   ~0108535

@BrunoK
"...but the 2nd is replaced with
  195,
  176,
  194,
  157,
  194,
  132,
  194,
  158 "

Each char was treated as a WideChar and converted to UTF8:
Chr(240) considered #$00F0 then converted to UTF8: 195, 176 (Hex: $C3,$B0)
Chr(157) considered #$009D then converted to UTF8: 194, 157 (Hex: $C2,$9D)
Chr(132) considered #$0084 then converted to UTF8: 194, 132 (Hex: $C2,$84)
Chr(158) considered #$009E then converted to UTF8: 194, 158 (Hex: $C2,$9E)

This seem to be cause by using the Euro sign "€" before these characters.

Thaddy de Koning

2018-05-26 09:17

reporter   ~0108538

Last edited: 2018-05-26 09:24

View 2 revisions

since that is 0128 it may be an off-by-one issue?
E.g.: https://www.alt-codes.net/euro_alt_code.php

BrunoK

2018-05-26 11:10

reporter   ~0108539

@Mr Thaddy de Koning
Why do you always rush in with a simplistic comment.
If you had looked at lpDbgByteArray^ in the debugger you would have found the sequence 3 byte code point for the € dec/hex 226:0xE2 130:0x82 0254:0xAC.

Regarding the rest of this matter, for me, things happen in the scanner.pas of the compiler.

1° Line 5070 : { four or more chars aren't handled } That settles 1 point.
2° Lines 5064-5066, 5105-5109: In general, for string literals, the tscannerfile.readtoken at some point, will start to convert any character UCS2 with ascii2unicode.

In conclusion : full utf-8 is not supported by FPC 3.0.4, that closes it for me.

Thaddy de Koning

2018-05-26 11:23

reporter   ~0108540

Last edited: 2018-05-26 11:27

View 2 revisions

That's not a simplistic conclusion:
I always try to reduce problems to their bare essence, which is often more difficult than giving complex code.
In this case I see a wrong code point in the Ansi codepage converted to UTF8. Lazarus has full UTF8 support for "string", FPC itself has some support (UTF8String and string(cp_utf8), but the basic "string" types are shortstring, ansistring and unicodestring. The latter is UTF16, not UTF8.
Note all conversions from and to UTF8 work from Unicodestring without loss.
You are correct in assuming full UTF8 is not fully supported as type "string" in FPC 3.0.4. but that is documented. Lazarus provides an UTF8 "string" type.
You are also correct that there is a bug (probably at least two).

Testing should be taken serious, as per engkin's explanation (full, but small) code example and me eliminating platform from this report.

engkin

2018-08-19 00:39

reporter   ~0110141

Reminder regarding this bug:
1-The patch attached to it is still valid.
2-Anything added by BrunoK is not related to this bug and should be in a different bug report.

Jonas Maebe

2018-12-24 23:22

manager   ~0112866

Thanks for the patch, applied.

Issue History

Date Modified Username Field Change
2018-04-29 17:28 engkin New Issue
2018-04-29 17:28 engkin File Added: project1.pp
2018-04-29 17:42 Marco van de Voort Note Added: 0108068
2018-04-29 17:42 Marco van de Voort Assigned To => Marco van de Voort
2018-04-29 17:42 Marco van de Voort Status new => feedback
2018-04-29 17:53 engkin Note Added: 0108069
2018-04-29 17:53 engkin Status feedback => assigned
2018-04-29 17:59 engkin Note Edited: 0108069 View Revisions
2018-04-30 15:47 Thaddy de Koning Note Added: 0108076
2018-05-21 00:57 engkin File Added: changestringtype.patch
2018-05-21 00:58 engkin Note Added: 0108457
2018-05-24 10:51 Marco van de Voort Assigned To Marco van de Voort =>
2018-05-24 17:56 Bart Broersma Note Added: 0108513
2018-05-25 10:14 BrunoK File Added: prjUTF8With4BUTF8.lpr
2018-05-25 10:15 BrunoK Note Added: 0108524
2018-05-25 11:27 Marco van de Voort Assigned To => Marco van de Voort
2018-05-25 11:27 Marco van de Voort Status assigned => new
2018-05-25 11:27 Marco van de Voort Assigned To Marco van de Voort =>
2018-05-25 15:56 BrunoK Note Added: 0108528
2018-05-25 18:05 Thaddy de Koning Note Added: 0108531
2018-05-25 18:07 Thaddy de Koning Note Edited: 0108531 View Revisions
2018-05-26 07:06 engkin Note Added: 0108535
2018-05-26 09:17 Thaddy de Koning Note Added: 0108538
2018-05-26 09:24 Thaddy de Koning Note Edited: 0108538 View Revisions
2018-05-26 11:10 BrunoK Note Added: 0108539
2018-05-26 11:23 Thaddy de Koning Note Added: 0108540
2018-05-26 11:27 Thaddy de Koning Note Edited: 0108540 View Revisions
2018-08-19 00:39 engkin Note Added: 0110141
2018-12-24 23:22 Jonas Maebe Fixed in Revision => 40637
2018-12-24 23:22 Jonas Maebe Note Added: 0112866
2018-12-24 23:22 Jonas Maebe Status new => resolved
2018-12-24 23:22 Jonas Maebe Fixed in Version => 3.3.1
2018-12-24 23:22 Jonas Maebe Resolution open => fixed
2018-12-24 23:22 Jonas Maebe Assigned To => Jonas Maebe
2018-12-24 23:22 Jonas Maebe Target Version => 3.2.0
2019-01-01 17:50 Jonas Maebe Fixed in Revision 40637 => 40637,40735
2019-01-01 17:50 Jonas Maebe Fixed in Version 3.3.1 => 3.2.0