View Issue Details

IDProjectCategoryView StatusLast Update
0012198LazarusLCLpublic2009-10-23 00:40
ReporterPhil Assigned ToVincent Snijders  
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionfixed 
PlatformPowerPCOSOS X 
Target Version0.9.28Fixed in Version0.9.27 (SVN) 
Summary0012198: AnsiToUTF8 no longer works with Carbon?
DescriptionAll of my apps that use AnsiToUTF8 to display ANSI characters are now broken.

See attached test app that used to work (2007). Now it displays "?". Works fine on Windows.

Thanks.

-Phil
TagsNo tags attached.
Fixed in Revision20727
LazTarget0.9.28
WidgetsetCarbon
Attached Files

Relationships

related to 0012211 closedMichael Van Canneyt FPC Document the meaning of 'ansi' in functions like AnsiToUTF8 

Activities

2008-09-21 18:54

 

testunicode2.zip (3,172 bytes)

Phil

2008-09-21 19:13

reporter   ~0022352

I ran one of my apps that was compiled with FPC 2.2.2 and the 20080828 snapshot of Lazarus and it does not have this problem - i.e., ANSI chars display fine when converted with AnsiToUTF8. So the change that broke things was probably introduced in the last few weeks.

Thanks.

-Phil

Vincent Snijders

2008-09-22 21:46

manager   ~0022366

For now, the widestring manager is loaded (from the cwstring) in the lclproc unit, which is used by almost any LCL unit.

Therefore now correct AnsiToUTF8 conversion is done (Ansi in this case means the system code set), which is probably UTF8 on Mac OS X.

Without loading the c widesting manger, a simple cast is done.

So in short, the code is now behaving as it should. #$B6 is not a valid UTF8 char.

Phil

2008-09-22 22:15

reporter   ~0022367

I don't understand this explanation.

Of course #$B6 is not a valid UTF8 character - it's an ANSI character, hence the conversion that's needed.

This was just a simple example. My apps use strings pulled from files that contain ANSI chars that need converting. This always worked before and now it doesn't. Plus it works on Windows.

Thanks.

-Phil

Vincent Snijders

2008-09-22 22:41

manager   ~0022368

Last edited: 2008-09-22 22:47

What is an ANSI character according to you?

I looked at http://lazarus-ccr.sourceforge.net/fpcdoc/rtl/system/ansitoutf8.html, which is not too clear.

I think the AnsiToUTF8 function defines it as a the chars in an ansistring, i.e. having the system encoding, so that is why it works on windows.

Vincent Snijders

2008-09-22 22:58

manager   ~0022369

If you want to convert from a specific code page (not the system code page) to UTF8, use the function from the lconvencoding unit: http://lazarus-ccr.sourceforge.net/docs/lcl/lconvencoding/index-5.html

Phil

2008-09-22 23:06

reporter   ~0022370

I've been operating under the impression that ANSI is the Latin-1 (1252) character set, hence the name AnsiString. The ANSI character set corresponds to the first 256 Unicode characters.

The documentation for AnsiToUtf8 looks wrong, since UTF8String is the same as AnsiString and is not a WideString as indicated. The rest of it tells me nothing.

I'll try CP1252ToUTF8 and see if that workaround helps any.

Thanks.

-Phil

Phil

2008-09-23 03:21

reporter   ~0022373

CP1252ToUTF8 works in place of AnsiToUTF8. However, I'm still not sure why AnsiToUTF8 doesn't work on my Mac.

LConvEncoding's GetSystemEncoding returns "ansi", not "utf8", on my Mac.

Of course, that's the only thing it would return. Looking at GetSystemEncoding source, I don't see how this applies at all to Mac. The code here makes sense only for Windows and Linux. Environment variables like "LC_ALL", etc. likely won't be defined on a Mac - certainly they're not defined on any Mac I've worked on. As a result the function just returns the default, "ansi".

Vincent Snijders

2008-09-23 09:14

manager   ~0022375

I added a request for updating the docs of AnsiToUTF8, so that can be explained what ansi exactly means in AnsiToUTF8.

Vincent Snijders

2008-09-23 09:15

manager   ~0022376

What is the system encoding on Mac OS X?

Phil

2008-09-24 04:11

reporter   ~0022405

On my Mac this code:

uses
  MacOSAll;

begin
  WriteLn(CFStringGetSystemEncoding);
end.

returns 0, indicating MacRoman encoding.

Still, why did my apps that used AnsiToUTF8 worked as recently as late August and then stopped working? I'm using the same Mac and the same version of FPC (2.2.2), so it must be something in LCL/carbon widgetset that changed.

Thanks.

-Phil

Vincent Snijders

2008-09-24 08:25

manager   ~0022406

Last edited: 2008-09-24 08:31

The change is caused by the fact that the LCL now loads the widestring manager, causing a correct conversion from system encoding (UTF8 on Mac OS X) to UTF8.

Before it didn't load the widestring manager and the dummy widestring manager was used, which assumed that the system encoding contains the first 255 unicode characters and could convert to widechar simply by doing a cast: myunicodechar := PWideChar(MyAnsiChar). It also means it doesn't use the dummy widestring manager anymore for doing the conversion, which assumed ord(ansichar) = (unicodechar).

So from the LCL point of view, AnsiToUTF8 and other widestring functions, such as conversion to upper and lower case (these functions are needed in the LCL) were broken without loading a widestring manager. So this is done in the LCL and not in user programs, so we don't depend on users doing that. Now this is fixed, so fro the LCL there is no change required.

If your system encoding is not UTF8, but something else and #$B6 is a valid char in that encoding, then you may have found a bug in the AnsiToUTF8 function, which belongs to the RTL.

Therefore I move this issue to the FPC team.

Jonas Maebe

2008-09-24 10:39

manager   ~0022411

Last edited: 2008-09-24 13:13

As mentioned about CFStringGetSystemEncoding at http://developer.apple.com/documentation/CoreFoundation/Reference/CFStringRef/Reference/reference.html#//apple_ref/c/func/CFStringGetSystemEncoding :

"In most situations you will not want to use this function, however, because your primary interest will be your application's default text encoding."

The same goes for AnsiToUtf8: you generally don't want to use what happens to be the currently active code page (either at the GUI or at the Unix level) to convert strings to UTF8, unless where they were actually received from said APIs. And in those cases you have to determine what the encoding of the strings returned by said APIs is before you can convert them.

All of the modern Carbon functions that interpret strings based on their encoding only accept CFStrings (and Cocoa APIs use the identical NSString). A CFString/NSString has its current encoding packed together with the payload, and you can easily convert it to whatever you want without having to know the original/source encoding.

Some older Mac OS APIs which do interpret the contents of their strings however accept/return plain C strings, and these may indeed (have to) be encoded using CFStringGetSystemEncoding().

FPC's widestring manager however uses the Unix-level locale functionality (type "man locale" in Terminal for more information) for determining the "ansi" encoding on all Unix-like platforms (also on Mac OS X).

Regardless of whether it would use that one or CFStringGetSystemEncoding(), some programs will break when using "ansistring" in general, because both ansi-encodings are unlikely to be identical. The OS cannot force them to be always the same because that would require one of these layers to get priority over the other one, breaking all sort of expectations for the other layer (or users thereof).

The current widestring manager has no way to deal differently with two different kinds of ansistrings (it assumes all ansistrings are encoded in the same way), but this may be resolved with the unicode rewrite (although you'll probably still have to be careful to somehow correctly declare the encoding of each of your "ansistrings").

So at this point, the only way to correctly handle things is to let the widestring manager handle things when communicating with the Unix layer, and to explicitly convert things when communicating with non-CFString routines at the GUI layer (which should be a small minority for modern programs, although for older programs this may constitute a significant chunk).


BTW: the "ansi" in ansistring never meant "the Latin-1 (1252) character set", on any platform. Assuming that this is the case will also break on Linux and Windows if people are using a different code page.

Jonas Maebe

2008-09-25 10:25

manager   ~0022431

That said, GetSystemEncoding in lconvencoding.pas *is* wrong for *nix platforms. The reason is that while LANG, LC_ALL and LC_MESSAGES influence the locale settings, having different locale settings does not necessarily require those environment variables to be set.

Case in point: on Mac OS X 10.5 without any such environment variables set, the output of the "locale" program is:

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=


LC_TYPE happens to be the setting which actually decides the charset mapping, so on Mac OS X 10.5 the default is UTF-8. On 10.4.x, the default still appears to be "C".

Anyway, lconvencoding should use the same method as the cwstring unit to determine the active character encoding on *nix platforms (first call setlocale(''), then call nl_langinfo(CODESET) -- and note that the first call to setlocale is mandatory according to POSIX and required on various systems, even though it isn't on Linux)

Phil

2008-09-26 16:52

reporter   ~0022442

Here is what locale returns on OS X 10.4.11:

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

Thanks.

-Phil

Jonas Maebe

2009-04-25 12:06

reporter   ~0027066

Maybe one extra information point: on Mac OS X, all Unix-level OS interfaces for dealing with files always expect and return utf-8 encoded strings, regardless of the locale. All FPC rtl units simply pass through the "ansi"strings they get to the api's, so this means you should encode these "ansistrings" using utf-8.

This can only be properly solved once the Delphi 2009 string type is implemented in FPC, because if we would add explicit encoding from the current locale to utf-8 in the RTL, then a lot of existing programs could break.

Gilles HEMERY

2009-05-30 09:54

reporter   ~0028146

Same problem with Debian Lenny [Lazarus 0.9.26.2]

Vincent Snijders

2009-06-23 15:49

manager   ~0028701

The purpose of GetSystemEncoding in the lconvencoding unit is to determine
what is good default encoding for text files, to help the IDE to guess the encoding of text files it reads.

Can this be done with the way described in note 0022431, similar as in the
cwstring unit? Or is it better to hard code it to UTF8 on Mac OS X?

Jonas Maebe

2009-06-23 17:02

reporter   ~0028702

I think UTF-8 is the best default on Mac OS X. But maybe an alias of this routine called GetDefaultTextEncoding() should be added, so that GetSystemEncoding() can eventually be phased out (or maybe be used to refer to whatever encoding the system APIs expect).

The info from note 0022431 is more applicable to command line unix programs.

Vincent Snijders

2009-06-24 11:52

manager   ~0028708

I renamed GetSystemEncoding to GetDefaultTextEncoding and let it return UTF-8 on Mac OS X.

Issue History

Date Modified Username Field Change
2008-09-21 18:54 Phil New Issue
2008-09-21 18:54 Phil File Added: testunicode2.zip
2008-09-21 18:54 Phil Widgetset => Carbon
2008-09-21 19:13 Phil Note Added: 0022352
2008-09-21 20:14 Vincent Snijders LazTarget => 0.9.26
2008-09-21 20:14 Vincent Snijders Status new => acknowledged
2008-09-21 20:14 Vincent Snijders Target Version => 0.9.26
2008-09-22 21:46 Vincent Snijders Status acknowledged => resolved
2008-09-22 21:46 Vincent Snijders Resolution open => no change required
2008-09-22 21:46 Vincent Snijders Assigned To => Vincent Snijders
2008-09-22 21:46 Vincent Snijders Note Added: 0022366
2008-09-22 22:15 Phil Status resolved => assigned
2008-09-22 22:15 Phil Resolution no change required => reopened
2008-09-22 22:15 Phil Note Added: 0022367
2008-09-22 22:41 Vincent Snijders Note Added: 0022368
2008-09-22 22:47 Vincent Snijders Note Edited: 0022368
2008-09-22 22:58 Vincent Snijders Note Added: 0022369
2008-09-22 23:06 Phil Note Added: 0022370
2008-09-23 03:21 Phil Note Added: 0022373
2008-09-23 09:13 Vincent Snijders Relationship added related to 0012211
2008-09-23 09:14 Vincent Snijders Note Added: 0022375
2008-09-23 09:15 Vincent Snijders Note Added: 0022376
2008-09-24 04:11 Phil Note Added: 0022405
2008-09-24 08:25 Vincent Snijders Note Added: 0022406
2008-09-24 08:31 Vincent Snijders Note Edited: 0022406
2008-09-24 08:31 Vincent Snijders Project Lazarus => FPC
2008-09-24 08:31 Vincent Snijders Assigned To Vincent Snijders =>
2008-09-24 08:31 Vincent Snijders Status assigned => new
2008-09-24 08:31 Vincent Snijders Target Version 0.9.26 =>
2008-09-24 08:32 Vincent Snijders FPCOldBugId => 0
2008-09-24 08:32 Vincent Snijders Resolution reopened => open
2008-09-24 08:32 Vincent Snijders Category Widgetset => RTL
2008-09-24 08:32 Vincent Snijders Product Version 0.9.25 (SVN) => 2.2.2
2008-09-24 10:39 Jonas Maebe Status new => resolved
2008-09-24 10:39 Jonas Maebe Resolution open => no change required
2008-09-24 10:39 Jonas Maebe Assigned To => Jonas Maebe
2008-09-24 10:39 Jonas Maebe Note Added: 0022411
2008-09-24 13:13 Jonas Maebe Note Edited: 0022411
2008-09-25 10:25 Jonas Maebe Note Added: 0022431
2008-09-25 10:25 Jonas Maebe Status resolved => confirmed
2008-09-25 10:25 Jonas Maebe Project FPC => Lazarus
2008-09-25 10:36 Vincent Snijders LazTarget 0.9.26 => 0.9.28
2008-09-25 10:36 Vincent Snijders Assigned To Jonas Maebe =>
2008-09-25 10:36 Vincent Snijders Status confirmed => acknowledged
2008-09-25 10:36 Vincent Snijders Target Version => 0.9.27 (SVN)
2008-09-26 16:52 Phil Note Added: 0022442
2008-09-26 21:31 Vincent Snijders Target Version 0.9.27 (SVN) => 0.9.28
2009-04-21 20:42 Vincent Snijders Status acknowledged => assigned
2009-04-21 20:42 Vincent Snijders Assigned To => Vincent Snijders
2009-04-25 12:06 Jonas Maebe Note Added: 0027066
2009-05-30 09:54 Gilles HEMERY Note Added: 0028146
2009-06-07 11:50 Vincent Snijders Category RTL => LCL
2009-06-07 11:50 Vincent Snijders Product Version 2.2.2 =>
2009-06-23 15:49 Vincent Snijders Note Added: 0028701
2009-06-23 17:02 Jonas Maebe Note Added: 0028702
2009-06-24 11:52 Vincent Snijders Fixed in Revision => 20727
2009-06-24 11:52 Vincent Snijders Status assigned => resolved
2009-06-24 11:52 Vincent Snijders Fixed in Version => 0.9.27 (SVN)
2009-06-24 11:52 Vincent Snijders Resolution no change required => fixed
2009-06-24 11:52 Vincent Snijders Note Added: 0028708
2009-10-23 00:40 Marc Weustink Status resolved => closed