View Issue Details

IDProjectCategoryView StatusLast Update
0037155PatchesLCLpublic2020-07-12 10:58
Reportereastorwest Assigned ToJuha Manninen  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Summary0037155: LConvEncoding miss KOI8-R characters
DescriptionLazarus 2.1 SVN r63258
LConvEncoding didnt work with all KOI8-R characters (missing chars from 0000128 to 0000191 and 0000255)
According to RFC 1489 (https://tools.ietf.org/html/rfc1489) all characters should be presented in array ArrayKOI8ToUTF8 (in file components/lazutils/commoncodepages.inc)
I made patch for this file (first ever from me)
Thanks for attention.

Sorry for my English, I'm using translator
TagsNo tags attached.
Fixed in Revisionr63354
LazTarget-
Widgetset
Attached Files

Relationships

related to 0036078 assignedJuha Manninen Lazarus Wish: split arrays (encoding pages) from LConvEncoding to .inc file 

Activities

eastorwest

2020-05-31 09:25

reporter  

mypatch.diff (4,971 bytes)   
Index: components/lazutils/commoncodepages.inc
===================================================================
--- components/lazutils/commoncodepages.inc	(revision 63258)
+++ components/lazutils/commoncodepages.inc	(working copy)
@@ -4545,70 +4545,71 @@
     '}',                // '}'
     '~',                // '~'
     #127,               // #127
-    '',                 // #128
-    '',                 // #129
-    '',                 // #130
-    '',                 // #131
-    '',                 // #132
-    '',                 // #133
-    '',                 // #134
-    '',                 // #135
-    '',                 // #136
-    '',                 // #137
-    '',                 // #138
-    '',                 // #139
-    '',                 // #140
-    '',                 // #141
-    '',                 // #142
-    '',                 // #143
-    '',                 // #144
-    '',                 // #145
-    '',                 // #146
-    '',                 // #147
-    '',                 // #148
-    '',                 // #149
-    '',                 // #150
-    '',                 // #151
-    '',                 // #152
-    '',                 // #153
-    '',                 // #154
-    '',                 // #155
-    '',                 // #156
-    '',                 // #157
-    '',                 // #158
-    '',                 // #159
-    '',                 // #160
-    '',                 // #161
-    '',                 // #162
-    '',                 // #163
-    '',                 // #164
-    '',                 // #165
-    '',                 // #166
-    '',                 // #167
-    '',                 // #168
-    '',                 // #169
-    '',                 // #170
-    '',                 // #171
-    '',                 // #172
-    '',                 // #173
-    '',                 // #174
-    '',                 // #175
-    '',                 // #176
-    '',                 // #177
-    '',                 // #178
-    '',                 // #179
-    '',                 // #180
-    '',                 // #181
-    '',                 // #182
-    '',                 // #183
-    '',                 // #184
-    '',                 // #185
-    '',                 // #186
-    '',                 // #187
-    '',                 // #188
-    '',                 // #189
-    '',                 // #190
-    '',                 // #191
+// RFC 1489
+    #226#148#128,       // #128
+    #226#148#130,       // #129
+    #226#148#140,       // #130
+    #226#148#144,       // #131
+    #226#148#148,       // #132
+    #226#148#152,       // #133
+    #226#148#156,       // #134
+    #226#148#164,       // #135
+    #226#148#172,       // #136
+    #226#148#180,       // #137
+    #226#148#188,       // #138
+    #226#150#128,       // #139
+    #226#150#132,       // #140
+    #226#150#136,       // #141
+    #226#150#140,       // #142
+    #226#150#144,       // #143
+    #226#150#145,       // #144
+    #226#150#146,       // #145
+    #226#150#147,       // #146
+    #226#140#160,       // #147
+    #226#150#160,       // #148
+    #226#136#153,       // #149
+    #226#136#154,       // #150
+    #226#137#136,       // #151
+    #226#137#164,       // #152
+    #226#137#165,       // #153
+    #194#160,           // #154
+    #226#140#161,       // #155
+    #194#176,           // #156
+    #194#178,           // #157
+    #194#183,           // #158
+    #195#183,           // #159
+    #226#149#144,       // #160
+    #226#149#145,       // #161
+    #226#149#146,       // #162
+    #209#145,           // #163
+    #226#149#147,       // #164
+    #226#149#148,       // #165
+    #226#149#149,       // #166
+    #226#149#150,       // #167
+    #226#149#151,       // #168
+    #226#149#152,       // #169
+    #226#149#153,       // #170
+    #226#149#154,       // #171
+    #226#149#155,       // #172
+    #226#149#156,       // #173
+    #226#149#157,       // #174
+    #226#149#158,       // #175
+    #226#149#159,       // #176
+    #226#149#160,       // #177
+    #226#149#161,       // #178
+    #208#129,           // #179
+    #226#149#162,       // #180
+    #226#149#163,       // #181
+    #226#149#164,       // #182
+    #226#149#165,       // #183
+    #226#149#166,       // #184
+    #226#149#167,       // #185
+    #226#149#168,       // #186
+    #226#149#169,       // #187
+    #226#149#170,       // #188
+    #226#149#171,       // #189
+    #226#149#172,       // #190
+    #194#169,           // #191
     #209#142,           // #192
     #208#176,           // #193
     #208#177,           // #194
@@ -4672,8 +4673,8 @@
     #208#173,           // #252
     #208#169,           // #253
     #208#167,           // #254
-    ''                  // #255
-  );
+    #208#170            // #255
+   );
 
   ArrayMacintoshToUTF8: TCharToUTF8Table = (
     #0,                 // #0
mypatch.diff (4,971 bytes)   

Mattias Gaertner

2020-05-31 09:53

manager   ~0123153

KOI8 <> KOI8-R

Maybe we need KOI8 as fallback for the other KOI8 encodings.

eastorwest

2020-05-31 11:06

reporter   ~0123154

This array "ArrayKOI8ToUTF8" consist only russian char codes (from C0 to FE) today.
I can make other encoding tables for other KOI8 also:
KOI8-U ukrainian
KOI8-RU russian-belarussian-ukrainian
KOI8-C central asia
KOI8-T Tajik
What will be better for LCL to do? I mean save as is, or make some new encodings.

eastorwest

2020-06-03 14:35

reporter   ~0123203

LConvEncoding already uses KOI8-R when converting from UTF8 to KOI8 (in OS Windows KOI8-R have code page 20866):

    function UTF8ToKOI8(const s: string; SetTargetCodePage: boolean): RawByteString;
    begin
      InternalUTF8ToCP(s,20866,SetTargetCodePage,{$IfDef UseSystemCPConv}nil{$else}@UnicodeToKOI8{$endif},Result);
    end;

eastorwest

2020-06-08 18:19

reporter   ~0123341

I made some test application with conversion from all encodings to UTF-8 (main for LCL) and back from UTF-8 to encoding.
As i can see some other single byte encodings also miss some characters when conversion.
Also i made changes in function UnicodeToKOI8 to make better conversion when it will be compiled. New patch is diff for 2 files now.
Note: UnicodeToKOI8U can not fully implement RFC2319 https://tools.ietf.org/rfc/rfc2319.txt without this patch, because KOI8-U and KOI8-RU is based on KOI8-R only.
LConvEncoding_test.zip (2,981 bytes)
mypatch2.zip (1,306 bytes)

Juha Manninen

2020-06-09 09:32

developer   ~0123350

I am trying to learn about the KOI codepages. I found these:
 https://en.wikipedia.org/wiki/KOI_character_encodings
 https://en.wikipedia.org/wiki/KOI-8
 https://en.wikipedia.org/wiki/KOI8-R

Q: Is the plain KOI-8 (without suffix) used anywhere?
My impression is that it is not used. I am thinking about Mattias' comment about KOI8 as fallback.

Unit LConvEncoding has
 function KOI8ToUTF8(const s: string): string; // russian cyrillic
and
 function UTF8ToKOI8(const s: string; SetTargetCodePage: boolean = false): RawByteString; // russian cyrillic
 function UTF8ToKOI8U(const s: string; SetTargetCodePage: boolean = false): RawByteString; // ukrainian cyrillic
 function UTF8ToKOI8RU(const s: string; SetTargetCodePage: boolean = false): RawByteString; // belarussian cyrillic
but none with suffix 'R'. Should the function names be changed as :
 KOI8ToUTF8 -> KOI8RToUTF8
 UTF8ToKOI8 -> UTF8ToKOI8R
?

eastorwest

2020-06-11 17:32

reporter   ~0123396

Thanks for reply @Juha.
You can also see
https://en.wikipedia.org/wiki/KOI8-U
https://en.wikipedia.org/wiki/KOI8-RU
KOI8 was developed to extend ASCII charset to support cyrillic characters and was included in old USSR state standard in 1974 year. Standard is not valid today.
"This standard has become the base for the later Internet standards such as KOI8-R, KOI8-U, KOI8-RU and all the other derivatives."
So pure (plain) KOI8 is old and it is not uses now. KOI8 is part of KOI8-R,KOI8-U,KOI8-RU char codes from 192 to 254.

KOI8-R was main cyrillic code page for Linux at the end of 1990-years to support russian cyrillic.
KOI8-U and KOI8-RU replaces few chars in KOI8-R to support ukrainian and belarussian cyrillic.

KOI8-R have RFC1489 https://tools.ietf.org/html/rfc1489 , and MS Windows code page number 20866.
KOI8-U have RFC2319 https://tools.ietf.org/html/rfc2319 , and MS Windows code page number 21866.
KOI8-RU have not RFC... I didnt find.

KOI8-R := KOI-8 + letters(ё,Ё,Ъ) + amount chars (─│┌┐└┘├┤┬┴┼▀▄█▌▐░▒▓⌠■∙√≈≤≥NBSP⌡°²·÷═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬©)
KOI8-U := KOI8-R with replaced 8 chars (єЄіїІЇґҐ)
KOI8-RU := KOI8-R with replaced 8 + 11 chars (єЄіїІЇґҐ) + (ўЎ“”—№™»®«¤) == KOI8-U with replaced 11 chars (ўЎ“”—№™»®«¤)

Today LConvEncoding have only partially support KOI8-R named 'KOI8' in GetSupportedEncodings() list and have
not fully implemented UnicodeToKOI8R (named 'UnicodeToKOI8' without R) and ArrayKOI8RToUTF8 (named 'ArrayKOI8ToUTF8' without R).
I can extend patch to rename this function also if it will be usefull for developers so they will know about full encoding name 'KOI8-R'. Do Lazarus need it?

I saw history of LConvEncoding and this forum thread from 2013 year https://forum.lazarus.freepascal.org/index.php/topic,22663.msg134071.html#msg134071 about adding partially support encodings KOI8-U and KOI8-RU.
  
So LConvEncoding have only functions
function UTF8ToKOI8U(const s: string; SetTargetCodePage: boolean = false): RawByteString; // ukrainian cyrillic
function UTF8ToKOI8RU(const s: string; SetTargetCodePage: boolean = false): RawByteString; // belarussian cyrillic
to support additional two encodings KOI8-U and KOI8-RU. But there is no items 'KOI8-U' and 'KOI8-RU' in GetSupportedEncodings() list and there is no arrays ArrayKOI8UToUTF8 and ArrayKOI8RUToUTF8 (and not exists functions UnicodeToKOI8U and UnicodeToKOI8RU).
I can try add this arrays and functions for LConvEncoding also. Do Lazarus need it?

Juha Manninen

2020-06-11 22:53

developer   ~0123401

Yes please, all codepages that are still in use should be supported.

Your test program reveals errors in conversion. I think it should be used to improve the existing test suite TTestLConvEncoding (directory "test").
After building runtests project, you can run
 $ ./runtests --suite=TTestLConvEncoding
It shows no errors because the test is not comprehensive.
You could improve or rewrite it if you want.

eastorwest

2020-06-15 11:12

reporter   ~0123441

Patch with renaming KOI8 to KOI8-R and adding support for KOI8-U and KOI8-RU (array of chars and functions).
Since KOI8-RU have no standard or RFC then I have to use UnicodeToKOI8RU function (which depends on UnicodeToKOI8U and UnicodeToKOI8R).
I cannot compile runtests with FPC 3.0.4 and got error 'Cant determine which overloaded function to call' in bugs8432.pas line (122,54):
AssertEquals('Wrong ID1', QWord($100000000), ID1);
mypatch3.zip (4,590 bytes)

Juha Manninen

2020-06-15 12:14

developer   ~0123442

Last edited: 2020-06-15 12:15

View 2 revisions

I applied your patch 3 in r63354. Thanks!

> I cannot compile runtests with FPC 3.0.4 and got error 'Cant determine which overloaded function to call' ...
Are you testing with latest Lazarus trunk? I fixed such errors in r63331 less than a week ago, at June 9.

eastorwest

2020-06-15 18:17

reporter   ~0123445

I updated Lazarus from SVN 63354
'svn update' and rebuild Lazarus with command 'make clean all'.
'Lazarus IDE 2.1.0 r63354' still have this error.
I am using stable FPC 3.0.4. 'Free Pascal Compiler version 3.0.4 [2017/10/03] for x86_64'
Perhaps the error is related to FPC compiler.

Juha Manninen

2020-06-16 07:41

developer   ~0123450

Yes, the line "AssertEquals('Wrong ID1', QWord($100000000), ID1);" was the one I fixed by adding a typecast. Now FPC trunk accepts it.
It is so irritating the the ancient 3.0.4 is still in June 2020 the latest FPC release.
Even 3.2. will be ancient when it some day comes out. It was branched ages ago.
Anyway let's keep this issue open and return to it later.

CudaText man

2020-06-16 14:44

reporter   ~0123456

I added this patch to EncConv package and used/added KOI8* codepages to CudaText 1.105 (so you can test the work of this patch)

eastorwest

2020-07-04 12:39

reporter   ~0123746

Last edited: 2020-07-04 12:39

View 2 revisions

@Juha, After release FPC 3.2.0 i can build runtests without errors.
Tested two cases: empty char convertion and bad reverse (back) convertion char by char from 1 to 255 char codes. I think it is complete for single-byte encodings.
I found unit lazarus/test/lazutils/testlconvencoding.pas
Application runtests uses this unit for testing encodings:
  Test(EncodingCPIso1);
  Test(EncodingCPIso2);
  Test(EncodingCPIso15);
  Test(EncodingCP437);
  Test(EncodingCP850);
  Test(EncodingCP852);
  Test(EncodingCP866);
  Test(EncodingCP874);
  Test(EncodingCP1250);
  Test(EncodingCP1251);
  Test(EncodingCP1252);
  Test(EncodingCP1253);
  Test(EncodingCP1254);
  Test(EncodingCP1255);
  Test(EncodingCP1256);
  Test(EncodingCP1257);
  Test(EncodingCP1258);
but not for old EncodingCPKOI8 or for new EncodingCPKOI8R, EncodingCPKOI8U and EncodingCPKOI8RU.
We can add this encodings for testing by lines
  Test(EncodingCPKOI8R);
  Test(EncodingCPKOI8U);
  Test(EncodingCPKOI8RU);

eastorwest

2020-07-04 12:40

reporter   ~0123747

patch_test_lconvenc.diff (419 bytes)   
Index: test/lazutils/testlconvencoding.pas
===================================================================
--- test/lazutils/testlconvencoding.pas	(revision 63506)
+++ test/lazutils/testlconvencoding.pas	(working copy)
@@ -66,6 +66,9 @@
   Test(EncodingCP1256);
   Test(EncodingCP1257);
   Test(EncodingCP1258);
+  Test(EncodingCPKOI8R);
+  Test(EncodingCPKOI8U);
+  Test(EncodingCPKOI8RU);
 end;
 
 initialization
patch_test_lconvenc.diff (419 bytes)   

Juha Manninen

2020-07-05 08:56

developer   ~0123758

Applied the patch fro tests. Thanks.
I thought it requires more test code but fortunately no.

Issue History

Date Modified Username Field Change
2020-05-31 09:25 eastorwest New Issue
2020-05-31 09:25 eastorwest File Added: mypatch.diff
2020-05-31 09:53 Mattias Gaertner Note Added: 0123153
2020-05-31 11:06 eastorwest Note Added: 0123154
2020-06-03 14:35 eastorwest Note Added: 0123203
2020-06-08 18:19 eastorwest Note Added: 0123341
2020-06-08 18:19 eastorwest File Added: LConvEncoding_test.zip
2020-06-08 18:19 eastorwest File Added: mypatch2.zip
2020-06-09 09:32 Juha Manninen Note Added: 0123350
2020-06-09 09:36 Juha Manninen Assigned To => Juha Manninen
2020-06-09 09:36 Juha Manninen Status new => feedback
2020-06-09 09:36 Juha Manninen LazTarget => -
2020-06-11 17:32 eastorwest Note Added: 0123396
2020-06-11 17:32 eastorwest Status feedback => assigned
2020-06-11 22:53 Juha Manninen Note Added: 0123401
2020-06-15 11:12 eastorwest Note Added: 0123441
2020-06-15 11:12 eastorwest File Added: mypatch3.zip
2020-06-15 12:14 Juha Manninen Note Added: 0123442
2020-06-15 12:14 Juha Manninen Fixed in Revision => r63354
2020-06-15 12:15 Juha Manninen Note Edited: 0123442 View Revisions
2020-06-15 12:15 Juha Manninen Status assigned => feedback
2020-06-15 18:17 eastorwest Note Added: 0123445
2020-06-15 18:17 eastorwest Status feedback => assigned
2020-06-16 07:41 Juha Manninen Note Added: 0123450
2020-06-16 14:44 CudaText man Note Added: 0123456
2020-07-04 12:39 eastorwest Note Added: 0123746
2020-07-04 12:39 eastorwest Note Edited: 0123746 View Revisions
2020-07-04 12:40 eastorwest Note Added: 0123747
2020-07-04 12:40 eastorwest File Added: patch_test_lconvenc.diff
2020-07-05 08:56 Juha Manninen Status assigned => resolved
2020-07-05 08:56 Juha Manninen Resolution open => fixed
2020-07-05 08:56 Juha Manninen Note Added: 0123758
2020-07-12 10:58 Juha Manninen Relationship added related to 0036078