View Issue Details

IDProjectCategoryView StatusLast Update
0026477LazarusPatchpublic2014-07-21 00:15
ReporterAntônio GalvãoAssigned ToMartin Friebe 
PrioritynormalSeverityminorReproducibilityN/A
Status assignedResolutionopen 
Product VersionProduct Build 
Target VersionFixed in Version 
Summary0026477: Allow SynHighlighterAny highlights UTF8 chars
DescriptionThis diff file is intended to allow using SynHighlighterAny to highlight SynEdit on an algorithm editor on which native natural language words should be used. It adds the letters with diacritics of Portuguese language plus the UTF8 character 195 which precedes every diacritical signal.
TagsNo tags attached.
Fixed in Revision
LazTarget-
Widgetset
Attached Files
  • SynHighlighterAny.diff (691 bytes)
    ***************
    *** 239,245 ****
        I: Char;
        idents:string;
      begin
    -   // added letters with diacritical signals
        idents:='_0123456789áàãâäéèêëéïóòõôöúüçabcdefghijklmnopqrstuvwxyzÁÀÃÂÄÉÈÊËÉÏÓÒÕÔÖÚÜÇABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
        for I := #0 to #255 do
        begin
    --- 239,244 ----
    ***************
    *** 370,376 ****
            '(': fProcTable[I] := @RoundOpenProc;
            ')': fProcTable[I] := @RoundCloseProc;
            '/': fProcTable[I] := @SlashProc;
    -       // added 195 char
            #1..#9, #11, #12, #14..#32, #195: fProcTable[I] := @SpaceProc;
            else fProcTable[I] := @UnknownProc;
          end;
    --- 369,374 ----
    
  • SynHighlighterAny_2.diff (1,918 bytes)
    ***************
    *** 239,245 ****
        I: Char;
        idents:string;
      begin
    !   idents:='_0123456789áàãâäéèêëéïóòõôöúüçabcdefghijklmnopqrstuvwxyzÁÀÃÂÄÉÈÊËÉÏÓÒÕÔÖÚÜÇABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
        for I := #0 to #255 do
        begin
          if pos(i,idents)>0 then identifiers[i]:=true
    --- 239,245 ----
        I: Char;
        idents:string;
      begin
    !   idents:='_0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
        for I := #0 to #255 do
        begin
          if pos(i,idents)>0 then identifiers[i]:=true
    ***************
    *** 258,268 ****
        First := 0;
        Last := fKeywords.Count - 1;
        Result := False;
    !   Token := utf8UpperCase(AKeyword);
        while First <= Last do
        begin
          I := (First + Last) shr 1;
    !     Compare := AnsiCompareStr(utf8uppercase(fKeywords[i]), Token);
          if Compare = 0 then
          begin
            Result := True;
    --- 258,268 ----
        First := 0;
        Last := fKeywords.Count - 1;
        Result := False;
    !   Token := UpperCase(AKeyword);
        while First <= Last do
        begin
          I := (First + Last) shr 1;
    !     Compare := AnsiCompareStr(fKeywords[i], Token);
          if Compare = 0 then
          begin
            Result := True;
    ***************
    *** 369,375 ****
            '(': fProcTable[I] := @RoundOpenProc;
            ')': fProcTable[I] := @RoundCloseProc;
            '/': fProcTable[I] := @SlashProc;
    !       #1..#9, #11, #12, #14..#32, #195: fProcTable[I] := @SpaceProc;
            else fProcTable[I] := @UnknownProc;
          end;
          fProcTable[fStringDelimCh] := @StringProc;
    --- 369,375 ----
            '(': fProcTable[I] := @RoundOpenProc;
            ')': fProcTable[I] := @RoundCloseProc;
            '/': fProcTable[I] := @SlashProc;
    !       #1..#9, #11, #12, #14..#32: fProcTable[I] := @SpaceProc;
            else fProcTable[I] := @UnknownProc;
          end;
          fProcTable[fStringDelimCh] := @StringProc;
    
    SynHighlighterAny_2.diff (1,918 bytes)

Activities

Antônio Galvão

2014-07-13 17:52

reporter  

SynHighlighterAny.diff (691 bytes)
***************
*** 239,245 ****
    I: Char;
    idents:string;
  begin
-   // added letters with diacritical signals
    idents:='_0123456789áàãâäéèêëéïóòõôöúüçabcdefghijklmnopqrstuvwxyzÁÀÃÂÄÉÈÊËÉÏÓÒÕÔÖÚÜÇABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
    for I := #0 to #255 do
    begin
--- 239,244 ----
***************
*** 370,376 ****
        '(': fProcTable[I] := @RoundOpenProc;
        ')': fProcTable[I] := @RoundCloseProc;
        '/': fProcTable[I] := @SlashProc;
-       // added 195 char
        #1..#9, #11, #12, #14..#32, #195: fProcTable[I] := @SpaceProc;
        else fProcTable[I] := @UnknownProc;
      end;
--- 369,374 ----

Bart Broersma

2014-07-13 20:23

developer   ~0076210

Please move to Lazarus.

Antônio Galvão

2014-07-14 09:45

reporter  

SynHighlighterAny_2.diff (1,918 bytes)
***************
*** 239,245 ****
    I: Char;
    idents:string;
  begin
!   idents:='_0123456789áàãâäéèêëéïóòõôöúüçabcdefghijklmnopqrstuvwxyzÁÀÃÂÄÉÈÊËÉÏÓÒÕÔÖÚÜÇABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
    for I := #0 to #255 do
    begin
      if pos(i,idents)>0 then identifiers[i]:=true
--- 239,245 ----
    I: Char;
    idents:string;
  begin
!   idents:='_0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-?!';
    for I := #0 to #255 do
    begin
      if pos(i,idents)>0 then identifiers[i]:=true
***************
*** 258,268 ****
    First := 0;
    Last := fKeywords.Count - 1;
    Result := False;
!   Token := utf8UpperCase(AKeyword);
    while First <= Last do
    begin
      I := (First + Last) shr 1;
!     Compare := AnsiCompareStr(utf8uppercase(fKeywords[i]), Token);
      if Compare = 0 then
      begin
        Result := True;
--- 258,268 ----
    First := 0;
    Last := fKeywords.Count - 1;
    Result := False;
!   Token := UpperCase(AKeyword);
    while First <= Last do
    begin
      I := (First + Last) shr 1;
!     Compare := AnsiCompareStr(fKeywords[i], Token);
      if Compare = 0 then
      begin
        Result := True;
***************
*** 369,375 ****
        '(': fProcTable[I] := @RoundOpenProc;
        ')': fProcTable[I] := @RoundCloseProc;
        '/': fProcTable[I] := @SlashProc;
!       #1..#9, #11, #12, #14..#32, #195: fProcTable[I] := @SpaceProc;
        else fProcTable[I] := @UnknownProc;
      end;
      fProcTable[fStringDelimCh] := @StringProc;
--- 369,375 ----
        '(': fProcTable[I] := @RoundOpenProc;
        ')': fProcTable[I] := @RoundCloseProc;
        '/': fProcTable[I] := @SlashProc;
!       #1..#9, #11, #12, #14..#32: fProcTable[I] := @SpaceProc;
        else fProcTable[I] := @UnknownProc;
      end;
      fProcTable[fStringDelimCh] := @StringProc;
SynHighlighterAny_2.diff (1,918 bytes)

Antônio Galvão

2014-07-14 09:46

reporter   ~0076212

The second file allows also that keyords be recognized properly by the IsKeyWord function.

Antônio Galvão

2014-07-14 22:32

reporter   ~0076216

Last edited: 2014-07-15 00:38

View 3 revisions

This solution was tested also for chinese characters and works fine, you need only to add the characters you want to IDENTS string.

Martin Friebe

2014-07-17 21:07

manager   ~0076246

Not all utf8 codepoints above 127 are "word chars".

There are semi-widh spaces, and various punctuation. http://www.fileformat.info/info/unicode/category/Zs/list.htm

Latin punctuation has "full widths" equivalents. Other languages (script languages) have their own punctuation.

If those are not accounted for, then after the patch, words in those languages, will be highlighted only, if there is a latin none wordchar on both sides, and otherwise not highlighted.

The patch changes the code form a "predictable wrong" behaviour, to an "unpredictable wrong" behaviour (well to be exact: harder to predict).

Martin Friebe

2014-07-18 10:53

manager   ~0076247

"utf8uppercase" is also problematic. in most languages "i" (lower case dotted i) becomes "I" (upper case none dotted I). But some languages have a uppercase dotted I.
However since we are not doing proper utf8 normalization, this is not a blocker.

Antônio Galvão

2014-07-18 18:33

reporter   ~0076255

Last edited: 2014-07-18 20:22

View 6 revisions

Is what you call "predictable wrong" behavior actually one that only takes into account the languages which have not diacritically signaled words? If so, I agree that the patch does not include other languages ​​beyond Europe, not only latin languages, since German, for example, is a diacritically signaled one. And I agree that after the patch this behavior becomes unpredictable for those other languages, what changes nothing from what was happening before. Probably the class should be rewritten by them.

As far as I can find references, dotted uppercase I is from Turkish and corresponds to an also dotted lowercase i. And dottless uppercase I corresponds to an also dottless lowercse i. So there are 2 i letters in Turkish, each one with its own uppercase. Anyway, letting turkish people to write their own code seems to be more practical than trying to guess what should be done.

Martin Friebe

2014-07-20 14:39

manager   ~0076287

Both "Turkish `I`" and diacritics are about normalization. I indicated: not a blocker.

What I was concerned about a word boundaries.

If I am not mistaken about your patch, then you define all utf8 chars above =127 as "wordchar" (as letter in a word).

- If you did not, then I must have overlooked something.

- If you do:
That is not correct.
There are many spaces and punctuation in that range.

If someone uses "fullwidth" chars, then "Latin, Other", the "," would be >0000127 (and maybe even the space too). So the 2," would be part of the word. Certainly not correct.

Roughly making 3 categories.

1) English: mostly working
2) European and similar language, based on latin, but with diacritics and similar. With your patch:
- accented chars no longer break the word
- punctuation is purely latin
3) Other (Eastern) Languages. They may include there own punctuation, and will be broken, in the way I described.

True, (3) is broken, and remains broken.


--
Anyway that problem applies to a lot of parts in SynEdit.

I remember I added a similar patch elsewhere (word completion), iirc providing some ability to add wordbreak determination.

So I might add it, or add it with modification.

In any case unfortunately all my assignments are currently on hold, due to me being busy outside the Lazarus project.

Antônio Galvão

2014-07-21 00:04

reporter   ~0076301

Last edited: 2014-07-21 00:15

View 3 revisions

Yes, thanks.

PS.: The UTF-8 chars I added on IDENTS string are from Portuguese only.

Issue History

Date Modified Username Field Change
2014-07-13 17:52 Antônio Galvão New Issue
2014-07-13 17:52 Antônio Galvão File Added: SynHighlighterAny.diff
2014-07-13 20:23 Bart Broersma Note Added: 0076210
2014-07-13 22:25 Jonas Maebe Project FPC => Lazarus
2014-07-14 09:45 Antônio Galvão File Added: SynHighlighterAny_2.diff
2014-07-14 09:46 Antônio Galvão Note Added: 0076212
2014-07-14 22:32 Antônio Galvão Note Added: 0076216
2014-07-15 00:24 Antônio Galvão Note Edited: 0076216 View Revisions
2014-07-15 00:38 Antônio Galvão Note Edited: 0076216 View Revisions
2014-07-17 21:07 Martin Friebe LazTarget => -
2014-07-18 02:09 Martin Friebe Note Added: 0076246
2014-07-18 08:37 Martin Friebe Assigned To => Martin Friebe
2014-07-18 08:37 Martin Friebe Status new => feedback
2014-07-18 10:53 Martin Friebe Note Added: 0076247
2014-07-18 18:33 Antônio Galvão Note Added: 0076255
2014-07-18 18:33 Antônio Galvão Status feedback => assigned
2014-07-18 19:29 Antônio Galvão Note Edited: 0076255 View Revisions
2014-07-18 19:53 Antônio Galvão Note Edited: 0076255 View Revisions
2014-07-18 19:55 Antônio Galvão Note Edited: 0076255 View Revisions
2014-07-18 20:03 Antônio Galvão Note Edited: 0076255 View Revisions
2014-07-18 20:22 Antônio Galvão Note Edited: 0076255 View Revisions
2014-07-20 14:39 Martin Friebe Note Added: 0076287
2014-07-21 00:04 Antônio Galvão Note Added: 0076301
2014-07-21 00:15 Antônio Galvão Note Edited: 0076301 View Revisions
2014-07-21 00:15 Antônio Galvão Note Edited: 0076301 View Revisions