View Issue Details

IDProjectCategoryView StatusLast Update
0034084FPCPackagespublic2018-08-04 10:38
ReporterAlexey Tor.Assigned ToMichael Van Canneyt 
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Product Version3.0.4Product Build 
Target Version3.2.0Fixed in Version3.1.1 
Summary0034084: RegExpr: proposal to support national letters in "\w"
Descriptionthis function is from ATSynEdit. it works for all Unicode letters:
russian
greek
german
japanese
...
it is optimized for ascii chars < 128.

uses UnicodeData;

 function IsCharWord(ch: WideChar): boolean;
 var
   NType: byte;
 begin
   case ch of
     '0'..'9',
     'a'..'z',
     'A'..'Z',
     '_':
       exit(true);
   end;

   if Ord(ch)<128 then
     exit(false)
   else
   if Ord(ch)>=LOW_SURROGATE_BEGIN then
     exit(false)
   else
   begin
     NType:= GetProps(Ord(ch))^.Category;
     Result:= (NType<=UGC_OtherNumber);
   end;
 end;


use it in RegExpr.pas. i did this in local copy of regexpr.pas:

- comment this var: fWordChars, and prop: WordChars
- replace all Pos(...., fWordChars) with call IsCharWord(..)
- one line will be weird: it calls
  EmitNNNNNNN(fWordChars)
  replace here fWordChars with const RegExprWordChars.

my test shows that CudaText editor now finds rus/greek/german letters by \w.
even with that call EmitNNNNN().
TagsNo tags attached.
Fixed in Revision39564
FPCOldBugId
FPCTarget
Attached Files
  • uni.diff (3,712 bytes)
    --- regexpr_orig.pas	2018-08-03 22:06:57.976880290 +0300
    +++ regexpr.pas	2018-08-04 10:12:49.000000000 +0300
    @@ -49,6 +49,7 @@
     interface
     
     {off $DEFINE DebugSynRegExpr}
    +{off $DEFINE UnicodeWordDetection}
     
     {$IFDEF FPC}
      {$MODE DELPHI} // Delphi-compatible mode in FreePascal
    @@ -292,7 +293,8 @@
         {$IFNDEF UniCode}
         fLineSeparatorsSet : set of REChar;
         {$ENDIF}
    -    Function IsSpaceChar(AChar : PRegExprChar) : Boolean; inline;
    +    function IsWordChar(AChar : REChar) : Boolean;
    +    function IsSpaceChar(AChar : PRegExprChar) : Boolean; inline;
     
         // Mark programm as having to be [re]compiled
         procedure InvalidateProgramm;
    @@ -661,6 +663,10 @@
     implementation
     
     {$IFDEF FPC}
    +{$IFDEF UnicodeWordDetection}
    +uses
    +  UnicodeData;
    +{$ENDIF}
     {$ELSE}
     uses
     {$IFDEF SYN_WIN32}
    @@ -1478,6 +1484,37 @@
     {==================== Compiler section =======================}
     {=============================================================}
     
    +{$IFDEF UnicodeWordDetection}
    +function TRegExpr.IsWordChar(AChar: REChar): Boolean;
    +var
    +  NType: byte;
    +begin
    +  case AChar of
    +    '0'..'9',
    +    'a'..'z',
    +    'A'..'Z',
    +    '_':
    +      exit(true);
    +  end;
    +
    +  if Ord(AChar)<128 then
    +    exit(false)
    +  else
    +  if Ord(AChar)>=LOW_SURROGATE_BEGIN then
    +    exit(false)
    +  else
    +  begin
    +    NType:= GetProps(Ord(AChar))^.Category;
    +    Result:= (NType<=UGC_OtherNumber);
    +  end;
    +end;
    +{$ELSE}
    +function TRegExpr.IsWordChar(AChar: REChar): Boolean; inline;
    +begin
    +  Result := Pos(AChar, fWordChars)>0;
    +end;
    +{$ENDIF}
    +
     function TRegExpr.IsSpaceChar(AChar: PRegExprChar): Boolean;
     begin
       Result:=Pos(AChar^,fSpaceChars)>0;
    @@ -2790,7 +2827,7 @@
         {$IFNDEF UseSetOfChar} //###0.929
         ANYLETTER:
           while (Result < TheMax) and
    -       (Pos (scan^, fWordChars) > 0) //###0.940
    +         IsWordChar(scan^) //###0.940
          {  ((scan^ >= 'a') and (scan^ <= 'z') !! I've forgotten (>='0') and (<='9')
            or (scan^ >= 'A') and (scan^ <= 'Z') or (scan^ = '_'))} do begin
             inc (Result);
    @@ -2798,7 +2835,7 @@
            end;
         NOTLETTER:
           while (Result < TheMax) and
    -       (Pos (scan^, fWordChars) <= 0)  //###0.940
    +         not IsWordChar(scan^)  //###0.940
          {   not ((scan^ >= 'a') and (scan^ <= 'z') !! I've forgotten (>='0') and (<='9')
              or (scan^ >= 'A') and (scan^ <= 'Z')
              or (scan^ = '_'))} do begin
    @@ -2930,11 +2967,11 @@
              BOUND:
              if (scan^ = BOUND)
               xor (
    -          ((reginput = fInputStart) or (Pos ((reginput - 1)^, fWordChars) <= 0))
    -            and (reginput^ <> #0) and (Pos (reginput^, fWordChars) > 0)
    +          ((reginput = fInputStart) or not IsWordChar((reginput - 1)^))
    +            and (reginput^ <> #0) and IsWordChar(reginput^)
                or
    -            (reginput <> fInputStart) and (Pos ((reginput - 1)^, fWordChars) > 0)
    -            and ((reginput^ = #0) or (Pos (reginput^, fWordChars) <= 0)))
    +            (reginput <> fInputStart) and IsWordChar((reginput - 1)^)
    +            and ((reginput^ = #0) or not IsWordChar(reginput^)))
               then EXIT;
     
              BOL: if reginput <> fInputStart
    @@ -3003,12 +3040,12 @@
                end;
              {$IFNDEF UseSetOfChar} //###0.929
              ANYLETTER: begin
    -            if (reginput^ = #0) or (Pos (reginput^, fWordChars) <= 0) //###0.943
    +            if (reginput^ = #0) or not IsWordChar(reginput^) //###0.943
                  then EXIT;
                 inc (reginput);
                end;
              NOTLETTER: begin
    -            if (reginput^ = #0) or (Pos (reginput^, fWordChars) > 0) //###0.943
    +            if (reginput^ = #0) or IsWordChar(reginput^) //###0.943
                  then EXIT;
                 inc (reginput);
                end;
    
    uni.diff (3,712 bytes)
  • tst-regex-pch.zip (41,154 bytes)

Activities

Michael Van Canneyt

2018-08-03 16:19

administrator   ~0109858

For backwards compatibility reasons we cannot do this as proposed.

A new property would need to be introduced that allows the class to choose between the existing WordChars property and your new "heuristic" algorithm.

What to do for ansistring version ?

Alexey Tor.

2018-08-03 17:38

reporter   ~0109861

For backward compat:

- add func IsCharWord
- add prop UnicodeWordDetection: bool
- inside IsCharWord add
   if fUnicodeWordDetection then
     {..my code..}
   else
     Result:= Pos(ch, fWordChars)>0;
- but must replace all calls to Pos(..., fWordChars) to new IsCharWord


for ansistring compat:
no problem too. use {$ifdef Unicode} in function IsCharWord, for new my code.

Michael Van Canneyt

2018-08-03 17:51

administrator   ~0109862

Care to supply a patch ?

Alexey Tor.

2018-08-04 09:21

reporter  

uni.diff (3,712 bytes)
--- regexpr_orig.pas	2018-08-03 22:06:57.976880290 +0300
+++ regexpr.pas	2018-08-04 10:12:49.000000000 +0300
@@ -49,6 +49,7 @@
 interface
 
 {off $DEFINE DebugSynRegExpr}
+{off $DEFINE UnicodeWordDetection}
 
 {$IFDEF FPC}
  {$MODE DELPHI} // Delphi-compatible mode in FreePascal
@@ -292,7 +293,8 @@
     {$IFNDEF UniCode}
     fLineSeparatorsSet : set of REChar;
     {$ENDIF}
-    Function IsSpaceChar(AChar : PRegExprChar) : Boolean; inline;
+    function IsWordChar(AChar : REChar) : Boolean;
+    function IsSpaceChar(AChar : PRegExprChar) : Boolean; inline;
 
     // Mark programm as having to be [re]compiled
     procedure InvalidateProgramm;
@@ -661,6 +663,10 @@
 implementation
 
 {$IFDEF FPC}
+{$IFDEF UnicodeWordDetection}
+uses
+  UnicodeData;
+{$ENDIF}
 {$ELSE}
 uses
 {$IFDEF SYN_WIN32}
@@ -1478,6 +1484,37 @@
 {==================== Compiler section =======================}
 {=============================================================}
 
+{$IFDEF UnicodeWordDetection}
+function TRegExpr.IsWordChar(AChar: REChar): Boolean;
+var
+  NType: byte;
+begin
+  case AChar of
+    '0'..'9',
+    'a'..'z',
+    'A'..'Z',
+    '_':
+      exit(true);
+  end;
+
+  if Ord(AChar)<128 then
+    exit(false)
+  else
+  if Ord(AChar)>=LOW_SURROGATE_BEGIN then
+    exit(false)
+  else
+  begin
+    NType:= GetProps(Ord(AChar))^.Category;
+    Result:= (NType<=UGC_OtherNumber);
+  end;
+end;
+{$ELSE}
+function TRegExpr.IsWordChar(AChar: REChar): Boolean; inline;
+begin
+  Result := Pos(AChar, fWordChars)>0;
+end;
+{$ENDIF}
+
 function TRegExpr.IsSpaceChar(AChar: PRegExprChar): Boolean;
 begin
   Result:=Pos(AChar^,fSpaceChars)>0;
@@ -2790,7 +2827,7 @@
     {$IFNDEF UseSetOfChar} //###0.929
     ANYLETTER:
       while (Result < TheMax) and
-       (Pos (scan^, fWordChars) > 0) //###0.940
+         IsWordChar(scan^) //###0.940
      {  ((scan^ >= 'a') and (scan^ <= 'z') !! I've forgotten (>='0') and (<='9')
        or (scan^ >= 'A') and (scan^ <= 'Z') or (scan^ = '_'))} do begin
         inc (Result);
@@ -2798,7 +2835,7 @@
        end;
     NOTLETTER:
       while (Result < TheMax) and
-       (Pos (scan^, fWordChars) <= 0)  //###0.940
+         not IsWordChar(scan^)  //###0.940
      {   not ((scan^ >= 'a') and (scan^ <= 'z') !! I've forgotten (>='0') and (<='9')
          or (scan^ >= 'A') and (scan^ <= 'Z')
          or (scan^ = '_'))} do begin
@@ -2930,11 +2967,11 @@
          BOUND:
          if (scan^ = BOUND)
           xor (
-          ((reginput = fInputStart) or (Pos ((reginput - 1)^, fWordChars) <= 0))
-            and (reginput^ <> #0) and (Pos (reginput^, fWordChars) > 0)
+          ((reginput = fInputStart) or not IsWordChar((reginput - 1)^))
+            and (reginput^ <> #0) and IsWordChar(reginput^)
            or
-            (reginput <> fInputStart) and (Pos ((reginput - 1)^, fWordChars) > 0)
-            and ((reginput^ = #0) or (Pos (reginput^, fWordChars) <= 0)))
+            (reginput <> fInputStart) and IsWordChar((reginput - 1)^)
+            and ((reginput^ = #0) or not IsWordChar(reginput^)))
           then EXIT;
 
          BOL: if reginput <> fInputStart
@@ -3003,12 +3040,12 @@
            end;
          {$IFNDEF UseSetOfChar} //###0.929
          ANYLETTER: begin
-            if (reginput^ = #0) or (Pos (reginput^, fWordChars) <= 0) //###0.943
+            if (reginput^ = #0) or not IsWordChar(reginput^) //###0.943
              then EXIT;
             inc (reginput);
            end;
          NOTLETTER: begin
-            if (reginput^ = #0) or (Pos (reginput^, fWordChars) > 0) //###0.943
+            if (reginput^ = #0) or IsWordChar(reginput^) //###0.943
              then EXIT;
             inc (reginput);
            end;
uni.diff (3,712 bytes)

Alexey Tor.

2018-08-04 09:23

reporter   ~0109870

Patch uni.diff: it adds define UnicodeWordDetection,
to test it:

- define UnicodeWordDetection
- define Unicode
- run attached test. It shows all words in angle brackets. With new define, it shows also Unicode words in brackets.

Alexey Tor.

2018-08-04 09:24

reporter  

tst-regex-pch.zip (41,154 bytes)

Michael Van Canneyt

2018-08-04 10:38

administrator   ~0109877

Thank you for the patch.

I changed and improved it somewhat.
- UnicodeWordDetection is by default defined for unicode mode.
- 2 separate functions
- Property UseUnicodeWordDetection
- IsWordChar calls IsUnicodeWordChar if UseUnicodeWordDetection is false.

For backwards compatibility, UseUnicodeWordDetection is by default false.

Modified your test program to demonstrate the new property.
It gives the same result if you run it as
"testwd 1"
I added is as example demowd.pp

Issue History

Date Modified Username Field Change
2018-08-03 15:49 Alexey Tor. New Issue
2018-08-03 16:17 Michael Van Canneyt Assigned To => Michael Van Canneyt
2018-08-03 16:17 Michael Van Canneyt Status new => assigned
2018-08-03 16:19 Michael Van Canneyt Note Added: 0109858
2018-08-03 16:19 Michael Van Canneyt Status assigned => feedback
2018-08-03 17:38 Alexey Tor. Note Added: 0109861
2018-08-03 17:38 Alexey Tor. Status feedback => assigned
2018-08-03 17:51 Michael Van Canneyt Note Added: 0109862
2018-08-04 09:21 Alexey Tor. File Added: uni.diff
2018-08-04 09:23 Alexey Tor. Note Added: 0109870
2018-08-04 09:24 Alexey Tor. File Added: tst-regex-pch.zip
2018-08-04 10:38 Michael Van Canneyt Fixed in Revision => 39564
2018-08-04 10:38 Michael Van Canneyt Note Added: 0109877
2018-08-04 10:38 Michael Van Canneyt Status assigned => resolved
2018-08-04 10:38 Michael Van Canneyt Fixed in Version => 3.1.1
2018-08-04 10:38 Michael Van Canneyt Resolution open => fixed
2018-08-04 10:38 Michael Van Canneyt Target Version => 3.2.0