View Issue Details

IDProjectCategoryView StatusLast Update
0018842LazarusIDEpublic2011-10-27 13:28
ReporterTomasz WieckowskiAssigned ToFelipe Monteiro de Carvalho 
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionfixed 
PlatformWin32OSWindows XPOS Version
Product Version0.9.31 (SVN)Product Build29452 
Target VersionFixed in Version0.9.31 (SVN) 
Summary0018842: [Patch] Find Dialog with UTF8 - case sensitive
DescriptionIDE Find function does'n respect "Case sensitive" option for utf8 text.

If I look for as ex. "średnia" (polish char) and "Case sensitive" is unchecked then it can't find "Średnia" in source editor.
TagsNo tags attached.
Fixed in Revision33031
LazTarget0.99.0
WidgetsetWin32/Win64
Attached Files
  • syneditsearch.patch (1,954 bytes)
    Index: components/synedit/syneditsearch.pp
    ===================================================================
    --- components/synedit/syneditsearch.pp	(revision 31963)
    +++ components/synedit/syneditsearch.pp	(working copy)
    @@ -424,6 +424,7 @@
       MinY: LongInt;
       IsFirstLine: boolean;
       SearchLineEndPos: LongInt;
    +  SearchForStr,
       FirstPattern: String;
       MaxPos: Integer;
       xStep: Integer;
    @@ -841,8 +842,10 @@
         end else
           FirstPattern:=copy(Pat,1,SearchLineEndPos-1);
       end;
    -  SearchFor:=PChar(FirstPattern);
    -  SearchLen:=length(FirstPattern);
    +  SearchForStr := FirstPattern;
    +  if not fSensitive then SearchForStr := UTF8UpperCase(SearchForStr);
    +  SearchFor:=PChar(SearchForStr);
    +  SearchLen:=Length(SearchForStr);
     
       if fRegExpr then begin
         RegExprEngine.ModifierI:=not fSensitive;
    @@ -865,6 +868,7 @@
       //          regex multi line whole word
       repeat
         LineStr:=Lines[y];
    +    if not fSensitive then LineStr := UTF8UpperCase(LineStr);
         LineLen:=length(LineStr);
         Line:=PChar(LineStr);
         if not IsFirstLine then begin
    @@ -919,12 +923,12 @@
             //DebugLn(['TSynEditSearch.FindNextOne x=',x,' MaxPos=',MaxPos,' Line="',Line,'"']);
             while (x>=0) and (x<=MaxPos) do begin
               //DebugLn(['TSynEditSearch.FindNextOne x=',x]);
    -          if (SearchLen=0) or (CompTable[Line[x]]=CompTable[SearchFor^]) then begin
    +          if (SearchLen=0) or (Line[x]=SearchFor^) then begin
                 //DebugLn(['TSynEditSearch.FindNextOne First character found x=',x,' Line[x]=',Line[x]]);
                 if (not fWhole)
                 or (x=0) or (not (Line[x-1] in FIdentChars)) then begin
                   i:=1;
    -              while (i<SearchLen) and (CompTable[Line[x+i]]=CompTable[SearchFor[i]])
    +              while (i<SearchLen) and (Line[x+i]=SearchFor[i])
                   do
                     inc(i);
                   //DebugLn(['TSynEditSearch.FindNextOne x=',x,' SearchLen=',SearchLen,' i=',i]);
    
    syneditsearch.patch (1,954 bytes)

Activities

2011-08-13 14:10

 

syneditsearch.patch (1,954 bytes)
Index: components/synedit/syneditsearch.pp
===================================================================
--- components/synedit/syneditsearch.pp	(revision 31963)
+++ components/synedit/syneditsearch.pp	(working copy)
@@ -424,6 +424,7 @@
   MinY: LongInt;
   IsFirstLine: boolean;
   SearchLineEndPos: LongInt;
+  SearchForStr,
   FirstPattern: String;
   MaxPos: Integer;
   xStep: Integer;
@@ -841,8 +842,10 @@
     end else
       FirstPattern:=copy(Pat,1,SearchLineEndPos-1);
   end;
-  SearchFor:=PChar(FirstPattern);
-  SearchLen:=length(FirstPattern);
+  SearchForStr := FirstPattern;
+  if not fSensitive then SearchForStr := UTF8UpperCase(SearchForStr);
+  SearchFor:=PChar(SearchForStr);
+  SearchLen:=Length(SearchForStr);
 
   if fRegExpr then begin
     RegExprEngine.ModifierI:=not fSensitive;
@@ -865,6 +868,7 @@
   //          regex multi line whole word
   repeat
     LineStr:=Lines[y];
+    if not fSensitive then LineStr := UTF8UpperCase(LineStr);
     LineLen:=length(LineStr);
     Line:=PChar(LineStr);
     if not IsFirstLine then begin
@@ -919,12 +923,12 @@
         //DebugLn(['TSynEditSearch.FindNextOne x=',x,' MaxPos=',MaxPos,' Line="',Line,'"']);
         while (x>=0) and (x<=MaxPos) do begin
           //DebugLn(['TSynEditSearch.FindNextOne x=',x]);
-          if (SearchLen=0) or (CompTable[Line[x]]=CompTable[SearchFor^]) then begin
+          if (SearchLen=0) or (Line[x]=SearchFor^) then begin
             //DebugLn(['TSynEditSearch.FindNextOne First character found x=',x,' Line[x]=',Line[x]]);
             if (not fWhole)
             or (x=0) or (not (Line[x-1] in FIdentChars)) then begin
               i:=1;
-              while (i<SearchLen) and (CompTable[Line[x+i]]=CompTable[SearchFor[i]])
+              while (i<SearchLen) and (Line[x+i]=SearchFor[i])
               do
                 inc(i);
               //DebugLn(['TSynEditSearch.FindNextOne x=',x,' SearchLen=',SearchLen,' i=',i]);
syneditsearch.patch (1,954 bytes)

Anton

2011-08-13 14:16

reporter   ~0050799

But there is a problem with "whole word" option, cause word boundary is determined as not(Line[x] in FIdentChars) [see syneditsearch.pp:574,569,929] where FIdentChars=['a'..'z','A'..'Z','0'..'9','_'], so all unicode characters become word boundaries

Martin Friebe

2011-10-06 15:45

manager   ~0052657

Can't currently test the patch, as UTF8UpperCase does not work on my platform => which also means that the patch will find random bit's of text (with completely different chars) on all platform with UTF8UpperCase failing.

so currently applying this patch, will break various platforms (due to a bug in UTF8UpperCase).


As a note from review:
Do all languages have all letter with either no upper/lower; or with exactly one opposite?

Then there is the problem with the Turkish "i"
http://www.i18nguy.com/unicode/turkish-i18n.html

And it should be case-folding, not mapping:
http://www.unicode.org/faq/casemap_charprop.html

Anyway, even without all this, the current patch will be an improvement.
I have some changes in mind (but can't test yet)

Instead of mapping each line to upper, it may be faster to map the search-text to both upper and lower, and compare each line[x] to the 2 search strings. Except that depends on CPU caches, so it may need some testing...

Felipe Monteiro de Carvalho

2011-10-06 16:04

developer   ~0052659

One easy solution is to just use lower case instead of uppercase. Use UTF8LowerCase from lazutf8 not from LCLProc.

> Then there is the problem with the Turkish "i"
> http://www.i18nguy.com/unicode/turkish-i18n.html [^]

Yep, that sucks. Clearly the upper/lower case needs an optional parameter to specify the locale. There we can do the special turkish stuff.

> And it should be case-folding, not mapping:
> http://www.unicode.org/faq/casemap_charprop.html [^]

I don't foresee much problem with that, usually the returned value is used only for comparison purposes and then discharted. One should not expect the input and output strings to be of the same size.

Martin Friebe

2011-10-07 13:04

manager   ~0052727

With current utf8upper/lower the string len is constant, unless a locale is given.

However, this is not guaranteed?
So changing the LineString to upper/lower may offset the found position?

Also converting always the entire string, may be very expensive. the search is used a lot (with every caret movement, in SynEditMarkupHighAll), so this may cause a performance loss.

Felipe Monteiro de Carvalho

2011-10-17 07:25

developer   ~0053070

Now both upper and lower from lazutils should should be working pretty well. Although lower is much more optimized for speed.

> With current utf8upper/lower the string len is constant, unless a locale is given.

It is not constant at all. Not only turkish changes the length, in the latest code many characters can change the length of the string... but why should be a problem? The number of characters almost always remains constant, and that is what matters (except while uppercasing ß => SS)

> Also converting always the entire string, may be very expensive. the search is used a lot (with every caret movement, in SynEditMarkupHighAll), so this may cause a performance loss.

Maybe we should just try and see if it really makes it much slower.

Felipe Monteiro de Carvalho

2011-10-17 07:28

developer   ~0053071

One issue with the patch is that it should use LowerCase not UpperCase IMO

Felipe Monteiro de Carvalho

2011-10-17 07:29

developer   ~0053072

Maybe we could apply with IFDEFs? Then it can be easily rolled out if it is inneficient.

Martin Friebe

2011-10-17 12:40

manager   ~0053090

Currently there are 2 (maybe 3) issues holding back: (please can we discuss on mailing list, unless an actual change to the patch is proposed? It is getting quite generic on this report)

1) usability of various utf8upper/lower methods

2) changes of character pos in utf8lower => even if not currently happening. It may in future, and then it will be more work to find, than to fix now

3) potential performance. Lower-casing all lines, unconditionally may have an impact (not measured yet).
The search is triggered on simply moving the caret, lowering it's performance might affect normal viewing text in the editor.

Felipe Monteiro de Carvalho

2011-10-17 13:30

developer   ~0053092

> 1) usability of various utf8upper/lower methods
> 2) changes of character pos in utf8lower => even if not currently happening. It may in future, and then it will be more work to find, than to fix now

I claim that these 2 issues do not exits, because I have implemented the *entire* lowercase Unicode table and there was no single instance of a character being lowercased to 2 or more chars.

One could say that maybe Unicode might release a new table in the future which presents this peculiarity, but that's rather speculative.

> 3) potential performance. Lower-casing all lines, unconditionally may have an impact (not measured yet).
> The search is triggered on simply moving the caret, lowering it's performance might affect normal viewing text in the editor.

It needs to search the entire unit each time the caret is moved? I agree that this might be a huge problem.

Felipe Monteiro de Carvalho

2011-10-17 14:28

developer   ~0053093

One solution might be having a parameter in the search routine: Do a fast search only for ASCII or a more precise search.

Felipe Monteiro de Carvalho

2011-10-22 21:31

developer   ~0053300

I implemented it as a new parameter in the routine which is off by default. It is on only when using TSynEdit.SearchReplaceEx. it is not used by the highlighter so it should have no speed penalty. And this bug was only about the IDE Find Dialog anyway, so it should be no problem to lack support in the highlighter for this.

Issue History

Date Modified Username Field Change
2011-02-28 10:35 Tomasz Wieckowski New Issue
2011-02-28 10:35 Tomasz Wieckowski Widgetset => Win32/Win64
2011-03-30 22:37 Vincent Snijders LazTarget => -
2011-03-30 22:37 Vincent Snijders Status new => acknowledged
2011-08-13 14:10 Anton File Added: syneditsearch.patch
2011-08-13 14:16 Anton Note Added: 0050799
2011-08-13 20:45 Maxim Ganetsky LazTarget - => 0.99.0
2011-08-13 20:45 Maxim Ganetsky Target Version => 0.99.0
2011-08-13 20:49 Maxim Ganetsky Status acknowledged => assigned
2011-08-13 20:49 Maxim Ganetsky Assigned To => Martin Friebe
2011-10-01 23:09 Felipe Monteiro de Carvalho Summary Find Dialog with UTF8 - case sensitive => [Patch] Find Dialog with UTF8 - case sensitive
2011-10-06 15:45 Martin Friebe Note Added: 0052657
2011-10-06 16:04 Felipe Monteiro de Carvalho Note Added: 0052659
2011-10-07 13:04 Martin Friebe Note Added: 0052727
2011-10-17 07:25 Felipe Monteiro de Carvalho Note Added: 0053070
2011-10-17 07:28 Felipe Monteiro de Carvalho Note Added: 0053071
2011-10-17 07:29 Felipe Monteiro de Carvalho Note Added: 0053072
2011-10-17 12:40 Martin Friebe Note Added: 0053090
2011-10-17 13:30 Felipe Monteiro de Carvalho Note Added: 0053092
2011-10-17 14:28 Felipe Monteiro de Carvalho Note Added: 0053093
2011-10-22 20:45 Felipe Monteiro de Carvalho Assigned To Martin Friebe => Felipe Monteiro de Carvalho
2011-10-22 21:31 Felipe Monteiro de Carvalho Fixed in Revision => 33031
2011-10-22 21:31 Felipe Monteiro de Carvalho Status assigned => resolved
2011-10-22 21:31 Felipe Monteiro de Carvalho Fixed in Version => 0.9.31 (SVN)
2011-10-22 21:31 Felipe Monteiro de Carvalho Resolution open => fixed
2011-10-22 21:31 Felipe Monteiro de Carvalho Note Added: 0053300
2011-10-27 13:28 Tomasz Wieckowski Status resolved => closed