View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0018842||Lazarus||IDE||public||2011-02-28 10:35||2011-10-27 13:28|
|Reporter||Tomasz Wieckowski||Assigned To||Felipe Monteiro de Carvalho|
|Platform||Win32||OS||Windows XP||OS Version|
|Product Version||0.9.31 (SVN)||Product Build||29452|
|Target Version||Fixed in Version||0.9.31 (SVN)|
|Summary||0018842: [Patch] Find Dialog with UTF8 - case sensitive|
|Description||IDE Find function does'n respect "Case sensitive" option for utf8 text.|
If I look for as ex. "średnia" (polish char) and "Case sensitive" is unchecked then it can't find "Średnia" in source editor.
|Tags||No tags attached.|
|Fixed in Revision||33031|
syneditsearch.patch (1,954 bytes)
Index: components/synedit/syneditsearch.pp =================================================================== --- components/synedit/syneditsearch.pp (revision 31963) +++ components/synedit/syneditsearch.pp (working copy) @@ -424,6 +424,7 @@ MinY: LongInt; IsFirstLine: boolean; SearchLineEndPos: LongInt; + SearchForStr, FirstPattern: String; MaxPos: Integer; xStep: Integer; @@ -841,8 +842,10 @@ end else FirstPattern:=copy(Pat,1,SearchLineEndPos-1); end; - SearchFor:=PChar(FirstPattern); - SearchLen:=length(FirstPattern); + SearchForStr := FirstPattern; + if not fSensitive then SearchForStr := UTF8UpperCase(SearchForStr); + SearchFor:=PChar(SearchForStr); + SearchLen:=Length(SearchForStr); if fRegExpr then begin RegExprEngine.ModifierI:=not fSensitive; @@ -865,6 +868,7 @@ // regex multi line whole word repeat LineStr:=Lines[y]; + if not fSensitive then LineStr := UTF8UpperCase(LineStr); LineLen:=length(LineStr); Line:=PChar(LineStr); if not IsFirstLine then begin @@ -919,12 +923,12 @@ //DebugLn(['TSynEditSearch.FindNextOne x=',x,' MaxPos=',MaxPos,' Line="',Line,'"']); while (x>=0) and (x<=MaxPos) do begin //DebugLn(['TSynEditSearch.FindNextOne x=',x]); - if (SearchLen=0) or (CompTable[Line[x]]=CompTable[SearchFor^]) then begin + if (SearchLen=0) or (Line[x]=SearchFor^) then begin //DebugLn(['TSynEditSearch.FindNextOne First character found x=',x,' Line[x]=',Line[x]]); if (not fWhole) or (x=0) or (not (Line[x-1] in FIdentChars)) then begin i:=1; - while (i<SearchLen) and (CompTable[Line[x+i]]=CompTable[SearchFor[i]]) + while (i<SearchLen) and (Line[x+i]=SearchFor[i]) do inc(i); //DebugLn(['TSynEditSearch.FindNextOne x=',x,' SearchLen=',SearchLen,' i=',i]);
syneditsearch.patch (1,954 bytes)
||But there is a problem with "whole word" option, cause word boundary is determined as not(Line[x] in FIdentChars) [see syneditsearch.pp:574,569,929] where FIdentChars=['a'..'z','A'..'Z','0'..'9','_'], so all unicode characters become word boundaries|
Can't currently test the patch, as UTF8UpperCase does not work on my platform => which also means that the patch will find random bit's of text (with completely different chars) on all platform with UTF8UpperCase failing.
so currently applying this patch, will break various platforms (due to a bug in UTF8UpperCase).
As a note from review:
Do all languages have all letter with either no upper/lower; or with exactly one opposite?
Then there is the problem with the Turkish "i"
And it should be case-folding, not mapping:
Anyway, even without all this, the current patch will be an improvement.
I have some changes in mind (but can't test yet)
Instead of mapping each line to upper, it may be faster to map the search-text to both upper and lower, and compare each line[x] to the 2 search strings. Except that depends on CPU caches, so it may need some testing...
One easy solution is to just use lower case instead of uppercase. Use UTF8LowerCase from lazutf8 not from LCLProc.
> Then there is the problem with the Turkish "i"
> http://www.i18nguy.com/unicode/turkish-i18n.html [^]
Yep, that sucks. Clearly the upper/lower case needs an optional parameter to specify the locale. There we can do the special turkish stuff.
> And it should be case-folding, not mapping:
> http://www.unicode.org/faq/casemap_charprop.html [^]
I don't foresee much problem with that, usually the returned value is used only for comparison purposes and then discharted. One should not expect the input and output strings to be of the same size.
With current utf8upper/lower the string len is constant, unless a locale is given.
However, this is not guaranteed?
So changing the LineString to upper/lower may offset the found position?
Also converting always the entire string, may be very expensive. the search is used a lot (with every caret movement, in SynEditMarkupHighAll), so this may cause a performance loss.
Now both upper and lower from lazutils should should be working pretty well. Although lower is much more optimized for speed.
> With current utf8upper/lower the string len is constant, unless a locale is given.
It is not constant at all. Not only turkish changes the length, in the latest code many characters can change the length of the string... but why should be a problem? The number of characters almost always remains constant, and that is what matters (except while uppercasing ß => SS)
> Also converting always the entire string, may be very expensive. the search is used a lot (with every caret movement, in SynEditMarkupHighAll), so this may cause a performance loss.
Maybe we should just try and see if it really makes it much slower.
||One issue with the patch is that it should use LowerCase not UpperCase IMO|
||Maybe we could apply with IFDEFs? Then it can be easily rolled out if it is inneficient.|
Currently there are 2 (maybe 3) issues holding back: (please can we discuss on mailing list, unless an actual change to the patch is proposed? It is getting quite generic on this report)
1) usability of various utf8upper/lower methods
2) changes of character pos in utf8lower => even if not currently happening. It may in future, and then it will be more work to find, than to fix now
3) potential performance. Lower-casing all lines, unconditionally may have an impact (not measured yet).
The search is triggered on simply moving the caret, lowering it's performance might affect normal viewing text in the editor.
> 1) usability of various utf8upper/lower methods
> 2) changes of character pos in utf8lower => even if not currently happening. It may in future, and then it will be more work to find, than to fix now
I claim that these 2 issues do not exits, because I have implemented the *entire* lowercase Unicode table and there was no single instance of a character being lowercased to 2 or more chars.
One could say that maybe Unicode might release a new table in the future which presents this peculiarity, but that's rather speculative.
> 3) potential performance. Lower-casing all lines, unconditionally may have an impact (not measured yet).
> The search is triggered on simply moving the caret, lowering it's performance might affect normal viewing text in the editor.
It needs to search the entire unit each time the caret is moved? I agree that this might be a huge problem.
||One solution might be having a parameter in the search routine: Do a fast search only for ASCII or a more precise search.|
||I implemented it as a new parameter in the routine which is off by default. It is on only when using TSynEdit.SearchReplaceEx. it is not used by the highlighter so it should have no speed penalty. And this bug was only about the IDE Find Dialog anyway, so it should be no problem to lack support in the highlighter for this.|
|2011-02-28 10:35||Tomasz Wieckowski||New Issue|
|2011-02-28 10:35||Tomasz Wieckowski||Widgetset||=> Win32/Win64|
|2011-03-30 22:37||Vincent Snijders||LazTarget||=> -|
|2011-03-30 22:37||Vincent Snijders||Status||new => acknowledged|
|2011-08-13 14:10||Anton||File Added: syneditsearch.patch|
|2011-08-13 14:16||Anton||Note Added: 0050799|
|2011-08-13 20:45||Maxim Ganetsky||LazTarget||- => 0.99.0|
|2011-08-13 20:45||Maxim Ganetsky||Target Version||=> 0.99.0|
|2011-08-13 20:49||Maxim Ganetsky||Status||acknowledged => assigned|
|2011-08-13 20:49||Maxim Ganetsky||Assigned To||=> Martin Friebe|
|2011-10-01 23:09||Felipe Monteiro de Carvalho||Summary||Find Dialog with UTF8 - case sensitive => [Patch] Find Dialog with UTF8 - case sensitive|
|2011-10-06 15:45||Martin Friebe||Note Added: 0052657|
|2011-10-06 16:04||Felipe Monteiro de Carvalho||Note Added: 0052659|
|2011-10-07 13:04||Martin Friebe||Note Added: 0052727|
|2011-10-17 07:25||Felipe Monteiro de Carvalho||Note Added: 0053070|
|2011-10-17 07:28||Felipe Monteiro de Carvalho||Note Added: 0053071|
|2011-10-17 07:29||Felipe Monteiro de Carvalho||Note Added: 0053072|
|2011-10-17 12:40||Martin Friebe||Note Added: 0053090|
|2011-10-17 13:30||Felipe Monteiro de Carvalho||Note Added: 0053092|
|2011-10-17 14:28||Felipe Monteiro de Carvalho||Note Added: 0053093|
|2011-10-22 20:45||Felipe Monteiro de Carvalho||Assigned To||Martin Friebe => Felipe Monteiro de Carvalho|
|2011-10-22 21:31||Felipe Monteiro de Carvalho||Fixed in Revision||=> 33031|
|2011-10-22 21:31||Felipe Monteiro de Carvalho||Status||assigned => resolved|
|2011-10-22 21:31||Felipe Monteiro de Carvalho||Fixed in Version||=> 0.9.31 (SVN)|
|2011-10-22 21:31||Felipe Monteiro de Carvalho||Resolution||open => fixed|
|2011-10-22 21:31||Felipe Monteiro de Carvalho||Note Added: 0053300|
|2011-10-27 13:28||Tomasz Wieckowski||Status||resolved => closed|