View Issue Details

IDProjectCategoryView StatusLast Update
0029817LazarusLazUtilspublic2016-03-12 19:31
ReporterBernard MarcellyAssigned ToBart Broersma 
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
PlatformPCOSWindows 7 HomeOS VersionSP1
Product Version1.6Product Build 
Target Version1.8Fixed in Version1.8 
Summary0029817: Request : UTF8ReverseString, UTF8RPos
DescriptionThe RTL offers Pos() and RPos() functions but they do not handle UTF8 strings.
Unit LazUTF8 offers UTF8Pos(), but it lacks UTF8RPos().

AnsiReverseString does not handle UTF8 strings, and there is no UTF8ReverseString.

Since ordinary strings are UTF8 strings in Lazarus 1.6, these routines would come in handy.
Additional Information// code proposal (tested OK with strings with national characters)

  function UTF8ReverseString(p: PChar; const ByteCount: LongInt): string;
  var
    CharLen, rBytePos: LongInt;
  begin
    setlength(Result,ByteCount);
    rBytePos:= ByteCount+1;
    while (rBytePos > 1) do
      begin
        CharLen:=UTF8CharacterLength(p);
        dec(rBytePos, CharLen);
        Move(p^, Result[rBytePos], CharLen);
        inc(p,CharLen);
      end;
  end;

  function UTF8ReverseString(const AText: string): string; inline;
  begin
    Result:= UTF8ReverseString(PChar(AText), length(AText));
  end;

  function UTF8RPos(const Substr, Source: string): integer;
  var
    RevSubstr, RevSource: string; pRev: integer;
  begin
    if Pos(Substr, Source) = 0
    then
      Result:= 0
    else
      begin
        RevSubstr:= UTF8ReverseString(Substr);
        RevSource:= UTF8ReverseString(Source);
        pRev:= UTF8Pos(RevSubstr, RevSource);
        Result:= UTF8Length(Source) -pRev -UTF8Length(Substr) +2;
      end;
  end;
TagsNo tags attached.
Fixed in Revisionr51927
LazTarget1.8
Widgetset
Attached Files

Activities

Juha Manninen

2016-03-11 20:18

developer   ~0090900

Last edited: 2016-03-11 20:19

View 2 revisions

Your UTF8RPos implementation is slow. It may be faster to find all occurrences and return the last one.
Assigning to Bart, the author of ReverseString.

Bart Broersma

2016-03-11 23:37

developer   ~0090906

> Assigning to Bart, the author of ReverseString.
????

I did quite OK in "The Slowest Pascal ReverseString competition" (http://www.flyingsheep.nl/reversestring.htm), but I did not write fpc's ReverseString.
(That competition is still open for entries ...)

B.t.w. Handling decomposed characters will most likely fail in any UTF8ReverseString alogorithm I can think of.

jamie philbrook

2016-03-12 02:00

reporter   ~0090908

How about UTF8ToUTF32 string... in both strings.

Do the code using direct indexing with no need to worry
about code points.

 When done, convert back the results to a UTF8 length.
You don't need to keep a UTF32 around for long, so it's not
that much of a problem allocating 32 bit character points.

 When doing the UTF0ToUTF32, one could reallocate in blocks along the
way.

Bart Broersma

2016-03-12 14:51

developer   ~0090933

> It may be faster to find all occurrences and return the last one.
Tested with repeated cals to Utf8Pos: this is actually significant slower especially when the substring occurs more than once.

Bart Broersma

2016-03-12 15:01

developer   ~0090934

Thanks for the contribution.
Please test and close if OK.

Bernard Marcelly

2016-03-12 18:32

reporter   ~0090945

Better implementation for UTF8RPos (I think...)

function UTF8RPos(const Substr, Source: string): integer;
var
  rBytePos: integer;
begin
  rBytePos:= RPos(Substr, Source);
  if rBytePos > 0
  then
    Result:= UTF8Length(Copy(Source, 1, rBytePos -1)) +1
  else
    Result:= 0;
end;

Bart Broersma

2016-03-12 19:31

developer   ~0090947

In my tests that one is actually about 1.4 times slower.

Issue History

Date Modified Username Field Change
2016-03-11 18:46 Bernard Marcelly New Issue
2016-03-11 20:18 Juha Manninen Note Added: 0090900
2016-03-11 20:18 Juha Manninen Assigned To => Bart Broersma
2016-03-11 20:18 Juha Manninen Status new => assigned
2016-03-11 20:19 Juha Manninen Note Edited: 0090900 View Revisions
2016-03-11 23:37 Bart Broersma Note Added: 0090906
2016-03-12 02:00 jamie philbrook Note Added: 0090908
2016-03-12 14:51 Bart Broersma Note Added: 0090933
2016-03-12 15:01 Bart Broersma Fixed in Revision => r51927
2016-03-12 15:01 Bart Broersma LazTarget => 1.8
2016-03-12 15:01 Bart Broersma Note Added: 0090934
2016-03-12 15:01 Bart Broersma Status assigned => resolved
2016-03-12 15:01 Bart Broersma Fixed in Version => 1.8
2016-03-12 15:01 Bart Broersma Resolution open => fixed
2016-03-12 15:01 Bart Broersma Target Version => 1.8
2016-03-12 18:32 Bernard Marcelly Note Added: 0090945
2016-03-12 19:31 Bart Broersma Note Added: 0090947