View Issue Details

IDProjectCategoryView StatusLast Update
0038939LazarusLazUtilspublic2021-06-03 21:28
ReporterMichel Assigned ToMattias Gaertner  
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Product Version2.0.13 (SVN) 
Fixed in Version2.1 (SVN) 
Summary0038939: LazUTF8 function UTF8FixBroken Endless loop
Descriptionp = 0000195 -- I hope the debugger gave me the right values.
p[1] = 0000195 -- I hope the debugger gave me the right values.

But I think you can see that if p is not incremented, there is an infinite loop.

{ fix any broken UTF8 sequences with spaces }
procedure UTF8FixBroken(P: PChar);
var
  c: cardinal;
begin
  if p=nil then exit;
 while p^<>#0 do begin
    if ord(p^)<%10000000 then begin
      // regular single byte character
      inc(p);
    end
    else if ord(p^)<%11000000 then begin
      // invalid
      p^:=' ';
      inc(p);
    end
    else if ((ord(p^) and %11100000) = %11000000) then begin
      // starts with %110 => should be 2 byte character
      if ((ord(p[1]) and %11000000) = %10000000) then begin
        c:=((ord(p^) and %00011111) shl 6);
           //or (ord(p[1]) and %00111111);
        if c<(1 shl 7) then
          p^:=' ' // fix XSS attack
        else
          inc(p,2)
      end
      else if p[1]<>#0 then <-- No counting up from p to here. Then back to while and on and so on endless.
        p^:=' ';
    end
TagsNo tags attached.
Fixed in Revisionr65163
LazTarget2.2
Widgetset
Attached Files

Activities

Bart Broersma

2021-05-27 20:28

developer   ~0131055

Indeed an endless loop if it encounters Chr(197)
There seem to be more occasions where p isn't incremented?

Bart Broersma

2021-05-27 20:52

developer   ~0131056

Last edited: 2021-05-29 10:48

View 2 revisions

> else if p[1]<>#0 then
The check for p[1]<>#0 seems wrong to me also.
At this point in the code we already have established (I think) that p^ (==p[0]) cannot be correct in any case, so it must be converted to a space?

jamie philbrook

2021-05-29 18:17

reporter   ~0131081

I also see another problem of "inc(p, 2)", there is no test to determine if there is a #0 on the first iteration before doing the second. you could be in a position where this string could be the start of another string that will be combined later to form a single string. So current it may simply jump over a #0 and then start looking at junk memory.

 and why is this code really needed, looks like someone's midnight candle burning attempt to fix a problem they thought was there?

Bart Broersma

2021-05-29 21:15

developer   ~0131082

The inc(p,2) occurs when a valid 2-byte codepoint is found.
If I understand the code correctly, if will skipp all the follwong else parts en come back to the while loop.
There a check for #0 is in place.
So this should be OK I think.

jamie philbrook

2021-05-29 23:08

reporter   ~0131083

Most likely, it just does not look clean from a quick look, its hard to fallow at the end..

jamie philbrook

2021-05-30 13:48

reporter   ~0131091

Last edited: 2021-05-30 13:51

View 2 revisions

Consider this as a replacement:

Procedure UTF8FixBrokenRevised(P:Pchar);
Var
  C:Cardinal;
Begin
  While P^ <> #0 Do
   Begin
    If Ord(P^) > $7F Then
     Begin
      If ORd(P^) < %11000000 Then P^:=' ' Else
       If ((Ord(P^)and %11100000) = %11000000) Then
       Begin
         If ((Ord(P[1]) and %11000000)=%10000000) Then
          Begin
            C := ((Ord(P^) and %00011111) shl 6);
            if C < (1 shl 7) Then P^:=' ' //Fix XSS Attack, what ever that is?
              Else Inc(P);
          end else
          if P[1] <> #0 Then P^ := ' ';
        end;
     end;
     Inc(P);
   End;
end;
 
I tested it with a char of 197 In the string, it does not lockup..

Bart Broersma

2021-05-30 18:49

developer   ~0131095

Last edited: 2021-05-30 18:50

View 2 revisions

This:

  c:=((ord(p^) and %00011111) shl 6);
           //or (ord(p[1]) and %00111111);
        if c<(1 shl 7) then

Basically tests if ord(p^) < 11000010
We already know that the byte starts with %110, so all this bitshifting, anding and then comparing to yet anothe bitshifted value seems a bit overkill?
Or did I overlook something?

Bart Broersma

2021-05-30 21:02

developer   ~0131098

According to https://en.wikipedia.org/wiki/UTF-8, a leading byte for a 4-byte codepoint can never be $F4 or greater.
Nevertheless Utf8FixBroken accepts the 4 byte sequence $F7 $BF $BF $BF.

Bart Broersma

2021-05-30 21:30

developer   ~0131099

Handling of possible 2-byte codepoints could be rewritten as:
    else if ((ord(p^) and %11100000) = %11000000) then begin
      // starts with %110 => should be 2 byte character
      //first byte must be >= %11000010  { $C2 }
      if (ord(p^)<%11000010) then begin
        p^:=' ';  // (C0 and C1) could be used only for a 2-byte encoding of a 7-bit ASCII character which should be encoded in 1 byte
        Inc(p);
      end
      else
      //second byte should be in range %10000000..%10111111
      if ((ord(p[1]) and %11000000) = %10000000) then begin
        inc(p,2)
      end
      else
      begin
        p^:=' ';
        Inc(p);
      end
    end


I changed the order of inspection: first fully (dis)qualify the first byte, only after that inspec the second one.

Bart Broersma

2021-06-03 21:28

developer   ~0131143

Please test and close if OK.

Issue History

Date Modified Username Field Change
2021-05-27 15:58 Michel New Issue
2021-05-27 20:28 Bart Broersma Status new => confirmed
2021-05-27 20:28 Bart Broersma LazTarget => -
2021-05-27 20:28 Bart Broersma Note Added: 0131055
2021-05-27 20:52 Bart Broersma Note Added: 0131056
2021-05-29 10:48 Bart Broersma Note Edited: 0131056 View Revisions
2021-05-29 18:17 jamie philbrook Note Added: 0131081
2021-05-29 21:15 Bart Broersma Note Added: 0131082
2021-05-29 23:08 jamie philbrook Note Added: 0131083
2021-05-30 13:48 jamie philbrook Note Added: 0131091
2021-05-30 13:51 jamie philbrook Note Edited: 0131091 View Revisions
2021-05-30 18:49 Bart Broersma Note Added: 0131095
2021-05-30 18:50 Bart Broersma Note Edited: 0131095 View Revisions
2021-05-30 21:02 Bart Broersma Note Added: 0131098
2021-05-30 21:30 Bart Broersma Note Added: 0131099
2021-06-03 21:28 Bart Broersma Assigned To => Mattias Gaertner
2021-06-03 21:28 Bart Broersma Status confirmed => resolved
2021-06-03 21:28 Bart Broersma Resolution open => fixed
2021-06-03 21:28 Bart Broersma Fixed in Version => 2.1 (SVN)
2021-06-03 21:28 Bart Broersma Fixed in Revision => r65163
2021-06-03 21:28 Bart Broersma LazTarget - => 2.2
2021-06-03 21:28 Bart Broersma Note Added: 0131143