0031991
ReporterCudaText man Assigned ToJuha Manninen  
Status resolvedResolutionfixed 
PlatformUbuntu 16.4 gtk2 
Product Version1.9 (SVN) 
0031991: OI help area wrong for TCombobox.Style
DescriptionPicture shows that area shows wrong text, missed list of values [2 lists in UL-LI tags].

FPDoc shows OK text in its area.
Fixed in Revision r55307, r55319, r55325, r55329, r55336
CudaText man

2017-06-09 19:20


Juha Manninen

2017-06-10 18:37

developer   ~0101002

Fixed, please test.

I see you don't use TurboPowerIProDsgn package which gives a nice HTML rendering for code help in editor hints and in OI Infobox.
Without it the text looks butt-ugly.

CudaText man

2017-06-10 18:47

reporter   ~0101004

Still not nice: too less endOfLines here:
we need 5 styles of Combobox as 5 new lines. picture shows.

CudaText man

2017-06-10 18:47


Juha Manninen

2017-06-10 19:07

developer   ~0101005

The formatting is totally screwed without HTML rendering. Fortunately TurboPowerIProDsgn works on every platform and it is installed by default.
It is now the "standard" way to look at code help.
Does it work well for you?

After my fix all list items from the original XML file are included, aren't they?
If you want to improve text rendering without HTML, please look at function HTMLToCaption() in unit IDEHelpManager.
It only strips the tags out and copies the text without any formatting.
For most people this is a low priority issue because HTML rendering works well.

If you plan to provide a patch then I can keep this issue open for a while. Otherwise it closes soon.
The task is not trivial. The code must do partly the same things that a HTML parser + renderer already does.

Juha Manninen

2017-06-10 19:29

developer   ~0101007

Another idea: there must be some "HTML to plain text" rendering engines out there. If you find one with a proper license we could integrate it.
It cannot show graphs or different font sizes but it could render text as nicely as possible.
Such code should not be very big. We don't want to bloat Lazarus with code that is almost never used. Remember, most people use the HTML rendering by provided by TurboPowerIProDsgn.

After thinking a little I realized that even the HTMLToCaption() function could be improved easily without implementing any state machine.
Spaces could be removed after "p" tag, list items would force a newline etc...

CudaText man

2017-06-10 22:21


CudaText man

2017-06-10 22:22


Index: ide/idehelpmanager.pas
--- ide/idehelpmanager.pas	(revision 55311)
+++ ide/idehelpmanager.pas	(working copy)
@@ -380,8 +380,14 @@
   sp: LongInt;
   InHeader: Boolean;
   CurTagName: String;
+  cReplacerForLI = LineEnding+'<br>&nbsp;*&nbsp;';  
+  Result:=StringReplace(Result, '<li>', cReplacerForLI, [rfReplaceAll]);
+  Result:=StringReplace(Result, '<LI>', cReplacerForLI, [rfReplaceAll]);
   //debugln(['HTMLToCaption HTML="',Result,'"']);
CudaText man

2017-06-10 22:22

reporter   ~0101011

Tks for note about HTMLToCaption. Added fix for LI tag. And picture shows result

Juha Manninen

2017-06-11 20:43

developer   ~0101025

Actually HTMLToCaption() did more layouting than I remembered but it didn't work very well with lots of whitespace.
I ended up making a proper parser / renderer after all in r55319.
It is a general purpose class, not specific to the IDE help system, so I placed it in LazUtils package.
Now I feel I wasted a lot of time. Something in parsers is pulling me. Damn!

The parser is robust and can be easily extended. For example the attribute in <div class="title"> could be parsed and used.

Please test. How does it work?

CudaText man

2017-06-11 21:06

reporter   ~0101030

You did not small work. good...
Good is to use "const" param in Render(), AddOutput();
name param as "aStream"

CudaText man

2017-06-11 21:09

reporter   ~0101031

Wish: add property LineEnding (with default of OS LineEnding), to use #_10.

CudaText man

2017-06-11 21:38

reporter   ~0101034

Maybe it is slower, but good:
delete HtmlEntity() and use simple post handling

s:=StringReplace(s, '....', '<', [rfReplaceAll]);

Juha Manninen

2017-06-11 22:21

developer   ~0101037

Why would StringReplace be good? It would be MUCH slower, you are right about that.
Did you notice my renderer does not copy the same big memory areas many times, it copies char by char only once what is needed?

LineEnding (with some other name) could be a useful property for somebody although not needed for the current use case.

CudaText man

2017-06-12 06:33

reporter   ~0101043

wp
2017-06-12 09:47
developer


2017-06-12 09:47

developer   ~0101046

Just to consider: Extracting text from html would be a simple exercise for the fasthtmlparser

unit html2text;

{$mode objfpc}{$H+}


  Classes, SysUtils;

function ExtractTextFromHTML(const AHTMLText: String): String;



  THTMLTextExtractor = class
    FParser: THTMLParser;
    FText: String;
    procedure FoundTextHandler(AText: String);
    constructor Create(AHTMLText: String);
    destructor Destroy; override;
    function Execute: String;

constructor THTMLTextExtractor.Create(AHTMLText: String);
  FParser := THTMLParser.Create(AHTMLText);
  FParser.OnFoundText := @FoundTextHandler;

destructor THTMLTextExtractor.Destroy;

function THTMLTextExtractor.Execute: String;
  FText := '';
  Result := FText;

procedure THTMLTextExtractor.FoundTextHandler(AText: String);
  if AText = '' then

  // Remove multiple line breaks from text start
  if (AText[1] in [0000010, 0000013]) then begin
    while (AText <> '') and (AText[1] in [0000010, 0000013]) do
      Delete(AText, 1, 1);
    AText := LineEnding + AText;
    if AText = '' then

  // ... and from text end
  if (AText[Length(AText)] in [0000010, 0000013]) then begin
    while (AText <> '') and (AText[Length(AText)] in [0000010, 0000013]) do
      Delete(AText, Length(AText), 1);
    AText := AText + LineEnding;
    if AText = '' then

  FText := FText + AText;

function ExtractTextFromHTML(const AHTMLText: String): String;
  extractor: THTMLTextExtractor;
  extractor := THTMLTextExtractor.Create(AHTMLText);
    Result := extractor.Execute;


CudaText man

2017-06-12 10:55

reporter   ~0101052

+ 'A': // Link
+ Result:=AddOutput(' 👀');
+ '/A':
+ Result:=AddOutput('👀 ');
eye chars?? Must be a property and better '[]' chars, IMO

CudaText man

2017-06-12 10:56

reporter   ~0101053

And prop for this char, pls.

CudaText man

2017-06-12 10:58

reporter   ~0101055

+ Result:=AddOutput('&'); // Entity not found, add just '&'.
Need prop, and better "?" char.

Juha Manninen

2017-06-12 11:43

developer   ~0101057

I removed the eyes and added a TitleMark property in r55329.
Unicode Emojis give nice opportunities for layout. They are essentially graphics inside text.
IMO '🔹' looks good with a title.

'&' without entity is not legal HTML, but if one is encountered then it must be copied verbatim. Why would you change it to '?'
If input is '&xxx', output must also be '&xxx' and not '?xxx'.

@wp: Yes, I believe fasthtmlparser and SAX could be used. However the code does not only extract text from HTML, it also renders in within the confines of pure text output.
To my surprise I did not find such code.
My class is loosely based on the original HTMLToCaption() function by Mattias. The function copied large memory blocks repeatedly while removing tags and thus was slow with big HTML.
I was kind of carried away when making an optimized class.
BTW, your example code removes newlines but it should remove the excess spaces, too.
Delete(AText, 1, 1) inside a big loop is butt-slow. :)

I am resolving this issue. The code can be discussed on mailing list or forum.
Patches can be added.

CudaText man

2017-06-12 13:10


Index: components/lazutils/html2textrender.pas
--- components/lazutils/html2textrender.pas	(revision 55332)
+++ components/lazutils/html2textrender.pas	(working copy)
@@ -31,11 +31,17 @@
     fHTML, fOutput: string;
     fMaxLines: integer;
-    fLineEndMark: String; // End of line, by default std. "LineEnding".
-    fTitleMark: String; // Text at start and end of title text, by default Unicode graph.
+    fLineEndMark: String; // End of line, by default standard LineEnding
+    fTitleMark: String; // Text at start/end of title text: <div class="title">...</div>
+    fHorzLine: String; // Text for <hr> tag
+    fLinkBegin: String; // Text before link, <a href="...">
+    fLinkEnd: String; // Text after link
+    fListItemMark: String; // Text for <li> items
+    fMoreMark: String; // Text to add if too many lines
     fInHeader, fInDivTitle: Boolean;
     fPendingSpace: Boolean;
     fPendingNewLineCnt: Integer;
+    fIndentSize: integer; // Increment (in spaces) for each nested HTML level
     fIndent: integer;
     fLineCnt, fHtmlLen: Integer;
     p: Integer;
@@ -53,6 +59,12 @@
     property LineEndMark: String read fLineEndMark write fLineEndMark;
     property TitleMark: String read fTitleMark write fTitleMark;
+    property HorzLineMark: String read fHorzLine write fHorzLine;
+    property LinkBeginMark: String read fLinkBegin write fLinkBegin;
+    property LinkEndMark: String read fLinkEnd write fLinkEnd;
+    property ListItemMark: String read fListItemMark write fListItemMark;
+    property MoreMark: String read fMoreMark write fMoreMark;
+    property IndentSize: integer read fIndentSize write fIndentSize;
@@ -68,6 +80,12 @@
   // These can be changed by user later.
+  fHorzLine:= '——————————————————';
+  fLinkBegin:='_';
+  fLinkEnd:='_';
+  fListItemMark:='* ';
+  fMoreMark:='...';
+  fIndentSize:=2;
 constructor THTML2TextRenderer.Create(const Stream: TStream);
@@ -122,13 +140,13 @@
     // Return False if max # of lines exceeded.
     if fLineCnt>fMaxLines then
-      fOutput:=fOutput+fLineEndMark+'...';
+      fOutput:=fOutput+fLineEndMark+fMoreMark;
   if fPendingNewLineCnt>0 then
-    fOutput:=fOutput+StringOfChar(' ',fIndent*2);
+    fOutput:=fOutput+StringOfChar(' ',fIndent*fIndentSize);
@@ -211,18 +229,18 @@
         // Don't leave empty lines before list item (not sure if this is good)
-        Result:=AddOutput('* ');
+        Result:=AddOutput(fListItemMark);
     'A':                             // Link
-        Result:=AddOutput(' _');
+        Result:=AddOutput(' '+fLinkBegin);
-        Result:=AddOutput('_ ');
+        Result:=AddOutput(fLinkEnd+' ');
-        Result:=AddOutput('——————————————————');
+        Result:=AddOutput(fHorzLine);
CudaText man

2017-06-12 13:10

reporter   ~0101058

Made refac, 6 new properties, patch added.

Juha Manninen

2017-06-12 14:22

developer   ~0101063

Applied, although I don't find some of the properties very useful. For example who would want to change the '...' at the end of truncated output?
I renamed one propery as IndentStep.

CudaText man

2017-06-12 15:58

reporter   ~0101064

there is Unicode char for "3 dots".

