Class TextExtractor

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

Converting PDF pages to text or XML for content repurposing.
Searching PDF pages for specific words or keywords.
Indexing large PDF repositories for indexing or content.

retrieval purposes (i.e. implementing a PDF search engine).

Classifying or summarizing PDF documents based on their text content.
Finding specific words for content editing purposes (such as splitting pages.

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Normalize all text content to Unicode.
Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
Extract positioning information for every line, word, or a glyph.
Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
Offer utility methods to convert PDF page content to text, XML, or HTML.

TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

For full sample code, please take a look at TextExtract sample project.

//... Initialize PDFNet ...
PDFDoc doc = new PDFDoc(filein);
doc.initSecurityHandler();
Page page = doc.pageBegin().current();
TextExtractor txt = new TextExtractor();
txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
string text = txt.getAsText();
// or traverse words one by one...
TextExtractor.Word word;
for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
string w = word.GetString();
}
}

Inheritance

object

TextExtractor

Implements

IDisposable

Inherited Members

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: pdftron.PDF

Assembly: PDFTronDotNet.dll

Syntax

public class TextExtractor : IDisposable

Constructors

TextExtractor()

Constructor and destructor.

Declaration

public TextExtractor()

Methods

Begin(Page)

Start reading the page.

Declaration

public void Begin(Page page)

Parameters

Type	Name	Description
Page	page	Page to read.

Begin(Page, Rect)

Start reading the page.

Declaration

public void Begin(Page page, Rect clip_ptr)

Parameters

Type	Name	Description
Page	page	Page to read.
Rect	clip_ptr	A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.

Begin(Page, Rect, ProcessingFlags)

Start reading the page.

Declaration

public void Begin(Page page, Rect clip, TextExtractor.ProcessingFlags flags)

Parameters

Type	Name	Description
Page	page	Page to read.
Rect	clip
TextExtractor.ProcessingFlags	flags	A list of ProcessingFlags used to control text extraction algorithm.

Destroy()

Declaration

public void Destroy()

Dispose()

Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.

Declaration

public void Dispose()

Dispose(bool)

Declaration

protected virtual void Dispose(bool disposing)

Parameters

Type	Name	Description
bool	disposing

~TextExtractor()

Releases all resources used by the TextExtractor

Declaration

protected ~TextExtractor()

GetAsText()

Get all words in the current selection as a single string.

Declaration

public string GetAsText()

Returns

Type	Description
string	The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

GetAsText(bool)

Get all words in the current selection as a single string.

Declaration

public string GetAsText(bool dehyphen)

Parameters

Type	Name	Description
bool	dehyphen	If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.

Returns

Type	Description
string	The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

GetAsXML()

Get text content in a form of an XML string.

Declaration

public string GetAsXML()

Returns

Type	Description
string	The string containing XML output.

Remarks

XML output will be encoded in UTF-8 and will have the following structure:

<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
...
</Line>
</Para>  
</Flow>
</Page>

The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor.e_words_as_elements | TextExtractor.e_output_bbox | TextExtractor.e_output_style_info)

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line>levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
</code>
</example>

GetAsXML(XMLOutputFlags)

Get text content in a form of an XML string.

Declaration

public string GetAsXML(TextExtractor.XMLOutputFlags flags)

Parameters

Type	Name	Description
TextExtractor.XMLOutputFlags	flags	flags controlling XML output. For more information, please see `TextExtract.XMLOutputFlags`.

Returns

Type	Description
string	The string containing XML output.

Remarks

XML output will be encoded in UTF-8 and will have the following structure: <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1"> <Para id="1"> <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;"> <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word> <Word box="106.188, 708.075, 15.9318, 10.02"<SDK</Word> <Word box="125.617, 708.075, 6.22242, 10.02"<is</Word> ... </Line> </Para> </Flow> </Page> The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor.e_words_as_elements | TextExtractor.e_output_bbox | TextExtractor.e_output_style_info)

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line<PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line<levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>

GetFirstLine()

Gets the first line of text on the selected page

Declaration

public TextExtractor.Line GetFirstLine()

Returns

Type	Description
TextExtractor.Line	The first line of text on the selected page.

GetHighlights(CharRange[])

Get a Highlights object based on an array of character ranges

Declaration

public Highlights GetHighlights(TextExtractor.CharRange[] char_ranges)

Parameters

Type	Name	Description
CharRange[]	char_ranges	an array of character ranges to be highlighted

Returns

Type	Description
Highlights	a Highlights object containing the selected characters

GetNumLines()

Gets the number of line

Declaration

public int GetNumLines()

Returns

Type	Description
int	number of lines

GetTextUnderAnnot(Annot)

Get all the characters that intersect an annotation.

Declaration

public string GetTextUnderAnnot(Annot annot)

Parameters

Type	Name	Description
Annot	annot	The annotation to intersect with.

Returns

Type	Description
string	The string under annot

GetWordCount()

Gets the word count.

Declaration

public int GetWordCount()

Returns

Type	Description
int	the number of words on the page.

SetOCGContext(Context)

Sets the Optional Content Group (OCG) context that should be used when extracting the text from the page. This function can be used to selectively extract text from optional content (such as PDF layers) based on the states of optional content groups in the given context.

Declaration

public void SetOCGContext(Context ctx)

Parameters

Type	Name	Description
Context	ctx	Optional Content Group (OCG) context, or NULL if the rasterizer should render all content on the page.

Implements

IDisposable