#include <TextExtractor.h>

Public Types
enum	ProcessingFlags { e_no_ligature_exp = 1, e_no_dup_remove = 2, e_punct_break = 4, e_remove_hidden_text = 8, e_no_invisible_text = 16, e_no_watermarks = 128, e_extract_using_zorder = 256 }

enum	XMLOutputFlags { e_words_as_elements = 1, e_output_bbox = 2, e_output_style_info = 4 }

typedef pdftron::PDF::Style	Style

typedef pdftron::PDF::Word	Word

typedef pdftron::PDF::Line	Line

Public Member Functions
	TextExtractor ()

	~TextExtractor ()

void	Begin (Page page, const Rect *clip_ptr=0, UInt32 flags=0)

void	SetOCGContext (OCG::Context *ctx)

int	GetWordCount ()

void	SetRightToLeftLanguage (bool rtl)

bool	GetRightToLeftLanguage ()

UString	GetAsText (bool dehyphen=true)

void	GetAsText (UString &out_str, bool dehyphen=true)

UString	GetTextUnderAnnot (const Annot &annot)

void	GetTextUnderAnnot (UString &out_str, const Annot &annot)

UString	GetAsXML (UInt32 xml_output_flags=0)

void	GetAsXML (UString &out_xml, UInt32 xml_output_flags=0)

Highlights	GetHighlights (const std::vector< CharRange > &char_ranges)

Highlights	GetHighlights (const CharRange *char_ranges, size_t char_ranges_count)

int	GetNumLines ()

Line	GetFirstLine ()

void	Destroy ()

Detailed Description

TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

Converting PDF pages to text or XML for content repurposing.
Searching PDF pages for specific words or keywords.
Indexing large PDF repositories for indexing or content retrieval purposes (i.e. implementing a PDF search engine).
Classifying or summarizing PDF documents based on their text content.
Finding specific words for content editing purposes (such as splitting pages based on keywords etc).

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Normalize all text content to Unicode.
Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
Extract positioning information for every line, word, or a glyph.
Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
Offer utility methods to convert PDF page content to text, XML, or HTML.

Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

A sample use case (in C++):

* ... Initialize PDFNet ...
* PDFDoc doc(filein);
* doc.InitSecurityHandler();
* Page page = *doc.PageBegin();
* TextExtractor txt;
* txt.Begin(page, 0, TextExtractor::e_remove_hidden_text);
* UString text;
* txt.GetAsText(text);
* // or traverse words one by one...
* TextExtractor::Line line = txt.GetFirstLine(), lend;
* TextExtractor::Word word, wend;
* for (; line!=lend; line=line.GetNextLine()) {
*  for (word=line.GetFirstWord(); word!=wend; word=word.GetNextWord()) {
*    text.Assign(word.GetString(), word.GetStringLen());
*    cout << text << '\n';
*  }
* }
* 

A sample use case (in C#):

* ... Initialize PDFNet ...
* PDFDoc doc = new PDFDoc(filein);
* doc.InitSecurityHandler();
* Page page = doc.PageBegin().Current();
* TextExtractor txt = new TextExtractor();
* txt.Begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
* string text = txt.GetAsText();
* // or traverse words one by one...
* TextExtractor.Word word;
* for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
*   for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
*     Console.WriteLine(word.GetString());
*   }
* }
* 

For full sample code, please take a look at TextExtract sample project.

Definition at line 116 of file TextExtractor.h.

Member Typedef Documentation

typedef pdftron::PDF::Line pdftron::PDF::TextExtractor::Line

Definition at line 121 of file TextExtractor.h.

typedef pdftron::PDF::Style pdftron::PDF::TextExtractor::Style

Definition at line 119 of file TextExtractor.h.

typedef pdftron::PDF::Word pdftron::PDF::TextExtractor::Word

Definition at line 120 of file TextExtractor.h.

Member Enumeration Documentation

enum pdftron::PDF::TextExtractor::ProcessingFlags

Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms

Enumerator
e_no_ligature_exp
e_no_dup_remove
e_punct_break
e_remove_hidden_text
e_no_invisible_text
e_no_watermarks
e_extract_using_zorder

Definition at line 133 of file TextExtractor.h.

enum pdftron::PDF::TextExtractor::XMLOutputFlags

Flags controlling the structure of XML output in a call to GetAsXML().

Enumerator
e_words_as_elements
e_output_bbox
e_output_style_info

Definition at line 240 of file TextExtractor.h.

Constructor & Destructor Documentation

pdftron::PDF::TextExtractor::TextExtractor ( )

Constructor and destructor

pdftron::PDF::TextExtractor::~TextExtractor ( )

Member Function Documentation

void pdftron::PDF::TextExtractor::Begin	(	Page	page,
		const Rect *	clip_ptr = `0`,
		UInt32	flags = `0`
	)

Start reading the page.

Parameters

page	Page to read.
clip_ptr	A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
flags	A list of ProcessingFlags used to control text extraction algorithm.

void pdftron::PDF::TextExtractor::Destroy ( )

Frees the native memory of the object.

UString pdftron::PDF::TextExtractor::GetAsText ( bool dehyphen = true )

Get all words in the current selection as a single string.

Parameters

out_str	The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. ' ') characters.
dehyphen	If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.

void pdftron::PDF::TextExtractor::GetAsText	(	UString &	out_str,
		bool	dehyphen = `true`
	)

UString pdftron::PDF::TextExtractor::GetAsXML ( UInt32 xml_output_flags = 0 )

Get text content in a form of an XML string.

Parameters

out_xml	- The string containing XML output.
xml_output_flags	- flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags.

XML output will be encoded in UTF-8 and will have the following structure:

* <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
*  <Flow id="1">
*   <Para id="1">
*    <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
*      <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
*      <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
*      <Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
*      ...
*    </Line>
*   </Para>     
*  </Flow>
* </Page>        
* 

The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1"> <Para id="1"> <Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line> <Line>levels. Using the PDFNet PDF library, ...</Line> ...

</Flow> </Page>

void pdftron::PDF::TextExtractor::GetAsXML	(	UString &	out_xml,
		UInt32	xml_output_flags = `0`
	)

Line pdftron::PDF::TextExtractor::GetFirstLine ( )

Returns: The first line of text on the selected page.

Note: To traverse the list of all text lines on the page use line.GetNextLine().; To traverse the list of all word on a given line use line.GetFirstWord().

Highlights pdftron::PDF::TextExtractor::GetHighlights ( const std::vector< CharRange > & char_ranges )

Get a Highlights object based on an array of character ranges.

Parameters

char_ranges an array of character ranges to be highlighted

Returns: a Highlights object containing the selected characters

Highlights pdftron::PDF::TextExtractor::GetHighlights	(	const CharRange *	char_ranges,
		size_t	char_ranges_count
	)

Get a Highlights object based on an array of character ranges.

Parameters

char_ranges	an array of character ranges to be highlighted
char_ranges_count	the number of ranges in the char_ranges array

Returns: a Highlights object containing the selected characters

int pdftron::PDF::TextExtractor::GetNumLines ( )

Returns: The number of lines of text on the selected page.

bool pdftron::PDF::TextExtractor::GetRightToLeftLanguage ( )

Returns: the directionality of text extractor.

UString pdftron::PDF::TextExtractor::GetTextUnderAnnot ( const Annot & annot )

Get all the characters that intersect an annotation.

Parameters

annot The annotation to intersect with.

void pdftron::PDF::TextExtractor::GetTextUnderAnnot	(	UString &	out_str,
		const Annot &	annot
	)

int pdftron::PDF::TextExtractor::GetWordCount ( )

Returns: the number of words on the page.

void pdftron::PDF::TextExtractor::SetOCGContext ( OCG::Context * ctx )

Sets the Optional Content Group (OCG) context that should be used when processing the document. This function can be used to change the current OCG context. Optional content (such as PDF layers) will be selectively processed based on the states of optional content groups in the given context.

Parameters

ctx	Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page.

void pdftron::PDF::TextExtractor::SetRightToLeftLanguage ( bool rtl )

Sets the directionality of text extractor. Must be called before the processing of a page started.

Parameters

rtl	mode reverses the directionality of TextExtractor algorithm.

The documentation for this class was generated from the following file:

PDF/TextExtractor.h

Public Types

Public Member Functions

Detailed Description

Member Typedef Documentation

Member Enumeration Documentation

Constructor & Destructor Documentation

Member Function Documentation