com.pdftron.pdf
Class TextExtractor

java.lang.Object
  extended by com.pdftron.pdf.TextExtractor

public class TextExtractor
extends java.lang.Object

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

A sample use case:

 ... Initialize PDFNet ...
 PDFDoc doc = new PDFDoc(filein);
 doc.initSecurityHandler();
 Page page = doc.pageBegin().current();
 TextExtractor txt = new TextExtractor();
 txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
 string text = txt.getAsText();
 // or traverse words one by one...
 TextExtractor.Word word;
 for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
   for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
     string w = word.GetString();
   }
 }
 
 

For full sample code, please take a look at TextExtract sample sample project.


Nested Class Summary
static class TextExtractor.Compat
          Compatibility layer API.
 class TextExtractor.Line
           
 class TextExtractor.Style
          A class representing predominant text style associated with a given Line, a Word, or a Glyph.
 class TextExtractor.Word
           
 
Field Summary
static int e_no_dup_remove
          Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.
static int e_no_invisible_text
          Enables removing text that uses rendering mode 3 (i.e.
static int e_no_ligature_exp
          Disables expanding of ligatures using a predefined mapping.
static int e_output_bbox
          Include bounding box information for each XML element.
static int e_output_style_info
          Include font and styling information.
static int e_punct_break
          Treat punctuation (e.g.
static int e_remove_hidden_text
          Enables removal of text that is obscured by images or rectangles.
static int e_words_as_elements
          Output words as XML elements instead of inline text.
 
Constructor Summary
TextExtractor()
          Constructor.
 
Method Summary
 void begin(Page page)
          Start reading the page.
 void begin(Page page, Rect clip_ptr)
          Start reading the page.
 void begin(Page page, Rect clip_ptr, int flags)
          Start reading the page.
 void destroy()
          Frees the native memory of the object.
 java.lang.String getAsText()
          Get all words in the current selection as a single string.
 java.lang.String getAsText(boolean dehyphen)
          Get all words in the current selection as a single string.
 java.lang.String getAsXML()
          Get text content in a form of an XML string.
 java.lang.String getAsXML(int xml_output_flags)
          Get text content in a form of an XML string.
 TextExtractor.Line getFirstLine()
          Get the first line.
 int getNumLines()
          Get the number lines.
 boolean getRightToLeftLanguage()
          Checkes if text extractor works in right-to-left language mode.
 java.lang.String getTextUnderAnnot(Annot annot)
          Get all the characters that intersect an annotation.
 int getWordCount()
          Get the word count.
 void setRightToLeftLanguage(boolean right_2_left)
          Sets text extractor to work in right-to-left language mode.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

e_no_ligature_exp

public static final int e_no_ligature_exp
Disables expanding of ligatures using a predefined mapping. Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st, oe, OE.

See Also:
Constant Field Values

e_no_dup_remove

public static final int e_no_dup_remove
Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.

See Also:
Constant Field Values

e_punct_break

public static final int e_punct_break
Treat punctuation (e.g. full stop, comma, semicolon, etc.) as word break characters.

See Also:
Constant Field Values

e_remove_hidden_text

public static final int e_remove_hidden_text
Enables removal of text that is obscured by images or rectangles. Since this option has small performance penalty on performance of text extraction, by default it is not enabled.

See Also:
Constant Field Values

e_no_invisible_text

public static final int e_no_invisible_text
Enables removing text that uses rendering mode 3 (i.e. invisible text). Invisible text is usually used in 'PDF Searchable Images' (i.e. scanned pages with a corresponding OCR text). As a result, invisible text will be extracted by default.

See Also:
Constant Field Values

e_words_as_elements

public static final int e_words_as_elements
Output words as XML elements instead of inline text.

See Also:
Constant Field Values

e_output_bbox

public static final int e_output_bbox
Include bounding box information for each XML element. The bounding box information will be stored as 'bbox' attribute.

See Also:
Constant Field Values

e_output_style_info

public static final int e_output_style_info
Include font and styling information.

See Also:
Constant Field Values
Constructor Detail

TextExtractor

public TextExtractor()
Constructor. Instantiate new TextExtractor.

Method Detail

destroy

public void destroy()
Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.


begin

public void begin(Page page)
Start reading the page.

Parameters:
page - Page to read.

begin

public void begin(Page page,
                  Rect clip_ptr)
Start reading the page.

Parameters:
page - Page to read.
clip_ptr - A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.

begin

public void begin(Page page,
                  Rect clip_ptr,
                  int flags)
Start reading the page.

Parameters:
page - Page to read.
clip_ptr - A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
flags - A list of ProcessingFlags used to control text extraction algorithm.

getWordCount

public int getWordCount()
Get the word count.

Returns:
the number of words on the page.

getAsText

public java.lang.String getAsText()
Get all words in the current selection as a single string.

Returns:
The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

getAsText

public java.lang.String getAsText(boolean dehyphen)
Get all words in the current selection as a single string.

Parameters:
dehyphen - If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.
Returns:
The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

getTextUnderAnnot

public java.lang.String getTextUnderAnnot(Annot annot)
Get all the characters that intersect an annotation.

Parameters:
annot - The annotation to intersect with.

getAsXML

public java.lang.String getAsXML()
Get text content in a form of an XML string.

Note: This method returns the same as if calling getAsXML(0). Please see getAsXML(int) for more information.

Returns:
The string containing XML output.

getAsXML

public java.lang.String getAsXML(int xml_output_flags)
Get text content in a form of an XML string.

Note: XML output will be encoded in UTF-8 and will have the following structure:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
     <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
     <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
     <Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
      ...
    </Line>
   </Para>
  </Flow>
 </Page>
 
 

The above XML output was generated by passing the following union of flags: e_words_as_elements | e_output_bbox | e_output_style_info.

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
    <Line>levels. Using the PDFNet PDF library, ...</Line>
     ...
   </Para>
  </Flow>
 </Page>
 
 

Parameters:
xml_output_flags - flags controlling XML output.

getNumLines

public int getNumLines()
Get the number lines.

Returns:
The number of lines of text on the selected page.

getFirstLine

public TextExtractor.Line getFirstLine()
Get the first line.

Note: To traverse the list of all text lines on the page use TextExtractor.Line.getNextLine(). To traverse the list of all word on a given line use TextExtractor.Line.getFirstWord().

Returns:
The first line of text on the selected page.

getRightToLeftLanguage

public boolean getRightToLeftLanguage()
Checkes if text extractor works in right-to-left language mode.


setRightToLeftLanguage

public void setRightToLeftLanguage(boolean right_2_left)
Sets text extractor to work in right-to-left language mode.

Parameters:
right_2_left - If true, text extractor is set to right-to-left language mode.


© 2002-2018 PDFTron Systems Inc.