PDFTron's PDF2Text is an easy-to-use, multi-platform command-line
program for high-quality and efficient text extraction from PDF
documents. PDF2Text can be used to convert text from any PDF document as
Unicode or as structured XML, while providing a wide range of output
styles and configuration options.
PDF2Text is offered as an easy-to-use command-line application and as a
software development component that can be used as a building block for
other client and server-based applications.
- Complete Unicode support. PDF2Text can process PDF files from any
part of the world (including Asian languages) and represent the
extracted text using UTF-8 and UTF-16. To improve Unicode output
PDF2Text can recognize vendor-specific Unicode character assignments (in
the Private Use Area) and map them to public Unicode area. Similarly
Unicode ligatures and PDF specific ligatures can be broken into a
sequence of individual Unicode characters. Characters that can't be
mapped to Unicode are predictably mapped in the Private Use Area.
- Intelligent Text Recognition. Intelligent text recognition and
logical structure engine used to recognize words, lines, paragraphs, and
the reading order in PDF documents. The engine can remove duplicated
text commonly used to drop shadows, or text that is obscured by other
page content. The text extractor also works flawlessly with PDF
documents that contain rotated text or documents where the information
is presented in a random order or is scattered across the page.
- Highest Reliability and Robustness. PDF2Text was from ground-up
designed to be run in high throughput server-based and multi-threaded
applications. A regular and rigorous Q&A process sets high standards for
the reliability of all PDFTron products.
- Top Performance. Advanced text recognition and content analysis
algorithms coupled with low-memory usage and native code efficiency,
make PDF2Text the ideal choice for high-traffic servers as well as for
- Extracts text from any PDF document to text or as structured XML.
- Offers different Unicode text encoding (UTF-8 and UTF-16) options.
- Provides positioning, font, and styling information for every
Paragraph, Line, Word, or a Glyph on a page.
- Offers options to control the level of detail and the formatting in
the output XML.
- Offers advanced options to control ligature expansion, hyphen
removal, and to remove duplicate text (e.g. which is sometimes used
for drop shadow effects).
- Allows for text extraction from a clip rectangle or to hide text in
specific regions on a page.
- Option to remove hidden text or text that is obscured by other page
elements (such as images or rectangles).
- Support for all versions of the PDF format (PDF 1.0 to ISO32000).
- Full support for encrypted documents (40 and 128 bit RC4 and 128 bit
- Supports automation and batch operation.
linkSample Use Case Scenarios
- Server-based, on-demand conversion of PDF documents to text format
- Extract text from a large PDF repository for text indexing or
content retrieval purposes (e.g. to implement a PDF search engine).
- Classify or summarize PDF documents based on their content. Find
specific words for content editing purposes (such as splitting pages
based on keywords, etc).
- Convert PDF pages to text or XML for content repurposing.
- Search PDF pages for specific words or keywords and return their
positioning information (e.g. to highlight instances of a given
linkOperating Systems Supported
- At least 10 MB of free disk space.
- 2 GB or RAM.
echo "Example 1): Convert PDF to Text"
./pdf2text "PDFTron PDF2Text User Manual.pdf"
echo "Example 2): Convert PDF to Text for page 1 in wordlist format with bounding box"
./pdf2text -o test_out -a 1 -f wordlist --output_bbox "PDFTron PDF2Text User Manual.pdf"
echo "Example 3): Convert PDF to Text for page 1 in wordlist format with bounding box"
./pdf2text -o test_out -a 1 -f xml --output_bbox *.pdf
For developers who are looking for a software development component to
integrate into their application, PDFTron also offers PDFNet SDK, an
easy-to-use, yet powerful software component for extracting text from
PDF documents. PDFNet SDK is available as a plain "C DLL" and can be
easily accessed from any programming language (including C#, VB.NET,
C/C++, Java, VB6, Perl, Python, Ruby, Delphi, etc). PDFNet SDK is PDFTron's own comprehensive PDF library. If you require
rasterization or additional PDF functionality, please
check out PDFNet SDK (http://www.pdftron.com/pdfnet) or contact a
PDFTron representative for more information.