Some test text!

menu
search
Overviewkeyboard_arrow_down

PDF to Text Command Line Extraction

PDFTron's PDF2Text is an easy-to-use, multi-platform command-line program for high-quality and efficient text extraction from PDF documents. PDF2Text can be used to convert text from any PDF document as Unicode or as structured XML, while providing a wide range of output styles and configuration options.

PDF2Text is offered as an easy-to-use command-line application and as a software development component that can be used as a building block for other client and server-based applications.

Download PDF2Text

linkWhy PDF2Text?

  • Complete Unicode support. PDF2Text can process PDF files from any part of the world (including Asian languages) and represent the extracted text using UTF-8 and UTF-16. To improve Unicode output PDF2Text can recognize vendor-specific Unicode character assignments (in the Private Use Area) and map them to public Unicode area. Similarly Unicode ligatures and PDF specific ligatures can be broken into a sequence of individual Unicode characters. Characters that can't be mapped to Unicode are predictably mapped in the Private Use Area.
  • Intelligent Text Recognition. Intelligent text recognition and logical structure engine used to recognize words, lines, paragraphs, and the reading order in PDF documents. The engine can remove duplicated text commonly used to drop shadows, or text that is obscured by other page content. The text extractor also works flawlessly with PDF documents that contain rotated text or documents where the information is presented in a random order or is scattered across the page.
  • Highest Reliability and Robustness. PDF2Text was from ground-up designed to be run in high throughput server-based and multi-threaded applications. A regular and rigorous Q&A process sets high standards for the reliability of all PDFTron products.
  • Top Performance. Advanced text recognition and content analysis algorithms coupled with low-memory usage and native code efficiency, make PDF2Text the ideal choice for high-traffic servers as well as for interactive applications.

linkKey Functions

  • Extracts text from any PDF document to text or as structured XML.
  • Offers different Unicode text encoding (UTF-8 and UTF-16) options.
  • Provides positioning, font, and styling information for every Paragraph, Line, Word, or a Glyph on a page.
  • Offers options to control the level of detail and the formatting in the output XML.
  • Offers advanced options to control ligature expansion, hyphen removal, and to remove duplicate text (e.g. which is sometimes used for drop shadow effects).
  • Allows for text extraction from a clip rectangle or to hide text in specific regions on a page.
  • Option to remove hidden text or text that is obscured by other page elements (such as images or rectangles).
  • Support for all versions of the PDF format (PDF 1.0 to ISO32000).
  • Full support for encrypted documents (40 and 128 bit RC4 and 128 bit AES).
  • Supports automation and batch operation.

linkSample Use Case Scenarios

  • Server-based, on-demand conversion of PDF documents to text format files.
  • Extract text from a large PDF repository for text indexing or content retrieval purposes (e.g. to implement a PDF search engine).
  • Classify or summarize PDF documents based on their content. Find specific words for content editing purposes (such as splitting pages based on keywords, etc).
  • Convert PDF pages to text or XML for content repurposing.
  • Search PDF pages for specific words or keywords and return their positioning information (e.g. to highlight instances of a given word).

linkOperating Systems Supported

  • Windows, Linux and Mac.

linkSystem Requirements

  • At least 10 MB of free disk space.
  • 2 GB or RAM.

linkExamples

#!/bin/sh
echo "Example 1): Convert PDF to Text"
./pdf2text "PDFTron PDF2Text User Manual.pdf"
echo
echo "Example 2): Convert PDF to Text for page 1 in wordlist format with bounding box"
./pdf2text -o test_out -a 1 -f wordlist --output_bbox "PDFTron PDF2Text User Manual.pdf"
echo
echo "Example 3): Convert PDF to Text for page 1 in wordlist format with bounding box"
./pdf2text -o test_out -a 1 -f xml --output_bbox *.pdf

linkPDFNet SDK

For developers who are looking for a software development component to integrate into their application, PDFTron also offers PDFNet SDK, an easy-to-use, yet powerful software component for extracting text from PDF documents. PDFNet SDK is available as a plain "C DLL" and can be easily accessed from any programming language (including C#, VB.NET, C/C++, Java, VB6, Perl, Python, Ruby, Delphi, etc). PDFNet SDK is PDFTron's own comprehensive PDF library. If you require rasterization or additional PDF functionality, please check out PDFNet SDK (http://www.pdftron.com/pdfnet) or contact a PDFTron representative for more information.

Get the answers you need: Support

close

Free Trial

Get unlimited trial usage of PDFTron SDK to bring accurate, reliable, and fast document processing capabilities to any application or workflow.

Select a platform to get started with your free trial.

Unlimited usage. No email address required.

PDFTron Receives USD$71 Million Growth Investment Led By Silversmith Capital Partners

Learn more
close