Some test text!

PDFGenie (Command Line Tool for PDF Text & Table Extraction)

In this document
chevron_rightWhy PDFGenie?
chevron_rightKey Functions
chevron_rightSample Use Case Scenarios
chevron_rightOperating Systems Supported
chevron_rightSystem Requirements
chevron_rightExample

PDFTron's PDFGenie is a simple-to-use utility that can extract tables and text from existing PDF documents as HTML or XML. It is available as a command-line tool.

Please visit our Online Demo to see this live in your browser.

PDF is popular for its ability to display and print exactly the same on different computers. However, PDF documents are usually missing information for features such as paragraphs, tables, figures, headers, and footers. The lack of 'document structure' or 'logical structure' information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes 'trapped'.

PDFGenie is able to recover table, text, and reading order from existing PDFs. This conversion analyzes the content of existing PDF files and performs an extraction of the logical structure in order to produce an HTML or XML reflowable document. This analysis and extraction of text and table features make this tool perfect for industries such as financial, legal, or health.

Like other PDFTron products, PDFGenie does not rely on other third-party software. PDFGenie can be used in server environments or as a batch conversion process.

PDFGenie Command-Line utility is based on PDFNet SDK core technology which is available for integration in third party solutions and applications. For more information about PDFNet SDK, please contact a PDFTron representative or visit http://www.pdftron.com/pdfnet for more information.

Launch Online Demo

Download PDFGenie

linkWhy PDFGenie?

  • Complete Unicode support. PDFGenie can process PDF files from any part of the world (including Asian languages) and represent the extracted text using UTF-8 and UTF-16. To improve Unicode output PDFGenie can recognize vendor-specific Unicode character assignments (in the Private Use Area) and map them to public Unicode area. Similarly Unicode ligatures and PDF specific ligatures can be broken into a sequence of individual Unicode characters. Characters that can't be mapped to Unicode are predictably mapped in the Private Use Area.

  • Intelligent Text and Structure Recognition. Intelligent Text Recognition and logical structure engine used to recognize tables, words, lines, paragraphs, and the reading order in PDF documents. The engine can remove duplicated text commonly used to drop shadows, or text that is obscured by other page content. The text extractor also works flawlessly with PDF documents that contain rotated text or documents where the information is presented in a random order or is scattered across the page.

  • Highest Reliability and Robustness. PDFGenie was from the ground-up designed to be run in high throughput server-based and multi-threaded applications. A regular and rigorous Q&A process sets high standards for the reliability of all PDFTron products.

  • Top Performance. Advanced text recognition and content analysis algorithms coupled with low-memory usage and native code efficiency, make PDFGenie the ideal choice for high-traffic servers as well as for interactive applications.

linkKey Functions

  • Extracts text and tables from any PDF document to HTML or structured XML.

  • Offers different Unicode text encoding (UTF-8 and UTF-16) options.

  • Provides positioning, font, and styling information for every Paragraph, Line, Word, or a Glyph on a page.

  • Offers options to control the level of detail and the formatting in the output XML.

  • Offers advanced options to control ligature expansion, hyphen removal, and to remove duplicate text (e.g. which is sometimes used for drop shadow effects).

  • Option to remove hidden text or text that is obscured by other page elements (such as images or rectangles).

  • Ability to annotate structural segments of pages for debugging and analysis.

  • Support for all versions of the PDF format (PDF 1.0 to ISO32000).

  • Full support for encrypted documents (40 and 128 bit RC4 and 128 bit AES).

  • Supports automation and batch operation.

linkSample Use Case Scenarios

  • Server-based, on-demand conversion of PDF documents to XML format files.

  • Extract text and tables from a large PDF repository for text indexing or content retrieval purposes (e.g. to implement a PDF search engine).

  • Classify or summarize PDF documents based on their content. Find specific words for content editing purposes (such as splitting pages based on keywords, etc).

  • Convert PDF pages to XML for content repurposing or accessibility viewing.

  • Search PDF pages for specific words or keywords and return their positioning information (e.g. to highlight instances of a given word).

linkOperating Systems Supported

  • Windows, Linux and Mac.

linkSystem Requirements

  • At least 10 MB of free disk space.

linkExample

#!/bin/sh
echo "Example 1) Convert PDF to HTML or XML using default options"
pdfgenie *.pdf