PDFTron's PDFGenie is a simple-to-use utility that can extract tables and text from existing PDF documents as HTML or XML. It is available as a command-line tool.
PDFGenie is now deprecated and has been replaced by PDFTron.ai , the next generation of table extraction and document understanding. PDFTron.ai uses advanced deep learning to recognize tables in PDF documents. Launch Online Demo
PDF is popular for its ability to display and print exactly the same on different computers. However, PDF documents are usually missing information for features such as paragraphs, tables, figures, headers, and footers. The lack of 'document structure' or 'logical structure' information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes 'trapped'.
PDFGenie is able to recover table, text, and reading order from existing PDFs. This conversion analyzes the content of existing PDF files and performs an extraction of the logical structure in order to produce an HTML or XML reflowable document. This analysis and extraction of text and table features make this tool perfect for industries such as financial, legal, or health.
Complete Unicode support. PDFGenie can process PDF files from any
part of the world (including Asian languages) and represent the
extracted text using UTF-8 and UTF-16. To improve Unicode output
PDFGenie can recognize vendor-specific Unicode character assignments (in
the Private Use Area) and map them to public Unicode area. Similarly
Unicode ligatures and PDF specific ligatures can be broken into a
sequence of individual Unicode characters. Characters that can't be
mapped to Unicode are predictably mapped in the Private Use Area.
Intelligent Text and Structure Recognition. Intelligent Text Recognition and
logical structure engine used to recognize tables, words, lines, paragraphs, and
the reading order in PDF documents. The engine can remove duplicated
text commonly used to drop shadows, or text that is obscured by other
page content. The text extractor also works flawlessly with PDF
documents that contain rotated text or documents where the information
is presented in a random order or is scattered across the page.
Highest Reliability and Robustness. PDFGenie was from the ground-up
designed to be run in high throughput server-based and multi-threaded
applications. A regular and rigorous Q&A process sets high standards for
the reliability of all PDFTron products.
Top Performance. Advanced text recognition and content analysis
algorithms coupled with low-memory usage and native code efficiency,
make PDFGenie the ideal choice for high-traffic servers as well as for
Extracts text and tables from any PDF document to HTML or structured XML.
Offers different Unicode text encoding (UTF-8 and UTF-16) options.
Provides positioning, font, and styling information for every
Paragraph, Line, Word, or a Glyph on a page.
Offers options to control the level of detail and the formatting in
the output XML.
Offers advanced options to control ligature expansion, hyphen
removal, and to remove duplicate text (e.g. which is sometimes used
for drop shadow effects).
Option to remove hidden text or text that is obscured by other page
elements (such as images or rectangles).
Ability to annotate structural segments of pages for debugging and analysis.
Support for all versions of the PDF format (PDF 1.0 to ISO32000).
Full support for encrypted documents (40 and 128 bit RC4 and 128 bit
Supports automation and batch operation.
Sample Use Case Scenarios
Server-based, on-demand conversion of PDF documents to XML format
Extract text and tables from a large PDF repository for text indexing or
content retrieval purposes (e.g. to implement a PDF search engine).
Classify or summarize PDF documents based on their content. Find
specific words for content editing purposes (such as splitting pages
based on keywords, etc).
Convert PDF pages to XML for content repurposing or accessibility viewing.
Search PDF pages for specific words or keywords and return their
positioning information (e.g. to highlight instances of a given
Operating Systems Supported
Windows, Linux and Mac.
At least 10 MB of free disk space.
echo "Example 1) Convert PDF to HTML or XML using default options"