Intuitive page content extraction based on a concept of graphical elements
Use the high-quality and efficient text recognition engine (pdftron.PDF.TextExtractor) TextExtractor to extract structured Unicode text including style and positioning information from any PDF document. The API is simple to use and has numerous advanced options related to hidden or duplicated text, ligature expansion, etc.
Low-level text extraction (including positioning information for text runs and individual characters)
Complete access to the graphics state (for color spaces and colorants, dash properties, etc.)
Full access to fonts, including glyph outlines
Image extraction. All compression filters allowed in PDF are supported and you can extract images in RAW format (optional)
Image color-conversion and normalization filters
Full access to marked content (e.g., used in tagged PDF documents to preserve logical structure or to mark transparency groups)
Full access to page form fields and annotations
Extract embedded fonts, ICC color profiles, U3D streams, embedded files, etc.
Access to a document's metadata
High-level Logical Structure API and support for 'Tagged' PDF documents
Extract and render PDF layers (also known as Optional Content Groups, or OCGs)
JavaScript PDF parsing
Integrate Quickly Into Your Project
PDFTron’s Java PDF Conversion Library is easy to get up and running with popular package managers and a few lines of code.
Tools and Utilities
PDF2Text
A command-line tool for text extraction from PDF documents.