PDF Parsing & Content Extraction Library

Access all PDF bits and pieces including images, fonts, structured text and tables, bookmarks, and metadata for advanced content repurposing & indexing in your web, mobile, desktop, and server applications.

Text Extraction

Convert PDFs into readable Unicode text, regardless of language or font. Extract characters, words, fonts, and form fields. Populate a full-text search engine to search across a set of documents.

Metadata Extraction

Analyze PDFs at a low level. Grab the PDF version, author information, timestamps, and anything else hidden away in the file.

Annotation Extraction

Serialize annotations into the industry-standard XFDF format (compatible with most PDF viewers). Enable users to edit annotations without modifying the underlying document. Share annotations with other users to enable real-time collaboration. Create a summary of all annotations.

Table Data Extraction

Detect tables, and programmatically extract the information as XML or HTML.

Image Extraction

Extract individual images or graphics embedded within a PDF, or convert pages into images.

3D Data Extraction

Unwrap U3D, PRC, or STEP files embedded within PDF documents for display in a 3D viewer.

Font Extraction

Retrieve Type1, OpenType, TrueType, Type3, and CID fonts embedded in the PDF. Find font names, font sizes, and the path data for individual glyphs.

Form Field Extraction

Serialize forms in the industry-standard XFDF format to extract, edit, or insert form field data.

Search Multiple Documents

Programmatically search across multiple documents at predefined locations. Extract information and metadata from a set of documents.

  • Extract digital signatures (timestamps, etc)
  • Intuitive page content extraction based on a concept of graphical elements
  • High-quality and efficient text recognition engine (pdftron.PDF.TextExtractor). TextExtractor can be used to extract structured Unicode text including style and positioning information from any PDF document. The API is simple to use and has a number of advanced options related to hidden or duplicated text, ligature expansion, etc
  • Low-level text extraction (including positioning information for text runs and individual characters)
  • Complete access to the graphics state (for color spaces and colorants, dash properties, etc)
  • Full access to fonts, including glyph outlines
  • Image extraction. All compression filters allowed in PDF are supported and images can be optionally extracted in RAW format
  • Image color-conversion and normalization filters
  • Full access to marked content (e.g. used in tagged PDF documents to preserve logical structure or to mark transparency groups)
  • Full access to page form fields and annotations
  • Extraction of embedded fonts, ICC color profiles, U3D streams, embedded files, etc
  • Access to a document's metadata
  • High-level Logical Structure API and support for 'Tagged' PDF documents
  • Extract and render PDF layers (also known as Optional Content Groups, or OCGs)

