Some test text!

keyboard_arrow_down

Get started

Download PDFTron SDK
play_arrow

Quick start - view a document

Integrate with NuGetIntegrate manuallyRun samplesAdd license keyUpdate to latest version
keyboard_arrow_down

Xamarin.Forms

keyboard_arrow_down

Fundamentals

keyboard_arrow_down

FAQ

System requirementsPDFTron full vs. standardReduce size of APK
keyboard_arrow_down

Document Viewer

play_arrow

Xamarin.Android

OverviewShow a document in an ActivityShow a document in a FragmentViewer configurationToolManager configurationDay and Night themes
play_arrow

Xamarin.iOS

Open from online URL
keyboard_arrow_down

Document Viewer components

Overview
play_arrow

PDFViewCtrl

play_arrow

Reflow

Quick menuCustomize quick menu
play_arrow

Annotation toolbar

play_arrow

Annotation style editor

play_arrow

Page slider

play_arrow

List container

Xamarin.AndroidXamarin.iOS
play_arrow

Annotation list

play_arrow

Document outline

play_arrow

User bookmark list

play_arrow

Thumbnail browser

Add pages dialogCrop pages dialogRotate pages dialogView mode dialog
keyboard_arrow_down

Annotation and collaboration

OverviewToolManager setupRead only mode
play_arrow

Disable annotation creation and editing

play_arrow

Default annotation properties

play_arrow

Change tool mode

play_arrow

Events

play_arrow

Override annotation handling

Show and hide existing annotationsRealtime collaboration
keyboard_arrow_down

Advanced customization

play_arrow

Add a custom view to a page

Create a custom toolPage coloring modePage fit modePage layout modeScrolling direction
keyboard_arrow_down

Conversion

play_arrow

Convert documents to PDF

Xamarin.AndroidXamarin.iOS
play_arrow

Convert HTML to PDF

Xamarin.AndroidXamarin.iOS
keyboard_arrow_down

Parsing and extraction

Text extraction
keyboard_arrow_down

Digital signatures

keyboard_arrow_down

Search

play_arrow

Text search

Xamarin.AndroidXamarin.iOS
keyboard_arrow_down

Advanced topics

PrintUnderstand coordinatesGenerate page thumbnails
keyboard_arrow_down

Troubleshooting

Reduce memory consumption
keyboard_arrow_down

Reference

keyboard_arrow_down

PDF Processing API

What is PDFNet?Opening a documentSerializing (saving) a documentWorking with pagesWorking with Page ContentWorking with BookmarksWorking with Interactive Forms (AcroForms)PDF SecurityLow-level PDF APIError handling
Text extractionkeyboard_arrow_down

Text extraction using PDFTron SDK

In this document
chevron_rightExtracting text data from a PDF document
chevron_rightText runs
chevron_rightTextExtractor class
chevron_rightPDFGenie

linkExtracting text data from a PDF document

When we use the ElementReader class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader class will return Element objects exactly as they are defined in the PDF page content stream.

linkText runs

An element of type e_text directly corresponds to a Tj element in the PDF document. Each e_text element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.

linkTextExtractor class

All this just goes to say that attempting to use an ElementReader to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor class, as shown in the TextExtract sample project - TextExtract Sample

TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader.