Some test text!

Text extractionkeyboard_arrow_down

Text extraction

You can extract text in the document without rendering it using the loadPageText API.

const wvElement = document.getElementById('viewer');
WebViewer({ ...options }, wvElement)
  .then(instance => {
    const pageIndex = 0; // Extract the text in the first page
    const doc = instance.docViewer.getDocument();

    // Accepts 0 based page index
    doc.loadPageText(pageIndex, (text) => {
      // .. do something with text

linkAdvanced text extraction

To perform advanced text extraction from a PDF document.

Only available with the Full API
const doc = await PDFNet.PDFDoc.createFromURL(filename);
const firstPage = await doc.getPage(1);

const txt = await PDFNet.TextExtractor.create();
const rect = new PDFNet.Rect(0, 0, 612, 794);
txt.begin(page, rect); // Read the page.

// Extract words one by one.
let line = await txt.getFirstLine();
for (; (await line.isValid()); line = (await line.getNextLine())) 
    for (word = await line.getFirstWord(); (await word.isValid()); word = (await word.getNextWord())) 
        // await word.getString();

Read a PDF File (Parse & Extract Text)
Full sample code which illustrates the basic text extraction capabilities.

linkAbout extracting text

When we use the ElementReader class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader class will return Element objects exactly as they are defined in the PDF page content stream.

linkText runs

An element of type e_text directly corresponds to a Tj element in the PDF document. Each e_text element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.

linkTextExtractor class

All this just goes to say that attempting to use an ElementReader to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor class, as shown in the TextExtract sample project - TextExtract Sample

TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader.


One of the more difficult things to do with a PDF document, is extracting tabular data. PDFGenie can extract tables, text, and reading order from existing PDF documents in the form of HTML or XML output. Please see our detailed blog post to know more about PDFGenie. You can also visit our PDFGenie documentation.

Get the answers you need: Support


Free Trial

Get unlimited trial usage of PDFTron SDK to bring accurate, reliable, and fast document processing capabilities to any application or workflow.

Select a platform to get started with your free trial.

Unlimited usage. No email address required.

PDFTron Receives USD$71 Million Growth Investment Led By Silversmith Capital Partners

Learn more