Some test text!

Text extraction

Extracting text from a PDF in UWP

To extract text from a PDF document.

Text extraction reading ordering is not defined in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar in a typical PDF file. This means each PDF vendor is left to their own design/solution and will extract text with some differences. Therefore, reading order is not guaranteed to match the order that a typical user reading the document would follow.

The reading order of a magazine, newspaper article, and an academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document. Where different users may have different expectations of the correct reading order.
PDFDoc doc = new PDFDoc(filename);
Page page = doc.GetPage(1);

TextExtractor textExtractor = new TextExtractor();

// Extract words one by one.
TextExtractorWord word;
for (TextExtractorLine line = textExtractor.GetFirstLine(); line.IsValid(); line = line.GetNextLine())
    for (word = line.GetFirstWord(); word.IsValid(); word = word.GetNextWord())

Read a PDF File sample
Full sample code which illustrates the basic text extraction capabilities.

Extract text under an annotation

To extract text from under an annotation in the document.

PDFDoc doc = new PDFDoc(filename)
Page page = doc.GetPage(1);
Annot annotation = page.GetAnnot(0);

TextExtractor txt = new TextExtractor();
txt.Begin(page); // Read the page.
string textData = txt.GetTextUnderAnnot(annotation);

About extracting text

When we use the ElementReader class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader class will return Element objects exactly as they are defined in the PDF page content stream.

Text runs

An element of type e_text directly corresponds to a Tj element in the PDF document. Each e_text element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.

TextExtractor class

All this just goes to say that attempting to use an ElementReader to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor class, as shown in the TextExtract sample project - TextExtract Sample

TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader.


One of the more difficult document processing activities to do with a PDF document is extracting tabular or article data. PDFTron.AI can extract tables, text, and reading order from existing PDF documents in the form of HTML output. It can also identify articles and identify them as annotations on the PDF.

The production version of PDFTron.AI will be a docker installation which you will deploy on-premise. You can trial the software using the online REST API endpoints described below.

Please visit PDFTron.AI to learn more about using artificial intelligence for document understanding. Or read our related blog on PDF article extraction .

Table recognition

The REST API demo is a post request to It will provide an HTML and XFDF in its response.

Please visit our online table recognition demo to try out the PDFTron.AI tool in the browser.

Below you can find a list of accepted headers to the API.


Enable to force OCR on a document


true or false


Language used in the document for object content recognition.


Any language code, eg: eng, fra


The page to start table recognition at.


The page number to start recognition at, eg 1, 2


The page to end table recognition at.


The page number to end recognition at, eg 1, 2


Set to true to output HTML.


true or false


Set to true to output a docx file.


true or false


Set to true to output a XLSX file.


true or false


Set to true to return a XFDF output.


true or false


Set to true to return a JSON output.


true or false

Here's an example code snippet for uploading a PDF to the demo using the API endpoint:

file = new File([fileData], 'mypdf.pdf');
const xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = () => this.handleResp(xhttp, originalFile, 'local');
const endpoint = '';'POST', endpoint, true);
xhttp.setRequestHeader("Content-type", "application/json");
xhttp.setRequestHeader("File-Name", originalName || 'mypdf.pdf')

Article extraction

The REST API demo is a post request to It will provide PDF in its response.

Please visit our online article extraction demo to try out the PDFTron.AI tool in the browser.

Here's an example code snippet for uploading a PDF to the demo using the API endpoint:

      method: 'POST',
      body: originalFile,
        "File-Name": originalName || 'mypdf.pdf'
     .then(resp => resp.blob())
     .then(this.handleResp) // handle the binary PDF blob

Get the answers you need: Support

Upcoming Webinar: SDK Features Preview and Live Run-Through | July 14, 2022 at 11 am PT


The Platform


© 2022 PDFTron Systems Inc. All rights reserved.


Terms of Use