Some test text!

Dec 8 2022

How to Extract Text from PDFs Using Python

by John Chow

This tutorial explains how to extract text from PDF using Python and the PDFTron SDK for machine learning.

In this case, “extract text from a PDF” doesn’t mean just a paragraph or two from a single document — it means extracting text from possibly thousands of PDFs, using automation and batch processing.

In this tutorial, you will:

  • Use the PDFTron SDK to run the bulk text extraction from your PDFs, automating the process.
  • Use Python scripts to specify what information to extract, from where, and where to send the extracted data. 

The tutorial provides a code sample for a very basic text extraction using a Python script with the PDFTron SDK. We’ll also cover methods you can use to extract all text or even specific text in a PDF. Finally, this tutorial will touch on other data, such as metadata and images, which you can extract from a PDF using Python. 

For more information, check out the

documentation. 

Prerequisites 

To start using Python and the PDFTron SDK, you need the following: 

  • A Python environment (PDFTron supports both
    Python 2
    and
    Python 3
  • A free PDFTron account, so you can: 
    * Get a trial license key  
    * Download the PDFTron SDK 
    * Download samples and sample code  
  • The PDFTron text extraction
    demo
    (optional) 

Step 1: Get Started 

Follow these steps to get started: 

  1. Go to the
    Download Center
    to get or sign in with a PDFTron account. 
  2. Choose your operating system—Windows, Linux, or macOS.  
  3. Click Reveal to get a trial key. 
  4. In the Download section, select Python as the language. 
  5. Download Python version 2 or 3. 
  6. (Optional) In the Get Started section, download the Python 2 or 3 Guide to get the Precompiled Python & PDF library integration. The guide will help you
    run PDFTron samples
    and
    integrate a free trial of the PDFTron SDK into Python applications
    . Your free trial includes unlimited trial usage and support from solution engineers. 
    You can then download and run the PDFTron SDK and samples. 

You can also visit the Python

page or the
Python PDF Content Extraction Library
.  

Step 2: Extract Text from a PDF Using Python 

Now that your Python environment is set up and you’ve downloaded the PDFTron SDK, let’s extract some text. 

Run the following code sample for a very basic text extraction using a Python script with the PDFTron SDK: 

doc = PDFDoc(filename) 
page = doc.GetPage(1) 
txt = TextExtractor() 
txt.Begin(page) # Read the page 
word = Word() 
line = txt.GetFirstLine() 
while line.IsValid(): 
    word = line.GetFirstWord() 
    while word.IsValid(): 
        # word.GetString() 
        word = word.GetNextWord() 
    line = line.GetNextLine()

Where to Send Extracted Text 

Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text. 

def dumpAllText (reader): 
    element = reader.Next() 
    while element != None: 
        type = element.GetType() 
        if type == Element.e_text_begin: 
            print("Text Block Begin") 
        elif type == Element.e_text_end: 
            print("Text Block End") 
        elif type == Element.e_text: 
            bbox = element.GetBBox() 
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "  
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2())) 
            textString = element.GetTextString() 
            print(textString) 
        elif type == Element.e_text_new_line: 
            print("New Line") 
        elif type == Element.e_form: 
            reader.FormBegin() 
            dumpAllText(reader) 
            reader.End() 
        element = reader.Next()

Extract Text from a Specific Region of a PDF Page 

You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system. 

def ReadTextFromRect (page, pos, reader): 
    reader.Begin(page) 
    srch_str = RectTextSearch(reader, pos) 
    reader.End() 
    return srch_str

Full Text Extraction 

To see a code sample for full text extraction, go to

and under TextExtract, click Python. You can also download more
code samples
.  

Extracting Other Data from a PDF 

In addition to simple text, you can also extract data from a PDF using Python, including:  

  • Digital signatures 
  • Intuitive page content based on a concept of graphical elements 
  • Structured
    Unicode text, including style and positioning information, from any PDF using the text recognition engine (pdftron.PDF.TextExtractor) 
  • Metadata,
    embedded fonts
    , ICC color profiles, U3D streams, and embedded files 

Conclusion  

In this tutorial, you extracted data for machine learning with Python and the PDFTron SDK. You then used the scripts to decide where to send extracted data. 

To learn more about text extraction, visit

and check out our WebViewer
showcase
to try out the PDF text extractor. The demo uses JavaScript, but the results are like what you’d see using Python.  

You can also visit the

to see what else you can do with PDFs using Python, including: 

  • Splitting or merging documents page by page 
  • Cropping pages 
  • Merging multiple pages into a single page 
  • Extracting text from PDF 
  • Rotating PDF pages 
  • Merging PDFs 
  • Splitting PDFs 
  • Adding watermark to PDF pages 
  • Encrypting and decrypting PDF files and more! 

If you have any questions or features you would like to see next, do not hesitate to

to us directly. 

Related articles

thumbnail

How to Build a Flutter PDF Viewer

This guide shows you your options to build a Flutter PDF viewer and your potential best path forward towards a professional solution.

thumbnail

How to Embed PDF Files or a PDF Viewer in an HTML Website

This blog discusses the three options for embedding PDF files or a PDF viewer in a website that are available to you, starting with the simplest and ending with the PDF viewing bells and whistles.

thumbnail

How to Extract Text from PDFs Using Python

A tutorial on how to extract text from a PDF using Python and the PDFTron SDK for machine learning.

JOHN CHOW

Product Manager

Related Products

Share this post

Upcoming Webinar: PDFTron SDK Tech Review | Nov 29, 2022 at 2 pm ET

PDFTron SDK

The Platform

NEW

© 2022 PDFTron Systems Inc. All rights reserved.

Privacy

Terms of Use