Some test text!
Dec 8 2022
by John Chow
This tutorial explains how to extract text from PDF using Python and the PDFTron SDK for machine learning.
In this case, “extract text from a PDF” doesn’t mean just a paragraph or two from a single document — it means extracting text from possibly thousands of PDFs, using automation and batch processing.
In this tutorial, you will:
The tutorial provides a code sample for a very basic text extraction using a Python script with the PDFTron SDK. We’ll also cover methods you can use to extract all text or even specific text in a PDF. Finally, this tutorial will touch on other data, such as metadata and images, which you can extract from a PDF using Python.
For more information, check out the
To start using Python and the PDFTron SDK, you need the following:
Follow these steps to get started:
You can also visit the Python
Now that your Python environment is set up and you’ve downloaded the PDFTron SDK, let’s extract some text.
Run the following code sample for a very basic text extraction using a Python script with the PDFTron SDK:
doc = PDFDoc(filename)
page = doc.GetPage(1)
txt = TextExtractor()
txt.Begin(page) # Read the page
word = Word()
line = txt.GetFirstLine()
while line.IsValid():
word = line.GetFirstWord()
while word.IsValid():
# word.GetString()
word = word.GetNextWord()
line = line.GetNextLine()
Next, decide what to do with the extracted text. You can save it to another text file, or in a database. Execute the following code to specify where to send your extracted text.
def dumpAllText (reader):
element = reader.Next()
while element != None:
type = element.GetType()
if type == Element.e_text_begin:
print("Text Block Begin")
elif type == Element.e_text_end:
print("Text Block End")
elif type == Element.e_text:
bbox = element.GetBBox()
print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
+ str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
textString = element.GetTextString()
print(textString)
elif type == Element.e_text_new_line:
print("New Line")
elif type == Element.e_form:
reader.FormBegin()
dumpAllText(reader)
reader.End()
element = reader.Next()
You can even use a utility method to extract all text content from a specific region, like a rectangle on a PDF page. This is useful if you’re extracting text from multiple documents that share the same layout, like invoices or forms. The rectangle coordinates are expressed in PDF user/page coordinate system.
def ReadTextFromRect (page, pos, reader):
reader.Begin(page)
srch_str = RectTextSearch(reader, pos)
reader.End()
return srch_str
To see a code sample for full text extraction, go to
In addition to simple text, you can also extract data from a PDF using Python, including:
In this tutorial, you extracted data for machine learning with Python and the PDFTron SDK. You then used the scripts to decide where to send extracted data.
To learn more about text extraction, visit
You can also visit the
If you have any questions or features you would like to see next, do not hesitate to
This guide shows you your options to build a Flutter PDF viewer and your potential best path forward towards a professional solution.
This blog discusses the three options for embedding PDF files or a PDF viewer in a website that are available to you, starting with the simplest and ending with the PDF viewing bells and whistles.
A tutorial on how to extract text from a PDF using Python and the PDFTron SDK for machine learning.
JOHN CHOW
Product Manager
PDFTron SDK
COMPANY