Some test text!
Nov 24 2022
by Valerie Yates
So, a PDF is a PDF, right? On the surface, that might seem so, but PDF content, such as graphics and text, can belong to different data types. Depending on how your PDF was created, you will have a vector- or raster-based PDF on your hands. And that can make all the difference in your ability to interact with, measure, and search the content.
Telling vector from raster PDFs may not be possible with a simple visual inspection, as the differences are not evident immediately. So, in this post, we share a few tips & tactics to help you determine what's inside your PDFs.
We’ll look at the most important content in PDFs from an interactive standpoint: graphics and text.
Graphics in a PDF document are embedded as one of two types of data: raster or vector.
Text is an interesting case: it can be an exception to the saying, "If it quacks like a duck..." Text characters in a PDF might look like text but not be "true" text elements. If text is defined by raster dots or vector line segments only, the text isn’t machine-readable text. This type of "text" is essentially an image, and you can’t search, highlight, or edit characters in an image. For text to be true searchable text, it needs an invisible layer of
Ask yourself whether you’ve ever tried to do something with a PDF, unsuccessfully. Perhaps you couldn’t do any of the following:
The reason is probably that you had a raster PDF on your hands.
This is somewhat of a simplification, but most PDFs are vector files; you can also save PDFs as raster files. And each graphics type is built for specific purposes. Whether to use vector- or raster-based PDFs depends on knowing which is best for the job you want to do.
You can easily figure out whether the content of your PDF is vector or raster using a few tricks or tools:
Your best bet for most situations is to zoom in to a detailed part of a PDF, to greater than 800% magnification. Vector PDF file content will look clear and smooth at any magnification while raster PDF content will become blurrier and more pixelated the more it’s zoomed.
At 6400% magnification, vector content remains clear and sharp.
At 6400% magnification, raster content is jagged and blurry.
If you can’t display text using a file menu option (such as Show All Text), then the text in the PDF is a raster image or vector lines without character encodings. Therefore, it is not “real” machine-readable text. Raster “text” is an image in which you cannot search text, select text, or edit text. The same goes for vector lines used to create the appearance of characters but without encodings.
Text above is not searchable, because content is a raster image.
PDFs created from character-based programs (e.g., Word and Excel) almost always create PDFs that contain real text.
Text above can be highlighted, because the content is vector with characters encoded.
You can run optical character recognition (
Simple methods like zooming in are great if your PDF contains a huge image, such as a drawing of a building, and you just want to quickly verify its contents.
But what if you want to look at different elements on an individual level or across many pages?
Some tools can help shed light on what’s image content quickly, so you can work efficiently.
Adobe Acrobat Pro lets you audit space usage to determine whether a PDF contains a lot of image content.
You can also use a tool like
Using a low-level PDF editor provides insight into what’s inside a PDF.
Downsampling is the process of changing the resolution of an image, usually done during compression to reduce the PDF file size. Downsampling makes raster content low quality and thus its presence obvious.
You can downsample a PDF programmatically. Use the PDFTron SDK and
We hope that this blog has shed light on the types of data contained in PDFs and how to figure out what your PDF contains – raster, vector, or text.
If you’d like to deepen your understanding of PDF rendering and PDF viewer libraries, check out our articles:
This guide shows you your options to build a Flutter PDF viewer and your potential best path forward towards a professional solution.
This blog discusses the three options for embedding PDF files or a PDF viewer in a website that are available to you, starting with the simplest and ending with the PDF viewing bells and whistles.
A tutorial on how to extract text from a PDF using Python and the PDFTron SDK for machine learning.