Some test text!

Loading...
Guides
OCR Workflow

Set OCR workflows: output, language & quality in PHP

To make a searchable PDF by adding invisible text to an image using OCR.

Requires the OCR module add-on
$doc = new PDFDoc();

// Run OCR on the image without options
OCRModule::ImageToPDF($doc, $image_path, NULL);

Convert images to PDF with searchable/selectable text
Full code sample which shows how to use the PDFTron OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.

Process a scanned document

To make a searchable PDF by adding invisible text to an image based PDF such as a scanned document using OCR.

$doc = new PDFDoc($filename);

// Set English as the language of choice
$opts = new OCROptions();
$opts->AddLang("eng");

// Run OCR on the PDF with options
OCRModule::ProcessPDF($doc, $opts);

Add searchable/selectable text to an image based PDF like a scanned document
Full code sample which shows how to use the PDFTron OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.

Get metadata as JSON

If we want to apply raw OCR output to the input document, we can either call OCRModule::ImageToPDF (if input file is an image) or OCROptions::ProcessPDF (for a PDF). However, it is likely that some post-processing will be beneficial, e.g., comparing results against white/black lists. To this purpose we can first extract text and corresponding metadata as either JSON or XML before re-applying processed results to the input document.

// Setup empty destination doc
$doc = new PDFDoc();
$image_path = "path/to/image";

// Extract OCR results as JSON
$json = OCRModule::GetOCRJsonFromImage($doc, $image_path, $opts);

// Post-processing step (whatever it might be) 

// Re-apply results. 
OCRModule::ApplyOCRJsonToPDF($doc, $json);

Output Attributes

OCR output consists of nested arrays: array of pages, array of paragraphs, array of lines, array of words. Pages have additional metadata:

AttributeValueDescription
numpage number
dpidocument resolution (needed to correctly scale the coordinates from points to pixels)
originTopLeftcoordinate system has origin at the top left corner (default)
BottomLeftcoordinate system has origin at the bottom left corner (i.e., PDF page coordinate system)

Then each word in the OCR output has the following:

AttributeValueDescription
xbouding box lower left corner x coordinate
ybouding box lower left corner y coordinate
lengthlength of bounding box
font-sizetext's font size
texttext output
orientationL270 degrees clockwise rotation
R90 degrees clockwise rotation
D180 degrees clockwise rotation
U0 degrees clockwise rotation
Finally, each line has an optional box property consisting of 4 values having the same interpretation as pdftron::PDF::Rect.

Sample JSON output

Below is a sample JSON output that the OCR module would output.

{  
   "Page":[  
      {  
         "Para":[  
            {  
               "Line":[  
                  {  
                     "Word":[  
                        {  
                           "font-size": 27,
                           "length": 64,
                           "orientation": "U",
                           "text":"Hello",
                           "x": 273,
                           "y": 265
                        }
                     ],
                     "box":[  
                        273,
                        265,
                        64,
                        29
                     ]
                  }
               ]
            }
         ],
         "num": 1,
         "dpi": 96,
         "origin": "BottomLeft"
      }
   ]
}

External OCR results

The API can also be used to apply OCR XML/JSON generated by different OCR engines. The expected structure for input JSON and XML respectively are:

{  
   "Page":[  
    	{  
          "Word":[  
              {  
                  "font-size": 12,
                  "length": 43,
                  "text":"ABC",
                  "x": 321,
                  "y": 141
              }
         ],
         "num": 1,
         "dpi": 96,
         "origin": "TopLeft"
      	}
   ]
}
<Doc>
	<Page num="1" origin="TopLeft" dpi="96">
		<Word font-size="12" x="321" y="141" length="43">ABC</Word>
	</Page>
</Doc>

Note that the OCR structure is simplified and we are expecting an array of Page, with each page consisting of Word array. Each Word is described by its text content and 4 typographic point values (i.e., font-size="12" x="321" y="141" length="43" in the example above) needed to construct the bounding box for placement of text on a page.

Language options

We use pdftron.PDF.OCROptions convenience class to pass OCR parameters. We can call pdftron.PDF.OCROptions.AddLang to pick a target language. If no language option is set, English is assumed.

OCR Module binary currently contains 6 built-in languages to play with:

  • English: eng
  • French: fra
  • Spanish: spa
  • Italian: ita
  • German: deu
  • Russian: rus

Additional trained language files can be placed in the search path ( which can be registered using PDFNet::AddResourceSearchPath ). Afterwards they can be referred to via their file prefix.

Multiple languages can be specified, although it is not recommended to use more than 3 languages.

// Add French, Spanish and default English to target languages
$opts = new OCROptions();
$opts->AddLang("fra");
$opts->AddLang("spa");

Output quality options

When processing documents with a priori known layouts, we can enhance output quality by either specifying regions that we want OCR to ignore via OCROptions::AddIgnoreZonesForPage, or listing exclusive regions to process via OCROptions::AddTextZonesForPage. Both zone options act as stencils, wherein for ignore zones we white out area inside supplied rectangular regions before processing, and for the the text zones we white out areas outside the supplied regions. The options store an array of RectCollection, where the index into the array corresponds to the relevant page number. OCROptions::AddIgnoreZonesForPage can also be used to skip pages via setting ignore zone to equal page's media box.

// Optionally specify page zones for OCR extraction in a multipage document
$page_zones = new RectCollection();

$page_zones->AddRect(new Rect(900.0, 2384.0, 1236.0, 2480.0));
$page_zones->AddRect(new Rect(948.0, 1288.0, 1672.0, 1476.0));

// OCR will only process the two specified zones on the first page
$opts->AddTextZonesForPage($page_zones, 1);

// Reset zone container
$page_zones->Clear();

$page_zones->AddRect(new Rect(428.0, 1484.0, 1784.0, 2344.0));

// OCR will only process one specified zone on the second page
$opts->AddTextZonesForPage($page_zones, 2);

Setting Input Resolution

We enable users to manually set input image resolution (tweaking which can often lead to better results in practice).

// Manually override DPI
$opts->AddDPI(300);

Get the answers you need: Support

UPCOMING WEBINAR: "2021 in review: Top five new & updated features" Dec 9th @ 11am PT