Some test text!

OCR Workflowkeyboard_arrow_down

Set OCR workflows: output, language & quality in Node.js

If we want to apply raw OCR output to the input document, we can either call OCRModule::ImageToPDF (if input file is an image) or OCROptions::ProcessPDF(for a PDF). However, it is likely that some post-processing will be beneficial, e.g., comparing results against white/black lists. To this purpose we can first extract text and corresponding metadata as either JSON or XML before re-applying processed results to the input document.

// Setup empty destination doc
const doc = await PDFNet.PDFDoc.create();
const image_path = "path/to/image";

// Extract OCR results as JSON
const json = await PDFNet.OCRModule.getOCRJsonFromImage(doc, image_path, opts);

// Post-processing step (whatever it might be) 

// Re-apply results. 
await PDFNet.OCRModule.applyOCRJsonToPDF(doc, json);

It is worth clarifying the meaning of some metadata items. For each word in the OCR output, a pair of coordinates (in PDF coordinate system) is returned representing position of its bounding box's lower left corner. In addition, each line has a box property consisting of 4 values having the same interpretation as pdftron::PDF::Rect.

linkLanguage options

We use pdftron.PDF.OCROptions convenience class to pass OCR parameters. We can call pdftron.PDF.OCROptions.AddLang to pick a target language. If no language option is set, English is assumed.

OCR Module binary currently contains 6 built-in languages to play with:

  • English: eng
  • French: fra
  • Spanish: spa
  • Italian: ita
  • German: deu
  • Russian: rus

Additional trained language files can be placed in the search path ( which can be registered using PDFNet::AddResourceSearchPath ). Afterwards they can be referred to via their file prefix.

Multiple languages can be specified, although it is not recommended to use more than 3 languages.

// Add French, Spanish and default English to target languages
const opts = new PDFNet.OCRModule.OCROptions();

linkOutput quality options

When processing documents with a priori known layouts, we can enhance output quality by either specifying regions that we want OCR to ignore via OCROptions::AddIgnoreZonesForPage, or listing regions to focus on via OCROptions::AddTextZonesForPage.

// Optionally specify page zones for OCR extraction in a multipage document
let page_zones = [];

page_zones.push(new PDFNet.Rect(900, 2384, 1236, 2480));
page_zones.push(new PDFNet.Rect(948, 1288, 1672, 1476));

// OCR will only process the two specified zones on the first page
opts.addTextZonesForPage(page_zones, 1);

// Reset zone container
page_zones = [];

page_zones.push(new PDFNet.Rect(428, 1484, 1784, 2344));

// OCR will only process one specified zone on the second page
opts.addTextZonesForPage(page_zones, 2);

linkSetting Input Resolution

We enable users to manually set input image resolution (tweaking which can often lead to better results in practice).

// Manually override DPI

Convert images to PDF with searchable/selectable text
Full code sample which shows how to use the PDFTron OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.

linkSample JSON output

Below is a sample JSON output that the OCR module would output.


Get the answers you need: Support


Free Trial

Get unlimited trial usage of PDFTron SDK to bring accurate, reliable, and fast document processing capabilities to any application or workflow.

Select a platform to get started with your free trial.

Unlimited usage. No email address required.

Join our live demo to learn about use cases & capabilities for WebViewer

Learn more