Some test text!

menu
OCR Workflowkeyboard_arrow_down

Set OCR workflows: output, language & quality in .NET Core

To make a searchable PDF by adding invisible text to an image using OCR.

Requires the OCR module add-on
PDFDoc doc = new PDFDoc();

// Run OCR on the image without options            
OCRModule.ImageToPDF(doc, image_path, null);

Convert images to PDF with searchable/selectable text
Full code sample which shows how to use the PDFTron OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.

linkProcess a scanned document

To make a searchable PDF by adding invisible text to an image based PDF such as a scanned document using OCR.

PDFDoc doc = new PDFDoc(filename);

// Set English as the language of choice
OCROptions opts = new OCROptions();
opts.AddLang("eng");

// Run OCR on the PDF with options            
OCRModule.ProcessPDF(doc, opts);

Add searchable/selectable text to an image based PDF like a scanned document
Full code sample which shows how to use the PDFTron OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.

linkGet metadata as JSON

If we want to apply raw OCR output to the input document, we can either call OCRModule::ImageToPDF (if input file is an image) or OCROptions::ProcessPDF(for a PDF). However, it is likely that some post-processing will be beneficial, e.g., comparing results against white/black lists. To this purpose we can first extract text and corresponding metadata as either JSON or XML before re-applying processed results to the input document.

// Setup empty destination doc
PDFDoc doc = new PDFDoc();
string image_path = "path/to/image";

// Extract OCR results as JSON
string json = OCRModule.GetOCRJsonFromImage(doc, image_path, opts);

// Post-processing step (whatever it might be) 

// Re-apply results. 
OCRModule.ApplyOCRJsonToPDF(doc, json);

For each word in the OCR output, a pair of coordinates (in PDF coordinate system) is returned representing position of its bounding box's lower left corner. In addition, each line has a box property consisting of 4 values having the same interpretation as pdftron::PDF::Rect.

linkSample JSON output

Below is a sample JSON output that the OCR module would output.

{  
   "Page":[  
      {  
         "Para":[  
            {  
               "Line":[  
                  {  
                     "Word":[  
                        {  
                           "font-size":"27",
                           "length":"64",
                           "orientation":"2",
                           "text":"Hello",
                           "x":"273",
                           "y":"265"
                        }
                     ],
                     "box":[  
                        "273",
                        "265",
                        "64",
                        "29"
                     ]
                  }
               ]
            }
         ],
         "num":1
      }
   ]
}

linkLanguage options

We use pdftron.PDF.OCROptions convenience class to pass OCR parameters. We can call pdftron.PDF.OCROptions.AddLang to pick a target language. If no language option is set, English is assumed.

OCR Module binary currently contains 6 built-in languages to play with:

  • English: eng
  • French: fra
  • Spanish: spa
  • Italian: ita
  • German: deu
  • Russian: rus

Additional trained language files can be placed in the search path ( which can be registered using PDFNet::AddResourceSearchPath ). Afterwards they can be referred to via their file prefix.

Multiple languages can be specified, although it is not recommended to use more than 3 languages.

// Add French, Spanish and default English to target languages
OCROptions opts = new OCROptions();
opts.AddLang("fra");
opts.AddLang("spa");

linkOutput quality options

When processing documents with a priori known layouts, we can enhance output quality by either specifying regions that we want OCR to ignore via OCROptions::AddIgnoreZonesForPage, or listing regions to focus on via OCROptions::AddTextZonesForPage.

// Optionally specify page zones for OCR extraction in a multipage document
RectCollection page_zones = new RectCollection();

page_zones.AddRect(900, 2384, 1236, 2480);
page_zones.AddRect(948, 1288, 1672, 1476);

// OCR will only process the two specified zones on the first page
opts.AddTextZonesForPage(page_zones, 1);

// Reset zone container
page_zones.Clear();

page_zones.AddRect(428, 1484, 1784, 2344);

// OCR will only process one specified zone on the second page
opts.AddTextZonesForPage(page_zones, 2);

linkSetting Input Resolution

We enable users to manually set input image resolution (tweaking which can often lead to better results in practice).

// Manually override DPI
opts.AddDPI(300);

Get the answers you need: Support

close

Free Trial

Get unlimited trial usage of PDFTron SDK to bring accurate, reliable, and fast document processing capabilities to any application or workflow.

Select a platform to get started with your free trial.

Unlimited usage. No email address required.