Some test text!

Apr 27 2022

Our Comparison Tests for PDF Table Detection and Extraction Are Done – and the Results Are Compelling!

by Aman Kumar

Hero image for table extraction comparison blog

PDFTron’s AI platform offers superior PDF table detection and extraction compared to products developed by Adobe, Amazon, and Google when tested on industry-standard, benchmark datasets.

We recently updated PDFTron’s artificial intelligence platform

, which uses deep learning methods to help companies extract complex tables in PDFs accurately into multiple formats. 

Naturally, we care a lot about the tool’s performance – most of all, because in the world of content extraction, accuracy trumps all. We wanted to create a tool that would be a table extraction game changer in the data accuracy department. Yielding nearly perfect accuracy on key parameters, this tool would make intensive manual review a thing of the past.

Therefore, to assess how far we’ve progressed towards our vision we needed to know how our updated platform fares against titans in the document recognition space, such as Adobe’s AI systems, Amazon Textract, and Google Table Parser.  

We’ve packaged our experiment highlights in the following section, but read the full blog to get the whole story. If you’re hungry for details, including all the output per evaluated system, don’t hesitate to reach out to our

. We’d be more than happy to share our results and answer any questions about our methodologies.

Highlights

In a nutshell:

→ PDFTron.ai offers the highest accuracy when it comes to correctly detecting tables in PDFs. In our experiment, we came out on top on overall accuracy results at nearly 98%.  

→PDFTron.ai is significantly more accurate in recognizing the contents of tables – the trickiest aspect of table recognition. We got an overall score of nearly 94%, which outscored our competitors by a margin of nearly 3%. 

You can experience PDFTron’s AI-powered table recognition and extraction for yourself by visiting our

.

Does Accuracy Really Matter?

Does a 3% difference in overall accuracy between two systems matter that much in the end? We think it does – data is useful only if it tells an accurate and complete story. 

When dirty data happens, teams spend a lot of time manually correcting outputs. If not caught and corrected, errors flow into downstream data analysis, further compounding errors. And if you’re dealing with thousands of documents and tables each day, correcting what seems to be a small margin of error upfront translates to a huge burden of manual correction further down the workstream.  

A 3% difference equals a huge amount of time saved for teams over the course of a year – time saved they can then spend on more meaningful work, like data analysis and not data entry.

And now let’s explore PDF tables and the experiment PDFTron.ai ran last fall.

PDF Tables Are Everywhere!

Tables are a great way to represent information in a structural form. Countless online PDFs contain valuable tabular data, and knowledge workers across many fields need to unlock this data for processing and analysis. 

Accurate PDF table detection, table structure recognition, and table data extraction are essential in many fields:

  • Data analysis (financial, academic, medical, manufacturing, and other businesses)
  • Traditional information retrieval and search
  • Visualization
  • Human-document interactions

How Do We Unlock Data from Tables?

However, the two pivotal problems in the domain of table understanding and extraction are:

  1. Detecting tables and then… 
  2. Recognizing their internal structures 

Modern OCR- and algorithm-based systems are OK at detecting table boundaries (#1). Where they don’t do well is at recognizing internal table structure (#2). They stumble when table layouts are heterogeneous, tables are side by side or span pages, rulings are absent, and cells are unruly. Think of misaligned content in cells, empty cells, cells with multi-line content, and spanning cells. One cell can span several cells, vertically or horizontally, and spanning cells can cross multiple pages.

Traditional approaches just don’t get the contextual meaning of the contents, which is vital to accurate and meaningful extraction. So, we’ve been working on a better way.

Our Experiment

We ran our experiment from August to November, 2021. It consisted of two separated parts:

Test 1 - A Close Up View on Two Handpicked Sample Tables
We analyzed how PDFTron.ai, Adobe, Amazon, and Google did at correctly recognizing the following:

  • tables
  • table content (rows and columns, cell boundaries, headers for rows and columns, spanning cells)

We picked two sample tables to capture and demonstrate differences in system performance on common extraction challenges, such as merged cells and spanning cells.

Test 2 - Overall Accuracy Scores on Standard Datasets
This test was designed to capture the overall accuracy of the four tools on a large body of tables. We ran the systems against three standard public datasets and an in-house, private dataset, which resulted in overall accuracy scores, in percentage. 

Again, the scores pertained to accurate recognition of table boundaries and table content. 

For information about the four systems we evaluated, visit:

Test 1: Detection Results

The two sample tables we used represent many of the challenges we find in table and table content recognition, such as spanning cells, misaligned cell boundaries, boundaries that overlap other objects, and headers that are not clearly configured. Extraction systems often encounter problems with these factors, which are common in real-world tables, and we found evidence of this difficulty in our evaluations.

Table Detection Results
Tested against the first sample table, all systems except for Adobe correctly detected table boundaries. With the second sample table, all four systems correctly detected table boundaries. 

Table Content Recognition Results
Half of the systems had trouble recognizing cell boundaries, column headers, and spanning cells. 

Summary of Content Recognition Results
The following tables (beneath) summarize how the four systems fared with the two sample tables.

First sample test table summary.

Second sample test table summary.

Now you get to see what inaccurate table content detection actually looks like. First, the image below shows the second sample table from dataset ICDAR-2013 in its original PDF form, without detection boundaries.

Example of PDF table before table recognition mark up.

And now here’s the output for this table produced by one of the tested systems, with problem areas called out. 

Examples of inaccuracies in table content recognition.

Test 2: Overall Accuracy Scores

We ran the four systems against three standard public datasets and an in-house, private dataset, which resulted in overall accuracy scores, in percentage. The standard public datasets were cTDaR-modern, ICDAR, and SciTSR. Research and commercial institutions frequently use these public datasets for unbiased evaluation of their machine learning models. 

Our in-house evaluation dataset was generated from multiple sources. We used it to represent thousands of documents collected from our customers and the types of complex tables they typically use for extraction. 

To ensure equal footing for each system during comparison, our model had no prior training on our in-house evaluation repository.

How We Got to Our Overall Accuracy Scores

A standard metric used by deep learning teams worldwide to assess the success of their image recognition tools is

(IoU). IoU calculates how accurately the AI detects an object.

We used IoU and sometimes an extended version of IoU in our tests to crunch the overall accuracy statistics for each system’s ability to detect table boundaries and then its contents, such as rows, columns, and individual cells.

Example of an inaccuracy in table content recognition.

This image (above) shows the accurate column boundary in green. The purple box represents an inaccurate prediction of the column (left and right) boundaries. The left purple column crosses the word 'Acquisition'. Furthermore, the purple column boundary on the left cuts the '$' symbols out of the cells. We compute the area of overlap between the predicted bounding box and the accurate bounding box. Dividing the area of overlap by the area of union yields our final score — the IoU.

How the Systems Compared for Overall Accuracy

Accurate table detection was a challenge for some systems with some datasets. The common stumbling block proved to be recognition of tabular cell structure (cell boundaries, rows/columns overlapping with text, spanning cells, etc.).

To our delight, PDFTron.ai got top scores for overall accuracy in table detection and table content recognition, achieving the highest total percentage in both cases. These scores confirm that our model performs better than the others in total tabulations (all datasets).

Overall scores for table detection test

Overall scores for table structure recognition test.

Conclusion

Our experiment demonstrates that PDFTron.ai and its deep learning model did pretty well against solutions available from Adobe AI systems, Google Table Parser, and Amazon Textract.  

As we’ve said earlier, we are proud that the PDFTron.ai solution not only keeps pace with commercially successful solutions developed by larger competitors but in fact, outperforms in two key areas:

1. Table detection:
PDFTron is the only system that in all cases accurately recognized table boundaries in the first part of the experiment. Overall, PDFTron.ai is nearly 98% accurate compared to Adobe’s 95%, Google’s 87%, and Amazon’s 95%.

2. Table structure recognition:
PDFTron is the only system that in all cases accurately recognized table structure in the first part of the experiment. Overall, PDFTron.ai is nearly 94% accurate compared to Adobe’s 82%, Google’s 72%, and Amazon’s 89%. 

The Bottom Line

PDFTron.ai outscored our next most accurate competitor by a margin of nearly 3% when it came to table content recognition – a small but significant percentage that lets your teams crunch data and make decisions with confidence.

Next Steps 

We’ve been improving our accuracy by training our model on good quality and diverse training data. As a result, we’ve seen improvements in our evaluation numbers for the metrics and datasets we’ve used in this experiment compared to the past, and the quality of our deep learning model itself has advanced as well.

We’re looking forward to testing PDFTron.ai again against the competition in the near future, as we’re confident that we’ll see further improvements in accuracy next time around. We’re encouraged that applications branching from our core system will build on this foundation of accuracy, and we look forward to testing these applications too one day.

To learn more about PDFTron’s advanced table detection and extraction system, try our

or visit
PDFTron.ai
.

And don’t hesitate to

with any questions! Our engineers would be happy to chat with you about your project and requirements, and answer any questions about our technology and how we can help support you in meeting your goals.

Related articles

thumbnail

How to View XLSX in a Vue Web App

This blog post describes how to open XLSX documents in a Vue web app and much more with PDFTron WebViewer.

thumbnail

How to View XLSX in an Angular Web App

This article describes how to open XLSX documents in an Angular web app and so much more with PDFTron WebViewer.

thumbnail

How to View PPTX Documents in a Vue Web App

A description of how to open PPTX documents in a Vue web app and so much more with PDFTron WebViewer.

AMAN KUMAR

Senior Computational Linguist

Background in AI/NLP data modeling, governance, and research.

Related Products

Share this post

Upcoming Webinar: Customer Experience and Retaining Control: Boosting Document Automation and Efficiency in Financial Services | Sept 15, 2022 at 11 am PT

PDFTron SDK

The Platform

NEW

© 2022 PDFTron Systems Inc. All rights reserved.

Privacy

Terms of Use