PDFTron's Artificial Intelligence platform uses deep learning methods to extract complex tables accurately and outputs in multiple formats.
PDFTron's AI-based table extraction is officially here!
Now there’s an easier way to get meaningful tabular data out of your real-world documents in PDF, Office, image formats, and more.
In this article, we discuss PDFTron’s innovative, new AI-based system. This uses deep learning to automatically analyze the entire document layout, recognize complex table structures, and extract tabular content accurately. It then lets users save to an HTML, JSON, CSV, or Excel spreadsheet file, with professional features to support further processing and analysis.
Check out the online demo to try the new, AI-powered table detection, recognition and extraction features in your browser. You can also read more on our AI-based capabilities and comprehensive document processing APIs at pdftron.ai.
There are millions of business and academic documents online that contain valuable tabular information, and many professionals would love to further utilize this information downstream with features to automatically extract and process tabular data from a large corpus.
But imagine you’re a financial analyst, and you want to extract only a particular financial table from, say, 10-Q reports. Or you want to process and analyze a complex table with missing column or row separators in scanned invoices.
Imagine you’re a financial analyst, and you want to extract only a particular financial table from, say, 10-Q reports. Or you want to process and analyze a table with missing column or row separators. in scanned invoices.
A frustration here is the tables themselves. Those found in documents in the wild are highly diverse and often, error-prone, as in they have...
- complex row or column headers
- misaligned content in cells and key-value pairs due to text variance and justification
- oversized cells
- misplaced delimiters
- multiple empty cells
- cells with multi-line content and...
- spanning cells in both rows and columns of a table, etc.
As a result, teams spend more time than expected on tedious, manual proof-reading and touching up outputs from conventional, OCR- and algorithm-based extraction tools.
To make data processing a lot less costly, therefore, we designed our new AI-based system from the ground up to handle rough and tumble real-world documents. Built from deep learning methods and trained on extensive, real-world tabular data, it detects tables in documents with state-of-the-art accuracy to preserve the original table structure and relationships when it extracts.
Not only that -- the system identifies the tables and tabular data from scanned images as well and transforms them into editable tables. The deep learning method helps recreate the original formatting of the intended table structure with accurate cell data.
Next, the system lets you port over data for further processing, such as editing or analysis. For example, you can extract and compare two tables side-by-side in HTML or Excel formats.
HTML output is also generated considering the detected table structure. What this means is you can have reflowable content that adjusts dynamically to fit smaller screens, for a mobile-friendly reading experience with extracted data.
Last but not least, with the professional backing and broad functionality of the commercial PDFTron SDK, the system integrates seamlessly into a full range of environments and industries.
The AI-based table extraction system can be hosted locally and easily integrated with any downstream applications or workflow.
PDFTron’s AI-based table extraction system can be hosted locally -- or easily integrated with any downstream applications or workflow. And its innovative methods, plus an easy-to-use and user-friendly UI, will save you significant time in your manual data processing.
Accurate tabular structure and extraction based on deep learning are essential in document analysis (financial, academic, medical, manufacturing, and other businesses), traditional information retrieval and search, visualization, and human-document interactions.
As part of researching the best way to design this new system, we consulted with our customers in some of the most demanding enterprise environments in the world. Here are just a few places they said better table detection and extraction could make a significant impact:
Invoices & financial reports: Financial analysts want to extract only a particular financial table from filings such as 10-Q report. Businesses may also need to process and analyze a table in a scanned invoice with missing column or row separators.
Contracts, insurance, and legal documents: An insurance agent working for an auto claim will be interested only in certain parts of an inventory table. Or a data analyst in a real estate marketing firm may want to compare two tables for pricing projections in a housing market.
Electronic healthcare records: Tables in raw electronic health record (EHR) data are disorganized and very diverse (manually scanned, digitally prepared from multiple sources, misaligned key-value pairs, missing column/row separators, etc.). A healthcare professional may want to extract a messy table with your patient's blood lab results and port it to Docx format.
We rolled out PDFTron.ai, featuring the addition of deep learning to our comprehensive document SDK platform, in order to teach machines to understand semantic information such as reading order and tabular structures in documents more like a human would. Earlier this year, we released our AI-based PDF article detection and extraction. And now, we’re excited to introduce AI-based table detection and extraction.
Here’s how it can work in your browser:
The web interface to upload and process your document to extract tables
Output from a PDF with annotated tables and HTML in a desktop setting
Once you upload a document (e.g., a PDF file), our deep learning system (based on the state-of-the-art methods) identifies all the tables, separates tables from the non-table text, and separates different tables from each other. Further, the models recognize the cell structure by partitioning text into cells, defining rows and columns, and figures out spanning cells. The post-processing module then refines the tables identified by discarding false positives and adjusting the table border and tabular (cell level) structure or by following any customer-specific rules.
Your documents can be scanned, program-generated ('born digital' PDF, txt, Docx, etc.), or hybrid. Our deep learning models address the noise in the scanned pages, for example, by reversing any rotation or distortion, sharpening if the document shows low resolution, and filtering the noise. Our models fix the inconsistent fonts, bounding boxes, highlighted text for the noisy OCR outputs, etc.
We leverage these methods to achieve the state-of-the-art tabular structure recognition results across all public test suites for PDF table extraction -- with future application in areas such as next-generation search and document reflow.
PDFTron's AI-based table detection and extraction tool is trained on hundreds of thousands of diverse tables to accurately identify, refine (post-processing), and extract tables in any given domain for your usage.
Don’t hesitate to contact us with any questions! Our engineers would be happy to discuss your project and requirements.