!(./img/table-extraction-and-pdf-to-xml-with-pdfgenie/7798292.gif) !(./img/table-extraction-and-pdf-to-xml-with-pdfgenie/chicago5.png) !(./img/table-extraction-and-pdf-to-xml-with-pdfgenie/chicago5_annotated.png)
- Why is PDF so popular and what is its Achilles’ foot?
- How difficult is it to extract a table from PDF?
- What is the best tool to extract structure from PDF?
- PDF Liberation Hackathon
- Evaluating accuracy of PDF table recognition
- Towards common evaluation framework for document recognition
PDF is a hugely popular format, and for good reason: with a PDF, you can be virtually assured that a document will display and print exactly the same way on different computers. However, PDF documents suffer from a drawback in that they are usually missing information specifying which content constitutes paragraphs, tables, figures, header/footer info etc. This lack of ‘logical structure’ information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes ‘trapped’. In this article we discuss the logical structure problem and introduce PDFGenie, a tool for extracting text and tables, as well as establishing a ground truth for evaluating progress in this area by PDFGenie as well as other tools.
After HTML, PDF is by far one of most popular document formats on the Web. Google stats show that PDF is used to represent over 70% of the non-html web. These are just the files that Google has indexed. There are likely to be many more in private silos such as company databases, academic archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications, etc.
One of the main reasons why PDF is so popular is that it can be used for accurate and reliable visual reproduction across software, hardware, and operating systems.
To achieve this, PDF essentially became the ‘assembly language’ of document formats. It is fairly easy to ‘compile’ (i.e. convert) other document formats to PDF, but the reverse (i.e. decompiling PDF to a high-level representation) is much more difficult.
As a result, most PDF documents are missing logical structures such as paragraphs, tables, figures, header/footers, the reading order, sections, chapters, TOC, etc.
Although PDF could technically be used to store this type of structured information via marked content, it is usually not present. When available, techniques similar to one shown in the LogicalStructure sample can be used to extract structured content.
Unfortunately, even when a file contains some tags, they are frequently not very useful because there is no universally accepted grammar for logical structure in documents (just like there is no universally accepted high-level programming language). Tags are also frequently incorrect or damaged due to file manipulation or errors in PDF generation software.
The lack of structural information makes it difficult to reuse and repurpose the digital content represented by PDF.
So, although massive amounts of unstructured data are held in the form of PDF documents, automated extraction of tables, figures, and other structured information from PDF can be very difficult and costly.
Based on 15+ years of experience developing PDF toolkits for developers, we can attest that there is a profound lack of appreciation for the complexity of the problem. Interestingly this view often comes from developers and even technology experts. Similar to David Marr who planned to solve Computer Vision as a summer project or Knuth giving his student a small side project (to become Tex), it is fairly hard and counter-intuitive to appreciate the full scope of what is required to get computers to understand documents.
Just like in case of OCR, there is no ‘perfect’ solution. So, for example, it is not difficult to run into cases where humans segment and label parts of a document in widely different ways. Is the text column an article, or a column, or is it part of a table? Sometimes you need to understand the content semantic to decide, at other times multiple interpretations are possible. In the figure below, do you see a face or a vase?
There are multiple annual conferences and hundreds of papers and doctorates published every year on the topic. The pace of research is only increasing and the problem is still far from being cracked.
These tools return precise positioning information for each character and can build simple segmentations: e.g. word > line > block > column. Most commercial solutions are tweaks of segmentation algorithms developed in the 80’s. They use a bunch of efficient bottom-up heuristics with hard-coded thresholds. Small but inevitable errors tend to propagate and cause serious issues down the line. For this reason most existing solutions usually produce very shallow structure (e.g. ‘paragraphs’).
Unfortunately if you are looking for anything a bit more complicated, even as simple as table extraction, it is likely that you will be eventually disappointed regardless of the tool. For example, in some cases text may be jumbled, cells may be glued together or over segmented; table boundary may be incorrect etc.
It doesn’t matter what tool you use, they will all under-perform because they are all based on the same paradigm. There are certainly some promising approaches in machine learning that may improve the situation in years to come (yes we working on this), however it remains to be seen and will not happen overnight.
Before we see the next generation of tools, we will first need to figure out how to rate them in terms of recognition quality/accuracy.
Unfortunately, because there is no common test suite and no common ground truth, currently it is impossible to objectively compare and evaluate algorithms/solutions. Different recognition engines will inevitably represent higher-level document structures differently. Different engines will also optimize or bias towards certain ‘types/grammars’ or ‘classes’ of documents which also complicates comparisons. Like document recognition itself, the evaluation of performance of recognition algorithms is also an active area of research without clear or effective solutions.
Without a scientific approach to quantify the performance of different algorithm/solutions, how can we compare solutions, make recommendations, or advance the state of the art in document recognition? How can we dream about liberating PDF content?
With this in mind, we were looking with some anticipation at the ‘PDF Liberation Hackathon’ that ran between Jan. 17-19, 2014.
Although it is possible that the event helped with document processing in some niche workflows it would be hard to say that it liberated PDF content. Perhaps the main issue may be with the event name, which may be further entrenching the false idea that PDF content can be ‘liberated through overnight hacking’. It is also unclear how helpful any software recommendations would be without objective evaluation or a proper benchmark.
At the same time the event was quite effective at getting people to start thinking about document recognition. Hopefully this will lead to development of common test suites and a ground truth that can be used as a starting point for evaluation of different solutions.
On our end, the event was a catalyst to showcase the second generation table recognition technology we were working on over last few years. Although the technology was available for some time as a PDFNet Add-on, it is now available as a simple to use command-line tool called PDFGenie.
PDFGenie can extract tables, text, and reading order from existing PDF documents in the form of HTML or XML output.
There is no special installation required. After you unzip the archive you are ready to go. For example: > pdfgenie my.pdf will convert
to the following HTML.
For a list of available options type pdfgenie -h
A particularly useful option offered by PDFGenie is -x (or –xfdf). This option produces XFDF (XML Forms Data Format). When XFDF resides in the same folder as the original PDF opening XFDF will load an annotation layer on top of the underlying/original PDF as shown in the image below:
In this case annotations are used to visually highlight regions with tables and other structures. The annotated PDF can be used for quick visual validation of document recognition.
The same option could be useful when generating ground truth (i.e. when you want to define the output you would expect to see). Instead of having to manually tag hundreds of documents, PDFGenie could generate initial labels and any PDF annotator could be used to manually correct miss-classified regions.
The primary advantage of annotations is that they are completely decoupled from the content stream and are therefore easier to manipulate compared to ‘tagged PDF’.
Using annotations to label PDF regions has some limitations (e.g. disjoint regions may overlap or nest) compared to marked content (or ‘Tagged PDF’), however we found that these cases are relatively rare and could be worked around by adding extra properties to annotation dictionaries.
Along with the input PDF, PDFGenie can accept a ground truth XFDF file and can compute the error rate and other statistics that can be used to evaluate deviation from ideal output. This is a sample report generated with -r (or –report) option. In this case all stats are perfect because the input XFDF is exactly the same as the one generated with -x option. If we modified the annotation to correct for error, the statistics would be less than perfect.
The above discussion regarding XFDF, ground truth, and automated verification is not only related to PDFGenie.
For example, XFDF (which is an upcoming ISO standard) could be used as a format to label a common ground truth repository independent from any particular document recognition engine. This repository could be used as a measurement stick for further advances in document recognition.
Rather than trying to liberate PDF content by focusing on specific solutions (either commercial or open source), perhaps a more effective approach would be development of a truly open ground truth repository along with corresponding benchmark and evaluation framework?
In the long run, it is unlikely that development of a specific document recognition engine will revolutionize the field, however a community-driven framework for scientific evaluation of different solutions does have a potential to make a big impact.
Even though this project is manageable, it is unlikely going to happen as result of a hackathon or a similar event. To make it successful, the project would require focus and financial backing that usually comes from some sort of organization.
If the above resonates with you and you would like to join the project please let us know.