The amount and variety of documents shared worldwide have increased exponentially. Organizations need ways to make sense of all this content faster with technology to automate its understanding and filter out the relevant and useful bits with their logical structures intact.
In this blog, we discuss our new PDF article detection and extraction API. This will automatically detect and extract articles from PDFs, Office files, scanned images, and more, and store extracted information in a structured XML/HTML format.
A typical application for this system includes extracting relevant articles from digitally born PDFs of magazines, newspapers, brochures, and more. The same system will separate core page content from surrounding page artifacts, including advertisements, crop marks, page numbers, and header/footers. And along with built-in reflow and OCR, these capabilities enable developers to support many fast-growing or innovative use-cases:
- Identify risk factors to companies by combing article extraction with NLP/OCR to accelerate investigation of an existing repository, such as a newspaper
- Streamline classification as part of a discovery process by extracting and presenting featured snippets so users can prioritize and cull content quickly
- Provide financial analysis of particular stock or equity performance via extraction and further processing of structured information from report filings
- Reflow extracted and processed articles via a downstream web-based or mobile application as part of a recommender system
- Curate content based on the presence of specific keywords, images, and metadata, to aid search through multiple repositories, such as patent/patent pending databases
Ready to deploy as a Docker container, the article detection system can be hosted on your own infrastructure or easily integrated in any workflow or solution.
Check out the online demo to try the article detection and extraction features firsthand. And read on for a deeper dive into new, AI-based capabilities.
Today, organizations rely on methods of content extraction with innate limitations. For example, methods reliant on human review require persons to pore over content, while OCR-based methods, using conventional algorithms to interpret document structures based on text blocks, don’t work well with complex and changing document layouts.
Using Deep Learning in addition to conventional methods, however, we get meaningful results faster with a larger variety of your content. That includes complex documents like newspapers, financial report filings, etc. and without constly reconfigurations of the system.
New features achieve a better understanding of your documents -- and a more meaningful result -- through analysis of multiple data sources for hints on document logical structures and patterns. The system looks at more than just blocks of text; it will also analyze features such as color transitions and document metadata. Using multiple sources for analysis, therefore, it can tell the difference between a real newspaper article and an advertisement. And it can sort extracted content to get constituent parts. For example, the system can differentiate and extract newspaper article headlines, body paragraphs in their reading order, bylines, and publication dates.
Extracted text and images are then stored in a structured XML or HTML format that enables repurposing, whether you intend to reflow content for viewing or to further process for analysis.
Here is the output of a PDF file (a newspaper article) with zoned segments and HTML outputs of an extracted article in a desktop setting.
Newspaper featuring articles zoned with annotations by PDFTron’s automated article segmentation
HTML output (Article 1 - a selected/zoned segment of the newspaper page)
Here is the output of the same PDF file (a newspaper article) with zoned segments and sample output in a mobile (cell phone) setting.
HTML output in a mobile (cell phone) setting
And here is the HTML output of the same segment in a tablet setting (Portrait mode), using built-in reflow to fit content to the width of the window.
Lastly, here’s the HTML output in a tablet in Landscape mode.
This robust and scalable system is ideal for extracting and segmenting information resident in any source documents. It works with PDF, image, and MS office files, CAD files, and 30+ other file formats supported by PDFTron’s document conversion. And the system’s capabilities dovetail with the hundreds of unique document processing, collaboration and visualization features offered by the PDFTron SDK, particularly reflow and OCR.
Using the built-in text and image reflow allows the system to automatically re-render the input to fit the size of the window, as seen the screen captures above. Reflow works through dynamic manipulation of the positions and dimensions of the Document Object model (DOM) elements. Page content thus wraps around accurately when zoomed in or out, while all the information and functionality continue to be available, thus ensuring a consistently smooth user experience when viewing extracted content downstream.
At the same time, built-in OCR enables you to directly ingest your scanned images of documents, like invoices, and transform static images of text into selectable/searchable text information. The system can then detect, snip, and segment relevant text information to focus the view of your users or capture specific data for analysis.
To learn more about PDFTron’s new article detection and extraction, visit the online demo, (documentation)[https://www.pdftron.com/documentation/web/guides/extraction/text-extract/#article-extraction], and PDFTron.ai. Don’t hesitate to contact us with any questions! Our engineers would be happy to discuss your project and requirements.