Semantic content recognition is the ability to identify components of a document by their “class” – that is if any particular content constitutes a title, subtitle, section, paragraph, word, figure, caption, table, etc. This is a problem, that despite decades of research, remains open. Available solutions are unreliable and are far, far behind the ability of a human being.
At the 2015 PDF Technical Conference, PDFTron’s CTO gave a presentation addressing the problem of semantic content recognition in PDF. The presentation gives an overview of the problem itself, why it has been such a hard problem to solve, and how the industry as a whole might organize itself to finally develop solutions that perform with the same accuracy as a person.