Some test text!
To access PDF page content.
doc = PDFDoc(filename) page = doc.GetPage(page_num) reader = ElementReader() # Use reader to being reading the page reader.Begin(page) element = reader.Next() while element != None: # ... perform some document processing element = reader.Next()
Read Elements Across All PDF Pages
Full code sample which illustrates how to traverse page display list using ElementReader.
PDFTron SDK provides a powerful, easy-to-use API that can be used to read, write and edit text, images, and other graphical entities, called the Element API. A good match for interactive applications (such as PDF viewers and editors), for content extraction applications (such as PDF conversion and validation), and for dynamic PDF generation because the Element API is very efficient.
Page content, a major component of a PDF document, is made up of the visible marks on a page drawn by PDF marking operators. For details on PDF content streams and thorough operator descriptions please refer to Section 3.7.1, “Content Streams,” in the PDF Reference Manual.
Although the PDFTron SDK SDF and Filter APIs provide everything required to decode and parse low-level content streams, using the Element API is easier and more intuitive. The reason why is that the Element API allows you to treat a page's contents as a list of objects (i.e. a display list or a sequence of Elements) rather than as sets of cryptic marking operators.
An Element (such as text, a path, or an image) is constructed from a set of marking operators from the page content stream. A set of Elements represents a display list.
A sequence of page marking operators represents an Element.
Therefore, the PDFTron SDK Element interface allows you to treat page contents as a list of objects whose values and attributes can be modified.
Using the Element interface, applications can read, write, edit, and create page contents and resources. These contents and resource may in turn contain fonts, images, shadings, patterns, extended graphics states, and so on.
An application may use Element methods to modify the appearance of a page, or it can create page content from scratch.
Each Element is independent of other Elements. Therefore, every Element encapsulates all the relevant information about itself. A text object, for example, contains all font attributes.
Element is the concrete base class for all Elements. PDFTron SDK supports all content elements allowed by the PDF format, namely:
Note that some Elements — such as
shading — represent concrete graphical elements. However, other Elements — such as
marked_content_begin/end — don't have graphical representation but are used for logical grouping of Element sequences or to provide meta-data associated with Element groups.
The Element class hierarchy implements a composite pattern — that is, the Element class provides the methods of all derived classes.
Element hierarchy. Only methods listed in the Element group or base class can be invoked for the given type.
To find the type of an Element object, use the
element.GetType() method. Be forewarned: it is not allowed to call methods on an object that are not related to that object's Element type. The behavior when doing so is undefined. For example, it is illegal to call
element.GetImageData() on an e_path element.
Note that, in above,
e_text_begin/end don't add any functionality to the common Element interface (i.e.
GetType()/GetGState()/GetCTM()). The main purpose of these Elements is to mark sequences of Elements into logical groups. The Element
e_group_begin corresponds to the PDF 'q' operator (saveState),
e_group_end corresponds to the 'Q' operator,
e_text_begin corresponds to the 'BT' (begin text) operator, and e_text_end corresponds to the 'ET' operator.
e_text_begin initializes a text object, initializing the text matrix and the text line matrix to the identity matrix. Because PDF text objects can't be nested, a second
e_text_begin element cannot appear before
e_text_end. A text object contains one or more text runs (that is, e_text elements) and new line markers (that is, e_text_new_line elements).
e_text_new_line are not allowed outside of the text group (that is, outside element sequence surrounded by
Every element has an associated CTM (current transformation matrix) and graphics state.
Element.GetCTM() returns the transformation matrix used while processing the current Element.
Element.GetGState() returns the element's associated graphics state. GState keeps track of a number of style attributes used to visually define graphical Elements.
The methods available through the GState class are listed below:
For a detailed description of graphics state attributes refer to section 4.3 "Graphics State" in the PDF Reference Manual.
Get the answers you need: Support