Some test text!

Working with Page Contentkeyboard_arrow_down

Working with Page Content

In this document
chevron_rightWhat is an Element?
chevron_rightGraphics State
chevron_rightReading Page Content
chevron_rightProcessing Forms, Type3 glyphs, tiling patterns.
chevron_rightProcessing changes in Graphics State
chevron_rightWriting Page Content

PDFNet provides a powerful, easy-to-use API that can be used to read, write and edit text, images, and other graphical entities, called the Element API. Because the Element API is very efficient, PDFNet is an good match for interactive applications (such as PDF viewers and editors) and for content extraction applications (such as PDF conversion and validation), as well as for dynamic PDF generation.

Page content, a major component of a PDF document, is made up of the visible marks on a page drawn by PDF marking operators. For details on PDF content streams and thorough operator descriptions please refer to Section 3.7.1, “Content Streams,” in the PDF Reference Manual.

Although the PDFNet SDF and Filter APIs provide everything required to decode and parse low-level content streams, using the Element API is easier and more intuitive. The reason why is that the Element API allows you to treat a page's contents as a list of objects (i.e. a display list or a sequence of Elements) rather than as sets of cryptic marking operators.

linkWhat is an Element?

An Element (such as text, a path, or an image) is constructed from a set of marking operators from the page content stream. A set of Elements represents a display list.

Figure 7. A sequence of page marking operators
represents an Element.

Therefore, the PDFNet Element interface allows you to treat page contents as a list of objects whose values and attributes can be modified.

Using the Element interface, applications can read, write, edit, and create page contents and resources. These contents and resource may in turn contain fonts, images, shadings, patterns, extended graphics states, and so on.

An application may use Element methods to modify the appearance of a page, or it can create page content from scratch.

Each Element is independent of other Elements. Therefore, every Element encapsulates all the relevant information about itself. A text object, for example, contains all font attributes.

Element is the concrete base class for all Elements. PDFNet supports all content elements allowed by the PDF format, namely: path, text_begin, text, text_new_line, text_end, image, inline_image, shading, form, group_begin, group_end, marked_content_begin, and marked_content_end.

Note that some Elements — such as path, text, image, inline-image, and shading — represent concrete graphical elements. However, other Elements — such as text_begin/end, text_new_line, group_begin/end, and marked_content_begin/end — don't have graphical representation but are used for logical grouping of Element sequences or to provide meta-data associated with Element groups.

The Element class hierarchy implements a composite pattern — that is, the Element class provides the methods of all derived classes.

Figure 8. Element hierarchy. Only methods listed in the Element
group or base class can be invoked for the given type.

To find the type of an Element object, use the element.GetType() method. Be forewarned: it is not allowed to call methods on an object that are not related to that object's Element type. The behavior when doing so is undefined. For example, it is illegal to call element.GetImageData() on an e_path element.

Note that, in Figure 8 above, e_group_begin/end and e_text_begin/end don't add any functionality to the common Element interface (i.e. GetType()/GetGState()/GetCTM()). The main purpose of these Elements is to mark sequences of Elements into logical groups. The Element e_group_begin corresponds to the PDF 'q' operator (saveState), e_group_end corresponds to the 'Q' operator, e_text_begin corresponds to the 'BT' (begin text) operator, and etextend corresponds to the 'ET' operator.

e_text_begin initializes a text object, initializing the text matrix and the text line matrix to the identity matrix. Because PDF text objects can't be nested, a second e_text_begin element cannot appear before e_text_end. A text object contains one or more text runs (that is, etext elements) and new line markers (that is, etextnewline elements). e_text and e_text_new_line are not allowed outside of the text group (that is, outside element sequence surrounded by e_text_begin/end).

linkGraphics State

Every Element has an associated CTM (current transformation matrix) and graphics state. Element.GetCTM() returns the transformation matrix used while processing the current Element. Element.GetGState() returns the Element's associated graphics state. GState keeps track of a number of style attributes used to visually define graphical Elements. The methods available through the GState class are listed below:

Figure 9. Graphics State.

For a detailed description of graphics state attributes refer to section 4.3 "Graphics State" in the PDF Reference Manual.

linkReading Page Content

Page content is represented as a sequence of graphical Elements such as paths, text, images, and forms. The only effect of the ordering of Elements in the display list is the order in which Elements are painted. Elements that occur later in the display list can obscure earlier elements.

A display list can be traversed using an ElementReader object. For example:

void ReadDoc()
{
  // Open an existing document
  PDFDoc doc = new PDFDoc("in.pdf");
  doc.InitSecurityHandler();

  ElementReader reader = new ElementReader();

  //  Read page content on every page in the document
  PageIterator itr;
  PageIterator end = doc.PageEnd();
  for (itr=doc.PageBegin(); itr!=end; itr.Next())
  {
    // Read the page
    reader.Begin(itr.Current());
    ProcessElements(reader);
  }
}

void ProcessElements(ElementReader reader)
{
  Element element;

  // Traverse the page display list
  while ((element = reader.Next()) != null)
  {
    switch (element.GetType())
    {
      case Element.ElementType.e_path:
      {
        if (element.IsClippingPath()) 
        {}
        // ...
        break; 
      }
      case Element.ElementType.e_text:
      {
        Matrix2D text_mtx = element.GetTextMatrix();
        // ...
        break;
      }
      case Element.ElementType.e_form:
      {
        reader.FormBegin();
        ProcessElements(reader);
        reader.End();
        break;
      }
    }
  }
}

To start traversing the display list, call reader.Begin(). Then, reader.Next() will return subsequent Elements until null is returned (marking the end of the display list).

Note that, while ElementReader only works with one page at a time, the same ElementReader object may be reused to process multiple pages.

linkProcessing Forms, Type3 glyphs, tiling patterns.

Note that a PDF page display list may contain child display lists of Form XObjects, Type3 font glyphs, and tiling patterns. A form XObject is a self-contained description of any sequence of graphics objects (such as path objects, text objects, and sampled images), defined as a PDF content stream. It may be painted multiple times — either on several pages or at several locations on the same page — and will produce the same results each time (subject only to the graphics state at the time the Form XObject is painted). In order to open a child display list for a Form XObject, call the reader.FormBegin() method. To return processing to the parent display list call reader.End(). Processing of the form XObject display is illustrated in Figure 10 below.

Figure 10. Traversing the child display list.

Note that, in the above sample code, a child display list is opened when an element with type Element.ElementType.e_form is encountered by the reader.FormBegin() method. The child display list becomes the current display list until it is closed using reader.End(). At this point the processing is returned to the parent display list and the next Element returned will be the Element following the Form XObject. Also note that, because Form XObjects may be nested, a sub-display list could have its own child display lists. The above sample code traverses these nested Form XObjects recursively.

Similarly, a pattern display list can be opened using reader.PatternBegin(), and a Type3 glyph display list can be opened using the reader.Type3FontBegin() method.

linkProcessing changes in Graphics State

After reading an Element using ElementReader.Next(), it is possible to access all graphical attributes of the Element through its graphics state. Some applications are more interested in changes in the graphics state than attribute values. For example, a transition from one Element to another may not involve changes in the graphics state. Or, perhaps, there may be changes only to a couple of attributes. In these cases, it isn't efficient to make memberwise comparisons between the old and current graphics states.

To make this easier and more efficient, PDFNet offers an API to enumerate the list of changes between subsequent Elements.

The list of changes in a graphics state can be traversed using the ElementReader.ChangesBegin/End() methods, as illustrated by the following example:

GSChangesIterator itr = reader.ChangesBegin();
GSChangesIterator end = reader.ChangesEnd();
for (; itr != end; itr.Next())
{
  switch(itr.Current())
  {
    case GState.GStateAttribute.e_transform:
      // Get transform matrix for this element. 
      // Unlike path.GetCTM() that returns full 
      // transformation matrix gs.GetTransform() 
      // returns only the transformation matrix 
      // that was installed for this element (a 
      // cm operator preceding this Element).
      // gs.GetTransform();
      break;
    case GState.GStateAttribute.e_line_width:
      // gs.GetLineWidth();
      break;
    case GState.GStateAttribute.e_line_cap:
      // gs.GetLineCap();
      break;
    case GState.GStateAttribute.e_line_join:
      // gs.GetLineJoin();
      break;
    case GState.GStateAttribute.e_miter_limit:
      // gs.GetMiterLimit();
      break;
    case GState.GStateAttribute.e_dash_pattern:
      break;
      
    // Etc.
    }
  }
}

It's also possible to query ElementReader for changes to a specific attribute:

if (reader.IsChanged(
     GState.GStateAttribute.e_line_width)) 
{
   // line width was changed.
}

Note that the list of modified attributes is accumulated when calling ElementReader.Next(). To clear the list of modified attributes use ElementReader.ClearChangeList() method. A call to ClearChangeList() serves as a marker in the display list from which further changes in the graphics state are tracked.

linkWriting Page Content

New page content can be added to an existing page or a blank new page using ElementBuilder and ElementWriter. ElementBuilder is used to instantiate one or more Elements that can be written to one or more pages using ElementWriter:

Figure 11. Adding new content to a page.

The following sample illustrates how to write page content to a new document:

PDFDoc doc = new PDFDoc();
doc.InitSecurityHandler();

// ElementBuilder is used to build new Element objects
ElementBuilder f = new ElementBuilder();

// ElementWriter is used to write Elements to the page
ElementWriter writer = new ElementWriter();

// Start a new page
// Position an image stream on several places on the page
Page page = doc.PageCreate();

// Begin writing to this page
writer.Begin(page);
// Attach ElementBuilder to the page    
f.Begin(page);        

// Import an Image that can be reused multiple 
// times in the document or  multiple times on the 
// same page.
MappedFile img_file = new MappedFile("peppers.jpg");
FilterReader img_data = new FilterReader(img_file);
Image img = Image.Create(doc.GetSDFDoc(),
   img_data, 
   Image.ImageCompression.e_jpeg, 
   400, 600, 8, 
   ColorSpace.CreateDeviceRGB());

Element element = f.CreateImage(img, 
  new Matrix2D(200, -145, 20, 300, 200, 150));
  
writer.WritePlacedElement(element);

GState gstate = element.GetGState();    

// Use the same image (just change its matrix)
gstate.SetTransform(200, 0, 0, 300, 50, 450);
writer.WritePlacedElement(element);

// Use the same image (just change its matrix)
writer.WritePlacedElement(
  f.CreateImage(img, 300, 600, 200, -150));

// save changes to the current page
writer.End();  

// Add a new page to the document sequence
doc.PagePushBack(page);

// Start a new page
page = doc.PageCreate();
writer.Begin(page);    
f.Begin(page);        

// Construct and draw a path object using 
// different GState attributes
f.PathBegin();
f.MoveTo(306, 396);
f.CurveTo(681, 771, 399.75, 864.75, 306, 771);
f.CurveTo(212.25, 864.75, -69, 771, 306, 396);
f.ClosePath();

// path is now constructed
element = f.PathEnd();            
element.SetPathFill(true);        

// Set the path color space and color.
gstate = element.GetGState();
gstate.SetFillColorSpace(
  ColorSpace.CreateDeviceCMYK()); 
gstate.SetFillColor(
  new ColorPt(1, 0, 0, 0));  // cyan
gstate.SetTransform(
  0.5, 0, 0, 0.5, -20, 300);
writer.WritePlacedElement(element);

// Draw the same path using a different 
// stroke color.
// The path will be filled and stroked.
element.SetPathStroke(true);        
gstate.SetFillColor(
  new ColorPt(0, 0, 1, 0)); // yellow
gstate.SetStrokeColorSpace(
  ColorSpace.CreateDeviceRGB());
gstate.SetStrokeColor(new ColorPt(1, 0, 0)); // red
gstate.SetTransform(0.5, 0, 0, 0.5, 280, 300);
gstate.SetLineWidth(20);
writer.WritePlacedElement(element);

// Draw the same path with with a given dash pattern.
// This path is should be only stroked.
element.SetPathFill(false);    
gstate.SetStrokeColor(new ColorPt(0, 0, 1)); // blue
gstate.SetTransform(0.5, 0, 0, 0.5, 280, 0);
double[] dash_pattern = {30};
gstate.SetDashPattern(ref dash_pattern, 0);
writer.WritePlacedElement(element);

writer.End();  // save changes to the current page
doc.PagePushBack(page);

doc.Save("out.pdf", PDFDoc.SaveOptions.e_remove_unused);

Note that once the Element is instantiated using ElementBuilder, you have full control over its properties and graphics state.

Page content can also come from existing pages. For example, you can use ElementReader to read paths, text, and images from existing pages and copy them to the current page. Note that, along the way, you can fully modify an Element's properties or its graphics state. This is how to perform page content editing. For example, the following copies all Elements except images from an existing page and changes text color to blue:

ElementWriter writer = new ElementWriter();
ElementReader reader = new ElementReader();
Element element;

reader.Begin(doc.PageBegin().Current());

Page new_page = doc.PageCreate(new Rect(0, 0, 612, 794));
doc.PagePushBack(new_page);

writer.Begin(new_page);
while ((element = reader.Next()) != null)
{
  if (element.GetType() == Element.ElementType.e_text) 
  {
    // Set all text to blue color.
    GState gs = element.GetGState();
    gs.SetFillColorSpace(
      ColorSpace.CreateDeviceRGB());
    gs.SetFillColor(new ColorPt(0, 0, 1));
  }
  else if (element.GetType() 
     == Element.ElementType.e_image) 
  {
    // remove all images
    continue;
  }
  
  writer.WriteElement(element);
}

writer.End();
reader.End();