Low-level PDF APIkeyboard_arrow_down

Low level PDF API

In this document
chevron_rightSDF.Obj
chevron_rightSDFDoc
chevron_rightStreams and Filters
chevron_rightSecurity
chevron_rightSecuring a Document

linkA short introduction to PDF file format

In this section, we present the basic structure of a PDF document. For details, please refer to the PDF Reference Manual. Below is a listing of a very simple PDF document. It displays a "Hello World" string on a single page.

0000    %PDF-1.4 0001    1 0 obj <<
0002      /Parent 5 0 R
0003      /Resources 3 0 R
0004      /Contents 2 0 R
0005    >>
0006    endobj
0007    2 0 obj
0008    <<
0009      /Length 51
0010    >>
0011    stream
0012      BT
0013      /F1 24 Tf
0014      1 0 0 1 260 330 Tm
0015      (Hello World)Tj
0016      ET
0017    endstream
0018    endobj
0019    3 0 obj
0020    <<
0021      /ProcSet \[/PDF/Text\]
0022      /Font <</F1 4 0 R >>
0023    >>
0024    endobj
0025    4 0 obj <<
0026      /Type /Font
0027      /Subtype /Type1
0028      /Name /F1
0029      /BaseFont/Helvetica
0030    >>
0031    endobj
0032    5 0 obj
0033    <<
0034      /Type /Pages
0035      /Kids \[ 1 0 R \]
0036      /Count 1
0037      /MediaBox \[0 0 612 714\]
0038    >>
0039    endobj
0040    6 0 obj
0041    <<
0042      /Type /Catalog
0043      /Pages 5 0 R
0044    >>
0045    endobj 0046    xref
0047    0 7
0048    0000000000 65535 f
0049    0000000009 00000 n
0050    0000000103 00000 n
0051    0000000204 00000 n
0052    0000000275 00000 n
0053    0000000361 00000 n
0054    0000000452 00000 n
0055    trailer
0056    <<
0057      /Size 7
0058      /Root 6 0 R
0059    >>
0060    startxref
0061    532

A PDF document consists of four sections:

  • A one-line header identifying the version of the PDF specification to which the file conforms (Line 0). In the above sample, the header string is "%PDF-1.4". It identifies this file as a PDF document adhering to the 1.4 specification.
  • A body containing the objects that make up the document contained in the file (Lines 1-45). Our sample file shows 6 objects each beginning with "obj" and ending with "endobj". Each object has its own number and a zero. The zero is the revision level (also known as the generation number) because PDF allows updates to the file to be made without rewriting the entire file.
  • A cross-reference table containing information about the indirect objects in the file (Lines 46-54). The cross-reference table in our sample notes that it contains 7 entries; a dummy for object zero and one for each of the 6 objects. The table maps implicit object index into a byte offset from the beginning of the file to the location where the object is located. For example, Object 1 is represented first indicating that it begins at byte 9; Object 3 is represented with the fourth entry indicating that it is located at byte 204 in the file. etc.
  • A trailer giving the location of the cross-reference table and of certain special objects within the body of the file (Lines 55-61).

Note that objects refer to each other using a notation like "5 0 R". The "R" stands for reference and uses the two preceding numbers to name a specific object and revision.

Therefore, the file body consists of a collection of objects, each object potentially referencing any or all objects, including itself. This set of nodes and directed references constitutes a graph. We could represent the "Hello World" sample file using the following abstract graph representation.

Each object in the graph is represented with an ellipse and each object cross reference is represented with an arrow.

Each PDF document must have a "Root" node. It must reference a "Catalog" node which must reference a "Pages" node. The "Pages" node further branches and points to each of the pages in the document. Note that a "Pages" node points to a group of pages whereas a "Page" node represents a single page.

The "Page" node references the page's "Contents" and the page's "Resources". The resource dictionary, in turn, references the "Fonts" used on the page. The resource dictionary can reference many other resource types, including Color Spaces, Patterns, Shadings, Images, Forms, and more. The page contents stream contains markup operators used to draw the page.

Each PDF document uses this basic object structure to represent a PDF document.

Before going into details of PDFNet SDF/COS object model, we should review the basics. For a detailed description of the SDF syntax and semantics, please refer to Chapter 3 (Syntax) of the PDF Reference Manual.

In PDF there are five atomic objects:

Object TypeDescriptionSamples
NumberPDF provides two types of numeric object: integer and real .1.03 612
BoolBoolean objects are identified by the keywords true and false .true false
NameA name object is an atomic symbol uniquely defined by a sequence of characters. Names always begin with "/" and can contain letters and numbers and a few special characters./Font /Info /PDFNet
StringStrings are sequences of bytes enclosed in "(" and ")"(Hello World!)
NullThe null object has a type and value that are unequal to those of any other object. Usually it refers to a missing object.null

Also, there are three compound objects:

Object TypeDescriptionSamples
ArrayAn array object is a one-dimensional collection of objects arranged sequentially. Unlike arrays in typical computer languages, PDF arrays may be heterogeneous; that is, an array's elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays.[ true /Name ]
DictionaryA dictionary object is a map containing pairs of objects, known as the dictionary's entries. The first element of each entry is the key and the second element is the value. The key must be a name. The value can be any kind of object, including another dictionary.<</key /value >>
StreamA stream is essentially a dictionary followed by a sequence of bytes. PDF streams are always indirect objects and thus always may be shared.1 0 obj << /Length 144 >> stream ........... endstream endobj

Objects can be arbitrarily nested using the dictionary and array compounding operations.

All of the objects in the above tables are "direct objects" because they are not surrounded by "obj" and "endobj" keywords. The body of the PDF document is actually made up of a sequence of "indirect objects". An indirect object is created by taking a single direct object (whether it be atomic or compound) and enclosing it with the "n m obj" and "endobj" keywords (where n and m are non-negative integers).

Note that, because indirect objects are numbered and can be referenced by other objects, they can be shared — that is, referenced by more than one other object. However, since direct objects are not numbered, they can't be shared.

In the above PDF example, the object "3 0 obj" is an indirect object because the "obj" and "endobj" keywords wrap a dictionary object containing two entries.

3 0 obj
<<
  /ProcSet \[/PDF /Text\]
  /Font << /F1 4 0 R >>
>>
endobj

The "ProcSet" key is mapped to an array which is a direct object containing atomic direct objects. In a similar way, the "Font" key is mapped to a direct dictionary. On the other hand, "F1" in the inner dictionary is mapped to an indirect object with object number 4 and generation number 0. Because the "Font" key points to an indirect object, the same font resource can be shared across many different pages.

linkSDF/COS Object Model

Real-life PDF documents are much more complex than the "Hello World" sample from the previous section. Streams in a PDF document can be compressed and encrypted, objects can form complex networks, and, in PDF 1.5, parts of the object graph can be compressed and embedded in so-called "object streams". All this makes manual editing of PDF documents extremely difficult — even impossible. The good news is that PDFTron Systems released CosEdit — a graphical utility for browsing and editing PDFdocuments at the object level, offering unprecedented ease and control. PDFNet also provides a full SDF/COS level API making it very easy to read, write, and edit PDF and FDF at the atomic level. Furthermore, PDFNet also provides a high-level API for reading, writing, and editing PDF documents at the level of pages, bookmarks, graphical primitives, and so on.

SDF (Structured Document Format) and COS (Carousel Object System; Carousel was a codename for Acrobat 1.0) are synonyms for PDF low-level object model. SDF is the acronym used in PDFNet, whereas COS is a legacy word used in the Acrobat SDK.

In many ways, SDF is to PDF what XML and DOM are to SVG (Scalable Vector Graphics). The SDF/COS object system provides the low-level object type and file structure used in PDF documents. PDF documents are graphs of SDF objects. SDF objects can represent document components such as bookmarks, pages, fonts, and annotations, and so on.

PDF is not the only document format built on top of SDF/COS. FDF (Form Data Format) and PJTF (Portable Job Ticket Format) are also built on top of SDF/COS.

linkSDF.Obj

The SDF layer deals directly with the data that is in a PDF document. The data types are referred to as SDF objects. There are eight data types found in PDF documents. They are arrays, dictionaries, numbers, boolean values, names, strings, streams, and the null object. PDFNet implements these objects as shown in the following graph:

In C#, all objects ultimately derive from the Object class. Similarly, all SDF objects ultimately derive from the Obj class. Following the Composite design pattern, Obj implements each method found in its derived classes. Thus you can invoke a member function of any derived object through the base Obj interface. This is illustrated in the following code sample:

SDFDoc doc = new SDFDoc("in.pdf");
        
// Get the trailer
Obj trailer = doc.GetTrailer();

// Get the info dictionary. 
Obj info = trailer.Get("Info").Value();

// Replace the Producer entry
info.PutString("Producer", "PDFNet");

// Create a custom inline dictionary within 
// Info dictionary
Obj custom_dict = info.PutDict("My Direct Dict");

// Add some key/value pairs
custom_dict.PutNumber("My Number", 100);
Obj my_array = custom_dict.PutArray("My Array");

// Create a custom indirect array within Info dictionary
Obj custom_array = doc.CreateIndirectArray();    
info.Put("My Indirect Array", custom_array);

// Create indirect link to root
custom_array.PushBack(trailer.Get("Root").Value());

doc.Save("out.pdf", 0, "%PDF-1.4");  // Save PDF

If a member function is not supported on a given object (e.g. if you are invoking obj.GetName() on a Bool object), an Exception will be thrown. Learn more about PDFNet exception handling under the Error handling section.

In order to find out type-information at run-time, use obj.GetType() or obj.Is_**type**_() methods (where type could be Array, Number, Bool, Str, Dict, or Stream). Usually, an object's type can be inferred from the PDF/FDF specification. For example, when you call doc.GetTrailer(), you can assume that the returned object is a dictionary object because this is mandated by PDF specification. If an object is not a dictionary, calling a dictionary method on it throws an exception. These semantics are important for stylistic reasons — since type casts and type checks are not required, you can keep your code efficient and elegant. In case there is an ambiguity in PDF/FDF specification, you can use GetType() or Is_**type**_() methods.

As mentioned in the previous section, SDF objects can be either direct or indirect. Direct objects can be created using Obj.Create_**type**_() methods. The following example illustrates how to create direct number and direct name objects inside Dict objects. Note that the same approach will work for Array objects.

// you can create direct objects inside container objects.
doc.GetRoot().PutNumber("My number key", 100);
doc.GetRoot().PutDict("My dict key");
doc.GetRoot().PutName("My name key", "My name value");

New indirect objects can be created using doc.CreateIndirect_**type**_() methods on an SDF document. The following code shows how to create new Number and Dictionary indirect objects:

Obj mynumber = doc.CreateIndirectNumber(100);
Obj mydict = doc.CreateIndirectDict();

PDFNet SDF provides many utility methods that can be used to efficiently traverse an SDF object graph. Here is an example on how to get to a document's page root:

Obj pages = doc.GetTrailer()
               .Get("Root").Value()
               .Get("Pages").Value();

Note that because the PDF specificationp mandates that "Root" is always a dictionary, we can directly reference the "Pages" object by calling Get("key"). Note also that some so-called "PDF" documents are corrupt, meaning that the documents are not in compliance with the PDF specification. In some corrupt PDF documents, the "Root" may be missing or may not be a dictionary object. In these and similar cases, the PDFNet SDK throws an exception.

In order to retrieve an object that may or may not be present in a dictionary, use the dict.FindObj("key") method. For example:

Obj my_value = dict.FindObj("my\_key");
if (my_value != null)
{
    // ...use my_value...
}
else
{
    // "my_key" does not exist in dict
}

You can use DictIterator in order to traverse key-value pairs within a dictionary:

for (DictIterator itr = dict.GetDictIterator(); 
                  itr.HasNext(); 
                  itr.Next())
{
  // itr.Key();
  // itr.Value();
}

To retrieve objects from an Array object, use array.GetAt(idx) method:

for (int i = 0; i < array.Size(); ++i) 
{
  Obj obj = array.GetAt(i);
  // ...
}

In the previous section, we learned how to create indirect objects by calling the SDFDoc.CreateIndirect_**type**_() methods. Now, let's look at how to create references to those indirect objects. The following code shows how:

Obj indirect_dict = doc.CreateIndirectDict();    
indirect_dict.PutName("My key", "My value");

Obj trailer_dict = doc.GetTrailer();
if (trailer_dict != null)
{
    Obj info_dict = trailer_dict.Get("Info").Value();
    if (info_dict != null)
    {
        // Add indirect reference to 'shared_dict'.
        info_dict.Put("MyDict", shared_dict);

        Obj root_dict = trailer_dict.Get("Root").Value();
        if (root != null)
        {
            // Add a second indirect reference to 'shared_dict'. 
            root.Put("MyDict", shared_dict);
        }
    }
}

So it's possible for multiple objects to refer to the same object. We call such objects shared objects. But shared objects must always be indirect objects. So if you want to share an object, it must have been created using SDFDoc.CreateIndirect_**type**_, or you should test Obj.IsIndirect() to make sure it's an indirect object.

Because the PDF document format disallows creating multiple links to direct objects, PDFNet will throw an exception should you try to create multiple links/references to a direct object. This is shown below:

Obj trailer_dict = mydoc.GetTrailer();
if (trailer_dict != null)
{
    Obj info_dict = trailer_dict.Get("Info").Value();
    if (info_dict != null)
    {
        Obj direct_obj = info_dict.PutDict("Link1");

        Obj root_dict = trailer_dict.Get("Root").Value();
        if (root_dict != null)
        {
            // Attempt to create a second link to direct_obj.
            // This will copy the object. If you want to
            // share objects, create them using the
            // PDFDoc.CreateIndirect() methods.
            root_dict.Put("Link2", direct_obj);
        }
    }
}

In addition to the basic types of objects mentioned so far, PDF also supports stream objects. A stream object is essentially a dictionary with an attached binary stream. In PDFNet, all methods that apply to dictionaries apply to streams as well.

In addition to the methods provided by Dict, streams provide an interface used to access an associated data stream. Given a stream Obj, you can use GetDecodedStream() to get decoded data or GetRawStream() to get raw, undecoded data. GetRawStreamLength() returns the length of the raw data stream. This number is the same as the one stored under “Length” key in the stream dictionary.

PDFNet supports all compression and encryption schemes used in the PDF format. It provides transparent access to decoded stream data. The following code decodes and extracts the contents of a given stream to an external file:

Obj stream = ...
Filter dec_stm = stream.GetDecodedStream();
dec_stm.WriteToFile("out.bin", false);

For a more complete discussion on PDFNet Filters see PDFNet Streams and Filters.

linkSDFDoc

Our overview of the SDF object model could not be complete without mentioning SDFDoc. SDFDoc brings together document security, document utility methods, and all SDF objects.

An SDF document can be created from scratch using a default constructor:

SDFDoc sdfdoc = new SDFDoc();
sdfdoc.InitSecurityHandler();
Obj trailer = sdfdoc.GetTrailer();

An SDF document can be also created from an existing file, such as an external PDF document:

SDFDoc sdfdoc = new SDFDoc("in.pdf");
sdfdoc.InitSecurityHandler();
Obj trailer = sdfdoc.GetTrailer();

Or it can be created from a memory buffer or some other Filter/Stream:

MemoryFilter memory = ....
SDFDoc sdfdoc = new SDFDoc(memory);
sdfdoc.InitSecurityHandler();
Obj trailer = sdfdoc.GetTrailer();

Finally, an SDF document can be accessed from a high-level PDF document as follows:

PDFDoc pdfdoc = new PDFDoc("in.pdf");
pdfdoc.InitSecurityHandler();
SDFDoc sdfdoc = pdfdoc.GetSDFDoc();
sdfdoc.InitSecurityHandler();
Obj trailer = sdfdoc.GetTrailer();

Note that the examples above use sdfdoc.GetTrailer() in order to access the document trailer, which is the starting SDF object (root node) in every document. Following the trailer links, we can visit all low-level objects in a document (e.g. all pages, outlines, fonts, and so on).

SDFDoc also provides utility methods used to import objects and object collections from one document to another. These methods can be useful for copy operations between documents such as a high-level page merge and document assembly.

linkStreams and Filters

One of the basic building blocks of a PDF document is an SDF stream object. For example, in a PDF document all page content, images, embedded fonts, and files are represented using object streams that can be compressed and encrypted using various Filter chains. See the "Stream Objects" and "Filters" chapters in the PDF Reference Manual for more details.

PDFNet supports an efficient and flexible architecture for processing stream using Filter pipelines.

A Filter is an abstraction of a sequence of bytes, such as a file, an input/output device, an inter-process communication pipe, or a TCP/IP socket. A filter can also perform certain transformations of input/output data (e.g. data compression/decompression, color conversion, and so on).

linkInput Filters/Streams

PDFNet enables generic input from external files using the MappedFile filter. Use MappedFile to open, read from, and close files on a file system. For example:

MappedFile myfile = new MappedFile("in.jpg");

opens an external image file for reading. MappedFile buffers input and output for better performance. Although it is possible to read input data directly through the Filter interface (MappedFile is a subclass of Filter), it is more convenient to attach a FilterReader to the filter and then read data through FilterReader interface:

FilterReader reader = new FilterReader(myfile);
int bytes;
while((bytes = reader.Read(buffer)) != 0) 
{
}

Data associated with SDF stream objects can be accessed using Stream.GetRawStream() or Stream.GetDecodedStream() methods.

void Extract(Obj stream) 
{
  Filter dec_stm = stream.GetDecodedStream();
  FilterReader reader = new FilterReader(dec_stm);

  int bytes;
  while((bytes = reader.Read(buffer)) != 0) 
  {
  }
}

Stream.GetRawStream() creates a Filter used to extract raw data as it appears in a serialized SDF document (or a decrypted version of the stream if the document is secured). Stream.GetDecodedStream() creates a Filter pipeline and returns the last filter in the chain. For example, a given stream may be compressed using JPEG (DCTDecode) compression and encoded using ASCII85 into an ASCII stream. When GetDecodedStream() is invoked on this SDF stream, it will return the last filter in a chain that composed of three filters (the file segment input Filter, the DCTDecode Filter, and the ASCII85Decode Filter, respectively). Data extracted from the returned Filter will be raw image data (i.e. RGB byte triples).

It's possible to iterate through the Filter chain using the Filter.GetAttachedFilter() method. For example, the following code prints out all the Filter names in the filter chain.

Filter attached_flt;
Filter cur_flt = dec_stm;
while ((attached_flt = cur_flt.GetAttachedFilter()) != null)
{
  Console.WriteLine(cur_flt.GetName());
  cur_flt = attached_flt;
}

It's also possible to construct new filter chains, and to edit existing ones, using the Filter.AttachFilter() method.

linkOutput Filters/Streams

To write a filter to a file, simply use Filter.WriteToFile():

dec_stm.WriteToFile("out.bin", false);

After the output file filter/stream is opened you can output data using FilterWriter class:

FilterWriter writer = new FilterWriter(myfile);
writer.WriteString("Hello World");
writer.Flush();

linkImplementing Custom Filters

PDFNet provides full support for all common Filters used in PDF. Although included Filters should cover all common use case scenarios, advanced users may want to provide custom implementations for certain filters (e.g. custom color conversion, or a new compression method). PDFNet provides an open and expandable architecture for creation of custom filters. To implement a custom Filter, derive a new class from Filter base class and implement the required interface. A more detailed guide for implementing custom Filters is available through PDFTron Systems developer program. Please contact support@pdftron.com for more details.

linkSecurity

PDF documents can be secured and encrypted using various encryption schemes. Control over document security in PDFNet is performed through security handlers. Security handlers perform user authorization and sets various permissions over PDF documents. Although PDFNet offers an extension mechanism through which users can register custom security handlers, it also provides a standard security handler.

This built-in security handler is the Standard Security Handler (StdSecurityHandler). The Standard Security Handler supports two passwords:

  • A user password that permits a user to open and read a protected document only with whatever permissions the owner chose.
  • An owner password that grants a document's owner free reign over what permissions are granted to users.

An application can also create its own implementation of SecurityHandler. For example, a custom SecurityHandler could perform user authorization requiring the presence of a hardware dongle or even feedback from a biometric system.

A Security Handler is used when:

  • A document is opened. The security handler must determine whether a user is authorized to open the file. It must also set up the RC4 decryption key used to decrypt the file.
  • A document is saved. The security handler must set up the RC4 encryption key and write security information into the PDF file's encryption dictionary.
  • A user tries to change a document's security settings. Note that the Standard Security Handler in PDFNet does not enforce a document's permissions. For example, it is possible to edit a document although document modification permission is not granted. Therefore, it is up to the application to respect PDF permissions.

The number of security handlers associated with a document change over time. When the document is first opened it isn't associated with any security handlers. When InitSecurityHandler (or InitStdSecurityHandler) is called on the document, that security handler is associated with the document. And when SetSecurityHandler is called on a document, that security handler is also associated with the document—albeit in a pending state until the document is saved. Until the document is saved with the new security handler, the old security handler rules the document's security.

A document may have both a current and a new security handler associated with it. A PDF document is not fully loaded in memory and decrypted when it is loaded. So to fully decrypt the document, even after applying a new security handler, the original security handler is still required.

linkWorking with Secured/Encrypted Documents

PDFNet fully supports the reading of secured and encrypted PDF documents. To test whether a document requires a password, check the return value of PDFDoc.InitSecurityHandler():

// Open a potentially encrypted document
PDFDoc doc = new PDFDoc("in.pdf");
if (!doc.InitSecurityHandler())
{
  Console.WriteLine(
    "in.pdf requires a password.");
}
else
{
  Console.WriteLine(
    "in.pdf does not require a password.");
}

Because InitSecurityHandler() doesn't have any side effects on documents that are not encrypted you should always invoke this method, or InitStdSecurityHandler(), after constructing a document.

If a document doesn't require authentication data (such as a user password) in order to view its content, InitSecurityHandler() is enough to work with encrypted documents. If, on the other hand, the document requires a password, InitStdSecurityHandler allows you to provide one:

// Open a potentially encrypted document
PDFDoc doc = new PDFDoc("in.pdf");
if (!doc.InitStdSecurityHandler("test"))
{
  Console.WriteLine(
    "in.pdf's password is 'test'.");
}
else
{
  Console.WriteLine(
    "in.pdf's password is not 'test'.");
}

After the document's security handler is initialized, you can access it using the doc.GetSecurityHandler() method. You can edit permissions and authorization data on an existing handler, or set a completely new security handler using the doc.SetSecurityHandler(handler) method.

To remove PDF security, set the document's current SecurityHandler to null:

PDFDoc doc = new PDFDoc("encrypted.pdf");
doc.InitSecurityHandler();
doc.SetSecurityHandler(null);

linkSecuring a Document

To secure a document, create a new SecurityHandler, set permission and authentication data, and call doc.SetSecurityHandler(handler) to set it as the new handler. For example:

PDFDoc doc = new PDFDoc("in.pdf");
if (!doc.InitSecurityHandler())
{
    Console.WriteLine(
        "Document authentication error...");
    return;
}

StdSecurityHandler new_handler = new StdSecurityHandler();

// Set a user password required to open a document
string user_password = "test";
new_handler.ChangeUserPassword(user\_password);

// Set Permissions
new_handler.SetPermission(
    SecurityHandler.Permission.e_print, true);
new_handler.SetPermission(
   SecurityHandler.Permission.e\_extract\_content, false);

// Associate the new_handler with the document.
doc.SetSecurityHandler(new_handler);

linkImplementing Custom Security

Besides providing full support for standard PDF security, PDFNet allows users to work with custom security handlers and proprietary encryption algorithms. To define a custom security handler, derive a class from SecurityHandler and implement SecurityHandler's interface. Please see the PDFNet Knowledge Base or contact support@pdftron.com for more details.