HTML5 apps offer many of advantages over native ones. Web apps are
- Naturally cross-platform: develop once, run on iOS, Android, Windows Phone and everything else.
- Easy to update the app for everyone, immediately.
- Do not have to go through Apple or Google to access customers (but you still can by embedding it into a native shell app)
But web apps suffer one big problem, and that’s the user experience.
Today, in 2013, even the best-crafted mobile web apps come nowhere near the quality of experience of the best native apps. In fact, with but a few exceptions, the best mobile web apps today still don’t approach the quality of the first batch of native iPhone apps from 2007.
One area of the user experience where HTML5 apps have been historically weak is in their ability to display a PDF within the app. For a long time, “viewing” a PDF on the web meant downloading it, and opening it in a different program. Next came browser PDF plugins, that would take over the browser screen in order to display the PDF. A small improvement, but still not integrated and certainly not a good user experience.
So, if the goal is to integrate PDF viewing into a web app, how can that be done? There are a number of approaches, each with pros and cons. Keep reading to see what techniques exist, and which might be best for your app.
link1. Rasterization to images
This is probably the simplest way to get “PDF” onto the web. Take the PDF, convert to image via CLI, and serve. Voila. PDF on the web in a format that is compatible with all browsers on all operating systems. However, there are some issues:
- No vector content limits quality at high resolutions,
- Storage- and bandwidth-heavy bitmap data,
- Does not support PDF capabilities such as forms or a standard method of annotations, needs extra work to simulate text selection
- Scalability problems: computationally expensive to rasterize, large storage requirements,
- Requires extra work to implement text selection and indexability.
While converting to images may be a good solution for some applications, it is unlikely to be an optimal one. So what can we do?
link2. HTML DOM
The idea here is to use the browser’s native text rendering and layer it on top of an image that contains all of the non-text data. (This technique is implemented by PDFTron in pdfton.PDF.Convert.ToHtml().) While it sounds like an incremental change from full rasterization, there are some significant advantages:
- Text quality is often preserved. People are especially sensitive to the quality of text, so preserving the vector nature of the glyphs is a big improvement.
- Allows the user to use the browser’s standard text selection/copying capabilities, which can also be read by search engine robots.
So while this is a step up from full rasterization, problems remain:
- Quality for non-text elements is sacrificed for all non-text data.
- Accurate text positioning is possible, however it requires a separate for every letter. Doing this reduces page load speed and the ability to search/index/select text. So one must accept this limitation, or instead accept somewhat inaccurate text positioning.
- Degrades to full PDF rasterization when text is semi-transparent, partially occluded or covered by transparent objects, pattern-filled objects, etc.
- It is easy for users to save DOM content locally, which is a concern if serving copyright content.
- Storage requirements could be significant
The W3C recognized the need to bring high-quality vector graphics to the web, and proposed SVG (scalable vector graphics). At first, this technology seems very promising: it will deliver the vector data and precise positioning we want, with fonts, gradients, masks and more. A “PDF killer” some predicted. PDFTron took action and developed the first PDF to SVG converter in 2001. However, widespread adoption of SVG and the supplanting of PDF never came to pass. Why not? Here are a few reasons:
- SVG is not fully compatible with the PDF graphics model (e.g. transparency/blend mode), making it impossible to faithfully reproduce PDF content using SVG.
- A bloated spec designed to also compete with Flash, incorporating scripting and animation, put a high burden for those wishing to implement the spec completely.
- It is missing support for efficient monochrome compression, which is important for many scanned business documents.
- Worst of all, most implementations were incomplete and buggy. Until IE9, Microsoft did not support SVG at all, and even now there is no support for SVG fonts. In other browsers (Chrome, Firefox) there are many glitches related to text positioning.
SVG had some built in technical limitations, but its biggest problem was (and still is) a lack of complete and correct implementations within browsers. Ultimately it has found success in certain niches, but it has not experienced widespread adoption for general use cases.
link4. HTML5 Canvas
So where does that leave us? Not surprising, we are going to take a close look at “HTML5”, specifically the canvas. Does this technology finally deliver the ability to view a PDF inline? Will it succeed where others have come up short?
linkPDF → JS code → HTML5 Canvas: pdf.js
- Vector graphics
- Render the PDF directly rather than using an intermediate format (such as images or SVG)
- Would not suffer from limitations of the previously outlined techniques
- Consistent behaviour across browsers
From the ‘get-go’ pdf.js faced issues on the rendering side. For example, standard HTML5 Canvas does not support paths with dashes, the even-odd fill rule, or PDF blend modes. Since Mozilla developers were in control of their own browser they were able to bandage Firefox with custom extensions (prefixed with moz-… ). Unfortunately these extensions are not part of the HTML5 standard and are not supported by all browsers, including the dominant mobile browsers. Also even with all of the custom moz extensions, ‘pdf.js’ can’t deal with some transparency groups, overprint, some soft masks, non-rgb color spaces, etc. Perhaps one day all browsers will add every extension required to accurately render a PDF, however the project clearly showed some limitations of implementing a complex graphics system in JS (read our updated guide on PDF.js rendering accuracy).
pdf.js Rendering (left) & Correct Rendering (right)
pdf.js Rendering (left) & Correct PDF rendering (right)
Mobile browsers do not respond well when they run out of memory: they simply exit, i.e. crash. Because PDF documents can be large and use complex resources it is not difficult to exceed the limit. (The same issues exist on the desktop, but thanks to large amounts of RAM and virtual memory, they are less critical.) For more information find our recently published PDF.js reliability benchmark where we opened 1,663 PDF files in PDF.js.
linkA solution: PDF→ PDFNet → JS code → HTML5: WebViewer
- Optimize the file for fast random access loading. This means that any page could be fetched and displayed regardless of which other pages in the document have already been downloaded.
- Downsample high resolution images so that they do not consume large amounts of memory, which is a real problem on mobile devices.
- Reduce the complexity of a document for accurate and efficient display on mobile devices. This means analyzing a PDF page element-by-element, looking for simplifications and alternate means of representing content that is known to be compatible with HTML5 Canvas. This may also mean rasterizing content that cannot in any way be accurately rendered by an HTML5 Canvas.
- Normalize all images to a form that can be natively decoded by a browser
So how well does this work? After 3+ years of implementing these optimizations for WebViewer, we are able to say that it indeed works very well. Once the PDF has been optimized for web viewing, all of pdf.js’s shortcomings melt away, and viewing is
These optimized documents have also served as a good basis for implementing PDF features such as interactive forms and annotations.
Displaying a PDF within a web browser is by no means trivial. What is clear is that for accurate and reliable viewing, the PDF needs to be “normalized” to a web friendly representation. Some normalization methods, such as converting to images, do work, but with limitations. Sophisticated normalization, such as what is done for WebViewer, offer an experience that approaches that of a native PDF viewer.
linkApril 2015 Update
What a difference 18 months makes. Most of the article above holds, however new technology and an innovative approach has allowed us to provide reliable and correct in-browser PDF rendering without the need to pre-process. (And no, not by using pdf.js, its problems remain.) Check out the newly released Webviewer 2.0, and our post on PDFNetJS.