January 15, 2013 20

Most website developers and web content managers prefer to pretend that the world’s chosen portable document format, better known as PDF, doesn’t exist.

Yes, if you post a PDF, your content management system will grudgingly provide a link. So far as your CMS is concerned, however, that PDF could be the company’s annual report or it could be scanned receipts from sales meetings in Vegas. It might contain 3, 30 or 30,000 pages – it might contain more content than the rest of your website!

Most web content management software cannot even parse PDF document metadata in order to populate its database. If your CMS is capable of searching within PDF files, it’s almost certainly limited to scraping the content streams, effectively guaranteeing low-quality search results.

Here’s the problem: PDF is everywhere and no-one can reasonably predict that it’s going away. There are billions of PDF files on the web and tens of billions more on private servers, As the graph below makes clear, PDF is by far the dominant format for non-HTML documents posted online.

Chart showing PDF as a percentage of electronic document formats on the Web. PDF represents approximately 80% of the PDF, DOC, PPT and XLS files posted online. Time frames tested include April, 2011, January 2012, August 2012 and January 2013.

Born before the web to facilitate the sharing of documents between differing computer systems, PDF became the format people use when they need an electronic rendition of a “hard copy” document. Many business, publishing and records-keeping situations require a reliable, flexible and capable analog for paper. PDF remains the only game in town.

Web content managers wouldn’t mind PDF so much if their content management systems knew what to do with PDF. But they don’t. Why not?

PDF may be complex but it’s not news. Adobe published the original specification in 1993 and thousands of third-party implementations have grown up since. In 2008 PDF became ISO 32000, a democratically managed international standard, no longer the property of Adobe Systems.

In the web world, content management developers get to offload the hard work of rendering to the browser. A PDF, by contrast, carries all rendering instructions within itself. You can’t really guess what a PDF page looks like until you process it, and to deal with PDF properly it’s necessary to process it fully. CMS developers have yet to accept this, or to start reaping the rewards that would come from understanding PDF as well as they understand HTML.

It’s likely that PDF files play a key (if unsung) role in your organization. Ask your content management vendor what their software does with PDF. It’s likely to be a really short discussion.

Then, check out your website. Count the number of PDF files (hopefully your CMS can help you with that much). You’ll need to look deeper to properly evaluate how significant a portion of your site’s content is in PDF, but counting files is a start.

You may be surprised by (a) how large a proportion of your content is in PDF and (b) how little your CMS can tell you about your PDFs, or help you in managing them.