80% of non-HTML documents posted online are PDFs. Deal with it.

Most website developers and web content managers prefer to pretend that the world’s chosen portable document format, better known as PDF, doesn’t exist.

Yes, if you post a PDF, your content management system will grudgingly provide a link. So far as your CMS is concerned, however, that PDF could be the company’s annual report or it could be scanned receipts from sales meetings in Vegas. It might contain 3, 30 or 30,000 pages – it might contain more content than the rest of your website!

Most web content management software cannot even parse PDF document metadata in order to populate its database. If your CMS is capable of searching within PDF files, it’s almost certainly limited to scraping the content streams, effectively guaranteeing low-quality search results.

Here’s the problem: PDF is everywhere and no-one can reasonably predict that it’s going away. There are billions of PDF files on the web and tens of billions more on private servers, As the graph below makes clear, PDF is by far the dominant format for non-HTML documents posted online.

Chart showing PDF as a percentage of electronic document formats on the Web. PDF represents approximately 80% of the PDF, DOC, PPT and XLS files posted online. Time frames tested include April, 2011, January 2012, August 2012 and January 2013.

Born before the web to facilitate the sharing of documents between differing computer systems, PDF became the format people use when they need an electronic rendition of a “hard copy” document. Many business, publishing and records-keeping situations require a reliable, flexible and capable analog for paper. PDF remains the only game in town.

Web content managers wouldn’t mind PDF so much if their content management systems knew what to do with PDF. But they don’t. Why not?

PDF may be complex but it’s not news. Adobe published the original specification in 1993 and thousands of third-party implementations have grown up since. In 2008 PDF became ISO 32000, a democratically managed international standard, no longer the property of Adobe Systems.

In the web world, content management developers get to offload the hard work of rendering to the browser. A PDF, by contrast, carries all rendering instructions within itself. You can’t really guess what a PDF page looks like until you process it, and to deal with PDF properly it’s necessary to process it fully. CMS developers have yet to accept this, or to start reaping the rewards that would come from understanding PDF as well as they understand HTML.

It’s likely that PDF files play a key (if unsung) role in your organization. Ask your content management vendor what their software does with PDF. It’s likely to be a really short discussion.

Then, check out your website. Count the number of PDF files (hopefully your CMS can help you with that much). You’ll need to look deeper to properly evaluate how significant a portion of your site’s content is in PDF, but counting files is a start.

You may be surprised by (a) how large a proportion of your content is in PDF and (b) how little your CMS can tell you about your PDFs, or help you in managing them.


3 comments

  1. January 22, 2013 at 06:52

    Very interesting reading. In addition to PDF as a webpage our clients who map files and content of any format into K-Map structures, combine these files into single PDF for purposes such as transmittal documents, collaborative platforms and archiving purpose.

  2. February 10, 2013 at 07:14

    Agree with the previous commenter that this is interesting reading, and welcome the stats on PDF’s.

    It was particularly refreshing to read the comment within the post that ‘Most web content management software cannot even parse PDF document metadata in order to populate its database.’ Finally – someone else who understands!

    Working extensively with a large collection of PDF files on a daily basis it is particularly frustrating that a) the Inbuilt MS Word ‘Save as PDF function’ doesn’t transfer the Custom File Properties to the PDF file, and b) that even if it did SharePoint (or any other CMS I’ve used) doesn’t pick up on them out of the box anyway.

    I’ve often wondered why this functionality doesn’t exist and can only conclude it probably would if people demanded it. Accordingly does it stand to reason that people simply aren’t aware of the complex metadata properties that can be embedded within a PDF and the benefits that could flow from this – particularly when it comes to moving content and the associated metadata?

    A final frustration with this format is that the Windows Search Service doesn’t seem index PDF Portfolios/Email Archives that can be made with the Acrobat plugin from Outlook!

  3. March 4, 2013 at 07:31

    Great thought. And if you take from Google the first 100 pages of each website you saw that in some cases in 100 pages you have only 4 in HTML (e.g. http://www.umic.pt). The relevant information of the entities are in PDF. Maybe it becomes difficult to leave the PDF format outside of web accessability validators used to produce web accessibility benchmarks.


Leave a Reply