October 7, 2015 19

We already know that PDF documents are the format of choice for final contracts, invoices, manuals and many other documents businesses keep on their own computers, even if they also store them in a cloud repository.

But is it the same story online? Do users tend to post Word files? Would they prefer to integrate PowerPoint slides directly into web pages, or do they post a PDF instead?

For this latest update in my efforts to track electronic document use on the web I’ve decided to return to the basic question of “fixed” vs. “editable” formats. While in 2014 I tried to add the Open Document and EPUB formats, as well as TXT and RTF, the numbers moved around even more than for the formats Google formally supports for its “filetype:” search (and those numbers vary a lot).

See the table on this page for chart data.

As in previous such surveys, I’ve provided a table representing this chart’s data.

Why is PDF so dominant?

If anything, PDF’s dominance online increased; PDF is still very much the format for those placing “hard copy” online. It’s not limited to the IRS; organizations that study their online content are often surprised to find that their PDF files, which may include dozens or hundreds of pages each, actually contain far more “content” than their web-pages.

in many cases, company or agency websites are, in reality, simply fancy HTML navigation to help the user find the PDF file they actually need.

What did you say about HTML?!?

Once again: many websites include far more content in PDF pages than in HTML. If your website is a .gov, .org or .edu, the chances are good that >80% of your actual text is in PDF files. Of course, the vast majority of .com sites include far more HTML than PDF… but where documents are a meaningful part of the content they often represent a high proportion of volume in both text and traffic.

I’ve been asked: “Why don’t you include HTML in this survey”? There are three reasons I’ll reiterate here:

  1. The number of HTML files Google reports is vastly greater than the total of document files reported, but we already knew that HTML files are the primary “stuff” of the web.
  2. The point is to assess “documents” as distinct from content that can’t be captured from its host website without a high risk of changes or usability. Of course, many PowerPoint files can’t even survive leaving the author’s computer, which is one of the reasons why we need PDF in the first place.
  3. A single HTML page is rarely a “document” in the same sense as is a PDF file. A PDF might be a single page invoice, a 40 page catalog, a 500 page annual report or a 5,000 page building plan complete with oversize drawings, layers and 3D models. It might include pages from five different sources, including scanned pages. By contrast, an HTML file is usually a single page. Very commonly, it takes three HTML pages to deliver a short article on a news website.

Chart data

These data are proportional, not absolute. The actual search results (counts by file-type) change over time due to search algorithm changes, or day of the month – who knows?

While the raw numbers fluctuate, there’s (relative) consistency in the proportions, which makes me think that the data’s reasonable net of whatever search model Google’s offering on my test days. The above chart’s data is provided in tabular form:

2011, April81%13%3%3%
2012, January86%10%3%1%
2013, January79%17%2%1%
2014, February82%6%6%6%
2014, February79%18%2%2%
2015, August89%9%1%1%

Searches are conducted using “filetype:” search strings on the following file-types: PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX. All searches were conducted on Mac OS / Chrome from Winchester, Mass. in the USA. Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.