February 12, 2015 89

Now in its 5th year, and updated annually, this post continues to track document file-format popularity as measured by way of Google’s “filetype:” search.

Note that in 2014 I added EPUB, Open Document formats (ODT, ODP, ODS), TXT and RTF to the survey. This (naturally) caused a “hit” in the size of the PDF bar relative to the rest.

I’ve provided a table representing this chart’s data. Graph showing PDF compared to other formats on the web.

Why is PDF so dominant?

The story hasn’t changed since last year. PDF is still the format for those who need an electronic “hard copy” document. Although “authoring” formats such as DOC and PPT are posted quite frequently, they aren’t very attractive formats for records-keeping or formal content.

Organizations who study their online content are often surprised to find that their PDF files, which may include dozens or hundreds of pages, actually contain far more actual content than their web-pages.

Bizarrely, the ECM industry remains (in North America, anyhow) stuck on TIFF images, unwilling to take advantage of the power of PDF even as the vast majority of “born digital” documents are overwhelmingly PDF files.

What about HTML?

I’ve been asked: “Why don’t you include HTML in this survey”? There are three reasons.

  1. The number of HTML and HTM files Google reports is vastly greater (20 – 50 times) than the total of document files Google reports, so you’d learn nothing beyond the fact that HTML files are the primary substance of the web. Big deal; we knew that.
  2. I’m deliberately restricting scope to those documents which may (generally speaking) be abstracted (downloaded or otherwise captured) from the host website without changes to appearance or utilization. Of course, many PowerPoint files can’t even survive abstraction from the author’s computer, which is one of the reasons why we need PDF in the first place.
  3. For the purposes of this survey an HTML page can’t really be considered a document in any event. Why? A PDF might be a single page invoice, a 40 page catalog, a 500 page annual report or a 5,000 page building plan complete with oversize drawings, layers and 3D models. It might include pages from five different sources, including scanned pages. By contrast, an HTML file is usually a single page.

Chart data

These data are proportional, not absolute. The actual search results (counts by file-type) change over time due (I guess) to search algorithm changes, or day of the month – who knows? While the raw numbers fluctuate, there’s (relative) consistency in the proportions, which makes me think the data’s reasonable net of whatever search model Google’s offering on my test days. The above chart’s data is provided in tabular form:

2011, April81%13%3%3%???
2012, January86%10%3%1%???
2013, January79%17%2%1%???
2014, February77.3%5.5%6.0%6.1%1.4%0.8%2.9%
2015, February71.7%16.1%1.8%1.6%1.6%1.0%6.3%

Searches are conducted on the following file-types: PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, EPUB, ODT, ODP, ODS, TXT, RTF All searches were conducted on Mac OS / Chrome from Cambridge, Mass. in the USA. Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.

2015-06 UPDATE: I now believe that the EPUB data is bogus / meaningless. I won’t bore you with the details, but the next survey won’t include that format (unless Google adds support for it).