February 17, 2014 374

This post is an update to my twice-annual (approximately) track of document file-format popularity as measured by way of Google’s “filetype:” search. Here’s the previous survey, posted in January, 2013.

New for 2014 I’ve decided to start tracking EPUB, Open Office files (ODT, ODP, ODS), TXT and RTF as well. In case you want all the numbers, I’ve provided a table representing this chart’s data.

Chart displaying data provided at the bottom of the page.

What about HTML?

I’ve been asked: “Why don’t you include HTML in this survey”? There are three reasons.

  1. The number of HTML and HTM files Google reports is vastly greater (20 – 50 times) than the total of document files Google reports, so you’d learn nothing beyond the fact that HTML files are the primary substance of the web. Big deal; we knew that.
  2. I’m deliberately restricting scope to those documents which may (generally speaking) be abstracted (downloaded or otherwise captured) from the host website without changes to appearance or utilization. Of course, many PowerPoint files can’t even survive abstraction from the author’s computer, which is one of the reasons why we need PDF in the first place.
  3. For the purposes of this survey an HTML page can’t really be considered a document in any event. Why?
    A PDF might be a single page invoice, a 40 page catalog, a 500 page annual report or a 5,000 page building plan complete with oversize drawings, layers and 3D models. It might include pages from five different sources, including scanned pages. By contrast, an HTML page is usually.. an HTML page, containing some text that may or may not be a document. In any event, with my humble methodology, I’d have no way to screen the login pages and other scraps of text from “content” HTML pages that might be candidate for consideration as a “document”.

Accordingly, I decided it’s just not meaningful to count each individual static .HTML and .HTM file (if that’s what Google is doing) and compare it to the number of .PDF and .DOC files. You may, if you disagree; feel free to start your own survey. You are also welcome to smirk at the fact that this post is in HTML rather than PDF. You won’t be the first, I assure you!

Why is PDF so dominant?

Born before the web to facilitate the exchange of hardcopy documents, PDF is the format people use when they need an electronic “hard copy” document. Many business, publishing and records-keeping applications require a reliable, flexible and capable analog for paper. Some love their TIFF files, but those are pictures, not documents. For the vast majority, PDF remains the only game in town.

Look around. You may be surprised by how large a proportion of your important (and unimportant) content is in PDF. And don’t just count files, as I’ve done in this survey. Organizations who study their online content are often surprised to find that their PDF files, which may include dozens or hundreds of pages, actually contain far more actual content than their web-pages.

Are you leveraging PDF technology?

What does your ECM / SharePoint / CMS / WCM or other system do to help you manage PDF files. PDF technology is far more than electronic paper. The format’s features include:

  • Archival quality control
  • Extensible document and content-level metadata
  • Annotations and fillable forms
  • Security and authenticity
  • Accessibility
  • Attached content
  • Content re-use
  • Redaction
  • Watermarking
  • Page management
  • 3D, video and other rich content
  • Scripting
  • Collation
  • and more…

Many vendors have yet to accept that PDF files play a key role in many of their customer’s organizations, and that better use of PDF might lead to new efficiencies and opportunities.

Ask your content management vendors how their software can support your needs.

Chart data

These data are proportional, not absolute. The actual search results (counts by file-type) change violently over time due (I guess) to search algorithm changes, or day of the month – who knows? While the raw numbers fluctuate, there’s (relative) consistency in the proportions, which makes me think the data’s reasonable net of whatever search model Google’s offering on my irregular test days.

The above chart’s data is provided in tabular form:

2011, April81%13%3%3%???
2012, January86%10%3%1%???
2012, August83%15%1%1%???
2013 January79%17%2%1%???
2013 June83%9%5%2%???
2014 February77.3%5.5%6.0%6.1%1.4%0.8%2.9%

Searches are conducted on the following file-types:


All searches were conducted on Mac OS / Chrome from Cambridge, Mass. in the USA. Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.