Blog Listing

98% of .com is HTML but 38% of .gov is PDF!

Chart showing prevalence of HTML files across domains. .com 98%, .de 96%, .jp 94%, .org 76%, .gov 47%, .edu 45%I’ve been tracking the relative popularity of electronic document file-formats for several years – here’s the February 2014 survey.

Along the way I’ve noticed that the .com domain (and at least some top-level country-specific domains) tend to have far higher proportions of HTML files compared with .gov, .edu and .org (see chart).

Let’s go ahead and add HTML

It’s hard to deny that HTML files can take the role of documents. Certainly, many HTML files play utility or fragmentary roles on websites, but some of them – perhaps many – are documents just as reasonably as a .XLS or .RTF file may be. Let’s leave aside (for now) the fact that it might take many HTML pages to comprise a “document”.

So fair enough; this new survey includes .HTML (and .HTM) files. But how to screen out the utility files?

Let’s look at 3 top-level domains

In terms of text document formats, the overall web is more than 98% HTML – no surprises there.

However, a login page (to take one example) isn’t what we’d commonly consider a document. In a substantial (and unknowable) number of cases, these HTML files do not represent documents. By contrast, (almost) every single PDF or DOCX file posted for public access on a webserver was placed there to serve a document function.

Chart showing file-formats in .org, .gov and .edu domains.

Since I was trying to understand how organizations use file-formats to post their documents, restricting the survey’s scope to non-commercial top-level domains was a (crude) attempt on my part to focus on “institutional” websites that might be expected to post more “document” content. As it turns out, these domains do include a far higher proportion of non-HTML content compared to commercial domains (for accessibility, the data is provided in a table below).

To begin with, it’s clear that only HTML, PDF and the various Open XML formats (files typically created by Microsoft’s Word, Excel and PowerPoint applications) have any meaningful proportion of the total volume. Note, for example, that Google was only able to find 4 (yes, four) EPUB files on the entire .gov domain.

Chart data

These data are offered without warranty. Search results (counts by file-type) change violently over time due (I guess) to search algorithm changes, or day of the month – who knows? The study will be repeated periodically.

The following data was used to create the above charts:

 HTMLPDFOpen XMLRTFODxEPUB
.com12,540,000,000126,000,00032,290,0005,720,000300,000410,000
.de1,195,700,00046,900,0003,500,0001,090,00067,00017,000
.jp1,356,300,00074,300,0006,100,0002,220,0001,7003,900
.gov48,800,00039,400,00014,815,0001,060,0005,8524
.edu55,300,00033,100,00033,620,0001,910,00032,0002,090
.org505,000,000101,100,00053,200,0002,570,000200,000348,000

Notes

  • “Open XML” represents the totals from searches for DOC, DOCX, PPT, PPTX, XLS and XLSX files. Yes, I know that DOC, PPT and XLS files aren’t actually Open XML formats; the point is simply to group the MS Office-originated formats together.
  • “ODx” represents the totals from searches for ODT, ODP and ODS files.
  • All searches were conducted on Mac OS / Chrome from Cambridge, Mass. in the USA during March 7-8 2014.

Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.


2 comments


Leave a Reply