January 18, 2013 16

Back in December I noticed that a feature in Adobe Acrobat I’d always thought very valuable was now missing: the ability to export tagged PDF to HTML or Word using the document’s structure (tags).

What are “tags” in PDF files?

Tags are the feature of PDF that provides reading order and semantic structure – headings, tables, figures, etc. – to text and other graphics encoded on the PDF page.

While I’d certainly noticed the much-improved export to Word feature in Acrobat XI, that new capability is driven by super-smart software that analyzes the page and performs all kinds of clever tricks to bring you something very similar (in appearance) once opened in MS Word.

The new export to HTML and Word functionality is not driven or influenced by the PDF’s tags (if any). I’m not privy to why Adobe chose this route, but I can guess it had to do with the (sad) fact that even if technically “tagged” the actual quality of tagging in the vast majority of PDF files is very poor.

That’s not, however, a good reason for removing the functionality altogether! What’s more, it used to be otherwise. In Acrobat 7, 8 and 9, export to HTML and Word used PDF tags; In Acrobat X and XI that feature was not retained.

Screen shot of Pulkit's blog post.

Export Tagged PDF to HTML is Back

I mentioned the subject to Adobe, who responded – I have to say – very quickly for a supertanker-sized software company!

This Adobe blog post, by Pulkit Jain, provides instructions on how to download and install Export to HTML options that use tags, a la Acrobat 9.

With respect to Acrobat XI, this feature more-than-likely won’t be made available in a maintenance release, but you can freely download the necessary files for yourself from Pulkit’s blog-post (or get them from an installation of Acrobat 9, for that matter).