pdfGoHTML: PDF Reflow Done Right

Screen shot of pdfGoHTML representing the tags in the PDF Association's PDF/UA flyer.

Last week I wrote about Acrobat’s export to HTML feature, how it was missing from Acrobat X and XI, and how Adobe has made it available once again.

Today we’re going to talk about an interesting implementation based on the idea of converting well-tagged PDF to HTML.

pdfGoHTML is a free plug-in by callas software for Adobe Acrobat Professional on Mac and Windows that converts PDF files into clean, easily reusable HTML which is then styled in a variety of ways to meet different needs.

Olaf DrümmerI interviewed the founder and CEO of callas software, Olaf Drümmer, to understand his company’s objectives for the software, and what he thinks accessibility advocates and government agencies should be doing with it.

Duff Johnson: pdfGoHTML makes HTML from PDF, yes, what does it actually do?

 Olaf Drümmer: This plug-in does three things:

  • pdfGoHTML exports the content structure of a tagged PDF to HTML, mapping PDF structure elements to suitable HTML tags
  • pdfGoHTML opens the exported HTML file in the user’s default browser
  • Besides using the user’s default browser settings, pdfGoHTML also provides styled views, including:
    • Several disability-centric views, mostly for demonstration purposes, that reflect the needs of certain low vision or dyslexic users.
    • A diagnostic view that makes it easy to evaluate the PDF’s tags.

DJ: There are many other PDF to HTML engines. Do you know of any others that use tagged PDF?

OD: While Adobe Acrobat 9 included the ability to create HTML and even Word files from tagged PDF, the feature was dropped in Acrobat X and XI; I’m not sure why. The only other tool I’m aware of that exports HTML from the content structure of a tagged PDF is the free PDF Accessibility Checker from Access for All, which generates a styled HTML for accessibility validation purposes.

DJ: I see that the pdfGoHTML plug-in for Adobe Acrobat is free. What was your principle motivation in developing pdfGoHTML if you aren’t going to charge for it?

OD: The idea is essentially that you have to sow before you can reap… As of today too few understand the value of tagged PDF. callas pdfGoHTML aims to highlight how tagged PDF works, and how different types of users can benefit from tagged PDF.

We expect that the market for tagged PDF is only beginning to take shape, and believe it will grow substantially in the coming years. One driving factor is increased demand for accessible PDF, which in essence is a high quality tagged PDF (as defined by the PDF/UA standard just published by ISO in July 2012). Another relevant factor is on the fly repurposing of PDF content for mobile devices, especially given the highly different form factors and user interface concepts.

We are developing functionality in our regular product lines to support creation or fixing of tagged PDF, or export of derivatives from tagged PDF beyond HTML, most notably EPUB.

In addition axaio software, the sister company of callas software, develops the MadeToTag plug-in for InDesign CS 5.5 and CS 6, which makes the creation of well tagged PDF from InDesign documents much faster and more reliable.

For these file creation-oriented offerings we do charge our customers, whether they are end users or OEM partners implementing our technology as a part of their own products and solutions.

DJ: How would you suggest end users get started using this tool? What would you suggest they do to most clearly understand the value it delivers?

OD: A number of common applications including Microsoft Word, PowerPoint and Excel, or Adobe InDesign have a built-in capability to export documents to tagged PDF.

It is very enlightening to review such documents using the diagnostic “Structure View” of callas pdfGoHTML, as it becomes obvious almost immediately where the tagging was done well, and where it simply fails.

Another worthwhile exercise is to open PDF files presented as being well-tagged. The diagnostic view will quickly show vast differences between some files that appear well crafted at first glance, whereas others are painful to consume for disabled and non-disabled users alike.

Last but not least I suggest playing around with the specialized views targeting different disabilities to get a feel for how low vision users interact with content. This will help you understand how to prepare documents in a fashion that doesn’t make the life of vision-impaired or older users unnecessarily difficult.

DJ: How did you come up with the profiles for Easy Reader, Low Vision and Dyslexia modes?

OD: As a proof of concept, Easy Reader View simply tries to represent the user experience when using an ebook reader like the Kindle, Nook or Kobo devices. The Low Vision Views are based on a project by Silas S. Brown, who has invested a lot of time in building a CSS style sheet generator for low vision needs.

The Dyslexic View leverages the results of a Dutch research project from a few years ago. One of the indirect results of this project was a free of charge font called “OpenDyslexic” that makes reading easier for dyslexic readers. A high quality version is also available for commercial licensing.

As many users are not familiar with the means by which people with disabilities optimize their access to electronic content – whether reading email, web content or a PDF document – I find it does help to be confronted with at least some examples. To be honest I wish someone had given these options to me a few years ago, when I started working on PDF/UA and the domain of accessible PDF in general, it would have been so much easier to understand some of the needs of users with disabilities.

DJ: Some users need very specific browser settings to get a good experience with HTML. Traditionally, PDF files have proven extremely difficult for these users. Can pdfGoHTML help these users?

OD: I’d say that in its initial release pdfGoHTML is more proof of concept implementation than something users with disabilities would use on a daily basis. We are currently evaluating how we can provide a more practical version for these users, e.g. a free plug-in for the free Adobe Reader, or maybe even a command line tool or filter, so users can simply grab content from a tagged PDF and use it anyway they wish.

Screen shot of the diagnostic tags view.

DJ: The “Structure Tags” viewing mode is like a spotlight for garbled or inconsistent tags! It’s certainly made it easy to spot plenty of errors I’d hoped not to find! Does it report on attributes other than Alt?

OD: Most of the attributes are present in the exported HTML, but would require specific CSS to make it visible. In our commercial products a user can create and use her own CSS styles.

In addition, while we ourselves are learning how to make even better use of the rich information present in tagged PDF we will improve and extend the Structure Tags View. We have to be careful here though, because we must avoid overloading it. It’s very important not to sacrifice its ease of use and intuitive display of the most important properties of the tagged content.

DJ: While the tool includes a “tags” view that’s very helpful in understanding the structure underlying the HTML presentation, there’s no error or warning indicators. For example, it’s possible to erroneously tag a table with a Figure tag, or leave some content untagged, or have marked some real content as artifact. Are you planning to add any features to make pdfGoHTML more feasible as a diagnostic tool?

OD: We have no plans to turn pdfGoHTML into a “tagged PDF checker” – this type of functionality will nevertheless be implemented in future versions of our commercial product lines, pdfaPilot and pdfToolbox.

DJ: It’s newly released, so not too many people are aware of pdfGoHTML yet. What comment I’ve seen is very positive. What are your expectations for this application? What should the accessibility community do about this tool?

OD: In the past quite a few members of the accessibility community claimed there is no easy, free of charge access to PDF content, even when such PDFs are well tagged. callas pdfGoHTML intends to fill the gap (and ideally even convince other developers to come up with similar features).

In addition, we have found by using pdfGoHTML ourselves in files that are “officially” considered as being well tagged PDF (e.g. on the background of legal requirements in the public sector in North America or Europe) actually have a very low level of accessibility. Using pdfGoHTML makes it so much easier for document creators as well as agencies contracting out creation of accessible documents to figure out quickly and easily whether a given PDF is, in fact, properly tagged.

DJ: What’s the relationship, as you see it, between this tool and Adobe’s Reflow Tool?

OD: From my point of view Adobe’s Reflow tool has been a proof of concept feature at the time when it was first introduced a decade ago. The tool proves that PDF content can be reflowed and in principle still be displayed nicely and maintain some of the appearance of the original PDF page. Reflow only made partial use of the tagging structure – preferring the sequence in which page content objects are encoded over their intended reading order as defined by the tagging structure. Sadly, it fails for PDF files with non-trivial layouts or use of certain properties, like inverted text (the white text will usually be displayed on white background in the reflow view, making it impossible to read and thus pretty useless).

Very unfortunately Adobe never took the Reflow feature beyond its proof of concept stage, and also very unfortunately, some members in the accessibility community took the Reflow feature for the real thing and started to massage PDF files to work around limitations in the Reflow view.

Clearly it is preferable to fix tools that are broken, rather than to fix files to accommodate broken tools. pdfGoHTML remedies this by also providing a reflowable view of a tagged PDF’s content, but fully honoring the tagging structure, including the reading order defined by it, and by ensuring that the content can be perceived well regardless of the layout complexity or graphic attributes of text or other page content.

DJ: Will there be a version for Adobe Reader?

OD: Adobe has the last word on plug-ins for Adobe Reader, and we have to obtain a license from Adobe to be able to publish pdfGoHTML as a Reader plug-in. Accordingly, I can’t say for sure at this moment, whether this will be the case, but our plan is to make a Reader version available later this year.

DJ: How about a server version that could operate on-demand from a website?

OD: This is possible already today using our commercial callas pdfaPilot 4 server offering, available for Mac, Windows, Linux and Sun Solaris.

DJ: How about a version for iOS or Android?

OD: Good question! All I can say at this moment is: stay tuned!

DJ: Does pdfGoHTML support PDF/UA? What are your plans with respect to PDF/UA?

OD: It does support PDF/UA in the way that a conforming PDF/UA file is a high quality tagged PDF file, and the better the source material the better the exported PDF. In addition, pdfGoHTML would allow you to quickly verify how well the tags reflect the semantic structures of the document.

DJ: How does pdfGoHTML fit with callas’ other software offerings?

OD: pdfGoHTML illustrates what – in the domain of structured PDF content – is possible with PDFs when they are done right. callas’ other software offerings help users get their PDFs right in numerous ways.

DJ: Thanks for your time, Olaf. Good luck with pdfGoHTML!



Leave a Reply