The National Digital Stewardship Alliance (NDSA), a consortium of archivist institutions committed to the long-term preservation of digital information, has published an important new paper analyzing PDF/A-3 from the archivist’s perspective. This post provides some brief analysis and comment.
What is PDF/A-3?
While PDF/A-2 is an update to the PDF/A-1 based on ISO 32000 instead of Adobe’s PDF 1.4 – along with other important changes – it represents no fundamental challenge to archivist’s values.
But PDF/A-3 is different. Responding to commercial demands for an archival document format that could also serve as a container for associated (and possibly, non-archival) content, the ISO committee’s response was a single, very simple change, but one that has roiled the archivist community: the ability to embed arbitrary files in documents that are otherwise archival-grade.
The NDSA on PDF/A-3
The report characterizes the general problem of PDF/A-3 as the possibility that PDF/A-3 files may be used as a general-purpose bundling format irrespective of the relative significance of any given item of content, including the PDF/A-3 document itself.
While acknowledging the value of PDF/A-3 for commercial purposes, the NDSA report calls for specific protocols between depositors and archival repositories. PDF/A-3 should only be considered, they say, when workflow and protocols guaranteeing understanding of the relationship between the PDF document and any embedded files are fully established and documented.
It should be emphasized that NDSA is not opposed to processing embedded files, understands the value proposition and wants PDF/A-3 processors. They say, for example:
The requirement for file specification dictionaries with required relationship values in those dictionaries will make associated files embedded in a PDF/A-3 more obvious and discoverable through generic PDF/A-3 viewers.
Nonetheless, the report concludes with a recommendation to software developers to provide tooling allowing users to identify cases where PDF/A-3 metadata is unnecessary (either no embedded files at all or all embedded files are PDF/A). This is, essentially a way to “just say no” to PDF/A-3.
The NDSA report points out that: “There is currently no robust vendor-independent mechanism for assessing that a PDF/A file does in fact comply fully with the standard and the conformance level it claims in its internal metadata.”
The authors go on to note that a canonical PDF validator is highly desirable (and not only for PDF/A-3) because it would mitigate concerns over PDF’s complexity with respect to its use as a container format, reduce the volume of files bearing invalid metadata and result in higher file quality and reliability overall.
Quite rightly, archival institutions are afraid that users may use a PDF/A-3 file as, effectively, a cover-note for a garbage bag full of who-knows-what. The report considers that PDF/A-3 may be “appropriate for use in controlled workflows….” That’s good, because there’s no reason in principle why PDF/A-3 implementations (and their embedded content) can’t be designed in good-faith with archival considerations in mind. Archivists should allow for such cases.
There’s general recognition that a bundling format is needed, and the report spends some time on vital characteristics for such a format, such as those described in ISO/IEC 21320-1. These features may be interesting considerations for a future version of PDF.
In general, however, this report is intended to ensure that archivists understand that PDF/A-3 is emphatically not a newer, faster, stronger PDF/A, and that’s entirely correct. They should not, however, conclude that PDF/A-3 is mistaken or ‘wrong’; it simply has its place.
PDF/A-3 Policy: A Recommendation
For memory institution purposes, where embedded (non PDF/A) PDF files are concerned, I’d propose requiring that non PDF/A embedded files must also always have a companion embedded file that is a PDF/A conforming rendition.
Example: let’s assume we have a PDF with 3D in it. It seems to me that archivists could accept a PDF/A-3 container in such cases if a PDF/A conforming rendition of that non-PDF/A embedded content (which would have to be a static poster image for the 3D model) was also provided.
Reality will never be under full control, but some of it must be archived anyway. Such a policy would ensure the best of both worlds (and until we have PRC/A or U3D/A or PDF/E-2 this is the best one could get).
Let’s look at email
Email software comes in many flavors; there’s no canonical way to do it. Some emails appear very differently on different systems, and so on.
Archiving email is nonetheless important, and may have to be done despite any obstacle. Hanging onto the email bitstream is doomed to fail at some point in the future. Just keeping PDF/A renderings (effectively, digital printouts) may fail to preserve every aspect that may seem relevant 50 years from now, but it’s a lot better than nothing. Preserving a robust PDF/A-3 digital printout with the original email bitstream embedded may be the best outcome one could reasonably and cost-effectively achieve in the foreseeable future.
PDF/A-3 may be extremely useful in this regard, or in other cases where archival perfection is simply unrealistic. As such, prohibiting PDF/A-3 seems short-sighted. Instead, a policy of carefully-designed workflows that account for – and even leverage – PDF/A-3’s capabilities seem not only worthwhile, but might even lead to highly cost-effective solutions for many commonplace archiving problems.