We were excited to attend the recent project presentation entitled: ‘A Case Study on Theses in Oxford’s Institutional Repository: Challenges Meeting the ISO 19005 Standard’ given by Anna Oates, a student involved in the Oxford-Illinois Digital Libraries Placement Programme.
The presentation focused initially on the PDF/A format: PDF/A differs from standard PDF in that it avoids common long term access issues associated with PDF. For example, a PDF created today may look and behave differently in 50 years time. This is because many visual aspects of the PDF are not saved into the file itself, (for example, PDFs may use font linking instead of font embedding) the standardised PDF/A format attempts to remedy this by embedding metadata within the file and restricting certain aspects commonly found in PDF which could inhibit long term preservation.
Aspects excluded from PDF/A include :
- Audio and video content
- JavaScript executable files
- All forms of PDF encryption
PDF/A is better suited therefore for the long term preservation of digital material as it maintains the integrity of the information included in the source files, be this textual or visual. Oates described PDF/A as having multiple ‘flavours’, PDF/A-1 published in 2005 including conformance level A (Accessible – maintains the structure of the file) and B (Basic – maintains the visual aspects only). Versions 2 and 3 published later in 2011 and 2012, were developed to encompass conformance level U (Unicode – enabling the embedding of Unicode information) alongside other features such as JPEG 2000 compression and the embedding of arbitrary file formats within PDF/A documents.
Oates specified that different types of documents benefited from different ‘flavours’ of PDF/A, for example, digitised documents were better suited to conformance level B whereas born digital documents were better suited to level A.
Whilst specifying the benefits of PDF/A, Oates also highlighted the myriad of issues associated with the format. Firstly, while experimenting with creating and conforming PDF/A documents, she noted the conformed documents had slight differences, such as changes to the colour pixels of embedded image files (PDF/A format showed less difference in the colour of pixels with programs like PDF Studio), this showcased a clear alteration of the authenticity of the original source file.
Secondly, Oates noted that when converting files from PDF to PDF/A-1b, smart software would change the decode filter of the image (e.g. changing from JPXDecode used for JPEG2000 to DCTDecode accepted by ISO 19005) in order to ensure it would conform to ISO 19005. However, she noted that despite the positives of avoiding non-conformance the software had increased the file size of the PDF by 65%. The file size increase poses obvious issues in regards to storage and cost considerations for organisations using PDF/A.
Format uptake was also discussed by Oates. She found that PDF/A had not been widely utilised by Universities for long term preservation of dissertations and thesis in the UK. However, Oates provided examples of users of PDF/A for Electronic Theses and Dissertations Repositories that included: Concordia University, Johns Hopkins University, McGill University, Rutgers University, University of Alberta, University of Oulu and Virginia Tech. Alongside this it was mentioned that uptake amongst Research and Cultural Heritage Institutions included: the Archaeology Data Service (ADS), British Library, California Digital Library, Data Archiving and Networked Services (DANS), the Library of Congress and the U.S. National Archives and Records Administration (NARA).
“Adobe Preflight has failed to recognize most of the glyph errors. As such, veraPDF will remain our final tool for validation.” (Anna Oates)
Oates therefore concluded that PDF/A was not the best solution to PDF preservation, she mentioned that the new ISO standard would cause new issues and considerations for PDF/A users.
Following the presentation the audience debated whether PDF/A should still be used. Some considered whether other solutions existed to PDF preservation; an example of a proposed solution was to keep both PDF/A and the original PDFs. However, many still felt that PDF/A provided the best solution available despite its various drawbacks.
Hopefully Oates’ findings will highlight the various areas needed for improvement in both PDF/A conversion/ validation software and conformance aspects of the ISO 19005 Standard used by PDF/A to ensure it is up to the task of digital preservation.
To learn more about PDF/A have a look at PDF Association’s own e-book PDF/A In a Nutshell.
Alice, Ben and Iram (Trainee Digital Archivists)
Hi Duff Johnson,
Thank you for your comment. We have amended the author of “PDF/a in a Nutshell” to PDF Association.
There are a number of inaccuracies in this article that I wish to highlight.
– It’s routine to embed fonts in PDF files. The article implies otherwise.
– It’s not true that digitized documents are “better suited” to conformance level B. It’s simply that conformance level B preserves ONLY visual appearance, nothing else.
– It’s not fair to cite “myriad issues” without being specific about what’s being claimed. I am not aware of “myriad issues” pertaining to PDF/A at all.
– The differences in the documents as noted are compression artifacts, and have nothing to do with PDF/A whatsoever.
– An example is given of converting a PDF that includes JPXDecode to PDF/A. No specific part of PDF/A is specified, however the article states that the image had to be recompressed from JPXDecode to DCTDecode, which implies that PDF/A-1 was the target. If so, then the behavior reported is expected and intended as a function of the user’s choice to downgrade the file to conform to PDF/A-1. They should have used PDF/A-2, which would have retained the JPXDecode-encoded data. It’s as if the user put a color magazine into a black-and-white photocopier… but expected to get a color copy.
– It is stated that “PDF/A is not the best solution to PDF preservation” – ok, so what is better? Answer: nothing. PDF/A *is* in fact the best solution to PDF preservation.
– The second-to-last paragraph includes statements about areas needed for improvement. In fact, what needs improvement is education of end-users about relevant aspects of the specification, and in general, the need to use the most current PDF/A specification in order to best accommodate content in more advanced PDF files.
– “PDF/A in a Nutshell” is actually a PDF Association publication, not an Adobe publication.