All posts by iramsafdar

DPC Email Preservation: How Hard Can It Be? Part 2

Source: https://lu2cspjiis-flywheel.netdna-ssl.com/wp-content/uploads/2015/09/email-marketing.jpg

In July last year my colleague Miten and I attended a DPC Briefing Day titled Email Preservation: How Hard Can It Be?  which introduced me to the work of the Task Force on Technical Approaches to Email Archives  and we were lucky enough to attend the second session last week.

Arranging a second session gave Chris Prom (@chrisprom), University of Illinois at Urbana-Champaign and Kate Murray (@fileformatology), Library of Congress, co-chair’s of the Task Force the opportunity to reflect upon and add the issues raised from the first session to the Task Force Report, and provided the event attendees with an update on their progress overall, in anticipation of their final report scheduled to be published some time in April.

“Using Email Archives in Research”

The first guest presentation was given by Dr. James Baker (@j_w_baker), University of Sussex, who was inspired to write about the use of email archives within research by two key texts; Born-digital archives at the Wellcome Library: appraisal and sensitivity review of two hard drives (2016), an article by Victoria Sloyan, and Dust (2001) a book by Carolyn Steedman.

These texts led Dr. Baker to think of the “imagination of the archive” as he put it, the mystique of archival research, stemming from the imagery of  19th century research processes. He expanded on this idea, stating “physically and ontologically unique; the manuscript, is no longer what we imagine to be an archive”.

However, despite this new platform for research, Dr. Baker stated that very few people outside of archive professionals know that born-digital archives exist, yet alone use them. This is an issue, as archives require evidence of use, therefore, we need to encourage use.

To address this, Dr. Baker set up a Born-Digital Access Workshop, at the Wellcome Library in collaboration with their Collections Information Team, where he gathered people who use born-digital archives and the archivists who make them, and provided them with a set of 4 varying case-studies. These 4 case-studies were designed to explore the following:

A) the “original” environment; hard drive files in a Windows OS
B) the view experience; using the Wellcome’s Viewer
C) levels of curation; comparing reformatted and renamed collections with unaltered ones
D) the physical media; asking does the media hold value?

Several interesting observations came out of this workshop, which Dr. Baker organised in to three areas:

  1. Levels of description; filenames are important, and are valuable data in themselves to researchers. Users need a balance between curation and an authentic representation of the original order.
  2. “Bog-standard” laptop as access point; using modern technology that is already used by many researchers as the mode of access to email and digital archives creates a sense of familiarity when engaging with the content.
  3. Getting the researcher from desk to archive; there is a substantial amount of work needed to make the researcher aware of the resources available to them and how – can they remote access, how much collection level description is necessary?

Dr. Baker concluded that even with outreach and awareness events such as the one we were all attending, born-digital archives are not yet accessible to researchers, and this has made me realise the digital preservation community must push for access solutions,  and get these out to users, to enable researchers to gain the insights they might from our digital collections.

“Email as a Corporate Record”

The third presentation of the day was given by James Lappin (@JamesLappin), Loughborough University, who discussed the issues involved in applying archival policies to emails in a governmental context.

His main point concerned the routine deletion of email that happens in governments around the world. He said there are no civil servants email accounts scheduled to be saved past the next 3 – 4 years – but, they may be available via a different structure; a kind of records management system. However, Lappin pointed out the crux in this scenario: government departments have no budget to move and save many individuals email accounts, and no real idea of the numerics: how much to save, how much can be saved?

“email is the record of our age” – James Lappin

Lappin suggested an alternative: keep the emails of the senior staff only, however, this begs the questions, how do we filter out sensitive and personal content?

Lappin posits that auto-deletion is the solution, aiming to spare institutions from unmanageable volumes of email and the consequential breach of data protection.
Autodeletion encourages:

  •  governments to kickstart email preservation action,
  • the integration of tech for records management solutions,
  • actively considering the value of emails for long-term preservation

But how do we transfer emails to a EDRMS, what structures do we use, how do we separate individuals, how do we enforce the transfer of emails? These issues are to be worked out, and can be, Lappin argues, if we implement auto-deletion as tool to make email preservation less daunting , as at the end of the day, the current goal is to retain the “important” emails, which will make both government departments and historians happy, and in turn, this makes archivists happy. This does indeed seem like a positive scenario for us all!

However, it was particularly interesting when Lappin made his next point: what if the very nature of email, as intimate and immediate, makes governments uncomfortable with the idea of saving and preserving governmental correspondence? Therefore, governments must be more active in their selection processes, and save something, rather than nothing – which is where the implementation of auto-deletion, could, again, prove useful!

To conclude, Lappin presented a list of characteristics which could justify the preservation of an individuals government email accounts, which included:

  • The role they play is of historic interest
  • They expect their account to be permanently preserved
  • They are given the chance to flag or remove personal correspondence
  • Access to personal correspondence is prevented except in case of overriding legal need

I, personally, feel this fair and thorough, but only time will tell what route various governments take.

On a side note: Lappin runs an excellent comic-based blog on Records Management which you can see here.

Conclusions
One of the key issues that stood out for me today was, maybe surprisingly, not to do with the technology used in email preservation, but how to address the myriad issues email preservation brings to light, namely the feasibility of data protection, sensitivity review and appraisal, particularly prevalent when dealing in such vast quantities of material.

Email can only be preserved once we have defined what constitutes ’email’ and how to proceed ethically, morally and legally. Then, we can move forward with the implementation of the technical frameworks, which have been designed to meet our pre-defined requirements, that will enable access to historically valuable, and information rich, email archives, that will yield much in the name of research.

In the tweet below, Evil Archivist succinctly reminds us of the importance of maintaining and managing our digital records…

PASIG 2017: “Sharing my loss to protect your data” University of the Balearic Islands

 

Last week I was lucky enough to be able to attend the PASIG 2017 (Preservation and Archiving Special Interest Group) conference, held at the Oxford University Museum of Natural History, where over the course of three days the  digital preservation community connected to share, experiences, tools, successes and mishaps.

The story of one such mishap came from Eduardo del Valle, Head of the Digitization and Open Access Unit at the University of the Balearic Islands (UIB), in his presentation titled Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation”. In 2013 the digitisation and digital preservation workflow pictured below was set up by the IT team at UIB.

2013 Digitisation and Digital Preservation Workflow (Eduardo del Valle, 2017)

Del Valle was told this was a reliable system, with fast retrieval. However, he found this was not the case, with slow retrieval and the only means of organisation consisting of an excel spreadsheet used to contain the storage locations of the data.

In order to assess their situation they used the NDSA Levels of Digital Preservation, a tiered set of recommendations on how organisations should build their digital preservation activities, developed by the National Digital Stewardship Alliance (NDSA) in 2012. The guidelines are organised into five functional areas that lie at the centre of digital preservation:

  1. Storage and geographic location
  2. File fixity and data integrity
  3. Information security
  4. Metadata
  5. File formats

These five areas then have four columns (Levels 1-4) which set tiered recommendations of action, from Level 1 being the least an organisation should do, to Level 4 being the most an organisation can do. You can read the original paper on the NDSA Levels here.

The slide below shows the extent to which the University met the NDSA Levels. They found there was an urgent need for improvement.

NDSA Levels of Preservation UIB compliance (Eduardo del Valle, 2017)

“Anything that can go wrong, will go wrong” – Eduardo del Valle

In 2014 the IT team decided to implement a new back up system. While the installation and configuration of the new backup system (B) was completed, the old system (A) remained operative.

On the 14th and 15th November 2014, a backup was created for the digital material generated during the digitisation of 9 rare books from the 14th century in the Tape Backup System (A) and notably, two confirmation emails were received, verifying the success of the backup.  By October 2015, all digital data had been migrated from System (A) to the new System (B), spanning UIB projects from 2008-2014.

However, on 4th November 2014, a loss of data was detected…

The files corresponding to the 9 digitised rare books were lost. This loss was detected a year after the initial back up of the 9 books in System A, and therefore the contract for technical assistance had finished. This meant there was no possibility of obtaining financial compensation, if the loss was due to a hardware or software problem.  The loss of these files, unofficially dubbed “the X-files”, meant the loss of three months of work and it’s corresponding economic loss. Furthermore, the rare books were in poor condition, and to digitise them again could cause serious damage. Despite a number of theories, the University is yet to receive an explanation for the loss of data.

The digitised 14th century rare book from UIB collection (Eduardo del Valle, 2017)

To combat issues like this, and to enforce best practice in their digital preservation efforts, the University acquired Libsafe, a digital preservation solution offered by Libnova. Libsafe is OAIS and ISO 14.721:2012 compliant, and encompasses advanced metadata management with a built-in ISAD(g) filter, with the possibility to import any custom metadata schema. Furthermore, Libsafe offers fast delivery, format control, storage of two copies in disparate locations, and a built-in catalogue. With the implementation of a standards compliant workflow, the UIB proceeded to meet all four levels of the 5 areas of the NDSA Levels of Digital Preservation.

The ISO 14.721:2012 Space Data and Information Transfer Systems – Open Archival Information System – Reference Model (OAIS)  provides a framework for implementing the archival concepts needed for long-term digital preservation and access, and for describing and comparing architectures and operations of existing and future archives, as well as describing roles, processes and methods for long-term preservation.

The use of these standards facilitates the easy access, discovery and sharing of digital material, as well as their long-term preservation. Del Valle’s story of data loss reminds us of the importance of implementing standards-based practices in our own institutions, to minimise risk and maximise interoperability and access, in order to undertake true digital preservation.

 

With thanks to Eduardo del Valle, University of the Balearic Islands.

PDF/A: Challenges Meeting the ISO 19005 Standard

Anna Oates (MSLIS Candidate, University of Illinois at Urbana-Champaign and NDNP Coordinator Graduate Assistant, Preservation Services) explaining the differences between PDF and PDF/A

We were excited to attend the recent project presentation entitled: ‘A Case Study on Theses in Oxford’s Institutional Repository: Challenges Meeting the ISO 19005 Standard’ given by Anna Oates, a student involved in the Oxford-Illinois Digital Libraries Placement Programme.

The presentation focused initially on the PDF/A format: PDF/A differs from standard PDF in that it avoids common long term access issues associated with PDF. For example, a PDF created today may look and behave differently in 50 years time. This is because many visual aspects of the PDF are not saved into the file itself, (PDFs use font linking instead of font embedding) the standardised PDF/A format attempts to remedy this by embedding  metadata within the file and restricting certain aspects commonly found in PDF which could inhibit long term preservation.

Aspects excluded from PDF/A include :

  • Audio and video content
  • JavaScript executable files
  • All forms of PDF encryption

PDF/A is better suited therefore for the long term preservation of digital material as it maintains the integrity of the information included in the source files, be this textual or visual. Oates described PDF/A as having multiple ‘flavours’, PDF/A-1 published in 2005 including conformance level A (Accessible – maintains the structure of the file) and B (Basic – maintains the visual aspects only). Versions 2 and 3 published later in 2011 and 2012, were developed to encompass conformance level U (Unicode – enabling the embedding of Unicode information) alongside other features such as JPEG 2000 compression and the embedding of arbitrary file formats within PDF/A documents.

Oates specified that different types of documents benefited from different ‘flavours’ of PDF/A, for example, digitised documents were better suited to conformance level B whereas born digital documents were better suited to level A.

Whilst specifying the benefits of PDF/A, Oates also highlighted the myriad of issues associated with the format.  Firstly, while experimenting with creating and conforming PDF/A documents, she noted the conformed documents had slight differences, such as changes to the colour pixels of embedded image files (PDF/A format showed less difference in the colour of pixels with programs like PDF Studio), this showcased a clear alteration of the authenticity of the original source file.

Oates compared source images to PDF/A converted images and found obvious visual differences.

Secondly,  Oates noted that when converting files from PDF to PDF/A-1b, smart software would change the decode filter of the image (e.g. changing from JPXDecode used for JPEG2000 to DCTDecode accepted by ISO 19005) in order to ensure it would conform to ISO 19005. However, she noted that despite the positives of avoiding non-conformance the software had increased the file size of the PDF by 65%. The file size increase poses obvious issues in regards to storage and cost considerations for organisations using PDF/A.

Oates’ workflow for creation and conformance checking of PDF/A files using different PDF/A software

Format uptake was also discussed by Oates. She found that PDF/A had not been widely utilised by Universities for long term preservation of dissertations and thesis in the UK. However, Oates provided examples of users of PDF/A for Electronic Theses and Dissertations Repositories that included: Concordia University, Johns Hopkins University, McGill University, Rutgers University, University of Alberta, University of Oulu and Virginia Tech.  Alongside this it was mentioned that uptake amongst Research and Cultural Heritage Institutions included: the Archaeology Data Service (ADS), British Library, California Digital Library, Data Archiving and Networked Services (DANS), the Library of Congress and the U.S. National Archives and Records Administration (NARA).

“Adobe Preflight has failed to recognize most of the glyph errors. As such, veraPDF will remain our final tool for validation.” (Anna Oates)

Oates therefore concluded that PDF/A was not the best solution to PDF preservation, she mentioned that the new ISO standard would cause new issues and considerations for PDF/A users.

Following the presentation the audience debated whether PDF/A should still be used. Some considered whether other solutions existed to PDF preservation; an example of a proposed solution was to keep both PDF/A and the original PDFs. However, many still felt that PDF/A provided the best solution available despite its various drawbacks.

Hopefully Oates’  findings will highlight the various areas needed for improvement in both PDF/A  conversion/ validation software and conformance aspects of the ISO 19005 Standard used by PDF/A to ensure it is up to the task of digital preservation.

To learn  more about PDF/A have a look at Adobe’s own e-book PDF/A In a Nutshell.

Alice, Ben and Iram (Trainee Digital Archivists)

Email Preservation: How Hard Can it Be? DPC Briefing Day

Miten and I outside the National Archives

Miten and I outside the National Archives, looking forward to a day of learning and networking

Last week I had the pleasure of attending a Digital Preservation Coalition (DPC) Briefing Day titled Email Preservation: How Hard Can it Be? 

In 2016 the DPC, in partnership with the Andrew W. Mellon Foundation, announced the formation of the Task Force on Technical Approaches to Email Archives to address the challenges presented by email as a critical historical source. The Task Force delineated three core aims:

  1. Articulating the technical framework of email
  2. Suggesting how tools fit within this framework
  3. Beginning to identify missing elements.

The aim of the briefing day was two-fold; to introduce and review the work of the task force thus far in identifying emerging technical frameworks for email management, preservation and access; and to discuss more broadly the technical underpinnings of email preservation and the associated challenges, utilising a series of case studies to illustrate good practice frameworks.

The day started with an introductory talk from Kate Murray (Library of Congress) and Chris Prom (University of Illinois Urbana-Champaign), who explained the goals of the task force in the context of emails as cultural documents, which are worthy of preservation. They noted that email is a habitat where we live a large portion of our lives, encompassing both work and personal. Furthermore, when looking at the terminology, they acknowledged email is an object, several objects and a verb – and it’s multi-faceted nature all adds to the complexity of preserving email. Ultimately, it was said email is a transactional process whereby a sender transmits a message to a recipient, and from a technical perspective, a protocol that defines a series of commands and responses that operate in a manner like a computer programming language and which permits email processes to occur.

From this standpoint, several challenges of email preservation were highlighted:

  • Capture: building trust with donors, aggregating data, creating workflows and using tools
  • Ensuring authenticity: ensuring no part of the email (envelope, header, and message data etc.) have been tampered with
  • Working at scale: email
  • Addressing security concerns: malicious content leading to vulnerability, confidentiality issues
  • Messages and formats
  • Preserving attachments and linked/networked documents: can these be saved and do we have the resources?
  • Tool interoperability

 

The first case study of the day was presented by Jonathan Pledge from the British Library on “Collecting Email Archives”, who explained born-digital research began at the British Library in 2000, and many of their born-digital archives contain email.  The presentation was particularly interesting as it included their workflow for forensic capture, processing and delivery of email for preservation, providing a current and real life insight into how email archives are being handled. The British Library use Aid4Mail Forensic for their processing and delivery, however, are looking into ePADD as a more holistic approach. ePADD is a software package developed by Standford University which supports archival processes around the appraisal, ingest, processing, discovery and delivery of email archives. Some of the challenges they experienced surrounded the issue of email as often containing personal information. A possible solution would be the redaction of offending material, however they noted this could lead to the loss of meaning, as well as being an extremely time-consuming process.

Next we heard from Anthea Seles (The National Archives) and Greg Falconer (UK Government Cabinet Office) who spoke about email and the record of government. Their presentation focused on the question of where the challenge truly lies for email – suggesting that, opposed to issues of preservation, the challenge lies in capture and presentation. They noted that when coming from a government or institutional perspective, the amount of email created increases hugely, leaving large collections of unstructured records. In terms of capture, this leads to the challenge of identifying  what is of value and what is sensitive. Following this, the major challenge is how to best present emails to users – discoverability and accessibility. This includes issues of remapping existing relationships between unstructured records, and again, the issue of how to deal with linked and networked content.

The third and final case study was given by Michael Hope, from Preservica; an “Active Preservation” technology, providing a suite of (Open Archival Information System) compliant workflows for ingest, data management, storage, access, administration and preservation for digital archives.

Following the case studies, there was a second talk from Kate Murray and Chris Prom on emerging Email Task Force themes and their Technology Roadmap. In June 2017 the task force released a Consultation Report Draft of their findings so far, to enable review, discussion and feedback, and the remainder of their presentation focused on the contents and gaps of the draft report. They talked about three possible preservation approaches:

  • Format Migration: copying data from one type of format to another to ensure continued access
  • Emulation: recreating user experience for both message and attachments in the original context
  • Bit Level Preservation: preservation of the file, as it was submitted (may be appropriate for closed collections)

They noted that there are many tools within the cultural heritage domain designed for interoperability, scalability, preservation and access in mind, yet these are still developing and improving. Finally, we discussed what the possible gaps of the draft report, and issues such as  the authenticity of email collections were raised, as well as a general interest in the differing workflows between institutions. Ultimately, I had a great time at The National Archives for the Email Preservation: How Hard Can it Be? Briefing Day – I learnt a lot about the various challenges of email preservation, and am looking forward to seeing further developments and solutions in the near future.