Tag Archives: email

Email Preservation: How Hard Can it Be? DPC Briefing Day

Miten and I outside the National Archives

Miten and I outside the National Archives, looking forward to a day of learning and networking

Last week I had the pleasure of attending a Digital Preservation Coalition (DPC) Briefing Day titled Email Preservation: How Hard Can it Be? 

In 2016 the DPC, in partnership with the Andrew W. Mellon Foundation, announced the formation of the Task Force on Technical Approaches to Email Archives to address the challenges presented by email as a critical historical source. The Task Force delineated three core aims:

  1. Articulating the technical framework of email
  2. Suggesting how tools fit within this framework
  3. Beginning to identify missing elements.

The aim of the briefing day was two-fold; to introduce and review the work of the task force thus far in identifying emerging technical frameworks for email management, preservation and access; and to discuss more broadly the technical underpinnings of email preservation and the associated challenges, utilising a series of case studies to illustrate good practice frameworks.

The day started with an introductory talk from Kate Murray (Library of Congress) and Chris Prom (University of Illinois Urbana-Champaign), who explained the goals of the task force in the context of emails as cultural documents, which are worthy of preservation. They noted that email is a habitat where we live a large portion of our lives, encompassing both work and personal. Furthermore, when looking at the terminology, they acknowledged email is an object, several objects and a verb – and it’s multi-faceted nature all adds to the complexity of preserving email. Ultimately, it was said email is a transactional process whereby a sender transmits a message to a recipient, and from a technical perspective, a protocol that defines a series of commands and responses that operate in a manner like a computer programming language and which permits email processes to occur.

From this standpoint, several challenges of email preservation were highlighted:

  • Capture: building trust with donors, aggregating data, creating workflows and using tools
  • Ensuring authenticity: ensuring no part of the email (envelope, header, and message data etc.) have been tampered with
  • Working at scale: email
  • Addressing security concerns: malicious content leading to vulnerability, confidentiality issues
  • Messages and formats
  • Preserving attachments and linked/networked documents: can these be saved and do we have the resources?
  • Tool interoperability

 

The first case study of the day was presented by Jonathan Pledge from the British Library on “Collecting Email Archives”, who explained born-digital research began at the British Library in 2000, and many of their born-digital archives contain email.  The presentation was particularly interesting as it included their workflow for forensic capture, processing and delivery of email for preservation, providing a current and real life insight into how email archives are being handled. The British Library use Aid4Mail Forensic for their processing and delivery, however, are looking into ePADD as a more holistic approach. ePADD is a software package developed by Standford University which supports archival processes around the appraisal, ingest, processing, discovery and delivery of email archives. Some of the challenges they experienced surrounded the issue of email as often containing personal information. A possible solution would be the redaction of offending material, however they noted this could lead to the loss of meaning, as well as being an extremely time-consuming process.

Next we heard from Anthea Seles (The National Archives) and Greg Falconer (UK Government Cabinet Office) who spoke about email and the record of government. Their presentation focused on the question of where the challenge truly lies for email – suggesting that, opposed to issues of preservation, the challenge lies in capture and presentation. They noted that when coming from a government or institutional perspective, the amount of email created increases hugely, leaving large collections of unstructured records. In terms of capture, this leads to the challenge of identifying  what is of value and what is sensitive. Following this, the major challenge is how to best present emails to users – discoverability and accessibility. This includes issues of remapping existing relationships between unstructured records, and again, the issue of how to deal with linked and networked content.

The third and final case study was given by Michael Hope, from Preservica; an “Active Preservation” technology, providing a suite of (Open Archival Information System) compliant workflows for ingest, data management, storage, access, administration and preservation for digital archives.

Following the case studies, there was a second talk from Kate Murray and Chris Prom on emerging Email Task Force themes and their Technology Roadmap. In June 2017 the task force released a Consultation Report Draft of their findings so far, to enable review, discussion and feedback, and the remainder of their presentation focused on the contents and gaps of the draft report. They talked about three possible preservation approaches:

  • Format Migration: copying data from one type of format to another to ensure continued access
  • Emulation: recreating user experience for both message and attachments in the original context
  • Bit Level Preservation: preservation of the file, as it was submitted (may be appropriate for closed collections)

They noted that there are many tools within the cultural heritage domain designed for interoperability, scalability, preservation and access in mind, yet these are still developing and improving. Finally, we discussed what the possible gaps of the draft report, and issues such as  the authenticity of email collections were raised, as well as a general interest in the differing workflows between institutions. Ultimately, I had a great time at The National Archives for the Email Preservation: How Hard Can it Be? Briefing Day – I learnt a lot about the various challenges of email preservation, and am looking forward to seeing further developments and solutions in the near future.

Email Preservation: How Hard Can it Be? DPC Briefing Day

On Thursday 6th July 2017 I attended the Digital Preservation Coalition briefing day in partnership with the Andrew W. Mellon Foundation on email preservation titled ‘Email preservation: how hard can it be?’. It was hosted at The National archives (TNA), this was my first visit to TNA and it was fantastic. I didn’t know a great deal about email preservation prior to this and so I was really looking forward to learning about this topic.

The National Archives, Photograph by Mike Peel (www.mikepeel.net)., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=9786613

The aim of the day was to engage in discussion about some of the current tools, technologies and thoughts on email preservation. It was orientated around the ‘Task Force on Technical Approaches to Email Preservation’ report that is currently in its draft phase. We also got to hear about interesting case studies from the British library, TNA and Preservica, each presenting their own unique experiences in relation to this topic. It was a great opportunity to learn about this area and hear from the co-chairs (Kate Murray and Christopher Prom) and the audience about their thoughts on the current situation and possible future directions.

We heard from Jonathan Pledge from British library (BL). He told us about the forensic capture expertise gained by the BL and using EnCase to capture email data from hard drives, CD’s and USB’s. We also got an insight into how they are deciding which email archive tool to use. Aid4mail fits better with their work flow however ePADD with its holistic approach was something they were considering. During their ingest they separate the emails from the attachments. They found that after the time consuming process of removing emails that would violate the data protection laws, there was very little usable content left, as often, entire threads would have to be redacted due to one message. This is not the most effective use of an archivist time and is something they are working to address.

We also heard from Anthea Seles who works with government collections at TNA. We learnt that from their research, they discovered that approximately 1TB of data in an organisations own electronic document and records management system is linked to 10TB of related data in shared drives. Her focus was on discovery and data analytics. For example, a way to increase efficiency and focus the attention of the curator on was to batch email. If an email was sent from TNA to a vast number of people, then there is a high chance that the content does not contain sensitive information. However, if it was sent to a high profile individual, then there is a higher chance that it will contain sensitive information, so the curator can focus their attention on those messages.

Hearing from Preservica was interesting as it gave an insight into the commercial side of email archiving. In their view, preservation was not an issue. For them, their attention was focused on addressing issues such as identifying duplicates/unwanted emails efficiently. Developing tools for performing whole collection email analysis and, interestingly, how to solve the problem of acquiring emails via a continuous transfer.

Emails are not going to be the main form of communication forever (the rise in the popularity of instant messaging is clear to see) however we learnt that we are still expecting growth in its use for the near future.

One of the main issues that was bought up was the potential size of future email archives and the issue that come with effective and efficient appraisal. What is large in academic terms, e.g. 100 000 emails, is not in government. The figure of over 200 million emails at the George W. Bush presidential library is a phenomenal amount and the Obama administrations is estimated at 300 million. This requires smart solutions and we learnt how the use of artificial intelligence and machine learning could help.

Continuous active learning was highlighted to improve searches. An example of searching for Miami dolphins was given. The Miami Dolphins are an American football team however someone might so be looking for information about dolphins in Miami. Initially the computer would present different search results and the user would choose which the more relevant result is, over time it will learn what it is the user is looking for in cases where searches can be ambiguous.

Another issue that was highlighted was, how do you make sure that you have searched the correct person? How do you avoid false positives? At TNA the ‘Traces Through Time’ project aimed to do that, initially with World War One records. This technology, using big data analytics can be used with email archives. There is also work on mining the email signature as a way to better determine ownership of the message.

User experience was also discussed. Emulation is an area of particular interest. The positive of this is that it recreates how the original user would have experienced the emails. However this technology is still being developed. Bit level preservation is a solution to make sure we capture and preserve the data now. This prevents loss of the archive and allows the information and value to be extracted in the future once the tools have been developed.

It was interesting to hear how policy could affect how easy it would be to acquire email archives. The new General Data Protection Regulation that will come into effect in May 2018 will mean anyone in breach of this will suffer worse penalties, up to 4% of annual worldwide turnover. This means that companies may air on the side of caution with regards to keeping personal data such as emails.

Whilst the email protocols are well standardised, allowing emails to be sent from one client to another (e.g. AOL account from early 1990’s to Gmail of now) the acquisition of them are not. When archivists get hold of email archives, they are left with the remnants of whatever the email client/user has done to it. This means metadata may have been added or removed and formats can vary. This adds a further level of complexity to the whole process

The day was thoroughly enjoyable. It was a fantastic way to learn about archiving emails. As emails are now one of the main methods of communication, for government, large organisations and personal use, it is important that we develop the tools, techniques and policies for email preservation. To answer the question ‘how hard can it be?’ I’d say very. Emails are not simple objects of text, they are highly complex entities comprising of attachments, links and embedded content. The solution will be complex but there is a great community of researchers, individuals, libraries and commercial entities working on solving this problem. I look forward to hearing the update in January 2018 when the task force is due to meet again.

Homes for old software

The more hybrid archives we work with, the more obvious it becomes that we need access to repositories of older software (or ‘abandonware’). For older formats you often find that not only is the creating software obsolete, but any migration tool you can dig up is pretty out-of-date too. Recently I used oldversion.com to source older versions of CompuServe and Eudora to transform an old CompuServe account to mbox format with CS2Eudora. The oldversion site is really valuable and we could use more like it, and more in it. The trouble is, collecting and publishing proprietary ‘abandonware’ seems to be a bit of a grey area.

In 2003, the Internet Archive obtained some exemptions from the Digital Millennium Copyright Act (DCMA) that has allowed them to archive software, but this has to be done privately with the software being made available after copyright expiry. Not much help now, but promising for the long-term. The best thing that could happen (from an archivist’s point of view) is that individuals and companies formally rescinded their interests in older software and put them in the public domain. Ideally they would put an expiry date into the initial licence before the software becomes abandonware.

I’m curious to hear about other good abandonware sites, especially ones that include ‘productivity software’ (our focus is here rather than gaming!). The Macintosh Garden is a good one, and Apple themselves also provide access to some older software, like ClarisWorks. What else is out there that we should know about?

-Susan Thomas

OSS projects for accessing data held in .pst format

Thanks to Neil Jefferies for a link to this article in The Register, which tells us that MS has begun two open source projects that will make it possible for developers to create tools to ‘browse, read and extract emails, calendar, contacts and events information’ which live in MS Outlook’s .pst file format. These tools are the PST Data Structure View Tool and the PST File Format SDK, and both are to be Apache-licensed.

-Susan Thomas

MS format specs, inc. Outlook pst

Last year Microsoft said that they would publish the format specifications for personal store files (pst) – formats used to store data from Outlook products, including email, contacts and diary appointments. You can find them at MSDN here: http://msdn.microsoft.com/en-us/library/ff387869.aspx. There is also format information for other Office products: http://msdn.microsoft.com/en-us/library/cc313118.aspx.

Related sites of interest include Microsoft’s interoperability pages, such as their ‘open specification promise‘ and Microsoft’s interoperability blog.

-Susan Thomas

XML Schema for archiving email accounts


I attended several great sessions at the Society of American Archivists conference last month. There is a wiki for the conference, but very few of the presentations have been posted so far…

One session I particularly enjoyed addressed the archiving of email – ‘Capturing the E-Tiger: New Tools for Email Preservation’. Archiving email is challenging for many reasons, which were very well put by the session speakers.

Both the EMCAP and CERP projects were introduced in the session.

EMCAP is a collaboration between state archives in North Carolina, Kentucky, and Pennsylvania to develop means to archive email. In the past, the archives have typically received email on CDs from a variety of systems, including MS Exchange, Novell Groupwise and Lotus Notes. One of the interesting outcomes of this work is software (an extension of the hmail software – see sourceforge) that enables ongoing capture of email, selected for archiving by users, from user systems. Email identified for archiving is normalised in an XML format and can be transformed to html for access. The software supports open email standards (POP3, SMTP, and IMAP4) as well as MySQL and MS SQL Server. The effort has been underway for five years and the software continues to be tested and refined.

CERP is a collaboration between the Smithsonian Institution Archives and Rockefeller Center Archives. This context has more in common with archiving email in the Bodleian context, where an email account is more likely to be accessioned from its owner in bulk than cumulatively. Ricc Ferrante gave an overview of the issues encountered, which were similar to our experiences on the Paradigm project and in working with creators more generally.

CERP has worked with EMCAP to publish an XML schema for preserving email accounts. Email is first normalised to mbox format and then converted to this XML standard using a prototype parser built in squeak smalltalk, which also has a web interface (seaside/comanche). The result of the transformation is a single XML file that represents an entire email account as per its original arrangement. Attachements can be embedded in the XML file, or externally referenced if on the larger side (over 25kb). If I remember rightly, the largest email account that has been processed so far is c. 1.5GB; we have one at the Library that’s significantly larger and I’d like to see how the parser handles this. It will be interesting to compare the schema/parser with The National Archives of Australia’s Xena. The developers are keen to receive comments on the schema, which is available here.

-Susan Thomas