Category Archives: Digital archives

Festivals in the UK Web Archive

Live events are funny things; can their spirit be captured or do you have to “be there to get it”? Personally I don’t think you can, so why are we archiving festival websites?

Running throughout the year, though most tend to be clustered around the short UK summer, festivals form a huge part of the UK’s contemporary cultural scene.  While it’s often the big music festivals that come to mind such as Glastonbury and Reading or perhaps the more local CAMRA sponsored beer and cider festivals; these days there is a festival for pretty much everything under the sun.

UK Web Archive topics and themes

In part this explosion of festivals from the very local and niche to the mainstream and brand sponsored has been helped by the internet. You can now find festivals dedicated to anything from bird watching to meat grilling to vintage motors.

With the number of tools and platforms available for website creation and event and bookings management and the rise of social media, it seems anyone with an idea can put on a festival. More importantly with increasing connectedness that the web gives us, the reach of these home grown festivals has become potentially global.

Of course most will remain small local events that go on until the organisers lose interest or money such as Blissfields in Winchester which had to cancel their 2018 event due to poor ticket sales. But some will make it big like Neverworld which started in 2006 in Lee Denny’s back garden while his parents were away for the week but now 10+ years on has sold out the 5000 capacity festival venue it has relocated to.

The UK Web Archive‘s Festivals collection attempts to capture the huge variety of UK festivals taking place each year and currently has around 1200 events being archived that are loosely categorised based around 15 common themes, though of course there is a great deal of crossover as they can be found combining themes such as:

In this collection of UK festivals sites, while we cannot capture the spirit of a live event we can still try to capture their transient nature. Here you can see their rise and fall, the photographs and comments left in their wake, and their impact on local communities over time. Hopefully these sites and their contents can still give future researchers a sometimes surprising and often candid snapshot of contemporary British culture.

Emily Chen

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Introducing the new UK Web Archive website

Until recently, if you wanted to search the vast UK Legal Deposit Web Archive (containing the whole UK Web space), then you would need to travel to the reading room of a UK Legal Deposit Library to see if what you needed was there. For the first time, the new UK Web Archive website offers:

  • The ability to search the Legal Deposit web archive from anywhere.
  • The ability to search the Legal Deposit web archive alongside the ‘Open’ UK Web Archive (15,000 or so publicly available websites collected since 2005).
  • The opportunity to browse over 100 curated collections on a wide range of topics.

Who is the UK Web Archive?
UKWA is a partnership of all the UK Legal Deposit Libraries – The British Library, National Library of Scotland, National Library of Wales, the Bodleian Libraries, Cambridge University Libraries, and Trinity College, Dublin. The Legal Deposit Web Archive is available in the reading rooms of all the Libraries.

How much is available now?
At the time of writing, everything that a human (curators and collaborators) has selected since 2005 is searchable. This constitutes many thousands of websites and millions of individual web pages. The huge yearly Legal Deposit domain crawls will be added over the coming year.

This includes over 100 curated collections of websites on a wide range of topics and themes. Recent collections curated by the Bodleian Libraries include:

Do the websites look and work as they did originally?
Yes and no. Every effort is made so that websites look how they did originally and internal links should work. However, for a variety of technical  issues many websites will look different or some elements may be missing. As a minimum, all of the text in the collection is searchable and most images should be there. Whilst we collect a considerable amount of video, much of this will not play back.

Is every UK website available?
We aim to collect every website made or owned by a UK resident, however, in reality it is extremely difficult to be comprehensive! Our annual Legal Deposit collections include every .uk (and .london, .scot, .wales and .cymru) plus any website on a server located in the UK. Of course, many websites are .com, .info etc. and on servers in other countries.

If you have or know of a UK website that should be in the archive we encourage you to nominate them via the website.

Another version of this post was first published on the UK Web Archive blog.

Archives Unleashed – Vancouver Datathon

On the 1st-2nd of November 2018 I was lucky enough to attend the  Archives Unleashed Datathon Vancouver co-hosted by the Archives Unleashed Team and Simon Fraser University Library along with KEY (SFU Big Data Initiative). I was very thankful and appreciative of the generous travel grant from the Andrew W. Mellon Foundation that made this possible.

The SFU campus at the Habour Centre was an amazing venue for the Datathon and it was nice to be able to take in some views of the surrounding mountains.

About the Archives Unleashed Project

The Archives Unleashed Project is a three year project with a focus on making historical internet content easily accessible to scholars and researchers whose interests lay in exploring and researching both the recent past and contemporary history.

After a series of datathons held at a number of International institutions such as the British Library, University of Toronto, Library of Congress and the Internet Archive, the Archives Unleashed Team identified some key areas of development that would enable and help to deliver their aim of making petabytes of valuable web content accessible.

Key Areas of Development
  • Better analytics tools
  • Community infrastructure
  • Accessible web archival interfaces

By engaging and building a community, alongside developing web archive search and data analysis tools the project is successfully enabling a wide range of people including scholars, programmers, archivists and librarians to “access, share and investigate recent history since the early days of the World Wide Web.”

The project has a three-pronged approach
  1. Build a software toolkit (Archives Unleashed Toolkit)
  2. Deploy the toolkit in a cloud-based environment (Archives Unleashed Cloud)
  3. Build a cohesive user community that is sustainable and inclusive by bringing together the project team members with archivists, librarians and researchers (Datathons)
Archives Unleashed Toolkit

The Archives Unleashed Toolkit (AUT) is an open-source platform for analysing web archives with Apache Spark. I was really impressed by AUT due to its scalability, relative ease of use and the huge amount of analytical options it provides. It can work on a laptop (Mac OS, Linux or Windows), a powerful cluster or on a single-node server and if you wanted to, you could even use a Raspberry Pi to run AUT. The Toolkit allows for a number of search functions across the entirety of a web archive collection. You can filter collections by domain, URL pattern, date, languages and more. Create lists of URLs to return the top ten in a collection. Extract plain text files from HTML files in the ARC or WARC file and clean the data by removing ‘boilerplate’ content such as advertisements. Its also possible to use the Stanford Named Entity Recognizer (NER) to extract names of entities, locations, organisations and persons. I’m looking forward to seeing the possibilities of how this functionality is adapted to localised instances and controlled vocabularies – would it be possible to run a similar programme for automated tagging of web archive collections in the future? Maybe ingest a collection into ATK , run a NER and automatically tag up the data providing richer metadata for web archives and subsequent research.

Archives Unleashed Cloud

The Archives Unleashed Cloud (AUK) is a GUI based front end for working with AUT, it essentially provides an accessible interface for generating research derivatives from Web archive files (WARCS). With a few clicks users can ingest and sync Archive-it collections, analyse the collections, create network graphs and visualise connections and nodes. It is currently free to use and runs on AUK central servers.

My experience at the Vancouver Datathon

The datathons bring together a small group of 15-20 people of varied professional backgrounds and experience to work and experiment with the Archives Unleashed Toolkit and the Archives Unleashed Cloud. I really like that the team have chosen to minimise the numbers that attend because it created a close knit working group that was full of collaboration, knowledge and idea exchange. It was a relaxed, fun and friendly environment to work in.

Day One

After a quick coffee and light breakfast, the Datathon opened with introductory talks from project team members Ian Milligan (Principal Investigator), Nick Ruest (Co-Principal Investigator) and Samantha Fritz (Project Manager), relating to the project – its goals and outcomes, the toolkit, available datasets and event logistics.

Another quick coffee break and it was back to work – participants were asked to think about the datasets that interested them, techniques they might want to use and questions or themes they would like to explore and write these on sticky notes.

Once placed on the white board, teams naturally formed around datasets, themes and questions. The team I was in consisted of  Kathleen Reed and Ben O’Brien  and formed around a common interest in exploring the First Nations and Indigenous communities dataset.

Virtual Machines were kindly provided by Compute Canada and available for use throughout the Datathon to run AUT, datasets were preloaded onto these VMs and a number of derivative files had already been created. We spent some time brainstorming, sharing ideas and exploring datasets using a number of different tools. The day finished with some informative lightning talks about the work participants had been doing with web archives at their home institutions.

Day Two

On day two we continued to explore datasets by using the full text derivatives and running some NER and performing key word searches using the command line tool Grep. We also analysed the text using sentiment analysis with the Natural Language Toolkit. To help visualise the data, we took the new text files produced from the key word searches and uploaded them into Voyant tools. This helped by visualising links between words, creating a list of top terms and provides quantitative data such as how many times each word appears. It was here we found that the word ‘letter’ appeared quite frequently and we finalised the dataset we would be using – University of British Columbia – bc-hydro-site-c.

We hunted down the site and found it contained a number of letters from people about the BC Hydro Dam Project. The problem was that the letters were in a table and when extracted the data was not clean enough. Ben O’Brien came up with a clever extraction solution utilising the raw HTML files and some script magic. The data was then prepped for geocoding by Kathleen Reed to show the geographical spread of the letter writers, hot-spots and timeline, a useful way of looking at the issue from the perspective of engagement and the community.

Map of letter writers.

Time Lapse of locations of letter writers. 

At the end of day 2 each team had a chance to present their project to the other teams. You can view the presentation (Exploring Letters of protest for the BC Hydro Dam Site C) we prepared here, as well as the other team projects.

Why Web Archives Matter

How we preserve, collect, share and exchange cultural information has changed dramatically. The act of remembering at National Institutes and Libraries has altered greatly in terms of scope, speed and scale due to the web. The way in which we provide access to, use and engage with archival material has been disrupted. All current and future historians who want to study the periods after the 1990s will have to use web archives as a resource. Currently issues around accessibility and usability have lagged behind and many students and historians are not ready. Projects like Archives Unleashed will help to furnish and equip researchers, historians, students and the community with the necessary tools to combat these problems. I look forward to seeing the next steps the project takes.

Archives Unleashed are currently accepted submissions for the next Datathon in March 2019, I highly recommend it.

Higher Education Archive Programme Network Meeting on Research Data Management

On 22nd June 2018 I attended the Higher Education Archive Programme (#HEAP) network meeting on Research Data Management (RDM) at the National Archives at Kew Gardens. This allowed me to learn about some of the current thinking in research data management from colleagues and peers currently working in this area through hearing about their own personal experiences.

The day consisted of a series of talks from presenters with a variety of backgrounds (archivists, managers, PhD students) giving their experiences of RDM from their different perspectives (design/implementation of systems, use). I will aim to briefly summarise the main message from a few of them. This was followed by a question and answers session and concluded with a workshop run by John Kaye from JISC.

Having had very little exposure to RDM in my career, it was a great way for me to understand what it is and what is being done in this sector. I have undertaken quantitative research myself during my PhD and so have an understanding of how research data is created, but until my recent move into the archival profession, I rather foolishly gave little thought as to how this data is managed. Events like this help to make people aware of the challenges archivists, information professionals and researchers face.

What is HEAP?

The Higher Education Archive Programme (#HEAP) is part of The National Archives’ continuing programme of engagement and sector support with particular archival constituencies. It is a mixture of strategic and practical work encompassing activity across The National Archives and the wider sector including guidance and training, pilot projects and advocacy. They also run network meetings for anyone involved in university archives, special collections and libraries with a variety of themes.

What is Research Data Management?

Susan Worrall, from University of Birmingham, started the day by explaining to us, what is research data management and why is it of interest to archivists? Put simply, it is the organisation, structuring, storage, care and use of data generated by research. It is important to archivists as these are all common themes of digital archiving and digital preservation, therefore, it suffers from similar issues, such as:

  • Skills gap in the sector
  • Fear of the unknown
  • Funding issues
  • Training

She presented a case study using a Brain imaging experiment, which highlighted the challenges of consent and managing huge amounts of highly specialised data. There are, however, opportunities for archivists; RDM and digital archiving are two sides of the same coin, digital archivists already do a lot of the RDM processes and so have many transferable skills. Online training is also available, University of Edinburgh and The University of North Carolina at Chapel Hill collaborated to create a course on Coursera.

A Digital Archivist’s Perspective

Jenny Mitcham, from University of York, gave us an insight into RDM from her experience as a digital archivist. She highlighted how RDM requires skills from the Library, Archival and IT sectors. Within a department, you may have all of these skills however the roles and responsibilities are not always clear, which can cause issues. She described a fantastic project called ‘Filling the Digital Preservation Gap’ which explored the potential of archivematica for RDM. It was a finalist in the 2016 Digital Preservation Awards and more information about the project can be found on the blog.

Planning, Designing and Implementing an RDM system

Laurain Williamson, from University of Leicester, spoke about how to plan and implement a research data management service. Firstly, she described the current situation within the university and what the project brief involved. Any large scale project will require a large amount of preparation and planning, however she noted that certain elements, such as considering all viable technical solutions was incredibly time consuming, however, it was essential to get the best fit for the institution. Through interviews and case study’s they analysed the thoughts and wants from a variety of stakeholders. 

Their research community wanted:

  • Expertise
  • Knowledge about copyright/publishing
  • Bespoke advice and a flexible service.

Challenges faced by the RDM team were:

  • To manage expectations (they will never be able to do everything, so they must collaborate and prioritise their resources)
  • Last minute requests from researchers
  • Liaising with researchers at an early stage of the project is vital (helping researchers think about file formats early on to aid the preservation process).

Conclusion

Whilst RDM to a layperson may seem simple at first (save it on the cloud or a hard drive) when you delve into the archival theories of correct digital preservation, this becomes an absurdly simplified view. Managing large amounts of data from such specialised experiments (producing niche file formats) requires a huge amount of knowledge, collaboration and expertise.

(CC BY 4.0) Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2018. Incentives for Building University RDM Services. The Realities of Research Data Management, Part 3. Dublin, OH: OCLC Research. doi:10.25333/C3S62F.

Data produced by universities can be seen as a commodity. The increase in the scholarly norms for open science and sharing data puts higher emphasis on RDM. It is important for the institutions/individuals creating the data (if there is any potential future scholarly or financial gain) and also for scientific integrity (allowing others in the community to review and confirm the results). But not everyone will want to make it open and actually not all of it has or should be open; creating a system and workflow that accounts for both is vital.

An OCLC research report recently stated ‘It would be a mistake to imagine that there is a single, best model of RDM service capacity, or a simple roadmap to acquiring it’. As with most things in the digital sector, this is a fast moving area and new technologies and theories are continually being developed. It will be exciting to see how these will be implemented in the future.

 

Building collections on Gender Equality at the UK Web Archive

The Bodleian is one of the 6 legal deposit libraries in the UK. One of my projects this year as a graduate trainee digital archivist on the Bodleian Libraries’ Developing the Next Generation Archivist programme is to help curate special collections in the UK Web Archive. Since May I’ve been working on the Gender Equality collection. Please note, this post also appears on the British Library UK Web Archive blog.

Why are we collecting?

2018 is the centenary of the 1918 Representation of the People’s Act. UK-wide memorials and celebrations of this journey, and victory of women’s suffrage, are all evident online: from events, exhibitions, commemorations and campaigns. Popular topics being discussed at the moment include the hashtags #timesup and #metoo, gender pay disparity and the recent referendum on the 8th Amendment in the Republic of Ireland. These discussions produce a lot of ephemeral material, and without web archiving this material is at risk of moving or even disappearing. As we can see gender equality is being discussed a lot currently in the media, these discussions have been developing over years.

Through the UK Web Archive SHINE interface we can see that matching text for the phrase ‘gender equality’ increased from a result of 0.002% (24 out of 843,204) of crawled resources in 1996, to 0.044% (23,289 out of 53,146,359) in 2013.

SHINE user interface

If we search UK web content relating to gender equality we will generate so many results; for example, organisations have published their gender pay discrepancy reports online and there is much to engage with from social media accounts of both individuals and organisations relating to campaigning for gender equality. It becomes apparent that when we browse this web content gender equality means something different for so many presences online: charities, societies, employers, authorities, heritage centres and individuals such as social entrepreneurs, teachers, researchers and more.

The Fawcett Society: https://www.fawcettsociety.org.uk/blog/why-does-teaching-votes-for-women-matter-an-a-level-teachers-perspective

What we are collecting?

The Gender Equality special collection, that is now live on the UK Web Archive comprises material which provides a snapshot into attitudes towards gender equality in the UK. Web material is harvested under the areas of:

  • Bodily autonomy
  • Domestic abuse/Gender based violence
  • Gender equality in the workplace
  • Gender identity
  • Parenting
  • The gender pay gap
  • Women’s suffrage

100 years on from women’s suffrage the fight for gender equality continues. The collection is still undergoing curation and growing in archival records – and you can help too!

How to get involved?

If there are any UK websites that you think should be added to the Gender Equality collection then you can take up the UK Web Archive’s call for action and nominate.

 

 

The UK Web Archive Ebola Outbreak collection

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Next month marks the four year anniversary of the WHO’s public announcement of “a rapidly evolving outbreak of Ebola virus disease (EVD)” that went on to become the deadliest outbreak of EVD in history.

With more than 28,000 cases and 11,000 deaths, it moved with such speed and virulence that–though concentrated in Guinea, Liberia and Sierra Leone–it was feared at the time that the Ebola virus disease outbreak of 2014-2016 would soon spread to become a global pandemic.

No cure or vaccine has yet been discovered and cases continue to flare up in West Africa. The most recent was declared over on 2 July 2017. Yet today most people in the UK unless directly affected don’t give it a second thought.

Searching online now, you can find fact sheets detailing everything you might want to know about patient zero and the subsequent rapid spread of infection. You can find discussions detailing the international response (or failure to do so) and lessons learned. You might even find the reminiscences of aid workers and survivors. But these sites all examine the outbreak in retrospect and their pages and stories have been updated so often that posts from then can no longer be found.

Posts that reflected the fear and uncertainty that permeated the UK during the epidemic. The urgent status updates and travel warnings.  The misinformation that people were telling each other. The speculation that ran riot. The groundswell of giving. The mobilisation of aid.

Understandably when we talk about epidemics the focus is on the scale of physical suffering: numbers stricken and dead; money spent and supplies sent; the speed and extent of its spread.

Whilst UKWA regularly collects the websites of major news channels and governmental agencies, what we wanted to capture was the public dialogue on, and interpretation of, events as they unfolded. To see how local interests and communities saw the crisis through the lenses of their own experience.

To this end, the special collection Ebola Outbreak, West Africa 2014 features a broad selection of websites concerning the UK response to the Ebola virus crisis. Here you can find:

  • The Anglican community’s view on the role of faith during the crisis;
  • Alternative medicine touting the virtues of liposomal vitamin C as a cure for Ebola;
  • Local football clubs fundraising to send aid;
  • Parents in the UK withdrawing children from school because of fear of the virus’ spread;
  • Think tanks’ and academics’ views on the national and international response;
  • Universities issuing guidance and reports on dealing with international students; and more.

Active collection for Ebola began in November 2014 at the height of the outbreak whilst related websites dating back to the infection of patient zero in December 2013 have been retrospectively added to the collection. Collection continued through to January 2016, a few months before the outbreak began tailing off in April 2016.

The Ebola collection is available via the UK Web Archive’s new beta interface.

DPC Email Preservation: How Hard Can It Be? Part 2

Source: https://lu2cspjiis-flywheel.netdna-ssl.com/wp-content/uploads/2015/09/email-marketing.jpg

In July last year my colleague Miten and I attended a DPC Briefing Day titled Email Preservation: How Hard Can It Be?  which introduced me to the work of the Task Force on Technical Approaches to Email Archives  and we were lucky enough to attend the second session last week.

Arranging a second session gave Chris Prom (@chrisprom), University of Illinois at Urbana-Champaign and Kate Murray (@fileformatology), Library of Congress, co-chair’s of the Task Force the opportunity to reflect upon and add the issues raised from the first session to the Task Force Report, and provided the event attendees with an update on their progress overall, in anticipation of their final report scheduled to be published some time in April.

“Using Email Archives in Research”

The first guest presentation was given by Dr. James Baker (@j_w_baker), University of Sussex, who was inspired to write about the use of email archives within research by two key texts; Born-digital archives at the Wellcome Library: appraisal and sensitivity review of two hard drives (2016), an article by Victoria Sloyan, and Dust (2001) a book by Carolyn Steedman.

These texts led Dr. Baker to think of the “imagination of the archive” as he put it, the mystique of archival research, stemming from the imagery of  19th century research processes. He expanded on this idea, stating “physically and ontologically unique; the manuscript, is no longer what we imagine to be an archive”.

However, despite this new platform for research, Dr. Baker stated that very few people outside of archive professionals know that born-digital archives exist, yet alone use them. This is an issue, as archives require evidence of use, therefore, we need to encourage use.

To address this, Dr. Baker set up a Born-Digital Access Workshop, at the Wellcome Library in collaboration with their Collections Information Team, where he gathered people who use born-digital archives and the archivists who make them, and provided them with a set of 4 varying case-studies. These 4 case-studies were designed to explore the following:

A) the “original” environment; hard drive files in a Windows OS
B) the view experience; using the Wellcome’s Viewer
C) levels of curation; comparing reformatted and renamed collections with unaltered ones
D) the physical media; asking does the media hold value?

Several interesting observations came out of this workshop, which Dr. Baker organised in to three areas:

  1. Levels of description; filenames are important, and are valuable data in themselves to researchers. Users need a balance between curation and an authentic representation of the original order.
  2. “Bog-standard” laptop as access point; using modern technology that is already used by many researchers as the mode of access to email and digital archives creates a sense of familiarity when engaging with the content.
  3. Getting the researcher from desk to archive; there is a substantial amount of work needed to make the researcher aware of the resources available to them and how – can they remote access, how much collection level description is necessary?

Dr. Baker concluded that even with outreach and awareness events such as the one we were all attending, born-digital archives are not yet accessible to researchers, and this has made me realise the digital preservation community must push for access solutions,  and get these out to users, to enable researchers to gain the insights they might from our digital collections.

“Email as a Corporate Record”

The third presentation of the day was given by James Lappin (@JamesLappin), Loughborough University, who discussed the issues involved in applying archival policies to emails in a governmental context.

His main point concerned the routine deletion of email that happens in governments around the world. He said there are no civil servants email accounts scheduled to be saved past the next 3 – 4 years – but, they may be available via a different structure; a kind of records management system. However, Lappin pointed out the crux in this scenario: government departments have no budget to move and save many individuals email accounts, and no real idea of the numerics: how much to save, how much can be saved?

“email is the record of our age” – James Lappin

Lappin suggested an alternative: keep the emails of the senior staff only, however, this begs the questions, how do we filter out sensitive and personal content?

Lappin posits that auto-deletion is the solution, aiming to spare institutions from unmanageable volumes of email and the consequential breach of data protection.
Autodeletion encourages:

  •  governments to kickstart email preservation action,
  • the integration of tech for records management solutions,
  • actively considering the value of emails for long-term preservation

But how do we transfer emails to a EDRMS, what structures do we use, how do we separate individuals, how do we enforce the transfer of emails? These issues are to be worked out, and can be, Lappin argues, if we implement auto-deletion as tool to make email preservation less daunting , as at the end of the day, the current goal is to retain the “important” emails, which will make both government departments and historians happy, and in turn, this makes archivists happy. This does indeed seem like a positive scenario for us all!

However, it was particularly interesting when Lappin made his next point: what if the very nature of email, as intimate and immediate, makes governments uncomfortable with the idea of saving and preserving governmental correspondence? Therefore, governments must be more active in their selection processes, and save something, rather than nothing – which is where the implementation of auto-deletion, could, again, prove useful!

To conclude, Lappin presented a list of characteristics which could justify the preservation of an individuals government email accounts, which included:

  • The role they play is of historic interest
  • They expect their account to be permanently preserved
  • They are given the chance to flag or remove personal correspondence
  • Access to personal correspondence is prevented except in case of overriding legal need

I, personally, feel this fair and thorough, but only time will tell what route various governments take.

On a side note: Lappin runs an excellent comic-based blog on Records Management which you can see here.

Conclusions
One of the key issues that stood out for me today was, maybe surprisingly, not to do with the technology used in email preservation, but how to address the myriad issues email preservation brings to light, namely the feasibility of data protection, sensitivity review and appraisal, particularly prevalent when dealing in such vast quantities of material.

Email can only be preserved once we have defined what constitutes ’email’ and how to proceed ethically, morally and legally. Then, we can move forward with the implementation of the technical frameworks, which have been designed to meet our pre-defined requirements, that will enable access to historically valuable, and information rich, email archives, that will yield much in the name of research.

In the tweet below, Evil Archivist succinctly reminds us of the importance of maintaining and managing our digital records…

Email Preservation: How Hard Can It Be? 2 – DPC Briefing Day

On Wednesday 23rd of January I attended the Digital Preservation Coalition briefing day titled ‘Email Preservation: How Hard Can It Be? 2’ with my colleague Iram. As I attended the first briefing day back in July 2017 it was a great opportunity to see what advances and changes had been achieved. This blog post will briefly highlight what I found particularly thought provoking and focus on two of the talks about e-discovery from a lawyers view point.

The day began with an introduction by the co-chair of the report, Chris Prom (@chrisprom), informing us of the work that the task force had been doing. This was followed by a variety of talks about the use of email archives and some of the technologies used for the large scale processing  from the perspective of researchers and lawyers. The day was concluded with a panel discussion (for a twist, we the audience were the panel) about the pending report and the next steps.

Update on Task Force on Technical Approaches to Email Archives Report

Chris Prom told us how the report had taken on the comments from the previous briefing day and also from consultation with many other people and organisations. This led to clearer and more concise messages. The report itself does not aim to provide hard rules but to give an overview of the current situation and some recommendations that people or organisations involved with, interested in or are considering email preservation can consider.

Reconstruction of Narrative in e-Discovery Investigations and The Future of Email Archiving: Four Propositions

Simon Attfield (Middlesex university) and Larry Chapin (attorney) spoke about narrative and e-discovery. It was a fascinating insight into a lawyers requirements for use of email archives. Larry used the LIBOR scandal as an example of a project he worked on and the power of emails in bringing people to justice. E-discovery from his perspective was its importance to help create a narrative and tell a story, something at the moment a computer cannot do. Emails ‘capture the stuff of story making’ as they have the ability to reach into the crevasses of things and detail the small. He noted how emails contain slang and interestingly the language of intention and desire. These subtleties show the true meaning of what people are saying and that is important in the quest for the truth. Simon Attfield presented his research on the coding aspect to aid lawyers in assessing and sorting through these vast data sets. The work he described here was too technical for me to truly understand however it was clear that collaboration between archivist, users and the programmers/researchers will be vital for better preservation and use strategies.

Jason Baron (@JasonRBaron1) (attorney) gave a talk on the future of email archiving detailing four propositions.

Slide detailing the four propositions for the future of email archives. By Jason R Baron 2018

The general conclusions from this talk was that automation and technology will be playing an even bigger part in the future to help with acquisition, review (filtering out sensitive material) and searching (aiding access to larger collections). As one of the leads of the Capstone project, he told us how that particular project saves all emails for a short time and some forever, removing the misconceptions that all emails are going to be saved forever. Analysis of how successful Capstone has been in reducing signal to noise ratio (so only capturing email records of permanent value) will be important going forward.

The problem of scale, which permeates into most aspects of digital preservation, again arose here. For lawyers, they must review any and all information, which when looking at emails accounts can be colossal. The analogy that was given was of finding a needle in a haystack – lawyers need to find ALL the needles (100% recall).

Current predictive coding for discovery requires human assistance. Users have to tell the program whether the recommendations it produced were correct, the program will learn from this process and hopefully become more accurate. Whilst a program can efficiently and effectively sort personal information such as telephone numbers, date of birth etc it cannot currently sort out textual content that required prior knowledge and non-textual content such as images.

Panel Discussion and Future Direction

The final report is due to be published around May 2018. Email is a complex digital object and the solution to its preservation and archiving will be complex also.

The technical aspects of physically preserving emails are available but we still need to address the effective review and selection of the emails to be made available to the researcher. The tools currently available are not accurate enough for large scale processing, however, as artificial intelligence becomes better and more advanced, it appears this technology will be part of the solution.

Tim Gollins (@timgollins) gave a great overview of the current use of technology within this context, and stressed the point that the current technology is here to ASSIST humans. The tools for selection, appraisal and review need to be tailored for each process and quality test data is needed to train the programs effectively.

The non technical aspects further add to the complexity, and might be more difficult to address, as a community we need to find answers to:

  • Who’s email to capture (particularly interesting when an email account is linked to a position rather than a person)
  • How much to capture (entire accounts such as in the case of Capstone or allowing the user to choose what is worthy of preservation)
  • How to get persons of interest engaged (effectiveness of tools that aid the process e.g. drag and drop into record management systems or integrated preservation tools)
  • Legal implications
  • How to best present the emails for scholarly research (bespoke software such as ePADD or emulation tools that recreate the original environment or a system that a user is familiar with) 

Like most things in the digital sector, this is a fast moving area with ever changing technologies and trends. It might be frustrating there is no hard guidance on email preservation, when the Task Force on Technical Approaches to Email Archives report is published it will be an invaluable resource and a must read for anyone with an interest or actively involved in email preservation. The takeaway message was, and still is, that emails matter!    

What I Wish I Knew Before I Started – DPC Student Conference 2018

On January 24th, four Archives Assistants from Archives and Modern Manuscripts visited Senate House, London for the DPC Student Conference. With the 2018 theme being ‘What I Wish I Knew Before I Started’, it was an opportunity for digital archivists to pass on their wealth of knowledge in the field.

Getting started with digital preservation

The day started with a brief introduction to digital preservation by Sharon McMeekin from the Digital Preservation Coalition. This included an outline of the three basic models of digital preservation: OAIS, DCC lifecycle and the three-legged stool. (More information about these models can be found in the DPC handbook.) Aimed at beginners, this introduction was made accessible and easy to understand, whilst also giving us plenty to think about.

Next to take the stage was Steph Taylor, an Information Manager from CoSector, University of London. Steph is a huge advocate for the use of Twitter to find out the latest information and opinion in the world of digital preservation. As someone who has never had a Twitter account, it made me realise the importance of social media for staying up to date in such a fast-moving profession. Needless to say, I signed myself up to Twitter that evening to find out what I had been missing out on. (You can follow what was happening at the conference with the hashtag #dpc_wiwik.)

The final speaker before lunch was Matthew Addis, giving a technologist’s perspective. Matthew broke down the steps that you would need to take should you be faced with the potentially overwhelming job of starting from the beginning with a depository of digital material. He referenced a two-step approach – conceived by Tim Gollins – named ‘Parsimonious Preservation’, which involves firstly understanding what you have, and secondly keeping the bits safe. In the world of digital preservation, the worst thing you can do is do nothing, so by dealing with the simple and usually low-cost files first, you can protect the vast majority of the collection rather than going straight into the technical, time-consuming and costly minority of material. In the long run, the simple material that could have been dealt with initially may become technical and costly – due to software obsolescence, for instance.

That morning, the thought of tackling a simple digital preservation project would have seemed somewhat daunting. But Matthew illustrated the steps very clearly and as we broke for lunch I was left thinking that actually, with a little guidance, it probably wouldn’t be quite so bad.

Speakers on their experiences in the digital preservation field

During the afternoon, speakers gave presentations on their experiences in the digital preservation field. The speakers were Adrian Brown from the Parliamentary Archives, Glenn Cumiskey from the British Museum and Edith Halvarsson from the Bodleian Libraries. It was fascinating to learn how diverse the day-to-day working lives of digital archivists can be, and how often, as Glenn Cumiskey remarked, you may be the first digital archivist there has ever been within a given organisation, providing a unique opportunity for you to pave the way for its digital future.

Adrian Brown on his digital preservation experience at the Parliamentary Archive

The final speaker of the day, Dave Thomson, explained why it is up to students and new professionals to be ‘disruptive change agents’ and further illustrated the point that digital preservation is a relatively new field. We now have a chance to be the change and make digital preservation something that is at the forefront of business’s minds, helping them avoid the loss of important information due to complacency.

The conference closed with the speakers taking questions from attendees. There was lively discussion over whether postgraduate university courses in archiving and records management are teaching the skills needed for careers in digital preservation. It was decided that although some universities do teach this subject better than others, digital archivists have to make a commitment to life-long learning – not just one postgraduate course. This is a field where the technology and methods are constantly changing, so we need to be continuously developing our skills in accordance with these changes. The discussion certainly left me with lots to think about when considering postgraduate courses this year.

If you are new to the archiving field and want to gain an insight into digital preservation, I would highly recommend the annual conference. I left London with plenty of information, ideas and resources to further my knowledge of the subject, starting my commitment to life-long learning in the area of digital preservation!