All posts by mitenmistry

Higher Education Archive Programme Network Meeting on Research Data Management

On 22nd June 2018 I attended the Higher Education Archive Programme (#HEAP) network meeting on Research Data Management (RDM) at the National Archives at Kew Gardens. This allowed me to learn about some of the current thinking in research data management from colleagues and peers currently working in this area through hearing about their own personal experiences.

The day consisted of a series of talks from presenters with a variety of backgrounds (archivists, managers, PhD students) giving their experiences of RDM from their different perspectives (design/implementation of systems, use). I will aim to briefly summarise the main message from a few of them. This was followed by a question and answers session and concluded with a workshop run by John Kaye from JISC.

Having had very little exposure to RDM in my career, it was a great way for me to understand what it is and what is being done in this sector. I have undertaken quantitative research myself during my PhD and so have an understanding of how research data is created, but until my recent move into the archival profession, I rather foolishly gave little thought as to how this data is managed. Events like this help to make people aware of the challenges archivists, information professionals and researchers face.

What is HEAP?

The Higher Education Archive Programme (#HEAP) is part of The National Archives’ continuing programme of engagement and sector support with particular archival constituencies. It is a mixture of strategic and practical work encompassing activity across The National Archives and the wider sector including guidance and training, pilot projects and advocacy. They also run network meetings for anyone involved in university archives, special collections and libraries with a variety of themes.

What is Research Data Management?

Susan Worrall, from University of Birmingham, started the day by explaining to us, what is research data management and why is it of interest to archivists? Put simply, it is the organisation, structuring, storage, care and use of data generated by research. It is important to archivists as these are all common themes of digital archiving and digital preservation, therefore, it suffers from similar issues, such as:

  • Skills gap in the sector
  • Fear of the unknown
  • Funding issues
  • Training

She presented a case study using a Brain imaging experiment, which highlighted the challenges of consent and managing huge amounts of highly specialised data. There are, however, opportunities for archivists; RDM and digital archiving are two sides of the same coin, digital archivists already do a lot of the RDM processes and so have many transferable skills. Online training is also available, University of Edinburgh and The University of North Carolina at Chapel Hill collaborated to create a course on Coursera.

A Digital Archivist’s Perspective

Jenny Mitcham, from University of York, gave us an insight into RDM from her experience as a digital archivist. She highlighted how RDM requires skills from the Library, Archival and IT sectors. Within a department, you may have all of these skills however the roles and responsibilities are not always clear, which can cause issues. She described a fantastic project called ‘Filling the Digital Preservation Gap’ which explored the potential of archivematica for RDM. It was a finalist in the 2016 Digital Preservation Awards and more information about the project can be found on the blog.

Planning, Designing and Implementing an RDM system

Laurain Williamson, from University of Leicester, spoke about how to plan and implement a research data management service. Firstly, she described the current situation within the university and what the project brief involved. Any large scale project will require a large amount of preparation and planning, however she noted that certain elements, such as considering all viable technical solutions was incredibly time consuming, however, it was essential to get the best fit for the institution. Through interviews and case study’s they analysed the thoughts and wants from a variety of stakeholders. 

Their research community wanted:

  • Expertise
  • Knowledge about copyright/publishing
  • Bespoke advice and a flexible service.

Challenges faced by the RDM team were:

  • To manage expectations (they will never be able to do everything, so they must collaborate and prioritise their resources)
  • Last minute requests from researchers
  • Liaising with researchers at an early stage of the project is vital (helping researchers think about file formats early on to aid the preservation process).

Conclusion

Whilst RDM to a layperson may seem simple at first (save it on the cloud or a hard drive) when you delve into the archival theories of correct digital preservation, this becomes an absurdly simplified view. Managing large amounts of data from such specialised experiments (producing niche file formats) requires a huge amount of knowledge, collaboration and expertise.

(CC BY 4.0) Bryant, Rebecca, Brian Lavoie, and Constance Malpas. 2018. Incentives for Building University RDM Services. The Realities of Research Data Management, Part 3. Dublin, OH: OCLC Research. doi:10.25333/C3S62F.

Data produced by universities can be seen as a commodity. The increase in the scholarly norms for open science and sharing data puts higher emphasis on RDM. It is important for the institutions/individuals creating the data (if there is any potential future scholarly or financial gain) and also for scientific integrity (allowing others in the community to review and confirm the results). But not everyone will want to make it open and actually not all of it has or should be open; creating a system and workflow that accounts for both is vital.

An OCLC research report recently stated ‘It would be a mistake to imagine that there is a single, best model of RDM service capacity, or a simple roadmap to acquiring it’. As with most things in the digital sector, this is a fast moving area and new technologies and theories are continually being developed. It will be exciting to see how these will be implemented in the future.

 

Email Preservation: How Hard Can It Be? 2 – DPC Briefing Day

On Wednesday 23rd of January I attended the Digital Preservation Coalition briefing day titled ‘Email Preservation: How Hard Can It Be? 2’ with my colleague Iram. As I attended the first briefing day back in July 2017 it was a great opportunity to see what advances and changes had been achieved. This blog post will briefly highlight what I found particularly thought provoking and focus on two of the talks about e-discovery from a lawyers view point.

The day began with an introduction by the co-chair of the report, Chris Prom (@chrisprom), informing us of the work that the task force had been doing. This was followed by a variety of talks about the use of email archives and some of the technologies used for the large scale processing  from the perspective of researchers and lawyers. The day was concluded with a panel discussion (for a twist, we the audience were the panel) about the pending report and the next steps.

Update on Task Force on Technical Approaches to Email Archives Report

Chris Prom told us how the report had taken on the comments from the previous briefing day and also from consultation with many other people and organisations. This led to clearer and more concise messages. The report itself does not aim to provide hard rules but to give an overview of the current situation and some recommendations that people or organisations involved with, interested in or are considering email preservation can consider.

Reconstruction of Narrative in e-Discovery Investigations and The Future of Email Archiving: Four Propositions

Simon Attfield (Middlesex university) and Larry Chapin (attorney) spoke about narrative and e-discovery. It was a fascinating insight into a lawyers requirements for use of email archives. Larry used the LIBOR scandal as an example of a project he worked on and the power of emails in bringing people to justice. E-discovery from his perspective was its importance to help create a narrative and tell a story, something at the moment a computer cannot do. Emails ‘capture the stuff of story making’ as they have the ability to reach into the crevasses of things and detail the small. He noted how emails contain slang and interestingly the language of intention and desire. These subtleties show the true meaning of what people are saying and that is important in the quest for the truth. Simon Attfield presented his research on the coding aspect to aid lawyers in assessing and sorting through these vast data sets. The work he described here was too technical for me to truly understand however it was clear that collaboration between archivist, users and the programmers/researchers will be vital for better preservation and use strategies.

Jason Baron (@JasonRBaron1) (attorney) gave a talk on the future of email archiving detailing four propositions.

Slide detailing the four propositions for the future of email archives. By Jason R Baron 2018

The general conclusions from this talk was that automation and technology will be playing an even bigger part in the future to help with acquisition, review (filtering out sensitive material) and searching (aiding access to larger collections). As one of the leads of the Capstone project, he told us how that particular project saves all emails for a short time and some forever, removing the misconceptions that all emails are going to be saved forever. Analysis of how successful Capstone has been in reducing signal to noise ratio (so only capturing email records of permanent value) will be important going forward.

The problem of scale, which permeates into most aspects of digital preservation, again arose here. For lawyers, they must review any and all information, which when looking at emails accounts can be colossal. The analogy that was given was of finding a needle in a haystack – lawyers need to find ALL the needles (100% recall).

Current predictive coding for discovery requires human assistance. Users have to tell the program whether the recommendations it produced were correct, the program will learn from this process and hopefully become more accurate. Whilst a program can efficiently and effectively sort personal information such as telephone numbers, date of birth etc it cannot currently sort out textual content that required prior knowledge and non-textual content such as images.

Panel Discussion and Future Direction

The final report is due to be published around May 2018. Email is a complex digital object and the solution to its preservation and archiving will be complex also.

The technical aspects of physically preserving emails are available but we still need to address the effective review and selection of the emails to be made available to the researcher. The tools currently available are not accurate enough for large scale processing, however, as artificial intelligence becomes better and more advanced, it appears this technology will be part of the solution.

Tim Gollins (@timgollins) gave a great overview of the current use of technology within this context, and stressed the point that the current technology is here to ASSIST humans. The tools for selection, appraisal and review need to be tailored for each process and quality test data is needed to train the programs effectively.

The non technical aspects further add to the complexity, and might be more difficult to address, as a community we need to find answers to:

  • Who’s email to capture (particularly interesting when an email account is linked to a position rather than a person)
  • How much to capture (entire accounts such as in the case of Capstone or allowing the user to choose what is worthy of preservation)
  • How to get persons of interest engaged (effectiveness of tools that aid the process e.g. drag and drop into record management systems or integrated preservation tools)
  • Legal implications
  • How to best present the emails for scholarly research (bespoke software such as ePADD or emulation tools that recreate the original environment or a system that a user is familiar with) 

Like most things in the digital sector, this is a fast moving area with ever changing technologies and trends. It might be frustrating there is no hard guidance on email preservation, when the Task Force on Technical Approaches to Email Archives report is published it will be an invaluable resource and a must read for anyone with an interest or actively involved in email preservation. The takeaway message was, and still is, that emails matter!    

Collecting Space: The Inaugural Science and Technology Archives Group Conference

On Friday 17th of November I attended the inaugural Science and Technology Archives Group (STAG) conference held at the fantastic Dana Library and Research Centre. The theme was ‘Collecting Space’ and bought together a variety of people working in or with science and technology archives relating to the topic of ‘Space’. The day consisted of a variety of talks (with topics as varied as The Cassini probe to UFOs), a tour of the Skylark exhibition and a final discussion on the future direction of STAG.

What is STAG?

The Science and technology archives group is a recently formed group (September 2016) to celebrate and promote scientific archives and to to engage anyone that has an interest in the creation, use and preservation of such archives. 

The keynote presentation was by Professor Michele Dougherty, who gave us a fascinating insight into the Cassini project, aided by some amazing photos. 

Colour-coded version of an ISS NAC clear-filter image of Enceladus’ near surface plumes at the south pole of the moon. Image credit: NASA/JPL-Caltech/SSI

Her concern with regards to archiving data was context. We were told how her raw data could be given to an archive however it would be almost meaningless without the relevant information about context, for example calibration parameters. Without it data could be misinterpreted.

Dr James Peters from the University of Manchester told us of the unique challenges of the Jodrell Bank Observatory Archive, also called the ‘sleeping giant’. They have a vast amount of material that has yet to be accessioned but requires highly specialised scientific knowledge to understand it. Highlighting the importance of the relationships between the creator of an archive and the repository. Promoting use of the archive was of particular concern, which was also shared by Dr Sian Prosser of the Royal Astronomical Society archives. She spoke of the challenges for current collection development. I’m looking forward to finding out about the events and activities planned for their bi-centenary in 2020.

We also heard from Dr Tom Lean of the Oral History of British Science at the British library. This was a great example of the vast amount of knowledge and history that is effectively hidden. The success of a project is typically well documented however the stories of the things that went wrong or of the relationships between groups has the potential to be lost. Whilst they may be lacking in scientific research value, they reveal the personal side of the projects and are a reminder of the people and personalities behind world changing projects and discoveries.

Dr David Clarke spoke about the Ministry of Defence UFO files release program. I was surprised to hear that as recently as 2009 there was a government funded UFO desk. In 2009 these surviving records were transferred to the National Archives. All files were digitised and made available online. The demand and reach for this content was huge, with millions of views and downloads from over 160 countries. Such an archive, whilst people may dismiss its relevance and use scientifically, provides an amazing window into the psyche of the society at that time.

Dr Amy Chambers spoke about how much scientific research and knowledge can go into producing a film and used Stanley Kubrick’s 2001: A Space Odyssey as an example. This was described as a science fiction dream + space documentary. Directors like Kubrick would delve deeply into the subject matter and speak to a whole host of professionals in both academia and industry to get the most up to date scientific thinking of the time. Even researching concepts that would potentially never make it on screen. This was highlighted as a way of capturing scientific knowledge and the current thoughts about the future of science at that point in history. Today it is no different, Interstellar, produced by Christopher Nolan, consulted Professor Kip Thorne and the collaboration produced a publication on gravitational lensing in the journal Classical and Quantum Gravity.

It was great to see the Dana research library and a small exhibition of some of the space related material that the Science Museum holds. There was the Apollo 11 flight plan that was signed by all the astronauts that took part and included a letter from the Independent Television News, as they used that book to help with the televised broadcast. We also got to see the recently opened Skylark exhibition, celebrating British achievements in space research.

Launch of a British Skylark sounding rocket from Woomera in South Australia. Image credit: NASA

The final part of the conference was an open discussion focusing on the challenges and future of science and technology archives and how these could be addressed.

Awareness and exposure

From my experience of being a chemistry graduate, I can speak first hand of the lack of awareness of science archives. I feel that I was not alone, as during the course of a science degree, especially for research projects, archives are never really needed compared to other disciplines as most of the material we needed was found in online journals. Although I completed my degree some time ago, I feel this is still the case today when I speak to friends who study and work in the science sector. It seems that promotion of science and technology archives to scientists (at any stage of their career, but especially at the start) will make them aware of the rich source of material out there that can be of benefit to them, and subsequently they will become more involved and interested in creating and maintaining such archives.

Content

Science and technology archives, for an archivist with little to no knowledge of that particular area of science, understanding the vastly complex data and material is a potentially impossible job. The nomenclature used in scientific disciplines can be highly specialised and specific and so deciphering the material can be made extremely difficult.

This problem could be resolved in one of two ways. Firstly, the creator of the material or a scientist working in that area can be consulted. Whilst this can be time consuming, it is a necessity as the highly specialised nature of certain topics, can mean there are only a handful of people that can understand the work. Secondly, when the material is created, the creator should be encouraged to explain and store data in a way that will allow future users to understand and contextualise the data better.

As science and technology companies can be highly secretive entities, problems with exploiting sensitive material arise. It was suggested maybe seeking the advice of other specialist archive groups that have dealt with highly sensitive archives.

It appears that there is still a great deal of work to do to promote access, exploitation and awareness of current science and technology archives (for both creators and users). STAG is a fantastic way to get like minds together to discuss and implement solutions. I’m really looking forward to seeing how this develops and hopefully I will be able to contribute to this exciting, worthwhile and necessary future for science and technology archives.

Why and how do we Quality Assure (QA) websites at the BLWA?

At the Bodleian Libraries Web Archive (BLWA), we Quality Assure (QA) every site in the web archive. This blog post aims give a brief introduction into why and how we QA. The first steps of our web archiving involve crawling a site, using the tools developed by ArchiveIT. These tools allow for entire websites to be captured and browsed using the Wayback Machine as if it were live, allowing you to download files, view videos/photos and interact with dynamic content, exactly how the website owner would want you to. However, due to the huge variety and technical complexity of websites, there is no guarantee that every capture will be successful (that is to say that all the content is captured and working as it should be). Currently there is no accurate automatic process to check this and so this is where we step in.

We want to ensure that the sites on our web archive are an accurate representation in every way. We owe this to the owners and the future users. Capturing the content is hugely important, but so too is how it looks, feels and how you interact with it, as this is a major part of the experience of using a website.

Quality assurance of a crawl involves manually checking the capture. Using the live site as a reference, we explore the archived capture, clicking on links, trying to download content or view videos; noting any major discrepancies to the live site or any other issues. Sometimes, a picture or two will be missing or, it maybe that a certain link is not resolving correctly, which can be relatively easy to fix, but other times it can be massive differences compared to the live site; so the (often long and sometimes confusing) process of solving the problem begins. Some common issue we encounter are:

  • Incorrect formatting
  • Images/video missing
  • Large file sizes
  • Crawler traps
  • Social media feeds
  • Dynamic content playback issues

There are many techniques available for us to use to help solve these problems, but there is no ‘one fix for all’, the same issue for two different sites may require two different solutions. There is a lot of trial and error involved and over the years we have gained a lot of knowledge on how to solve a variety of issues. Also ArchiveIT has a fantastic FAQ section on their site, however, if we have gone through the usual avenues and still cannot solve our problems, then our final port of call is to ask the geniuses at ArchiveIT, who are always happy and willing to help.

An example of how important and effective QA can be. The initial test capture did not have the correct formatting and was missing images. This was resolved after the QA process

QA’ing is a continual process. Websites add new content or companies change to different website designers, meaning captures of websites that have previously been successful, might suddenly have an issue. It is for this reason that every crawl is given special attention and is QA’d. QA’ing the captures before they are made available is a time consuming but incredibly important part of the web archiving process at the Bodleian Libraries Web Archive. It allows us to maintain a high standard of capture and provide an accurate representation of the website for future generations.

 

Email Preservation: How Hard Can it Be? DPC Briefing Day

On Thursday 6th July 2017 I attended the Digital Preservation Coalition briefing day in partnership with the Andrew W. Mellon Foundation on email preservation titled ‘Email preservation: how hard can it be?’. It was hosted at The National archives (TNA), this was my first visit to TNA and it was fantastic. I didn’t know a great deal about email preservation prior to this and so I was really looking forward to learning about this topic.

The National Archives, Photograph by Mike Peel (www.mikepeel.net)., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=9786613

The aim of the day was to engage in discussion about some of the current tools, technologies and thoughts on email preservation. It was orientated around the ‘Task Force on Technical Approaches to Email Preservation’ report that is currently in its draft phase. We also got to hear about interesting case studies from the British library, TNA and Preservica, each presenting their own unique experiences in relation to this topic. It was a great opportunity to learn about this area and hear from the co-chairs (Kate Murray and Christopher Prom) and the audience about their thoughts on the current situation and possible future directions.

We heard from Jonathan Pledge from British library (BL). He told us about the forensic capture expertise gained by the BL and using EnCase to capture email data from hard drives, CD’s and USB’s. We also got an insight into how they are deciding which email archive tool to use. Aid4mail fits better with their work flow however ePADD with its holistic approach was something they were considering. During their ingest they separate the emails from the attachments. They found that after the time consuming process of removing emails that would violate the data protection laws, there was very little usable content left, as often, entire threads would have to be redacted due to one message. This is not the most effective use of an archivist time and is something they are working to address.

We also heard from Anthea Seles who works with government collections at TNA. We learnt that from their research, they discovered that approximately 1TB of data in an organisations own electronic document and records management system is linked to 10TB of related data in shared drives. Her focus was on discovery and data analytics. For example, a way to increase efficiency and focus the attention of the curator on was to batch email. If an email was sent from TNA to a vast number of people, then there is a high chance that the content does not contain sensitive information. However, if it was sent to a high profile individual, then there is a higher chance that it will contain sensitive information, so the curator can focus their attention on those messages.

Hearing from Preservica was interesting as it gave an insight into the commercial side of email archiving. In their view, preservation was not an issue. For them, their attention was focused on addressing issues such as identifying duplicates/unwanted emails efficiently. Developing tools for performing whole collection email analysis and, interestingly, how to solve the problem of acquiring emails via a continuous transfer.

Emails are not going to be the main form of communication forever (the rise in the popularity of instant messaging is clear to see) however we learnt that we are still expecting growth in its use for the near future.

One of the main issues that was bought up was the potential size of future email archives and the issue that come with effective and efficient appraisal. What is large in academic terms, e.g. 100 000 emails, is not in government. The figure of over 200 million emails at the George W. Bush presidential library is a phenomenal amount and the Obama administrations is estimated at 300 million. This requires smart solutions and we learnt how the use of artificial intelligence and machine learning could help.

Continuous active learning was highlighted to improve searches. An example of searching for Miami dolphins was given. The Miami Dolphins are an American football team however someone might so be looking for information about dolphins in Miami. Initially the computer would present different search results and the user would choose which the more relevant result is, over time it will learn what it is the user is looking for in cases where searches can be ambiguous.

Another issue that was highlighted was, how do you make sure that you have searched the correct person? How do you avoid false positives? At TNA the ‘Traces Through Time’ project aimed to do that, initially with World War One records. This technology, using big data analytics can be used with email archives. There is also work on mining the email signature as a way to better determine ownership of the message.

User experience was also discussed. Emulation is an area of particular interest. The positive of this is that it recreates how the original user would have experienced the emails. However this technology is still being developed. Bit level preservation is a solution to make sure we capture and preserve the data now. This prevents loss of the archive and allows the information and value to be extracted in the future once the tools have been developed.

It was interesting to hear how policy could affect how easy it would be to acquire email archives. The new General Data Protection Regulation that will come into effect in May 2018 will mean anyone in breach of this will suffer worse penalties, up to 4% of annual worldwide turnover. This means that companies may air on the side of caution with regards to keeping personal data such as emails.

Whilst the email protocols are well standardised, allowing emails to be sent from one client to another (e.g. AOL account from early 1990’s to Gmail of now) the acquisition of them are not. When archivists get hold of email archives, they are left with the remnants of whatever the email client/user has done to it. This means metadata may have been added or removed and formats can vary. This adds a further level of complexity to the whole process

The day was thoroughly enjoyable. It was a fantastic way to learn about archiving emails. As emails are now one of the main methods of communication, for government, large organisations and personal use, it is important that we develop the tools, techniques and policies for email preservation. To answer the question ‘how hard can it be?’ I’d say very. Emails are not simple objects of text, they are highly complex entities comprising of attachments, links and embedded content. The solution will be complex but there is a great community of researchers, individuals, libraries and commercial entities working on solving this problem. I look forward to hearing the update in January 2018 when the task force is due to meet again.