Category Archives: Digital archives

The UK Web Archive Ebola Outbreak collection

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Next month marks the four year anniversary of the WHO’s public announcement of “a rapidly evolving outbreak of Ebola virus disease (EVD)” that went on to become the deadliest outbreak of EVD in history.

With more than 28,000 cases and 11,000 deaths, it moved with such speed and virulence that–though concentrated in Guinea, Liberia and Sierra Leone–it was feared at the time that the Ebola virus disease outbreak of 2014-2016 would soon spread to become a global pandemic.

No cure or vaccine has yet been discovered and cases continue to flare up in West Africa. The most recent was declared over on 2 July 2017. Yet today most people in the UK unless directly affected don’t give it a second thought.

Searching online now, you can find fact sheets detailing everything you might want to know about patient zero and the subsequent rapid spread of infection. You can find discussions detailing the international response (or failure to do so) and lessons learned. You might even find the reminiscences of aid workers and survivors. But these sites all examine the outbreak in retrospect and their pages and stories have been updated so often that posts from then can no longer be found.

Posts that reflected the fear and uncertainty that permeated the UK during the epidemic. The urgent status updates and travel warnings.  The misinformation that people were telling each other. The speculation that ran riot. The groundswell of giving. The mobilisation of aid.

Understandably when we talk about epidemics the focus is on the scale of physical suffering: numbers stricken and dead; money spent and supplies sent; the speed and extent of its spread.

Whilst UKWA regularly collects the websites of major news channels and governmental agencies, what we wanted to capture was the public dialogue on, and interpretation of, events as they unfolded. To see how local interests and communities saw the crisis through the lenses of their own experience.

To this end, the special collection Ebola Outbreak, West Africa 2014 features a broad selection of websites concerning the UK response to the Ebola virus crisis. Here you can find:

  • The Anglican community’s view on the role of faith during the crisis;
  • Alternative medicine touting the virtues of liposomal vitamin C as a cure for Ebola;
  • Local football clubs fundraising to send aid;
  • Parents in the UK withdrawing children from school because of fear of the virus’ spread;
  • Think tanks’ and academics’ views on the national and international response;
  • Universities issuing guidance and reports on dealing with international students; and more.

Active collection for Ebola began in November 2014 at the height of the outbreak whilst related websites dating back to the infection of patient zero in December 2013 have been retrospectively added to the collection. Collection continued through to January 2016, a few months before the outbreak began tailing off in April 2016.

The Ebola collection is available via the UK Web Archive’s new beta interface.

DPC Email Preservation: How Hard Can It Be? Part 2

Source: https://lu2cspjiis-flywheel.netdna-ssl.com/wp-content/uploads/2015/09/email-marketing.jpg

In July last year my colleague Miten and I attended a DPC Briefing Day titled Email Preservation: How Hard Can It Be?  which introduced me to the work of the Task Force on Technical Approaches to Email Archives  and we were lucky enough to attend the second session last week.

Arranging a second session gave Chris Prom (@chrisprom), University of Illinois at Urbana-Champaign and Kate Murray (@fileformatology), Library of Congress, co-chair’s of the Task Force the opportunity to reflect upon and add the issues raised from the first session to the Task Force Report, and provided the event attendees with an update on their progress overall, in anticipation of their final report scheduled to be published some time in April.

“Using Email Archives in Research”

The first guest presentation was given by Dr. James Baker (@j_w_baker), University of Sussex, who was inspired to write about the use of email archives within research by two key texts; Born-digital archives at the Wellcome Library: appraisal and sensitivity review of two hard drives (2016), an article by Victoria Sloyan, and Dust (2001) a book by Carolyn Steedman.

These texts led Dr. Baker to think of the “imagination of the archive” as he put it, the mystique of archival research, stemming from the imagery of  19th century research processes. He expanded on this idea, stating “physically and ontologically unique; the manuscript, is no longer what we imagine to be an archive”.

However, despite this new platform for research, Dr. Baker stated that very few people outside of archive professionals know that born-digital archives exist, yet alone use them. This is an issue, as archives require evidence of use, therefore, we need to encourage use.

To address this, Dr. Baker set up a Born-Digital Access Workshop, at the Wellcome Library in collaboration with their Collections Information Team, where he gathered people who use born-digital archives and the archivists who make them, and provided them with a set of 4 varying case-studies. These 4 case-studies were designed to explore the following:

A) the “original” environment; hard drive files in a Windows OS
B) the view experience; using the Wellcome’s Viewer
C) levels of curation; comparing reformatted and renamed collections with unaltered ones
D) the physical media; asking does the media hold value?

Several interesting observations came out of this workshop, which Dr. Baker organised in to three areas:

  1. Levels of description; filenames are important, and are valuable data in themselves to researchers. Users need a balance between curation and an authentic representation of the original order.
  2. “Bog-standard” laptop as access point; using modern technology that is already used by many researchers as the mode of access to email and digital archives creates a sense of familiarity when engaging with the content.
  3. Getting the researcher from desk to archive; there is a substantial amount of work needed to make the researcher aware of the resources available to them and how – can they remote access, how much collection level description is necessary?

Dr. Baker concluded that even with outreach and awareness events such as the one we were all attending, born-digital archives are not yet accessible to researchers, and this has made me realise the digital preservation community must push for access solutions,  and get these out to users, to enable researchers to gain the insights they might from our digital collections.

“Email as a Corporate Record”

The third presentation of the day was given by James Lappin (@JamesLappin), Loughborough University, who discussed the issues involved in applying archival policies to emails in a governmental context.

His main point concerned the routine deletion of email that happens in governments around the world. He said there are no civil servants email accounts scheduled to be saved past the next 3 – 4 years – but, they may be available via a different structure; a kind of records management system. However, Lappin pointed out the crux in this scenario: government departments have no budget to move and save many individuals email accounts, and no real idea of the numerics: how much to save, how much can be saved?

“email is the record of our age” – James Lappin

Lappin suggested an alternative: keep the emails of the senior staff only, however, this begs the questions, how do we filter out sensitive and personal content?

Lappin posits that auto-deletion is the solution, aiming to spare institutions from unmanageable volumes of email and the consequential breach of data protection.
Autodeletion encourages:

  •  governments to kickstart email preservation action,
  • the integration of tech for records management solutions,
  • actively considering the value of emails for long-term preservation

But how do we transfer emails to a EDRMS, what structures do we use, how do we separate individuals, how do we enforce the transfer of emails? These issues are to be worked out, and can be, Lappin argues, if we implement auto-deletion as tool to make email preservation less daunting , as at the end of the day, the current goal is to retain the “important” emails, which will make both government departments and historians happy, and in turn, this makes archivists happy. This does indeed seem like a positive scenario for us all!

However, it was particularly interesting when Lappin made his next point: what if the very nature of email, as intimate and immediate, makes governments uncomfortable with the idea of saving and preserving governmental correspondence? Therefore, governments must be more active in their selection processes, and save something, rather than nothing – which is where the implementation of auto-deletion, could, again, prove useful!

To conclude, Lappin presented a list of characteristics which could justify the preservation of an individuals government email accounts, which included:

  • The role they play is of historic interest
  • They expect their account to be permanently preserved
  • They are given the chance to flag or remove personal correspondence
  • Access to personal correspondence is prevented except in case of overriding legal need

I, personally, feel this fair and thorough, but only time will tell what route various governments take.

On a side note: Lappin runs an excellent comic-based blog on Records Management which you can see here.

Conclusions
One of the key issues that stood out for me today was, maybe surprisingly, not to do with the technology used in email preservation, but how to address the myriad issues email preservation brings to light, namely the feasibility of data protection, sensitivity review and appraisal, particularly prevalent when dealing in such vast quantities of material.

Email can only be preserved once we have defined what constitutes ’email’ and how to proceed ethically, morally and legally. Then, we can move forward with the implementation of the technical frameworks, which have been designed to meet our pre-defined requirements, that will enable access to historically valuable, and information rich, email archives, that will yield much in the name of research.

In the tweet below, Evil Archivist succinctly reminds us of the importance of maintaining and managing our digital records…

Email Preservation: How Hard Can It Be? 2 – DPC Briefing Day

On Wednesday 23rd of January I attended the Digital Preservation Coalition briefing day titled ‘Email Preservation: How Hard Can It Be? 2’ with my colleague Iram. As I attended the first briefing day back in July 2017 it was a great opportunity to see what advances and changes had been achieved. This blog post will briefly highlight what I found particularly thought provoking and focus on two of the talks about e-discovery from a lawyers view point.

The day began with an introduction by the co-chair of the report, Chris Prom (@chrisprom), informing us of the work that the task force had been doing. This was followed by a variety of talks about the use of email archives and some of the technologies used for the large scale processing  from the perspective of researchers and lawyers. The day was concluded with a panel discussion (for a twist, we the audience were the panel) about the pending report and the next steps.

Update on Task Force on Technical Approaches to Email Archives Report

Chris Prom told us how the report had taken on the comments from the previous briefing day and also from consultation with many other people and organisations. This led to clearer and more concise messages. The report itself does not aim to provide hard rules but to give an overview of the current situation and some recommendations that people or organisations involved with, interested in or are considering email preservation can consider.

Reconstruction of Narrative in e-Discovery Investigations and The Future of Email Archiving: Four Propositions

Simon Attfield (Middlesex university) and Larry Chapin (attorney) spoke about narrative and e-discovery. It was a fascinating insight into a lawyers requirements for use of email archives. Larry used the LIBOR scandal as an example of a project he worked on and the power of emails in bringing people to justice. E-discovery from his perspective was its importance to help create a narrative and tell a story, something at the moment a computer cannot do. Emails ‘capture the stuff of story making’ as they have the ability to reach into the crevasses of things and detail the small. He noted how emails contain slang and interestingly the language of intention and desire. These subtleties show the true meaning of what people are saying and that is important in the quest for the truth. Simon Attfield presented his research on the coding aspect to aid lawyers in assessing and sorting through these vast data sets. The work he described here was too technical for me to truly understand however it was clear that collaboration between archivist, users and the programmers/researchers will be vital for better preservation and use strategies.

Jason Baron (@JasonRBaron1) (attorney) gave a talk on the future of email archiving detailing four propositions.

Slide detailing the four propositions for the future of email archives. By Jason R Baron 2018

The general conclusions from this talk was that automation and technology will be playing an even bigger part in the future to help with acquisition, review (filtering out sensitive material) and searching (aiding access to larger collections). As one of the leads of the Capstone project, he told us how that particular project saves all emails for a short time and some forever, removing the misconceptions that all emails are going to be saved forever. Analysis of how successful Capstone has been in reducing signal to noise ratio (so only capturing email records of permanent value) will be important going forward.

The problem of scale, which permeates into most aspects of digital preservation, again arose here. For lawyers, they must review any and all information, which when looking at emails accounts can be colossal. The analogy that was given was of finding a needle in a haystack – lawyers need to find ALL the needles (100% recall).

Current predictive coding for discovery requires human assistance. Users have to tell the program whether the recommendations it produced were correct, the program will learn from this process and hopefully become more accurate. Whilst a program can efficiently and effectively sort personal information such as telephone numbers, date of birth etc it cannot currently sort out textual content that required prior knowledge and non-textual content such as images.

Panel Discussion and Future Direction

The final report is due to be published around May 2018. Email is a complex digital object and the solution to its preservation and archiving will be complex also.

The technical aspects of physically preserving emails are available but we still need to address the effective review and selection of the emails to be made available to the researcher. The tools currently available are not accurate enough for large scale processing, however, as artificial intelligence becomes better and more advanced, it appears this technology will be part of the solution.

Tim Gollins (@timgollins) gave a great overview of the current use of technology within this context, and stressed the point that the current technology is here to ASSIST humans. The tools for selection, appraisal and review need to be tailored for each process and quality test data is needed to train the programs effectively.

The non technical aspects further add to the complexity, and might be more difficult to address, as a community we need to find answers to:

  • Who’s email to capture (particularly interesting when an email account is linked to a position rather than a person)
  • How much to capture (entire accounts such as in the case of Capstone or allowing the user to choose what is worthy of preservation)
  • How to get persons of interest engaged (effectiveness of tools that aid the process e.g. drag and drop into record management systems or integrated preservation tools)
  • Legal implications
  • How to best present the emails for scholarly research (bespoke software such as ePADD or emulation tools that recreate the original environment or a system that a user is familiar with) 

Like most things in the digital sector, this is a fast moving area with ever changing technologies and trends. It might be frustrating there is no hard guidance on email preservation, when the Task Force on Technical Approaches to Email Archives report is published it will be an invaluable resource and a must read for anyone with an interest or actively involved in email preservation. The takeaway message was, and still is, that emails matter!    

What I Wish I Knew Before I Started – DPC Student Conference 2018

On January 24th, four Archives Assistants from Archives and Modern Manuscripts visited Senate House, London for the DPC Student Conference. With the 2018 theme being ‘What I Wish I Knew Before I Started’, it was an opportunity for digital archivists to pass on their wealth of knowledge in the field.

Getting started with digital preservation

The day started with a brief introduction to digital preservation by Sharon McMeekin from the Digital Preservation Coalition. This included an outline of the three basic models of digital preservation: OAIS, DCC lifecycle and the three-legged stool. (More information about these models can be found in the DPC handbook.) Aimed at beginners, this introduction was made accessible and easy to understand, whilst also giving us plenty to think about.

Next to take the stage was Steph Taylor, an Information Manager from CoSector, University of London. Steph is a huge advocate for the use of Twitter to find out the latest information and opinion in the world of digital preservation. As someone who has never had a Twitter account, it made me realise the importance of social media for staying up to date in such a fast-moving profession. Needless to say, I signed myself up to Twitter that evening to find out what I had been missing out on. (You can follow what was happening at the conference with the hashtag #dpc_wiwik.)

The final speaker before lunch was Matthew Addis, giving a technologist’s perspective. Matthew broke down the steps that you would need to take should you be faced with the potentially overwhelming job of starting from the beginning with a depository of digital material. He referenced a two-step approach – conceived by Tim Gollins – named ‘Parsimonious Preservation’, which involves firstly understanding what you have, and secondly keeping the bits safe. In the world of digital preservation, the worst thing you can do is do nothing, so by dealing with the simple and usually low-cost files first, you can protect the vast majority of the collection rather than going straight into the technical, time-consuming and costly minority of material. In the long run, the simple material that could have been dealt with initially may become technical and costly – due to software obsolescence, for instance.

That morning, the thought of tackling a simple digital preservation project would have seemed somewhat daunting. But Matthew illustrated the steps very clearly and as we broke for lunch I was left thinking that actually, with a little guidance, it probably wouldn’t be quite so bad.

Speakers on their experiences in the digital preservation field

During the afternoon, speakers gave presentations on their experiences in the digital preservation field. The speakers were Adrian Brown from the Parliamentary Archives, Glenn Cumiskey from the British Museum and Edith Halvarsson from the Bodleian Libraries. It was fascinating to learn how diverse the day-to-day working lives of digital archivists can be, and how often, as Glenn Cumiskey remarked, you may be the first digital archivist there has ever been within a given organisation, providing a unique opportunity for you to pave the way for its digital future.

Adrian Brown on his digital preservation experience at the Parliamentary Archive

The final speaker of the day, Dave Thomson, explained why it is up to students and new professionals to be ‘disruptive change agents’ and further illustrated the point that digital preservation is a relatively new field. We now have a chance to be the change and make digital preservation something that is at the forefront of business’s minds, helping them avoid the loss of important information due to complacency.

The conference closed with the speakers taking questions from attendees. There was lively discussion over whether postgraduate university courses in archiving and records management are teaching the skills needed for careers in digital preservation. It was decided that although some universities do teach this subject better than others, digital archivists have to make a commitment to life-long learning – not just one postgraduate course. This is a field where the technology and methods are constantly changing, so we need to be continuously developing our skills in accordance with these changes. The discussion certainly left me with lots to think about when considering postgraduate courses this year.

If you are new to the archiving field and want to gain an insight into digital preservation, I would highly recommend the annual conference. I left London with plenty of information, ideas and resources to further my knowledge of the subject, starting my commitment to life-long learning in the area of digital preservation!

 

 

Web-Archiving: A Short Guide to Proxy Mode

Defining Proxy Mode:

Proxy Mode is an ‘offline browsing’ mode  which provides an intuitive way of checking the quality and comprehensiveness of any web-archived content captured. Proxy Mode enables you to view documents within an Archive-It collection and ascertain which page elements have been captured effectively and which are still being ‘pulled’ from the live site.

Why Use Proxy Mode?

Carrying out QA (Quality Assurance) without proxy mode could lead to a sense of false reassurance about the data that has been captured, since some page elements displayed may actually present those being taken from the live site as opposed to a desired archival capture. Proxy Mode should therefore be employed as part of the standard QA process since it prevents these live-site redirects from occurring and provides a true account of the data captured.

Using Proxy Mode:

Proxy Mode is easy to setup and involves simply downloading an add-on that can be accessed here. There is also an option to setup Proxy Mode manually in Firefox or Chrome.

Potential Issues and Solutions:

Whilst using Proxy Mode a couple of members of the BLWA team (myself included) had issues viewing certain URLs in Proxy Mode often receiving  a ‘server not found’ error message.  After corresponding with Archive-It I discovered that Proxy Mode often has trouble loading https URLs. With this in mind I loaded the same URL but this time removed the ‘s’ from https and reloaded the page. Once Proxy Mode had been enabled this seemed to rectify the issue.

There was one particular instance however where this fix didn’t work and the same ‘server not found’ error message returned, much to my dismay! Browsers can sometimes save a specific version of the URL as the preferred version and will direct to it automatically. I discovered it was just a case of clearing the browser’s: cache, cookies, offline website data and site preferences. Once this had been done I was able to load the site once again using Proxy Mode #bigachievements.

Significance & Authenticity: a Briefing

As an Ancient History graduate, significance and authenticity of source information characterised my university education. Transferring these principles to digital objects in an archival situation is a challenge I look forward to learning more about and embracing. Therefore I set off to Tate Britain on a cold Friday morning excited to explore the Digital Preservation Coalition’s briefing: Significance & Authenticity. Here are some of my reflections.

A dictionary definition is not enough

The morning started with a stimulating discussion led by Sharon McMeekin (DPC), on the definitions of these two concepts within the field of Digital Archives and the context of the varying institutions the delegates were from. Several key points were made, and further questions generated:

Authenticity

  • Authenticity clearly carries with it evidential value; if something is not what it purports to be then how can it (claim to) be authentic?
  • Chains of custody and tracking accidental/intended changes are extremely relevant to maintaining authenticity
  • Further measures such as increasing metadata fields – does this ensure authenticity?

For an archival record to retain authenticity there must be record of the original creation or experience of the digital object; otherwise we are looking at data without context. This also has a bearing on how significant an archival record is. A suggestion was also made that perhaps as a sector too much over-emphasis is placed on integrity checking procedures. Questions surfaced such as: is the digital preservation community too reliant on it? And in turn, is this practical process approach to ensuring authenticity too simplistic?

Significance

  • Records are not just static evidence, they are also for appreciation, education and to use
  • Should the users and re-users (the designated community) be considered more extensively when deciding the significance of a digital object?
  • Emulation as a digital preservation action prioritises the experience of using the data: is this the way to go regarding maintaining both the significant properties together with the authenticity?

There was no doubt left in my mind that the two principles are inextricably linked. However, not only are they increasingly subjective for both the record keeper and the end user, they must be distinguished from one another. For example, if a digital object can be interpreted as both a game and a book, yet the object was created and marketed as a book, does this make it any less significant or authentic? Or is the dispute part of what makes the object significant; the creation, characterisation and presentation of data in digital form is reflective of society today and what researchers may (or may not be) interested in in the future? We do not know and, as a fellow delegate reminded, cannot prejudice future research needs.

Building on the open mindedness that the  discussion encouraged, we were then fortunate enough to hear and learn from practitioners of differing backgrounds regarding how they ensure significance and authenticity of their collections. One particular example had me contemplating all weekend.

Significance & Authenticity of Digital Art by Patricia Falcao & Tom Ensom (Tate)

Patricia and Tom explained that they work with time-based media art and its creators. Working (mostly) with living artists ensures a short chain of provenance, however the nature of the digital art means that applying authenticity and significance is in no way straightforward. A principle which immediately affects the criteria of significance is the fact that it is very important that the Tate can exhibit the works, illustrating that differences in organisations will of course have a bearing on how significant a record is.

One example Tom analysed was the software based Brutalism: Stereo Reality Environment 3 by Peruvian artist Jose Carlos Martinat Mendoza:

Brutalism: Stereo Reality Environment 3 2007 Jose Carlos Martinat Mendoza born 1974 Presented by Eduardo Leme 2007, accessioned 2011 http://www.tate.org.uk/art/work/T13251

The artwork comprises of a range of components: high speed printers, paper rolls,  a web search program and accompanying hardware, movement sensors and a model replica of the Peruvian government building ‘El Petagonito’ which is a symbol of brutalist architecture. The computer is programmed to search the web for references to ‘Brutalism’ and the different extracts of information it gathers are printed from mounted printers on the sculpture, left to fall to the floor around the replica.

Tom explained that retaining authenticity of the digital art was very much a case of the commitment to represent the artists work together with the arrangement and intention. One method of ensuring this is the transfer of a document from the creator called ‘Installation Parameters’. For this particular example, it contained details such as paper type and cabling needs. It also contained display specifications such as the hardware being  a very visible element of the art work.

Further documentation is created and stored to preserve the original authenticity and thus unique significance of the artwork and the integrity of its ‘performance’.  Provenance information such as diagrams, process metadata and the original source code is stored separately to the work itself. However, Tom acknowledged there is no doubt the work will need to change and in turn will be reinterpreted. Interestingly, the point was made that the text itself on the paper itself is time sensitive; live search results related to Brutalism will evolve and change.

Looking ahead, what will happen when the hardware fails? And even, what will happen when nobody uses printers anymore? Stockpiling is only a short term plan for maintaining authenticity and significance. Furthermore, even if hardware can be guaranteed then the program software itself generates different issues. Software emulation, code-change tracking systems and a binary analysis are all to be explored as a means to enable authenticity but there will always be a risk and need for alternative solutions.

Would these changes reduce the authenticity or significance? I believe authenticity is associated with intention and so perhaps if changes are communicated to the user with justifications this could be one way of maintaining this principle. Significance, on the other hand, is more tricky. Without the significant and notable properties of the work, is significance automatically lost?

This case study reinforced that there is much to explore and consider when approaching the principles of authenticity and significance of digital objects. To conclude, Tom and Patricia reinforced that within the artistic context, decisions around authenticity and significance are made through collaborative dialogues with the artist/creator which does indeed provide direction.

Workshop

After 3 more talks and a panel session the briefing ended with a workshop requiring us to evaluate the significance and authenticity of a digital object provided. As a trainee digital archivist I can be guilty of shying away from group discussions/exercises within the community of practice, so I was really pleased to jump in and contribute during the group workshop exercise.

Thank you to the DPC and all involved for a brilliant day.

Subcultures as Integrative Forces in East-Central Europe 1900 – present: a Bodleian Libraries’ Web Archive record

A problem, and a solution in action:

The ephemeral nature of internet content (the average life of a web page is 100 days – illustrating that websites do not need to be purposefully deleted to vanish) is only one contributing factor to data loss. Web preservation is high priority;  action is required. This is a driver for not only Bodleian Libraries’ Web Archive, but digital preservation initiatives on a global scale.

However, today I would like to share the solution in action, an example from BLWA’s University of Oxford Collection: Subcultures as Integrative Forces in East-Central Europe 1900 – present.

On the live web, attempts to access the site are met with automatic redirects to BLWA’s most recent archived capture (24 Jan. 2017). The yellow banner indicates it is part of our archive. Image from http://wayback.archive-it.org/2502/20170124104518/http://subcultures.mml.ox.ac.uk/home.html

Subcultures is a University of Oxford project, backed by the Arts & Humanities Research Council, which through its explorative redefinition of ‘sub-cultures’ aims to challenge the current way of understanding simultaneous identification forms in the region of Eastern Europe through a multi-disciplinary methodology of social anthropology, discourse analysis, historical studies and linguistics. The project ran from 2012-2016.

The Subcultures website is an incredibly rich record of the project and it’s numerous works.  It held cross-continent collaborative initiatives including lectures, international workshops and seminars, as well as an outreach programme including academic publications. Furthermore, comparative micro-studies were conducted in parallel with main collaborative project: Linguistic Identities: L’viv/Lodz, c.1900; Myth and Memory: Jews and Germans, Interwar Romania; Historical Discourses: Communist Silesia and Discursive Constructions: L’viv and Wroclaw to present. The scope and content of the project, including key questions, materials, past and present events and network information is* all hosted on http://subcultures.mml.ox.ac.uk/home.html.

Was*. The site is no longer live on the internet.

However, as well as an automatic re-direction to our most recent archival copy, a search on Bodleian Libraries’ Web Archive generates 6 captures in total:

Search results for Subcultures within BLWA. Image from https://archive-it.org/home/bodleian?q=Subcultures

The materials tab of the site fully functions in the archived capture: you are able to listen to the podcasts and download the papers on theory and case studies as PDF versions.

The use of Subcultures

To explore the importance of web-archiving in this context, let us think about the potential use(rs) of this record and the implications if the website were no longer available:

As the  project comprised a wider outreach programme alongside its research, content such as PDF publications and podcasts were available for download, consultation and further research. The website platform means that these innovative collaborations and the data informed by the primary methodology are available for access. This is of access to the public on a global scale for education and knowledge and interaction with important issues – without even elaborating on how academics, researchers, historians and the wider user community will benefit from the availability of the materials from this web archive. Outreach by its very nature demands an unspecified group of people to lend its services to help.

Listening to the podcast of the project event hosted in Krakow: ‘Hybrid Identity’ in 2014. Rationale, abstracts and biographies from the workshop can also be opened. Image from http://wayback.archive-it.org/2502/20170124104618/http://subcultures.mml.ox.ac.uk/materials/workshop-krakow-hybrid-identity-september-2014.html

Furthermore, the site provides an irreplaceable record of institutional history for University of Oxford as a whole, as well as its research and collaborations. This is a dominant purpose of our University of Oxford collection. The role of preserving for posterity cannot be underplayed. Subcultures provides data that will be used, re-used and of grave importance for decades to come, and also documents decisions and projects of the University of Oxford. For example, the outline and rationale of the project is available in full through the Background Paper – Theory, available for consultation through the archived capture as it would be through the live web. Biographical details of contributors are also hosted on the captures, preserving records of people involved and their roles for further posterity and accountability.

Building on the importance of access to research: internet presence increases scholarly interaction. The scope of the project is of great relevance, and data for research is not only available from the capture of the site, but the use of internet archives as datasets are expected to become more prominent.

Participate!

Here at BLWA the archiving process begins with a nomination for archiving: if you have a site that you believe is of value for preserving as part of one of our collections then please do so here. The nomination form will go to the curators and web-archivists on the  BLWA team for selection checks and further processing. We would love to hear your nominations.

Why and how do we Quality Assure (QA) websites at the BLWA?

At the Bodleian Libraries Web Archive (BLWA), we Quality Assure (QA) every site in the web archive. This blog post aims give a brief introduction into why and how we QA. The first steps of our web archiving involve crawling a site, using the tools developed by ArchiveIT. These tools allow for entire websites to be captured and browsed using the Wayback Machine as if it were live, allowing you to download files, view videos/photos and interact with dynamic content, exactly how the website owner would want you to. However, due to the huge variety and technical complexity of websites, there is no guarantee that every capture will be successful (that is to say that all the content is captured and working as it should be). Currently there is no accurate automatic process to check this and so this is where we step in.

We want to ensure that the sites on our web archive are an accurate representation in every way. We owe this to the owners and the future users. Capturing the content is hugely important, but so too is how it looks, feels and how you interact with it, as this is a major part of the experience of using a website.

Quality assurance of a crawl involves manually checking the capture. Using the live site as a reference, we explore the archived capture, clicking on links, trying to download content or view videos; noting any major discrepancies to the live site or any other issues. Sometimes, a picture or two will be missing or, it maybe that a certain link is not resolving correctly, which can be relatively easy to fix, but other times it can be massive differences compared to the live site; so the (often long and sometimes confusing) process of solving the problem begins. Some common issue we encounter are:

  • Incorrect formatting
  • Images/video missing
  • Large file sizes
  • Crawler traps
  • Social media feeds
  • Dynamic content playback issues

There are many techniques available for us to use to help solve these problems, but there is no ‘one fix for all’, the same issue for two different sites may require two different solutions. There is a lot of trial and error involved and over the years we have gained a lot of knowledge on how to solve a variety of issues. Also ArchiveIT has a fantastic FAQ section on their site, however, if we have gone through the usual avenues and still cannot solve our problems, then our final port of call is to ask the geniuses at ArchiveIT, who are always happy and willing to help.

An example of how important and effective QA can be. The initial test capture did not have the correct formatting and was missing images. This was resolved after the QA process

QA’ing is a continual process. Websites add new content or companies change to different website designers, meaning captures of websites that have previously been successful, might suddenly have an issue. It is for this reason that every crawl is given special attention and is QA’d. QA’ing the captures before they are made available is a time consuming but incredibly important part of the web archiving process at the Bodleian Libraries Web Archive. It allows us to maintain a high standard of capture and provide an accurate representation of the website for future generations.

 

PASIG 2017: Smartphones within the changing landscape of digital preservation

I recently volunteered at the PASIG 2017 Conference in Oxford, it was a great experience to learn more about the archives sector. Many of the talks at the conference focused on the current trends and influences affecting the trajectory of the industry.

A presentation that covered some of these trends in detail was a talk by Somaya Langley from Cambridge University Library (Polonsky Digital Preservation Project), her talk was featured in the ‘Future of DP theory and practice’ session. ‘Realistic digital preservation in the near future: How do we get from A to Z when B already seems too far away?’. Somaya’s presentation considered how we preserve the digital content we receive from donors on smartphones, with her focus being on iOS.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Somaya’s presentation discussed how in the field of digital preservation ingest suites have  long been used to dealing with CDs, DVDs, Floppys and HDDs. However, are not sufficiently prepared for ingesting smartphones or tablets, and the various issues that are associated with these devices. We must realise that smartphones potentially hold a wealth of information for archives:

‘With the design of the Apple Operation System (iOS) and the large amount of storage space available, records of emails, text messages, browsing history, chat, map searching, and more are all being kept’.

(Forensic Analysis on iOS Devices,  Tim Proffitt, 2012. https://uk.sans.org/reading-room/whitepapers/forensics/forensic-analysis-ios-devices-34092 )

Why iOS? What about Android?

The UK market for the iPhone (unlike the rest of Europe) shows a much closer split: iOS November 2016 Sales 48.3% versus Android 49.6% market share in the UK. This  is contrasted against the global market share that Apple have of 12.1% in Q3 of 2016.

Whatever side of the fence you stand on it is clear that smartphones in digital curation, be they Android or iOS, will both play an important role in our collections. The skills required to extract content differs across platforms, we as digital archivists will have to learn both methods of extraction and leave our consumer preferences at the door.

So how do we get the data off the iPhone?

iOS has long been known as a ‘locked-down’ operating system, and Apple have always had an anti-tinkering stance with many of their products. Therefore it should come as no surprise that locating files on an iPhone is not very straightforward.

As Somaya pointed out in her talk, after spending six hours in the Apple Shop ‘Genius Bar’ she was no closer to understanding from Apple employees what the best course of action would be to locate backups of notes from a ‘bricked’ iPhone. Therefore she used her own method of retrieving the notes, using iExplorer to search through the backups from the iPhone.

She noted however that due to limitations of iOS it was very challenging to locate these files, in some cases it even required command line to access the location for storage backups as they were hidden by default in OSX (MacOS the main operating system used by Apple Computers).

Many tools do exist for the purpose of extracting information from iPhones, the four main methods outlined in the The SANS Institute White Paper on Forensic Analysis on iOS Devices by Tim Proffitt:

  1. Acquisition via iTunes Backups (requires original PC last used to sync the iPhone)
  2. Acquiring Backup Data with iPhone Analyzer (free java-based computer program, issues exist when dealing with encrypted backups)
  3. Acquisition via Logical Methods: (uses a synchronisation method built into iOS to recover data, e.g: programs like iPhone Explorer)
  4. Acquisition via Physical Methods (obtaining a bit-by-bit copy, e.g: Lantern 2 forensics suite)

Encryption is a challenge for retrieving data off the iPhone, especially since iTunes includes an encryption of backups feature when syncing. Proffitt suggests using a password cracker or jail-breaking as solutions to this issue, however, these solutions might not be fully compatible with our archive situations.

Another issue with smartphone digital preservation is platform and version locking. Just because the above methods work for data extraction at the moment it is very possible that future versions of iOS could make then defunct, requiring software developers to consistently update their programs or look for new approaches.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Final thoughts

One final consideration that can be raised from Somaya’s talk is that of privacy. As with the arrival of computers into our archives, phones will pose similar moral questions for archivists:

Do we ascribe different values to information stored on smartphones?
Do we consider the material stored on phones more personal than data stored on our computers?

As mentioned previously, our phones store everything from emails, geo-tagged photos, phone call information, and now with the growing popularity of smart wearable-technology, health data (including user heart-rate, daily activity, weight etc.) We as digital archivists will be dealing with very sensitive personal information and need to be prepared to understand the responsibility to safeguard it appropriately.

There is no doubt that soon enough we in the archive field will be receiving more and more smartphones and tablets into our archives from donors. Hopefully talks like Somaya’s will start the ball rolling towards the creation of better standards and approaches to smartphone digital curation.

PASIG 2017: “Sharing my loss to protect your data” University of the Balearic Islands

 

Last week I was lucky enough to be able to attend the PASIG 2017 (Preservation and Archiving Special Interest Group) conference, held at the Oxford University Museum of Natural History, where over the course of three days the  digital preservation community connected to share, experiences, tools, successes and mishaps.

The story of one such mishap came from Eduardo del Valle, Head of the Digitization and Open Access Unit at the University of the Balearic Islands (UIB), in his presentation titled Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation”. In 2013 the digitisation and digital preservation workflow pictured below was set up by the IT team at UIB.

2013 Digitisation and Digital Preservation Workflow (Eduardo del Valle, 2017)

Del Valle was told this was a reliable system, with fast retrieval. However, he found this was not the case, with slow retrieval and the only means of organisation consisting of an excel spreadsheet used to contain the storage locations of the data.

In order to assess their situation they used the NDSA Levels of Digital Preservation, a tiered set of recommendations on how organisations should build their digital preservation activities, developed by the National Digital Stewardship Alliance (NDSA) in 2012. The guidelines are organised into five functional areas that lie at the centre of digital preservation:

  1. Storage and geographic location
  2. File fixity and data integrity
  3. Information security
  4. Metadata
  5. File formats

These five areas then have four columns (Levels 1-4) which set tiered recommendations of action, from Level 1 being the least an organisation should do, to Level 4 being the most an organisation can do. You can read the original paper on the NDSA Levels here.

The slide below shows the extent to which the University met the NDSA Levels. They found there was an urgent need for improvement.

NDSA Levels of Preservation UIB compliance (Eduardo del Valle, 2017)

“Anything that can go wrong, will go wrong” – Eduardo del Valle

In 2014 the IT team decided to implement a new back up system. While the installation and configuration of the new backup system (B) was completed, the old system (A) remained operative.

On the 14th and 15th November 2014, a backup was created for the digital material generated during the digitisation of 9 rare books from the 14th century in the Tape Backup System (A) and notably, two confirmation emails were received, verifying the success of the backup.  By October 2015, all digital data had been migrated from System (A) to the new System (B), spanning UIB projects from 2008-2014.

However, on 4th November 2014, a loss of data was detected…

The files corresponding to the 9 digitised rare books were lost. This loss was detected a year after the initial back up of the 9 books in System A, and therefore the contract for technical assistance had finished. This meant there was no possibility of obtaining financial compensation, if the loss was due to a hardware or software problem.  The loss of these files, unofficially dubbed “the X-files”, meant the loss of three months of work and it’s corresponding economic loss. Furthermore, the rare books were in poor condition, and to digitise them again could cause serious damage. Despite a number of theories, the University is yet to receive an explanation for the loss of data.

The digitised 14th century rare book from UIB collection (Eduardo del Valle, 2017)

To combat issues like this, and to enforce best practice in their digital preservation efforts, the University acquired Libsafe, a digital preservation solution offered by Libnova. Libsafe is OAIS and ISO 14.721:2012 compliant, and encompasses advanced metadata management with a built-in ISAD(g) filter, with the possibility to import any custom metadata schema. Furthermore, Libsafe offers fast delivery, format control, storage of two copies in disparate locations, and a built-in catalogue. With the implementation of a standards compliant workflow, the UIB proceeded to meet all four levels of the 5 areas of the NDSA Levels of Digital Preservation.

The ISO 14.721:2012 Space Data and Information Transfer Systems – Open Archival Information System – Reference Model (OAIS)  provides a framework for implementing the archival concepts needed for long-term digital preservation and access, and for describing and comparing architectures and operations of existing and future archives, as well as describing roles, processes and methods for long-term preservation.

The use of these standards facilitates the easy access, discovery and sharing of digital material, as well as their long-term preservation. Del Valle’s story of data loss reminds us of the importance of implementing standards-based practices in our own institutions, to minimise risk and maximise interoperability and access, in order to undertake true digital preservation.

 

With thanks to Eduardo del Valle, University of the Balearic Islands.