Tag Archives: Bodleian Libraries Web Archive

Web Archiving & Preservation Working Group: Social Media & Complex Content

On January 16 2020, I had the pleasure of attending the first public meeting of the Digital Preservation Coalition’s Web Archiving and Preservation Working Group. The meeting was held in the beautiful New Records House in Edinburgh.

We were welcomed by Sara Day Thomson who in her opening talk gave us a very clear overview of the issues and questions we increasingly run into when archiving complex/ dynamic web or social media content. For example, how do we preserve apps like Pokémon Go that use a user’s location data or even personal information to individualize the experience? Or where do we draw the line in interactive social media conversations? After all, we cannot capture everything. But how do we even capture this information without infringing the rights of the original creators? These and more musings set the stage perfectly to the rest of the talks during the day.

Although I would love to include every talk held this day, as they were all very interesting, I will only highlight a couple of the presentations to give this blog some pretence at “brevity”.

The first talk I want to highlight was given by Giulia Rossi, Curator of Digital Publications at the British Library, on “Overview of Collecting Approach to Complex Publications”. Rossie introduced us to the emerging formats project; a two year project by the British Library. The project focusses on three types of content:

  1. Web-based interactive narratives where the user’s interaction with a browser based environment determines how the narrative evolves;
  2. Book as mobile apps (a.k.a. literary apps);
  3. Structured data.

Personally, I found Rossi’s discussion of the collection methods in particular very interesting. The team working on the emerging formats project does not just use heritage crawlers and other web harvesting tools, but also file transfers or direct downloads via access code and password. Most strikingly, in the event that only a partial capture can be made, they try to capture as much contextual information about the digital object as possible including blog posts, screen shots or videos of walkthroughs, so researchers will have a good idea of what the original content would have looked like.

The capture of contextual content and the inclusion of additional contextual metadata about web content is currently not standard practice. Many tools do not even allow for their inclusion. However, considering that many of the web harvesting tools experience issues when attempting to capture dynamic and complex content, this could offer an interesting work-around for most web archives. It is definitely an option that I myself would like to explore going forward.

The second talk that I would like to zoom in on is “Collecting internet art” by Karin de Wild, digital fellow at the University of Leicester. Taking the Agent Ruby – a chatbot created by Lynn Hershman Leeson – as her example, de Wild explored questions on how we determine what aspects of internet art need to be preserved and what challenges this poses. In the case of Agent Ruby, the San Francisco Museum of Modern Art initially exhibited the chatbot in a software installation within the museum, thereby taking the artwork out of its original context. They then proceeded to add it to their online Expedition e-space, which has since been taken offline. Only a print screen of the online art work is currently accessible through the SFMOMA website, as the museum prioritizes the preservation of the interface over the chat functionality.

This decision raises questions about the right ways to preserve online art. Does the interface indeed suffice or should we attempt to maintain the integrity of the artwork by saving the code as well? And if we do that, should we employ code restitution, which aims to preserve the original arts’ code, or a significant part of it, whilst adding restoration code to reanimate defunct code to full functionality? Or do we emulate the software as the University of Freiburg is currently exploring? How do we keep track of the provenance of the artwork whilst taking into account the different iterations that digital art works go through?

De Wild proposed to turn to linked data as a way to keep track of particularly the provenance of an artwork. Together with two other colleagues she has been working on a project called Rhizome in which they are creating a data model that will allow people to track the provenance of internet art.

Although this is not within the scope of the Rhizome project, it would be interesting to see how the finished data model would lend itself to keep track of changes in the look and feel of regular websites as well. Even though the layouts of websites have changed radically over the past number of years, these changes are usually not documented in metadata or data models, even though they can be as much of a reflection of social and cultural changes as the content of the website. Going forward it will be interesting to see how the changes in archiving online art works will influence the preservation of online content in general.

The final presentation I would like to draw attention to is “Twitter Data for Social Science Research” by Luke Sloan, deputy director of the Social Data Science Lab at the University of Cardiff. He provided us with a demo of COSMOS, an alternative to the twitter API, which  is freely available to academic institutions and not-for-profit organisations.

COSMOS allows you to either target a particular twitter feed or enter a search term to obtain a 1% sample of the total worldwide twitter feed. The gathered data can be analysed within the system and is stored in JSON format. The information can subsequently be exported to a .CVS or Excel format.

Although the system is only able to capture new (or live) twitter data, it is possible to upload historical twitter data into the system if an archive has access to this.

Having given us an explanation on how COSMOS works, Sloan asked us to consider the potential risks that archiving and sharing twitter data could pose to the original creator. Should we not protect these creators by anonymizing their tweets to a certain extent? If so,  what data should we keep? Do we only record the tweet ID and the location? Or would this already make it too easy to identify the creator?

The last part of Sloan’s presentation tied in really well with the discussion about the ethical approaches to archiving social media. During this discussion we were prompted to consider ways in which archives could archive twitter data, whilst being conscious of the potential risks to the original creators of the tweets. This definitely got me thinking about the way we currently archive some of the twitter accounts related to the Bodleian Libraries in our very own Bodleian Libraries Web Archive.

All in all, the DPC event definitely gave me more than enough food for thought about the ways in which the Bodleian Libraries and the wider community in general can improve the ways we capture (meta)data related to the online content that we archive and the ethical responsibilities that we have towards the creators of said content.

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Why and how do we Quality Assure (QA) websites at the BLWA?

At the Bodleian Libraries Web Archive (BLWA), we Quality Assure (QA) every site in the web archive. This blog post aims give a brief introduction into why and how we QA. The first steps of our web archiving involve crawling a site, using the tools developed by ArchiveIT. These tools allow for entire websites to be captured and browsed using the Wayback Machine as if it were live, allowing you to download files, view videos/photos and interact with dynamic content, exactly how the website owner would want you to. However, due to the huge variety and technical complexity of websites, there is no guarantee that every capture will be successful (that is to say that all the content is captured and working as it should be). Currently there is no accurate automatic process to check this and so this is where we step in.

We want to ensure that the sites on our web archive are an accurate representation in every way. We owe this to the owners and the future users. Capturing the content is hugely important, but so too is how it looks, feels and how you interact with it, as this is a major part of the experience of using a website.

Quality assurance of a crawl involves manually checking the capture. Using the live site as a reference, we explore the archived capture, clicking on links, trying to download content or view videos; noting any major discrepancies to the live site or any other issues. Sometimes, a picture or two will be missing or, it maybe that a certain link is not resolving correctly, which can be relatively easy to fix, but other times it can be massive differences compared to the live site; so the (often long and sometimes confusing) process of solving the problem begins. Some common issue we encounter are:

  • Incorrect formatting
  • Images/video missing
  • Large file sizes
  • Crawler traps
  • Social media feeds
  • Dynamic content playback issues

There are many techniques available for us to use to help solve these problems, but there is no ‘one fix for all’, the same issue for two different sites may require two different solutions. There is a lot of trial and error involved and over the years we have gained a lot of knowledge on how to solve a variety of issues. Also ArchiveIT has a fantastic FAQ section on their site, however, if we have gone through the usual avenues and still cannot solve our problems, then our final port of call is to ask the geniuses at ArchiveIT, who are always happy and willing to help.

An example of how important and effective QA can be. The initial test capture did not have the correct formatting and was missing images. This was resolved after the QA process

QA’ing is a continual process. Websites add new content or companies change to different website designers, meaning captures of websites that have previously been successful, might suddenly have an issue. It is for this reason that every crawl is given special attention and is QA’d. QA’ing the captures before they are made available is a time consuming but incredibly important part of the web archiving process at the Bodleian Libraries Web Archive. It allows us to maintain a high standard of capture and provide an accurate representation of the website for future generations.