Tag Archives: Bodleian Libraries Web Archive

The International Internet Preservation Consortium Web Archiving Conference: Thoughts and Takeaways

A couple months ago, thanks to the generous support of the IIPC student bursary, I had the pleasure of attending the International Internet Preservation Consortium (IIPC) web archiving conference in Hilversum, The Netherlands. The conference took place in The Netherlands Institute for Sound & Vision, adding gravitas and rainbow colour to each of the talks and panels.

The Netherlands Institute for Sound & Vision. Photo taken by Olga Holownia.

What I was struck by most throughout the conference was the extremely up-to-date ideas and topics of the panel. While typical archiving usually deals with history that happened decades or centuries ago, web archiving requires fast-paced decisions and actions to preserve contemporary material as it is being produced. The web is a dynamic, flexible, constantly changing entity. Content is often deleted or frequently buried under the constant barrage of new content creation. Therefore, web archivists must stay in the know and up to date in order keep up with the arms race between web technology and archiving resources.

For instance, right from the beginning, the opening keynote speech discussed the ongoing Russian war in Ukraine. Eliot Higgins, founder of Bellingcat, the independent investigative collective focused on producing open source research, discussed the role of digital metadata and digital preservation techniques in the fight against disinformation. Using the example of Russian spread propaganda about the war in Ukraine, Higgins demonstrated that archived versions of sites and videos, and their associated metadata, can help to debunk intentionally spread misinformation depicting the Ukrainian army in a bad light. For instance, geolocation metadata has been used to prove that multiple videos supposedly showing the Ukrainian army threatening and killing innocent civilians, were actually staged and filmed behind the Russian frontlines. The notion that web archives are not just preserving modern culture and history, but also aiding in the fight against harmful disinformation, is quite heartening.

A similarly current topic of conversation was the potential use of artificial intelligence (AI) in web archives. Given the hot topic that AI is, it’s prevalence at the web archiving conference was well received. The quality assurance process for web archiving, which can be arduous and time consuming, was mentioned as a potential use-case for AI. Checking every subpage of an archived site against the live site is impossible given time and resource constraints. However, if AI could be used to compare screenshots of the live site to the captured version, even without actually going in and patching the issues, just knowing where the issues are would save considerable time. Additionally, AI could be used to fill gaps in collections. It is hard to know what you do not know. In particular, the Bodleian has a collection aimed at preserving the events and experiences of peopled affected by the war in Ukraine. Given our web archiving team’s lack of Ukrainian and Russian language skills, it can be hard to know what sites to include in the collection and what not to. Thus, having AI generate a list of sites deemed culturally relevant to the conflict could help fill the gaps in this collection that we were not even aware of.

Social media archiving was also a significant subject discussed at the conference. Despite the large part that social media plays in our lives and culture, it can be very challenging to capture. For example, the Heritrix crawler, the most commonly used web crawler in web archiving, is blocked by Facebook and Instagram. Additionally, while Twitter technically remains capturable, much of the dynamic content contained in posts (i.e. videos, gifs, links to outside content) can’t be replayed in archived versions. Discussions of collaborations between social media companies and archivists were heralded as a necessity and something that needs to happen soon. In the meantime, talk of web archiving tools that may be best suited for dealing with social media captures included Webrecorder and other tools that mimic how a user would navigate a website in order to create a high-fidelity capture that includes dynamic content.

Between discussions of the role of web archives in halting the spread of disinformation, the use of barely understood tools like generative AI, and potential techniques to get around stumbling blocks within the field of social media archiving, the conference discussions got all attendees excited to begin further exploration of web preservation. The internet is the main resource through which we communicate, disseminate knowledge, and create modern history. Therefore, the pursuit of preserving such history is necessary and integral to the field of archiving.

Invasion of Ukraine: web archiving volunteers needed

The Bodleian Libraries Web Archive (BLWA) needs your help to document what is happening in Ukraine and the surrounding region. Much of the information about Ukraine being added to the web right now will be ephemeral, and especially information from individuals about their experiences, and those of the people around them. Action is needed to ensure we preserve some of these contemporary insights for future reflection. We hope to archive a range of different content, including social media, and to start forming a resource which can join with other collections being developed elsewhere to:

  • capture the experiences of people affected by the invasion, both within and outside of Ukraine
  • reflect the different ways the crisis is being described and discussed, including misinformation and propaganda
  • record the response to the crisis

To play our part, we need help from individuals with relevant cultural knowledge and language skills who can select websites for archiving. We are particularly interested in Ukrainian and Russian websites, and those from other countries in the region, though any suggestions are welcome.

Please nominate websites via: https://www2.bodleian.ox.ac.uk/beam/webarchive/nominate

Web Archiving & Preservation Working Group: Social Media & Complex Content

On January 16 2020, I had the pleasure of attending the first public meeting of the Digital Preservation Coalition’s Web Archiving and Preservation Working Group. The meeting was held in the beautiful New Records House in Edinburgh.

We were welcomed by Sara Day Thomson who in her opening talk gave us a very clear overview of the issues and questions we increasingly run into when archiving complex/ dynamic web or social media content. For example, how do we preserve apps like Pokémon Go that use a user’s location data or even personal information to individualize the experience? Or where do we draw the line in interactive social media conversations? After all, we cannot capture everything. But how do we even capture this information without infringing the rights of the original creators? These and more musings set the stage perfectly to the rest of the talks during the day.

Although I would love to include every talk held this day, as they were all very interesting, I will only highlight a couple of the presentations to give this blog some pretence at “brevity”.

The first talk I want to highlight was given by Giulia Rossi, Curator of Digital Publications at the British Library, on “Overview of Collecting Approach to Complex Publications”. Rossie introduced us to the emerging formats project; a two year project by the British Library. The project focusses on three types of content:

  1. Web-based interactive narratives where the user’s interaction with a browser based environment determines how the narrative evolves;
  2. Book as mobile apps (a.k.a. literary apps);
  3. Structured data.

Personally, I found Rossi’s discussion of the collection methods in particular very interesting. The team working on the emerging formats project does not just use heritage crawlers and other web harvesting tools, but also file transfers or direct downloads via access code and password. Most strikingly, in the event that only a partial capture can be made, they try to capture as much contextual information about the digital object as possible including blog posts, screen shots or videos of walkthroughs, so researchers will have a good idea of what the original content would have looked like.

The capture of contextual content and the inclusion of additional contextual metadata about web content is currently not standard practice. Many tools do not even allow for their inclusion. However, considering that many of the web harvesting tools experience issues when attempting to capture dynamic and complex content, this could offer an interesting work-around for most web archives. It is definitely an option that I myself would like to explore going forward.

The second talk that I would like to zoom in on is “Collecting internet art” by Karin de Wild, digital fellow at the University of Leicester. Taking the Agent Ruby – a chatbot created by Lynn Hershman Leeson – as her example, de Wild explored questions on how we determine what aspects of internet art need to be preserved and what challenges this poses. In the case of Agent Ruby, the San Francisco Museum of Modern Art initially exhibited the chatbot in a software installation within the museum, thereby taking the artwork out of its original context. They then proceeded to add it to their online Expedition e-space, which has since been taken offline. Only a print screen of the online art work is currently accessible through the SFMOMA website, as the museum prioritizes the preservation of the interface over the chat functionality.

This decision raises questions about the right ways to preserve online art. Does the interface indeed suffice or should we attempt to maintain the integrity of the artwork by saving the code as well? And if we do that, should we employ code restitution, which aims to preserve the original arts’ code, or a significant part of it, whilst adding restoration code to reanimate defunct code to full functionality? Or do we emulate the software as the University of Freiburg is currently exploring? How do we keep track of the provenance of the artwork whilst taking into account the different iterations that digital art works go through?

De Wild proposed to turn to linked data as a way to keep track of particularly the provenance of an artwork. Together with two other colleagues she has been working on a project called Rhizome in which they are creating a data model that will allow people to track the provenance of internet art.

Although this is not within the scope of the Rhizome project, it would be interesting to see how the finished data model would lend itself to keep track of changes in the look and feel of regular websites as well. Even though the layouts of websites have changed radically over the past number of years, these changes are usually not documented in metadata or data models, even though they can be as much of a reflection of social and cultural changes as the content of the website. Going forward it will be interesting to see how the changes in archiving online art works will influence the preservation of online content in general.

The final presentation I would like to draw attention to is “Twitter Data for Social Science Research” by Luke Sloan, deputy director of the Social Data Science Lab at the University of Cardiff. He provided us with a demo of COSMOS, an alternative to the twitter API, which  is freely available to academic institutions and not-for-profit organisations.

COSMOS allows you to either target a particular twitter feed or enter a search term to obtain a 1% sample of the total worldwide twitter feed. The gathered data can be analysed within the system and is stored in JSON format. The information can subsequently be exported to a .CVS or Excel format.

Although the system is only able to capture new (or live) twitter data, it is possible to upload historical twitter data into the system if an archive has access to this.

Having given us an explanation on how COSMOS works, Sloan asked us to consider the potential risks that archiving and sharing twitter data could pose to the original creator. Should we not protect these creators by anonymizing their tweets to a certain extent? If so,  what data should we keep? Do we only record the tweet ID and the location? Or would this already make it too easy to identify the creator?

The last part of Sloan’s presentation tied in really well with the discussion about the ethical approaches to archiving social media. During this discussion we were prompted to consider ways in which archives could archive twitter data, whilst being conscious of the potential risks to the original creators of the tweets. This definitely got me thinking about the way we currently archive some of the twitter accounts related to the Bodleian Libraries in our very own Bodleian Libraries Web Archive.

All in all, the DPC event definitely gave me more than enough food for thought about the ways in which the Bodleian Libraries and the wider community in general can improve the ways we capture (meta)data related to the online content that we archive and the ethical responsibilities that we have towards the creators of said content.

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Why and how do we Quality Assure (QA) websites at the BLWA?

At the Bodleian Libraries Web Archive (BLWA), we Quality Assure (QA) every site in the web archive. This blog post aims give a brief introduction into why and how we QA. The first steps of our web archiving involve crawling a site, using the tools developed by ArchiveIT. These tools allow for entire websites to be captured and browsed using the Wayback Machine as if it were live, allowing you to download files, view videos/photos and interact with dynamic content, exactly how the website owner would want you to. However, due to the huge variety and technical complexity of websites, there is no guarantee that every capture will be successful (that is to say that all the content is captured and working as it should be). Currently there is no accurate automatic process to check this and so this is where we step in.

We want to ensure that the sites on our web archive are an accurate representation in every way. We owe this to the owners and the future users. Capturing the content is hugely important, but so too is how it looks, feels and how you interact with it, as this is a major part of the experience of using a website.

Quality assurance of a crawl involves manually checking the capture. Using the live site as a reference, we explore the archived capture, clicking on links, trying to download content or view videos; noting any major discrepancies to the live site or any other issues. Sometimes, a picture or two will be missing or, it maybe that a certain link is not resolving correctly, which can be relatively easy to fix, but other times it can be massive differences compared to the live site; so the (often long and sometimes confusing) process of solving the problem begins. Some common issue we encounter are:

  • Incorrect formatting
  • Images/video missing
  • Large file sizes
  • Crawler traps
  • Social media feeds
  • Dynamic content playback issues

There are many techniques available for us to use to help solve these problems, but there is no ‘one fix for all’, the same issue for two different sites may require two different solutions. There is a lot of trial and error involved and over the years we have gained a lot of knowledge on how to solve a variety of issues. Also ArchiveIT has a fantastic FAQ section on their site, however, if we have gone through the usual avenues and still cannot solve our problems, then our final port of call is to ask the geniuses at ArchiveIT, who are always happy and willing to help.

An example of how important and effective QA can be. The initial test capture did not have the correct formatting and was missing images. This was resolved after the QA process

QA’ing is a continual process. Websites add new content or companies change to different website designers, meaning captures of websites that have previously been successful, might suddenly have an issue. It is for this reason that every crawl is given special attention and is QA’d. QA’ing the captures before they are made available is a time consuming but incredibly important part of the web archiving process at the Bodleian Libraries Web Archive. It allows us to maintain a high standard of capture and provide an accurate representation of the website for future generations.