All posts by scarlatam

The Why and How of Digital Archiving

Guest post by Matthew Bell, Summer intern in the Modern Archives & Manuscripts Department

If you have ever wondered how future historians will reconstruct and analyse our present society, you may well have envisioned scholars wading through stacks of printed Tweets, Facebook messages and online quizzes, discussing the relevance of, for instance, GIFs sent on the comment section of a particular politician’s announcement of their candidacy, or what different E-Mail autoreplies reveal about communication in the 2010s. The source material for the researcher of this period must, after all, comprise overwhelmingly of internet material; the platform for our communication, the source of our news, the medium on which we work. To take but one example, Ofcom’s report on UK consumption of news from 2022 identifies that “The differences between platforms used across age groups are striking; younger age groups continue to be more likely to use the internet and social media for news, whereas their older counterparts favour print, radio and TV”. As this generation grows up to take the positions of power in our country, it is clear that in seeking to understand the cultural background from which they emerged, a reliance on storing solely physical newspapers will be insufficient. An accurate picture of Britain today would only be possible by careful digital archaeology, sifting through sediments of hyperlinks and screenshots.

This month, through the Oxford University Summer Internship Programme, I was incredibly fortunate to work as an intern in the Bodleian Libraries Web Archive (BLWA) for four weeks, at the cutting edge of digital archiving. One of the first things that became clear speaking to those working in the BLWA is that the world wide web as a source of research material as described above is by no means a foregone conclusion. The perception of the internet as a stable collection that will remain as it is without care and upkeep is a fallacy: websites are taken down, hyperlinks stop working or redirect somewhere else, social media accounts get removed, and companies go bankrupt and stop maintaining their online presence. Digital archiving can feel like a race against time, a push to capture the websites that people use today whilst we still can; without the constant work of web archivists, there is nothing to ensure that the online resources we use will still be available even decades down the line for researchers to consult.

Fortunately, the BLWA is far from alone in this endeavor. Perhaps the most ambitious contemporary web archive is the Internet Archive; from 1996 this archive has formed a collection of billions of websites, and states as its task the humble aim of providing “Universal Access to all Knowledge”, seeking to capture the entire internet. Other archives have a slightly more defined scope, such as the UK Web Archive, although even here the task is still an enormous one, of collecting “all UK websites at least once per year.” Because of the scale of online material that is published every day, whether or not a site has been archived by either the Internet Archive or the UK Web Archive has relevance for whether the Bodleian chooses to archive it; to this extent the world of digital archiving represents cooperation on an international scale.

One aspect of these web archives that struck me during my time here is the conscious effort made by many to place the power of web archiving in the hands of anyone with access to a computer. The Internet Archive, for instance, allows any users with a free account to add content to the archive. Furthermore, one of my responsibilities as intern was a research project into the viability of a programme named Webrecorder for capturing more complex sites such as social medias, and democratization of web archiving seems to be the key purpose of the programme. On their website, which offers free browser-based web archiving tools, the title of the company stands above the powerful rallying call “Web archiving for all!” Whilst the programme currently remains difficult to navigate without a certain level of coding knowledge, and never quite worked as expected during my research, its potential for expanding the responsibility of archiving is certainly exciting. As historians increasingly seek to understand the lives of those whose records have not generally made it into archive collections, one can see as particularly noble the desire to put secure archiving into the hands of people as well as institutions.

The “why” of Digital Archiving, then, seems clear, but what about the “how”? Before going into my main responsibilities this month, some clarification of terminology is necessary.

Capture – This refers to the Bodleian’s copy of a website, a snapshot of it at a particular moment in time which can be navigated exactly like the original.

Live Site – The website as it is available to users on the internet, as opposed to the capture.

Crawl – The process by which a website is captured, as the computer program “crawls” through the live site, clicking on all the links, copying all of the text and photographs, and gathering all of this together into a capture.

Crawl Frequency – The frequency with which a particular website is captured by the Bodleian, determined by a series of criteria including the regularity of the website’s updates.

Archive-It – The website used by the Bodleian to run these crawls, and which stores the captured websites.

Brozzler – A particularly detailed crawl, taking more time but better for dynamic or complicated sites such as social medias. Brozzlers are used for Twitter accounts, for instance. Crawls which are not brozzlers are known as standard crawls and use Heritrix software.

Data Budget – The allocated quantity of data the Bodleian libraries purchase to use on captures, meaning a necessary selectivity as to what is and is not captured.

Quality Assurance (QA) – A huge part of the work of digital archiving, the process by which a capture is compared with the live site and scrutinized for any potential problems in the way it has copied the website, which are then “patched” (fixed). These generally include missing images, stylesheets, or subpages.

Seed – The term for a website which is being captured.

Permission E-Mails – Due to the copyright regulations around web archiving, the BLWA requires permission from the owners of websites before archiving; this can be a particularly complicated task due to the difficulty of finding contact information for many websites, as well as language barriers.

My responsibilities during my internship were diverse, and my day to day work was generally split between quality assurance, setting off crawls, and sending or drafting permission e-mails. Alongside this I was not only carrying out research into Webrecorder, but also contributing to a report re-assessing the crawl frequency of several of our seeds. The work I have done this month has been not only incredibly satisfying (when the computer programme works and you are able to patch a PDF during QA of a website it makes one disproportionately happy), but rewarding. One missing image or hyperlink at a time, digital archivists are driving the careful maintenance of a particularly fragile medium, but one which is vital for the analysis of everything we are living through today.

The International Internet Preservation Consortium Web Archiving Conference: Thoughts and Takeaways

A couple months ago, thanks to the generous support of the IIPC student bursary, I had the pleasure of attending the International Internet Preservation Consortium (IIPC) web archiving conference in Hilversum, The Netherlands. The conference took place in The Netherlands Institute for Sound & Vision, adding gravitas and rainbow colour to each of the talks and panels.

The Netherlands Institute for Sound & Vision. Photo taken by Olga Holownia.

What I was struck by most throughout the conference was the extremely up-to-date ideas and topics of the panel. While typical archiving usually deals with history that happened decades or centuries ago, web archiving requires fast-paced decisions and actions to preserve contemporary material as it is being produced. The web is a dynamic, flexible, constantly changing entity. Content is often deleted or frequently buried under the constant barrage of new content creation. Therefore, web archivists must stay in the know and up to date in order keep up with the arms race between web technology and archiving resources.

For instance, right from the beginning, the opening keynote speech discussed the ongoing Russian war in Ukraine. Eliot Higgins, founder of Bellingcat, the independent investigative collective focused on producing open source research, discussed the role of digital metadata and digital preservation techniques in the fight against disinformation. Using the example of Russian spread propaganda about the war in Ukraine, Higgins demonstrated that archived versions of sites and videos, and their associated metadata, can help to debunk intentionally spread misinformation depicting the Ukrainian army in a bad light. For instance, geolocation metadata has been used to prove that multiple videos supposedly showing the Ukrainian army threatening and killing innocent civilians, were actually staged and filmed behind the Russian frontlines. The notion that web archives are not just preserving modern culture and history, but also aiding in the fight against harmful disinformation, is quite heartening.

A similarly current topic of conversation was the potential use of artificial intelligence (AI) in web archives. Given the hot topic that AI is, it’s prevalence at the web archiving conference was well received. The quality assurance process for web archiving, which can be arduous and time consuming, was mentioned as a potential use-case for AI. Checking every subpage of an archived site against the live site is impossible given time and resource constraints. However, if AI could be used to compare screenshots of the live site to the captured version, even without actually going in and patching the issues, just knowing where the issues are would save considerable time. Additionally, AI could be used to fill gaps in collections. It is hard to know what you do not know. In particular, the Bodleian has a collection aimed at preserving the events and experiences of peopled affected by the war in Ukraine. Given our web archiving team’s lack of Ukrainian and Russian language skills, it can be hard to know what sites to include in the collection and what not to. Thus, having AI generate a list of sites deemed culturally relevant to the conflict could help fill the gaps in this collection that we were not even aware of.

Social media archiving was also a significant subject discussed at the conference. Despite the large part that social media plays in our lives and culture, it can be very challenging to capture. For example, the Heritrix crawler, the most commonly used web crawler in web archiving, is blocked by Facebook and Instagram. Additionally, while Twitter technically remains capturable, much of the dynamic content contained in posts (i.e. videos, gifs, links to outside content) can’t be replayed in archived versions. Discussions of collaborations between social media companies and archivists were heralded as a necessity and something that needs to happen soon. In the meantime, talk of web archiving tools that may be best suited for dealing with social media captures included Webrecorder and other tools that mimic how a user would navigate a website in order to create a high-fidelity capture that includes dynamic content.

Between discussions of the role of web archives in halting the spread of disinformation, the use of barely understood tools like generative AI, and potential techniques to get around stumbling blocks within the field of social media archiving, the conference discussions got all attendees excited to begin further exploration of web preservation. The internet is the main resource through which we communicate, disseminate knowledge, and create modern history. Therefore, the pursuit of preserving such history is necessary and integral to the field of archiving.