Category Archives: Web Archives

Introducing Wacksy: a library for writing WACZ collections

Blog published on behalf of Pierre Marshall, Technical Research Officer, Algorithmic Archive project.

As part of the Algorithmic Archive project, we have been building tooling to support a prospective social media archive. This post introduces Wacksy, a Rust crate we wrote for packaging Web ARChive (WARC) files in WACZ collections.

Web Archive Collection Zipped (WACZ) is a format for packaging web archives. A WACZ collects WARC files along with any other related resources into a single hermetic zip archive. Each WACZ collection is self-describing and should contain within it all the resources necessary to replay a WARC file.

Motivation

The main advantage to independently replayable WACZ files is that you don’t need to maintain an external indexing server. Eliminating that dependency on an external indexing system is beneficial for digital preservation, in the long run the most reliable database is just the file system.

Of course, this also means that packaging a WARC file in a WACZ involves indexing the WARC, and so the scope of this little project grew from ‘wrap some things in a zip archive’ to ‘build a WARC indexer’.

Thankfully, WARC files are easy enough to parse. One of the useful side-effects of this project was that we got to learn the WARC and WACZ specs very well. The process of implementing the WACZ format also brought up a few issues (#161, #163, #164, #166, #167) which could be used to tighten the spec in a future revision.

Archiving the datafied web

There’s another reason you might want to package resources together in a WACZ. We’ve spent a lot of time in this project trying to understand how to represent a social media post.

While sidestepping arguments about mass literary culture, we can think of social media as a kind of electronic literature, and preserving it is an extension of ongoing web archiving work. This approach includes preserving the ‘look and feel’ of a post in context, surrounded by comments and advertising and blobby Frutiger Aero buttons.

Social media posts are also data, in the sense that they exist in the form of structured json. Each post is an object with properties: text, username, datetime, maybe links to associated media. All raw content, easily searchable, indexable, and ready for researchers to throw into a data processing workflow.

Ideally, you want to capture and preserve both: a web archive snapshot and the structured data.

You could include the json inline in a WARC header field, or add it to the WARC file as a resource record. Or, you could package the WARC and json files together in a WACZ collection, this is the use case we had in mind when writing Wacksy.

How to use

With a stable Rust toolchain, run cargo add wacksy to add the crate to your cargo manifest.

The API provides a WACZ type with two functions: from_file and as_zip_archive.

from_file() builds the WACZ object, it takes a path to a WARC file, indexes it, and returns a result with either a WACZ struct or an error. The indexer was recently rewritten and contains almost no error handling. Use with caution! Also, the format requires all resources to be defined in the datapackage, so when you construct a datapackage you’re already building a structured representation of the WACZ collection. This is a neat feature, Ed Summers and Ilya Kremer really did a good job on the WACZ spec here.

as_zip_archive() takes all the resources in the WACZ object and passes them through into a zip file, making use of Nick Babcock’s rawzip library.

Here is an example from the documentation (current/latest):

fn main() -> Result<(), Box<dyn Error>> {

    let warc_file_path = Path::new("example.warc.gz"); // set path to your ᴡᴀʀᴄ file

    let wacz_object = WACZ::from_file(warc_file_path)?; // index the ᴡᴀʀᴄ and create a ᴡᴀᴄᴢ object

    let zipped_wacz: Vec<u8> = wacz_object.as_zip_archive()?; // zip up the ᴡᴀᴄᴢ

    fs::write("example.wacz", zipped_wacz)?; // write out to file

    Ok(())

}

This API is still missing a way of adding arbitrary extra resources into the WACZ, although the code is flexible enough to accommodate that in future.

Shipping binaries

There are two other WACZ libraries out there:

Besides the library API, another goal is to provide a simple command line interface, and wrap that up into a standalone binary. This would be the main distinguishing feature of Wacksy, that you don’t need to set up Python or Node.js runtimes. For example, there would be fewer steps involved if you were packaging up WARC files in an automated workflow.

Although, py-wacz is better tested and more feature-complete, so take that into consideration. We have used py-wacz as a reference implementation to test against.

We’re also working on packaging Wacksy for Debian, and other systems after that. When it’s all packaged, it’ll be much easier for users to try out.

Performance

The WARC indexer was written with an eye on performance and memory use. When reading plain uncompressed WARCs, the indexer only reads the headers of each record. With a known header length and the WARC content-length value for each record, we can calculate the next record offset and skip through the file without passing record contents into memory. Where possible we’re also using a buffered reader rather than reading reading byte-by-byte. For gzip compressed WARCs it’s more complicated, and we’ve avoided doing anything more fancy like streaming decompression.

We’ve also tried to limit the dependencies used. At the moment, a binary compiled from the example above comes out at ~600 kilobytes — not super small, but more lightweight than most web pages. As a bonus, pruning the dependency tree will make Wacksy easier to package and distribute.

Use it!

At time of writing, the library is still experimental, not yet integrated into any other software, and it has only been tested against a few example WARC files. It would benefit from wider testing on real world use cases.

The code is all open source and available under an MIT license; contributions welcome!

Reporting from the RESAW2025 Conference Workshop: Towards an “Algorithmic Archive”

Welcome sign at the University of Siegen

At this year’s REsearch infrastructure for the Study of Archived Web materials (RESAW) conference, my colleague Pierre Marshall and I organised a workshop titled “Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers”. The workshop was accepted as one of the RESAW2025 pre-conference workshops and took place on 4 June 2025. We had around twelve participants, including researchers at various career stages and web archivists, that contributed to lively discussions and to the Algorithmic Archive project thanks to their experience with social media data.

The workshop was articulated in two sessions: the first one focussed on gathering researchers’ perspectives and information about the use of social media data in research. The second session invited participants to imagine a long-term archive of social media data, asking them to think about the features they would like to see in a social media data service. Both sessions offered a valuable opportunity to gather insights for the Algorithmic Archive project, particularly regarding issues and expectations related to short- and long-term access to social media data.

Key themes and takeaways are summarised below.

Social media data (re)use and data management practices

Researchers appeared to work mostly with small datasets, especially after free access to data for research purposes came to an end with the deprecation of the Twitter Academic API in 2023.  Among the researchers that shared their experience with social media data, one noted how they currently work with information about the number of followers, which is often supplemented with screenshots taken at different points in time. They explained how screenshots are essential for their research as enable to capture the “look and feel” of the social platforms, which is an essential part of the research they are conducting. In this regard, one of the web archivists participating in the workshop noted how at their institution they use Webrecorder[1] at least once a year.

In addition, a researcher whose research focussed on algorithms, noted that social media data collected via APIs is only one of the sources they use for their study. Other sources include existing policies, new regulations (e.g. EU Digital Services Act) and other archival sources such as information on GitHub.[2]

As for long-term preservation, researchers participating in the workshop appeared to not have specific plans in this sense, with some indicating that they usually delete social media data sometime after the end of the project. Despite some concerns related to potential ethical issues, researchers expressed a general interest in reusing datasets that include social media data. Nevertheless, they emphasised that effective reuse would require detailed documentation from the dataset creator to understand how the data was developed.

Access and user requirements

For the second session, we organised a post-it note exercise where we asked researchers to reflect on the type of metadata they would find useful for their research and would like for memory institutions to collect/provide. Researchers suggested several metadata or information they would like to see associated with the archived resource, includingdate of capture; date of publication; technical and curatorial metadata; hardware (e.g., mobile, tablet, laptop); sensitivity assessment; andthe type of tool used to collect the information.

Post-it note session

There was a general agreement among participants about the need for the collecting institution to preserve at least some instances of the context in which the data was embedded. For example, walkthroughs of social media platforms recorded using tools such as Webrecorder, would be crucial for researchers and future users of the collection to get a sense of platforms’ “look and feel” at certain points in time.Some of the participants noted the importance to get an understanding of potential functionality loss when replaying archived social media material.

Nevertheless, Access, particularly free access to platform data is still one of the major blocks for researchers who need such information for their studies. This has become even more crucial after the Twitter Academic API was deprecated in 2023 and replaced with a paid tier system which due to the high fees required to get access to the required amount of data, has often led many researchers to redirect their research goals, either significantly reducing the amount of data needed or focus on other platforms.

Overall, the workshop brought together diverse perspectives from practitioners and researchers working with social media data, fostering discussions regarding the development of sustainable strategies to collect social media platforms. This was a unique opportunity to discuss some of the Algorithmic Archive findings, clarify researchers’ perspectives on concerns related to the use of social media data as well as raise further questions that the Algorithmic Archive project should take into consideration for the development of a social media data service.


[1] Webrecorder homepage: https://webrecorder.net/

[2] More information about the GitHub Archiving Programme can be found here: https://archiveprogram.github.com/

Reflections on Curating in the Crossfire: Collecting in the Time of War, Conflict and Crises

On 3-4 November, I attended a two-day event at the British Library that highlighted the challenges and approaches of collecting materials created during times of war, conflict and crises. Through a series of panels and discussions, museum and library professionals, researchers and private collectors shared examples of incredible historical and contemporary initiatives to preserve diverse materials and heritage sites at risk of loss, decay or destruction.

Having recently worked on the joint Bodleian Libraries and History of Science Museum Collecting COVID project, I was particularly interested in contemporary programmes of collecting. Our project, which ran from 2021-2023, aimed to acquire and preserve the University of Oxford’s research response to the COVID-19 pandemic. It enabled us to capture, catalogue and publish over ninety oral history interviews.

Modern collections/initiatives showcased included:

  • Web Archiving the COVID-19 pandemic, Nicola Bingham, British Library
  • Coastal Connections (heritage sites at threat from coastal erosion) Dr Alex Kent, World Monuments Fund)
  • Crowdsourcing photographs for the Picturing Lockdown Collection Dr Tamsin Silvey, Historic England
  • Endangered Archives Programme (recent case studies include Ukraine, Gaza and Sudan) Dr Sam van Schaik, British Library
  • Collecting Human Stories during the war in Ukraine, Natalia Yemchenko, Rinat Akhmetov Foundation/Museum of Civilian Voices

Rapid collecting is a means to collect documentary evidence, preserve cultural memories and commemorate events. By providing access to these collections, institutions are then able to build a body of evidence and facilitate research. I was struck by the similarities between modern initiatives and those that had taken place a century before. Some of the contemporary examples of collections crowdsourcing harked back to the collecting of ephemera during the First World War. Dr Ann-Marie Foster highlighted the Bond of Sacrifice Collection and Women’s Work Collection (Imperial War Museums) in her presentation with Alison Bailey, in which families sent items memorialising loved ones, as examples of early collecting initiatives. Modern rapid collecting work has meant that contemporary archivists/curators have taken up this tradition, working actively to save materials at risk of loss through intentional selection.

As well as crowdsourcing and outreach, other strategies institutions draw upon in an increasingly online world are web archiving, digitisation and digital preservation. With social media now a main mode of communication for millions, web archiving is a useful tool to preserve and present online response to global events. Work to capture websites relating to recent events is ongoing at both the Bodleian Libraries and British Library. I found Archive-It to be an incredibly useful tool to capture and publish a range of web pages (including the social media pages of COVID-19 researchers, given with permission) for our project, which without reactive selection and preservation, would otherwise have been at risk of loss.

Overall, the event highlighted that institutions must use active strategies towards preserving at-risk materials created during ongoing crises and conflicts, including:

  • Involving communities to assist in selection of materials;
  • Providing as representative a view of the event as possible (capturing diverse perspectives);
  • Providing access to collections and making them available as widely as possible (ethical considerations and sensitivities permitting);
  • Democratising collections and preserving them for future generations.

Algorithmic Archive Project: Use Cases (3/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the third in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case – Study on the trustworthiness of social media visual content among young adults (TRAVIS project)[1]

Research questions and aim(s):

Trust And Visuality: Everyday digital practices (TRAVIS) is an ESRC project which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme. This research project that looks at how young adults experience, build and express trust in news and social media images related to wellbeing and health. It explores how and why people trust some visuals over others and how content creators establish trustworthiness through visual content. The TRAVIS project involves cross-national collaboration of multiple research teams located at different universities in UK and Europe. This includes the University of Oxford, in particular the Oxford team is based School of Geography and the Environment.

Social media data used:

The project included data collected indirectly from platforms including Facebook, Instagram, TikTok and YouTube (see below).

Tools and methods adopted:

Data collection from social media consisted of screenshots taken from the devices of interviewed young adults, as the TRAVIS project investigates the meaning of social media posts (visual content) via interviews with young adult users. The datasets generated from this method of collection counts around 400 screenshots, stored on an institutional cloud drive, which is accessible by the whole team.


[1] Further information about the TRAVIS project are available here: https://www.tlu.ee/en/bfm/researchmedit/trust-and-visuality-everyday-digital-practices-travis

Algorithmic Archive Project: Use Cases (2/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the second in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case – Exploring Algorithmic Mediation and Recommendation Systems on YouTube [1]

Research questions and aim(s):

The study sought to investigate how the YouTube platform operates, focusing on algorithmic activity and the strategies employed by both human and automated (robot) actors within federal and regional elections. The aim was to understand the impact that this system of mediation has on society and to demystify preconceptions of ideologically neutral technologies in highly disputed political events. The research focuses on two case studies: 1) the 2018 Ontario (Canada) election and 2) the 2018 Brazilian Federal Election. The data collection was carried out during the campaigning periods, between May and June in Ontario, and between August and October 2018 in Brazil.

Social media data used:

The research focussed on the sole YouTube platform. Specifically, the researchers collected information about recommended videos starting from specific keywords related to the election campaign.

Tools and methods adopted:

The data collection was carried out using a Python script developed by the Algo Transparency project. The script automates YouTube search operations based on specified keywords (e.g., the names of the candidates), allowing the researcher to gather video-related data and the relative ranking position displayed to the user. Once the keywords were defined, the tool retrieved links for the top four results for each keyword and then examined the recommendation section. This process was repeated four times, each time collecting recommended videos, simulating a user interacting with algorithmic suggestions.

Data collected was stored on personal devices and the institutional cloud, and can be visualized at the following links:


[1] Reis, R., Zanetti, D., & Frizzera, L. (2020). A conveniência dos algoritmos: o papel do YouTube nas eleições brasileiras de 2018. Compolítica10(1), 35–58. https://doi.org/10.21878/compolitica.2020.10.1.333

Algorithmic Archive Project: Use Cases (1/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the first in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case: Network/cluster analysis to investigate the construction and influence of information trustworthiness within social movements on Twitter [1]

Research questions and aim(s):

The researcher wanted to explore the construction and influence of information trustworthiness within social media movements in the context of the Hong Kong protests and the #BlackLivesMatter movements. Social media platforms offer a digital space for social movements to facilitate the diffusion of critical information and the formation of networks, coordinating protests and reach a wider audience.

Social media data used:

This study focused on Twitter as it was used evenly by both social movements, and the researcher already had an established presence on this platform. Also, at the time of data collection (2020-2021), access to Twitter data for academic research was still relatively open to researchers.

For the purpose of this study, the researcher examined the follow and followers’ relationship of top accounts counting millions of followers that had been selected as big information disseminators, including organisations, individuals or accounts serving a particular niche or purpose.

Data collection was conducted at a specific point in time in 2021. Social media data quantitative analysis (e.g. cluster analysis) was complemented with qualitative data collected via an online survey.

Tools and methods adopted:

The researcher requested and obtained access to the Twitter API. However, high-level coding skills were required to access the data, which the researcher did not have at that time due to their predominantly qualitative research background. To address this, the researcher found and used a Go script called Nucoll[2], which is freely available on GitHub and enabled the researcher to collect the required data. Nucoll is a command-line tool that, according to its developer, retrieves data from Twitter using keyword instructions, for which the developer provided example queries and brief explanations. For each social movement, the researcher selected three organisations: one large organisation, one activist group, and one additional account that was relevant to the movement. Once these accounts were selected, they were processed through the script to capture all following/follower relationships and combine them into a graph for each protest analysed. Further data visualisation and analysis — including clustering and network analysis — were conducted using Gephi.


[1] Charlotte Im, The Construction and Influence of Information Trustworthiness in Social Movements, Doctoral Thesis, University College London (UCL), 2024.

[2] https://github.com/jdevoo/nucoll

The Algorithmic Archive: a project overview

What is the Algorithmic Archive Project?

In 2024, the Algorithmic Archive Project has received funding from the Mellon Foundation to carry out scoping research that will ultimately support the Bodleian Libraries in the development of a lasting, interoperable infrastructure and sustainable strategies for archiving web-based data, including social media data and algorithms. The project is part of the broader Future Bodleian programme aiming to expand and evolve its centuries-old role by engaging with the digital domain.

Why archive social media data?

In the past two decades, social media platforms have become a central means of communication, enabling people from across the globe to engage in discussions that transcend geographical borders, reflect on contemporary events and contribute to collective memory. Given their profound impact on society, researchers across various disciplines increasingly rely on social media data to analyse social, economic, and political phenomena. However, social media data is inherently ephemeral, subject to continuous evolution driven by changes in platform leadership, economic gain, and shifting policies. For this reason, it is essential to preserve and provide reliable and sustainable access for the (re)use of such an important resource.

Steps towards the development of a social media and algorithmic data service.

The Algorithmic Archive project is articulated in four interconnected phases aimed to investigate the research, archiving, legal and technical landscape to inform the Bodleian Libraries’ future development of a social and algorithmic data service.

The image below offers a visual summary of the work packages that the Research Officers have been exploring over this one-year project.

In upcoming blog posts, we will present some of the results and highlight use cases drawn from research conducted with social media data.

Reporting from the Born-Digital Collections, Archives and Memory Conference 2025

Between 2-4 April 2025, I attended the very first edition of the Born-digital Collections, Archives and Memory conference, together with my colleague from the Algorithmic Archive Project, Pierre Marshall. The conference was co-organised by the School of Advanced Study at the University of London, the Endangered Material Knowledge Programme at The British Museum, The British Library and Aarhus University. This international event offered the unique opportunity to bring together academics and practitioners from diverse disciplines, career paths and backgrounds to explore the transformative impact of born-digital cultural heritage. The diverse range of research, methodologies, and practices presented in this year’s programme offered valuable insights and reflections, particularly relevant to the Algorithmic Archive project and its goal of developing sustainable, persistent approaches to preserving born-digital heritage created on the web, especially on social media platforms.

The inspiring opening keynote by Dorothy Berry, Digital Curator at the Smithsonian National Museum of African American History and Culture, highlighted the vital importance of preserving ephemeral and fragile forms of born-digital heritage (such as social media) —many of which have increasingly replaced traditional modes of memory-making, also drawing attention to the pressing need for a deeper understanding of what and how born-digital memory should be preserved. In particular, she stressed the need to record the “full context” in which born-digital records and materials were embedded before being collected and included in specific collections. However, she also highlighted the challenges many memory institutions face due to uneven resource distribution, an issue that may hinders both the development and long-term sustainability of innovative preservation efforts.

Given the richness of the BDCAM25 program, it is incredibly difficult to summarise the many takeaways from the three-day conference. Nevertheless, it is worth highlighting sessions such as the one exploring the history, socio-technical dynamics and research conducted on corpora from platforms such as Usenet; the important reflections stemmed from a study conducted by Rosario Rogel-Salazar and Alan Colín-Arce exploring the presence of feminist organisations in web archives; and the research conducted by Dr Andrea Stanton exploring Palestine and the concept of Palestinian heritage through the analysis of accounts and hashtags on Instagram. 

Particularly valuable insights came also from Dr Kieran Hegarty’s paper, which explored the challenges posed by the unpredictable and frequent changes to platform design and policies, underscoring how this significantly influence what is included in web archives and how the material is made available.

Beveridge Hall entrance, Senate House, University of London. Photo taken by B. Cannelli

Overall, the conference provided a valuable opportunity to learn about new research and to network with scholars and practitioners from around the globe. During lunch and coffee breaks, I had insightful conversations with several delegates about the challenges of preserving born-digital materials, particularly data generated on social media platforms. We exchanged ideas and reinforced the importance of developing shared practices to safeguard these resources. This theme strongly resonated in the closing session, which brought together voices from diverse career paths and regions to reflect on the current state of born-digital archives, collections, and memory, and to identify future directions.
Among the key takeaways were the need to foster data literacy and building digital citizens from a young age, as well as the importance of connecting with activists and minority communities to help them tell and preserve their stories.

Highlights and Takeaways from the Association of Internet Researchers Annual Conference (AoIR) 2024

At the end of October, I had the opportunity to attend the 2024 Association of Internet Researchers (AoIR) conference, which took place in the lovely city of Sheffield. This was my first time attending an AoIR conference and I was grateful to join such a vibrant meeting of Internet researchers from all over the world. As a Curatorial and Policy Research Officer for the Algorithmic Archive Project, currently exploring the ways in which social media and algorithmic data are being used across disciplines, this was a unique opportunity for me to engage with a diverse range of research on the web and social platforms.

This year’s AoIR conference was hosted by the University of Sheffield, with the Student Union building serving as the main venue. This impressive structure spans five floors and includes a cosy lounge area on the third floor, offering attendees a space to relax and network between sessions in a packed 4-day program. The main theme of this year’s AoIR2024 conference was “industry”, inviting the research community to reflect and discuss the relationship between the internet and industry. With over thirteen parallel sessions scheduled for each time block, choosing just one to attend proved to be rather challenging.

A view of the University of Sheffield, Student Union where some of the AoIR2024 conference sessions took place between 30 October – 2 November 2024. Photo taken by B. Cannelli

One aspect that really stood out to me from the conference was the diverse range of research involving information generated on social media platforms, spanning from creators’ economy dynamics, news polarization, AI applied in the context of online communities and content moderation, online pop culture and disinformation across various platforms. There were several panels discussing platform governance – the set of rules, policies and decision-making processes that shape how content is collected, accessed and used within a platform – shedding light on the power dynamics that influence user experience. From an archival perspective, understanding how platforms regulate access to data and the consumption of content is crucial, with significant implications for how this content can be archived by memory institutions.

Among the many sessions exploring virality phenomena and cultures on social media, it is worth mentioning the one reflecting on “mediated memory”. It examined how social platforms like TikTok serve, for instance, as spaces to remember displaced cultures, and how they facilitate the transmission of cultural aspects to younger generations, helping to perpetuate them through time and space. Additionally, the session titled “Times and Transformations” provided some excellent examples of research conducted with web-archived content from research libraries, along with insightful reflections on the epistemology of web archiving.

Firth Court, a Grade II listed Edwardian building that constitutes part of the Western Bank Campus of the University of Sheffield. Photo taken by B. Cannelli

Overall, the conference highlighted the crucial role social media data play in today’s communication landscape and underscored the value of platforms’ user-generated content as a key resource for researchers across a wide range of disciplines. The interplay of light and shadows explored in various panels on platform governance further emphasised the enormous power platforms hold over this user-generated data, as well as the pressing need for support to enable researchers to access and preserve these data over time. 

I left the AoIR2024 conference with so much food for thought! It has also been a fantastic opportunity for networking, which will be important for the scoping phase of the Algorithmic Archive project.

The Why and How of Digital Archiving

Guest post by Matthew Bell, Summer intern in the Modern Archives & Manuscripts Department

If you have ever wondered how future historians will reconstruct and analyse our present society, you may well have envisioned scholars wading through stacks of printed Tweets, Facebook messages and online quizzes, discussing the relevance of, for instance, GIFs sent on the comment section of a particular politician’s announcement of their candidacy, or what different E-Mail autoreplies reveal about communication in the 2010s. The source material for the researcher of this period must, after all, comprise overwhelmingly of internet material; the platform for our communication, the source of our news, the medium on which we work. To take but one example, Ofcom’s report on UK consumption of news from 2022 identifies that “The differences between platforms used across age groups are striking; younger age groups continue to be more likely to use the internet and social media for news, whereas their older counterparts favour print, radio and TV”. As this generation grows up to take the positions of power in our country, it is clear that in seeking to understand the cultural background from which they emerged, a reliance on storing solely physical newspapers will be insufficient. An accurate picture of Britain today would only be possible by careful digital archaeology, sifting through sediments of hyperlinks and screenshots.

This month, through the Oxford University Summer Internship Programme, I was incredibly fortunate to work as an intern in the Bodleian Libraries Web Archive (BLWA) for four weeks, at the cutting edge of digital archiving. One of the first things that became clear speaking to those working in the BLWA is that the world wide web as a source of research material as described above is by no means a foregone conclusion. The perception of the internet as a stable collection that will remain as it is without care and upkeep is a fallacy: websites are taken down, hyperlinks stop working or redirect somewhere else, social media accounts get removed, and companies go bankrupt and stop maintaining their online presence. Digital archiving can feel like a race against time, a push to capture the websites that people use today whilst we still can; without the constant work of web archivists, there is nothing to ensure that the online resources we use will still be available even decades down the line for researchers to consult.

Fortunately, the BLWA is far from alone in this endeavor. Perhaps the most ambitious contemporary web archive is the Internet Archive; from 1996 this archive has formed a collection of billions of websites, and states as its task the humble aim of providing “Universal Access to all Knowledge”, seeking to capture the entire internet. Other archives have a slightly more defined scope, such as the UK Web Archive, although even here the task is still an enormous one, of collecting “all UK websites at least once per year.” Because of the scale of online material that is published every day, whether or not a site has been archived by either the Internet Archive or the UK Web Archive has relevance for whether the Bodleian chooses to archive it; to this extent the world of digital archiving represents cooperation on an international scale.

One aspect of these web archives that struck me during my time here is the conscious effort made by many to place the power of web archiving in the hands of anyone with access to a computer. The Internet Archive, for instance, allows any users with a free account to add content to the archive. Furthermore, one of my responsibilities as intern was a research project into the viability of a programme named Webrecorder for capturing more complex sites such as social medias, and democratization of web archiving seems to be the key purpose of the programme. On their website, which offers free browser-based web archiving tools, the title of the company stands above the powerful rallying call “Web archiving for all!” Whilst the programme currently remains difficult to navigate without a certain level of coding knowledge, and never quite worked as expected during my research, its potential for expanding the responsibility of archiving is certainly exciting. As historians increasingly seek to understand the lives of those whose records have not generally made it into archive collections, one can see as particularly noble the desire to put secure archiving into the hands of people as well as institutions.

The “why” of Digital Archiving, then, seems clear, but what about the “how”? Before going into my main responsibilities this month, some clarification of terminology is necessary.

Capture – This refers to the Bodleian’s copy of a website, a snapshot of it at a particular moment in time which can be navigated exactly like the original.

Live Site – The website as it is available to users on the internet, as opposed to the capture.

Crawl – The process by which a website is captured, as the computer program “crawls” through the live site, clicking on all the links, copying all of the text and photographs, and gathering all of this together into a capture.

Crawl Frequency – The frequency with which a particular website is captured by the Bodleian, determined by a series of criteria including the regularity of the website’s updates.

Archive-It – The website used by the Bodleian to run these crawls, and which stores the captured websites.

Brozzler – A particularly detailed crawl, taking more time but better for dynamic or complicated sites such as social medias. Brozzlers are used for Twitter accounts, for instance. Crawls which are not brozzlers are known as standard crawls and use Heritrix software.

Data Budget – The allocated quantity of data the Bodleian libraries purchase to use on captures, meaning a necessary selectivity as to what is and is not captured.

Quality Assurance (QA) – A huge part of the work of digital archiving, the process by which a capture is compared with the live site and scrutinized for any potential problems in the way it has copied the website, which are then “patched” (fixed). These generally include missing images, stylesheets, or subpages.

Seed – The term for a website which is being captured.

Permission E-Mails – Due to the copyright regulations around web archiving, the BLWA requires permission from the owners of websites before archiving; this can be a particularly complicated task due to the difficulty of finding contact information for many websites, as well as language barriers.

My responsibilities during my internship were diverse, and my day to day work was generally split between quality assurance, setting off crawls, and sending or drafting permission e-mails. Alongside this I was not only carrying out research into Webrecorder, but also contributing to a report re-assessing the crawl frequency of several of our seeds. The work I have done this month has been not only incredibly satisfying (when the computer programme works and you are able to patch a PDF during QA of a website it makes one disproportionately happy), but rewarding. One missing image or hyperlink at a time, digital archivists are driving the careful maintenance of a particularly fragile medium, but one which is vital for the analysis of everything we are living through today.