Tag Archives: Algorithmic Archive project

Introducing Wacksy: a library for writing WACZ collections

Blog published on behalf of Pierre Marshall, Technical Research Officer, Algorithmic Archive project.

As part of the Algorithmic Archive project, we have been building tooling to support a prospective social media archive. This post introduces Wacksy, a Rust crate we wrote for packaging Web ARChive (WARC) files in WACZ collections.

Web Archive Collection Zipped (WACZ) is a format for packaging web archives. A WACZ collects WARC files along with any other related resources into a single hermetic zip archive. Each WACZ collection is self-describing and should contain within it all the resources necessary to replay a WARC file.

Motivation

The main advantage to independently replayable WACZ files is that you don’t need to maintain an external indexing server. Eliminating that dependency on an external indexing system is beneficial for digital preservation, in the long run the most reliable database is just the file system.

Of course, this also means that packaging a WARC file in a WACZ involves indexing the WARC, and so the scope of this little project grew from ‘wrap some things in a zip archive’ to ‘build a WARC indexer’.

Thankfully, WARC files are easy enough to parse. One of the useful side-effects of this project was that we got to learn the WARC and WACZ specs very well. The process of implementing the WACZ format also brought up a few issues (#161, #163, #164, #166, #167) which could be used to tighten the spec in a future revision.

Archiving the datafied web

There’s another reason you might want to package resources together in a WACZ. We’ve spent a lot of time in this project trying to understand how to represent a social media post.

While sidestepping arguments about mass literary culture, we can think of social media as a kind of electronic literature, and preserving it is an extension of ongoing web archiving work. This approach includes preserving the ‘look and feel’ of a post in context, surrounded by comments and advertising and blobby Frutiger Aero buttons.

Social media posts are also data, in the sense that they exist in the form of structured json. Each post is an object with properties: text, username, datetime, maybe links to associated media. All raw content, easily searchable, indexable, and ready for researchers to throw into a data processing workflow.

Ideally, you want to capture and preserve both: a web archive snapshot and the structured data.

You could include the json inline in a WARC header field, or add it to the WARC file as a resource record. Or, you could package the WARC and json files together in a WACZ collection, this is the use case we had in mind when writing Wacksy.

How to use

With a stable Rust toolchain, run cargo add wacksy to add the crate to your cargo manifest.

The API provides a WACZ type with two functions: from_file and as_zip_archive.

from_file() builds the WACZ object, it takes a path to a WARC file, indexes it, and returns a result with either a WACZ struct or an error. The indexer was recently rewritten and contains almost no error handling. Use with caution! Also, the format requires all resources to be defined in the datapackage, so when you construct a datapackage you’re already building a structured representation of the WACZ collection. This is a neat feature, Ed Summers and Ilya Kremer really did a good job on the WACZ spec here.

as_zip_archive() takes all the resources in the WACZ object and passes them through into a zip file, making use of Nick Babcock’s rawzip library.

Here is an example from the documentation (current/latest):

fn main() -> Result<(), Box<dyn Error>> {

    let warc_file_path = Path::new("example.warc.gz"); // set path to your ᴡᴀʀᴄ file

    let wacz_object = WACZ::from_file(warc_file_path)?; // index the ᴡᴀʀᴄ and create a ᴡᴀᴄᴢ object

    let zipped_wacz: Vec<u8> = wacz_object.as_zip_archive()?; // zip up the ᴡᴀᴄᴢ

    fs::write("example.wacz", zipped_wacz)?; // write out to file

    Ok(())

}

This API is still missing a way of adding arbitrary extra resources into the WACZ, although the code is flexible enough to accommodate that in future.

Shipping binaries

There are two other WACZ libraries out there:

Besides the library API, another goal is to provide a simple command line interface, and wrap that up into a standalone binary. This would be the main distinguishing feature of Wacksy, that you don’t need to set up Python or Node.js runtimes. For example, there would be fewer steps involved if you were packaging up WARC files in an automated workflow.

Although, py-wacz is better tested and more feature-complete, so take that into consideration. We have used py-wacz as a reference implementation to test against.

We’re also working on packaging Wacksy for Debian, and other systems after that. When it’s all packaged, it’ll be much easier for users to try out.

Performance

The WARC indexer was written with an eye on performance and memory use. When reading plain uncompressed WARCs, the indexer only reads the headers of each record. With a known header length and the WARC content-length value for each record, we can calculate the next record offset and skip through the file without passing record contents into memory. Where possible we’re also using a buffered reader rather than reading reading byte-by-byte. For gzip compressed WARCs it’s more complicated, and we’ve avoided doing anything more fancy like streaming decompression.

We’ve also tried to limit the dependencies used. At the moment, a binary compiled from the example above comes out at ~600 kilobytes — not super small, but more lightweight than most web pages. As a bonus, pruning the dependency tree will make Wacksy easier to package and distribute.

Use it!

At time of writing, the library is still experimental, not yet integrated into any other software, and it has only been tested against a few example WARC files. It would benefit from wider testing on real world use cases.

The code is all open source and available under an MIT license; contributions welcome!

Reporting from the RESAW2025 Conference Workshop: Towards an “Algorithmic Archive”

Welcome sign at the University of Siegen

At this year’s REsearch infrastructure for the Study of Archived Web materials (RESAW) conference, my colleague Pierre Marshall and I organised a workshop titled “Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers”. The workshop was accepted as one of the RESAW2025 pre-conference workshops and took place on 4 June 2025. We had around twelve participants, including researchers at various career stages and web archivists, that contributed to lively discussions and to the Algorithmic Archive project thanks to their experience with social media data.

The workshop was articulated in two sessions: the first one focussed on gathering researchers’ perspectives and information about the use of social media data in research. The second session invited participants to imagine a long-term archive of social media data, asking them to think about the features they would like to see in a social media data service. Both sessions offered a valuable opportunity to gather insights for the Algorithmic Archive project, particularly regarding issues and expectations related to short- and long-term access to social media data.

Key themes and takeaways are summarised below.

Social media data (re)use and data management practices

Researchers appeared to work mostly with small datasets, especially after free access to data for research purposes came to an end with the deprecation of the Twitter Academic API in 2023.  Among the researchers that shared their experience with social media data, one noted how they currently work with information about the number of followers, which is often supplemented with screenshots taken at different points in time. They explained how screenshots are essential for their research as enable to capture the “look and feel” of the social platforms, which is an essential part of the research they are conducting. In this regard, one of the web archivists participating in the workshop noted how at their institution they use Webrecorder[1] at least once a year.

In addition, a researcher whose research focussed on algorithms, noted that social media data collected via APIs is only one of the sources they use for their study. Other sources include existing policies, new regulations (e.g. EU Digital Services Act) and other archival sources such as information on GitHub.[2]

As for long-term preservation, researchers participating in the workshop appeared to not have specific plans in this sense, with some indicating that they usually delete social media data sometime after the end of the project. Despite some concerns related to potential ethical issues, researchers expressed a general interest in reusing datasets that include social media data. Nevertheless, they emphasised that effective reuse would require detailed documentation from the dataset creator to understand how the data was developed.

Access and user requirements

For the second session, we organised a post-it note exercise where we asked researchers to reflect on the type of metadata they would find useful for their research and would like for memory institutions to collect/provide. Researchers suggested several metadata or information they would like to see associated with the archived resource, includingdate of capture; date of publication; technical and curatorial metadata; hardware (e.g., mobile, tablet, laptop); sensitivity assessment; andthe type of tool used to collect the information.

Post-it note session

There was a general agreement among participants about the need for the collecting institution to preserve at least some instances of the context in which the data was embedded. For example, walkthroughs of social media platforms recorded using tools such as Webrecorder, would be crucial for researchers and future users of the collection to get a sense of platforms’ “look and feel” at certain points in time.Some of the participants noted the importance to get an understanding of potential functionality loss when replaying archived social media material.

Nevertheless, Access, particularly free access to platform data is still one of the major blocks for researchers who need such information for their studies. This has become even more crucial after the Twitter Academic API was deprecated in 2023 and replaced with a paid tier system which due to the high fees required to get access to the required amount of data, has often led many researchers to redirect their research goals, either significantly reducing the amount of data needed or focus on other platforms.

Overall, the workshop brought together diverse perspectives from practitioners and researchers working with social media data, fostering discussions regarding the development of sustainable strategies to collect social media platforms. This was a unique opportunity to discuss some of the Algorithmic Archive findings, clarify researchers’ perspectives on concerns related to the use of social media data as well as raise further questions that the Algorithmic Archive project should take into consideration for the development of a social media data service.


[1] Webrecorder homepage: https://webrecorder.net/

[2] More information about the GitHub Archiving Programme can be found here: https://archiveprogram.github.com/

Algorithmic Archive Project: Use Cases (3/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the third in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case – Study on the trustworthiness of social media visual content among young adults (TRAVIS project)[1]

Research questions and aim(s):

Trust And Visuality: Everyday digital practices (TRAVIS) is an ESRC project which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme. This research project that looks at how young adults experience, build and express trust in news and social media images related to wellbeing and health. It explores how and why people trust some visuals over others and how content creators establish trustworthiness through visual content. The TRAVIS project involves cross-national collaboration of multiple research teams located at different universities in UK and Europe. This includes the University of Oxford, in particular the Oxford team is based School of Geography and the Environment.

Social media data used:

The project included data collected indirectly from platforms including Facebook, Instagram, TikTok and YouTube (see below).

Tools and methods adopted:

Data collection from social media consisted of screenshots taken from the devices of interviewed young adults, as the TRAVIS project investigates the meaning of social media posts (visual content) via interviews with young adult users. The datasets generated from this method of collection counts around 400 screenshots, stored on an institutional cloud drive, which is accessible by the whole team.


[1] Further information about the TRAVIS project are available here: https://www.tlu.ee/en/bfm/researchmedit/trust-and-visuality-everyday-digital-practices-travis

Algorithmic Archive Project: Use Cases (2/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the second in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case – Exploring Algorithmic Mediation and Recommendation Systems on YouTube [1]

Research questions and aim(s):

The study sought to investigate how the YouTube platform operates, focusing on algorithmic activity and the strategies employed by both human and automated (robot) actors within federal and regional elections. The aim was to understand the impact that this system of mediation has on society and to demystify preconceptions of ideologically neutral technologies in highly disputed political events. The research focuses on two case studies: 1) the 2018 Ontario (Canada) election and 2) the 2018 Brazilian Federal Election. The data collection was carried out during the campaigning periods, between May and June in Ontario, and between August and October 2018 in Brazil.

Social media data used:

The research focussed on the sole YouTube platform. Specifically, the researchers collected information about recommended videos starting from specific keywords related to the election campaign.

Tools and methods adopted:

The data collection was carried out using a Python script developed by the Algo Transparency project. The script automates YouTube search operations based on specified keywords (e.g., the names of the candidates), allowing the researcher to gather video-related data and the relative ranking position displayed to the user. Once the keywords were defined, the tool retrieved links for the top four results for each keyword and then examined the recommendation section. This process was repeated four times, each time collecting recommended videos, simulating a user interacting with algorithmic suggestions.

Data collected was stored on personal devices and the institutional cloud, and can be visualized at the following links:


[1] Reis, R., Zanetti, D., & Frizzera, L. (2020). A conveniência dos algoritmos: o papel do YouTube nas eleições brasileiras de 2018. Compolítica10(1), 35–58. https://doi.org/10.21878/compolitica.2020.10.1.333

Algorithmic Archive Project: Use Cases (1/3)

The Algorithmic Archive project is a one year project funded by the Mellon Foundation. As part of the first Work Package, we explored how researchers from different disciplines use social media data to answer various research questions.

This post is the first in a three-part series presenting use cases drawn from research conducted as part of the Algorithmic Archive project.

We would like to thank the researchers who generously shared insights from their work.


Use Case: Network/cluster analysis to investigate the construction and influence of information trustworthiness within social movements on Twitter [1]

Research questions and aim(s):

The researcher wanted to explore the construction and influence of information trustworthiness within social media movements in the context of the Hong Kong protests and the #BlackLivesMatter movements. Social media platforms offer a digital space for social movements to facilitate the diffusion of critical information and the formation of networks, coordinating protests and reach a wider audience.

Social media data used:

This study focused on Twitter as it was used evenly by both social movements, and the researcher already had an established presence on this platform. Also, at the time of data collection (2020-2021), access to Twitter data for academic research was still relatively open to researchers.

For the purpose of this study, the researcher examined the follow and followers’ relationship of top accounts counting millions of followers that had been selected as big information disseminators, including organisations, individuals or accounts serving a particular niche or purpose.

Data collection was conducted at a specific point in time in 2021. Social media data quantitative analysis (e.g. cluster analysis) was complemented with qualitative data collected via an online survey.

Tools and methods adopted:

The researcher requested and obtained access to the Twitter API. However, high-level coding skills were required to access the data, which the researcher did not have at that time due to their predominantly qualitative research background. To address this, the researcher found and used a Go script called Nucoll[2], which is freely available on GitHub and enabled the researcher to collect the required data. Nucoll is a command-line tool that, according to its developer, retrieves data from Twitter using keyword instructions, for which the developer provided example queries and brief explanations. For each social movement, the researcher selected three organisations: one large organisation, one activist group, and one additional account that was relevant to the movement. Once these accounts were selected, they were processed through the script to capture all following/follower relationships and combine them into a graph for each protest analysed. Further data visualisation and analysis — including clustering and network analysis — were conducted using Gephi.


[1] Charlotte Im, The Construction and Influence of Information Trustworthiness in Social Movements, Doctoral Thesis, University College London (UCL), 2024.

[2] https://github.com/jdevoo/nucoll

Reporting from the Born-Digital Collections, Archives and Memory Conference 2025

Between 2-4 April 2025, I attended the very first edition of the Born-digital Collections, Archives and Memory conference, together with my colleague from the Algorithmic Archive Project, Pierre Marshall. The conference was co-organised by the School of Advanced Study at the University of London, the Endangered Material Knowledge Programme at The British Museum, The British Library and Aarhus University. This international event offered the unique opportunity to bring together academics and practitioners from diverse disciplines, career paths and backgrounds to explore the transformative impact of born-digital cultural heritage. The diverse range of research, methodologies, and practices presented in this year’s programme offered valuable insights and reflections, particularly relevant to the Algorithmic Archive project and its goal of developing sustainable, persistent approaches to preserving born-digital heritage created on the web, especially on social media platforms.

The inspiring opening keynote by Dorothy Berry, Digital Curator at the Smithsonian National Museum of African American History and Culture, highlighted the vital importance of preserving ephemeral and fragile forms of born-digital heritage (such as social media) —many of which have increasingly replaced traditional modes of memory-making, also drawing attention to the pressing need for a deeper understanding of what and how born-digital memory should be preserved. In particular, she stressed the need to record the “full context” in which born-digital records and materials were embedded before being collected and included in specific collections. However, she also highlighted the challenges many memory institutions face due to uneven resource distribution, an issue that may hinders both the development and long-term sustainability of innovative preservation efforts.

Given the richness of the BDCAM25 program, it is incredibly difficult to summarise the many takeaways from the three-day conference. Nevertheless, it is worth highlighting sessions such as the one exploring the history, socio-technical dynamics and research conducted on corpora from platforms such as Usenet; the important reflections stemmed from a study conducted by Rosario Rogel-Salazar and Alan Colín-Arce exploring the presence of feminist organisations in web archives; and the research conducted by Dr Andrea Stanton exploring Palestine and the concept of Palestinian heritage through the analysis of accounts and hashtags on Instagram. 

Particularly valuable insights came also from Dr Kieran Hegarty’s paper, which explored the challenges posed by the unpredictable and frequent changes to platform design and policies, underscoring how this significantly influence what is included in web archives and how the material is made available.

Beveridge Hall entrance, Senate House, University of London. Photo taken by B. Cannelli

Overall, the conference provided a valuable opportunity to learn about new research and to network with scholars and practitioners from around the globe. During lunch and coffee breaks, I had insightful conversations with several delegates about the challenges of preserving born-digital materials, particularly data generated on social media platforms. We exchanged ideas and reinforced the importance of developing shared practices to safeguard these resources. This theme strongly resonated in the closing session, which brought together voices from diverse career paths and regions to reflect on the current state of born-digital archives, collections, and memory, and to identify future directions.
Among the key takeaways were the need to foster data literacy and building digital citizens from a young age, as well as the importance of connecting with activists and minority communities to help them tell and preserve their stories.