Blog published on behalf of Pierre Marshall, Technical Research Officer, Algorithmic Archive project.
As part of the Algorithmic Archive project, we have been building tooling to support a prospective social media archive. This post introduces Wacksy, a Rust crate we wrote for packaging Web ARChive (WARC) files in WACZ collections.
Web Archive Collection Zipped (WACZ) is a format for packaging web archives. A WACZ collects WARC files along with any other related resources into a single hermetic zip archive. Each WACZ collection is self-describing and should contain within it all the resources necessary to replay a WARC file.
Motivation
The main advantage to independently replayable WACZ files is that you don’t need to maintain an external indexing server. Eliminating that dependency on an external indexing system is beneficial for digital preservation, in the long run the most reliable database is just the file system.
Of course, this also means that packaging a WARC file in a WACZ involves indexing the WARC, and so the scope of this little project grew from ‘wrap some things in a zip archive’ to ‘build a WARC indexer’.
Thankfully, WARC files are easy enough to parse. One of the useful side-effects of this project was that we got to learn the WARC and WACZ specs very well. The process of implementing the WACZ format also brought up a few issues (#161, #163, #164, #166, #167) which could be used to tighten the spec in a future revision.
Archiving the datafied web
There’s another reason you might want to package resources together in a WACZ. We’ve spent a lot of time in this project trying to understand how to represent a social media post.
While sidestepping arguments about mass literary culture, we can think of social media as a kind of electronic literature, and preserving it is an extension of ongoing web archiving work. This approach includes preserving the ‘look and feel’ of a post in context, surrounded by comments and advertising and blobby Frutiger Aero buttons.
Social media posts are also data, in the sense that they exist in the form of structured json. Each post is an object with properties: text, username, datetime, maybe links to associated media. All raw content, easily searchable, indexable, and ready for researchers to throw into a data processing workflow.
Ideally, you want to capture and preserve both: a web archive snapshot and the structured data.
You could include the json inline in a WARC header field, or add it to the WARC file as a resource record. Or, you could package the WARC and json files together in a WACZ collection, this is the use case we had in mind when writing Wacksy.
How to use
With a stable Rust toolchain, run cargo add wacksy to add the crate to your cargo manifest.
The API provides a WACZ type with two functions: from_file and as_zip_archive.
from_file() builds the WACZ object, it takes a path to a WARC file, indexes it, and returns a result with either a WACZ struct or an error. The indexer was recently rewritten and contains almost no error handling. Use with caution! Also, the format requires all resources to be defined in the datapackage, so when you construct a datapackage you’re already building a structured representation of the WACZ collection. This is a neat feature, Ed Summers and Ilya Kremer really did a good job on the WACZ spec here.
as_zip_archive() takes all the resources in the WACZ object and passes them through into a zip file, making use of Nick Babcock’s rawzip library.
Here is an example from the documentation (current/latest):
fn main() -> Result<(), Box<dyn Error>> {
let warc_file_path = Path::new("example.warc.gz"); // set path to your ᴡᴀʀᴄ file
let wacz_object = WACZ::from_file(warc_file_path)?; // index the ᴡᴀʀᴄ and create a ᴡᴀᴄᴢ object
let zipped_wacz: Vec<u8> = wacz_object.as_zip_archive()?; // zip up the ᴡᴀᴄᴢ
fs::write("example.wacz", zipped_wacz)?; // write out to file
Ok(())
}
This API is still missing a way of adding arbitrary extra resources into the WACZ, although the code is flexible enough to accommodate that in future.
Shipping binaries
There are two other WACZ libraries out there:
Besides the library API, another goal is to provide a simple command line interface, and wrap that up into a standalone binary. This would be the main distinguishing feature of Wacksy, that you don’t need to set up Python or Node.js runtimes. For example, there would be fewer steps involved if you were packaging up WARC files in an automated workflow.
Although, py-wacz is better tested and more feature-complete, so take that into consideration. We have used py-wacz as a reference implementation to test against.
We’re also working on packaging Wacksy for Debian, and other systems after that. When it’s all packaged, it’ll be much easier for users to try out.
Performance
The WARC indexer was written with an eye on performance and memory use. When reading plain uncompressed WARCs, the indexer only reads the headers of each record. With a known header length and the WARC content-length value for each record, we can calculate the next record offset and skip through the file without passing record contents into memory. Where possible we’re also using a buffered reader rather than reading reading byte-by-byte. For gzip compressed WARCs it’s more complicated, and we’ve avoided doing anything more fancy like streaming decompression.
We’ve also tried to limit the dependencies used. At the moment, a binary compiled from the example above comes out at ~600 kilobytes — not super small, but more lightweight than most web pages. As a bonus, pruning the dependency tree will make Wacksy easier to package and distribute.
Use it!
At time of writing, the library is still experimental, not yet integrated into any other software, and it has only been tested against a few example WARC files. It would benefit from wider testing on real world use cases.
The code is all open source and available under an MIT license; contributions welcome!





