ingest | Archives and Manuscripts at the Bodleian Library

Yesterday, I enjoyed a very scenic drive through the Cotswolds to attend a workshop at Gloucestershire Archives (GA). GA have been working on digital curation for a few years now, using their historical digital archives – a discrete and reasonably unproblematic set of data – as a testbed for developing approaches that might help them with the more modern digital records created by the local authority in due course.

Yesterday’s event was the culmination of a six month project ‘Digital curation: from ingest to trusted storage’ and was designed to provide attendees with lots of hands-on time, using the ‘SCAT’ tool developed through the project. The project was funded by the Society of Archivist’s research fund, but builds on previous work funded by CyMAL to develop GAip – a software tool which packages digital objects ready for ingest to preservation storage. There is not much in the way of a web presence for GAip, or its successor project just yet, but the slides from Viv’s presentation at the Society of Archivist’s digital preservation road show give a flavour of GAip at least.

SCAT
The main output of the recent project is a tool called SCAT (Scat is Curation And Trust). The tool is written in perl and currently runs only in Linux environments; with a few modifications it should also run on a Windows platform. SCAT provides an interface to a number of open source digital curation tools that exist out in the wild; by loading a file or directory into SCAT, it is possible to apply these curation tools to them. Among the tools represented are:

Bagit – the Library of Congress tool mentioned elsewhere on this blog
GAip – GA’s own packaging tool, which creates a Bagit-conforming package. This is used by GA to package its digital archives.
DROID – The National Archives’ tool for file format identification
Jhove – the tool developed by JSTOR and Harvard for object identification, validation and metadata extraction
NLNZ metadata extraction tool – extracts basic metadata from some popular formats
FITS – identifies and validates files, and extracts technical metadata. It is a wrapper for a number of third-party tools (Jhove, EXIFtool, NLNZ, DROID, Ffident and the file utility). The intersting things about FITS is that is also attempts to normalise and consolidate the metadata output from these tools. Pete is using FITS at the moment to generate certain file-level metadata for dissemination purposes.
Antiword – a reader for Word files
Imagemagick – a tool that can do many wonderful things with image files
xmllint – used for validating XML files against their schemas, etc.
Unix’s file utility.
SWORD deposit to repository (GA have been experimenting with an eprints instance in this project)
tools for fixity checking employing the MD5 and SHA1 algorithms.
document conversion tool (destination formats odt and pdf – possibly openoffice.org?)

This list isn’t comprehensive, but it’s enough to show you that there is a strong open source philosophy underpinning the project.

SCAT is still very much alpha code, and Viv Cothey (its developer) intends to do a bit of tidying up before putting it out on the web. It’s really been designed to provide a hands-on learning space for archivists, and is not conceived as a ready-to-use system for digital curation. As a learning environment, I think it is very effective, providing a workbench which can call up a whole host of tools that the archivist can experiment with.

GAip packages
It’s worth saying a little something about the package produced by GAip. Using the tool, the archivist can decide to package:

a single item
a collection of materials as single items
a collection of materials as a bundle, retaining their directory structure.

In a GAip package you will find a:

copy of the original source data
a sidecar metadata record for each data item (GAip uses XMP for this, and the metadata includes dublin core, and other, metadata expressed in rdf)
an inventory of the files contained in the package including their hash values

The package is a compressed tar file (.tar.gz, although gaip uses a .gaip file extension). Each package is identified by a unique timestamp.

This is not a million miles away from our approach with the BEAM ingest tool.

Building on SCAT
Could SCAT be developed into something more? There are a number of areas that would need to addressed. Some of the ones raised in discussions yesterday include:

the need to support multiple users (both GAip and SCAT are conceived as single-user software at present)
a better approach to unique, and persistent, identifiers
a method by which data objects can accrue additional metadata in their XMP sidecars beyond that supplied when the GAip package is created
workflow

If any of this is of interest, then I’m sure Viv Cothey would be pleased to hear from potential collaborators.

-Susan Thomas

Thanks to everyone who came along and contributed to the project’s first advisory board meeting last Thursday.

Introductions
We started with some introductory discussions around the Library’s hybrid collections and the futureArch project’s aims and activities. This discussion was wide ranging, touching on a number of subjects including the potential content sources for ‘digital manuscripts’: from mobile phones, to digital media, to cloud materials.

Systems
In the past year, we’ve made progress on developing, and beginning to implement, the technical architecture for BEAM (Bodleian Electronic Archives & Manuscripts). Pete Cliff (futureArch Software Engineer) kicked off our session on ‘systems’ with an overview of the architecture, drawing on some particular highlights; it’s worth a look at his slides if you’re interested in finding out more.

futureArchitecture

View more presentations from Peter Cliff.

Next, a demo of two ingest tools:
1. Renhart Gittens demonstrated the BEAM ingester, our means of committing accessions (under a collection umbrella) to BEAM’s preservation storage.

2. Dave Thompson (Wellcome Library Digital Curator) demonstrated the XIP creator. This tool does a similar job to the BEAM Ingester and forms part of the Tessella digital preservation system being implemented at the Wellcome Library.

Keeping with technical architecture, Neil Jefferies (OULS R&D Project Officer) introduced Oxford University Library Service’s Digital Asset Management System (or DAMS, as we’ve taken to calling it). This is the resilient preservation store upon which BEAM, and other digital repositories, will sit.

How will researchers use hybrid archives?
Next we turned our attention to the needs of the researchers who will use the Library’s hybrid archives. Matt Kirschenbaum (Assoc. Prof. of English & Assoc. Director of MITH at the University of Maryland) got us off to a great start with an overview of his work as a researcher working with born-digital materials. Matt’s talk emphasised digital archives as ‘ material culture’, an aspect of digital manuscripts that can be overlooked when the focus becomes overly content-driven. Some researchers want to explore the writer’s writing environment; this includes seeing the writer’s desktop, and looking at their MP3 playlist, as much as examining the word-processed files generated on a given computer. Look out for the paper Matt has co-authored for iPRES this year.

Next we broke into groups to critique the ‘interim interface’ which will serve as a temporary access mechanism for digital archives while a more sophisticated interface is developed for BEAM. Feedback from the advisory board critique session was helpful and we’ve come away with a to-do list of bug fixes and enhancements for the interim interface as well as ideas for developing BEAM’s researcher interfaces. We expect to take work on researcher requirements further next year (2010) through workshops with researchers.

Finally, we heard from Helen Hockx-Yu (British library’s Web Archiving Programme Manager) on the state of the art in web archiving. Helen kindly agreed to give us an overview of web archiving processes and the range of web archiving solutions available. Her talk covered all the options, from implementing existing tools suites in-house to outsourcing some/all of the activity. This was enormously useful and should inform conversations about the desired scope of web archiving activity at the Bodleian and the most appropriate means by which this could be supported.

Some of us continued the conversation into a sunny autumn evening on the terrace of the Fellows’ Garden of Exeter College, and then over dinner.

-Susan Thomas

Archives and Manuscripts at the Bodleian Library

A Bodleian Libraries blog

Tag Archives: ingest

scat @ Gloucestershire archives

Advisory board meeting, 24 Sept. 2009