Tag Archives: futureArch project

Advisory Board Meeting, 18 March 2011

Our second advisory board meeting took place on Friday. Thanks to everyone who came along and contributed to what was a very useful meeting. For those of you who weren’t there here is a summary of the meeting and our discussions.

The first half of the afternoon took the form of an overview of different aspects of the project.

Overview of futureArch’s progress

Susan Thomas gave us a reminder of the project’s aims and objectives and the progress being made to meet them. After an overview of the percentage of digital accessions coming into the library since 2004 and the remaining storage space we currently have, we discussed the challenge of predicting the size of future digital accessions and collecting digital material. We also discussed what we think researcher demand for born-digital material is now and will be in the future.

Born Digital Archives Case Studies

Bill Stingone presented a useful case study about what the New York Public Library has learnt from the process of making born-digital materials from the Straphangers Campaign records available to researchers.

After this Dave Thompson spoke about some of the technicalities of making all content in the Wellcome Library (born-digital, analogue and digitised) available through the Wellcome Digital Library. Since the project is so wide reaching a number of questions followed about the practicalities involved.

Web archiving update

Next we returned to the futureArch project and Susan gave an overview of the scoping, research and decisions that have been made regarding the web archiving pilot since the last meeting. I then gave an insight into how the process of web archiving will be managed using a tracking database. Some very helpful discussions followed about the practicalities of obtaining permission for archiving websites and the legal risks involved.

After breaking for a well earned coffee we reconvened to look at systems.

Systems for Curators

Susan explained how the current data capture process works for digital collections at the Bodleian including an overview of the required metadata which we enter manually at the moment. Renhart moved on to talk about our intention to use a web-based capture workbench in the future and to give us a demo of the RAP workbench. Susan also showed us how FTK is used for appraisal, arrangement and description of collections and the directions we would like to take in the future.

Researcher Interface

To conclude the systems part of the afternoon, Pete spoke about how the BEAM researcher interface has developed since the last advisory board meeting, the experience of the first stage of testing the interface and the feedback gained so far. He then encouraged everyone to get up and have a go at using the interface for themselves and to comment on it.

Training the next generation of archivists?

With the end of the meeting fast approaching, Caroline Brown from the University of Dundee gave our final talk. She addressed the extent to which different archives courses in the UK cover digital curation and the challenges faced by course providers aiming to include this kind of content in their modules.

With the final talk over we moved onto some concluding discussions around the various skills that digital archivists need. Those of us who were able to stay continued our discussions over dinner.

-Emma Hancox

what have the Romans ever Done For us?

Today I presented an internal seminar on RDF to the Bodleian Library developers, the first in a series of (hopefully) regular R&D meetings. This one was to provide a practical introduction to RDF to give us a baseline to build from when we start building models of our content (at least one Library project requires we generate RDF). I called it “what have the Romans ever Done For us?”:


The title is a line from the Monty Python film “The Life of Brian” and I chose it not just because I’ve been looking into Python (the language) but also because I can imagine a future where people ask “What has RDF ever done for us?” in a disgruntled way. In the film people suggest the Romans did quite a lot – bits of public infrastructure, like aqueducts and roads, alongside useful services like wine and medicine. I think RDF is a bit like that. Done well, it creates a solid infrastructure from which useful services will be built, but it is also likely to invisible, if not taken for granted, like sanitation and HTML. ๐Ÿ™‚

The seminar itself seemed to go well – though you’d have to ask the attendees rather than the presenter to get the real story! We started with some slides that outlines the basics of RDF, using Dean Allemang & Jim Hendler’s nice method of distributing responsibility for tablular data and ending up at RDF (see pg. 32 onwards in the book Semantic Web for the Working Ontologist), and leapt straight in with LIBRIS (for example Neverwhere as RDF) as a case study. In the resultant discussion we looked at notation, RDFS, and linked data.

The final half of the seminar was a workshop in which we split into two groups: data providers and data consumers, and then considered what resources at the Bodleian might be suitable for publication as RDF (and linked data) and what services we might build using data from elsewhere.

The data providers discussed how there was probably quite a lot of resources in the Library that we could publish, or become the authority on – members of the University for example, or any of the many wonders we have in the collections. To make this manageable, it was felt it would be sensible to break this task up, probably by project. This would also allow for specific models to be identified and/or developed for each set of resources.

There was some concern among the providers regarding how to “sell” the benefits of RDF and linked data to management. The concerns paralleled what I imagine happened with the emergent Web. Is this kind of data publication giving away valuable information assets, for little or no return? At worst this leads people away from the Library to viewing and using information via aggregation services. Of course, there is a flip side to this argument. Serendipitous discovery of a Bodleian resource via a third-party is essentially free advertising and may drive users through our (virtual or real) door?

The consumers seemed realise early on that one of the big problems was the lack of usable data. Indeed, for the talk, I scoured datasets trying to find a decent match of data to augment Library catalogues and found it quite hard. That isn’t to say there is not a lot of data available (though of course there could be more), it is just without an application for it, shoehorning data into a novel use remains a novelty item. However, one possible example was the Legislation API. The group suggested that reviews from other sites could be used to augment Library catalogue results. They also suggested that the people data the Library published could be very useful and Monica talked about a suggestion she had heard at Dev8D for a developer expertise database (data Web?).

All in all lots of very useful discussion and I hope everyone went away with a good idea of what RDF was and what it might do for us and what we might do with it. There (justifiably) remains some scepticism, mostly because without the Web, linked data is simply data and we’ve all got our own neat ways to handle data already, be it a SQL database, XML & Solr, or whatever. Without the Web of Data the question “What do we gain?” remains.

It is a bit chicken and egg and the answer will eventually become clear as more and more people create machine processable data on the Web.

For the next meeting we’ll be modelling people. I’m going to bring the clay! ๐Ÿ™‚

Slides and worksheets are available with the source open office documents also published on this (non-RDF) page!

-Peter Cliff

Migrating documents

We have a collection that consists of several thousand documents in various archaic (well, 1980s/90s) word processor formats including Ami Professional and (its predecessor) Samna Word. Perhaps of interest to folks intent on discussing the implications of migration for authenticity of the items, some of those Ami Pro files contain the (automatically generated) line:

“File … was converted from Samna Word on …”

So which is the original now?

Migrating these file formats has not been straight forward. This is because it was proved remarkably tricky to ascertain a key piece of information – the file format of the original. This is not the fault of file format tools (I’m using FITS, which itself wraps the usual suspects JHOVE & DROID), but the broader problem that the files have multiple formats. Ami Pro files are correctly identified “text/plain”. The command file reports them as “ASCII English text”. Some (not all) have a file extension “.sam” which is usually Ami Word, but the “.sam” files are not all the same format.

Yet this small piece of metadata is essential because without it it is very difficult to identify the correct tool to perform the migration. For example, if I run my usual text to PDF tool – which is primed to leap into action on arrival of a “text/plain” document – the resultant PDF shows the internals of a Ami Pro file, not the neatly laid out document the creator saw. We have a further piece of information available too, and curiously it is the most useful. This is the “Category” from the FTK – which correctly sorts the Ami Pros from the Samna Words.

This leads to a complex migration machine that needs to be capable of collating file format information from disparate sources and making sense of the differences, all within the context of the collection itself. If I know that creator X used Ami Pro a lot, then I can guess that “text/plain” & “.sam” means an Ami Pro document, for example. This approach is not without problems however, not least of which is that it requires a lot of manual input into what should ultimately be an automated and unwatched process. (One day, when it works better, I’ll try to share this code!)

Sometimes you get lucky, and the tool to do the migration offers an “auto” mode for input. For this collection I am using a trial copy of FileMerlin to perform the migration and evaluate it. It actually works better if you let it guess the input format rather than attempt to tell it. Other tools, such as JODConverter, like to know the input format and here you have a similar problem – you need to know what JODConverter is happy to accept rather than the real format – for example, send it a file with a content type of “application/rtf” and it responds with an internal server error. Send the same file with a content type of “application/msword” and the PDF is generated and returned to you.

Then there is a final problem – sometimes you have to make several steps to get the file into shape. For this collection, FileMerlin should be able to migrate Ami Pro and Samna Word into PDFs. In practice, it crashes on a very small sub-set of the documents. To overcome this, I migrate these same documents to “rich text format” (which FileMerlin seems OK with) and then to PDF with JODConverter – sending the aforementioned “application/msword” content type. I had a similar problem with WordPerfect files where using JOD directly changed the formatting of the original files. Using libwpd to create ODTs and then converting them to PDFs generated more accurate PDFs. (This is strange behaviour since OpenOffice itself uses libwpd!) Every time I hit a new (old) file format, the process of identifying it and generating a heuristic for handling it starts over.

I’m starting to think I need a neural network! That really would be putting the AI in OAIS!

-Peter Cliff

Digital forensics and digital archives

If you’re interested in how digital forensics methodologies and tools could, and are, being applied to digital archives then you might like to take a look at a new report published by CLIR.ย  See http://www.clir.org/pubs/reports/pub149/pub149.pdf. It also includes a sidebar outlining how forensics tools are incorporated into our workflow.

-Susan Thomas

What I learned from the word clouds…

Now, word clouds are probably bit out of fashion these days. Like a Google Map, they just seem shiny but most of the time quite useless. Still, that hasn’t stopped us trying them out in the interface – because I’m curious to see what interesting (and simple to gather) metadata n-grams & their frequency can suggest.

Take for instance the text of “Folk-Lore and Legends of Scotland” [from Project Gutenberg] (I’m probably not allowed to publish stuff from a real collection here and choose this text because I’m pining for the mountains). It generates a “bi-gram”-based word cloud that looks like this:

Names (of both people and places) quickly become obvious to human readers, as do some subjects (“haunted ships” is my favourite). To make it more useful to machines, I’m pretty sure someone has already tried cross-referencing bi-grams with name authority files. I also imagine someone has used the bi-grams as facets. Theoretically a bi-gram like “Winston Churchill” may well turn up in manuscripts from multiple collections. (Any one know of any successes doing these things?).

Still, for now I’ll probably just add the word clouds of the full-texts to the interface, including a “summary” of a shelfmark, and then see what happens!

I made the (very simple) Java code available on GitHub, but I take no credit for it! It is simply a Java reworking of Jim Bumgardner’s word cloud article using Jonathan Feinberg’s tokenizer (part of Wordle).

-Peter Cliff

The as yet unpaved publication pathway…

It has been a while since we had a whiteboard post, so I thought it was high time we had one! This delightful picture is the result of trying to explain the “Publication Pathway” – Susan’s term for making our content available – to a new member of staff at the Library…

Nothing too startling here really – take some disparate sources of metadata, add a sprinkling of auto-gen’d metadata (using the marvelous FITS and the equally marvelous tools it wraps), migrate the arcane input formats to something useful, normalise and publish! (I’m thinking I might get “Normalise and Publish!” printed on a t-shirt! :-))

The blue box CollectionBuilder is what does most of the work – constructs an in memory tree of “components” from the EAD, tags the items onto the right shelfmarks, augments the items with additional metadata, and writes the whole lot out in a tidy directory structure that even includes a foxml file with DC, PREMIS and RDF data streams (the RDF is used to maintain the hierarchical relationships in the EAD). That all sounds a lot neater than it currently is, but, like all computer software, it is a work in progress that works, rather than a perfect end result! ๐Ÿ™‚

After that, we (will, it aint quite there yet) push the metadata parts into the Web interface and from there index it and present to our lovely readers!

Hooray!

The four boxes at the bottom are the “vhysical” layout – its a new word I made up to describe what is essentially a physical (machine) architecture, but is in fact a bunch of virtual machines…

For the really attentive among you, this shot is of the whiteboard in its new home on the 2nd floor of Osney One, where Renhart and I have moved following a fairly major building renovation. Clearly we were too naughty to remain with the archivists! ๐Ÿ˜‰

-Peter Cliff