Tag Archives: preservation tools

Preserving Digital Sound and Vision: A Briefing 8th April 2011

Last Friday I went along to the DPC briefing Preserving Digital Sound and Vision. I was particularly interested in the event because of digital video files currently held on DVD media at the Bodleian.

After arriving at the British Library and collecting my very funky Save the Bits DPC badge I sat down to listen to a packed programme of speakers. The morning talks gave an overview of issues associated with preserving audio-visual resources. We began with Nicky Whitsed from the Open University who spoke about the nature of the problem of preserving audio-visual content; a particularly pertinent issue for the OU who have 40 years of audio-visual teaching resources to deal with. Richard Ranft then gave a fascinating insight into the history and management of the British Library Sound Archive. He played a speech from Nelson Mandela’s 1964 trial to emphasise the value of audio preservation. Next Stephen Gray from JISC Digital Media spoke about how students are using audio-visual content in their research. He mentioned the difficulties researchers find when citing videos, especially those on YouTube that may disappear at any time! To round off the morning John Zubrycki from BBC R and D spoke about Challenges and Solutions in Broadcast Archives. One of the many interesting facts that he mentioned was that subtitle files originally produced by the BBC for broadcast have been used as a tool for search and retrieval of video content.

After enjoying lunch and the beautiful sunny weather on the British Library terrace we moved onto the afternoon programme based on specific projects and tools. Richard Wright of the BBC spoke about the Presto Centre and the tools it has developed to help with audio-visual preservation. He also spoke about the useful digital preservation tools available online via Presto Space. Sue Allcock and James Alexander then discussed the Outcomes and Lessons learnt from the Access to Video Assets Project at the Open University which makes past video content from the Open University’s courses available to OU staff through a Fedora repository. Like the BBC, discovering subtitle files has allowed the OU to index their audio-visual collections. Finally Simon Dixon from the Centre for Digital Music Queen Mary University spoke about emerging tools for digital sound.

A final wide ranging discussion about collaboration and next steps followed which included discussion about storage as well as ideas for a future event addressing the contexts of audio-visual resources. I left the event with my mind full of new information and lots of pointers for places to look to help me consider the next steps for our digital video collections… watch this space.

-Emma Hancox


This is probably an old and battered hat for you good folks (seeing as the Web site’s last “announcement” was in 2004!), but most days I still feel pretty new to this whole digital archiving business – not just with the “archive” bit, but also the “digital preservation”, um, bit so it was news to me… 😉

Perusing the latest Linux Format at the weekend, I chanced on an article by Ben Martin (I couldn’t find a Web site for him…) about parchive and specifically par2cmdline.

Par-what? I hear you ask? (Or perhaps “oh yeah, that old thing” ;-))

Par2 files are what the article calls “error correcting files”. A bit like checksums, only once created they can be used to repair the original file in the event of bit/byte level damage.


So I duly installed par2 – did I mention how wonderful Linux (Ubuntu in this case) is? – the install was simple:

sudo apt-get install par2

Then tried it out on a 300MB Mac disk image – the new Doctor Who game from the BBC – and guess what? It works! Do some damage to the file with dd, run the verify again and it says “the file is damaged, but I can fix it” in a reassuring HAL-like way (that could be my imagination, it didn’t really talk – and if it did, probably best not to trust it to fix the file right…)

The par2 files totalled around 9MB at “5% redundancy” – not quite sure what that means – which isn’t much of an overhead for a some extra data security… I think, though I’ve not tried, that it is integrated into KDE4 too for a little bit of personal file protection.

The interesting thing about par2 is that it comes from an age when bandwidth was limited. If you downloaded a large file and it was corrupt, rather than have to download it again, you simply downloaded the (much smaller) par2 file that had the power to fix your download.

This got me thinking. Is there then any scope for archives to share par2 files with each other? (Do they already?) We cannot exchange confidential data but perhaps we could share the par2 files, a little like a pseudo-mini-LOCKSS?

All that said, I’m not quite sure we will use parchive here, though it’d be pretty easy to create the par2 files on ingest. In theory our use of ZFS, RAID, etc. should be covering this level of data security for us, but I guess it remains an interesting question – would anything be gained by keeping par2 data alongside our disk images? And, after Dundee, would smaller archives be able to get some of the protection offered by things like ZFS, but in a smaller, lighter way?

Oh, and Happy Summer Solstice!

-Peter Cliff

DPC’s preservation planning workshop

Earlier in the week I attended a DPC workshop on preservation planning, which was largely constructed of material coming out of the European project called Planets, which is now half-way through its four-year programme. There were also interesting contributions from Natalie Walters of the Wellcome Library and Matthew Woollard of the UK Data Archive.

A preservation system for the Wellcome Library?
Much of what Natalie had to say about the curation of born-digital archives chimed with our experiences here. Unlike us though, Wellcome are in the process of evaluating ‘off the shelf’ systems to manage digital preservation. They put out a tender earlier this year and received five responses that seem, in the main, to demonstrate a misunderstanding of archival requirements and the immaturity of the digital curation/preservation marketplace. One criticism was that the responses offered systems for ‘access’ or ‘institutional repositories’ (of the kind associated with open access HE content – academic papers and e-theses). This is something we also felt when we evaluated the Fedora and DSpace repositories on the Paradigm project (admittedly, this evaluation becomes a bit more obsolete day by day). Balancing access and preservation requirements has long been an issue for archivists, since we often have to preserve material that is embargoed for a period of time. I still believe that systems providing preservation services and systems providing researcher access are doing different things, but we do of course need some form of access to embargoed material for management and processing purposes. I also find the adoption of new meanings for words, like ‘repository’ and ‘archive’, tricky to negotiate at times. These issues aside, one of the systems offered seems to have held Wellcome’s interest and I’ll be keen to find out which one when this information can be revealed.

Preservation policy at UKDA
Matthew spoke about the evolution of preservation policy at the UKDA, which had no preservation policy until 2003 despite celebrating its 40th anniversary last year. The first two editions of the policy were more or less exclusively concerned with the technical aspects of preserving digital material, specifying such things as acceptable storage conditions and the frequency with which tape should be re-tensioned. The latest (third) edition embraces wider requirements including organisational/business need, user requirements (designated community and others), standards, legislation, technology and security. The new policy increases emphasis on data integrity and archival standards, it defines archival packages more closely to provide for their verification, and it pays attention to the curation of metadata describing the resources to be preserved.

If I understood correctly, the UKDA preserves datasets in their original form (SIP), migrates them to a neutral format (AIP1) and creates usable versions from the neutral format (AIP2). All these versions are preserved and dissemination versions of the dataset are created from AIP2. The degree of processing applied to a dataset is determined by applying a matrix which assigns a value on the basis of likely use and value. These processes feel similar to those evolving here, though we need to do more work to formalise them.

Matthew also showed us a nice little diagram from 1976, which was created to document UKDA workflow from initial acquisition of a dataset to its presentation to the final user. The fundamentals of professional archival, or OAIS-like, practice are evident. The UKDA’s analysis of its own conformance with the OAIS model undertaken under the JISC 04/04 Programme is worth a look for those who haven’t seen it.

Towards the end of the talk Matthew reminded us that having written a policy, one must implement it. It’s not normally possible to implement every new thing in a policy at once, but the policy is valueless without mechanisms in place to audit it. Steps must be taken to progress those aspects of the policy that are new and to audit compliance more generally. The policy must also be available to relevant audiences who can evaluate the degree to which the archive complies with its own policy for themselves. I found this a very useful overview of the key issues involved in developing a preservation policy and the resulting policy itself is very clear and concise.

Planets tools for preservation planning
It’s great to see the promise of Planets starting to be realised, especially since we plan to build on the project’s work in relation to characterising material, planning and executing preservation strategies. Andreas Rauber kicked things off with an overview of the Planets project, which helped to demonstrated how the various components fit together.What is uncertain at the moment is how the software and services being developed by Planets will be sustained beyond the project’s life. Neither is it clear what licensing model/s will be adopted for different components in the project, since there are the needs of commercial partners to consider as well as those of national archives, libraries and universities.

Christoph Becker gave us an overview of Plato, a tool which allows the user to develop preservation strategies for specific kind of objects. In Plato, users can design experiments to determine the best available preservation strategy for a particular type of material. This involves a formal definition of constraints and objectives, which includes an assessment of the relative importance of each of these factors. Factors might include:

* object migration time – max. 1 second
* object migration cost – max £0.05 per object
* preserve footnotes – 5
* preserve images- 5
* preserve headings – 4
* open format required – 5
* preserve font – 3
* and so on…

These are expressed in an ‘objective tree’, which can be created directly in Plato or in the Freemind mind mapping tool and uploaded to Plato. Objective tress can be very simple, but the process of creating a good and detailed objective tree is quite demanding (we had a go at doing this ourselves in the afternoon). In future we should be able to build on previous objective trees as these are developed and that will ease the process. For the moment the templates provided are minimal because the Plato team don’t want to preempt user requirements!

The user must also supply a sample of material which can be used to assess the effectiveness of different strategies. This should be the bare minimum of objects required to represent the range of factors expressed in the objective tree. The user then selects different strategies to apply to the sample material, sets the experiment in motion, and compares the results against the objective tree. The process of evaluating results is manual at present, but there are plans to begin automating aspects of this too. Once the evaluation is complete, Plato can produce a report of the experiment which should demonstrate why one preservation strategy was chosen over another in respect of a particular class of material.

Plato is available for offline use, which will be necessary for us when processing embargoed material, but it is also offered as an online service where users can perform experiments in one place and benefit from working with the results of experiments performed by others.

The Planets work on characterisation was introduced by Manfred Thaller. This work develops two formal characterisation languages – the extensible characterisation extraction language (XCEL) and the extensible characterisation description language (XCDL). The work should make it possible to perform more automatically determine whether a preservation action, such as migration, has preserved an object’s essential characteristics (or significant properties). It is expected that the Microsoft family of formats, PDF formats and common image formats will treated before the end of the project.

One of the interesting aspects of the characterisation work is developing an understanding of what is preserved or not in a particular process and how a file format impacts on this. Thaller demonstrated this (using a little tool for *shooting* files) by deliberately causing a small amount of damage to a png file and a tif file. A small amount of damage to the png file had severe consequences for its rendering, while the tif file could be damaged much more extensively and still retain some of its informational value. Thaller also used the example of migrating a MS Word 2003 document to the Open Document Text format. The migration to ODT seemed to lose a footnote in the document. Thaller then showed the same MS Word 2003 document migrated to PDF, where the footnote appears to be retained. In actual fact the footnote isn’t lost in the migration to ODT, it’s just not rendered. On the other hand, the footnote is structurally lost in the PDF file, but visually present. Thaller is proposing a solution which allows structure and appearance to be preserved.

The final element of planets on show was the testbed developed at HATII, demonstrated by Matthew Barr. The testbed looks very useful and, like Plato, will be available for use online and offline. There did seem to be some overlap in aims and functionality with Plato, but there are differences too. It’s essential objectives seem similar – users should be able to perform experiments with select data and tools, evaluate those experiments and draw conclusions to inform their preservation strategy; the testbed will also tools and services to be benchmarked. It struck me is that the process of conducting an experiment was simpler than with Plato, since a granular expression of objectives is not necessary. It’s more quick and dirty, which may suit some scenarios better, but will the result be as good? Aspects I found particularly interesting were the development of a corpora and the ability to add new services (tools are deployed and accessed using web services) for testing.

-Susan Thomas