Tag Archives: futureArch project

Digital Preservation: What I Wish I Knew Before I Started

Tuesday 24th January, 2012

Last week I attended a student conference, hosted by the Digital Preservation Coalition, on what digital preservation professionals wished they had known before they started. The event covered a great deal of the challenges faced by those involved in digital preservation, and the skills required to deal with these challenges.

The similarities between traditional archiving and digital preservation were highlighted at the beginning of the afternoon, when Sarah Higgins translated terms from the OAIS model into more traditional ‘archive speak’. Dave Thompson also emphasized this connection, arguing that digital data “is just a new kind of paper”, and that trained archivists already have 85-90% of the skills needed for digital preservation.

Digital preservation was shown to be a human rather than a technical challenge. Adrian Brown argued that much of the preservation process (the “boring stuff”) can be automated. Dave Thompson stated that many of the technical issues of digital preservation, such as migration, have been solved, and that the challenge we now face is to retain the context and significance of the data. The point made throughout the afternoon was that you don’t need to be a computer expert in order to carry out effective digital preservation.

The urgency of intervention was another key lesson for the afternoon. As William Kilbride put it; digital preservation won’t do itself, won’t go away, and we shouldn’t wait for perfection before we begin to act. Access to data in the future is not guaranteed without input now, and digital data is particularly intolerant to gaps in preservation. Andrew Fetherstone added to this argument, noting that doing something is (usually) better than doing nothing, and that even if you are not in a position to carry out the whole preservation process, it is better to follow the guidelines as far as you can, rather than wait and create a backlog.

The scale of digital preservation was another point illustrated throughout the afternoon. William Kilbride suggested that the days of manual processing are over, due to the sheer amount of digital data being created (estimated to reach 35ZB by 2020!). He argued that the ability to process this data is more important to the future of digital preservation than the risks of obsolescence. The impossibility of preserving all of this data was illustrated by Helen Hockx-Yu, who offered the statistic the the UK Web Archive and National Archives Web Archive combined have archived less than 1% of UK websites. Adrian Brown also pointed out that as we move towards dynamic, individualised content on the web, we must decide exactly what the information is that we are trying to preserve. During the Q&A session, it was argued that the scale of digital data means that we have to accept that we can’t preserve everything, that not everything needs to be preserved, and that there will be data loss.

The importance of collaboration was another theme which was repeated by many speakers. Collaboration between institutions on a local, national and even international level was encouraged, as by sharing solutions to problems and implementing common standards we can make the task of digital preservation easier.

This is only a selection of the points covered in a very engaging afternoon of discussion. Overall, the event showed that, despite the scale of the task, digital preservation needn’t be a frightening prospect, as archivists already have many of the necessary skills.

The DPC have uploaded the slides used during the event, and the event was also live-tweeted, using the hashtag #dpc_wiwik, if you are interested in finding out more.

-Rebecca Nielsen

What is ‘The Future of the Past of the Web’?

‘The Future of the Past of the Web’
Digital Preservation Coalition Workshop
British Library, 7 October 2011
Chrissie Webb and Liz McCarthy

In his keynote address to this event – organised by the Digital Preservation Coalition , the Joint Information Systems Committee and the British Library – Herbert van der Sompel described the purpose of web archiving as combating the internet’s ‘perpetual now’. Stressing the importance to researchers of establishing the ‘temporal context’ of publications and information, he explained how the framework of his Memento Project uses a ‘timegate’ implemented via web plugins to show what a resource was like at a particular date in the past. There is a danger, however, that not enough is being archived to provide the temporal context; for instance, although DOIs provide stable documents, the resources they link to may disappear (‘link rot’).

The Memento Project Firefox plugin uses a sliding timeline (here, just below the Google search box) to let users choose an archived date

A session on using web archives picked up on the theme of web continuity in a presentation by The National Archives on the UK Government Web Archive, where a redirection solution using open source software helps tackle the problems that occur when content is moved or removed and broken links result. Current projects are looking at secure web archiving, capturing internal (e.g. intranet) sources, social media capture and a semantic search tool that helps to tag ‘unstructured’ material. In a presentation that reinforced the reason for the day’s ‘use and impact’ theme, Eric Meyer of the Oxford Internet Institute wondered whether web archives were in danger of becoming the ‘dusty archives’ of the future, contrasting their lack of use with the mass digitisation of older records to make them accessible. Is this due to a lack of engagement with researchers, their lack of confidence with the material or the lingering feeling that a URL is not a ‘real’ source? Archivists need to interrupt the momentum of ‘learned’ academic behaviour, engaging researchers with new online material and developing archival resources in ways that are relevant to real research – for instance, by helping set up mechanisms for researchers to trigger archiving activity around events or interests, or making more use of server logs to help them understand use of content and web traffic.

One of the themes of the second session on emerging trends was the shift from a ‘page by page’ approach to the concept of ‘data mining’ and large scale data analysis. Some of the work being done in this area is key to addressing the concerns of Eric Meyer’s presentation; it has meant working with researchers to determine what kinds and sources of data they could really use in their work. Representatives of the UK Web Archive and the Internet Archive described their innovations in this field, including visualisation and interactive tools. Archiving social networks was also a major theme, and Wim Peters outlined the challenges of the ARCOMEM project, a collaboration between Sheffield and Hanover Universities that is tackling the problems of archiving ‘community memory’ through the social web, confronting extremely diverse and volatile content of varying quality for which future demand is uncertain. Richard Davis of the University of London Computer Centre spoke about the BlogForever project, a multi-partner initiative to preserve blogs, while Mark Williamson of Hanzo Archives spoke about web archiving from a commercial perspective, noting that companies are very interested in preserving the research opportunities online information offers.

The final panel session raised the issue of the changing face of the internet, as blogs replace personal websites and social media rather than discrete pages are used to create records of events. The notion of ‘web pages’ may eventually disappear, and web archivists must be prepared to manage the dispersed data that will take (and is taking) their place. Other points discussed included the need for advocacy and better articulation of the demand for web archiving (proposed campaign: ‘Preserve!: Are you saving your digital stuff?’), duplication and deduplication of content, the use of automated selection for archiving and the question of standards.

Day of Digital Archives, 2011

Today is officially ‘Day of Digital Archives’ 2011! Well, it’s been quite a busy week on the digital archives front here at the Bodleian…

The week began with the arrival of our new digital archives graduate trainee, Rebecca Nielsen. During her year here with us, the majority of Rebecca’s work will be on digital archives of one kind or another, she’ll be archiving all sorts, from materials arriving on old floppies to web sites on the live web.

Another of my colleagues, Matthew Neely, has been spending quite a bit of time this week working on the archive of Oxford don, John Barton. The archive includes over 150 floppies and a hard disk as well as hard-copy papers and photographs.

Barton’s digital material was captured in our processing lab back in the Spring of 2010, and now Matthew is busy using Forensic Toolkit software to appraise, arrange and describe the digital content alongside the papers. There are a few older word-processing formats in the collection, but all things that we can handle.

We’ve also been having conversations with quite a few archive depositors this week, about scoping collections and transfer mechanisms, among other things. There has been some planning work too, while we consider the requirements for processing the archive of Sir Walter Bodmer, which includes around 300 disks (3.5″ and 5.25″). For more on the Bodmer archive see the Library’s Special Collections blog, The Conveyor.

Today, I’ve spent a little time looking at our ‘Publication Pathway’ and thinking about where we need a few tweaks. This is the process and toolset that we are building to publish our digital archives to users (Pete called it CollectionBuilder, and you can have a look at a slightly out-of-date version of it here: http://sourceforge.net/projects/beamcollectionb/). We have a bit more work to do on this and our user interface, but quite a bit of material in the pipeline waiting to get out to our users.

To close out the week, two of our webarchiving pilot group are heading off to the DPC’s The Future of the Past of The Web event tomorrow, to learn more about the state of the art in webarchiving.

Lastly, I can’t resist returning to the start of the week. On Monday, we had a power cut and temporarily lost access to Bodleian Electronic Archives and Manuscripts (BEAM) services. An unsubtle reminder that digital archives require lots of things to remain accessible, power being one of them!

-Susan Thomas

Comparing software tools

While looking at software relating to digital video earlier today I came across a handy website called alternativeTo. It’s a useful means of comparing software applications and getting an idea of the tools that are out there to help perform a particular task. AlternativeTo gives a brief summary of each piece of software along with screenshots of the software in action. Another useful feature is that searches can be filtered by whether the tools are free or open source.

-Emma Hancox

Mobile forensics

Anyone who heard Brad Glisson’s talk at the DPC event on digital forensics may be interested in the paper ‘A comparison of forensic evidence recovery techniques for a windows mobile smart phone’ published in Digital Investigation Volume 8, Issue 1, July 2011, Pages 23-36. Interesting to see what was recovered, and how the different tools behave.

-Susan Thomas

Preserving born-digital video – what are good practices?


Interesting to see Killian Escobedo’s post on digital video preservation over at the Smithsonian Archives’ visual archives blog. Our trainee, Emma, is working on questions of these sort at the moment as we start to develop strategies for preserving the vast amount of born-digital video being deposited in our archive collections. While there’s quite a lot of material out there on digitising analogue video, we’ve found a real shortage of guidance on the management of born-digital video collections. With that in mind I’d be really interested in hearing how other folks are dealing with this kind of material. Can you give us any pointers? At the moment we’re particularly interested in learning more about existing practices, good tools, realistic workflows, and preservation-grade standards (for metadata and content – which ones and why?).

So, what kind of digital video do we have? It’s a good question, and one I can’t answer fully for the moment. What I can say is that our collections include digital video deposited on CDs, DVDs, Bluray discs, miniDV and mediumDV cassettes, and hard disks. Much of this material has yet to be captured from its original media so we don’t have that inventory of codecs, wrapper formats, frame rates, metadata, etc. that Killian talks about. This kind of detailed survey work is a next step for us, but one that will have to wait until we have developed a workflow for initial capture (bit-level preservation comes first). I wonder if we’ll see the same diversity of technical characteristics present in the Smithsonian’s materials. It seems likely.

-Susan Thomas

Hidden Pages

Yesterday I foolishly uploaded a Pages document to my work machine (that isn’t a Mac) before heading into the office. I needed the content because I was due to give it to Susan that morning. Luckily I stumbled upon this tip and thought I’d share it in case you ever find yourself faced with a Mac disk full of documents and no Mac to read them on…

I guess I shouldn’t have been surprised to discover a Pages document is in fact a zip file (like Word docs) and if you unpack it not only do you find an XML representation of the document (which would let you get at the text – run it through tidy first though as there aren’t any line breaks!) or, neater, in the QuickLook directory is a PDF (file reports PDF 1.3) of the document.

Day saved! Phew!

-Peter Cliff

Media recognition – Floppy Disks part 3

3 inch Disks (Mitsumi ‘Quick Disk’)

Magnetic storage media
Used in the 1980s.
?128KB – 256KB
Requires a 3” drive appropriate to the manufacturer’s specifications.
Likely to have been individual users and small organisations. Used for word-processing, music and gaming.
File Systems:
Unknown. May vary according to use. The disks were manufactured by Mitsumi and offered as OEM to resellers and used in a range of contexts including Nintendo (Famicom), various MIDI keyboards/samplers (Roland) and the Smith Corona Personal Word Processor (PWP).
Common Manufacturers:
Disks: Mitsumi appear to have made the magnetic disk (the innards), while other manufacturers made the cases. This resulted in different case shapes and labelling. For example Sharp Corona labelled the disks as DataDisk 2.8″
Drives: Mitsumi?

The Smith Corona Personal Word Processor (PWP) variant of the disk is double sided with one side being labelled ‘A’ and the other ‘B’. Each side also had a dedicated write-protect hole, known as a ‘breakout lug’.
2.8″ Smith Corona ‘Quick Disk’
3.5″ floppy side-by-side with a 2.8″ Smith Corona ‘Quick Disk’
Nintendo Famicon disk
Some rights reserved by bochalla
High Level Formatting
Unknown. Possibly varied according to use.
3 Inch Disk Drives
Varied according to disk. The Smith Corona word processing disks are most likely to turn up in an archival collection. These were used in a Smith Corona PWP and possible models nos. include: 3,5,6, 6BL, 7, X15,X25, 40, 50LT, 55D, 60, 65D, 75D, 80, 85DLT, 100, 100C, 220, 230, 250, 270LT, 300, 350, 355, 960, 990, 2000, 2100, 3000, 3100, 5000, 5100, 7000LT, DeVille 3, DeVille 300, Mark X, Mark XXX, Mark XL LT. 

Lego mockup of a Nintendo Famicon drive

Some rights reserved by kelvin255

Useful links

 -Susan Thomas

Preserving Digital Sound and Vision: A Briefing 8th April 2011

Last Friday I went along to the DPC briefing Preserving Digital Sound and Vision. I was particularly interested in the event because of digital video files currently held on DVD media at the Bodleian.

After arriving at the British Library and collecting my very funky Save the Bits DPC badge I sat down to listen to a packed programme of speakers. The morning talks gave an overview of issues associated with preserving audio-visual resources. We began with Nicky Whitsed from the Open University who spoke about the nature of the problem of preserving audio-visual content; a particularly pertinent issue for the OU who have 40 years of audio-visual teaching resources to deal with. Richard Ranft then gave a fascinating insight into the history and management of the British Library Sound Archive. He played a speech from Nelson Mandela’s 1964 trial to emphasise the value of audio preservation. Next Stephen Gray from JISC Digital Media spoke about how students are using audio-visual content in their research. He mentioned the difficulties researchers find when citing videos, especially those on YouTube that may disappear at any time! To round off the morning John Zubrycki from BBC R and D spoke about Challenges and Solutions in Broadcast Archives. One of the many interesting facts that he mentioned was that subtitle files originally produced by the BBC for broadcast have been used as a tool for search and retrieval of video content.

After enjoying lunch and the beautiful sunny weather on the British Library terrace we moved onto the afternoon programme based on specific projects and tools. Richard Wright of the BBC spoke about the Presto Centre and the tools it has developed to help with audio-visual preservation. He also spoke about the useful digital preservation tools available online via Presto Space. Sue Allcock and James Alexander then discussed the Outcomes and Lessons learnt from the Access to Video Assets Project at the Open University which makes past video content from the Open University’s courses available to OU staff through a Fedora repository. Like the BBC, discovering subtitle files has allowed the OU to index their audio-visual collections. Finally Simon Dixon from the Centre for Digital Music Queen Mary University spoke about emerging tools for digital sound.

A final wide ranging discussion about collaboration and next steps followed which included discussion about storage as well as ideas for a future event addressing the contexts of audio-visual resources. I left the event with my mind full of new information and lots of pointers for places to look to help me consider the next steps for our digital video collections… watch this space.

-Emma Hancox