Tag Archives: webarchives

What is ‘The Future of the Past of the Web’?

‘The Future of the Past of the Web’
Digital Preservation Coalition Workshop
British Library, 7 October 2011
Chrissie Webb and Liz McCarthy

In his keynote address to this event – organised by the Digital Preservation Coalition , the Joint Information Systems Committee and the British Library – Herbert van der Sompel described the purpose of web archiving as combating the internet’s ‘perpetual now’. Stressing the importance to researchers of establishing the ‘temporal context’ of publications and information, he explained how the framework of his Memento Project uses a ‘timegate’ implemented via web plugins to show what a resource was like at a particular date in the past. There is a danger, however, that not enough is being archived to provide the temporal context; for instance, although DOIs provide stable documents, the resources they link to may disappear (‘link rot’).

The Memento Project Firefox plugin uses a sliding timeline (here, just below the Google search box) to let users choose an archived date

A session on using web archives picked up on the theme of web continuity in a presentation by The National Archives on the UK Government Web Archive, where a redirection solution using open source software helps tackle the problems that occur when content is moved or removed and broken links result. Current projects are looking at secure web archiving, capturing internal (e.g. intranet) sources, social media capture and a semantic search tool that helps to tag ‘unstructured’ material. In a presentation that reinforced the reason for the day’s ‘use and impact’ theme, Eric Meyer of the Oxford Internet Institute wondered whether web archives were in danger of becoming the ‘dusty archives’ of the future, contrasting their lack of use with the mass digitisation of older records to make them accessible. Is this due to a lack of engagement with researchers, their lack of confidence with the material or the lingering feeling that a URL is not a ‘real’ source? Archivists need to interrupt the momentum of ‘learned’ academic behaviour, engaging researchers with new online material and developing archival resources in ways that are relevant to real research – for instance, by helping set up mechanisms for researchers to trigger archiving activity around events or interests, or making more use of server logs to help them understand use of content and web traffic.

One of the themes of the second session on emerging trends was the shift from a ‘page by page’ approach to the concept of ‘data mining’ and large scale data analysis. Some of the work being done in this area is key to addressing the concerns of Eric Meyer’s presentation; it has meant working with researchers to determine what kinds and sources of data they could really use in their work. Representatives of the UK Web Archive and the Internet Archive described their innovations in this field, including visualisation and interactive tools. Archiving social networks was also a major theme, and Wim Peters outlined the challenges of the ARCOMEM project, a collaboration between Sheffield and Hanover Universities that is tackling the problems of archiving ‘community memory’ through the social web, confronting extremely diverse and volatile content of varying quality for which future demand is uncertain. Richard Davis of the University of London Computer Centre spoke about the BlogForever project, a multi-partner initiative to preserve blogs, while Mark Williamson of Hanzo Archives spoke about web archiving from a commercial perspective, noting that companies are very interested in preserving the research opportunities online information offers.

The final panel session raised the issue of the changing face of the internet, as blogs replace personal websites and social media rather than discrete pages are used to create records of events. The notion of ‘web pages’ may eventually disappear, and web archivists must be prepared to manage the dispersed data that will take (and is taking) their place. Other points discussed included the need for advocacy and better articulation of the demand for web archiving (proposed campaign: ‘Preserve!: Are you saving your digital stuff?’), duplication and deduplication of content, the use of automated selection for archiving and the question of standards.

Advisory Board Meeting, 18 March 2011

Our second advisory board meeting took place on Friday. Thanks to everyone who came along and contributed to what was a very useful meeting. For those of you who weren’t there here is a summary of the meeting and our discussions.

The first half of the afternoon took the form of an overview of different aspects of the project.

Overview of futureArch’s progress

Susan Thomas gave us a reminder of the project’s aims and objectives and the progress being made to meet them. After an overview of the percentage of digital accessions coming into the library since 2004 and the remaining storage space we currently have, we discussed the challenge of predicting the size of future digital accessions and collecting digital material. We also discussed what we think researcher demand for born-digital material is now and will be in the future.

Born Digital Archives Case Studies

Bill Stingone presented a useful case study about what the New York Public Library has learnt from the process of making born-digital materials from the Straphangers Campaign records available to researchers.

After this Dave Thompson spoke about some of the technicalities of making all content in the Wellcome Library (born-digital, analogue and digitised) available through the Wellcome Digital Library. Since the project is so wide reaching a number of questions followed about the practicalities involved.

Web archiving update

Next we returned to the futureArch project and Susan gave an overview of the scoping, research and decisions that have been made regarding the web archiving pilot since the last meeting. I then gave an insight into how the process of web archiving will be managed using a tracking database. Some very helpful discussions followed about the practicalities of obtaining permission for archiving websites and the legal risks involved.

After breaking for a well earned coffee we reconvened to look at systems.

Systems for Curators

Susan explained how the current data capture process works for digital collections at the Bodleian including an overview of the required metadata which we enter manually at the moment. Renhart moved on to talk about our intention to use a web-based capture workbench in the future and to give us a demo of the RAP workbench. Susan also showed us how FTK is used for appraisal, arrangement and description of collections and the directions we would like to take in the future.

Researcher Interface

To conclude the systems part of the afternoon, Pete spoke about how the BEAM researcher interface has developed since the last advisory board meeting, the experience of the first stage of testing the interface and the feedback gained so far. He then encouraged everyone to get up and have a go at using the interface for themselves and to comment on it.

Training the next generation of archivists?

With the end of the meeting fast approaching, Caroline Brown from the University of Dundee gave our final talk. She addressed the extent to which different archives courses in the UK cover digital curation and the challenges faced by course providers aiming to include this kind of content in their modules.

With the final talk over we moved onto some concluding discussions around the various skills that digital archivists need. Those of us who were able to stay continued our discussions over dinner.

-Emma Hancox

Advisory board meeting, 24 Sept. 2009

Thanks to everyone who came along and contributed to the project’s first advisory board meeting last Thursday.

Introductions
We started with some introductory discussions around the Library’s hybrid collections and the futureArch project’s aims and activities. This discussion was wide ranging, touching on a number of subjects including the potential content sources for ‘digital manuscripts’: from mobile phones, to digital media, to cloud materials.

Systems
In the past year, we’ve made progress on developing, and beginning to implement, the technical architecture for BEAM (Bodleian Electronic Archives & Manuscripts). Pete Cliff (futureArch Software Engineer) kicked off our session on ‘systems’ with an overview of the architecture, drawing on some particular highlights; it’s worth a look at his slides if you’re interested in finding out more.

Next, a demo of two ingest tools:
1. Renhart Gittens demonstrated the BEAM ingester, our means of committing accessions (under a collection umbrella) to BEAM’s preservation storage.

2. Dave Thompson (Wellcome Library Digital Curator) demonstrated the XIP creator. This tool does a similar job to the BEAM Ingester and forms part of the Tessella digital preservation system being implemented at the Wellcome Library.

Keeping with technical architecture, Neil Jefferies (OULS R&D Project Officer) introduced Oxford University Library Service’s Digital Asset Management System (or DAMS, as we’ve taken to calling it). This is the resilient preservation store upon which BEAM, and other digital repositories, will sit.

How will researchers use hybrid archives?
Next we turned our attention to the needs of the researchers who will use the Library’s hybrid archives. Matt Kirschenbaum (Assoc. Prof. of English & Assoc. Director of MITH at the University of Maryland) got us off to a great start with an overview of his work as a researcher working with born-digital materials. Matt’s talk emphasised digital archives as ‘ material culture’, an aspect of digital manuscripts that can be overlooked when the focus becomes overly content-driven. Some researchers want to explore the writer’s writing environment; this includes seeing the writer’s desktop, and looking at their MP3 playlist, as much as examining the word-processed files generated on a given computer. Look out for the paper Matt has co-authored for iPRES this year.

Next we broke into groups to critique the ‘interim interface’ which will serve as a temporary access mechanism for digital archives while a more sophisticated interface is developed for BEAM. Feedback from the advisory board critique session was helpful and we’ve come away with a to-do list of bug fixes and enhancements for the interim interface as well as ideas for developing BEAM’s researcher interfaces. We expect to take work on researcher requirements further next year (2010) through workshops with researchers.

Finally, we heard from Helen Hockx-Yu (British library’s Web Archiving Programme Manager) on the state of the art in web archiving. Helen kindly agreed to give us an overview of web archiving processes and the range of web archiving solutions available. Her talk covered all the options, from implementing existing tools suites in-house to outsourcing some/all of the activity. This was enormously useful and should inform conversations about the desired scope of web archiving activity at the Bodleian and the most appropriate means by which this could be supported.

Some of us continued the conversation into a sunny autumn evening on the terrace of the Fellows’ Garden of Exeter College, and then over dinner.

 

 

 

-Susan Thomas

Geocities being rescued by Archive Team

Yahoo’s decision to pull the plug on geocities sites shows how ephemeral web content can be. There should be a means for individual users to obtain their geocities stuff first, so long as they get the message to do something in time. In a more comprehensive approach, the Archive Team are busy downloading all the geocities websites they can find, though what happens after that is unclear. The Archive Team have a page with more info. and there’s a nice article in The Register, complete with geocities styling.

-Susan Thomas