Tag Archives: Web Archives

A web of meaningful links. Archived websites in and as special collections

As some of you may know, since 2011 the Bodleian has been archiving websites, which are collected in the Bodleian Libraries Web Archive (BLWA) and made publicly accessible through the platform Archive-it. BLWA is thematically organised into seven collections: Arts and Humanities; Social Sciences; Science, Technology and Medicine; International; Oxford University Colleges; Oxford Student Societies and Oxford GLAM. As their names already suggest, much of the online content we collect relates to Oxford University and seeks to provide a snapshot of its intellectual, cultural and academic life as well as to document the University’s main administrative functions.

From the very beginning, the BLWA collection has also been regarded as a complement to and reflection of the Bodleian’s analogue special collections that users can consult in the reading rooms. For example, there are multiple meaningful links between our BLWA Arts & Humanities collection and the Bodleian’s Modern Archives & Manuscripts. By teasing out the connections between them, I hope to offer some concrete examples of how archived websites can be valuable to historical and cultural research and explore some of the reasons why the BLWA can be seen as integral to the Bodleian Special Collections.

Collecting author appreciation society websites…

In BLWA, you can find websites of societies dedicated to the study of famous authors whose papers are kept at the Bodleian (partly or in full), such as T.S. Eliot, J. R.R. Tolkien and Evelyn Waugh. An example from this category is The Philip Larkin Society website, which complements the holdings of correspondence to and from the poet and librarian Philip Larkin (1922-1985) held at the Bodleian.

The website provides helpful information to anyone with a general or academic interest in Larkin, as it lists talks and events about the poet as well as relevant publications and online resources promoted by the Society.

A 2018 capture in BLWA of a webpage from the Larkin Society website, describing a public art project celebrating Larkin’s famous poem ‘Toads’

The value of the archived version of The Philip Larkin Society website may not be immediately apparent now, when the live site is still active. However, in decades from now, this website may well become a primary source that offers a window onto how early 21st century society engaged with English poetry and disseminated research about the topic through media and formats distinctive of our time, such as online reviews, podcasts and blog posts.

…and social media accounts

Alongside websites, BLWA has been actively collecting Twitter accounts pertaining to authors and artists, such as The Barbara Pym Society Twitter presence.

A 2019 capture in BLWA of the Barbara Pym Society Twitter account

The Twitter feed preserves the memory of ephemeral, but meaningful encounters and forms of engagement with the works of English novelist Barbara Pym (1913-1980). The experience of consulting the Archive of English Novelist Barbara Pym in the Weston Reading rooms is enriched by the possibility of reading through the posts on the Pym Twitter account. From talks about Pym’s work to quotes in newspaper articles mentioning the author, the Twitter feed is not only a collection of news and information about Barbara Pym’s work, but also a representation of the lively network of individuals engaging with her writings, both in academic and broader circles.

Online presence of contemporary artists

Building an online presence through social media and a personal website is a promotional strategy that many contemporary artists and authors have adopted. A good example of this is the website of the British photographer and documentarist Daniel Meadows (b. 1952). In 2019, BLWA started taking regular captures of Meadows’ website, Photobus, following the acquisition of Meadows’ Archive a year earlier. This hybrid archive (which includes both analogue and born-digital items) has since been catalogued and its finding aid is available here.

The captures taken of Meadows’ Photobus site provide us with contextual information on the photographic series described in the finding aid of Meadows’ Archive at the Bodleian. Through the website, we get an account of Meadows’ life in his own words, we learn about the exhibitions where Meadows’ photographs were displayed and find out about the books in which his work has been published.

If you were to search for Daniel Meadows’ website on the live web right now, you would find that the website is still active, but looks rather different in content and layout from the captures archived in the BLWA between 2019 and March 2023.

Comparison of the ‘About’ page on Daniel Meadows’ website: the BLWA capture from January 2023 (top), and the capture from May 2023 (bottom)

Furthermore, the URL has changed from Photobus to the name of the photographer himself. Were it not for the version of the website archived in BLWA, the old content and structure of the site would not be as easily accessible. The website has also changed in scope, as it now provides us with a comprehensive digital repository of Meadows’ photographic series.

Comparing Meadows’ website in BLWA with his archive at the Bodleian, we can see an interesting series of correspondences between digital and analogue realm, and between digital and physical archives. For example, the archived version of Meadows’ website Photobus is included as a link in the section of the finding aid for the Meadows archive devoted to ‘related materials’. In turn, the updated, 2023 version of Meadows’ site reflects in some respects the organisation and structure of an archive: his oeuvre is tidily arranged into series, each accompanied by a description and digital images of the photographs to match their arrangement in the physical archive at the Bodleian. Daniel Meadows’ new website exemplifies how, through the combination of metadata and high-resolution images, websites can become a powerful interface through which an archive is discovered and its contents accessed in ways that complement and enhance the experience of working through an archival box in a reading room.

Archived websites as a link to tomorrow’s archives

Web archives are a relatively recent phenomenon, so the uses of a collection of archived websites like the BLWA are only gradually beginning to emerge. The historical, cultural and evidential value of web archives is still overlooked, or perhaps just not yet fully exploited. It is only a matter of time before social media and websites like those kept in BLWA will be seen as an increasingly important resource on the cultural significance of 20th and 21st century authors and artists and the reception of their work. After all, for today’s authors and artists, social media and websites are an important vehicle for the dissemination of news about their work, of their opinions and creativity. As such, their online presence may be different in form, but similar in purpose and significance to the letters, pamphlets, alba amicorum and diaries that one would consult to research the social interactions, ideas, and activities of a humanist scholar.

One of the exciting aspects of working with digital archives is the proactive nature of our collecting practice. Curators of digital collections need to identify, select and collect relevant content before it disappears or decay – threats to which websites and social media are vulnerable. Through the choices we make today of content to archive, we are ultimately shaping the digital archives that will be accessible decades from now.

We are happy to consider suggestions from our users about websites that could be suitable additions to the collection. If you are curious to explore the BLWA collection further, you can find it here.  The online nomination form can be found at this link. So don’t just follow the links – help us save them!

The resilience of digital heritage. A session in focus from the iPRES 2022 Conference

iPRES, the annual International Conference on Digital Preservation, took place in Glasgow 12th-16th September 2022, hosted by the Digital Preservation Coalition (DPC). In this blog post, Alice Zamboni reports on some of the highlights of the conference, held in person after a two-year hiatus.

The title chosen for the 2022 iPRES Conference, “Let Digits Flourish. Data for all, for good, for ever” is also an exhortation that perfectly captures the ambitions of the Digital Preservation community and the spirit of its annual gathering at iPRES. Its rich conference programme combined traditional panels with lightning talks, workshops and interactive sessions. The subdivision of the programme into the five thematic strands of Resilience, Innovation, Environment, Exchange and Community was an effective way to foster interdisciplinary conversations among experts who are busy tackling similar issues from different angles and work towards the same goal of ensuring the preservation of digital heritage worldwide.

Thanks to the generous support of the DPC career development fund, I was lucky enough to be able to attend iPRES in person. As I am only a few months into my role as graduate trainee digital archivist at the Bodleian, this was my first professional conference. For me, attending iPRES was the perfect opportunity to get acquainted with current trends and developments in the field of digital preservation and learn more about the important work undertaken in Archives and Libraries across Europe and further afield.

Souvenirs from iPRES: a tote bag and a tartan scarf in the DPC colour scheme

One session that skilfully interwove many of the ideas running through the conference was held on Thursday 15th as part of the Resilience strand. The session brought together archivists, researchers and experts from various Industries, which allowed for a multifaceted exploration of the obstacles posed by the preservation of complex digital resources connected to academia and the art world. The session touched upon a number of issues, from the threat that obsolete software poses to Internet art, to the importance of digital preservation strategies for academic research projects with a digital output and the application of web archiving to academic referencing.

The first two presentations highlighted the value of web archiving as a way to ensure the preservation of online resources used in academic research. Sara Day Thomson and Anisa Hawes’s talk focused on the website created as part of the Carmichael Watson Research Project, based at the University of Edinburgh. The website hosts an important online database of primary written resources and artefacts relating to Gaelic culture. Following the end of the research project, the website was taken down owing to security issues caused by its infrastructure. Day Thomson and Hawes were involved in the complex task of archiving this very large online database using Webrecorder.

Without the web archivists’ intervention, the Carmichael Watson Project website would have simply vanished. The presentation made a case for the development of digital preservation strategies, which should be viewed as a priority by academic institutions whose research output includes important digital archives and databases. Equally, this case study sparks questions about whether web archiving is the sole and most viable solution for the preservation of digital archives and databases. Does the website – its structure and the way in which it displays the database – matter and is therefore worth preserving for its cultural and evidential value, or could the research output be separated from the website and preserved through other means?

Martin Klein’s (Los Alamos National Laboratory) paper on Reference Rot presented another issued posed by the ubiquity of the internet in academic writing and publishing. As the number of scholarly resources available solely in electronic formats grows, so too does the amount of bibliographic citations that include a URL. Yet these links are easily broken. Many of us will have experienced the disappointment of clicking on a hyperlink only to find that the resource is no longer available on that webpage. Fewer will know that this phenomenon has its own nickname: content drift, which exposes URLs to link rot. Luckily, Klein’s project has devised an automatized programme for the creation of what he described as ‘robustified links’. In this way, it is possible to create an archived version of a URL, along with a unique resource identifier that includes information about date and time of creation of this robust link.

Both presentations offered me a new perspective on the work that I do at the Bodleian, where I help manage the Bodleian Libraries Web Archive. I often wonder who the current users of our web archive may be and what value this collection of websites may acquire in decades from now. The two talks made me appreciate the growing recognition of web archiving as a form of preservation of digital heritage as well as the value that these archived resources have for different stakeholders.

The second half of the session turned from academia to the art world, with papers by Natasa Milic-Frayling (IntactDigital Ltd) and Dragan Espenschied (Rhizome). The two papers explored some of the challenges faced by the preservation of Internet art. Both talks were interesting for the historical perspective they offered on recent developments in the art world such as NFT artworks, which may eventually find their way in a contemporary artist’s archive. As Milic-Frayling pointed out, the internet opened up a world of possibilities for emerging artists in the 1990s. Thanks to the web, artists could reach new audiences online without the mediation of art galleries and exhibitions. Yet the dissemination of artworks in the online environment has exposed them to the insidious threat of software obsolescence.

Espenschied showed the valuable work that Rhizome’s platform ArtBase has done to counter this issue. Active since 1999, this archive of Internet art employs various pieces of software to handle obscure data formats used by artists in the 1990s and allows users to perform the artefact choosing from different options, such as browser emulation or a web archived version of the artwork.

Milic-Frayling talked about her recent collaboration with artist Michael Takeo Magruder. Some of his Internet art pieces were created using Flash and VRML (Virtual Reality Modelling Language), both of which are no longer supported by today’s browsers.  At first, it may be difficult to comprehend how a piece of software can negatively affect a work of art. Conservation issues affecting analogue archival material – from the threat of humidity and bookworms for a rare printed book to the excessive exposure to light for a delicate drawing – are tangible and visible. Yet software obsolescence should be taken just as seriously for the way in which it affects the born-digital counterparts to works on paper. In Magruder’s net art piece World[s], the combination of FLASH and VRML contributes to the creation of mesmerizingly intricate three-dimensional virtual shapes floating through a dark space. If the software is not correctly read, the integrity and quality of the artwork are endangered and potentially lost forever. Milic-Frayling worked to ensure the preservation of these net art pieces, guided in her approach by the artist’s requirements around access to and display of his artworks.

Together, the four talks contributed to show that born-digital resources are fragile and especially vulnerable to obsolescence. Yet the picture they painted was far from bleak. The speakers also made a case for the resilience of digital heritage, which owes much to the work that digital preservation specialists do to ensure that born-digital complex objects adapt to constant technological advancement and continue to be accessible to future generations.

Some useful links:

Digital Preservation Coalition – https://www.dpconline.org/

Webarchiving with Webrecorder – https://webrecorder.net/tools#archivewebpage

Robustifying Links Project – https://robustlinks.mementoweb.org/

ArtBase archive – https://artbase.rhizome.org/wiki/Main_Page

 

 

 

 

Can web archives tell stories?

Archives tell stories. A series of induction sessions with archivists have brought me, a web archivist, to a new understanding of what archives are and what archivists do.

Archivists enable stories to be told — stories about people, organisations, society and much more. Archival materials bring them back to life. The very making of a collection — how its contents have been selected, preserved and made available to the public, and how some have not – constitute stories in themselves.

But can web archives tell stories? Web archives differ from conventional archives, where archival material comes into custody as a collection with a relatively clear boundary, within which archivists carry out appraisal, selection and cataloguing work. The boundaries for web archives, by comparison, have been both blurred and expanded.

Continue reading

Conference Report: IIPC Web Archiving Conference 2021

This year’s International Internet Preservation Consortium Web Archiving Conference was held online from 15-16th June 2021, bringing together professionals from around the world to share their experiences of preserving the Web as a research tool for future generations. In this blog post, Simon Mackley reports back on some of the highlights from the conference.  

How can we best preserve the World Wide Web for future researchers, and how can we best provide access to our collections? These were the questions that were at the forefront of this year’s International Internet Preservation Consortium Web Archiving Conference, which was hosted virtually by the National Library of Luxembourg. Web archiving is a subject of particular interest to me: as one of the Bodleian Library’s Graduate Trainee Digital Archivists, I spend a lot of my time working on our own Web collections as part of the Bodleian Libraries Web Archive. It was great therefore to have the chance to attend part of this virtual conference and hear for myself about new developments in the sector.

One thing that really struck me from the conference was the huge diversity in approaches to preserving the Web. On the one hand, many of the papers concerned large-scale efforts by national legal deposit institutions. For instance, Ivo Branco, Ricardo Basílio, and Daniel Gomes gave a very interesting presentation on the creation of the 2019 European Parliamentary Elections collection at the Portuguese Web Archive. This was a highly ambitious project, with the aim of crawling not just the Portuguese Web domain but also capturing a snapshot of elections coverage across 24 different European languages through the use of an automated search engine and a range of web crawler technologies (see their blog for more details). The World Wide Web is perhaps the ultimate example of an international information resource, so it is brilliant to see web archiving initiatives take a similarly international approach.

At the other end of the scale, Hélène Brousseau gave a fascinating paper on community-based web archiving at Artexte library and research centre, Canada. Within the arts community, websites often function as digital publications analogous to traditional exhibition catalogues. Brousseau emphasised the need for manual web archiving rather than automated crawling as a means of capturing the full content and functionality of these digital publications, and at Artexete this has been achieved by training website creators to self-archive their own websites using Conifer. Given that in many cases web archivists often have minimal or even no contact with website creators, it was fascinating to hear of an approach that places creators at the very heart of the process.

It was also really interesting to hear about the innovative new ways that web archives were engaging with researchers using their collections, particularly in the use of new ‘Labs’-style approaches. Marie Carlin and Dorothée Benhamou-Suesser for instance reported on the new services being planned for researchers at the Bibliothèque nationale de France Data Lab, including a crawl-on-demand service and the provision of web archive datasets. New methodologies are always being developed within the Digital Humanities, and so it is vitally important that web archives are able to meet the evolving needs of researchers.

Like all good conferences, the papers and discussions did not solely focus on the successes of the past year, but also explored the continued challenges of web archiving and how they can be addressed. Web archiving is often a resource-intensive activity, which can prove a significant challenge for collecting institutions. This was a major point of discussion in the panel session on web archiving the coronavirus pandemic, as institutions had to balance the urgency of quickly capturing web content during a fast-evolving crisis against the need to manage resources for the longer-term, as it became apparent that the pandemic would last months rather than weeks. It was clear from the speakers that no two institutions had approached documenting the pandemic in quite the same way, but nonetheless some very useful general lessons were drawn from the experiences, particularly about the need to clearly define collection scope and goals at the start of any collecting project dealing with rapidly changing events.

The question of access presents an even greater challenge. We ultimately work to preserve the Web so that researchers can make use of it, but as a sector we face significant barriers in delivering this goal. The larger legal deposit collections, for instance, can often only be consulted in the physical reading rooms of their collecting libraries. In his opening address to the conference, Claude D. Conter of the National Library of Luxembourg addressed this problem head-on, calling for copyright reform in order to meet reader expectations of access.

Yet although these challenges may be significant, I have no doubt from the range of new and innovative approaches showcased at this conference that the web archiving sector will be able to overcome them. I am delighted to have had the chance to attend the conference, and I cannot wait to see how some of the projects presented continue to develop in the years to come.

Simon Mackley

#WeMissiPRES: Preserving social media and boiling 1.04 x 10^16 kettles

This year the annual iPRES digital preservation conference was understandably postponed and in its place the community hosted a 3-day Zoom conference called #WeMissiPRES. As two of the Bodleian Libraries’ Graduate Trainee Digital Archivists, Simon and I were in attendance and blogged about our experiences. This post contains some of my highlights.

The conference kicked off with a keynote by Geert Lovink. Geert is the founding director of the Institute of Network Cultures and the author of several books on critical Internet studies. His talk was wide-ranging and covered topics from the rise of so-called ‘Zoom fatigue’ (I guarantee you know this feeling by now) to how social media platforms affect all aspects of contemporary life, often in negative ways. Geert highlighted the importance of preserving social media in order to allow future generations to be able to understand the present historical moment. However, this is a complicated area of digital preservation because archiving social media presents a host of ethical and technical challenges. For instance, how do we accurately capture the experience of using social media when the content displayed to you is largely dictated by an algorithm that is not made public for us to replicate?

After the keynote I attended a series of talks about the ARCHIVER project. João Fernandes from CERN explained that the goal of this project is to improve archiving and digital preservation services for scientific and research data. Preservation solutions for this type of data need to be cost-effective, scalable, and capable of ingesting amounts of data within the petabyte range. There were several further talks from companies who are submitting to the design phase of this project, including Matthew Addis from Arkivum. Matthew’s talk focused on the ways that digital preservation can be conducted on the industrial scale required to meet the brief and explained that Arkivum is collaborating with Google to achieve this, because Google’s cloud infrastructure can be leveraged for petabyte-scale storage. He also noted that while the marriage of preserved content with robust metadata is important in any digital preservation context, it is essential for repositories dealing with very complex scientific data.

In the afternoon I attended a range of talks that addressed new standards and technologies in digital preservation. Linas Cepinskas (Data Archiving and Networked Services (DANS)) spoke about a self-assessment tool for the FAIR principles, which is designed to assess whether data is Findable, Accessible, Interoperable and Reusable. Later, Barbara Sierman (DigitalPreservation.nl) and Ingrid Dillo (DANS) spoke about TRUST, a new set of guiding principles that are designed to map well with FAIR and assess the reliability of data repositories. Antonio Guillermo Martinez (LIBNOVA) gave a talk about his research into Artificial Intelligence and machine learning applied to digital preservation. Through case studies, he identified that AI is especially good at tasks such as anomaly detection and automatic metadata generation. However, he found that regardless of how well the AI performs, it needs to generate better explanations for its decisions, because it’s hard for human beings to build trust in automated decisions that we find opaque.

Paul Stokes from Jisc3C gave a talk on calculating the carbon costs of digital curation and unfortunately concluded that not much research has been done in this area. The need to improve the environmental sustainability of all human activity could not be more pressing and digital preservation is no exception, as approximately 3% of the world’s electricity is used by data centres. Paul also offered the statistic that enough power is consumed by data centres worldwide to boil 10,400,000,000,000,000 kettles – which is the most important digital preservation metric I can think of.

This conference was challenging and eye-opening because it gave me an insight into (complicated!) areas of digital preservation that I was not familiar with, particularly surrounding the challenges of preserving large quantities of scientific and research data. I’m very grateful to the speakers for sharing their research and to the organisers, who did a fantastic job of bringing the community together to bridge the gap between 2019 and 2021!

#WeMissiPRES: A Bridge from 2019 to 2021

Every year, the international digital preservation community meets for the iPRES conference, an opportunity for practitioners to exchange knowledge and showcase the latest developments in the field. With the 2020 conference unable to take place due to the global pandemic, digital preservation professionals instead gathered online for #WeMissiPRES to ensure that the global community remained connected. Our graduate trainee digital archivist Simon Mackley attended the first day of the event; in this blog post he reflects on some of the highlights of the talks and what they tell us about the state of the field.

How do you keep the global digital preservation community connected when international conferences are not possible? This was the challenge faced by the organisers of #WeMissIPres, a three-day online event hosted by the Digital Preservation Coalition. Conceived as a festival of digital preservation, the aim was not to try and replicate the regular iPRES conference in an online format, but instead to serve as a bridge for the digital preservation community, connecting the efforts of 2019 with the plans for 2021.

As might be expected, the impact of the pandemic loomed large in many of the talks. Caylin Smith (Cambridge University Library) and Sara Day Thomson (University of Edinburgh) for instance gave a fascinating paper on the challenge of rapidly collecting institutional responses to coronavirus, focusing on the development of new workflows and streamlined processes. The difficulties of working from home, the requirements of remote access to resources, and the need to move training online likewise proved to be recurrent themes throughout the day. As someone whose own experience of digital preservation has been heavily shaped by the pandemic (I began my traineeship at the start of lockdown!) it was really useful to hear how colleagues in other institutions have risen to these challenges.

I was also struck by the different ways in which responses to the crisis have strengthened digital preservation efforts. Lynn Bruce and Eve Wright (National Records of Scotland) noted for instance that the experience of the pandemic has led to increased appreciation of the value of web-archiving from stakeholders, as the need to capture rapidly-changing content has become more apparent. Similarly, Natalie Harrower (Digital Repository of Ireland) made the excellent point that the crisis had not only highlighted the urgent need for the sharing of medical research data, but also the need to preserve it: Coronavirus data may one day prove essential to fighting a future pandemic, and so there is therefore a moral imperative for us to ensure that it is preserved.

As our keynote speaker Geert Lovink (Institute of Network Cultures) reminded us, the events of the past year have been momentous quite apart from the pandemic, with issues such as the distorting impacts of social media on society, the climate emergency, and global demands for racial justice all having risen to the forefront of society. It was great therefore to see the role of digital preservation in these challenges being addressed in many of the panel sessions. A personal highlight for me was the presentation by Daniel Steinmeier (KB National Library of the Netherlands) on diversity and digital preservation. Steinmeier stressed that in order for diversity efforts to be successful, institutions needed to commit to continuing programmes of inclusion rather than one-off actions, with the communities concerned actively included in the archiving process.

So what challenges can we expect from the year ahead? Perhaps more than ever, this year this has been a difficult question to answer. Nonetheless, a key theme that struck me from many of the discussions was that the growing challenge of archiving social media platforms was matched only by the increasing need to preserve the content hosted on them. As Zefi Kavvadia (International Institute of Social History) noted, many social media platforms actively resist archiving; even when preservation is possible, curators are faced with a dilemma between capturing user experiences and capturing platform data. Navigating this challenge will surely be a major priority for the profession going forward.

While perhaps no substitute for meeting in person, #WeMissiPRES nonetheless succeeded in bringing the international digital preservation community together in a shared celebration of the progress being made in the field, successfully bridging the gap between 2019 and 2021, and laying the foundations for next year’s conference.

 

#WeMissiPRES was held online from 22nd-24th September 2020. For more information, and for recordings of the talks and panel sessions, see the event page on the DPC website.

Archiving web content related to the University of Oxford and the coronavirus pandemic

Since March 2020, the scope of collection development at the Bodleian Libraries’ Web Archive has expanded to also focus on the coronavirus pandemic: how the University of Oxford, and wider university community have reacted and responded to the rapidly changing global situation and government guidance. The Bodleian Libraries’ Web Archive team have endeavoured (and will keep working) to capture, quality assess and make publicly available records from the web relating to Oxford and the coronavirus pandemic. Preserving these ephemeral records is important. Just a few months into what is sure to be a long road, what do these records show?

Firstly, records from the Bodleian Libraries’ Web Archive can demonstrate how university divisions and departments are continually adjusting in order to facilitate core activities of learning and research. This could be by moving planned events online or organising and hosting new events relevant to the current climate:

Capture of http://pcmlp.socleg.ox.ac.uk/ 24 May 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/2502/20200524133907/https://pcmlp.socleg.ox.ac.uk/global-media-policy-seminar-series-victor-pickard-on-media-policy-in-a-time-of-crisis/

Captures of websites also provide an insight to the numerous collaborations of Oxford University with both the UK government and other institutions at this unprecedented time; that is, the role Oxford is playing and how that role is changing and adapting. Much of this can be seen in the ever evolving news pages of departmental websites, especially those within Medical Sciences division, such as the Nuffield Department of Population Health’s collaboration with UK Biobank for the government department of health and social care announced on 17 May 2020.

The web archive preserves records of how certain groups are contributing to coronavirus covid-19 research, front line work and reviewing things at an extremely  fast pace which the curators at Bodleian Libraries’ Web Archive can attempt to capture by crawling more frequently. One example of this is the Centre for Evidence Based Medicine’s Oxford Covid-19 Evidence Service – a platform for rapid data analysis and reviews which is currently updated with several articles daily. Comparing two screenshots of different captures of the site, seven weeks apart, show us the different themes of data being reviewed, and particularly how the ‘Most Viewed’ questions change (or indeed, don’t change) over time.

Capture of https://www.cebm.net/covid-19/ 14 April 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/org-467/20200414111731/https://www.cebm.net/covid-19/

Interestingly, the page location has slightly changed, the eagle-eyed among you may have spotted that the article reviews are now under /oxford-covid-19-evidence-service/, which is still in the web crawler’s scope.

Capture of https://www.cebm.net/covid-19/ 05 June 2020 available through the Bodleian Libraries’ Web Archive. Wayback url https://wayback.archive-it.org/org-467/20200605100737/https://www.cebm.net/oxford-covid-19-evidence-service/

We welcome recommendations for sites to archive; if you would like to nominate a website for inclusion in the Bodleian Libraries’ Web Archive you can do so here. Meanwhile, the work to capture institutional, departmental and individual responses at this time continues.

Web Archiving & Preservation Working Group: Social Media & Complex Content

On January 16 2020, I had the pleasure of attending the first public meeting of the Digital Preservation Coalition’s Web Archiving and Preservation Working Group. The meeting was held in the beautiful New Records House in Edinburgh.

We were welcomed by Sara Day Thomson who in her opening talk gave us a very clear overview of the issues and questions we increasingly run into when archiving complex/ dynamic web or social media content. For example, how do we preserve apps like Pokémon Go that use a user’s location data or even personal information to individualize the experience? Or where do we draw the line in interactive social media conversations? After all, we cannot capture everything. But how do we even capture this information without infringing the rights of the original creators? These and more musings set the stage perfectly to the rest of the talks during the day.

Although I would love to include every talk held this day, as they were all very interesting, I will only highlight a couple of the presentations to give this blog some pretence at “brevity”.

The first talk I want to highlight was given by Giulia Rossi, Curator of Digital Publications at the British Library, on “Overview of Collecting Approach to Complex Publications”. Rossie introduced us to the emerging formats project; a two year project by the British Library. The project focusses on three types of content:

  1. Web-based interactive narratives where the user’s interaction with a browser based environment determines how the narrative evolves;
  2. Book as mobile apps (a.k.a. literary apps);
  3. Structured data.

Personally, I found Rossi’s discussion of the collection methods in particular very interesting. The team working on the emerging formats project does not just use heritage crawlers and other web harvesting tools, but also file transfers or direct downloads via access code and password. Most strikingly, in the event that only a partial capture can be made, they try to capture as much contextual information about the digital object as possible including blog posts, screen shots or videos of walkthroughs, so researchers will have a good idea of what the original content would have looked like.

The capture of contextual content and the inclusion of additional contextual metadata about web content is currently not standard practice. Many tools do not even allow for their inclusion. However, considering that many of the web harvesting tools experience issues when attempting to capture dynamic and complex content, this could offer an interesting work-around for most web archives. It is definitely an option that I myself would like to explore going forward.

The second talk that I would like to zoom in on is “Collecting internet art” by Karin de Wild, digital fellow at the University of Leicester. Taking the Agent Ruby – a chatbot created by Lynn Hershman Leeson – as her example, de Wild explored questions on how we determine what aspects of internet art need to be preserved and what challenges this poses. In the case of Agent Ruby, the San Francisco Museum of Modern Art initially exhibited the chatbot in a software installation within the museum, thereby taking the artwork out of its original context. They then proceeded to add it to their online Expedition e-space, which has since been taken offline. Only a print screen of the online art work is currently accessible through the SFMOMA website, as the museum prioritizes the preservation of the interface over the chat functionality.

This decision raises questions about the right ways to preserve online art. Does the interface indeed suffice or should we attempt to maintain the integrity of the artwork by saving the code as well? And if we do that, should we employ code restitution, which aims to preserve the original arts’ code, or a significant part of it, whilst adding restoration code to reanimate defunct code to full functionality? Or do we emulate the software as the University of Freiburg is currently exploring? How do we keep track of the provenance of the artwork whilst taking into account the different iterations that digital art works go through?

De Wild proposed to turn to linked data as a way to keep track of particularly the provenance of an artwork. Together with two other colleagues she has been working on a project called Rhizome in which they are creating a data model that will allow people to track the provenance of internet art.

Although this is not within the scope of the Rhizome project, it would be interesting to see how the finished data model would lend itself to keep track of changes in the look and feel of regular websites as well. Even though the layouts of websites have changed radically over the past number of years, these changes are usually not documented in metadata or data models, even though they can be as much of a reflection of social and cultural changes as the content of the website. Going forward it will be interesting to see how the changes in archiving online art works will influence the preservation of online content in general.

The final presentation I would like to draw attention to is “Twitter Data for Social Science Research” by Luke Sloan, deputy director of the Social Data Science Lab at the University of Cardiff. He provided us with a demo of COSMOS, an alternative to the twitter API, which  is freely available to academic institutions and not-for-profit organisations.

COSMOS allows you to either target a particular twitter feed or enter a search term to obtain a 1% sample of the total worldwide twitter feed. The gathered data can be analysed within the system and is stored in JSON format. The information can subsequently be exported to a .CVS or Excel format.

Although the system is only able to capture new (or live) twitter data, it is possible to upload historical twitter data into the system if an archive has access to this.

Having given us an explanation on how COSMOS works, Sloan asked us to consider the potential risks that archiving and sharing twitter data could pose to the original creator. Should we not protect these creators by anonymizing their tweets to a certain extent? If so,  what data should we keep? Do we only record the tweet ID and the location? Or would this already make it too easy to identify the creator?

The last part of Sloan’s presentation tied in really well with the discussion about the ethical approaches to archiving social media. During this discussion we were prompted to consider ways in which archives could archive twitter data, whilst being conscious of the potential risks to the original creators of the tweets. This definitely got me thinking about the way we currently archive some of the twitter accounts related to the Bodleian Libraries in our very own Bodleian Libraries Web Archive.

All in all, the DPC event definitely gave me more than enough food for thought about the ways in which the Bodleian Libraries and the wider community in general can improve the ways we capture (meta)data related to the online content that we archive and the ethical responsibilities that we have towards the creators of said content.

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Archives Unleashed – Vancouver Datathon

On the 1st-2nd of November 2018 I was lucky enough to attend the  Archives Unleashed Datathon Vancouver co-hosted by the Archives Unleashed Team and Simon Fraser University Library along with KEY (SFU Big Data Initiative). I was very thankful and appreciative of the generous travel grant from the Andrew W. Mellon Foundation that made this possible.

The SFU campus at the Habour Centre was an amazing venue for the Datathon and it was nice to be able to take in some views of the surrounding mountains.

About the Archives Unleashed Project

The Archives Unleashed Project is a three year project with a focus on making historical internet content easily accessible to scholars and researchers whose interests lay in exploring and researching both the recent past and contemporary history.

After a series of datathons held at a number of International institutions such as the British Library, University of Toronto, Library of Congress and the Internet Archive, the Archives Unleashed Team identified some key areas of development that would enable and help to deliver their aim of making petabytes of valuable web content accessible.

Key Areas of Development
  • Better analytics tools
  • Community infrastructure
  • Accessible web archival interfaces

By engaging and building a community, alongside developing web archive search and data analysis tools the project is successfully enabling a wide range of people including scholars, programmers, archivists and librarians to “access, share and investigate recent history since the early days of the World Wide Web.”

The project has a three-pronged approach
  1. Build a software toolkit (Archives Unleashed Toolkit)
  2. Deploy the toolkit in a cloud-based environment (Archives Unleashed Cloud)
  3. Build a cohesive user community that is sustainable and inclusive by bringing together the project team members with archivists, librarians and researchers (Datathons)
Archives Unleashed Toolkit

The Archives Unleashed Toolkit (AUT) is an open-source platform for analysing web archives with Apache Spark. I was really impressed by AUT due to its scalability, relative ease of use and the huge amount of analytical options it provides. It can work on a laptop (Mac OS, Linux or Windows), a powerful cluster or on a single-node server and if you wanted to, you could even use a Raspberry Pi to run AUT. The Toolkit allows for a number of search functions across the entirety of a web archive collection. You can filter collections by domain, URL pattern, date, languages and more. Create lists of URLs to return the top ten in a collection. Extract plain text files from HTML files in the ARC or WARC file and clean the data by removing ‘boilerplate’ content such as advertisements. Its also possible to use the Stanford Named Entity Recognizer (NER) to extract names of entities, locations, organisations and persons. I’m looking forward to seeing the possibilities of how this functionality is adapted to localised instances and controlled vocabularies – would it be possible to run a similar programme for automated tagging of web archive collections in the future? Maybe ingest a collection into ATK , run a NER and automatically tag up the data providing richer metadata for web archives and subsequent research.

Archives Unleashed Cloud

The Archives Unleashed Cloud (AUK) is a GUI based front end for working with AUT, it essentially provides an accessible interface for generating research derivatives from Web archive files (WARCS). With a few clicks users can ingest and sync Archive-it collections, analyse the collections, create network graphs and visualise connections and nodes. It is currently free to use and runs on AUK central servers.

My experience at the Vancouver Datathon

The datathons bring together a small group of 15-20 people of varied professional backgrounds and experience to work and experiment with the Archives Unleashed Toolkit and the Archives Unleashed Cloud. I really like that the team have chosen to minimise the numbers that attend because it created a close knit working group that was full of collaboration, knowledge and idea exchange. It was a relaxed, fun and friendly environment to work in.

Day One

After a quick coffee and light breakfast, the Datathon opened with introductory talks from project team members Ian Milligan (Principal Investigator), Nick Ruest (Co-Principal Investigator) and Samantha Fritz (Project Manager), relating to the project – its goals and outcomes, the toolkit, available datasets and event logistics.

Another quick coffee break and it was back to work – participants were asked to think about the datasets that interested them, techniques they might want to use and questions or themes they would like to explore and write these on sticky notes.

Once placed on the white board, teams naturally formed around datasets, themes and questions. The team I was in consisted of  Kathleen Reed and Ben O’Brien  and formed around a common interest in exploring the First Nations and Indigenous communities dataset.

Virtual Machines were kindly provided by Compute Canada and available for use throughout the Datathon to run AUT, datasets were preloaded onto these VMs and a number of derivative files had already been created. We spent some time brainstorming, sharing ideas and exploring datasets using a number of different tools. The day finished with some informative lightning talks about the work participants had been doing with web archives at their home institutions.

Day Two

On day two we continued to explore datasets by using the full text derivatives and running some NER and performing key word searches using the command line tool Grep. We also analysed the text using sentiment analysis with the Natural Language Toolkit. To help visualise the data, we took the new text files produced from the key word searches and uploaded them into Voyant tools. This helped by visualising links between words, creating a list of top terms and provides quantitative data such as how many times each word appears. It was here we found that the word ‘letter’ appeared quite frequently and we finalised the dataset we would be using – University of British Columbia – bc-hydro-site-c.

We hunted down the site and found it contained a number of letters from people about the BC Hydro Dam Project. The problem was that the letters were in a table and when extracted the data was not clean enough. Ben O’Brien came up with a clever extraction solution utilising the raw HTML files and some script magic. The data was then prepped for geocoding by Kathleen Reed to show the geographical spread of the letter writers, hot-spots and timeline, a useful way of looking at the issue from the perspective of engagement and the community.

Map of letter writers.

Time Lapse of locations of letter writers. 

At the end of day 2 each team had a chance to present their project to the other teams. You can view the presentation (Exploring Letters of protest for the BC Hydro Dam Site C) we prepared here, as well as the other team projects.

Why Web Archives Matter

How we preserve, collect, share and exchange cultural information has changed dramatically. The act of remembering at National Institutes and Libraries has altered greatly in terms of scope, speed and scale due to the web. The way in which we provide access to, use and engage with archival material has been disrupted. All current and future historians who want to study the periods after the 1990s will have to use web archives as a resource. Currently issues around accessibility and usability have lagged behind and many students and historians are not ready. Projects like Archives Unleashed will help to furnish and equip researchers, historians, students and the community with the necessary tools to combat these problems. I look forward to seeing the next steps the project takes.

Archives Unleashed are currently accepted submissions for the next Datathon in March 2019, I highly recommend it.