Tag Archives: Web Archives

#WeMissiPRES: Preserving social media and boiling 1.04 x 10^16 kettles

This year the annual iPRES digital preservation conference was understandably postponed and in its place the community hosted a 3-day Zoom conference called #WeMissiPRES. As two of the Bodleian Libraries’ Graduate Trainee Digital Archivists, Simon and I were in attendance and blogged about our experiences. This post contains some of my highlights.

The conference kicked off with a keynote by Geert Lovink. Geert is the founding director of the Institute of Network Cultures and the author of several books on critical Internet studies. His talk was wide-ranging and covered topics from the rise of so-called ‘Zoom fatigue’ (I guarantee you know this feeling by now) to how social media platforms affect all aspects of contemporary life, often in negative ways. Geert highlighted the importance of preserving social media in order to allow future generations to be able to understand the present historical moment. However, this is a complicated area of digital preservation because archiving social media presents a host of ethical and technical challenges. For instance, how do we accurately capture the experience of using social media when the content displayed to you is largely dictated by an algorithm that is not made public for us to replicate?

After the keynote I attended a series of talks about the ARCHIVER project. João Fernandes from CERN explained that the goal of this project is to improve archiving and digital preservation services for scientific and research data. Preservation solutions for this type of data need to be cost-effective, scalable, and capable of ingesting amounts of data within the petabyte range. There were several further talks from companies who are submitting to the design phase of this project, including Matthew Addis from Arkivum. Matthew’s talk focused on the ways that digital preservation can be conducted on the industrial scale required to meet the brief and explained that Arkivum is collaborating with Google to achieve this, because Google’s cloud infrastructure can be leveraged for petabyte-scale storage. He also noted that while the marriage of preserved content with robust metadata is important in any digital preservation context, it is essential for repositories dealing with very complex scientific data.

In the afternoon I attended a range of talks that addressed new standards and technologies in digital preservation. Linas Cepinskas (Data Archiving and Networked Services (DANS)) spoke about a self-assessment tool for the FAIR principles, which is designed to assess whether data is Findable, Accessible, Interoperable and Reusable. Later, Barbara Sierman (DigitalPreservation.nl) and Ingrid Dillo (DANS) spoke about TRUST, a new set of guiding principles that are designed to map well with FAIR and assess the reliability of data repositories. Antonio Guillermo Martinez (LIBNOVA) gave a talk about his research into Artificial Intelligence and machine learning applied to digital preservation. Through case studies, he identified that AI is especially good at tasks such as anomaly detection and automatic metadata generation. However, he found that regardless of how well the AI performs, it needs to generate better explanations for its decisions, because it’s hard for human beings to build trust in automated decisions that we find opaque.

Paul Stokes from Jisc3C gave a talk on calculating the carbon costs of digital curation and unfortunately concluded that not much research has been done in this area. The need to improve the environmental sustainability of all human activity could not be more pressing and digital preservation is no exception, as approximately 3% of the world’s electricity is used by data centres. Paul also offered the statistic that enough power is consumed by data centres worldwide to boil 10,400,000,000,000,000 kettles – which is the most important digital preservation metric I can think of.

This conference was challenging and eye-opening because it gave me an insight into (complicated!) areas of digital preservation that I was not familiar with, particularly surrounding the challenges of preserving large quantities of scientific and research data. I’m very grateful to the speakers for sharing their research and to the organisers, who did a fantastic job of bringing the community together to bridge the gap between 2019 and 2021!

#WeMissiPRES: A Bridge from 2019 to 2021

Every year, the international digital preservation community meets for the iPRES conference, an opportunity for practitioners to exchange knowledge and showcase the latest developments in the field. With the 2020 conference unable to take place due to the global pandemic, digital preservation professionals instead gathered online for #WeMissiPRES to ensure that the global community remained connected. Our graduate trainee digital archivist Simon Mackley attended the first day of the event; in this blog post he reflects on some of the highlights of the talks and what they tell us about the state of the field.

How do you keep the global digital preservation community connected when international conferences are not possible? This was the challenge faced by the organisers of #WeMissIPres, a three-day online event hosted by the Digital Preservation Coalition. Conceived as a festival of digital preservation, the aim was not to try and replicate the regular iPRES conference in an online format, but instead to serve as a bridge for the digital preservation community, connecting the efforts of 2019 with the plans for 2021.

As might be expected, the impact of the pandemic loomed large in many of the talks. Caylin Smith (Cambridge University Library) and Sara Day Thomson (University of Edinburgh) for instance gave a fascinating paper on the challenge of rapidly collecting institutional responses to coronavirus, focusing on the development of new workflows and streamlined processes. The difficulties of working from home, the requirements of remote access to resources, and the need to move training online likewise proved to be recurrent themes throughout the day. As someone whose own experience of digital preservation has been heavily shaped by the pandemic (I began my traineeship at the start of lockdown!) it was really useful to hear how colleagues in other institutions have risen to these challenges.

I was also struck by the different ways in which responses to the crisis have strengthened digital preservation efforts. Lynn Bruce and Eve Wright (National Records of Scotland) noted for instance that the experience of the pandemic has led to increased appreciation of the value of web-archiving from stakeholders, as the need to capture rapidly-changing content has become more apparent. Similarly, Natalie Harrower (Digital Repository of Ireland) made the excellent point that the crisis had not only highlighted the urgent need for the sharing of medical research data, but also the need to preserve it: Coronavirus data may one day prove essential to fighting a future pandemic, and so there is therefore a moral imperative for us to ensure that it is preserved.

As our keynote speaker Geert Lovink (Institute of Network Cultures) reminded us, the events of the past year have been momentous quite apart from the pandemic, with issues such as the distorting impacts of social media on society, the climate emergency, and global demands for racial justice all having risen to the forefront of society. It was great therefore to see the role of digital preservation in these challenges being addressed in many of the panel sessions. A personal highlight for me was the presentation by Daniel Steinmeier (KB National Library of the Netherlands) on diversity and digital preservation. Steinmeier stressed that in order for diversity efforts to be successful, institutions needed to commit to continuing programmes of inclusion rather than one-off actions, with the communities concerned actively included in the archiving process.

So what challenges can we expect from the year ahead? Perhaps more than ever, this year this has been a difficult question to answer. Nonetheless, a key theme that struck me from many of the discussions was that the growing challenge of archiving social media platforms was matched only by the increasing need to preserve the content hosted on them. As Zefi Kavvadia (International Institute of Social History) noted, many social media platforms actively resist archiving; even when preservation is possible, curators are faced with a dilemma between capturing user experiences and capturing platform data. Navigating this challenge will surely be a major priority for the profession going forward.

While perhaps no substitute for meeting in person, #WeMissiPRES nonetheless succeeded in bringing the international digital preservation community together in a shared celebration of the progress being made in the field, successfully bridging the gap between 2019 and 2021, and laying the foundations for next year’s conference.

 

#WeMissiPRES was held online from 22nd-24th September 2020. For more information, and for recordings of the talks and panel sessions, see the event page on the DPC website.

Archiving web content related to the University of Oxford and the coronavirus pandemic

Since March 2020, the scope of collection development at the Bodleian Libraries’ Web Archive has expanded to also focus on the coronavirus pandemic: how the University of Oxford, and wider university community have reacted and responded to the rapidly changing global situation and government guidance. The Bodleian Libraries’ Web Archive team have endeavoured (and will keep working) to capture, quality assess and make publicly available records from the web relating to Oxford and the coronavirus pandemic. Preserving these ephemeral records is important. Just a few months into what is sure to be a long road, what do these records show?

Firstly, records from the Bodleian Libraries’ Web Archive can demonstrate how university divisions and departments are continually adjusting in order to facilitate core activities of learning and research. This could be by moving planned events online or organising and hosting new events relevant to the current climate:

Capture of http://pcmlp.socleg.ox.ac.uk/ 24 May 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/2502/20200524133907/https://pcmlp.socleg.ox.ac.uk/global-media-policy-seminar-series-victor-pickard-on-media-policy-in-a-time-of-crisis/

Captures of websites also provide an insight to the numerous collaborations of Oxford University with both the UK government and other institutions at this unprecedented time; that is, the role Oxford is playing and how that role is changing and adapting. Much of this can be seen in the ever evolving news pages of departmental websites, especially those within Medical Sciences division, such as the Nuffield Department of Population Health’s collaboration with UK Biobank for the government department of health and social care announced on 17 May 2020.

The web archive preserves records of how certain groups are contributing to coronavirus covid-19 research, front line work and reviewing things at an extremely  fast pace which the curators at Bodleian Libraries’ Web Archive can attempt to capture by crawling more frequently. One example of this is the Centre for Evidence Based Medicine’s Oxford Covid-19 Evidence Service – a platform for rapid data analysis and reviews which is currently updated with several articles daily. Comparing two screenshots of different captures of the site, seven weeks apart, show us the different themes of data being reviewed, and particularly how the ‘Most Viewed’ questions change (or indeed, don’t change) over time.

Capture of https://www.cebm.net/covid-19/ 14 April 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/org-467/20200414111731/https://www.cebm.net/covid-19/

Interestingly, the page location has slightly changed, the eagle-eyed among you may have spotted that the article reviews are now under /oxford-covid-19-evidence-service/, which is still in the web crawler’s scope.

Capture of https://www.cebm.net/covid-19/ 05 June 2020 available through the Bodleian Libraries’ Web Archive. Wayback url https://wayback.archive-it.org/org-467/20200605100737/https://www.cebm.net/oxford-covid-19-evidence-service/

We welcome recommendations for sites to archive; if you would like to nominate a website for inclusion in the Bodleian Libraries’ Web Archive you can do so here. Meanwhile, the work to capture institutional, departmental and individual responses at this time continues.

Web Archiving & Preservation Working Group: Social Media & Complex Content

On January 16 2020, I had the pleasure of attending the first public meeting of the Digital Preservation Coalition’s Web Archiving and Preservation Working Group. The meeting was held in the beautiful New Records House in Edinburgh.

We were welcomed by Sara Day Thomson who in her opening talk gave us a very clear overview of the issues and questions we increasingly run into when archiving complex/ dynamic web or social media content. For example, how do we preserve apps like Pokémon Go that use a user’s location data or even personal information to individualize the experience? Or where do we draw the line in interactive social media conversations? After all, we cannot capture everything. But how do we even capture this information without infringing the rights of the original creators? These and more musings set the stage perfectly to the rest of the talks during the day.

Although I would love to include every talk held this day, as they were all very interesting, I will only highlight a couple of the presentations to give this blog some pretence at “brevity”.

The first talk I want to highlight was given by Giulia Rossi, Curator of Digital Publications at the British Library, on “Overview of Collecting Approach to Complex Publications”. Rossie introduced us to the emerging formats project; a two year project by the British Library. The project focusses on three types of content:

  1. Web-based interactive narratives where the user’s interaction with a browser based environment determines how the narrative evolves;
  2. Book as mobile apps (a.k.a. literary apps);
  3. Structured data.

Personally, I found Rossi’s discussion of the collection methods in particular very interesting. The team working on the emerging formats project does not just use heritage crawlers and other web harvesting tools, but also file transfers or direct downloads via access code and password. Most strikingly, in the event that only a partial capture can be made, they try to capture as much contextual information about the digital object as possible including blog posts, screen shots or videos of walkthroughs, so researchers will have a good idea of what the original content would have looked like.

The capture of contextual content and the inclusion of additional contextual metadata about web content is currently not standard practice. Many tools do not even allow for their inclusion. However, considering that many of the web harvesting tools experience issues when attempting to capture dynamic and complex content, this could offer an interesting work-around for most web archives. It is definitely an option that I myself would like to explore going forward.

The second talk that I would like to zoom in on is “Collecting internet art” by Karin de Wild, digital fellow at the University of Leicester. Taking the Agent Ruby – a chatbot created by Lynn Hershman Leeson – as her example, de Wild explored questions on how we determine what aspects of internet art need to be preserved and what challenges this poses. In the case of Agent Ruby, the San Francisco Museum of Modern Art initially exhibited the chatbot in a software installation within the museum, thereby taking the artwork out of its original context. They then proceeded to add it to their online Expedition e-space, which has since been taken offline. Only a print screen of the online art work is currently accessible through the SFMOMA website, as the museum prioritizes the preservation of the interface over the chat functionality.

This decision raises questions about the right ways to preserve online art. Does the interface indeed suffice or should we attempt to maintain the integrity of the artwork by saving the code as well? And if we do that, should we employ code restitution, which aims to preserve the original arts’ code, or a significant part of it, whilst adding restoration code to reanimate defunct code to full functionality? Or do we emulate the software as the University of Freiburg is currently exploring? How do we keep track of the provenance of the artwork whilst taking into account the different iterations that digital art works go through?

De Wild proposed to turn to linked data as a way to keep track of particularly the provenance of an artwork. Together with two other colleagues she has been working on a project called Rhizome in which they are creating a data model that will allow people to track the provenance of internet art.

Although this is not within the scope of the Rhizome project, it would be interesting to see how the finished data model would lend itself to keep track of changes in the look and feel of regular websites as well. Even though the layouts of websites have changed radically over the past number of years, these changes are usually not documented in metadata or data models, even though they can be as much of a reflection of social and cultural changes as the content of the website. Going forward it will be interesting to see how the changes in archiving online art works will influence the preservation of online content in general.

The final presentation I would like to draw attention to is “Twitter Data for Social Science Research” by Luke Sloan, deputy director of the Social Data Science Lab at the University of Cardiff. He provided us with a demo of COSMOS, an alternative to the twitter API, which  is freely available to academic institutions and not-for-profit organisations.

COSMOS allows you to either target a particular twitter feed or enter a search term to obtain a 1% sample of the total worldwide twitter feed. The gathered data can be analysed within the system and is stored in JSON format. The information can subsequently be exported to a .CVS or Excel format.

Although the system is only able to capture new (or live) twitter data, it is possible to upload historical twitter data into the system if an archive has access to this.

Having given us an explanation on how COSMOS works, Sloan asked us to consider the potential risks that archiving and sharing twitter data could pose to the original creator. Should we not protect these creators by anonymizing their tweets to a certain extent? If so,  what data should we keep? Do we only record the tweet ID and the location? Or would this already make it too easy to identify the creator?

The last part of Sloan’s presentation tied in really well with the discussion about the ethical approaches to archiving social media. During this discussion we were prompted to consider ways in which archives could archive twitter data, whilst being conscious of the potential risks to the original creators of the tweets. This definitely got me thinking about the way we currently archive some of the twitter accounts related to the Bodleian Libraries in our very own Bodleian Libraries Web Archive.

All in all, the DPC event definitely gave me more than enough food for thought about the ways in which the Bodleian Libraries and the wider community in general can improve the ways we capture (meta)data related to the online content that we archive and the ethical responsibilities that we have towards the creators of said content.

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Archives Unleashed – Vancouver Datathon

On the 1st-2nd of November 2018 I was lucky enough to attend the  Archives Unleashed Datathon Vancouver co-hosted by the Archives Unleashed Team and Simon Fraser University Library along with KEY (SFU Big Data Initiative). I was very thankful and appreciative of the generous travel grant from the Andrew W. Mellon Foundation that made this possible.

The SFU campus at the Habour Centre was an amazing venue for the Datathon and it was nice to be able to take in some views of the surrounding mountains.

About the Archives Unleashed Project

The Archives Unleashed Project is a three year project with a focus on making historical internet content easily accessible to scholars and researchers whose interests lay in exploring and researching both the recent past and contemporary history.

After a series of datathons held at a number of International institutions such as the British Library, University of Toronto, Library of Congress and the Internet Archive, the Archives Unleashed Team identified some key areas of development that would enable and help to deliver their aim of making petabytes of valuable web content accessible.

Key Areas of Development
  • Better analytics tools
  • Community infrastructure
  • Accessible web archival interfaces

By engaging and building a community, alongside developing web archive search and data analysis tools the project is successfully enabling a wide range of people including scholars, programmers, archivists and librarians to “access, share and investigate recent history since the early days of the World Wide Web.”

The project has a three-pronged approach
  1. Build a software toolkit (Archives Unleashed Toolkit)
  2. Deploy the toolkit in a cloud-based environment (Archives Unleashed Cloud)
  3. Build a cohesive user community that is sustainable and inclusive by bringing together the project team members with archivists, librarians and researchers (Datathons)
Archives Unleashed Toolkit

The Archives Unleashed Toolkit (AUT) is an open-source platform for analysing web archives with Apache Spark. I was really impressed by AUT due to its scalability, relative ease of use and the huge amount of analytical options it provides. It can work on a laptop (Mac OS, Linux or Windows), a powerful cluster or on a single-node server and if you wanted to, you could even use a Raspberry Pi to run AUT. The Toolkit allows for a number of search functions across the entirety of a web archive collection. You can filter collections by domain, URL pattern, date, languages and more. Create lists of URLs to return the top ten in a collection. Extract plain text files from HTML files in the ARC or WARC file and clean the data by removing ‘boilerplate’ content such as advertisements. Its also possible to use the Stanford Named Entity Recognizer (NER) to extract names of entities, locations, organisations and persons. I’m looking forward to seeing the possibilities of how this functionality is adapted to localised instances and controlled vocabularies – would it be possible to run a similar programme for automated tagging of web archive collections in the future? Maybe ingest a collection into ATK , run a NER and automatically tag up the data providing richer metadata for web archives and subsequent research.

Archives Unleashed Cloud

The Archives Unleashed Cloud (AUK) is a GUI based front end for working with AUT, it essentially provides an accessible interface for generating research derivatives from Web archive files (WARCS). With a few clicks users can ingest and sync Archive-it collections, analyse the collections, create network graphs and visualise connections and nodes. It is currently free to use and runs on AUK central servers.

My experience at the Vancouver Datathon

The datathons bring together a small group of 15-20 people of varied professional backgrounds and experience to work and experiment with the Archives Unleashed Toolkit and the Archives Unleashed Cloud. I really like that the team have chosen to minimise the numbers that attend because it created a close knit working group that was full of collaboration, knowledge and idea exchange. It was a relaxed, fun and friendly environment to work in.

Day One

After a quick coffee and light breakfast, the Datathon opened with introductory talks from project team members Ian Milligan (Principal Investigator), Nick Ruest (Co-Principal Investigator) and Samantha Fritz (Project Manager), relating to the project – its goals and outcomes, the toolkit, available datasets and event logistics.

Another quick coffee break and it was back to work – participants were asked to think about the datasets that interested them, techniques they might want to use and questions or themes they would like to explore and write these on sticky notes.

Once placed on the white board, teams naturally formed around datasets, themes and questions. The team I was in consisted of  Kathleen Reed and Ben O’Brien  and formed around a common interest in exploring the First Nations and Indigenous communities dataset.

Virtual Machines were kindly provided by Compute Canada and available for use throughout the Datathon to run AUT, datasets were preloaded onto these VMs and a number of derivative files had already been created. We spent some time brainstorming, sharing ideas and exploring datasets using a number of different tools. The day finished with some informative lightning talks about the work participants had been doing with web archives at their home institutions.

Day Two

On day two we continued to explore datasets by using the full text derivatives and running some NER and performing key word searches using the command line tool Grep. We also analysed the text using sentiment analysis with the Natural Language Toolkit. To help visualise the data, we took the new text files produced from the key word searches and uploaded them into Voyant tools. This helped by visualising links between words, creating a list of top terms and provides quantitative data such as how many times each word appears. It was here we found that the word ‘letter’ appeared quite frequently and we finalised the dataset we would be using – University of British Columbia – bc-hydro-site-c.

We hunted down the site and found it contained a number of letters from people about the BC Hydro Dam Project. The problem was that the letters were in a table and when extracted the data was not clean enough. Ben O’Brien came up with a clever extraction solution utilising the raw HTML files and some script magic. The data was then prepped for geocoding by Kathleen Reed to show the geographical spread of the letter writers, hot-spots and timeline, a useful way of looking at the issue from the perspective of engagement and the community.

Map of letter writers.

Time Lapse of locations of letter writers. 

At the end of day 2 each team had a chance to present their project to the other teams. You can view the presentation (Exploring Letters of protest for the BC Hydro Dam Site C) we prepared here, as well as the other team projects.

Why Web Archives Matter

How we preserve, collect, share and exchange cultural information has changed dramatically. The act of remembering at National Institutes and Libraries has altered greatly in terms of scope, speed and scale due to the web. The way in which we provide access to, use and engage with archival material has been disrupted. All current and future historians who want to study the periods after the 1990s will have to use web archives as a resource. Currently issues around accessibility and usability have lagged behind and many students and historians are not ready. Projects like Archives Unleashed will help to furnish and equip researchers, historians, students and the community with the necessary tools to combat these problems. I look forward to seeing the next steps the project takes.

Archives Unleashed are currently accepted submissions for the next Datathon in March 2019, I highly recommend it.

The UK Web Archive: Mental Health, Social Media and the Internet Collection

The UK Web Archive hosts several Special Collections, curating material related to a particular theme or subject. One such collection is on Mental Health, Social Media and the Internet.

Since the advent of Web 2.0, people have been using the Internet as a platform to engage and connect, amongst other things, resulting in new forms of communication, and consequently new environments to adapt to – such as social media networks. This collection aims to illustrate how this has affected the UK, in terms of the impact on mental health. This collection will reflect the current attitudes displayed online within the UK towards mental health, and how the Internet and social media are being used in contemporary society.

We began curating material in June 2017, archiving various types of web content, including: research, news pieces, UK based social media initiatives and campaigns, charities and organisations’ websites, blogs and forums.

Material is being collected around several themes, including:

Body Image
Over the past few years, there has been a move towards using social media to discuss body image and mental health. This part of the collection curates material relating to how the Internet and social media affect mental health issues relating to body image. This includes research about developing theory in this area, news articles on various individuals experiences, as well as various material posted on social media accounts discussing this theme.

Cyber-bullying
This theme curates material, such as charities and organisations’ websites and social media accounts, which discuss, raise awareness and tackle this issue. Furthermore, material which examines the impact of social media and Internet use on bullying such as news articles, social media campaigns and blog posts, as well as online resources created to aid with this issue, such as guides and advice, are also collected.

Addiction

This theme collects material around gaming and other  Internet-based activities that may become addictive such as social media, pornography and gambling. It includes recent UK based research, studies and online polls, social media campaigns, online resources, blogs and news articles from individuals and organisations. Discourse, discussions, opinion and actions regarding different aspects of Internet addition are all captured and collected in this overarching catchment term of addiction, including social media addiction.

The Mental Health, Social Media and the Internet Special Collection, is available via the new UK Web Archive Beta Interface!

Co authored with Carl Cooper

The UK Web Archive Ebola Outbreak collection

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Next month marks the four year anniversary of the WHO’s public announcement of “a rapidly evolving outbreak of Ebola virus disease (EVD)” that went on to become the deadliest outbreak of EVD in history.

With more than 28,000 cases and 11,000 deaths, it moved with such speed and virulence that–though concentrated in Guinea, Liberia and Sierra Leone–it was feared at the time that the Ebola virus disease outbreak of 2014-2016 would soon spread to become a global pandemic.

No cure or vaccine has yet been discovered and cases continue to flare up in West Africa. The most recent was declared over on 2 July 2017. Yet today most people in the UK unless directly affected don’t give it a second thought.

Searching online now, you can find fact sheets detailing everything you might want to know about patient zero and the subsequent rapid spread of infection. You can find discussions detailing the international response (or failure to do so) and lessons learned. You might even find the reminiscences of aid workers and survivors. But these sites all examine the outbreak in retrospect and their pages and stories have been updated so often that posts from then can no longer be found.

Posts that reflected the fear and uncertainty that permeated the UK during the epidemic. The urgent status updates and travel warnings.  The misinformation that people were telling each other. The speculation that ran riot. The groundswell of giving. The mobilisation of aid.

Understandably when we talk about epidemics the focus is on the scale of physical suffering: numbers stricken and dead; money spent and supplies sent; the speed and extent of its spread.

Whilst UKWA regularly collects the websites of major news channels and governmental agencies, what we wanted to capture was the public dialogue on, and interpretation of, events as they unfolded. To see how local interests and communities saw the crisis through the lenses of their own experience.

To this end, the special collection Ebola Outbreak, West Africa 2014 features a broad selection of websites concerning the UK response to the Ebola virus crisis. Here you can find:

  • The Anglican community’s view on the role of faith during the crisis;
  • Alternative medicine touting the virtues of liposomal vitamin C as a cure for Ebola;
  • Local football clubs fundraising to send aid;
  • Parents in the UK withdrawing children from school because of fear of the virus’ spread;
  • Think tanks’ and academics’ views on the national and international response;
  • Universities issuing guidance and reports on dealing with international students; and more.

Active collection for Ebola began in November 2014 at the height of the outbreak whilst related websites dating back to the infection of patient zero in December 2013 have been retrospectively added to the collection. Collection continued through to January 2016, a few months before the outbreak began tailing off in April 2016.

The Ebola collection is available via the UK Web Archive’s new beta interface.

#WAWeek2017 – Researchers, practitioners and their use of the archived web

This year, the world of web archiving  saw a premiere: not only were the biennial RESAW conference and the IIPC conference, established in 2016, held jointly for the first time, but they also formed part of a whole week of workshops, talks and public events around web archives – Web Archiving Week 2017 (or #WAWeek2017 for the social medially inclined).

After previous conferences Reykjavik (2016) and Arhus (RESAW 2015), the big 2017 event was held in London, 14-16 June 2017, organised jointly by the School of Advanced Studies of the University of London, the IIPC and the British Library.
The programme was packed full of an eclectic variety of presentations and discussions, with topics ranging from the theory and practice of curating web archive collections or capturing whole national web domains, via technical topics such as preservation strategies, software architecture and data management, to the development of methodologies and tools for using web archives based research and case studies of their application.

Even in digital times, who doesn’t like a conference pack? Of course, the full programme is also available online. (…but which version will be easier to archive?)

Continue reading

Researchers,practitioners and their use of the archived web. IIPC Web Archiving Conference 15th June 2017

From the 14th – 16th of June researchers and practitioners from a global community came together for a series of talks, presentations and workshops on the subject of Web Archiving at the IIPC Web Archiving Conference. This event coincided with Web Archiving Week 2017, a week long event running from 12th – 16th June hosted by the British Library and the School of Advance Study

I was lucky enough to attend the conference  on the 15th June with a fellow trainee digital archivist and listen to some thoughtful, engaging and challenging talks.

The day started with a plenary in which John Sheridan, Digital Director of the National Archives, spoke about the work of the National Archives and the challenges and approaches to Web Archiving they have taken. The National Archives is principally the archive of the government, it allows us to see what the state saw through the state’s eyes. Archiving government websites is a crucial part of this record keeping as we move further into the digital age where records are increasingly born-digital. A number of points were made which highlighted the motivations behind web archiving at the National Archives.

  • They care about the records that government are publishing and their primary function is to preserve the records
  • Accountability for government services online or information they publish
  • Capturing both the context and content

By preserving what the government publishes online it can be held accountable, accountability is one aspect that demonstrates the inherent value of archiving the web. You can find a great blog post on accountability and digital services by Richard Pope in this link.  http://blog.memespring.co.uk/2016/11/23/oscon-2016/

The published records and content on the internet provides valuable and crucial context for the records that are unpublished, it links the backstory and the published records. This allows for a greater understanding and analysis of the information and will be vital for researchers and historians now and into the future.

Quality assurance is a high priority at the National Archives. By having a narrow focus of crawling, it has allowed for but also prompted a lot of effort to be directed into the quality of the archived material so it has a high fidelity in playback. To keep these high standards it can take weeks in order to have a really good in-depth crawl. Having a small curated collection it is an incentive to work harder on capture.

The users and their needs were also discussed as this often shapes the way the data is collected, packaged and delivered.

  • Users want to substantiate a point. They use the archived sites for citation on Facebook or Twitter for example
  • The need to cite for a writer or researcher
  • Legal – What was the government stance or law at the time of my clients case
  • Researchers needs – This was highlighted as an area where improvements can be made
  • Government itself are using the archives for information purposes
  • Government websites requesting crawls before their website closes – An example of this is the NHS website transferring to a GOV.UK site

The last part of the talk focused on the future of web archiving and how this might take shape at the National Archives. Web archiving is complex and at times chaotic. Traditional archiving standards have been placed upon it in an attempt to order the records. It was a natural evolution for information managers and archivists to use the existing knowledge, skills and standards to bring this information under control. This has resulted in difficulties in searching across web archives, describing the content and structuring the information. The nature of the internet and the way in which the information is created means that uncertainty has to inevitably be embraced. Digital Archiving could take the turn into the 2.0, the second generation and move away from the traditional standards and embrace new standards and concepts. One proposed method is the ICA Records in Context conceptual model. It proposes a multidimensional description with each ‘ thing ‘ having a unique description as opposed to the traditional unit of description (one size fits all).  Instead of a single hierarchical fonds down approach, the Records in Context model uses a  description that can be formed as a network or graph. The context of the fonds is broader, linking between other collections and records to give different perspectives and views. The records can be enriched this way and provide a fuller picture of the record/archive. The web produces content that is in a constant state of flux and a system of description that can grow and morph over time, creating new links and context would be a fruitful addition.

Visual Diagram of How the Records in Context Conceptual Model works

“This example shows some information about P.G.F. Leveau a French public notary in the 19th century including:
• data from the Archives nationales de France (ANF) (in blue); and
• data from a local archival institution, the Archives départementales du Cher (in yellow).” INTERNATIONAL COUNCIL ON ARCHIVES: RECORDS IN CONTEXTS A CONCEPTUAL MODEL FOR ARCHIVAL DESCRIPTION.p.93

 

Traditional Fonds Level Description

 

I really enjoyed the conference as a whole and the talk by John Sheridan. I learnt a lot about the National Archives approach to web archiving, the challenges and where the future of web archiving might go. I’m looking forward to taking this new knowledge and applying it to the web archiving work I do here at the Bodleian.

Changes are currently being made to the National Archives Web Archiving site and it will relaunch on the 1st July this year.  Why don’t you go and check it out.