Tag Archives: Web Archives

Oxford LibGuides: Web Archives

Web archives are becoming more and more prevalent and are being increasingly used for research purposes. They are fundamental to the preservation of our cultural heritage in the interconnected digital age. With the continuing collection development on the Bodleian Libraries Web Archive and the recent launch of the new UK Web Archive site, the web archiving team at the Bodleian have produced a new guide to web archives. The new Web Archives LibGuide includes useful information for anyone wanting to learn more about web archives.

It focuses on the following areas:

  • The Internet Archive, The UK Web Archive and the Bodleian Libraries Web Archive.
  • Other web archives.
  • Web archive use cases.
  • Web archive citation information.

Check out the new look for the Web Archives LibGuide.

 

 

Archives Unleashed – Vancouver Datathon

On the 1st-2nd of November 2018 I was lucky enough to attend the  Archives Unleashed Datathon Vancouver co-hosted by the Archives Unleashed Team and Simon Fraser University Library along with KEY (SFU Big Data Initiative). I was very thankful and appreciative of the generous travel grant from the Andrew W. Mellon Foundation that made this possible.

The SFU campus at the Habour Centre was an amazing venue for the Datathon and it was nice to be able to take in some views of the surrounding mountains.

About the Archives Unleashed Project

The Archives Unleashed Project is a three year project with a focus on making historical internet content easily accessible to scholars and researchers whose interests lay in exploring and researching both the recent past and contemporary history.

After a series of datathons held at a number of International institutions such as the British Library, University of Toronto, Library of Congress and the Internet Archive, the Archives Unleashed Team identified some key areas of development that would enable and help to deliver their aim of making petabytes of valuable web content accessible.

Key Areas of Development
  • Better analytics tools
  • Community infrastructure
  • Accessible web archival interfaces

By engaging and building a community, alongside developing web archive search and data analysis tools the project is successfully enabling a wide range of people including scholars, programmers, archivists and librarians to “access, share and investigate recent history since the early days of the World Wide Web.”

The project has a three-pronged approach
  1. Build a software toolkit (Archives Unleashed Toolkit)
  2. Deploy the toolkit in a cloud-based environment (Archives Unleashed Cloud)
  3. Build a cohesive user community that is sustainable and inclusive by bringing together the project team members with archivists, librarians and researchers (Datathons)
Archives Unleashed Toolkit

The Archives Unleashed Toolkit (AUT) is an open-source platform for analysing web archives with Apache Spark. I was really impressed by AUT due to its scalability, relative ease of use and the huge amount of analytical options it provides. It can work on a laptop (Mac OS, Linux or Windows), a powerful cluster or on a single-node server and if you wanted to, you could even use a Raspberry Pi to run AUT. The Toolkit allows for a number of search functions across the entirety of a web archive collection. You can filter collections by domain, URL pattern, date, languages and more. Create lists of URLs to return the top ten in a collection. Extract plain text files from HTML files in the ARC or WARC file and clean the data by removing ‘boilerplate’ content such as advertisements. Its also possible to use the Stanford Named Entity Recognizer (NER) to extract names of entities, locations, organisations and persons. I’m looking forward to seeing the possibilities of how this functionality is adapted to localised instances and controlled vocabularies – would it be possible to run a similar programme for automated tagging of web archive collections in the future? Maybe ingest a collection into ATK , run a NER and automatically tag up the data providing richer metadata for web archives and subsequent research.

Archives Unleashed Cloud

The Archives Unleashed Cloud (AUK) is a GUI based front end for working with AUT, it essentially provides an accessible interface for generating research derivatives from Web archive files (WARCS). With a few clicks users can ingest and sync Archive-it collections, analyse the collections, create network graphs and visualise connections and nodes. It is currently free to use and runs on AUK central servers.

My experience at the Vancouver Datathon

The datathons bring together a small group of 15-20 people of varied professional backgrounds and experience to work and experiment with the Archives Unleashed Toolkit and the Archives Unleashed Cloud. I really like that the team have chosen to minimise the numbers that attend because it created a close knit working group that was full of collaboration, knowledge and idea exchange. It was a relaxed, fun and friendly environment to work in.

Day One

After a quick coffee and light breakfast, the Datathon opened with introductory talks from project team members Ian Milligan (Principal Investigator), Nick Ruest (Co-Principal Investigator) and Samantha Fritz (Project Manager), relating to the project – its goals and outcomes, the toolkit, available datasets and event logistics.

Another quick coffee break and it was back to work – participants were asked to think about the datasets that interested them, techniques they might want to use and questions or themes they would like to explore and write these on sticky notes.

Once placed on the white board, teams naturally formed around datasets, themes and questions. The team I was in consisted of  Kathleen Reed and Ben O’Brien  and formed around a common interest in exploring the First Nations and Indigenous communities dataset.

Virtual Machines were kindly provided by Compute Canada and available for use throughout the Datathon to run AUT, datasets were preloaded onto these VMs and a number of derivative files had already been created. We spent some time brainstorming, sharing ideas and exploring datasets using a number of different tools. The day finished with some informative lightning talks about the work participants had been doing with web archives at their home institutions.

Day Two

On day two we continued to explore datasets by using the full text derivatives and running some NER and performing key word searches using the command line tool Grep. We also analysed the text using sentiment analysis with the Natural Language Toolkit. To help visualise the data, we took the new text files produced from the key word searches and uploaded them into Voyant tools. This helped by visualising links between words, creating a list of top terms and provides quantitative data such as how many times each word appears. It was here we found that the word ‘letter’ appeared quite frequently and we finalised the dataset we would be using – University of British Columbia – bc-hydro-site-c.

We hunted down the site and found it contained a number of letters from people about the BC Hydro Dam Project. The problem was that the letters were in a table and when extracted the data was not clean enough. Ben O’Brien came up with a clever extraction solution utilising the raw HTML files and some script magic. The data was then prepped for geocoding by Kathleen Reed to show the geographical spread of the letter writers, hot-spots and timeline, a useful way of looking at the issue from the perspective of engagement and the community.

Map of letter writers.

Time Lapse of locations of letter writers. 

At the end of day 2 each team had a chance to present their project to the other teams. You can view the presentation (Exploring Letters of protest for the BC Hydro Dam Site C) we prepared here, as well as the other team projects.

Why Web Archives Matter

How we preserve, collect, share and exchange cultural information has changed dramatically. The act of remembering at National Institutes and Libraries has altered greatly in terms of scope, speed and scale due to the web. The way in which we provide access to, use and engage with archival material has been disrupted. All current and future historians who want to study the periods after the 1990s will have to use web archives as a resource. Currently issues around accessibility and usability have lagged behind and many students and historians are not ready. Projects like Archives Unleashed will help to furnish and equip researchers, historians, students and the community with the necessary tools to combat these problems. I look forward to seeing the next steps the project takes.

Archives Unleashed are currently accepted submissions for the next Datathon in March 2019, I highly recommend it.

The UK Web Archive: Mental Health, Social Media and the Internet Collection

The UK Web Archive hosts several Special Collections, curating material related to a particular theme or subject. One such collection is on Mental Health, Social Media and the Internet.

Since the advent of Web 2.0, people have been using the Internet as a platform to engage and connect, amongst other things, resulting in new forms of communication, and consequently new environments to adapt to – such as social media networks. This collection aims to illustrate how this has affected the UK, in terms of the impact on mental health. This collection will reflect the current attitudes displayed online within the UK towards mental health, and how the Internet and social media are being used in contemporary society.

We began curating material in June 2017, archiving various types of web content, including: research, news pieces, UK based social media initiatives and campaigns, charities and organisations’ websites, blogs and forums.

Material is being collected around several themes, including:

Body Image
Over the past few years, there has been a move towards using social media to discuss body image and mental health. This part of the collection curates material relating to how the Internet and social media affect mental health issues relating to body image. This includes research about developing theory in this area, news articles on various individuals experiences, as well as various material posted on social media accounts discussing this theme.

Cyber-bullying
This theme curates material, such as charities and organisations’ websites and social media accounts, which discuss, raise awareness and tackle this issue. Furthermore, material which examines the impact of social media and Internet use on bullying such as news articles, social media campaigns and blog posts, as well as online resources created to aid with this issue, such as guides and advice, are also collected.

Addiction

This theme collects material around gaming and other  Internet-based activities that may become addictive such as social media, pornography and gambling. It includes recent UK based research, studies and online polls, social media campaigns, online resources, blogs and news articles from individuals and organisations. Discourse, discussions, opinion and actions regarding different aspects of Internet addition are all captured and collected in this overarching catchment term of addiction, including social media addiction.

The Mental Health, Social Media and the Internet Special Collection, is available via the new UK Web Archive Beta Interface!

Co authored with Carl Cooper

The UK Web Archive Ebola Outbreak collection

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

By CDC Global (Ebola virus) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Next month marks the four year anniversary of the WHO’s public announcement of “a rapidly evolving outbreak of Ebola virus disease (EVD)” that went on to become the deadliest outbreak of EVD in history.

With more than 28,000 cases and 11,000 deaths, it moved with such speed and virulence that–though concentrated in Guinea, Liberia and Sierra Leone–it was feared at the time that the Ebola virus disease outbreak of 2014-2016 would soon spread to become a global pandemic.

No cure or vaccine has yet been discovered and cases continue to flare up in West Africa. The most recent was declared over on 2 July 2017. Yet today most people in the UK unless directly affected don’t give it a second thought.

Searching online now, you can find fact sheets detailing everything you might want to know about patient zero and the subsequent rapid spread of infection. You can find discussions detailing the international response (or failure to do so) and lessons learned. You might even find the reminiscences of aid workers and survivors. But these sites all examine the outbreak in retrospect and their pages and stories have been updated so often that posts from then can no longer be found.

Posts that reflected the fear and uncertainty that permeated the UK during the epidemic. The urgent status updates and travel warnings.  The misinformation that people were telling each other. The speculation that ran riot. The groundswell of giving. The mobilisation of aid.

Understandably when we talk about epidemics the focus is on the scale of physical suffering: numbers stricken and dead; money spent and supplies sent; the speed and extent of its spread.

Whilst UKWA regularly collects the websites of major news channels and governmental agencies, what we wanted to capture was the public dialogue on, and interpretation of, events as they unfolded. To see how local interests and communities saw the crisis through the lenses of their own experience.

To this end, the special collection Ebola Outbreak, West Africa 2014 features a broad selection of websites concerning the UK response to the Ebola virus crisis. Here you can find:

  • The Anglican community’s view on the role of faith during the crisis;
  • Alternative medicine touting the virtues of liposomal vitamin C as a cure for Ebola;
  • Local football clubs fundraising to send aid;
  • Parents in the UK withdrawing children from school because of fear of the virus’ spread;
  • Think tanks’ and academics’ views on the national and international response;
  • Universities issuing guidance and reports on dealing with international students; and more.

Active collection for Ebola began in November 2014 at the height of the outbreak whilst related websites dating back to the infection of patient zero in December 2013 have been retrospectively added to the collection. Collection continued through to January 2016, a few months before the outbreak began tailing off in April 2016.

The Ebola collection is available via the UK Web Archive’s new beta interface.

#WAWeek2017 – Researchers, practitioners and their use of the archived web

This year, the world of web archiving  saw a premiere: not only were the biennial RESAW conference and the IIPC conference, established in 2016, held jointly for the first time, but they also formed part of a whole week of workshops, talks and public events around web archives – Web Archiving Week 2017 (or #WAWeek2017 for the social medially inclined).

After previous conferences Reykjavik (2016) and Arhus (RESAW 2015), the big 2017 event was held in London, 14-16 June 2017, organised jointly by the School of Advanced Studies of the University of London, the IIPC and the British Library.
The programme was packed full of an eclectic variety of presentations and discussions, with topics ranging from the theory and practice of curating web archive collections or capturing whole national web domains, via technical topics such as preservation strategies, software architecture and data management, to the development of methodologies and tools for using web archives based research and case studies of their application.

Even in digital times, who doesn’t like a conference pack? Of course, the full programme is also available online. (…but which version will be easier to archive?)

Continue reading

Researchers,practitioners and their use of the archived web. IIPC Web Archiving Conference 15th June 2017

From the 14th – 16th of June researchers and practitioners from a global community came together for a series of talks, presentations and workshops on the subject of Web Archiving at the IIPC Web Archiving Conference. This event coincided with Web Archiving Week 2017, a week long event running from 12th – 16th June hosted by the British Library and the School of Advance Study

I was lucky enough to attend the conference  on the 15th June with a fellow trainee digital archivist and listen to some thoughtful, engaging and challenging talks.

The day started with a plenary in which John Sheridan, Digital Director of the National Archives, spoke about the work of the National Archives and the challenges and approaches to Web Archiving they have taken. The National Archives is principally the archive of the government, it allows us to see what the state saw through the state’s eyes. Archiving government websites is a crucial part of this record keeping as we move further into the digital age where records are increasingly born-digital. A number of points were made which highlighted the motivations behind web archiving at the National Archives.

  • They care about the records that government are publishing and their primary function is to preserve the records
  • Accountability for government services online or information they publish
  • Capturing both the context and content

By preserving what the government publishes online it can be held accountable, accountability is one aspect that demonstrates the inherent value of archiving the web. You can find a great blog post on accountability and digital services by Richard Pope in this link.  http://blog.memespring.co.uk/2016/11/23/oscon-2016/

The published records and content on the internet provides valuable and crucial context for the records that are unpublished, it links the backstory and the published records. This allows for a greater understanding and analysis of the information and will be vital for researchers and historians now and into the future.

Quality assurance is a high priority at the National Archives. By having a narrow focus of crawling, it has allowed for but also prompted a lot of effort to be directed into the quality of the archived material so it has a high fidelity in playback. To keep these high standards it can take weeks in order to have a really good in-depth crawl. Having a small curated collection it is an incentive to work harder on capture.

The users and their needs were also discussed as this often shapes the way the data is collected, packaged and delivered.

  • Users want to substantiate a point. They use the archived sites for citation on Facebook or Twitter for example
  • The need to cite for a writer or researcher
  • Legal – What was the government stance or law at the time of my clients case
  • Researchers needs – This was highlighted as an area where improvements can be made
  • Government itself are using the archives for information purposes
  • Government websites requesting crawls before their website closes – An example of this is the NHS website transferring to a GOV.UK site

The last part of the talk focused on the future of web archiving and how this might take shape at the National Archives. Web archiving is complex and at times chaotic. Traditional archiving standards have been placed upon it in an attempt to order the records. It was a natural evolution for information managers and archivists to use the existing knowledge, skills and standards to bring this information under control. This has resulted in difficulties in searching across web archives, describing the content and structuring the information. The nature of the internet and the way in which the information is created means that uncertainty has to inevitably be embraced. Digital Archiving could take the turn into the 2.0, the second generation and move away from the traditional standards and embrace new standards and concepts. One proposed method is the ICA Records in Context conceptual model. It proposes a multidimensional description with each ‘ thing ‘ having a unique description as opposed to the traditional unit of description (one size fits all).  Instead of a single hierarchical fonds down approach, the Records in Context model uses a  description that can be formed as a network or graph. The context of the fonds is broader, linking between other collections and records to give different perspectives and views. The records can be enriched this way and provide a fuller picture of the record/archive. The web produces content that is in a constant state of flux and a system of description that can grow and morph over time, creating new links and context would be a fruitful addition.

Visual Diagram of How the Records in Context Conceptual Model works

“This example shows some information about P.G.F. Leveau a French public notary in the 19th century including:
• data from the Archives nationales de France (ANF) (in blue); and
• data from a local archival institution, the Archives départementales du Cher (in yellow).” INTERNATIONAL COUNCIL ON ARCHIVES: RECORDS IN CONTEXTS A CONCEPTUAL MODEL FOR ARCHIVAL DESCRIPTION.p.93

 

Traditional Fonds Level Description

 

I really enjoyed the conference as a whole and the talk by John Sheridan. I learnt a lot about the National Archives approach to web archiving, the challenges and where the future of web archiving might go. I’m looking forward to taking this new knowledge and applying it to the web archiving work I do here at the Bodleian.

Changes are currently being made to the National Archives Web Archiving site and it will relaunch on the 1st July this year.  Why don’t you go and check it out.

 

 

 

EU Referendum Web Archiving Mini-internship – Part 1

On 20 and 21 June eight Oxford University students took part in a web archiving micro-internship at the Weston Library’s Centre for Digital Scholarship. Working with the UK Legal Deposit Web Archive, they contributed to the curation of a special collection of websites on the UK European Referendum. This is the first of two guest blog posts on the micro-internship.

Web archiving micro-interns on the roof of the Weston Library, June 2016.

Web archiving micro-interns on the roof of the Weston Library, June 2016.

Using library archives for their research is not a novelty for any student or scholar. However, web archives represent a completely new dimension of swiftly evolving research methods – they intend to document what is posted online – a  relatively recent form of data collection due to scientific advancements.

For researchers used to traditional archives, the need to store and analyse this data might be not really understandable, however, web archiving, despite being relatively new, is very significant. Firstly, it allows us to store information for generations of future historians and sociologists – contrary to the common perception, many data held on World Wide Web disappears or changes very frequently and rapidly. Secondly, it might be an asset for those pursuing topical research projects in the present – recent technologies (such as prototype SHINE database for historical research) allow us to trace data trends and come to important and fascinating conclusions. Therefore, even if some might underrate web archives, it surely does not diminish their utility to academia.

In the eve of the Brexit referendum, which sparked many debates and discussions in British web space, timely creation of a web collection has proven to be very important – after all, the decision is likely to have long-term consequences for our society, economy, and legal system. Traditionally, individual narratives and civic engagement are set aside when documenting major political decisions. However, a web collection can significantly improve this situation by collecting diverse standpoints expressed in the web sphere. This, in my opinion, perfectly mirrors the ethos of direct democracy where every vote and view counts.

However, important as it is, web archiving comes with a range of practical and ethical obstacles: with huge masses of information being stored online it is very hard to choose what is worthy of being preserved for future generations. Legal restrictions, such as the recent legal deposit legislation, also significantly limit the scope of archivists’ work. During my micro-internship I, along with other interns, tried to overcome these obstacles as much as possible, minimising bias and efficiently using our time resources and server memory. Even in the era of technology, it is the human resources and individual judgment that shape the scope and direction of the collection.

Working on a web collection, especially since the campaigning has increased just before the referendum, was very challenging. However, as interns, we tackled the masses of information by focusing on individual areas of knowledge. Our work on the project was also aided by the guidance provided by our supervisors and discussions on ethical and scientific implications of our research. This was a very rewarding insight into a new area of knowledge, and I am convinced that skills and knowledge acquired and applied by me during the internship will aid me in my future research career.

Anna Lukina

Web Archiving Micro-internship – Part 2

On 14 and 15 March eight Oxford University students took part in a web archiving micro-internship at the Weston Library’s Centre for Digital Scholarship. Working with the UK Legal Deposit Web Archive, they contributed to the curation of a special collection of websites on the UK European Referendum. This is the second of two guest blog posts on the micro-internship.

The most central aspect of modern life is now the proliferation of digital technology. Since the 1990s, it has become a central mode of communication which is often taken for granted. At the start of this micro-internship, we were introduced to the concept of the digital ‘black hole’, a term used to describe the irrevocable loss of this information. Unlike physical correspondence and materials–the letters, writs, and manuscripts of earlier centuries–so much of what we write is fragile and evanescent. To stem the loss of this digital history, we were shown how the Bodleian Libraries and other legal deposit libraries use domain crawls to capture online content at pre-determined intervals using the W3ACT tool. This then preserves a screen grab of the website on the Internet Archive, namely the waybackmachine, before the website is updated.

Web archiving micro-interns working in the Centre for Digital Scholarship, Weston Libary, March 2016.

Web archiving micro-interns working in the Centre for Digital Scholarship, Weston Libary, March 2016.

The right to a copy of electronic and other non-print publications, such as e-journals and CD-ROMs by legal deposit libraries only came into existence on 6th April 2013. This meant that libraries were able to create an archive of all websites with domains based in the United Kingdom. The recent ‘right to be forgotten’ law adopted by the EU is a signal of the fact that the legal status of digital archives is nevertheless becoming increasingly complicated, particularly when compiling archives of events receiving international commentary, like the upcoming EU referendum. Each of us focused on a different aspect of the EU referendum, reflecting our individual interests, ranging from national newspapers and student newspapers to the blogs of Scottish MSPs, Welsh AMs, and MEPs, and the blogs of solicitors and legal firms’ websites offering advice to businesses and refugees in the event of a ‘Brexit’. One of the trickier views to archive was that of British expats living abroad. In this situation, unless the site can be proven to be based in the UK, we would have to write to the owner of the domain to request permission to archive the website. In a situation where permission was given but the person expressing those views subsequently wished to erase this history under the ‘right to be forgotten’ law adopted by the EU, should the UK have voted to leave the EU, this would leave the archived material in a tricky legal position. We learned during the internship that this would most likely result in the relevant archived material being deleted. However, this is exactly what the archive was set up to prevent and so the tension between the right to privacy and freedom of information on a public platform presents considerable problems to the aim of web archives to be fully comprehensive, aggravated further by the omission of websites with pay walls.

After finding this material and ensuring it was covered by the legal deposit law, it was necessary to classify the site accurately, identifying the main language, and providing titles and descriptions. For newspaper articles, this was relatively straightforward, but for Welsh and Irish-language publications produced by political parties, languages which I am studying at Jesus college, this was more complicated as the only languages available to select from were German or English–a testament to the nascent stage of the web archive’s development. In addition, classifying material was very much up to our own individual discretion and the descriptions to our own style. To complicate things further, the order in which searched-for material should be presented raises further issues, which we discussed at the end of the micro-internship. Namely whether results should be arranged by ‘most popular’, by date of publication, or any other criterion. The discussions and practical experience offered by this internship gave us an opportunity to help address the legal and administrative challenges facing web archivists.

Daniel Taylor

Preserving Social Media – a briefing day

This post is a bit late as the DPC briefing day on Preserving Social Media was almost a month ago, but our excuse is that there was a lot of food for thought!

As digital archives trainees Rachael and I have spent a lot of time thinking about preserving social media (a bit sad maybe, but true!). Everyone loves web 2.0: It’s dynamic and complex; it gives us the ability to communicate and interact across continents; and it’s a giant headache if you’re trying to archive it!

So as you can see we were quite excited about this briefing day, and it did not disappoint!

Throughout the day the talks were pretty evenly split between various means of capturing and curating social media and how researchers looked to access and use it, as well as the quality of datasets they were able to pull from it. They also touched on the legal ramifications of preserving it and there were a few case studies that discussed lessons learnt from institutions that are actively collecting social media.

Nathan Cunningham introduced us to the concept of the Big Data Network and the UK Data Archive. He talked about how much data and metadata the web was currently generating and the funding that the government was putting into it.

Sara Thomson’s keynote focused on different strategies for capturing and curating social media, such as: the pros and cons of Platform APIs, Data Resellers, Third-party Services and Platform Self-Archiving Services.  She also argued the need for better integration of Social Media with Web Archives in order to contextualize the social media; including preserving archived pages of content that URLs link to. She also focuses on more collaboration between institutions in terms of resources, access and methods/knowledge and within institutions with their own researchers and end users.

Stephen Daisley from STV talked about Social Media & Journalism, about how it provided diverse and up-to-date coverage through non-traditional channels and its use as a tool for those underrepresented in mainstream media.

After lunch we had Katrin Weller from GESIS discuss how social scientists were using social media (For research! Not lolcats!) and the challenges of collecting, sharing and documentation. Going back to the methods that Sara Thomson listed in her keynote, most involve a third party and have restrictions on how the data can be shared, what tools can be used on it, how much data they give you. She highlighted the difficulties this can cause when researchers want to replicate or expand upon another researcher’s work as well as other issues that come from using data that they researcher has not collected.

Tom Storrar from the National Archives rounded off the presentations with a talk on how the UK Government’s social media presence was being captured for posterity. His project was to capture the UK Government’s official Twitter presence. This involved deciding what would be in scope including content and metadata, how they would collect this data and finally how they would present it.

Emily:

While I found Sara’s keynote interesting and quite informative—especially in terms of what is available out there and a balanced view of what they have to offer—it wasn’t as relevant as I had hoped as it was focused more on someone else providing the data to you rather than the tools you can use to collect what you are interested in. While there are many benefits to having authorised data resellers or the platform itself giving you archiving abilities (especially being able to harvest all the metadata associated with it) I like the flexibility and power that we get with Archive-IT (though of course in some ways it will be a much shallower collection as we only collect what the end-user sees) and the fact that we aren’t restricted to the data that the providers think we want.

I’m glad that she talked about the need for collaboration so that we don’t all try to reinvent the wheel. At the Bodleian we’re quite lucky because we work closely with other legal deposit libraries to capture web content (including social media) so we regularly have the opportunity to discuss and learn from each other’s experiences. We also have our own Bodleian Library Web Archive where we encourage our own researchers to use it as a repository and a resource that they can help us grow.

One thing that I found problematic was Stephen Daisley’s talk. Well not problematic, but perhaps a bit naïve? While I agreed with some of his points, I think he romanticises the notion of social media as the great equaliser. I can think off the top of my head at least one quite large group of underrepresented voices that are not getting their say in social media; the elderly. And I’m sure that there are many examples that you can come up with if you stop to think of it too. Just because the barrier to access is much lower than traditional news stations does not mean there is no barrier. The vast amount of data and metadata generated makes it tempting to believe that that is the whole of the story but I think we need to remember who isn’t part of the conversation.

I also really enjoyed Tom Storrar’s presentation because it highlights the need to have a clear collection policy, to realise you can’t and shouldn’t capture everything, and to make your decisions transparent so that researchers will know exactly what they do and do not have to work with.

Rachael:

Although the talks on Big Data and social science research were less relevant to our work on the Bodleian Libraries Web Archive, it was an eye-opening introduction to the sheer amount of digital data which is collected. This might be commercial research, profiting from the amount of information we can give to social media sites such as our name, nationality, photos, mobile number, address, and interests; or for forecasting purposes such as predicting results of political elections; or for academic study in areas such as activism, audiences, networks and crisis communication and response. I think Katrin Weller certainly succeeded in dismissing the claim that ‘99% of tweets are worthless babble’ – Weller, Social Media as Research Data, 27/10/2015.

Like Emily, I also enjoyed Tom Storrar’s presentation on the capture of government bodies’ Twitter and YouTube feeds. For me it really highlighted how complex the web of legislation is, requiring them to adapt to changing circumstances. If an organisation ceases to be a government body, the National Archives no longer has the right to capture its social media content. Because of these legal restrictions, no retweets or YouTube comments are captured, which means it is a one-way conversation. I think this is a shame, as we are losing that interaction which is so essential to social media. If YouTube comments are modern day equivalents to the letters sent to the government to comment on its policies, should we be preserving them?

Overall the day was full of fascinating talks and discussions on how to move forward in preserving social media. But, the best part of the briefing day was knowing we weren’t alone! We got to talk to people approaching preserving social media from very different angles; the BBC, the National Archives, etc. And even though we all had different mandates and different foci we still found a lot of common ground.

Event: Exploring the UK Web, 11 December 2015

 

Wab Archives TalkExploring the UK Web:
An introduction to web archives as scholarly resources

11 December 2015
2.00pm – 4.00pm

Venue: Lecture Theatre, Weston Library

Speakers: Jason Webber, Prof Jane Winters, Dr Gareth Millward, Prof Ralph Schroeder

‘The Web’, in the 25 years of its existence, has become deeply ingrained in modern life: it is where we find information, communicate, research, share ideas, shop, get entertained, set and follow trends and, increasingly, live our social lives.
As much as we rely on traditional paper archives today to find out about the past, for anyone trying to understand life in the late 20th and early 21st century, archived websites will be an invaluable resource.

Join us and our expert panel for an afternoon of exploring the archives of the UK web space, focusing on their potential use for research and teaching. Short presentations will introduce the resources and tools available for web archives research in the UK, and the opportunities (and challenges) they come with in theory and practice: from web archives curation, preservation and research tool development at the British Library, to current research in the Big UK Domain Data for the Arts and Humanities (BUDDAH) Project and at the Oxford Internet Institute.
Afterwards there will be plenty of time for questions and discussion – your chance to ask everything you ever wanted to know about web archives and to contribute your thoughts and ideas to an emerging discipline.

Admission free. All welcome.
To secure a place, please complete our booking form via What’s on

Jason Webber is the Web Archiving Engagement and Liaison Manager at the British Library, working with the UK Web Archive and the Legal Deposit Web Archive.
Jane Winters is Professor of Digital History at the Institute of Historical Research, and Principal Investigator in the BUDDAH Project.
Gareth Millward is a Research Fellow at the London School of Hygiene and Tropical Medicine and one of the BUDDAH Project bursary holders.
Ralph Schroeder is a Senior Research Fellow at the Oxford Internet Institute.