Tag Archives: digital preservation

Newly available: Recollecting Oxford Medicine oral history project

Born digital material from the Recollecting Oxford Medicine oral history project has been donated to the Weston Library since the early 2010s, and the project is still active today with further interviews planned. A selection of interviews from the project are now available to listen to online,  via University of Oxford podcasts.

The Recollecting Oxford Medicine oral history project comprises interviews with Oxford medics, which provide individual perspectives of both pre clinical and clinical courses at the Oxford Medical School, medical careers in Oxford and other locations, and give an insight into the evolution of clinical medicine at Oxford since the mid 1940s.

The interviewees have worked in a range of specialisms and departments including psychiatry, neurology, endocrinology and dermatology to name a few. In episodes 11-12 we can learn about Chris Winearls – a self proclaimed ‘accidental Rhodes Scholar’ from medical school in Cape Town – his journey into nephrology and how he later became Associate Professor of Medicine for the university.

Listen to the Recollecting Oxford Medicine oral history podcast series online at https://podcasts.ox.ac.uk/series/recollecting-oxford-medicine-oral-histories

In episode 1, John Spalding,  interviewed by John Oxbury  in 2011, discusses working under Hugh Cairns, firstly as a student houseman at the Radcliffe Infirmary during the second world war.  Spalding also recounts his experience of the initial conception of the East Radcliffe Ventilator, first being devised for use in treatment of Polio. In episode 13 we can listen to Derek Hockaday’s interview with Joan Trowell, former Deputy Director of Clinical Studies for Oxford Medical School, which amongst other topics covers her experience of roles held at the General Medical Council.

The majority of the interviews were undertaken by Derek Hockaday, former Oxford hospitals consultant physician and Emeritus Fellow of Brasenose College. The cataloguing and preservation of the oral history project is supported by Oxford Medical Alumni. The library acknowledges the donations of material and financial support by Derek Hockaday and OMA respectively.

Listeners may also be interested in the Sir William Dunn School of Pathology Oral Histories, of which the archive masters are also preserved in the Weston Library.

Archiving web content related to the University of Oxford and the coronavirus pandemic

Since March 2020, the scope of collection development at the Bodleian Libraries’ Web Archive has expanded to also focus on the coronavirus pandemic: how the University of Oxford, and wider university community have reacted and responded to the rapidly changing global situation and government guidance. The Bodleian Libraries’ Web Archive team have endeavoured (and will keep working) to capture, quality assess and make publicly available records from the web relating to Oxford and the coronavirus pandemic. Preserving these ephemeral records is important. Just a few months into what is sure to be a long road, what do these records show?

Firstly, records from the Bodleian Libraries’ Web Archive can demonstrate how university divisions and departments are continually adjusting in order to facilitate core activities of learning and research. This could be by moving planned events online or organising and hosting new events relevant to the current climate:

Capture of http://pcmlp.socleg.ox.ac.uk/ 24 May 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/2502/20200524133907/https://pcmlp.socleg.ox.ac.uk/global-media-policy-seminar-series-victor-pickard-on-media-policy-in-a-time-of-crisis/

Captures of websites also provide an insight to the numerous collaborations of Oxford University with both the UK government and other institutions at this unprecedented time; that is, the role Oxford is playing and how that role is changing and adapting. Much of this can be seen in the ever evolving news pages of departmental websites, especially those within Medical Sciences division, such as the Nuffield Department of Population Health’s collaboration with UK Biobank for the government department of health and social care announced on 17 May 2020.

The web archive preserves records of how certain groups are contributing to coronavirus covid-19 research, front line work and reviewing things at an extremely  fast pace which the curators at Bodleian Libraries’ Web Archive can attempt to capture by crawling more frequently. One example of this is the Centre for Evidence Based Medicine’s Oxford Covid-19 Evidence Service – a platform for rapid data analysis and reviews which is currently updated with several articles daily. Comparing two screenshots of different captures of the site, seven weeks apart, show us the different themes of data being reviewed, and particularly how the ‘Most Viewed’ questions change (or indeed, don’t change) over time.

Capture of https://www.cebm.net/covid-19/ 14 April 2020 available through the Bodleian Libraries’ Web Archive. Wayback URL https://wayback.archive-it.org/org-467/20200414111731/https://www.cebm.net/covid-19/

Interestingly, the page location has slightly changed, the eagle-eyed among you may have spotted that the article reviews are now under /oxford-covid-19-evidence-service/, which is still in the web crawler’s scope.

Capture of https://www.cebm.net/covid-19/ 05 June 2020 available through the Bodleian Libraries’ Web Archive. Wayback url https://wayback.archive-it.org/org-467/20200605100737/https://www.cebm.net/oxford-covid-19-evidence-service/

We welcome recommendations for sites to archive; if you would like to nominate a website for inclusion in the Bodleian Libraries’ Web Archive you can do so here. Meanwhile, the work to capture institutional, departmental and individual responses at this time continues.

Web Archiving & Preservation Working Group: Social Media & Complex Content

On January 16 2020, I had the pleasure of attending the first public meeting of the Digital Preservation Coalition’s Web Archiving and Preservation Working Group. The meeting was held in the beautiful New Records House in Edinburgh.

We were welcomed by Sara Day Thomson who in her opening talk gave us a very clear overview of the issues and questions we increasingly run into when archiving complex/ dynamic web or social media content. For example, how do we preserve apps like Pokémon Go that use a user’s location data or even personal information to individualize the experience? Or where do we draw the line in interactive social media conversations? After all, we cannot capture everything. But how do we even capture this information without infringing the rights of the original creators? These and more musings set the stage perfectly to the rest of the talks during the day.

Although I would love to include every talk held this day, as they were all very interesting, I will only highlight a couple of the presentations to give this blog some pretence at “brevity”.

The first talk I want to highlight was given by Giulia Rossi, Curator of Digital Publications at the British Library, on “Overview of Collecting Approach to Complex Publications”. Rossie introduced us to the emerging formats project; a two year project by the British Library. The project focusses on three types of content:

  1. Web-based interactive narratives where the user’s interaction with a browser based environment determines how the narrative evolves;
  2. Book as mobile apps (a.k.a. literary apps);
  3. Structured data.

Personally, I found Rossi’s discussion of the collection methods in particular very interesting. The team working on the emerging formats project does not just use heritage crawlers and other web harvesting tools, but also file transfers or direct downloads via access code and password. Most strikingly, in the event that only a partial capture can be made, they try to capture as much contextual information about the digital object as possible including blog posts, screen shots or videos of walkthroughs, so researchers will have a good idea of what the original content would have looked like.

The capture of contextual content and the inclusion of additional contextual metadata about web content is currently not standard practice. Many tools do not even allow for their inclusion. However, considering that many of the web harvesting tools experience issues when attempting to capture dynamic and complex content, this could offer an interesting work-around for most web archives. It is definitely an option that I myself would like to explore going forward.

The second talk that I would like to zoom in on is “Collecting internet art” by Karin de Wild, digital fellow at the University of Leicester. Taking the Agent Ruby – a chatbot created by Lynn Hershman Leeson – as her example, de Wild explored questions on how we determine what aspects of internet art need to be preserved and what challenges this poses. In the case of Agent Ruby, the San Francisco Museum of Modern Art initially exhibited the chatbot in a software installation within the museum, thereby taking the artwork out of its original context. They then proceeded to add it to their online Expedition e-space, which has since been taken offline. Only a print screen of the online art work is currently accessible through the SFMOMA website, as the museum prioritizes the preservation of the interface over the chat functionality.

This decision raises questions about the right ways to preserve online art. Does the interface indeed suffice or should we attempt to maintain the integrity of the artwork by saving the code as well? And if we do that, should we employ code restitution, which aims to preserve the original arts’ code, or a significant part of it, whilst adding restoration code to reanimate defunct code to full functionality? Or do we emulate the software as the University of Freiburg is currently exploring? How do we keep track of the provenance of the artwork whilst taking into account the different iterations that digital art works go through?

De Wild proposed to turn to linked data as a way to keep track of particularly the provenance of an artwork. Together with two other colleagues she has been working on a project called Rhizome in which they are creating a data model that will allow people to track the provenance of internet art.

Although this is not within the scope of the Rhizome project, it would be interesting to see how the finished data model would lend itself to keep track of changes in the look and feel of regular websites as well. Even though the layouts of websites have changed radically over the past number of years, these changes are usually not documented in metadata or data models, even though they can be as much of a reflection of social and cultural changes as the content of the website. Going forward it will be interesting to see how the changes in archiving online art works will influence the preservation of online content in general.

The final presentation I would like to draw attention to is “Twitter Data for Social Science Research” by Luke Sloan, deputy director of the Social Data Science Lab at the University of Cardiff. He provided us with a demo of COSMOS, an alternative to the twitter API, which  is freely available to academic institutions and not-for-profit organisations.

COSMOS allows you to either target a particular twitter feed or enter a search term to obtain a 1% sample of the total worldwide twitter feed. The gathered data can be analysed within the system and is stored in JSON format. The information can subsequently be exported to a .CVS or Excel format.

Although the system is only able to capture new (or live) twitter data, it is possible to upload historical twitter data into the system if an archive has access to this.

Having given us an explanation on how COSMOS works, Sloan asked us to consider the potential risks that archiving and sharing twitter data could pose to the original creator. Should we not protect these creators by anonymizing their tweets to a certain extent? If so,  what data should we keep? Do we only record the tweet ID and the location? Or would this already make it too easy to identify the creator?

The last part of Sloan’s presentation tied in really well with the discussion about the ethical approaches to archiving social media. During this discussion we were prompted to consider ways in which archives could archive twitter data, whilst being conscious of the potential risks to the original creators of the tweets. This definitely got me thinking about the way we currently archive some of the twitter accounts related to the Bodleian Libraries in our very own Bodleian Libraries Web Archive.

All in all, the DPC event definitely gave me more than enough food for thought about the ways in which the Bodleian Libraries and the wider community in general can improve the ways we capture (meta)data related to the online content that we archive and the ethical responsibilities that we have towards the creators of said content.

Because Digital Objects can Decay too: Conducting a Proof of Concept for Archivematica

Like other archives, the Bodleian Libraries has been searching for ways to optimize the conservation of our digital collections. The need to find a solution has become increasingly pressing as the Bodleian Electronic Archives and Manuscripts (BEAM), our digital repository service for the management of born-digital archives and manuscripts acquired by the Special Collections, now contains roughly 13TB worth of digital objects, with much more waiting in the wings.

In order to help us manage the ingest of digital objects within our collections, the Bodleian Libraries undertook an options review as part of its DPOC project. This lead to a decision to conduct a proof of concept of Archivematica. This proof of concept included the installation of a QA and DEV environment with the help of Artefactual followed by an extensive testing period and a gap analysis.

In November 2018 we started testing the system to establish whether or not Archivematica met our acceptance criteria. We mainly focussed on three areas:

  1. Overall performance/ functionality: Is the system user friendly? Can it successfully process all the different file types and sizes that we have in our collection?
  2. Metadata: Can Archivematica extract the metadata from the Excel sheets that we have created over time? What technical metadata does Archivematica automatically extract from ingested files?
  3. File extraction and normalization: Are disk images extracted properly? Is the content of a transfers normalized to the right file type?

Whilst testing, we also reached out to and visited other organisations that had already implemented Archivematica as well, including the International Institute of Social History in Amsterdam, the University of Edinburgh, the National Library of Wales and the Wellcome Trust.

Based on the outcomes of the tests we conducted, and the conversations we had with other institutions, we identified five gap areas:

  1. Performance: The Archivematica instance we configured for the Proof of Concept struggled with transfers over 200GB or transfers that contain over 5000+ files.
  2. Error reporting: It was often unclear what a particular error code and message meant. The error logs used by system administrators are also verbose, making it hard for them to pinpoint the error.
  3. Metadata: Here we identified two gaps. Firstly, there is the verbosity of the metadata. Because Archivematica records individual PREMIS events for each digital file, the resulting METS file becomes unwieldy, compromising the system’s performance. Secondly, we require a workflow to migrate our spreadsheet-held legacy pre-ingest capture metadata and file-level metadata into Archivematica, and to go on including this pre-ingest metadata, which will continue to be recorded in spreadsheet form for the foreseeable, in future ingests.
  4. User/ access management: Archivematica does not offer a way to manage access to collections or Archive Information Packages, and allows all users to alter the system work-flow. We are a multi-user organisation, and wish to have tighter controls on access to collections and workflow configurations.
  5. General reporting: Archivematica currently does not offer many reports to monitor progress, content and growth of collections.

Once we identified these gaps we had an intensive two day workshop with Artefactual to pinpoint possible solutions, which we subsequently presented to the wider Archivematica community during the Archivematica Camp in London in July 2019.

We will use all the input gathered from the proof of concept to inform our initial implementation of Archivematica, which will begin in January 2020. The project will focus on the performance and metadata gaps identified during the proof of concept, allowing us to bring Archivematica into production use 2021. We are keen to work with the Archivematica community, so do get in touch at beam@bodleian.ox.ac.uk if you’re interested in finding out more about our work.

Online Enthusiast Communities in the UK Web Archive

There is a saying that ‘variety is the spice of life’ and this is certainly true when you think of the types of hobbies and interests the UK public engages in. There are the hobbies we have all probably heard of such as train spotting or metal detecting and there are the more obscure ones such as Poohsticks or Hand Dryer appreciation.  Websites are a useful tool for enthusiasts to communicate and share their passion with the world. At the UK Web Archive (UKWA) the Online Enthusiast Communities  collection aims to:

‘Capture how UK based public forums are used to discuss hobbies and activities and serve as a place for enthusiasts to converse with others sharing similar interests.’

This collection includes such a diverse and wonderful selection of websites and forums. I can honestly say that curating this collection has truly been a joy – there are probably very few jobs that allow you to look at The Letter Box Study Group (a website about the history and development of British roadside letter boxes) as part of your tasks for the day.

Differences I have noticed

As a curator you get to explore lots of sites and you begin to notice differences and similarities between websites. It is interesting to see the variety in website design and levels of expertise and to me it feels like this is reflected in the websites that are archived.

I have noticed lots of online communities using a variety of website builders. The huge diversity in tools appear to have made it easier to create more professional looking sites with ease. Compared to older sites, you notice:

  • the increased use of images
  • cleaner feel
  • neutral backgrounds
  • minimal text
  • occasional e-commerce sections

However, it is nostalgic to see some of the older more ‘blocky’ sites, as I do remember the days of dial-up internet access and early web sites. To me, forums tend to have a similar feel and the designs does not deviate greatly from each other.

I have also found how often a website updates intriguing. Some are regularly updated whereas others appear to have been untouched for several years. This may reflect that many websites are run by volunteers balancing other commitments. Regularity of updates is an important factor as it will contribute to deciding how often we capture the site – it is the skill of a web archivist to judge this accordingly however these frequencies can be updated.

Some of my Favourite sites

One of the joys of curating this collection is that you get to experience sites that are really unique that you would not normally explore. I wanted to highlight a few of the sites that particularly caught my attention, specifically from the ‘Miscellaneous’ sub section as this is my personal favourite.

Pylon of the Month

Pylon of the month (February 2018) from Sweden. Image Credit: Kristin Allardh, 2018

This is a site dedicated to electricity pylons highlighting a monthly winner. These could include current pylons or historic images and entries can come from the UK and beyond. Images are usually accompanied by some interesting history or facts.

Modernist Britain

Odeon cinema Leicester, Leicestershire. Image Credit: Richard Coltman, 2010

This site is beautifully designed and celebrates modernist architecture in Britain. There are fifty illustrated images with accompanying information about the history of the buildings and photographs taken by Richard Coltman.

Cloud appreciation society

A Lenticular cloud. Image Credit: © José Ramón Sáez, 2019

This site was launched in 2005 with the aim of ‘bringing together people who love the sky’. It has an international membership with members submitting images from all over the world. They also run events, cloud related news and in 2019 they are contributing to the non-profit FogQuest project.

The online enthusiast community is also very witty, there are some fantastically named sites and forums such as:

  • Planet of the Vapes – a forum about vaping
  • DIYnot Forum – a forum about DIY
  • Frit-Happens! – an online community for glass blowing and glass crafting

Curating the online enthusiast collection has been incredibly enjoyable. Having to actively seek new sites has made me more aware of the variety of hobbies and diversity of interests the public engage in.

As this collection develops, more sites relating to the variety of hobbies and interests will be captured and persevered for future generations explore, enjoy and research. However, due to the size, complexity and technological challenges of archiving all UK websites, some may get missed or we just do not know about them . If there is a site that you think should be included then you can nominate it on the ‘Save a UK website‘ page of the UKWA.

Developing collections on Gender Equality at the UK Web Archive

The Gender Equality collection

The UK web archive Gender Equality collection and its themed subsections provide a rich insight into attitudes and approaches towards gender equality in contemporary UK society and culture. This was previously discussed in my last blog post about the collection, which you can read here.

Curating the collection

A great deal of the discussion and activity relating to gender equality occurs predominantly in an online space. This means that as a curator for the Gender Equality collection, the harvest is plenty! The type of content being collected by the UK Web Archive includes:

Of course there is some crossover, not only regarding the type of content but also within subsections of the gender equality collection.

This image is made available and reproduced by CC-BY-NC-SA 2.0. [https://creativecommons.org/licenses/by-nc-sa/2.0/legalcode]

Specifically, I find the event sites in the collection really interesting. As well as documenting that the event(s) even existed and happened in the first place, they can give us a snapshot of who organised the event, as well as who the intended audience were. Also, the collection exhibits the evolution of websites related to gender equality over time (which can be very speedy indeed when it comes to sites like twitter accounts!), and the changing priorities, trends, initiatives and more that can tell us about attitudes towards gender equality in the UK. These kinds of websites are being created by and engaged with by humans right now.

Nominate a website!

The endeavour of the UK Web Archive never stops – if you would like to help grow the Gender Equality collection (or indeed, any other collections) click here to nominate a website to save. Go on…whilst you’re at it, you can explore the UK Web Archive’s funky new interface!

 

Image reference: Workers Solidarity Movement (2012) March for Choice

 

Sixth British Library Labs Symposium

On Monday November 12, 2018 I was fortunate enough to attend the annual British Library Labs Symposium. During the symposium the British Library showcases the projects that they have been working on for their digital collections and issues awards to those who either contributed to those projects or used the digital collections to create their own projects.

According to Adam Farquhar, Head of Digital Scholarship at the British Library, this year’s symposium was their biggest and best attended yet: a testimony to the growing importance of digitization, as well as digital preservation and curation, within both archives and libraries.

This year’s theme of 3D models and scanning was wonderfully introduced by Daniel Prett, Head of Digital and IT at the Fitzwilliam Museum in Cambridge, in his keynote lecture on ‘The Value, Impact and Importance of experimenting with Cultural Heritage Digital Collections’. He explained how, during his time with the British Museum, they began to experiment with the creation of digital 3D models. This eventually lead to the purchase of a rig with multiple camera’s allowing them to take better quality photos in less time. At the Fitzmuseum, Prett has continued to advocate the development of 3D imaging. The museum now even offers free 3D imaging workshops open to anyone who is in possession of a laptop and any device that has a camera (including a smartphone).

Although Prett shared much of his other successful projects with us, he also emphasized that much of digitization is about trial and error, and stressed the importance recording those errors. Unfortunately, libraries and archives alike are prone to celebrate their successes, but cover-up their errors, even though we may learn just as much from them. Prett called upon all attendees to more frequently share their errors, so we may learn from each other.

During the break I wandered into a separate room where individuals and companies showcased the projects that they developed in relation to the digital libraries special collections. A lucky few managed to lay their hands on a VR headset in order to experience Project Lume (a virtual data simulation program) and part of the exhibition by Nomad. The British Library itself showcased their own digitization services, including 360° spin photography and 3D imaging. The latter lead to some interesting discussions about the de- and re-contextualization of artworks when using 3D imaging technology.

In the midst of all this there was one stand that did not lure its spectators with fancy technology or gadgets. Instead, Jonah Coman, winner of the BL Teaching & Learning Award, showcased the small zines that he created. The format of these Pocket Miscellany, as they are called, are inspired by small medieval manuscripts and are intended to inform their readers about marginalized bodies, disability and queerness in medieval literature. Due to copyright issues these zines are not available for purchase, but can be found on Coman’s Patreon website.

The BL labs symposium also showed how the digital collections of the British Library can inspire both art and fashion. Fashion designer Nabil Nayal, who unfortunately could not accept his BL labs Commercial Award in person, for example, had used the Elizabethan digital collections as inspiration for the collection he presented at the British Library during the London Fashion week.

Artist Richard Wright, on the other hand, looked to the library’s infrastructure for inspiration. This resulted in The Elastic System, a virtual mosaic of hundreds of the British Library books that together make-up a sketch of Thomas Watts. When you zoom in on the mosaic you can browse the books in detail and can even order them through a link to the BL’s catalogue that is integrated in the picture. Once a book is checked out, it reveals the pictures of BL employees working in the stacks to collect the books. It thereby slowly reveals a part of the library that is usually hidden from view.

Another fascinating talk was given by artist Michael Takeo Magruder about his exhibition on Imaginary Cities which will be staged at the British Library’s entrance hall from 5 April to 14 July 2019. Magruder is using the library’s 19th and early 20th century maps collection to create new and ever changing maps and simulations of virtual, fantastical cities. Try as I might, I fear I cannot do justice to Magruder’s unique and intriguing artwork with words alone and can therefore only urge you to go visit the exhibition this coming year.

These are only a few of the wonderful talks that were given during the Labs symposium. The British Labs symposium was a real eye opener for me. I did not realize just how quickly the field of 3D imaging had developed within the museum and library world. Nor did I realize how digital collections could be used, not simply to inspire, but create actual artworks.

Yet, one of the things that struck me most is how much the development of and advocacy for the use of digital collections within archives and libraries is spurred on by passionate individuals; be they artists who use digital collections to inspire their work, digital- and IT-specialists willing to sacrifice a lunch break or two for the sake of progress or individual scholars who create little zines to spread awareness about a topic they feel passionate about. Imagine what they can do if initiatives like the BL labs continue to bring such people together. I, for one, cannot wait to see what the future for digital collections and scholarship holds. On to next year’s symposium.

 

Archives Unleashed – Vancouver Datathon

On the 1st-2nd of November 2018 I was lucky enough to attend the  Archives Unleashed Datathon Vancouver co-hosted by the Archives Unleashed Team and Simon Fraser University Library along with KEY (SFU Big Data Initiative). I was very thankful and appreciative of the generous travel grant from the Andrew W. Mellon Foundation that made this possible.

The SFU campus at the Habour Centre was an amazing venue for the Datathon and it was nice to be able to take in some views of the surrounding mountains.

About the Archives Unleashed Project

The Archives Unleashed Project is a three year project with a focus on making historical internet content easily accessible to scholars and researchers whose interests lay in exploring and researching both the recent past and contemporary history.

After a series of datathons held at a number of International institutions such as the British Library, University of Toronto, Library of Congress and the Internet Archive, the Archives Unleashed Team identified some key areas of development that would enable and help to deliver their aim of making petabytes of valuable web content accessible.

Key Areas of Development
  • Better analytics tools
  • Community infrastructure
  • Accessible web archival interfaces

By engaging and building a community, alongside developing web archive search and data analysis tools the project is successfully enabling a wide range of people including scholars, programmers, archivists and librarians to “access, share and investigate recent history since the early days of the World Wide Web.”

The project has a three-pronged approach
  1. Build a software toolkit (Archives Unleashed Toolkit)
  2. Deploy the toolkit in a cloud-based environment (Archives Unleashed Cloud)
  3. Build a cohesive user community that is sustainable and inclusive by bringing together the project team members with archivists, librarians and researchers (Datathons)
Archives Unleashed Toolkit

The Archives Unleashed Toolkit (AUT) is an open-source platform for analysing web archives with Apache Spark. I was really impressed by AUT due to its scalability, relative ease of use and the huge amount of analytical options it provides. It can work on a laptop (Mac OS, Linux or Windows), a powerful cluster or on a single-node server and if you wanted to, you could even use a Raspberry Pi to run AUT. The Toolkit allows for a number of search functions across the entirety of a web archive collection. You can filter collections by domain, URL pattern, date, languages and more. Create lists of URLs to return the top ten in a collection. Extract plain text files from HTML files in the ARC or WARC file and clean the data by removing ‘boilerplate’ content such as advertisements. Its also possible to use the Stanford Named Entity Recognizer (NER) to extract names of entities, locations, organisations and persons. I’m looking forward to seeing the possibilities of how this functionality is adapted to localised instances and controlled vocabularies – would it be possible to run a similar programme for automated tagging of web archive collections in the future? Maybe ingest a collection into ATK , run a NER and automatically tag up the data providing richer metadata for web archives and subsequent research.

Archives Unleashed Cloud

The Archives Unleashed Cloud (AUK) is a GUI based front end for working with AUT, it essentially provides an accessible interface for generating research derivatives from Web archive files (WARCS). With a few clicks users can ingest and sync Archive-it collections, analyse the collections, create network graphs and visualise connections and nodes. It is currently free to use and runs on AUK central servers.

My experience at the Vancouver Datathon

The datathons bring together a small group of 15-20 people of varied professional backgrounds and experience to work and experiment with the Archives Unleashed Toolkit and the Archives Unleashed Cloud. I really like that the team have chosen to minimise the numbers that attend because it created a close knit working group that was full of collaboration, knowledge and idea exchange. It was a relaxed, fun and friendly environment to work in.

Day One

After a quick coffee and light breakfast, the Datathon opened with introductory talks from project team members Ian Milligan (Principal Investigator), Nick Ruest (Co-Principal Investigator) and Samantha Fritz (Project Manager), relating to the project – its goals and outcomes, the toolkit, available datasets and event logistics.

Another quick coffee break and it was back to work – participants were asked to think about the datasets that interested them, techniques they might want to use and questions or themes they would like to explore and write these on sticky notes.

Once placed on the white board, teams naturally formed around datasets, themes and questions. The team I was in consisted of  Kathleen Reed and Ben O’Brien  and formed around a common interest in exploring the First Nations and Indigenous communities dataset.

Virtual Machines were kindly provided by Compute Canada and available for use throughout the Datathon to run AUT, datasets were preloaded onto these VMs and a number of derivative files had already been created. We spent some time brainstorming, sharing ideas and exploring datasets using a number of different tools. The day finished with some informative lightning talks about the work participants had been doing with web archives at their home institutions.

Day Two

On day two we continued to explore datasets by using the full text derivatives and running some NER and performing key word searches using the command line tool Grep. We also analysed the text using sentiment analysis with the Natural Language Toolkit. To help visualise the data, we took the new text files produced from the key word searches and uploaded them into Voyant tools. This helped by visualising links between words, creating a list of top terms and provides quantitative data such as how many times each word appears. It was here we found that the word ‘letter’ appeared quite frequently and we finalised the dataset we would be using – University of British Columbia – bc-hydro-site-c.

We hunted down the site and found it contained a number of letters from people about the BC Hydro Dam Project. The problem was that the letters were in a table and when extracted the data was not clean enough. Ben O’Brien came up with a clever extraction solution utilising the raw HTML files and some script magic. The data was then prepped for geocoding by Kathleen Reed to show the geographical spread of the letter writers, hot-spots and timeline, a useful way of looking at the issue from the perspective of engagement and the community.

Map of letter writers.

Time Lapse of locations of letter writers. 

At the end of day 2 each team had a chance to present their project to the other teams. You can view the presentation (Exploring Letters of protest for the BC Hydro Dam Site C) we prepared here, as well as the other team projects.

Why Web Archives Matter

How we preserve, collect, share and exchange cultural information has changed dramatically. The act of remembering at National Institutes and Libraries has altered greatly in terms of scope, speed and scale due to the web. The way in which we provide access to, use and engage with archival material has been disrupted. All current and future historians who want to study the periods after the 1990s will have to use web archives as a resource. Currently issues around accessibility and usability have lagged behind and many students and historians are not ready. Projects like Archives Unleashed will help to furnish and equip researchers, historians, students and the community with the necessary tools to combat these problems. I look forward to seeing the next steps the project takes.

Archives Unleashed are currently accepted submissions for the next Datathon in March 2019, I highly recommend it.

Building collections on Gender Equality at the UK Web Archive

The Bodleian is one of the 6 legal deposit libraries in the UK. One of my projects this year as a graduate trainee digital archivist on the Bodleian Libraries’ Developing the Next Generation Archivist programme is to help curate special collections in the UK Web Archive. Since May I’ve been working on the Gender Equality collection. Please note, this post also appears on the British Library UK Web Archive blog.

Why are we collecting?

2018 is the centenary of the 1918 Representation of the People’s Act. UK-wide memorials and celebrations of this journey, and victory of women’s suffrage, are all evident online: from events, exhibitions, commemorations and campaigns. Popular topics being discussed at the moment include the hashtags #timesup and #metoo, gender pay disparity and the recent referendum on the 8th Amendment in the Republic of Ireland. These discussions produce a lot of ephemeral material, and without web archiving this material is at risk of moving or even disappearing. As we can see gender equality is being discussed a lot currently in the media, these discussions have been developing over years.

Through the UK Web Archive SHINE interface we can see that matching text for the phrase ‘gender equality’ increased from a result of 0.002% (24 out of 843,204) of crawled resources in 1996, to 0.044% (23,289 out of 53,146,359) in 2013.

SHINE user interface

If we search UK web content relating to gender equality we will generate so many results; for example, organisations have published their gender pay discrepancy reports online and there is much to engage with from social media accounts of both individuals and organisations relating to campaigning for gender equality. It becomes apparent that when we browse this web content gender equality means something different for so many presences online: charities, societies, employers, authorities, heritage centres and individuals such as social entrepreneurs, teachers, researchers and more.

The Fawcett Society: https://www.fawcettsociety.org.uk/blog/why-does-teaching-votes-for-women-matter-an-a-level-teachers-perspective

What we are collecting?

The Gender Equality special collection, that is now live on the UK Web Archive comprises material which provides a snapshot into attitudes towards gender equality in the UK. Web material is harvested under the areas of:

  • Bodily autonomy
  • Domestic abuse/Gender based violence
  • Gender equality in the workplace
  • Gender identity
  • Parenting
  • The gender pay gap
  • Women’s suffrage

100 years on from women’s suffrage the fight for gender equality continues. The collection is still undergoing curation and growing in archival records – and you can help too!

How to get involved?

If there are any UK websites that you think should be added to the Gender Equality collection then you can take up the UK Web Archive’s call for action and nominate.

 

 

The UK Web Archive: Mental Health, Social Media and the Internet Collection

The UK Web Archive hosts several Special Collections, curating material related to a particular theme or subject. One such collection is on Mental Health, Social Media and the Internet.

Since the advent of Web 2.0, people have been using the Internet as a platform to engage and connect, amongst other things, resulting in new forms of communication, and consequently new environments to adapt to – such as social media networks. This collection aims to illustrate how this has affected the UK, in terms of the impact on mental health. This collection will reflect the current attitudes displayed online within the UK towards mental health, and how the Internet and social media are being used in contemporary society.

We began curating material in June 2017, archiving various types of web content, including: research, news pieces, UK based social media initiatives and campaigns, charities and organisations’ websites, blogs and forums.

Material is being collected around several themes, including:

Body Image
Over the past few years, there has been a move towards using social media to discuss body image and mental health. This part of the collection curates material relating to how the Internet and social media affect mental health issues relating to body image. This includes research about developing theory in this area, news articles on various individuals experiences, as well as various material posted on social media accounts discussing this theme.

Cyber-bullying
This theme curates material, such as charities and organisations’ websites and social media accounts, which discuss, raise awareness and tackle this issue. Furthermore, material which examines the impact of social media and Internet use on bullying such as news articles, social media campaigns and blog posts, as well as online resources created to aid with this issue, such as guides and advice, are also collected.

Addiction

This theme collects material around gaming and other  Internet-based activities that may become addictive such as social media, pornography and gambling. It includes recent UK based research, studies and online polls, social media campaigns, online resources, blogs and news articles from individuals and organisations. Discourse, discussions, opinion and actions regarding different aspects of Internet addition are all captured and collected in this overarching catchment term of addiction, including social media addiction.

The Mental Health, Social Media and the Internet Special Collection, is available via the new UK Web Archive Beta Interface!

Co authored with Carl Cooper