Based at The British Library and officially a collaborative effort between all six legal deposit libraries, UKWA has been at work since 2005, although their scope and reach have expanded since then and have records going back to 1996, even if their team is still small for such an encompassing endeavour. To put that in perspective, the World Wide Web only came into existence as a publicly accessible network in 1990. Jason Webber, the team lead, tries not to worry too much about those intervening six years, although you can tell that if anyone had that data he’d want UKWA to incorporate it.
In 2013 UKWA made their first full annual trawl of everything that could be considered a UK public website. That’s millions of websites, billions of individual assets, and hundreds of terabytes of data every year, and it’s growing all the time. They don’t get anything private, no emails, nothing from behind a log-in, and the rise in streaming is proving a challenge, but everything they do get is captured to look and work just as it did when it was live. Jason is keen to make it clear they do their best, but there may still be bad links here and there in the vast amount of data they process, and some websites, retail especially, are too much to handle in their complete form. “We collect a representative sample of the UK web space” is the line they’re comfortable with for now.
The event which myself and Hannah attended on November 4th was described as a “mini” conference, and with only maybe 30 to 40 delegates it’s not an inaccurate name. The whole UKWA team consists of eight people, four technical and four curatorial staff. This small staffing means that for all their efforts there are still difficulties accessing the archive, and, along with legal deposit restrictions, that there’s a major limit on what’s possible in terms of Big Data analysis and research. Most collected websites are only accessible in legal deposit libraries. The website for BUDDAH (Big UK Domain Data for the Arts and Humanities), as presented by Prof. Jane Winters from the University of London, summarises the current situation in most fields best:
Despite the limitations, improvements to the user interface are a top priority at UKWA. There’s hope that as the archive “becomes history” and its relevance grows that increased interest will see increased use and development. The achieve was able to save a database of Conservative party speeches that was otherwise removed from public domain back in 2013 while the privately organised Internet Archive was blocked from doing so (UKWA had its legal obligation to gather the data to protect it). 90% of UKWA is no longer live, so instances like this are likely to occur more often in the future – their Brexit collection is already seeing higher traffic than previous curations and holds evidence of the notorious bus-pledge on the Vote Leave campaign’s website.
More events of this kind are planned and it’s evident the UKWA team want to see the project grow. Presentations by researchers at the mini-con showed the breadth of what the archive can be used for. Public assistance also helps – archiving a website is an option for anyone and can be done easily and rapidly at https://www.webarchive.org.uk/en/ukwa/info/nominate Sites like this one with a “.uk” domain are atuomatically included, but anything else requires nomination. Don’t hold back – as the team made sure we were aware, every website matters.
- All websites back to 2013 that have given permission for public access are searchable at https://www.webarchive.org.uk/en/ukwa/index
- Pre 2013 collection can be found at webarchive.org.uk/shine
- (currently, for both sites, urls should be searched beginning with www. as the engine find http and https confusing. Boolean searches are also strongly recommended)
- The easiest way to get to grips with the achieve is currently though their curated “topics and themes” section at https://www.webarchive.org.uk/en/ukwa/collection
- Updates on UKWA’s work can be found at https://blogs.bl.uk/webarchive
- More information about BUDDAH can be found at https://buddah.projects.history.ac.uk/
- Datasets pulled from UKWA and complied into a workable format can be found at https://data.bl.uk/