Earlier this week I attended a 3 day mashup event in Glasgow, organised as part of the SPRUCE project. SPRUCE aims to enable Higher Education Institutions to address preservation gaps and articulate the business case of digital preservation, and the mashup serves as a way to bring practitioners and developers together to work on these problems. Practitioners took along a collection which they were having issues with, and were paired off with a developer who could work on a tool to provide a solution.
After some short presentations on the purpose of SPRUCE and the aims of the mashup, the practitioners presented some lightning talks on our collections and problems. These included dealing with email attachments, preserving content off Facebook, software emulation, black areas in scanned images, and identifying file formats with incorrect extensions, amongst others. I took along some disk images, as we find it very time-consuming to find out date ranges, file types and content of the files in the disk image, and we wanted a more efficient way to get this metadata. More information on the collections and issuespresented can be found at the wiki.
After a short break for coffee (and excellent cakes and biscuits) we were sorted into small groups of collection owners and developers to discuss our issues in more detail. In my group this led to conversations about natural language processing, and the possibilities of using predefined subjects to identify files as being about a particular topic, which we thought could be really helpful, but somewhat impossible to create in a couple of days! We were then allocated our developers. As there were a few of us with problems with file identification, we were assigned to the same developer, Peter May from the BL. The day ended with a short presentation from William Kilbride on the value of digital collections and Neil Beagrie’s benefits framework.
The developers were packed off to another room to work on coding, while we collection owners started to look into the business case for digital preservation. We used Beagrie’s framework to consider the three dimensions of benefits (direct or indirect, near- or long-term, and internal or external), as they apply to our institutions. When we reported back, it was interesting to see how different organisations benefit in different ways. We also looked at various stakeholders and how important or influential they are to digital preservation. Write ups of these sessions are also available at the wiki.
The developers came back at several points throughout the day to share their progress with us, and by lunchtime the first solution had been found! The first steps to solving our problem were being made; Peter had found a program, Apache Tika, which can parse a file and extract metadata (it can also identify the content type of files with incorrect extensions), and had written a script so that it could work through a directory of files, and output the information into a CSV spreadsheet. This was a really promising start, especially due to the amount of metadata that could potentially be extracted (provided it exists within the file), and the ability to identify file types with incorrect extensions.
We had another catch up with the developers and their overnight progress. Peter had written a script that took the information from the CSV file and summarised it into one row, so that it fits into the spreadsheets we use at BEAM. Unfortunately, mounting the ISO image to check it with Apache Tika was slightly more complicated than anticipated, so our disk images couldn’t be checked this way without further work.
While the developers set about finalizing their solutions, we continued to work on the business case, doing a skills gap analysis to consider whether our institutions had the skills and resources to carry out digital preservation. Reporting back, we had a very interesting discussion on skills gaps within the broader archives sector, and the need to provide digital preservation training to students as well as existing professionals. We then had to prepare an ‘elevator pitch’ for those occasions when we find ourselves in a lift with senior management, which neatly brought together all the things we had discussed, as we had to explain the specific benefits of digital preservation to our institution and our goals in about a minute.
To wrap up the developers presented their solutions, which solved many of the problems we had arrived with.
A last minute breakthrough in mounting ISO images using WinCDEmu and running scripts on them meant that we are able to use the Tika script on our disk images. However, because we were so short on time, there are still some small problems that need addressing. I’m really happy with our solution, and I was very impressed by all the developers and how much they were able to get done in such a short space of time.
I felt that this event was a very useful way to get thinking about the business case for what we do, and to get to see what other people within the sector are doing and what problems they are facing. It was also really helpful as a non-techie to get to talk with developers and get an idea of what it is possible to build tools to do (and get them made!). I would definitely recommend this type of event – in fact, I’d love to go along again if I get the opportunity!