Tag Archives: digital preservation

Subcultures as Integrative Forces in East-Central Europe 1900 – present: a Bodleian Libraries’ Web Archive record

A problem, and a solution in action:

The ephemeral nature of internet content (the average life of a web page is 100 days – illustrating that websites do not need to be purposefully deleted to vanish) is only one contributing factor to data loss. Web preservation is high priority;  action is required. This is a driver for not only Bodleian Libraries’ Web Archive, but digital preservation initiatives on a global scale.

However, today I would like to share the solution in action, an example from BLWA’s University of Oxford Collection: Subcultures as Integrative Forces in East-Central Europe 1900 – present.

On the live web, attempts to access the site are met with automatic redirects to BLWA’s most recent archived capture (24 Jan. 2017). The yellow banner indicates it is part of our archive. Image from http://wayback.archive-it.org/2502/20170124104518/http://subcultures.mml.ox.ac.uk/home.html

Subcultures is a University of Oxford project, backed by the Arts & Humanities Research Council, which through its explorative redefinition of ‘sub-cultures’ aims to challenge the current way of understanding simultaneous identification forms in the region of Eastern Europe through a multi-disciplinary methodology of social anthropology, discourse analysis, historical studies and linguistics. The project ran from 2012-2016.

The Subcultures website is an incredibly rich record of the project and it’s numerous works.  It held cross-continent collaborative initiatives including lectures, international workshops and seminars, as well as an outreach programme including academic publications. Furthermore, comparative micro-studies were conducted in parallel with main collaborative project: Linguistic Identities: L’viv/Lodz, c.1900; Myth and Memory: Jews and Germans, Interwar Romania; Historical Discourses: Communist Silesia and Discursive Constructions: L’viv and Wroclaw to present. The scope and content of the project, including key questions, materials, past and present events and network information is* all hosted on http://subcultures.mml.ox.ac.uk/home.html.

Was*. The site is no longer live on the internet.

However, as well as an automatic re-direction to our most recent archival copy, a search on Bodleian Libraries’ Web Archive generates 6 captures in total:

Search results for Subcultures within BLWA. Image from https://archive-it.org/home/bodleian?q=Subcultures
The materials tab of the site fully functions in the archived capture: you are able to listen to the podcasts and download the papers on theory and case studies as PDF versions.

The use of Subcultures

To explore the importance of web-archiving in this context, let us think about the potential use(rs) of this record and the implications if the website were no longer available:

As the  project comprised a wider outreach programme alongside its research, content such as PDF publications and podcasts were available for download, consultation and further research. The website platform means that these innovative collaborations and the data informed by the primary methodology are available for access. This is of access to the public on a global scale for education and knowledge and interaction with important issues – without even elaborating on how academics, researchers, historians and the wider user community will benefit from the availability of the materials from this web archive. Outreach by its very nature demands an unspecified group of people to lend its services to help.

Listening to the podcast of the project event hosted in Krakow: ‘Hybrid Identity’ in 2014. Rationale, abstracts and biographies from the workshop can also be opened. Image from http://wayback.archive-it.org/2502/20170124104618/http://subcultures.mml.ox.ac.uk/materials/workshop-krakow-hybrid-identity-september-2014.html

Furthermore, the site provides an irreplaceable record of institutional history for University of Oxford as a whole, as well as its research and collaborations. This is a dominant purpose of our University of Oxford collection. The role of preserving for posterity cannot be underplayed. Subcultures provides data that will be used, re-used and of grave importance for decades to come, and also documents decisions and projects of the University of Oxford. For example, the outline and rationale of the project is available in full through the Background Paper – Theory, available for consultation through the archived capture as it would be through the live web. Biographical details of contributors are also hosted on the captures, preserving records of people involved and their roles for further posterity and accountability.

Building on the importance of access to research: internet presence increases scholarly interaction. The scope of the project is of great relevance, and data for research is not only available from the capture of the site, but the use of internet archives as datasets are expected to become more prominent.

Participate!

Here at BLWA the archiving process begins with a nomination for archiving: if you have a site that you believe is of value for preserving as part of one of our collections then please do so here. The nomination form will go to the curators and web-archivists on the  BLWA team for selection checks and further processing. We would love to hear your nominations.

PASIG 2017: Ageing of Digital – Towards Managed Services for Digital Continuity

PASIG 2017 (Preservation and Archiving Special Interest Group) was hosted in Oxford this year at the Natural History Museum by Bodleian Libraries & Digital Preservation at Oxford and Cambridge (DPOC). I attended on all three days (11th -13th September), when I wasn’t working I had the opportunity to listen to some thought provoking talks centered around the issue of digital preservation.

One of the highlights of the conference for me, was a talk given by Natasa Milic-Frayling, the founder of Intact Digital. The presentation entitled  ‘Ageing of Digital: Towards Managed Services for Digital Continuity’ demonstrated the innovative ways in which digital preservation issues are being approached.

Digital technology has a short lifespan; hardware and software become redundant and obsolete in a very short time, essentially outdated. This is  known as ‘Legacy Software’, outdated software that no longer receives vendor support or updates.

This poses the problem – How can we manage the life-cycle of digital in the face of a dynamic and changing computing ecosystem?                                        

Technologies are routinely changed, updated (sometimes at a cost), made redundant and retired. The value of digital assets needs to be protected. In the current climate there is an imbalance of power between the technology producers and providers and the content producers, owners and curators. The providers and producers can move on without the opinion or input of those who use the software.

How do we enable prolonged use of software to protect value of digital assets?

A case study was presented that contextualised the problem and the solution. The vendor Tamal vista Insights provided Cut&Search, a software for automated and semi automated  indexing of digitised manuscripts and digital artefacts that standard OCR can not handle.
The software was supplied to Fo Guang Shan, an International Chinese Buddhist Monastic Order with over 200 branch temples worldwide for use with their digitised manuscript collection. This project is made up of thousands of volunteers and spans years, beyond the providers expected life-cycle for their product, its primary market life-time.
 Intact Digital provide a managed service that allows for digital continuity. There are several steps in the process which then provide a  number of options to software providers and the content producers:
  • Deposit
  • Hosting
  • Remote Access
  • Digital Continuity Assurance Plans

The software can be hosted in a virtual machine and accessed remotely via a browser. The implications of this are far reaching for projects like the ones undertaken by the Fo Guang Shan. They don’t need to worry about the Cut&Search software becoming redundant and their digital assets remain protected. For smaller organisations operating on ever decreasing budgets this is an important step both for asset protection and digital preservation.

Key areas to develop

Although this is an important step, there is still much work to do and some key areas that need to be developed were highlighted. This will result in a sustained use of digital.

  • Economy around “retired” software
  • Legal frameworks and sustainable business models
  • New practices to create demand
  • New services to make it efficient, economical and sustainable

Changes to the Ecosystem

In taking these steps and creating a dialogue between the technology producers/providers and the content producers it changes the dynamic of the ecosystem, readdressing the imbalance in control.

 

The talk ended with two very pertinent statements;

Together we can create new practices and
new models of extending the life of digital”
“Without digital continuity our digital content,
information and knowledge has no future”
As a trainee I still have lots to learn but a major theme running throughout digital archiving and digital preservation is the need for communication, collaboration and dialogue. Working together, sharing ideas and the challenges is key to securing the future of digital content.

 

A complete collection of the slides relating to this topic can be found here;  https://doi.org/10.6084/m9.figshare.5415040.v1  Milic-Frayling, Natasa (2017): Aging of digital: Towards managed services for digital continuity. figshare.

PASIG 2017: Smartphones within the changing landscape of digital preservation

I recently volunteered at the PASIG 2017 Conference in Oxford, it was a great experience to learn more about the archives sector. Many of the talks at the conference focused on the current trends and influences affecting the trajectory of the industry.

A presentation that covered some of these trends in detail was a talk by Somaya Langley from Cambridge University Library (Polonsky Digital Preservation Project), her talk was featured in the ‘Future of DP theory and practice’ session. ‘Realistic digital preservation in the near future: How do we get from A to Z when B already seems too far away?’. Somaya’s presentation considered how we preserve the digital content we receive from donors on smartphones, with her focus being on iOS.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Somaya’s presentation discussed how in the field of digital preservation ingest suites have  long been used to dealing with CDs, DVDs, Floppys and HDDs. However, are not sufficiently prepared for ingesting smartphones or tablets, and the various issues that are associated with these devices. We must realise that smartphones potentially hold a wealth of information for archives:

‘With the design of the Apple Operation System (iOS) and the large amount of storage space available, records of emails, text messages, browsing history, chat, map searching, and more are all being kept’.

(Forensic Analysis on iOS Devices,  Tim Proffitt, 2012. https://uk.sans.org/reading-room/whitepapers/forensics/forensic-analysis-ios-devices-34092 )

Why iOS? What about Android?

The UK market for the iPhone (unlike the rest of Europe) shows a much closer split: iOS November 2016 Sales 48.3% versus Android 49.6% market share in the UK. This  is contrasted against the global market share that Apple have of 12.1% in Q3 of 2016.

Whatever side of the fence you stand on it is clear that smartphones in digital curation, be they Android or iOS, will both play an important role in our collections. The skills required to extract content differs across platforms, we as digital archivists will have to learn both methods of extraction and leave our consumer preferences at the door.

So how do we get the data off the iPhone?

iOS has long been known as a ‘locked-down’ operating system, and Apple have always had an anti-tinkering stance with many of their products. Therefore it should come as no surprise that locating files on an iPhone is not very straightforward.

As Somaya pointed out in her talk, after spending six hours in the Apple Shop ‘Genius Bar’ she was no closer to understanding from Apple employees what the best course of action would be to locate backups of notes from a ‘bricked’ iPhone. Therefore she used her own method of retrieving the notes, using iExplorer to search through the backups from the iPhone.

She noted however that due to limitations of iOS it was very challenging to locate these files, in some cases it even required command line to access the location for storage backups as they were hidden by default in OSX (MacOS the main operating system used by Apple Computers).

Many tools do exist for the purpose of extracting information from iPhones, the four main methods outlined in the The SANS Institute White Paper on Forensic Analysis on iOS Devices by Tim Proffitt:

  1. Acquisition via iTunes Backups (requires original PC last used to sync the iPhone)
  2. Acquiring Backup Data with iPhone Analyzer (free java-based computer program, issues exist when dealing with encrypted backups)
  3. Acquisition via Logical Methods: (uses a synchronisation method built into iOS to recover data, e.g: programs like iPhone Explorer)
  4. Acquisition via Physical Methods (obtaining a bit-by-bit copy, e.g: Lantern 2 forensics suite)

Encryption is a challenge for retrieving data off the iPhone, especially since iTunes includes an encryption of backups feature when syncing. Proffitt suggests using a password cracker or jail-breaking as solutions to this issue, however, these solutions might not be fully compatible with our archive situations.

Another issue with smartphone digital preservation is platform and version locking. Just because the above methods work for data extraction at the moment it is very possible that future versions of iOS could make then defunct, requiring software developers to consistently update their programs or look for new approaches.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Final thoughts

One final consideration that can be raised from Somaya’s talk is that of privacy. As with the arrival of computers into our archives, phones will pose similar moral questions for archivists:

Do we ascribe different values to information stored on smartphones?
Do we consider the material stored on phones more personal than data stored on our computers?

As mentioned previously, our phones store everything from emails, geo-tagged photos, phone call information, and now with the growing popularity of smart wearable-technology, health data (including user heart-rate, daily activity, weight etc.) We as digital archivists will be dealing with very sensitive personal information and need to be prepared to understand the responsibility to safeguard it appropriately.

There is no doubt that soon enough we in the archive field will be receiving more and more smartphones and tablets into our archives from donors. Hopefully talks like Somaya’s will start the ball rolling towards the creation of better standards and approaches to smartphone digital curation.

PASIG 2017: Reflections on ‘Digital Preservation at the United Nations Mechanism for International Criminal Tribunals’

Along with my colleagues, I was incredibly grateful to be at Oxford PASIG 2017, hosted at the Oxford University Museum of Natural History from 11-13 September.

A presentation given by Angeline Takawira,  was affirmation indeed as to why advocacy for digital preservation is crucial worldwide.  Angeline gave us an insight into the aims and challenges of digital preservation at the United Nations Mechanism for International Criminal Tribunals (UN MICT).

The Mechanism

Angeline explained that the purpose of the UN MICT is to continue the mandated and essential actions that  have been carried out temporarily by two International Criminal Tribunals: Rwanda (ICTR) from 1993 until 2015 and Yugoslavia (ICTY) since 1994, which will be closing at the end of this year. UN MICT was established in 2010 by the UN Security Council, and is therefore a relatively new organisation. However, like its two predecessors, it is temporary.

We were told about the highly significant and mandated functions of MICT:

  1. To protect and support victims, witnesses and all others affected by war crimes
  2. To enforce sentences and other judicial work
  3. To preserve and manage the archives of the international tribunals.

You can find out more about the important work of the UN MICT here.

Digital Preservation at UN MICT

The Mechanism is made up of two branches: The Hague, Netherlands and Arusha, Tanzania, so the single digital repository is maintained across two continents. Currently the digital records of each of these are a hybrid of both digitised and born-digital material with example files including emails, GIS datasets, websites and CAD files. However, the audio-visual files take up 90% in volume of the digital archives combined.

It is so apparent that UN MICT’s  preservation goals are aligned to their aims as an organisation as a whole; authenticity is imperative for all of their records.  Angeline asserted that their digital preservation goals were to be trustworthy, accessible and useable and ‘demonstrably authentic’ – that is, identical to the digital original in all essential aspects. The digital archive is made up of:

  • Judicial case records – such as court decisions, judgements, court transcripts
  • Records relating to the judicial process – for example detentions of the accused and the protection of witnesses
  • Administrative records of the tribunals as an organisation (and also the Mechanism as an organisation).

Through a range of actions, the development of the digital preservation programme is achieving these aims. Angeline cited the introductions of workflows and compliance with standards, as well as the records being transferred to the repository with an unbroken chain of custody with stringent access controls and fixity checks to ensure no corruption. Furthermore, work continues on defining procedures around migration plans, as the Mechanism wishes to retain an experience of authenticity – which understandably needs a focus on file format characteristics.

Challenges

PASIG definitely taught me that authentic and usable digital preservation is always a trialling undertaking, but the challenges faced when digitally preserving the UN MICT are particularly unique due to its sensitive content and technicalities. For one, the fact that it is a temporary organisation is at odds with the long term endeavour of making these tribunal records accessible for the future and ensuring their protection. A repository transfer as a next step would need extremely critical consideration. Also, the retention schedule of different data is a factor for discussion – so that the UN MICT can fulfil its requirements of deletion in a transparent way.

One of the largest challenges to the future of digital preservation for similar organisations and initiatives, there is limited financial sustainability, resources and staff in order to sustain the long term commitment that digital preservation of records like this really command.

Use

There is no doubt that the digital archive of the UN MICT would be of fundamental significance to an international user community of the global media, legal professionals, academics, researchers and all education in general.  Combine these user groups with the broad range of stakeholders in preserving the Mechanism: the international courts, the security council who gave the mandated the work, there are many to whom this cause, and the information it preserves, will be vital to.  I have visited 4 countries of former Yugoslavia and the digital records of the MICT are surely equally  as compulsory to preserve and learn from as the  physical and tangible evidence of conflict. The need for advocacy of digital preservation is pertinent, and the UN MICT are doing urgent work.

Bountiful Harvest: Curation, Collection and Use of Web Archives

The theme for the ARA Annual Conference 2017 is: ‘Challenge the Past, Set the Agenda’. I was fortunate enough to attend a pre-conference workshop in Manchester, ran by Lori Donovan and Maria Praetzellis from The Internet Archive, about the bountiful harvest that is web content, and the technology, tools and features that enable web archivists to overcome the challenges it presents.

Part I – Collections, Community and Challenges

Lori gave us an insight into the use cases of Archive-it partner organisations to show us the breadth of reasons why other institutions archive the web. The creation of a web collection can be for one of (or indeed, all) the following reasons:

  • To maintain institutional history
  • To document social commentary and the perspectives of users
  • To capture spontaneous events
  • To augment physical holdings
  • Responsibility: Some documents are ONLY digital. For example, if a repository upholds a role to maintain all published records, a website can be moved into the realm of publication material.

When asked about duplication amongst web archives, and whether it was a problem if two different organisations archive the same web content, Lori put forward the argument that duplication is not worrisome. The more captures of a website is good for long term preservation in general – in some cases organisations can work together on collaborative collecting if the collection scope is appropriate.

Ultimately, the priority of crawling and capturing a site is to recreate the same experience a user would have if they were to visit the live site on the day it was archived. Combining this with an appropriate archive frequency  means that change over time can also be preserved. This is hugely important: the ephemeral nature of internet content is widely attested to. Thankfully, the misconception that ‘online content will be around forever’ is being confronted. Lori put forward some examples to illustrate the point for why the archiving of websites is crucial.

In general, a typical website lasts 90-100 days before one of the following happens:

  1. The content changes
  2. The site URL moves
  3. The content disappears completely

A study was carried out on the Occupy Movement sites archived in 2012. Of 582 archived sites, only 41% were still live on the web as of April 2014. (Lori Donovan)

Furthermore, we were told about a 2014 study which concluded that 70% of scholarly articles online with text citations suffered from reference rot over time. This speaks volumes about preserving copies in order for both authentication and academic integrity.

The challenge continues…

Lori also pointed us to the NDSA 2016/2017 survey which outlines the principle concerns within web archiving currently: Social media, (70%); Video, (69%) and Interactive media and Databases, (both 62%).  Any dynamic content can be difficult to capture and curate, therefore sharing advice  and guidelines amongst leaders in the web archiving community is a key factor in determining successful practice for both current web archivists, and those of future generations.

Part II – Current and Future Agenda

Maria then talked us through some key tools and features which enable greater crawling technology, higher quality captures and the preservation of web archives for access and use:

  • Brozzler. Definitely my new favourite portmanteau (browser + crawler = brozzler!), brozzler is the newly developed crawler by The Internet Archive which is replacing the combination of heritrix and umbra crawlers. Brozzler captures http traffic as it is loaded, works with YouTube in order to improve media capture and the data will be immediately written and saved as a WARC file. Also, brozzler uses a real browser to fetch pages, which enables it to capture embedded urls and extract links.
  • WARC. A Web ARChive file format is the ISO standard for web archives. It is a concatenated file written by a crawler, with long term storage and preservation specifically in mind. However, Maria pointed out to us that WARC files are not constructed to easily enable research (more on this below.).
  • Elasticsearch. The full-text search system does not just search the html content displayed on the web pages, it searches PDF, Word and other text-based documents.
  • solr. A metadata-only search tool. Metadata can be added on Archive-it at collection, seed and document level.

Supporting researchers now and in the future

The tangible experience and use of web archives where a site can be navigated as if it was live can shed so much light on the political and social climate of its time of capture. Yet, Maria explained that the raw captured data, rather than just the replay, is obviously a rich area for potential research and, if handled correctly, is an inappropriable research tool.

As well as the use of Brozzler as a new crawling technology, Archive-it research services offer a set of derivative data-set files which are less complex than WARC and allow for data analysis and research. One of these derivative data sets is a Longitudinal Graph Analysis (LGA) dataset file which will allow the researcher to analyse the trend in links between urls over time within an entire web collection.

Maria acknowledged that there are lessons  to be learnt when supporting researchers using web archives, including technical proficiency training and reference resources. The typology of the researchers who use web archives is ever growing: social and political scientists, digital humanities disciplines, computer science and documentary and evidence based research including legal discovery.

What Lori and Maria both made clear throughout the workshop was that the development and growth of web archiving is integral to challenging the past and preserving access on a long term scale. I really appreciated an insight into how the life cycle of web archiving is a continual process, from creating a collection, through to research services, whilst simultaneously managing the workflow of curation.

When in Manchester…

Virtual Archive, Central Library, Manchester

I  couldn’t leave  Manchester without exploring the John Rylands Library and Manchester’s Central Library. In the latter, this interactive digital representation of a physical archive combined choosing a box from how a physical archive may be arranged, and then projected the digitised content onto the screen once selected. A few streets away in Deansgate I had just enough time in John Rylands to learn that the fear of beards is called Pogonophobia. Go and visit yourself to learn more!

Special collections reading room, John Rylands Library, Manchester

Email Preservation: How Hard Can it Be? DPC Briefing Day

On Thursday 6th July 2017 I attended the Digital Preservation Coalition briefing day in partnership with the Andrew W. Mellon Foundation on email preservation titled ‘Email preservation: how hard can it be?’. It was hosted at The National archives (TNA), this was my first visit to TNA and it was fantastic. I didn’t know a great deal about email preservation prior to this and so I was really looking forward to learning about this topic.

The National Archives, Photograph by Mike Peel (www.mikepeel.net)., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=9786613

The aim of the day was to engage in discussion about some of the current tools, technologies and thoughts on email preservation. It was orientated around the ‘Task Force on Technical Approaches to Email Preservation’ report that is currently in its draft phase. We also got to hear about interesting case studies from the British library, TNA and Preservica, each presenting their own unique experiences in relation to this topic. It was a great opportunity to learn about this area and hear from the co-chairs (Kate Murray and Christopher Prom) and the audience about their thoughts on the current situation and possible future directions.

We heard from Jonathan Pledge from British library (BL). He told us about the forensic capture expertise gained by the BL and using EnCase to capture email data from hard drives, CD’s and USB’s. We also got an insight into how they are deciding which email archive tool to use. Aid4mail fits better with their work flow however ePADD with its holistic approach was something they were considering. During their ingest they separate the emails from the attachments. They found that after the time consuming process of removing emails that would violate the data protection laws, there was very little usable content left, as often, entire threads would have to be redacted due to one message. This is not the most effective use of an archivist time and is something they are working to address.

We also heard from Anthea Seles who works with government collections at TNA. We learnt that from their research, they discovered that approximately 1TB of data in an organisations own electronic document and records management system is linked to 10TB of related data in shared drives. Her focus was on discovery and data analytics. For example, a way to increase efficiency and focus the attention of the curator on was to batch email. If an email was sent from TNA to a vast number of people, then there is a high chance that the content does not contain sensitive information. However, if it was sent to a high profile individual, then there is a higher chance that it will contain sensitive information, so the curator can focus their attention on those messages.

Hearing from Preservica was interesting as it gave an insight into the commercial side of email archiving. In their view, preservation was not an issue. For them, their attention was focused on addressing issues such as identifying duplicates/unwanted emails efficiently. Developing tools for performing whole collection email analysis and, interestingly, how to solve the problem of acquiring emails via a continuous transfer.

Emails are not going to be the main form of communication forever (the rise in the popularity of instant messaging is clear to see) however we learnt that we are still expecting growth in its use for the near future.

One of the main issues that was bought up was the potential size of future email archives and the issue that come with effective and efficient appraisal. What is large in academic terms, e.g. 100 000 emails, is not in government. The figure of over 200 million emails at the George W. Bush presidential library is a phenomenal amount and the Obama administrations is estimated at 300 million. This requires smart solutions and we learnt how the use of artificial intelligence and machine learning could help.

Continuous active learning was highlighted to improve searches. An example of searching for Miami dolphins was given. The Miami Dolphins are an American football team however someone might so be looking for information about dolphins in Miami. Initially the computer would present different search results and the user would choose which the more relevant result is, over time it will learn what it is the user is looking for in cases where searches can be ambiguous.

Another issue that was highlighted was, how do you make sure that you have searched the correct person? How do you avoid false positives? At TNA the ‘Traces Through Time’ project aimed to do that, initially with World War One records. This technology, using big data analytics can be used with email archives. There is also work on mining the email signature as a way to better determine ownership of the message.

User experience was also discussed. Emulation is an area of particular interest. The positive of this is that it recreates how the original user would have experienced the emails. However this technology is still being developed. Bit level preservation is a solution to make sure we capture and preserve the data now. This prevents loss of the archive and allows the information and value to be extracted in the future once the tools have been developed.

It was interesting to hear how policy could affect how easy it would be to acquire email archives. The new General Data Protection Regulation that will come into effect in May 2018 will mean anyone in breach of this will suffer worse penalties, up to 4% of annual worldwide turnover. This means that companies may air on the side of caution with regards to keeping personal data such as emails.

Whilst the email protocols are well standardised, allowing emails to be sent from one client to another (e.g. AOL account from early 1990’s to Gmail of now) the acquisition of them are not. When archivists get hold of email archives, they are left with the remnants of whatever the email client/user has done to it. This means metadata may have been added or removed and formats can vary. This adds a further level of complexity to the whole process

The day was thoroughly enjoyable. It was a fantastic way to learn about archiving emails. As emails are now one of the main methods of communication, for government, large organisations and personal use, it is important that we develop the tools, techniques and policies for email preservation. To answer the question ‘how hard can it be?’ I’d say very. Emails are not simple objects of text, they are highly complex entities comprising of attachments, links and embedded content. The solution will be complex but there is a great community of researchers, individuals, libraries and commercial entities working on solving this problem. I look forward to hearing the update in January 2018 when the task force is due to meet again.

Initiating conversation: let’s talk about web content (part 2)

Colin Harris, Superintendent of Special Collections reading rooms. Chosen site: cyndislist.com

‘I am a founding member of Oxfordshire Family History Society and I’ve long been interested in family history. As a phenomena it surged in popularity in the 1970’s. In about 1973 there was great curiosity (in OFHS) in Bicester as everyone was interested in the popular group, The Osmonds (who originated from Bicester!). Every county has a family history society and I would say it’s they who have done the lion’s share of the work. All of their work and indexing…it’s all grist to the mill in terms of recording names and events.

So the website I would like to have access to in 10 years’ time is cyndislist.com, which is one of the world’s largest databases for genealogy. In fact it’s been going for over 21 years already. This was launched on the 4th March 1996. The family history people have been right there from the very beginning, it’s been growing solidly since then; it’s fantastic. It covers 200 categories of subjects, it has links to 332,000 other websites, and it’s the starting point for any genealogical research. The ‘Cyndi’ is Cyndi Howell, an author in genealogy.

Almost every day the site is launching content that might be interesting in some particular subject. So just going back within the last couple of weeks: an article on Telling the Orphan’s story; Archive lab on how to preserve old negatives; The key to family reunion success and DNA: testing at a family reunion! Projects even go beyond individuals…they explore a Yellowstone wolf family. There is virtually nothing that is untouched. Anything with a name to it has potential for exploration.

To be honest, I haven’t been able to do any family history research since 1980, but I am hoping to do some later on this year (when I retire). All these years that have passed has meant that so much is available to be accessed over the internet

Actually I’d love to see genealogy and family history workers and volunteers getting more recognition for the fantastic amount of industrious and tech savvy work they do. Family history is something for people from all walks of life. Our history, your history, my history is something very personal. As I say, 21 years and going strong; I’d love to see the site going stronger still in 10 years’ time.’


 

Pip Willcox, Head of the Centre for Digital Scholarship and Senior Researcher at Oxford e-Research. Chosen site: twitter.com

Twitter is an amazing tool that society has used to show the best of what humanity is at the moment…we share ideas, we share friendship, fun and joy, we communicate with others around the world, people help each other. But, it shows the worst of what humans can do. The news we see is just the tip of the iceberg – the levels of abuse that users, particularly minority groups, receive is appalling. Twitter is a fantastic place to meet people who think very differently from us, people who come from different backgrounds, have had different experiences, who live far from us, or close by but we might not otherwise have met. It is so rich, so full of potential, and some of what we do with it is amazing, yet some of what we do with it is appalling.

The question for the archive is “which Twitter?” There is the general feed, what you see if you don’t sign in. Then there are our individual feeds, where we curate our own filter bubbles, customizing what we see through our accounts. You can create a feed around a hashtag, an event, or slice it by time or location. All of these approaches will affect the version of Twitter we archive and leave for the future to discover.

These filter bubbles are not new: we have always lived in them, even if we haven’t called them that before. Last year there was an experiment where a series of couples who held diametrically opposing views switched Twitter accounts and I found that, and their thoughtful response to it fascinating.

Projects like Cultures of Knowledge, for example, which is based at the History Faculty here at the University of Oxford, traces early modern correspondence. This resource lets you search for who was writing to whom, when, where, and the subjects they were discussing. It’s an enormously rich, people-centred view of the history of ideas and relationships across time and space, and of course it points readers on in interesting directions, to engage closely with the texts themselves. This is possible because the letters were archived and catalogued over the years, over the centuries by experts.

How are we going to trace the conversations of the late 20th and the early 21st centuries? The speed at which ideas flow is faster than ever and their breadth is global. What will future historians make of our age?

I’m interested from a future history as well as a community point of view. The way we are using Twitter has already changed and tracking its use, reach, and power seems to me well worth recording to help us understand it now, and to help explain an aspect of our lives to future societies. For me, Twitter makes the world more familiar, and anything that draws us together as a global community, that reinforces our understanding that we share one planet, that what we have in common vastly outweighs what divides us, and that helps us find ways to communicate is a good and a necessary thing.’

 


 

Will Shire, Library Assistant, Philosophy and Theology Faculty Library. Chosen site: wikipedia.org

‘It’s one of the sites I use the most…it has all of human knowledge. I think it’s a cool idea that anyone can edit it – unlike a normal book it’s updated constantly. I feel it’s derided almost too much by people who automatically think it’s not trustworthy…but I like the fact that it is a range of people coming together to edit and amend this resource. As a kid I bothered my mum all the time with constant questioning of ‘Why is this like this, why does it do that. Nowadays if you have a question about anything you can visit wikipedia.org. It would be really interesting to take a snapshot of one article every month or week in order to see how much it changes through user editing.

 Also, I studied languages and it is extremely useful for learning new vocabulary as the links at the side of the article can take you to the content in other available languages. You can quite easily look at different words or use it as a starter to take you to different articles in other languages that aren’t English.’


 

 

 

 

Why archive the web?

Here at the Bodleian Libraries’ Web Archive (BLWA), the archiving process starts with a nomination – either by our web curators or by you, the public. The nominated URLs the BLWA team then select for archiving are those specifically identified as being of lasting value and significance for preservation.

Not only are the sites chosen from a preservation standpoint – we are also continually seeking to build up the scope and content of our 7 collections within the BLWA: University of Oxford; University of Oxford colleges; University of Oxford museums, libraries and archives; social sciences; arts and humanities; international and science, medicine and technology. Exactly like the use of a physical collection, the sites belonging to the web collection will be used for research, fact checking, discovery and collaboration. There can be no denying that the web is the platform on which so much of contemporary society occurs. In the future then, and indeed now, web archives are providing an insight into our history.

Anti-Apartheid Movement Archives – http://www.aamarchives.org/

The AAMA site is part of our international collection in the BLWA. Within this collection we have captured the aamarchives.org 7 times since 24th November 2015. This online platform is vital for digital access to further research, cross-cultural relationships and efforts towards understanding the history of the British Anti-Apartheid Movement 1959 – 1994. This capture has preserved the navigation and functionality of the site and links still resolve; for example the user community can still browse the archive, learn about campaigns and download resources. The date and time is clearly displayed in the banner at the top.

BLWA’s first capture of the online AAMA

This website can also be used and explored in conjunction with our related physical holdings. Here at the Bodleian Special Collections we have an amazing depth and range of physical material in the Anti-Apartheid Movement archive and our Commonwealth and African studies collections. You can browse the catalogue for this here.

This archived capture is fully functional, like a live site.

This is a tangible example of how digital preservation enhances and complements physical material and ensures records can reach a wider audience. How exciting it is that a researcher can consult manuscript or archived material, alongside captures of websites from the past in order to gain more of an insight and have a wider scope of substance to survey!

Web content like the aamarchives.org/ is not as stable as you might presume. A repository of web based collections enables future discovery of internet sites that are perhaps taken for granted due to the nature of our technological society; everything is just a tap or a click away. In fact, much of the material we interact with today is only available online. The truth is that web content is ephemeral: there is a very real threat that it can rapidly change and disappear altogether. Therefore web archiving initiatives are vital to preserve these valuable resources for good. Through these captures, provenance, arrangement and content have been preserved; and arguably most importantly of all – access.

Both individual collections and the web archive as a whole can be searched for a specific site, or browsed at leisure.

Growth of open access and web based initiatives mean that there is an ever increasing network of digital libraries on a global scale. There is no doubt that the practice of web archiving is a significant contribution towards ensuring knowledge for all. Access to the Internet enabling access to an ever growing knowledge depository is central to the integrity of educational and professional research, web archiving and on a larger scale, digital preservation.

Browse our collections in Bodleian Libraries’ Web Archive

Get involved and help preserve our history! Nominate a site to archive

Initiating conversation: let’s talk about web content (part 1)

To initiate conversation about preserving web content and to encourage people to think about why archiving the web is so important, I asked staff at the Bodleian Libraries to imagine the following: If you could choose just one website to have guaranteed access to in 10 years’ time what would it be – and why? Keep reading to discover staff answers and perspectives…

Richard Ovenden, Bodley’s Librarian, Bodleian Libraries. Chosen site: bodleian.ox.ac.uk

‘Obviously as somebody who is leading this institution, seeing its history reflected in the institutional website is so significant. If you go back to the archived captures of bodleian.ox.ac.uk that are accessible now through the Internet Archive it’s incredible not only to see evolution of the HTML site itself and the look and feel of it but just to see how it reflects the changes in the organisation since the 1990’s when the first Bodleian website was set up…which was actually the first library in the UK to have a website.

We can see the changes to the way the Bodleian Libraries reflect their public persona through the web but also the website is a useful proxy for how the organisation itself has changed: the organisational structure, the administrative arrangements, the policies and strategies, how the web is a reflection of those changes over the past 20 years is really interesting. And in 10 years’ time it would be over 30 years and there will be another decade of evolution, growth, change…the web is a very convenient place to see that at a glance. We obviously archive a large number of institutional and administrative records in paper and digital form but it’s a huge amount to wade through, whereas the web provides a very convenient lens to view our organisational past through. I can’t think of another way, so conveniently, to chart our history, our progress, our challenges and even some of the mistakes that we’ve made as an organisation over that time.

Our organisation as a whole changed dramatically in the year 2000 when we stopped being just the historic Bodleian Library and we were integrated with the departmental faculty libraries. We then changed our name to University of Oxford Library services, then back to the Bodleian. Through the website you can actually see that extraordinary change. It’s such a convenient way of getting a grip on our history’.


Lukasz Kowalski, Bodleian Library Reader Services, Weston Library. Chosen site: stackexchange.com

‘I was thinking “what’s the website with the most information in it?”. My initial thought was Wikipedia.org. But I could easily live without it if I had to, as probably most knowledge contained in it is available in print. My next thought was stackexchange.com. It facilitates an exchange of knowledge and collective problem-solving on a large scale, otherwise unattainable via printed media. It’s supported by a large community of users, including experts in their fields. Together with its sister sites, it covers virtually any discipline and questions that can be asked and answered. Stackexchange is a web of knowledge, but different from Wikipedia. Rather than being organised knowledge it is more organised thinking.

My background is in Physics and I have used this site to further my understanding of concepts which did not have clear explanations in textbooks, or when I wanted to check that my thinking about a solution to a given problem was on the same page as others.

I think it goes back to what, I guess, the internet was about in the first place: the exchange of knowledge and ideas, and such is the character of this site. It’s great to rely on good teachers if one has access to them – but it is wonderful that people from across the world can gain a deeper understanding of concepts and exchange ideas by connecting more readily with those who have the expertise.’


 

Sophie Quantrell, Library Assistant, Philosophy and Theology Faculty Library. Chosen site: youtube.com

‘I was thinking about youtube.com as a resource mainly because it’s so versatile. It can be used to display images, sound…I’ve seen some people use it for musical scores – putting musical scores alongside the sound and that sort of thing. I think it is a site that can be used almost for any purpose – so you’ve got the social aspect of it with the comments and the interaction as well as the instructional aspect. I learn sign language when I am not busy with other things [gestures around her at the library] so to be able to see and learn it through videos it is great…it’s much more difficult to tell what the signs are if all you’ve got are drawings on a piece of paper!

It can link to videos on so many different topics, like instructional TED talks. There are so many good quality resources online that get overlooked with all the cat videos. It also crosses cultural boundaries…you can upload and view videos in whatever language you want. You could post a video from Australia and someone could be watching it in Kazakhstan!’


Iram Safdar, Graduate Trainee Digital Archivist, Weston Library. Chosen site: wikipedia.org

Wikipedia has been the main source for my knowledge since I was a kid. It’s also provided me with countless hours of entertainment by following the breadcrumb trail of links and seeing where you end up! All sorts of hilarity ensues when you find a rogue edit by someone…I like that it is an open source resource.

Similarly, it shows you what society thinks about things and reveals how we view stuff…which I think in a broader sense is quite interesting.’


Keep an eye out for part 2 and more staff insights coming up on the Archives and Modern Manuscripts blog imminently…

 

iPRES 2016

Last month, I attended the 13th International Conference on Digital Preservation, this year hosted in Bern, Switzerland. The four days of papers, panels, posters and workshops were an intensive and exciting opportunity to meet with colleagues working in digital preservation around the world, share ideas, and hear about innovative projects and approaches. The topics ranged widely from technical systems and practices, to quality and risk assessment, and stewardship and sustainability. What follows are just a couple of highlights from a really fascinating week.

Networking wall
The post-it note networking wall: What do you know? What do you want to know?

Net-based and digital art

As email, digital documents and social media replace traditional forms of communication, it is crucial to be able to preserve born-digital material and make it accessible. An area which I hadn’t previously considered was the realm of net-based art. Here, the internet is used as an artistic medium, which of course has implications (and complications) for digital preservation.

In her key-note speech, Sabine Himmelsbach from the House of Electronic Arts in Basel, introduced us to this exciting field, showing artwork such as Olia Lialina’s ‘Summer’, 2013, shown below.

Summer, by Olia Lialina
Screenshot of Summer, Olia Lialina, 2013. Available at https://www.youtube.com/watch?v=SxvHoXdC4Uk

The artwork features an animated loop of Lialina swinging from the browser bar. Each frame is hosted by a different website, and the playback therefore depends on your connection speed. This creative use of technology creates enormous challenges for preservation. Here, rather than preserving artefacts, it is the preservation of behaviours which is crucial, and these behaviours are extremely vulnerable to obsolescence.

Marc Lee’s ‘TV Bot’ is another net-based artwork, which is automated to broadcast current news stories with live TV streams, radio streams and webcam images from around the world. Reliant on technical infrastructure in this way, the shift from Real Player to Adobe Flash Player was one such development which prevented ‘TV Bot’ from functioning. The artist then not only worked on technical migration, but re-interpreted the artwork, modernising the look and feel, resulting in ‘TV Bot 2.0’ in 2010. This process soon happened again, this time including a twitter stream, in ‘TV Bot 3.0’, 2016. In this way, the artist is working against cultural, as well as technical obsolescence.

Marc Lee, 'TV Bot 2.0', 2010. Image from http://ceaac.org/en/artistes/marc-lee
Marc Lee, ‘TV Bot 2.0’, 2010. Image from http://ceaac.org/en/artistes/marc-lee

The heavy involvement from the artist in this case has helped preserve the artwork, but this process cannot be sustained indefinitely. Himmelsbach ended her speech by stressing the need for collaboration and dialogue, which emerged as a central theme of the conference.

A new approach to web archiving

Another highlight was the workshop on Webrecorder lead by Dragan Espenschied from Rhizome. He introduced their new tool which departs from the usual crawling method to capture web content ‘symmetrically’, which results in incredibly high-fidelity captures. The demonstration of how the tool can capture dynamic and interactive content sparked gasps of amazement from the group!

Webrecorder not only captures social media, embedded video and complex javascript (often tricky with current tools), but can actually capture the essence of an individual’s interaction with the web-content.

How it works: Webrecorder records all the content you interact with during the recording session. Users are then able to interact with the content themselves, but anything that was not viewed during the recording session will not be available to them.

Current web archiving strategies aren’t able to capture the personalised nature of web use. How to use this functionality is still a big question, as a web recording in this way would be personal to the web archivist: showing what they decided to explore, unless a systematic approach was designed by an institution. This itself would be very resource-intensive, and is arguably not where the potential of Webrecorder lies: the ability to capture dynamic content, such as net-based artworks. However, the possibility of preserving not only web content, but our interaction with it, is a very exciting development.iPRES 2016 balloon

iPRES 2016 was a fantastic opportunity to gain insight into projects happening around the world to further digital preservation. It showed me that often there are no clear answers to ‘which file format is best for that?’ or ‘how do I preserve this?’ and that seeking advice from others, and experimenting, is often the way forward. What was really clear from attending was that the strength and support of the community is the most valuable digital preservation tool available.