Tag Archives: #SkillsForTheFuture

Why and how do we Quality Assure (QA) websites at the BLWA?

At the Bodleian Libraries Web Archive (BLWA), we Quality Assure (QA) every site in the web archive. This blog post aims give a brief introduction into why and how we QA. The first steps of our web archiving involve crawling a site, using the tools developed by ArchiveIT. These tools allow for entire websites to be captured and browsed using the Wayback Machine as if it were live, allowing you to download files, view videos/photos and interact with dynamic content, exactly how the website owner would want you to. However, due to the huge variety and technical complexity of websites, there is no guarantee that every capture will be successful (that is to say that all the content is captured and working as it should be). Currently there is no accurate automatic process to check this and so this is where we step in.

We want to ensure that the sites on our web archive are an accurate representation in every way. We owe this to the owners and the future users. Capturing the content is hugely important, but so too is how it looks, feels and how you interact with it, as this is a major part of the experience of using a website.

Quality assurance of a crawl involves manually checking the capture. Using the live site as a reference, we explore the archived capture, clicking on links, trying to download content or view videos; noting any major discrepancies to the live site or any other issues. Sometimes, a picture or two will be missing or, it maybe that a certain link is not resolving correctly, which can be relatively easy to fix, but other times it can be massive differences compared to the live site; so the (often long and sometimes confusing) process of solving the problem begins. Some common issue we encounter are:

  • Incorrect formatting
  • Images/video missing
  • Large file sizes
  • Crawler traps
  • Social media feeds
  • Dynamic content playback issues

There are many techniques available for us to use to help solve these problems, but there is no ‘one fix for all’, the same issue for two different sites may require two different solutions. There is a lot of trial and error involved and over the years we have gained a lot of knowledge on how to solve a variety of issues. Also ArchiveIT has a fantastic FAQ section on their site, however, if we have gone through the usual avenues and still cannot solve our problems, then our final port of call is to ask the geniuses at ArchiveIT, who are always happy and willing to help.

An example of how important and effective QA can be. The initial test capture did not have the correct formatting and was missing images. This was resolved after the QA process

QA’ing is a continual process. Websites add new content or companies change to different website designers, meaning captures of websites that have previously been successful, might suddenly have an issue. It is for this reason that every crawl is given special attention and is QA’d. QA’ing the captures before they are made available is a time consuming but incredibly important part of the web archiving process at the Bodleian Libraries Web Archive. It allows us to maintain a high standard of capture and provide an accurate representation of the website for future generations.

 

PASIG 2017: Ageing of Digital – Towards Managed Services for Digital Continuity

PASIG 2017 (Preservation and Archiving Special Interest Group) was hosted in Oxford this year at the Natural History Museum by Bodleian Libraries & Digital Preservation at Oxford and Cambridge (DPOC). I attended on all three days (11th -13th September), when I wasn’t working I had the opportunity to listen to some thought provoking talks centered around the issue of digital preservation.

One of the highlights of the conference for me, was a talk given by Natasa Milic-Frayling, the founder of Intact Digital. The presentation entitled  ‘Ageing of Digital: Towards Managed Services for Digital Continuity’ demonstrated the innovative ways in which digital preservation issues are being approached.

Digital technology has a short lifespan; hardware and software become redundant and obsolete in a very short time, essentially outdated. This is  known as ‘Legacy Software’, outdated software that no longer receives vendor support or updates.

This poses the problem – How can we manage the life-cycle of digital in the face of a dynamic and changing computing ecosystem?                                        

Technologies are routinely changed, updated (sometimes at a cost), made redundant and retired. The value of digital assets needs to be protected. In the current climate there is an imbalance of power between the technology producers and providers and the content producers, owners and curators. The providers and producers can move on without the opinion or input of those who use the software.

How do we enable prolonged use of software to protect value of digital assets?

A case study was presented that contextualised the problem and the solution. The vendor Tamal vista Insights provided Cut&Search, a software for automated and semi automated  indexing of digitised manuscripts and digital artefacts that standard OCR can not handle.
The software was supplied to Fo Guang Shan, an International Chinese Buddhist Monastic Order with over 200 branch temples worldwide for use with their digitised manuscript collection. This project is made up of thousands of volunteers and spans years, beyond the providers expected life-cycle for their product, its primary market life-time.
 Intact Digital provide a managed service that allows for digital continuity. There are several steps in the process which then provide a  number of options to software providers and the content producers:
  • Deposit
  • Hosting
  • Remote Access
  • Digital Continuity Assurance Plans

The software can be hosted in a virtual machine and accessed remotely via a browser. The implications of this are far reaching for projects like the ones undertaken by the Fo Guang Shan. They don’t need to worry about the Cut&Search software becoming redundant and their digital assets remain protected. For smaller organisations operating on ever decreasing budgets this is an important step both for asset protection and digital preservation.

Key areas to develop

Although this is an important step, there is still much work to do and some key areas that need to be developed were highlighted. This will result in a sustained use of digital.

  • Economy around “retired” software
  • Legal frameworks and sustainable business models
  • New practices to create demand
  • New services to make it efficient, economical and sustainable

Changes to the Ecosystem

In taking these steps and creating a dialogue between the technology producers/providers and the content producers it changes the dynamic of the ecosystem, readdressing the imbalance in control.

 

The talk ended with two very pertinent statements;

Together we can create new practices and
new models of extending the life of digital”
“Without digital continuity our digital content,
information and knowledge has no future”
As a trainee I still have lots to learn but a major theme running throughout digital archiving and digital preservation is the need for communication, collaboration and dialogue. Working together, sharing ideas and the challenges is key to securing the future of digital content.

 

A complete collection of the slides relating to this topic can be found here;  https://doi.org/10.6084/m9.figshare.5415040.v1  Milic-Frayling, Natasa (2017): Aging of digital: Towards managed services for digital continuity. figshare.

PASIG 2017: Smartphones within the changing landscape of digital preservation

I recently volunteered at the PASIG 2017 Conference in Oxford, it was a great experience to learn more about the archives sector. Many of the talks at the conference focused on the current trends and influences affecting the trajectory of the industry.

A presentation that covered some of these trends in detail was a talk by Somaya Langley from Cambridge University Library (Polonsky Digital Preservation Project), her talk was featured in the ‘Future of DP theory and practice’ session. ‘Realistic digital preservation in the near future: How do we get from A to Z when B already seems too far away?’. Somaya’s presentation considered how we preserve the digital content we receive from donors on smartphones, with her focus being on iOS.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Somaya’s presentation discussed how in the field of digital preservation ingest suites have  long been used to dealing with CDs, DVDs, Floppys and HDDs. However, are not sufficiently prepared for ingesting smartphones or tablets, and the various issues that are associated with these devices. We must realise that smartphones potentially hold a wealth of information for archives:

‘With the design of the Apple Operation System (iOS) and the large amount of storage space available, records of emails, text messages, browsing history, chat, map searching, and more are all being kept’.

(Forensic Analysis on iOS Devices,  Tim Proffitt, 2012. https://uk.sans.org/reading-room/whitepapers/forensics/forensic-analysis-ios-devices-34092 )

Why iOS? What about Android?

The UK market for the iPhone (unlike the rest of Europe) shows a much closer split: iOS November 2016 Sales 48.3% versus Android 49.6% market share in the UK. This  is contrasted against the global market share that Apple have of 12.1% in Q3 of 2016.

Whatever side of the fence you stand on it is clear that smartphones in digital curation, be they Android or iOS, will both play an important role in our collections. The skills required to extract content differs across platforms, we as digital archivists will have to learn both methods of extraction and leave our consumer preferences at the door.

So how do we get the data off the iPhone?

iOS has long been known as a ‘locked-down’ operating system, and Apple have always had an anti-tinkering stance with many of their products. Therefore it should come as no surprise that locating files on an iPhone is not very straightforward.

As Somaya pointed out in her talk, after spending six hours in the Apple Shop ‘Genius Bar’ she was no closer to understanding from Apple employees what the best course of action would be to locate backups of notes from a ‘bricked’ iPhone. Therefore she used her own method of retrieving the notes, using iExplorer to search through the backups from the iPhone.

She noted however that due to limitations of iOS it was very challenging to locate these files, in some cases it even required command line to access the location for storage backups as they were hidden by default in OSX (MacOS the main operating system used by Apple Computers).

Many tools do exist for the purpose of extracting information from iPhones, the four main methods outlined in the The SANS Institute White Paper on Forensic Analysis on iOS Devices by Tim Proffitt:

  1. Acquisition via iTunes Backups (requires original PC last used to sync the iPhone)
  2. Acquiring Backup Data with iPhone Analyzer (free java-based computer program, issues exist when dealing with encrypted backups)
  3. Acquisition via Logical Methods: (uses a synchronisation method built into iOS to recover data, e.g: programs like iPhone Explorer)
  4. Acquisition via Physical Methods (obtaining a bit-by-bit copy, e.g: Lantern 2 forensics suite)

Encryption is a challenge for retrieving data off the iPhone, especially since iTunes includes an encryption of backups feature when syncing. Proffitt suggests using a password cracker or jail-breaking as solutions to this issue, however, these solutions might not be fully compatible with our archive situations.

Another issue with smartphone digital preservation is platform and version locking. Just because the above methods work for data extraction at the moment it is very possible that future versions of iOS could make then defunct, requiring software developers to consistently update their programs or look for new approaches.

Langley, Somaya (2017): Realistic digital preservation in the near future: How to get from A to Z when B seems too far away?. figshare. https://doi.org/10.6084/m9.figshare.5418685.v1 Retrieved: 08:22, Sep 22, 2017 (GMT)

Final thoughts

One final consideration that can be raised from Somaya’s talk is that of privacy. As with the arrival of computers into our archives, phones will pose similar moral questions for archivists:

Do we ascribe different values to information stored on smartphones?
Do we consider the material stored on phones more personal than data stored on our computers?

As mentioned previously, our phones store everything from emails, geo-tagged photos, phone call information, and now with the growing popularity of smart wearable-technology, health data (including user heart-rate, daily activity, weight etc.) We as digital archivists will be dealing with very sensitive personal information and need to be prepared to understand the responsibility to safeguard it appropriately.

There is no doubt that soon enough we in the archive field will be receiving more and more smartphones and tablets into our archives from donors. Hopefully talks like Somaya’s will start the ball rolling towards the creation of better standards and approaches to smartphone digital curation.

PASIG 2017: “Sharing my loss to protect your data” University of the Balearic Islands

 

Last week I was lucky enough to be able to attend the PASIG 2017 (Preservation and Archiving Special Interest Group) conference, held at the Oxford University Museum of Natural History, where over the course of three days the  digital preservation community connected to share, experiences, tools, successes and mishaps.

The story of one such mishap came from Eduardo del Valle, Head of the Digitization and Open Access Unit at the University of the Balearic Islands (UIB), in his presentation titled Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation”. In 2013 the digitisation and digital preservation workflow pictured below was set up by the IT team at UIB.

2013 Digitisation and Digital Preservation Workflow (Eduardo del Valle, 2017)

Del Valle was told this was a reliable system, with fast retrieval. However, he found this was not the case, with slow retrieval and the only means of organisation consisting of an excel spreadsheet used to contain the storage locations of the data.

In order to assess their situation they used the NDSA Levels of Digital Preservation, a tiered set of recommendations on how organisations should build their digital preservation activities, developed by the National Digital Stewardship Alliance (NDSA) in 2012. The guidelines are organised into five functional areas that lie at the centre of digital preservation:

  1. Storage and geographic location
  2. File fixity and data integrity
  3. Information security
  4. Metadata
  5. File formats

These five areas then have four columns (Levels 1-4) which set tiered recommendations of action, from Level 1 being the least an organisation should do, to Level 4 being the most an organisation can do. You can read the original paper on the NDSA Levels here.

The slide below shows the extent to which the University met the NDSA Levels. They found there was an urgent need for improvement.

NDSA Levels of Preservation UIB compliance (Eduardo del Valle, 2017)

“Anything that can go wrong, will go wrong” – Eduardo del Valle

In 2014 the IT team decided to implement a new back up system. While the installation and configuration of the new backup system (B) was completed, the old system (A) remained operative.

On the 14th and 15th November 2014, a backup was created for the digital material generated during the digitisation of 9 rare books from the 14th century in the Tape Backup System (A) and notably, two confirmation emails were received, verifying the success of the backup.  By October 2015, all digital data had been migrated from System (A) to the new System (B), spanning UIB projects from 2008-2014.

However, on 4th November 2014, a loss of data was detected…

The files corresponding to the 9 digitised rare books were lost. This loss was detected a year after the initial back up of the 9 books in System A, and therefore the contract for technical assistance had finished. This meant there was no possibility of obtaining financial compensation, if the loss was due to a hardware or software problem.  The loss of these files, unofficially dubbed “the X-files”, meant the loss of three months of work and it’s corresponding economic loss. Furthermore, the rare books were in poor condition, and to digitise them again could cause serious damage. Despite a number of theories, the University is yet to receive an explanation for the loss of data.

The digitised 14th century rare book from UIB collection (Eduardo del Valle, 2017)

To combat issues like this, and to enforce best practice in their digital preservation efforts, the University acquired Libsafe, a digital preservation solution offered by Libnova. Libsafe is OAIS and ISO 14.721:2012 compliant, and encompasses advanced metadata management with a built-in ISAD(g) filter, with the possibility to import any custom metadata schema. Furthermore, Libsafe offers fast delivery, format control, storage of two copies in disparate locations, and a built-in catalogue. With the implementation of a standards compliant workflow, the UIB proceeded to meet all four levels of the 5 areas of the NDSA Levels of Digital Preservation.

The ISO 14.721:2012 Space Data and Information Transfer Systems – Open Archival Information System – Reference Model (OAIS)  provides a framework for implementing the archival concepts needed for long-term digital preservation and access, and for describing and comparing architectures and operations of existing and future archives, as well as describing roles, processes and methods for long-term preservation.

The use of these standards facilitates the easy access, discovery and sharing of digital material, as well as their long-term preservation. Del Valle’s story of data loss reminds us of the importance of implementing standards-based practices in our own institutions, to minimise risk and maximise interoperability and access, in order to undertake true digital preservation.

 

With thanks to Eduardo del Valle, University of the Balearic Islands.

PASIG2017: Preserving Memory

 

The Oxford University Natural History Museum (photo by Roxana Popistasu, twitter)

This year’s PASIG conference, (Preservation and Archiving Special Interest Group) bought together an eclectic mix of individuals from around the world to discuss the very exciting and constantly evolving topic of digital preservation. Held at the Oxford University Natural History Museum, the conference aimed to connect practitioners from a variety of industries with a view to promoting conversation surrounding various digital preservation experiences, designs and best practices. The presentations given comprised a series of lightning talks, speeches and demos on a variety of themes including: the importance of standards, sustainability and copyright within digital preservation.

UNHCR: Archiving on the Edge

UNHCR Fieldworkers digitally preserving refugee records (photo by Natalie Harrower, twitter)

I was particularly moved by a talk given on the third day by Patricia Sleeman, an Archivist working for the UNHCR, a global organisation dedicated to saving lives, protecting rights and building a better future for refugees, forcibly displaced communities and stateless people.

Entitled “Keep your Eyes on the Information” Sleeman’s poignant and thought-provoking presentation discussed the challenges and difficulties faced when undertaking digital preservation in countries devastated by the violence and conflicts of war. Whilst recognising that digital preservation doesn’t immediately save lives in the way that food, water and aid can, Sleeman identified the place of digital preservation as having significant importance in the effort to retain, record and preserve the memory, identity and voice of a people which would otherwise be lost through the destruction and devastation of displacement, war and violence.

About the Archive

Sleeman and her team seek to capture a wide range of digital media including: you tube, websites and social media, each forming a precious snapshot of history, an antidote to the violent acts of mnemnocide- or the destruction of memory.

The digital preservation being undertaken is still in its early stages with focus being given to the creation of good quality captures and metadata. It is hoped in time however that detailed policies and formats will be developed to aid Sleeman in her digital preservation work.

One of the core challenges of this project has been handling highly sensitive material including refugee case files. The preservation of such delicate material has required Sleeman and her team to act slowly and with integrity, respecting the content of information at each stage.

For more information on the UNHCR  please click here.

 

PASIG 2017: Reflections on ‘Digital Preservation at the United Nations Mechanism for International Criminal Tribunals’

Along with my colleagues, I was incredibly grateful to be at Oxford PASIG 2017, hosted at the Oxford University Museum of Natural History from 11-13 September.

A presentation given by Angeline Takawira,  was affirmation indeed as to why advocacy for digital preservation is crucial worldwide.  Angeline gave us an insight into the aims and challenges of digital preservation at the United Nations Mechanism for International Criminal Tribunals (UN MICT).

The Mechanism

Angeline explained that the purpose of the UN MICT is to continue the mandated and essential actions that  have been carried out temporarily by two International Criminal Tribunals: Rwanda (ICTR) from 1993 until 2015 and Yugoslavia (ICTY) since 1994, which will be closing at the end of this year. UN MICT was established in 2010 by the UN Security Council, and is therefore a relatively new organisation. However, like its two predecessors, it is temporary.

We were told about the highly significant and mandated functions of MICT:

  1. To protect and support victims, witnesses and all others affected by war crimes
  2. To enforce sentences and other judicial work
  3. To preserve and manage the archives of the international tribunals.

You can find out more about the important work of the UN MICT here.

Digital Preservation at UN MICT

The Mechanism is made up of two branches: The Hague, Netherlands and Arusha, Tanzania, so the single digital repository is maintained across two continents. Currently the digital records of each of these are a hybrid of both digitised and born-digital material with example files including emails, GIS datasets, websites and CAD files. However, the audio-visual files take up 90% in volume of the digital archives combined.

It is so apparent that UN MICT’s  preservation goals are aligned to their aims as an organisation as a whole; authenticity is imperative for all of their records.  Angeline asserted that their digital preservation goals were to be trustworthy, accessible and useable and ‘demonstrably authentic’ – that is, identical to the digital original in all essential aspects. The digital archive is made up of:

  • Judicial case records – such as court decisions, judgements, court transcripts
  • Records relating to the judicial process – for example detentions of the accused and the protection of witnesses
  • Administrative records of the tribunals as an organisation (and also the Mechanism as an organisation).

Through a range of actions, the development of the digital preservation programme is achieving these aims. Angeline cited the introductions of workflows and compliance with standards, as well as the records being transferred to the repository with an unbroken chain of custody with stringent access controls and fixity checks to ensure no corruption. Furthermore, work continues on defining procedures around migration plans, as the Mechanism wishes to retain an experience of authenticity – which understandably needs a focus on file format characteristics.

Challenges

PASIG definitely taught me that authentic and usable digital preservation is always a trialling undertaking, but the challenges faced when digitally preserving the UN MICT are particularly unique due to its sensitive content and technicalities. For one, the fact that it is a temporary organisation is at odds with the long term endeavour of making these tribunal records accessible for the future and ensuring their protection. A repository transfer as a next step would need extremely critical consideration. Also, the retention schedule of different data is a factor for discussion – so that the UN MICT can fulfil its requirements of deletion in a transparent way.

One of the largest challenges to the future of digital preservation for similar organisations and initiatives, there is limited financial sustainability, resources and staff in order to sustain the long term commitment that digital preservation of records like this really command.

Use

There is no doubt that the digital archive of the UN MICT would be of fundamental significance to an international user community of the global media, legal professionals, academics, researchers and all education in general.  Combine these user groups with the broad range of stakeholders in preserving the Mechanism: the international courts, the security council who gave the mandated the work, there are many to whom this cause, and the information it preserves, will be vital to.  I have visited 4 countries of former Yugoslavia and the digital records of the MICT are surely equally  as compulsory to preserve and learn from as the  physical and tangible evidence of conflict. The need for advocacy of digital preservation is pertinent, and the UN MICT are doing urgent work.

Bountiful Harvest: Curation, Collection and Use of Web Archives

The theme for the ARA Annual Conference 2017 is: ‘Challenge the Past, Set the Agenda’. I was fortunate enough to attend a pre-conference workshop in Manchester, ran by Lori Donovan and Maria Praetzellis from The Internet Archive, about the bountiful harvest that is web content, and the technology, tools and features that enable web archivists to overcome the challenges it presents.

Part I – Collections, Community and Challenges

Lori gave us an insight into the use cases of Archive-it partner organisations to show us the breadth of reasons why other institutions archive the web. The creation of a web collection can be for one of (or indeed, all) the following reasons:

  • To maintain institutional history
  • To document social commentary and the perspectives of users
  • To capture spontaneous events
  • To augment physical holdings
  • Responsibility: Some documents are ONLY digital. For example, if a repository upholds a role to maintain all published records, a website can be moved into the realm of publication material.

When asked about duplication amongst web archives, and whether it was a problem if two different organisations archive the same web content, Lori put forward the argument that duplication is not worrisome. The more captures of a website is good for long term preservation in general – in some cases organisations can work together on collaborative collecting if the collection scope is appropriate.

Ultimately, the priority of crawling and capturing a site is to recreate the same experience a user would have if they were to visit the live site on the day it was archived. Combining this with an appropriate archive frequency  means that change over time can also be preserved. This is hugely important: the ephemeral nature of internet content is widely attested to. Thankfully, the misconception that ‘online content will be around forever’ is being confronted. Lori put forward some examples to illustrate the point for why the archiving of websites is crucial.

In general, a typical website lasts 90-100 days before one of the following happens:

  1. The content changes
  2. The site URL moves
  3. The content disappears completely

A study was carried out on the Occupy Movement sites archived in 2012. Of 582 archived sites, only 41% were still live on the web as of April 2014. (Lori Donovan)

Furthermore, we were told about a 2014 study which concluded that 70% of scholarly articles online with text citations suffered from reference rot over time. This speaks volumes about preserving copies in order for both authentication and academic integrity.

The challenge continues…

Lori also pointed us to the NDSA 2016/2017 survey which outlines the principle concerns within web archiving currently: Social media, (70%); Video, (69%) and Interactive media and Databases, (both 62%).  Any dynamic content can be difficult to capture and curate, therefore sharing advice  and guidelines amongst leaders in the web archiving community is a key factor in determining successful practice for both current web archivists, and those of future generations.

Part II – Current and Future Agenda

Maria then talked us through some key tools and features which enable greater crawling technology, higher quality captures and the preservation of web archives for access and use:

  • Brozzler. Definitely my new favourite portmanteau (browser + crawler = brozzler!), brozzler is the newly developed crawler by The Internet Archive which is replacing the combination of heritrix and umbra crawlers. Brozzler captures http traffic as it is loaded, works with YouTube in order to improve media capture and the data will be immediately written and saved as a WARC file. Also, brozzler uses a real browser to fetch pages, which enables it to capture embedded urls and extract links.
  • WARC. A Web ARChive file format is the ISO standard for web archives. It is a concatenated file written by a crawler, with long term storage and preservation specifically in mind. However, Maria pointed out to us that WARC files are not constructed to easily enable research (more on this below.).
  • Elasticsearch. The full-text search system does not just search the html content displayed on the web pages, it searches PDF, Word and other text-based documents.
  • solr. A metadata-only search tool. Metadata can be added on Archive-it at collection, seed and document level.

Supporting researchers now and in the future

The tangible experience and use of web archives where a site can be navigated as if it was live can shed so much light on the political and social climate of its time of capture. Yet, Maria explained that the raw captured data, rather than just the replay, is obviously a rich area for potential research and, if handled correctly, is an inappropriable research tool.

As well as the use of Brozzler as a new crawling technology, Archive-it research services offer a set of derivative data-set files which are less complex than WARC and allow for data analysis and research. One of these derivative data sets is a Longitudinal Graph Analysis (LGA) dataset file which will allow the researcher to analyse the trend in links between urls over time within an entire web collection.

Maria acknowledged that there are lessons  to be learnt when supporting researchers using web archives, including technical proficiency training and reference resources. The typology of the researchers who use web archives is ever growing: social and political scientists, digital humanities disciplines, computer science and documentary and evidence based research including legal discovery.

What Lori and Maria both made clear throughout the workshop was that the development and growth of web archiving is integral to challenging the past and preserving access on a long term scale. I really appreciated an insight into how the life cycle of web archiving is a continual process, from creating a collection, through to research services, whilst simultaneously managing the workflow of curation.

When in Manchester…

Virtual Archive, Central Library, Manchester

I  couldn’t leave  Manchester without exploring the John Rylands Library and Manchester’s Central Library. In the latter, this interactive digital representation of a physical archive combined choosing a box from how a physical archive may be arranged, and then projected the digitised content onto the screen once selected. A few streets away in Deansgate I had just enough time in John Rylands to learn that the fear of beards is called Pogonophobia. Go and visit yourself to learn more!

Special collections reading room, John Rylands Library, Manchester

PDF/A: Challenges Meeting the ISO 19005 Standard

Anna Oates (MSLIS Candidate, University of Illinois at Urbana-Champaign and NDNP Coordinator Graduate Assistant, Preservation Services) explaining the differences between PDF and PDF/A

We were excited to attend the recent project presentation entitled: ‘A Case Study on Theses in Oxford’s Institutional Repository: Challenges Meeting the ISO 19005 Standard’ given by Anna Oates, a student involved in the Oxford-Illinois Digital Libraries Placement Programme.

The presentation focused initially on the PDF/A format: PDF/A differs from standard PDF in that it avoids common long term access issues associated with PDF. For example, a PDF created today may look and behave differently in 50 years time. This is because many visual aspects of the PDF are not saved into the file itself, (PDFs use font linking instead of font embedding) the standardised PDF/A format attempts to remedy this by embedding  metadata within the file and restricting certain aspects commonly found in PDF which could inhibit long term preservation.

Aspects excluded from PDF/A include :

  • Audio and video content
  • JavaScript executable files
  • All forms of PDF encryption

PDF/A is better suited therefore for the long term preservation of digital material as it maintains the integrity of the information included in the source files, be this textual or visual. Oates described PDF/A as having multiple ‘flavours’, PDF/A-1 published in 2005 including conformance level A (Accessible – maintains the structure of the file) and B (Basic – maintains the visual aspects only). Versions 2 and 3 published later in 2011 and 2012, were developed to encompass conformance level U (Unicode – enabling the embedding of Unicode information) alongside other features such as JPEG 2000 compression and the embedding of arbitrary file formats within PDF/A documents.

Oates specified that different types of documents benefited from different ‘flavours’ of PDF/A, for example, digitised documents were better suited to conformance level B whereas born digital documents were better suited to level A.

Whilst specifying the benefits of PDF/A, Oates also highlighted the myriad of issues associated with the format.  Firstly, while experimenting with creating and conforming PDF/A documents, she noted the conformed documents had slight differences, such as changes to the colour pixels of embedded image files (PDF/A format showed less difference in the colour of pixels with programs like PDF Studio), this showcased a clear alteration of the authenticity of the original source file.

Oates compared source images to PDF/A converted images and found obvious visual differences.

Secondly,  Oates noted that when converting files from PDF to PDF/A-1b, smart software would change the decode filter of the image (e.g. changing from JPXDecode used for JPEG2000 to DCTDecode accepted by ISO 19005) in order to ensure it would conform to ISO 19005. However, she noted that despite the positives of avoiding non-conformance the software had increased the file size of the PDF by 65%. The file size increase poses obvious issues in regards to storage and cost considerations for organisations using PDF/A.

Oates’ workflow for creation and conformance checking of PDF/A files using different PDF/A software

Format uptake was also discussed by Oates. She found that PDF/A had not been widely utilised by Universities for long term preservation of dissertations and thesis in the UK. However, Oates provided examples of users of PDF/A for Electronic Theses and Dissertations Repositories that included: Concordia University, Johns Hopkins University, McGill University, Rutgers University, University of Alberta, University of Oulu and Virginia Tech.  Alongside this it was mentioned that uptake amongst Research and Cultural Heritage Institutions included: the Archaeology Data Service (ADS), British Library, California Digital Library, Data Archiving and Networked Services (DANS), the Library of Congress and the U.S. National Archives and Records Administration (NARA).

“Adobe Preflight has failed to recognize most of the glyph errors. As such, veraPDF will remain our final tool for validation.” (Anna Oates)

Oates therefore concluded that PDF/A was not the best solution to PDF preservation, she mentioned that the new ISO standard would cause new issues and considerations for PDF/A users. (Iram do you have anything in your notes re: this?)

Following the presentation the audience debated whether PDF/A should still be used. Some considered whether other solutions existed to PDF preservation; an example of a proposed solution was to keep both PDF/A and the original PDFs. However, many still felt that PDF/A provided the best solution available despite its various drawbacks.

Hopefully Oates’  findings will highlight the various areas needed for improvement in both PDF/A  conversion/ validation software and conformance aspects of the ISO 19005 Standard used by PDF/A to ensure it is up to the task of digital preservation.

To learn  more about PDF/A have a look at Adobe’s own e-book PDF/A In a Nutshell.

Alice, Ben and Iram (Trainee Digital Archivists)

Email Preservation: How Hard Can it Be? DPC Briefing Day

Miten and I outside the National Archives

Miten and I outside the National Archives, looking forward to a day of learning and networking

Last week I had the pleasure of attending a Digital Preservation Coalition (DPC) Briefing Day titled Email Preservation: How Hard Can it Be? 

In 2016 the DPC, in partnership with the Andrew W. Mellon Foundation, announced the formation of the Task Force on Technical Approaches to Email Archives to address the challenges presented by email as a critical historical source. The Task Force delineated three core aims:

  1. Articulating the technical framework of email
  2. Suggesting how tools fit within this framework
  3. Beginning to identify missing elements.

The aim of the briefing day was two-fold; to introduce and review the work of the task force thus far in identifying emerging technical frameworks for email management, preservation and access; and to discuss more broadly the technical underpinnings of email preservation and the associated challenges, utilising a series of case studies to illustrate good practice frameworks.

The day started with an introductory talk from Kate Murray (Library of Congress) and Chris Prom (University of Illinois Urbana-Champaign), who explained the goals of the task force in the context of emails as cultural documents, which are worthy of preservation. They noted that email is a habitat where we live a large portion of our lives, encompassing both work and personal. Furthermore, when looking at the terminology, they acknowledged email is an object, several objects and a verb – and it’s multi-faceted nature all adds to the complexity of preserving email. Ultimately, it was said email is a transactional process whereby a sender transmits a message to a recipient, and from a technical perspective, a protocol that defines a series of commands and responses that operate in a manner like a computer programming language and which permits email processes to occur.

From this standpoint, several challenges of email preservation were highlighted:

  • Capture: building trust with donors, aggregating data, creating workflows and using tools
  • Ensuring authenticity: ensuring no part of the email (envelope, header, and message data etc.) have been tampered with
  • Working at scale: email
  • Addressing security concerns: malicious content leading to vulnerability, confidentiality issues
  • Messages and formats
  • Preserving attachments and linked/networked documents: can these be saved and do we have the resources?
  • Tool interoperability

 

The first case study of the day was presented by Jonathan Pledge from the British Library on “Collecting Email Archives”, who explained born-digital research began at the British Library in 2000, and many of their born-digital archives contain email.  The presentation was particularly interesting as it included their workflow for forensic capture, processing and delivery of email for preservation, providing a current and real life insight into how email archives are being handled. The British Library use Aid4Mail Forensic for their processing and delivery, however, are looking into ePADD as a more holistic approach. ePADD is a software package developed by Standford University which supports archival processes around the appraisal, ingest, processing, discovery and delivery of email archives. Some of the challenges they experienced surrounded the issue of email as often containing personal information. A possible solution would be the redaction of offending material, however they noted this could lead to the loss of meaning, as well as being an extremely time-consuming process.

Next we heard from Anthea Seles (The National Archives) and Greg Falconer (UK Government Cabinet Office) who spoke about email and the record of government. Their presentation focused on the question of where the challenge truly lies for email – suggesting that, opposed to issues of preservation, the challenge lies in capture and presentation. They noted that when coming from a government or institutional perspective, the amount of email created increases hugely, leaving large collections of unstructured records. In terms of capture, this leads to the challenge of identifying  what is of value and what is sensitive. Following this, the major challenge is how to best present emails to users – discoverability and accessibility. This includes issues of remapping existing relationships between unstructured records, and again, the issue of how to deal with linked and networked content.

The third and final case study was given by Michael Hope, from Preservica; an “Active Preservation” technology, providing a suite of (Open Archival Information System) compliant workflows for ingest, data management, storage, access, administration and preservation for digital archives.

Following the case studies, there was a second talk from Kate Murray and Chris Prom on emerging Email Task Force themes and their Technology Roadmap. In June 2017 the task force released a Consultation Report Draft of their findings so far, to enable review, discussion and feedback, and the remainder of their presentation focused on the contents and gaps of the draft report. They talked about three possible preservation approaches:

  • Format Migration: copying data from one type of format to another to ensure continued access
  • Emulation: recreating user experience for both message and attachments in the original context
  • Bit Level Preservation: preservation of the file, as it was submitted (may be appropriate for closed collections)

They noted that there are many tools within the cultural heritage domain designed for interoperability, scalability, preservation and access in mind, yet these are still developing and improving. Finally, we discussed what the possible gaps of the draft report, and issues such as  the authenticity of email collections were raised, as well as a general interest in the differing workflows between institutions. Ultimately, I had a great time at The National Archives for the Email Preservation: How Hard Can it Be? Briefing Day – I learnt a lot about the various challenges of email preservation, and am looking forward to seeing further developments and solutions in the near future.

Email Preservation: How Hard Can it Be? DPC Briefing Day

On Thursday 6th July 2017 I attended the Digital Preservation Coalition briefing day in partnership with the Andrew W. Mellon Foundation on email preservation titled ‘Email preservation: how hard can it be?’. It was hosted at The National archives (TNA), this was my first visit to TNA and it was fantastic. I didn’t know a great deal about email preservation prior to this and so I was really looking forward to learning about this topic.

The National Archives, Photograph by Mike Peel (www.mikepeel.net)., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=9786613

The aim of the day was to engage in discussion about some of the current tools, technologies and thoughts on email preservation. It was orientated around the ‘Task Force on Technical Approaches to Email Preservation’ report that is currently in its draft phase. We also got to hear about interesting case studies from the British library, TNA and Preservica, each presenting their own unique experiences in relation to this topic. It was a great opportunity to learn about this area and hear from the co-chairs (Kate Murray and Christopher Prom) and the audience about their thoughts on the current situation and possible future directions.

We heard from Jonathan Pledge from British library (BL). He told us about the forensic capture expertise gained by the BL and using EnCase to capture email data from hard drives, CD’s and USB’s. We also got an insight into how they are deciding which email archive tool to use. Aid4mail fits better with their work flow however ePADD with its holistic approach was something they were considering. During their ingest they separate the emails from the attachments. They found that after the time consuming process of removing emails that would violate the data protection laws, there was very little usable content left, as often, entire threads would have to be redacted due to one message. This is not the most effective use of an archivist time and is something they are working to address.

We also heard from Anthea Seles who works with government collections at TNA. We learnt that from their research, they discovered that approximately 1TB of data in an organisations own electronic document and records management system is linked to 10TB of related data in shared drives. Her focus was on discovery and data analytics. For example, a way to increase efficiency and focus the attention of the curator on was to batch email. If an email was sent from TNA to a vast number of people, then there is a high chance that the content does not contain sensitive information. However, if it was sent to a high profile individual, then there is a higher chance that it will contain sensitive information, so the curator can focus their attention on those messages.

Hearing from Preservica was interesting as it gave an insight into the commercial side of email archiving. In their view, preservation was not an issue. For them, their attention was focused on addressing issues such as identifying duplicates/unwanted emails efficiently. Developing tools for performing whole collection email analysis and, interestingly, how to solve the problem of acquiring emails via a continuous transfer.

Emails are not going to be the main form of communication forever (the rise in the popularity of instant messaging is clear to see) however we learnt that we are still expecting growth in its use for the near future.

One of the main issues that was bought up was the potential size of future email archives and the issue that come with effective and efficient appraisal. What is large in academic terms, e.g. 100 000 emails, is not in government. The figure of over 200 million emails at the George W. Bush presidential library is a phenomenal amount and the Obama administrations is estimated at 300 million. This requires smart solutions and we learnt how the use of artificial intelligence and machine learning could help.

Continuous active learning was highlighted to improve searches. An example of searching for Miami dolphins was given. The Miami Dolphins are an American football team however someone might so be looking for information about dolphins in Miami. Initially the computer would present different search results and the user would choose which the more relevant result is, over time it will learn what it is the user is looking for in cases where searches can be ambiguous.

Another issue that was highlighted was, how do you make sure that you have searched the correct person? How do you avoid false positives? At TNA the ‘Traces Through Time’ project aimed to do that, initially with World War One records. This technology, using big data analytics can be used with email archives. There is also work on mining the email signature as a way to better determine ownership of the message.

User experience was also discussed. Emulation is an area of particular interest. The positive of this is that it recreates how the original user would have experienced the emails. However this technology is still being developed. Bit level preservation is a solution to make sure we capture and preserve the data now. This prevents loss of the archive and allows the information and value to be extracted in the future once the tools have been developed.

It was interesting to hear how policy could affect how easy it would be to acquire email archives. The new General Data Protection Regulation that will come into effect in May 2018 will mean anyone in breach of this will suffer worse penalties, up to 4% of annual worldwide turnover. This means that companies may air on the side of caution with regards to keeping personal data such as emails.

Whilst the email protocols are well standardised, allowing emails to be sent from one client to another (e.g. AOL account from early 1990’s to Gmail of now) the acquisition of them are not. When archivists get hold of email archives, they are left with the remnants of whatever the email client/user has done to it. This means metadata may have been added or removed and formats can vary. This adds a further level of complexity to the whole process

The day was thoroughly enjoyable. It was a fantastic way to learn about archiving emails. As emails are now one of the main methods of communication, for government, large organisations and personal use, it is important that we develop the tools, techniques and policies for email preservation. To answer the question ‘how hard can it be?’ I’d say very. Emails are not simple objects of text, they are highly complex entities comprising of attachments, links and embedded content. The solution will be complex but there is a great community of researchers, individuals, libraries and commercial entities working on solving this problem. I look forward to hearing the update in January 2018 when the task force is due to meet again.