Category Archives: Modern

PDF/A: Challenges Meeting the ISO 19005 Standard

Anna Oates (MSLIS Candidate, University of Illinois at Urbana-Champaign and NDNP Coordinator Graduate Assistant, Preservation Services) explaining the differences between PDF and PDF/A

We were excited to attend the recent project presentation entitled: ‘A Case Study on Theses in Oxford’s Institutional Repository: Challenges Meeting the ISO 19005 Standard’ given by Anna Oates, a student involved in the Oxford-Illinois Digital Libraries Placement Programme.

The presentation focused initially on the PDF/A format: PDF/A differs from standard PDF in that it avoids common long term access issues associated with PDF. For example, a PDF created today may look and behave differently in 50 years time. This is because many visual aspects of the PDF are not saved into the file itself, (PDFs use font linking instead of font embedding) the standardised PDF/A format attempts to remedy this by embedding  metadata within the file and restricting certain aspects commonly found in PDF which could inhibit long term preservation.

Aspects excluded from PDF/A include :

  • Audio and video content
  • JavaScript executable files
  • All forms of PDF encryption

PDF/A is better suited therefore for the long term preservation of digital material as it maintains the integrity of the information included in the source files, be this textual or visual. Oates described PDF/A as having multiple ‘flavours’, PDF/A-1 published in 2005 including conformance level A (Accessible – maintains the structure of the file) and B (Basic – maintains the visual aspects only). Versions 2 and 3 published later in 2011 and 2012, were developed to encompass conformance level U (Unicode – enabling the embedding of Unicode information) alongside other features such as JPEG 2000 compression and the embedding of arbitrary file formats within PDF/A documents.

Oates specified that different types of documents benefited from different ‘flavours’ of PDF/A, for example, digitised documents were better suited to conformance level B whereas born digital documents were better suited to level A.

Whilst specifying the benefits of PDF/A, Oates also highlighted the myriad of issues associated with the format.  Firstly, while experimenting with creating and conforming PDF/A documents, she noted the conformed documents had slight differences, such as changes to the colour pixels of embedded image files (PDF/A format showed less difference in the colour of pixels with programs like PDF Studio), this showcased a clear alteration of the authenticity of the original source file.

Oates compared source images to PDF/A converted images and found obvious visual differences.

Secondly,  Oates noted that when converting files from PDF to PDF/A-1b, smart software would change the decode filter of the image (e.g. changing from JPXDecode used for JPEG2000 to DCTDecode accepted by ISO 19005) in order to ensure it would conform to ISO 19005. However, she noted that despite the positives of avoiding non-conformance the software had increased the file size of the PDF by 65%. The file size increase poses obvious issues in regards to storage and cost considerations for organisations using PDF/A.

Oates’ workflow for creation and conformance checking of PDF/A files using different PDF/A software

Format uptake was also discussed by Oates. She found that PDF/A had not been widely utilised by Universities for long term preservation of dissertations and thesis in the UK. However, Oates provided examples of users of PDF/A for Electronic Theses and Dissertations Repositories that included: Concordia University, Johns Hopkins University, McGill University, Rutgers University, University of Alberta, University of Oulu and Virginia Tech.  Alongside this it was mentioned that uptake amongst Research and Cultural Heritage Institutions included: the Archaeology Data Service (ADS), British Library, California Digital Library, Data Archiving and Networked Services (DANS), the Library of Congress and the U.S. National Archives and Records Administration (NARA).

“Adobe Preflight has failed to recognize most of the glyph errors. As such, veraPDF will remain our final tool for validation.” (Anna Oates)

Oates therefore concluded that PDF/A was not the best solution to PDF preservation, she mentioned that the new ISO standard would cause new issues and considerations for PDF/A users. (Iram do you have anything in your notes re: this?)

Following the presentation the audience debated whether PDF/A should still be used. Some considered whether other solutions existed to PDF preservation; an example of a proposed solution was to keep both PDF/A and the original PDFs. However, many still felt that PDF/A provided the best solution available despite its various drawbacks.

Hopefully Oates’  findings will highlight the various areas needed for improvement in both PDF/A  conversion/ validation software and conformance aspects of the ISO 19005 Standard used by PDF/A to ensure it is up to the task of digital preservation.

To learn  more about PDF/A have a look at Adobe’s own e-book PDF/A In a Nutshell.

Alice, Ben and Iram (Trainee Digital Archivists)

Email Preservation: How Hard Can it Be? DPC Briefing Day

Miten and I outside the National Archives

Miten and I outside the National Archives, looking forward to a day of learning and networking

Last week I had the pleasure of attending a Digital Preservation Coalition (DPC) Briefing Day titled Email Preservation: How Hard Can it Be? 

In 2016 the DPC, in partnership with the Andrew W. Mellon Foundation, announced the formation of the Task Force on Technical Approaches to Email Archives to address the challenges presented by email as a critical historical source. The Task Force delineated three core aims:

  1. Articulating the technical framework of email
  2. Suggesting how tools fit within this framework
  3. Beginning to identify missing elements.

The aim of the briefing day was two-fold; to introduce and review the work of the task force thus far in identifying emerging technical frameworks for email management, preservation and access; and to discuss more broadly the technical underpinnings of email preservation and the associated challenges, utilising a series of case studies to illustrate good practice frameworks.

The day started with an introductory talk from Kate Murray (Library of Congress) and Chris Prom (University of Illinois Urbana-Champaign), who explained the goals of the task force in the context of emails as cultural documents, which are worthy of preservation. They noted that email is a habitat where we live a large portion of our lives, encompassing both work and personal. Furthermore, when looking at the terminology, they acknowledged email is an object, several objects and a verb – and it’s multi-faceted nature all adds to the complexity of preserving email. Ultimately, it was said email is a transactional process whereby a sender transmits a message to a recipient, and from a technical perspective, a protocol that defines a series of commands and responses that operate in a manner like a computer programming language and which permits email processes to occur.

From this standpoint, several challenges of email preservation were highlighted:

  • Capture: building trust with donors, aggregating data, creating workflows and using tools
  • Ensuring authenticity: ensuring no part of the email (envelope, header, and message data etc.) have been tampered with
  • Working at scale: email
  • Addressing security concerns: malicious content leading to vulnerability, confidentiality issues
  • Messages and formats
  • Preserving attachments and linked/networked documents: can these be saved and do we have the resources?
  • Tool interoperability

 

The first case study of the day was presented by Jonathan Pledge from the British Library on “Collecting Email Archives”, who explained born-digital research began at the British Library in 2000, and many of their born-digital archives contain email.  The presentation was particularly interesting as it included their workflow for forensic capture, processing and delivery of email for preservation, providing a current and real life insight into how email archives are being handled. The British Library use Aid4Mail Forensic for their processing and delivery, however, are looking into ePADD as a more holistic approach. ePADD is a software package developed by Standford University which supports archival processes around the appraisal, ingest, processing, discovery and delivery of email archives. Some of the challenges they experienced surrounded the issue of email as often containing personal information. A possible solution would be the redaction of offending material, however they noted this could lead to the loss of meaning, as well as being an extremely time-consuming process.

Next we heard from Anthea Seles (The National Archives) and Greg Falconer (UK Government Cabinet Office) who spoke about email and the record of government. Their presentation focused on the question of where the challenge truly lies for email – suggesting that, opposed to issues of preservation, the challenge lies in capture and presentation. They noted that when coming from a government or institutional perspective, the amount of email created increases hugely, leaving large collections of unstructured records. In terms of capture, this leads to the challenge of identifying  what is of value and what is sensitive. Following this, the major challenge is how to best present emails to users – discoverability and accessibility. This includes issues of remapping existing relationships between unstructured records, and again, the issue of how to deal with linked and networked content.

The third and final case study was given by Michael Hope, from Preservica; an “Active Preservation” technology, providing a suite of (Open Archival Information System) compliant workflows for ingest, data management, storage, access, administration and preservation for digital archives.

Following the case studies, there was a second talk from Kate Murray and Chris Prom on emerging Email Task Force themes and their Technology Roadmap. In June 2017 the task force released a Consultation Report Draft of their findings so far, to enable review, discussion and feedback, and the remainder of their presentation focused on the contents and gaps of the draft report. They talked about three possible preservation approaches:

  • Format Migration: copying data from one type of format to another to ensure continued access
  • Emulation: recreating user experience for both message and attachments in the original context
  • Bit Level Preservation: preservation of the file, as it was submitted (may be appropriate for closed collections)

They noted that there are many tools within the cultural heritage domain designed for interoperability, scalability, preservation and access in mind, yet these are still developing and improving. Finally, we discussed what the possible gaps of the draft report, and issues such as  the authenticity of email collections were raised, as well as a general interest in the differing workflows between institutions. Ultimately, I had a great time at The National Archives for the Email Preservation: How Hard Can it Be? Briefing Day – I learnt a lot about the various challenges of email preservation, and am looking forward to seeing further developments and solutions in the near future.

Email Preservation: How Hard Can it Be? DPC Briefing Day

On Thursday 6th July 2017 I attended the Digital Preservation Coalition briefing day in partnership with the Andrew W. Mellon Foundation on email preservation titled ‘Email preservation: how hard can it be?’. It was hosted at The National archives (TNA), this was my first visit to TNA and it was fantastic. I didn’t know a great deal about email preservation prior to this and so I was really looking forward to learning about this topic.

The National Archives, Photograph by Mike Peel (www.mikepeel.net)., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=9786613

The aim of the day was to engage in discussion about some of the current tools, technologies and thoughts on email preservation. It was orientated around the ‘Task Force on Technical Approaches to Email Preservation’ report that is currently in its draft phase. We also got to hear about interesting case studies from the British library, TNA and Preservica, each presenting their own unique experiences in relation to this topic. It was a great opportunity to learn about this area and hear from the co-chairs (Kate Murray and Christopher Prom) and the audience about their thoughts on the current situation and possible future directions.

We heard from Jonathan Pledge from British library (BL). He told us about the forensic capture expertise gained by the BL and using EnCase to capture email data from hard drives, CD’s and USB’s. We also got an insight into how they are deciding which email archive tool to use. Aid4mail fits better with their work flow however ePADD with its holistic approach was something they were considering. During their ingest they separate the emails from the attachments. They found that after the time consuming process of removing emails that would violate the data protection laws, there was very little usable content left, as often, entire threads would have to be redacted due to one message. This is not the most effective use of an archivist time and is something they are working to address.

We also heard from Anthea Seles who works with government collections at TNA. We learnt that from their research, they discovered that approximately 1TB of data in an organisations own electronic document and records management system is linked to 10TB of related data in shared drives. Her focus was on discovery and data analytics. For example, a way to increase efficiency and focus the attention of the curator on was to batch email. If an email was sent from TNA to a vast number of people, then there is a high chance that the content does not contain sensitive information. However, if it was sent to a high profile individual, then there is a higher chance that it will contain sensitive information, so the curator can focus their attention on those messages.

Hearing from Preservica was interesting as it gave an insight into the commercial side of email archiving. In their view, preservation was not an issue. For them, their attention was focused on addressing issues such as identifying duplicates/unwanted emails efficiently. Developing tools for performing whole collection email analysis and, interestingly, how to solve the problem of acquiring emails via a continuous transfer.

Emails are not going to be the main form of communication forever (the rise in the popularity of instant messaging is clear to see) however we learnt that we are still expecting growth in its use for the near future.

One of the main issues that was bought up was the potential size of future email archives and the issue that come with effective and efficient appraisal. What is large in academic terms, e.g. 100 000 emails, is not in government. The figure of over 200 million emails at the George W. Bush presidential library is a phenomenal amount and the Obama administrations is estimated at 300 million. This requires smart solutions and we learnt how the use of artificial intelligence and machine learning could help.

Continuous active learning was highlighted to improve searches. An example of searching for Miami dolphins was given. The Miami Dolphins are an American football team however someone might so be looking for information about dolphins in Miami. Initially the computer would present different search results and the user would choose which the more relevant result is, over time it will learn what it is the user is looking for in cases where searches can be ambiguous.

Another issue that was highlighted was, how do you make sure that you have searched the correct person? How do you avoid false positives? At TNA the ‘Traces Through Time’ project aimed to do that, initially with World War One records. This technology, using big data analytics can be used with email archives. There is also work on mining the email signature as a way to better determine ownership of the message.

User experience was also discussed. Emulation is an area of particular interest. The positive of this is that it recreates how the original user would have experienced the emails. However this technology is still being developed. Bit level preservation is a solution to make sure we capture and preserve the data now. This prevents loss of the archive and allows the information and value to be extracted in the future once the tools have been developed.

It was interesting to hear how policy could affect how easy it would be to acquire email archives. The new General Data Protection Regulation that will come into effect in May 2018 will mean anyone in breach of this will suffer worse penalties, up to 4% of annual worldwide turnover. This means that companies may air on the side of caution with regards to keeping personal data such as emails.

Whilst the email protocols are well standardised, allowing emails to be sent from one client to another (e.g. AOL account from early 1990’s to Gmail of now) the acquisition of them are not. When archivists get hold of email archives, they are left with the remnants of whatever the email client/user has done to it. This means metadata may have been added or removed and formats can vary. This adds a further level of complexity to the whole process

The day was thoroughly enjoyable. It was a fantastic way to learn about archiving emails. As emails are now one of the main methods of communication, for government, large organisations and personal use, it is important that we develop the tools, techniques and policies for email preservation. To answer the question ‘how hard can it be?’ I’d say very. Emails are not simple objects of text, they are highly complex entities comprising of attachments, links and embedded content. The solution will be complex but there is a great community of researchers, individuals, libraries and commercial entities working on solving this problem. I look forward to hearing the update in January 2018 when the task force is due to meet again.

The first rule of Pig Club…

Rulebook of Nailsworth Pig Club, from the Sir Stafford Cripps archive at the Bodleian Library [sc22/1c]

Rulebook of Nailsworth Pig Club, April 1918, from the Sir Stafford Cripps archive [sc22/1c]

There’s just something about this delightful little three page rulebook that tickles me. Perhaps it’s the use of phrases like ‘eligible pigs’. Otherwise, it’s a perfectly serious document which details the ins and outs of the provision, insurance and inspection of pigs and potatoes (pigs and potatoes!) raised by members of the club.

It appears to have been produced by Gloucestershire County Council and was presumably part of a county-, or country-wide effort to encourage people to raise their own food during the war (also done during the second world war). Despite the central concern of food production, though, it’s a surprisingly cheering document for people concerned with animal welfare, as it’s very specific that the animals must be healthy and well-cared for, and that insurance compensation would not be paid if ‘the death or sickness of a pig is attributed to bad food, insufficient attention, or other carelessness or ill-treatment’.

One of my favourite things about the rulebook is who it belonged to: Stafford Cripps, probably best known as Britain’s ambassador to Moscow in 1940-1942 and as the austerity chancellor from 1947-1950. When this document was produced however, Cripps was not yet a political high flier. A chemistry graduate and practicing lawyer, he was both married and in ill health in 1914, which meant that he was not called up. He kept himself busy with recruitment efforts and then volunteered for a year in France as a Red Cross ambulance driver. In late 1915 he offered his chemistry expertise to the Ministry of Munitions and was posted to one of the country’s biggest munitions factories in Queensferry, near Chester. From early 1916, Cripps was running it, and it took its toll on his health. In early 1918, when this rulebook was drawn up, he was convalescing from a physical breakdown. Somehow, though, he still managed to find the time and energy to serve as honorary secretary of the Nailsworth Pig Club. The now nearly 100 year old rulebook survives in his archive at the Bodleian Library.

Incidentally, the first rule of Pig Club?:

  1. NAME.–The Society shall be called the “Nailsworth Pig Club.”

Can’t argue with that.

#WAWeek2017 – Researchers, practitioners and their use of the archived web

This year, the world of web archiving  saw a premiere: not only were the biennial RESAW conference and the IIPC conference, established in 2016, held jointly for the first time, but they also formed part of a whole week of workshops, talks and public events around web archives – Web Archiving Week 2017 (or #WAWeek2017 for the social medially inclined).

After previous conferences Reykjavik (2016) and Arhus (RESAW 2015), the big 2017 event was held in London, 14-16 June 2017, organised jointly by the School of Advanced Studies of the University of London, the IIPC and the British Library.
The programme was packed full of an eclectic variety of presentations and discussions, with topics ranging from the theory and practice of curating web archive collections or capturing whole national web domains, via technical topics such as preservation strategies, software architecture and data management, to the development of methodologies and tools for using web archives based research and case studies of their application.

Even in digital times, who doesn’t like a conference pack? Of course, the full programme is also available online. (…but which version will be easier to archive?)

Continue reading

Eton College, a journey to India, and wartime Britain: Personal stories from the Opie Archive

The Archive of Iona and Peter Opie, at first glance, is an archive of professional papers, covering the folklorists’ extensive research on childlore, games and play from the 1950s through to the 1990s. Researchers looking at the history of childhood, and at even at how the ‘world of children’ was observed and documented, will no doubt find a rich resource in the children’s original papers, as well as in the Opies’ working files, their professional correspondence, and the notes for, and drafts of, the Opie publications from The Oxford Dictionary of Nursery Rhymes to The People in the Playground.

What is probably less well known is the fact that the archive also includes a significant proportion of personal papers and correspondence – the personal archive of Peter Opie, covering his own childhood and teenage years, his years at Eton College, his early career as an author, his time serving in the Army, and the early years of starting a family and finding his vocation of collecting books and researching childhood folklore. These personal papers, which focus on the 1930s-1950s, include correspondence with friends and relatives, diaries and scrapbooks, personal documents, photographs and memorabilia. Another sequence comprises the raw material, notes, and drafts for Peter Opie’s early autobiographical books, short stories and other writings.

An aspiring writer: Drafts for Peter Opie’s first book ‘I Want To Be A Success’, early 1938.

Continue reading

Researchers,practitioners and their use of the archived web. IIPC Web Archiving Conference 15th June 2017

From the 14th – 16th of June researchers and practitioners from a global community came together for a series of talks, presentations and workshops on the subject of Web Archiving at the IIPC Web Archiving Conference. This event coincided with Web Archiving Week 2017, a week long event running from 12th – 16th June hosted by the British Library and the School of Advance Study

I was lucky enough to attend the conference  on the 15th June with a fellow trainee digital archivist and listen to some thoughtful, engaging and challenging talks.

The day started with a plenary in which John Sheridan, Digital Director of the National Archives, spoke about the work of the National Archives and the challenges and approaches to Web Archiving they have taken. The National Archives is principally the archive of the government, it allows us to see what the state saw through the state’s eyes. Archiving government websites is a crucial part of this record keeping as we move further into the digital age where records are increasingly born-digital. A number of points were made which highlighted the motivations behind web archiving at the National Archives.

  • They care about the records that government are publishing and their primary function is to preserve the records
  • Accountability for government services online or information they publish
  • Capturing both the context and content

By preserving what the government publishes online it can be held accountable, accountability is one aspect that demonstrates the inherent value of archiving the web. You can find a great blog post on accountability and digital services by Richard Pope in this link.  http://blog.memespring.co.uk/2016/11/23/oscon-2016/

The published records and content on the internet provides valuable and crucial context for the records that are unpublished, it links the backstory and the published records. This allows for a greater understanding and analysis of the information and will be vital for researchers and historians now and into the future.

Quality assurance is a high priority at the National Archives. By having a narrow focus of crawling, it has allowed for but also prompted a lot of effort to be directed into the quality of the archived material so it has a high fidelity in playback. To keep these high standards it can take weeks in order to have a really good in-depth crawl. Having a small curated collection it is an incentive to work harder on capture.

The users and their needs were also discussed as this often shapes the way the data is collected, packaged and delivered.

  • Users want to substantiate a point. They use the archived sites for citation on Facebook or Twitter for example
  • The need to cite for a writer or researcher
  • Legal – What was the government stance or law at the time of my clients case
  • Researchers needs – This was highlighted as an area where improvements can be made
  • Government itself are using the archives for information purposes
  • Government websites requesting crawls before their website closes – An example of this is the NHS website transferring to a GOV.UK site

The last part of the talk focused on the future of web archiving and how this might take shape at the National Archives. Web archiving is complex and at times chaotic. Traditional archiving standards have been placed upon it in an attempt to order the records. It was a natural evolution for information managers and archivists to use the existing knowledge, skills and standards to bring this information under control. This has resulted in difficulties in searching across web archives, describing the content and structuring the information. The nature of the internet and the way in which the information is created means that uncertainty has to inevitably be embraced. Digital Archiving could take the turn into the 2.0, the second generation and move away from the traditional standards and embrace new standards and concepts. One proposed method is the ICA Records in Context conceptual model. It proposes a multidimensional description with each ‘ thing ‘ having a unique description as opposed to the traditional unit of description (one size fits all).  Instead of a single hierarchical fonds down approach, the Records in Context model uses a  description that can be formed as a network or graph. The context of the fonds is broader, linking between other collections and records to give different perspectives and views. The records can be enriched this way and provide a fuller picture of the record/archive. The web produces content that is in a constant state of flux and a system of description that can grow and morph over time, creating new links and context would be a fruitful addition.

Visual Diagram of How the Records in Context Conceptual Model works

“This example shows some information about P.G.F. Leveau a French public notary in the 19th century including:
• data from the Archives nationales de France (ANF) (in blue); and
• data from a local archival institution, the Archives départementales du Cher (in yellow).” INTERNATIONAL COUNCIL ON ARCHIVES: RECORDS IN CONTEXTS A CONCEPTUAL MODEL FOR ARCHIVAL DESCRIPTION.p.93

 

Traditional Fonds Level Description

 

I really enjoyed the conference as a whole and the talk by John Sheridan. I learnt a lot about the National Archives approach to web archiving, the challenges and where the future of web archiving might go. I’m looking forward to taking this new knowledge and applying it to the web archiving work I do here at the Bodleian.

Changes are currently being made to the National Archives Web Archiving site and it will relaunch on the 1st July this year.  Why don’t you go and check it out.

 

 

 

Web Archiving Week 2017 – “Pages for kids, by kids”

Yesterday I was lucky enough to attend a day of the Web Archiving Week 2017 conferences in Senate House, London along with another graduate trainee digital archivist.

A beautiful staircase in Senate House

Every session I attended throughout the day was fascinating, but Ian Milligan’s ‘Pages by kids, for kids’: unlocking childhood and youth history through the GeoCities web archive stood out for me as truly capturing part of what makes a web archive so important to society today.

Pages by kids, for kids

GeoCities, for those unfamiliar with the name, was a website founded in 1994 from which anyone could build their own free website which would become part of a ‘neighbourhood’. Each neighbourhood was themed for a particular topic, allowing topic clusters to form from created websites. GeoCities was shut down in Europe and the US in 2009, but evidence of it still exists in the Internet Archive.

Milligan’s talk focused particularly on the Enchanted Forest neighbourhood between 1996 and 1999. The Enchanted Forest was dedicated to child-friendliness and was the only age based neighbourhood, and as such had extra rules and community moderation to ensure nothing age inappropriate was present.

“The web was not just made by dot.com companies”

The above image shows what I think was one of the key points from the talk, a quote from the New York Times, March 17th 1997
“The web was not just made by dot.com companies, but that eleven-year-old boys and grandmothers are also busy putting up Web sites. Of course, the quality of these sites varies greatly, but low-cost and even free home page services are a growing part of the on-line world.”

The internet is a democracy, and to show a true record of how and why it has been used it necessarily involves people – not just businesses. By having GeoCities websites within the Internet Archive, it’s possible to access direct evidence of how people were using the internet in the late part of the 20th century, but, as Ian Milligan’s talk explained, it also allows access to direct evidence of childhood and youth culture forming on the internet.

Milligan pointed out that access to evidence of childhood and youth culture is rare, normally historical evidence comes in the form of adults remembering their time as children or from researchers studying children, but something produced by a child for other children would rarely make it into a traditional archive. Within the trove of archived GeoCities websites, however, children producing web content for children is clearly visible. From this, it is possible to examine what constituted popular activities for children on GeoCities in the late 20th century.

Milligan noted one major activity within the Enchanted Forest centred around an awards culture, wherein a popular site would award users based on several web page qualities such as no personal identifiable information, working links and loading times of less than one minute. Some users would create their own awards to present to people, for example an award for finding all the Winnie the Pooh words in a word search. His findings showed that 15% of Enchanted Forest websites had a dedicated awards page.

A darker side of a child-centric portion of the web was also revealed in the Geokidz club. On the surface, the Geokidz Club appeared to be an unofficial online clubhouse where children could share poetry and book reviews, they could chat and take HTML lessons – but these activities came at the price of a survey which contained questions about the lifestyles of the child’s parents (the type of information would appeal to advertisers). This formed part of one of the first internet privacy court cases due to the data being obtained from children and sold on without proper informed consent.

It was among my favourite talks of the day, and showed how much richer our understanding of the recent past can be using web archives, as well as the benefit to researchers of the history of youth and childhood.
It felt particularly relevant to me, as someone who spent her teen years on the internet watching, and being involved in, youth culture happening online in the 2000s to know that online youth culture, which can feel very ephemeral, can be saved for future research in web archives.

A wall hanging in Senate House (made of sisal)

In truth, any talk I attended would have made an interesting topic for this blog – the entire day was filled with informative speakers, interesting presentations and monumental, hair-like wall hangings. But I felt Ian Milligan’s talk gave such a positive example of how the internet, and particularly web archives, can give a voice to those whose experiences might be lost otherwise.

Initiating conversation: let’s talk about web content (part 2)

Colin Harris, Superintendent of Special Collections reading rooms. Chosen site: cyndislist.com

‘I am a founding member of Oxfordshire Family History Society and I’ve long been interested in family history. As a phenomena it surged in popularity in the 1970’s. In about 1973 there was great curiosity (in OFHS) in Bicester as everyone was interested in the popular group, The Osmonds (who originated from Bicester!). Every county has a family history society and I would say it’s they who have done the lion’s share of the work. All of their work and indexing…it’s all grist to the mill in terms of recording names and events.

So the website I would like to have access to in 10 years’ time is cyndislist.com, which is one of the world’s largest databases for genealogy. In fact it’s been going for over 21 years already. This was launched on the 4th March 1996. The family history people have been right there from the very beginning, it’s been growing solidly since then; it’s fantastic. It covers 200 categories of subjects, it has links to 332,000 other websites, and it’s the starting point for any genealogical research. The ‘Cyndi’ is Cyndi Howell, an author in genealogy.

Almost every day the site is launching content that might be interesting in some particular subject. So just going back within the last couple of weeks: an article on Telling the Orphan’s story; Archive lab on how to preserve old negatives; The key to family reunion success and DNA: testing at a family reunion! Projects even go beyond individuals…they explore a Yellowstone wolf family. There is virtually nothing that is untouched. Anything with a name to it has potential for exploration.

To be honest, I haven’t been able to do any family history research since 1980, but I am hoping to do some later on this year (when I retire). All these years that have passed has meant that so much is available to be accessed over the internet

Actually I’d love to see genealogy and family history workers and volunteers getting more recognition for the fantastic amount of industrious and tech savvy work they do. Family history is something for people from all walks of life. Our history, your history, my history is something very personal. As I say, 21 years and going strong; I’d love to see the site going stronger still in 10 years’ time.’


 

Pip Willcox, Head of the Centre for Digital Scholarship and Senior Researcher at Oxford e-Research. Chosen site: twitter.com

Twitter is an amazing tool that society has used to show the best of what humanity is at the moment…we share ideas, we share friendship, fun and joy, we communicate with others around the world, people help each other. But, it shows the worst of what humans can do. The news we see is just the tip of the iceberg – the levels of abuse that users, particularly minority groups, receive is appalling. Twitter is a fantastic place to meet people who think very differently from us, people who come from different backgrounds, have had different experiences, who live far from us, or close by but we might not otherwise have met. It is so rich, so full of potential, and some of what we do with it is amazing, yet some of what we do with it is appalling.

The question for the archive is “which Twitter?” There is the general feed, what you see if you don’t sign in. Then there are our individual feeds, where we curate our own filter bubbles, customizing what we see through our accounts. You can create a feed around a hashtag, an event, or slice it by time or location. All of these approaches will affect the version of Twitter we archive and leave for the future to discover.

These filter bubbles are not new: we have always lived in them, even if we haven’t called them that before. Last year there was an experiment where a series of couples who held diametrically opposing views switched Twitter accounts and I found that, and their thoughtful response to it fascinating.

Projects like Cultures of Knowledge, for example, which is based at the History Faculty here at the University of Oxford, traces early modern correspondence. This resource lets you search for who was writing to whom, when, where, and the subjects they were discussing. It’s an enormously rich, people-centred view of the history of ideas and relationships across time and space, and of course it points readers on in interesting directions, to engage closely with the texts themselves. This is possible because the letters were archived and catalogued over the years, over the centuries by experts.

How are we going to trace the conversations of the late 20th and the early 21st centuries? The speed at which ideas flow is faster than ever and their breadth is global. What will future historians make of our age?

I’m interested from a future history as well as a community point of view. The way we are using Twitter has already changed and tracking its use, reach, and power seems to me well worth recording to help us understand it now, and to help explain an aspect of our lives to future societies. For me, Twitter makes the world more familiar, and anything that draws us together as a global community, that reinforces our understanding that we share one planet, that what we have in common vastly outweighs what divides us, and that helps us find ways to communicate is a good and a necessary thing.’

 


 

Will Shire, Library Assistant, Philosophy and Theology Faculty Library. Chosen site: wikipedia.org

‘It’s one of the sites I use the most…it has all of human knowledge. I think it’s a cool idea that anyone can edit it – unlike a normal book it’s updated constantly. I feel it’s derided almost too much by people who automatically think it’s not trustworthy…but I like the fact that it is a range of people coming together to edit and amend this resource. As a kid I bothered my mum all the time with constant questioning of ‘Why is this like this, why does it do that. Nowadays if you have a question about anything you can visit wikipedia.org. It would be really interesting to take a snapshot of one article every month or week in order to see how much it changes through user editing.

 Also, I studied languages and it is extremely useful for learning new vocabulary as the links at the side of the article can take you to the content in other available languages. You can quite easily look at different words or use it as a starter to take you to different articles in other languages that aren’t English.’


 

 

 

 

Why archive the web?

Here at the Bodleian Libraries’ Web Archive (BLWA), the archiving process starts with a nomination – either by our web curators or by you, the public. The nominated URLs the BLWA team then select for archiving are those specifically identified as being of lasting value and significance for preservation.

Not only are the sites chosen from a preservation standpoint – we are also continually seeking to build up the scope and content of our 7 collections within the BLWA: University of Oxford; University of Oxford colleges; University of Oxford museums, libraries and archives; social sciences; arts and humanities; international and science, medicine and technology. Exactly like the use of a physical collection, the sites belonging to the web collection will be used for research, fact checking, discovery and collaboration. There can be no denying that the web is the platform on which so much of contemporary society occurs. In the future then, and indeed now, web archives are providing an insight into our history.

Anti-Apartheid Movement Archives – http://www.aamarchives.org/

The AAMA site is part of our international collection in the BLWA. Within this collection we have captured the aamarchives.org 7 times since 24th November 2015. This online platform is vital for digital access to further research, cross-cultural relationships and efforts towards understanding the history of the British Anti-Apartheid Movement 1959 – 1994. This capture has preserved the navigation and functionality of the site and links still resolve; for example the user community can still browse the archive, learn about campaigns and download resources. The date and time is clearly displayed in the banner at the top.

BLWA’s first capture of the online AAMA

This website can also be used and explored in conjunction with our related physical holdings. Here at the Bodleian Special Collections we have an amazing depth and range of physical material in the Anti-Apartheid Movement archive and our Commonwealth and African studies collections. You can browse the catalogue for this here.

This archived capture is fully functional, like a live site.

This is a tangible example of how digital preservation enhances and complements physical material and ensures records can reach a wider audience. How exciting it is that a researcher can consult manuscript or archived material, alongside captures of websites from the past in order to gain more of an insight and have a wider scope of substance to survey!

Web content like the aamarchives.org/ is not as stable as you might presume. A repository of web based collections enables future discovery of internet sites that are perhaps taken for granted due to the nature of our technological society; everything is just a tap or a click away. In fact, much of the material we interact with today is only available online. The truth is that web content is ephemeral: there is a very real threat that it can rapidly change and disappear altogether. Therefore web archiving initiatives are vital to preserve these valuable resources for good. Through these captures, provenance, arrangement and content have been preserved; and arguably most importantly of all – access.

Both individual collections and the web archive as a whole can be searched for a specific site, or browsed at leisure.

Growth of open access and web based initiatives mean that there is an ever increasing network of digital libraries on a global scale. There is no doubt that the practice of web archiving is a significant contribution towards ensuring knowledge for all. Access to the Internet enabling access to an ever growing knowledge depository is central to the integrity of educational and professional research, web archiving and on a larger scale, digital preservation.

Browse our collections in Bodleian Libraries’ Web Archive

Get involved and help preserve our history! Nominate a site to archive