Turning a historical book into a data set

A series of books published around the turn of the 20th century are crucial to modern bibliographic research: they are biographical dictionaries of booksellers and printers, including addresses, dates and significant works printed. Some of these books are out of copyright and available as scanned pages, allowing us not only to copy them into new formats, but adapt them into new kinds of resource.

These scanned books could be made more useful to researchers in a number of ways. Text could be meaningfully segmented, by dictionary entry rather than by page or paragraph. The book’s internal and external citations can become links, for instance linking a proper name to identifiers for the named person. The book can even have an open data representation which other data sets can hook on to, for example to say that a person is described in the book.

This case study describes the transformation of one of these books, Henry Plomer’s A Dictionary of the Booksellers and Printers who Were at Work in England, Scotland and Ireland from 1641 to 1667 using Wikisource, part of the Wikimedia family of sites. As a collaborative platform, Wikisource allowed Bodleian staff to work with Wikisource volunteers. We benefited from many kinds of volunteer labour, from correcting simple errors in the text to creating custom wiki-code to speed up the process.

A lot of important data sets only currently exist in the form of printed books, including catalogues, dictionaries and encyclopedias. We adopted a process that has already been used on some large, multi-volume works and could be used for many more. Continue reading

Making Sense of Negotiated Text at Scale: a workshop

Register by email: see below for detailsWhat: Making Sense of Negotiated Text at Scale: a workshop

When: 11:30—14:30, Thursday 30 November 2017

Where: Centre for Digital Scholarship, Weston Library (map)

Open to all

Free

Registration is required: please email Pip Willcox with your name, email address, and access and dietary requirements

How do we evaluate the relationship between different iterations of ideas in text form?

Speakers

  • Nicholas Cole and Alfie Abdul-Rahman: The Quill Project
  • Radoslaw Zubek, David Doyle, and Abhishek Dasgupta: Measuring Government Policy with Text Analysis project
  • David Price: DebateGraph—Exploring the Intention to Withdraw from the Union
  • Félix Krawatzek: Buying Words? The impact of donations on political language

This workshop brings together experts from four projects which are using digital methods to analyze, understand, and re-present negotiated texts. Taking UK government policy documents, the creation of the American Constitution, current political debate, and the economic cost of political language as their subject matter, each speaker will outline the motivation for their work and the approaches they have taken towards answering questions such as:

  • Are government regulations becoming more or less business friendly?
  • Which State’s representatives contributed the most successful proposals to the American Constitution?
  • What common threads of agreement are there in differing political viewpoints?
  • How much money does it take to change the language in the US Congress?

This workshop will be of interest to people working in history, politics, computational linguistics, visualization, or the application of digital innovation to research.

Alfie Abdul-Rahman is a Research Associate at the University of Oxford e-Research Centre where she develops web-based visualization tools for humanities scholars, including for the Quill Project.

Nicholas Cole is a Senior Research Fellow at Pembroke College Oxford, specializing in the history of political thought and American Constitutional History, and directs the Quill Project.

Abhishek Dasgupta is a doctoral student at Exeter College, studying Foundations, Logic, and Structures in the Department of Computer Science.

David Doyle is an Associate Professor of Latin American Politics in the Department of Politics and International Relations at the University of Oxford, a Fellow of St Hugh’s College, and co-investigator of the Fell-funded Measuring Government Policy with Text Analysis project.

Félix Krawatzek is a British Academy Postdoctoral Fellow based at the University of Oxford’s Department of Politics and International Relations and a Research Fellow at Nuffield College.

David Price co-founded DebateGraph with the former Australian cabinet minister Peter Baldwin and has led DebateGraph’s projects with, amongst others, the UK Prime Minister’s Office, the White House Office of Science and Technology Policy, CNN, the European Commission, and the Bill & Melinda Gates Foundation.

Radoslaw Zubek is an Associate Professor of European Politics, a Tutorial Fellow at Hertford College, and principal investigator of the Fell-funded Measuring Government Policy with Text Analysis project.

This workshop is convened by:

Registration

To register, please email Pip Willcox (pip.willcox@bodleian.ox.ac.uk) with:

  • Your name
  • Your email address
  • Access or dietary requirements

Image credit: Global Academic Forum

Digital Approaches to the History of Science: a successful workshop

‘Digital Approaches to the History of Science’, the first of two planned workshops on this topic, was held at the History Faculty in Oxford on 28 September 2018. A total of nearly sixty attendees assembled to hear presentations from a selection of the most exciting current projects in this field from around the UK.

Professor Rob Iliffe, representing the Newton Project, addressed the ongoing challenges and complexity of digitizing and presenting the manuscript writings of Isaac Newton, and Alison Pearn spoke of the related issues faced by the digital side of the ongoing Darwin Correspondence Project. Lauren Kassell, of the Casebooks Project, introduced a very different type of material and spoke of the need to find new ways of representing, encoding and searching the mass of information contained in early modern medical-astrological casebooks.

After lunch two speakers discussed from complementary perspectives the opportunities represented by the very rich archive of The Royal Society. Louisiane Ferlier discussed the digitization of Royal Society journals and the work needed to clean and link the metadata about the articles in them. Pierpaolo Dondio described his work modelling and visualising the network of authors, editors and referees who controlled the content of those paper, and provided examples of the kinds of research outcomes such work can produce. A final talk turned to the use of digital humanities resources in the university classroom: Kathryn Eccles and Howard Hotson described the Cabinet Project, which has made a rich ecology of digital images and objects available to students on a growing list of Oxford undergraduate papers.

Rich discussions took place both around the individual presentations and over lunch and coffee, and this sell-out event has certainly stimulated interest and ongoing discussion about the distinctive opportunities for history of science created by digital scholarship and resources.

Reflections on discussion topics during the workshop by Pip Willcox

The event was supported by the Centre for Digital Scholarship (Bodleian Libraries), ‘Reading Euclid‘, The Royal Society and the Newton Project, and was organized jointly by the Centre for Digital Scholarship and ‘Reading Euclid’. The date for the second workshop will be announced shortly.

—Benjamin Wardhaugh, ‘Reading Euclid’

Top image credit: René Descartes, Principia philosophiae (Amsterdam, 1644), ‘Cartesian network of vortices of celestial motion’, p. 110. Bodleian Library Savile T 22. Edited in Photoshop by Yelda Nasifoglu.

SWORDV3 stakeholder call

The SWORDV3 project team are looking for expressions of interest from potential stakeholders as they develop a new technical standard and community and governance mechanisms for this updated version of SWORD. From the DPC announcement:

Expressions of interest are sought to become stakeholders in the project: to make suggestions, review activities and meet as required over the coming months.

In particular, the project team is interested in making contact with people who may wish to develop SWORD V3 libraries for their preferred platforms or languages since the aim is to provide some support for such activities during the project. Please contact one of the project team (ideally by mid-October) if you are interested in participating, and indicate if you are interested in the technical or community aspects of the project (or both!).

On the technical side, the project is creating a document that brings together the change requests and new use cases that have collected since the release of SWORDV2, culled from the github site, message posts and preliminary discussions with some stakeholders earlier this year. This has also suggested a way forward that breaks with SWORD’s AtomPub roots in order to provide a more up-to-date and flexible protocol. This will be circulated to stakeholders soon.

On the community side, a similar document outlining possible models for developing the SWORD community in the future will be circulated soon. This is a much more open set of choices since the SWORD user-base has expanded considerably since its first conception, and we are open to further suggestions! The final arrangements must be aligned with community wishes in order to be an effective sustainable solution.

More at http://www.dpconline.org/news/swordv3-project-stakeholder-call.

Working with Spreadsheets: a workshop

Image of hand-drawn spreadsheet

What: Working with Spreadsheets: a workshop

When: 10:00—16:30, Tuesday 21 November

Where: Centre for Digital Scholarship, Weston Library (map)

Access: open to all members of the University

Admission: free

Trainers: Iain Emsley and Pip Willcox

Registration is required: please see below

This workshop is designed for anyone who works with spreadsheets and wants to learn how to explore that data more efficiently and consistently. No prior experience is required. The hands-on workshop teaches basic concepts, skills, and tools for working more effectively and reproducibly with your data.

We will cover data organization in spreadsheets and OpenRefine for managing data.

By the end of the workshop participants will be able to manage and analyze data effectively and be able to apply the tools and approaches directly to their ongoing research.

The workshop draws on lessons prepared by Data Carpentry and adapted by the trainers for use with Early English Books Online Text Creation Partnership data.

The methods that you will learn will be applicable to work in any field that uses spreadsheets. The EEBO-TCP subject matter we will use may be of particular interest to people working with library or early modern data.

Registration

To register, please email Pip Willcox (pip.willcox@bodleian.ox.ac.uk) with:

  • Your name
  • Your ox.ac.uk email address
  • Your departmental affiliation

This workshop is run in collaboration with the Centre for Digital Scholarship and the Reproducible Research Oxford project.

For announcements about future workshops and related activities run by Reproducible Research Oxford, please see the project website, subscribe to the mailing list, and follow the project on Twitter @RR_Oxford.

Equipment

Participants are requested to bring a laptop. To work with with spreadsheets, you will need an application such as Microsoft Excel, Mac Numbers, or OpenOffice.org. If you don’t have a suitable program installed, you might like to use LibreOffice, a free, open source spreadsheet program.

You will also need OpenRefine (formerly Google Refine) and a web browser, and to have Java installed.

If you cannot bring a laptop with you, please let us know before the day.

Trainers

Iain Emsley works for the University of Oxford e-Research Centre on digital library and museums projects. Having recently finished an MSc in Software Engineering, he has started a PhD in Digital Media at Sussex University.

Pip Willcox is the Head of the Centre for Digital Scholarship at the Bodleian Libraries and a Senior Researcher at the University of Oxford e-Research Centre.

Image credit: Stockbyte/Getty Images.

Research Uncovered—Beyond reading: understanding the book through computer vision

Book tickets!What: Research Uncovered—Beyond reading: understanding the book through computer vision

Who: Giles Bergel

When: 13:00—14:00, Thursday 2 November 2017

Where: Weston Library Lecture Theatre (map)

Access: open to all

Admission: free

Registration is required

This talk showcases Oxford’s cutting-edge research at the intersection of book history and computer vision. It aims to make images of books as easy to search, compare and annotate as their texts.

The University’s Visual Geometry Group has a long track record of working with University researchers and collections, building tools to help researchers analyse everything from classical art to fifteenth-century printed books and English broadside ballads, as well as numerous applications in the sciences. Several of these tools have now been openly released for all to use and adapt.

The talk reveals how computer vision, far from detracting from understanding books as material objects, offers a fresh pair of eyes on what remains one of humanity’s most sophisticated inventions and richest forms of heritage.

Dr Giles Bergel is Digital Humanities Research Officer in the University of Oxford’s Visual Geometry Group. He works on printed books, printing materials and the history of the book trade. Find out more information.

Book tickets: http://www.bodleian.ox.ac.uk/whatson/whats-on/upcoming-events/2017/nov/beyond-reading

Reconciling database identifiers with Wikidata

Charles Grey, former Prime Minister, has an entry in Electronic Enlightenment. How do we find his UK National Archives ID, British Museum person ID, History of Parliament ID, National Portrait Gallery ID, and 22 other identifiers? By first linking his Wikidata identifier.

In a previous blog post I stressed the advantage of mapping the identifiers in databases and catalogue to Wikidata. This post describes a few different tools that were used in reconciling more than three thousand identifiers from the Electronic Enlightenment (EE) biographical dictionary.

The advantages to the source database include:

  • Maintaining links between Wikipedia and the source database. EE and Early Modern Letters Online (EMLO) are two biographical projects that maintain links to Wikipedia. As Wikipedia articles get renamed or occasionally deleted, links can break. It is also easy to miss the creation of new Wikipedia articles. As EE and EMLO links are added to Wikidata, a simple database query gets a list of Wikipedia article links and their corresponding identifiers. Thus we can save work by automatically maintaining the links.
  • Identifying the Wikipedia articles of individuals in the source database. These are targets for improvement by adding citations of the source database.
  • Identifying individuals in the source database who lack Wikipedia articles, or who have articles in other language versions of Wikipedia, but not English. New articles can raise the profile of those individuals and can link to the source database. We raised awareness among the Wikipedian community with a project page and blog post. We also arranged with Oxford University Press to give free access to EE for active Wikipedia editors who requested it, via OUP’s existing Wikipedia Library arrangement.

Continue reading

Report from Wikimania

Last month I had to privilege to attend the Wikimania conference in Montreal, Canada, where 900 people from around the world gathered for two days of hacking and building and then three days of conference sessions. The conference scope includes not just the Wikimedia projects but also the big themes of open education, open access, community building, and privacy and rights in the digital age. One blog post by one attendee is only going to capture a sliver of what went on, and here I am summarising some big projects of most relevance to university research projects and GLAMs.

This time round, Wikidata rather than Wikipedia was generating the most excitement. Wikidata, the free structured knowledge-base, is going through a period of explosive growth, helped in a small part by data shared from partner institutions including Oxford University, and the conference brought together many people using Wikidata to document cultural heritage and current knowledge.

The author and hundreds of other Wikimedians. Photo by Victor Grigas of the Wikimedia foundation, CC-BY-SA 4.0

Continue reading

Data Carpentry Workshop for Humanists

You are invited to join a free Data Carpentry workshop run by the Reproducible Research Oxford project. Registration is required.

 

Date: 26–27 September 2017 

Venue: Institute of Cognitive and Evolutionary Anthropology, 64 Banbury Road, Oxford OX2 6PN

 

The workshop will cover data organization in spreadsheets and OpenRefine, data analysis and visualization in python, and SQL for data management, with a focus on humanities data. This is a joint effort with Data Carpentry to develop a (pilot) curriculum for the digital humanities. It is at an introductory level.

See the workshop website for details: https://rroxford.github.io/2017-09-26-oxford/

The workshop is free and open to any member of the University — researchers, staff, and students. It will be particularly relevant to people working with humanities data, though the methods are widely applicable.

 

IIIFrankenstein

Last week Digital.Bodleian reached 700,000 images with the help of Mary Shelley’s Frankenstein notebooks. These have been accessible online at the wonderful Shelley-Godwin Archive for some time now, complete with transcriptions, TEI markup and detailed explanatory notes, alongside other manuscripts from Mary Shelley, Percy-Bysshe Shelley, and William Godwin. Porting them to Digital.Bodleian is not intended to replace this brilliant resource, but it helps with the Bodleian’s mission to improve the discoverability of our online resources. It also lets users do a few extra neat things with the images.

Bodleian MS. Abinger c.57, fol. 23r.

Everything added to Digital.Bodleian receives a IIIF Manifest. This means the image sets and accompanying metadata are expressed in a rich, flexible format conforming to a shared API standard. IIIF tools exist for manipulating and comparing, as well as viewing, digital images. This comes in handy for the Frankenstein notebooks (properly called MS. Abinger c.56, MS. Abinger c.57 and MS. Abinger c.58). At present they are fragmented, and the ordering of the pages in the Draft notebooks (MS. Abinger c. 56 and c.57) is different to the linear order of the novel. Using IIIF tools, we can easily work with the notebooks side-by-side, and remix the ordering of pages to fit the novel’s sequence.

The Mirador viewer, created by Stanford University with the help of the Andrew. W. Mellon Foundation, lets us quickly and easily view multiple IIIF-compliant image sets alongside each other. We’ve created an instance with the Frankenstein notebooks ready-loaded side by side.

Bodleian MS. Abinger c.56, c.57 and c.58 viewed in Mirador.

The Bodleian’s Digital Manuscripts Toolkit, also funded with help from the Andrew W. Mellon Foundation, includes a Manifest Editor. This lets us remix and combine IIIF-compliant image sets into new sequences. Following the lead of the Shelley-Godwin Archive, we’ve created a manifest which reorders the Frankenstein Draft pages into the linear sequence of the novel. This can be viewed in a Mirador instance here – though note that the extant Draft is incomplete! The manifest itself lives here, and can be used with any other IIIF-compliant API.

IIIF Manifests are in a standardised JSON format.

If you’d like to use Mirador to view Digital.Bodleian images, you can use the link in the sidebar (the stylised ‘M’) when viewing any image or item. IIIF, Universal Viewer and Mirador Icons on Digital.Bodleian

To add further images alongside an item in Mirador, select ‘Change Layout’ from the top menu and choose how many items you’d like to view together, and the layout you’d like to view then in. You can then simply click-and-drag the IIIF icon from any other Digital.Bodleian image set into the Mirador browser tab. You can also open IIIF-compliant image sets from other institutions – you just need the URI of the IIIF Manifest.

For instructions on using the Digital Manuscript Toolkit’s Manifest Editor (and other tools), please see the DMT website.