Turning a historical book into a data set

A series of books published around the turn of the 20th century are crucial to modern bibliographic research: they are biographical dictionaries of booksellers and printers, including addresses, dates and significant works printed. Some of these books are out of copyright and available as scanned pages, allowing us not only to copy them into new formats, but adapt them into new kinds of resource.

These scanned books could be made more useful to researchers in a number of ways. Text could be meaningfully segmented, by dictionary entry rather than by page or paragraph. The book’s internal and external citations can become links, for instance linking a proper name to identifiers for the named person. The book can even have an open data representation which other data sets can hook on to, for example to say that a person is described in the book.

This case study describes the transformation of one of these books, Henry Plomer’s A Dictionary of the Booksellers and Printers who Were at Work in England, Scotland and Ireland from 1641 to 1667 using Wikisource, part of the Wikimedia family of sites. As a collaborative platform, Wikisource allowed Bodleian staff to work with Wikisource volunteers. We benefited from many kinds of volunteer labour, from correcting simple errors in the text to creating custom wiki-code to speed up the process.

A lot of important data sets only currently exist in the form of printed books, including catalogues, dictionaries and encyclopedias. We adopted a process that has already been used on some large, multi-volume works and could be used for many more.

From page scans to transcriptions

The Internet Archive provides a scan of the book, downloadable as a set of images or a single plain text file. The text, generated by Optical Character Recognition (OCR), is mostly readable but with problems. Digits are misrecognised as letters or vice versa, and letters merged, such as “ComhiU” for “Cornhill”. The biographical dictionary has many dates and many page references to other sources, so common problems included “ii.” misrecognised as “iL” and “3” as “2” because the typeface gave “3” a large descender. Still, the OCR text is good enough that someone without much prior knowledge can correct it to a faithful representation of the book. This was the basis of the crowdsourcing process.

Any image file on Wikisource needs to be hosted on Wikimedia Commons. There is a tool to copy the file and its associated metadata from Internet Archive to Commons, with links back to the source. The next step is to create an Index page on Wikisource. This gives a link for each page of the scanned book, where the OCR text is presented next to the image of the page, and users can click “Edit” to fix errors in the OCR.

A couple of the pages in the scan had damage, obscuring some words. The interface allowed proofreaders to mark those pages as “Problematic” and an alternative Internet Archive scan of the same edition helped us complete the text.

The colour of the page links indicates the progress of the error-correction. Pages in green have been checked by multiple users; in yellow by one user. Pages in red are incompletely transcribed. The number highlighted in blue indicates that transcribing that page is problematic, in this case because the text is indistinct. Once every page has been checked by at least one user, the “Progress” status of the whole work will change to “To be validated”. Then, once every page has been verified by multiple users, the transcription as a whole is “done” and the pages are ready for assembling into a digital version of the book.

Links within the text

The dictionary makes liberal use of references between entries, using “q.v.” or “the preceding”. As well as correcting OCR errors, users can add links to the text as they go. The same can happen to other kinds of reference, such as mentions of authors or their works. The community policy on links within text is that they can be to Wikisource or Wikipedia but not external sites, and they must be purely informative and not editorialise. This does not mean that the text is isolated from external sites; just indirectly linked. A Wikisource author profile links to many external records for that author, including VIAF, LCCn, BnF and the Oxford Dictionary of National Biography. So in the sentence “He was the printer of the first volume of Sir W. Dugdale’s Monasticon”, “W. Dugdale” is linked to the Wikisource author profile “Author:William Dugdale (1605-1686)” which in turn is linked to many authoritative external sites.

Recreating the book

Once all the pages are ready, they need to be assembled to create an online version of the book, with a page for each biography. This is not a matter of copy-and-paste, but of using pointers which pick out chunks of text in the transcription project. Creating the web version of one of the book’s biographies might require combining the last section of one page and the top section of the next page, giving the biography a title and providing “previous” and “next” links to step through the whole book. One of the Wikisource administrators created a wikicode template to simplify this. With the template, recreating the whole book was a repetitive task, but took much less effort than otherwise.

With the web version of the book created, the individual biographies are now shareable and bookmarkable. Wikisource provides an export tool which gives the entire book in various electronic book formats including EPUB and PDF, so for example we can retrieve the entire book as one XML file.

A search engine and API

With one line of wiki-code, we added a customised search that only returns results from within the book. Since it searches the structured dictionary rather than a set of pages or a monolithic text file, search results are in the form of the names of the printers. Searching for “oxford” brings up all the entries mentioning Oxford, which are almost always booksellers or printers who worked in Oxford. Searching for “Milton” brings up those who are mentioned in the dictionary as publishing Milton.

MediaWiki, the software platform for Wikisource and Wikipedia, has an API; an interface for use by software. This means that by constructing a URL, it is possible to get some properties of a biography, or its text in XML form. This proves useful later on for integration with Google Sheets.

The book’s structure as a data set

With the book online as electronic text, the next step was to create a representation in Wikidata of the book and its component biographies. This includes pointers to the online transcription and allows cross-referencing with other data.

The first step is to create a Wikidata representation for the book itself: this means two entries; one for the work and another for the edition. These include basic bibliographic data such as year of publication and original language, as well as identifiers in external databases including Google Books and the Open Library. The 940 entries are then represented with entries that point to this representation of the book. Wikidata properties can express that each entry is a biographical article, written by Henry Plomer, and is published in the book. If we were dealing with a book in many editions, this structure could express which editions each article appeared in. Each biography’s Wikidata entry links to its Wikisource transcription.

To make this structure, I copied the text of the book’s table of contents and removed irrelevant lines. This gives a 940-line text file, where each line is both the title of the article and the URL path for its transcription in Wikisource. The QuickStatements tool enables bulk imports to Wikidata, taking a tab-delimited text format. Here is the QuickStatements syntax to create a new Wikidata item which is an instance of a biographical article and is published in Plomer’s dictionary.

CREATE
LAST P31 Q19389637
LAST P1433 Q40901539

In the Notepad++ text editor, I opened the biography list in one tab and, in the other, created a template description of a biography in the QuickStatements format. A regular expression combined the two files, resulting in a list of 6580 statements for pasting into QuickStatements.

A few errors crept in due to carelessness: four statements had the wrong number of tab characters. Luckily, QuickStatements gives an in-browser report on each statement it has added, and statements that failed are shown in red alongside links to their newly created Wikidata entries, so it was easy to scroll through the whole report, find the failed statements, and fix them manually.

Now that this structure exists, each entry has its own identifier in Wikidata, and a “Wikidata item” link now appears on each entry in Wikisource. These identifiers can be used in citations within Wikidata. Specifically, a citation can say that a fact is “stated in” a particular entry in the Plomer book, so we can start to connect the representation of the book to representations of the people and works it describes.

Plomer’s biography of William Dugard used as a citation for a fact about Dugard in his Wikidata entry.

The book’s content as a data set

By pasting the list of biography titles into Google Sheets, we can start to recreate the book within the spreadsheet.

B1=CONCAT("A_Dictionary_of_the_Booksellers_and_Printers_who_Were_at_Work_in_England,_Scotland_and_Ireland_from_1641_to_1667/",A1)

This identifies the link within English Wikisource to the specific biography.

C1=CONCAT("https://en.wikisource.org/w/api.php?action=parse&prop=text&format=xml&disabletoc=true&disableeditsection=true&page=",B1)

This constructs the API call to get the text of the biography.

D1=REGEXREPLACE(IMPORTXML(C1,"//text"),"\n","")

Gets the text of the biography from the API and puts it all on one line. The lookups are spaced out in time, so populating the whole spreadsheet takes a several hours.

E1=REGEXEXTRACT(D1,"[A-Z][A-Z][A-Z]+.*")

The text that comes out of Wikisource includes title, header and “previous” and “next” links. This rather hacky step is needed to extract just the text of the biography: the first capitalised word in the text is the surname of the individual, which is also the first word of the biography.

With the text of a biography in a spreadsheet, we can start to get properties. For example, we can get the length of the string and use this to work on the longest biographies first.

F1=LEN(E1)

With the Wikidata numbers of the biographies pasted into column G, we have a way to say that facts extracted from the text are stated in that particular biography. There are sophisticated things that could be done with text processing to extract individual statements from the text, but this was outside the scope of the project. A simple alternative is to use regular expressions within conditional statements that output the fact in QuickStatements format. For example:

H1=IF(REGEXMATCH(E1,"(?i)bookseller"),CONCAT("LAST        P106        Q998550        S248        ",G1),"")

In English, “If the biography text contains the word ‘bookseller’, add a statement that the person’s occupation (property P106) was bookseller (item Q998550), stated in (source property P248) the biography.”

J=IF(REGEXMATCH(E1,"(?i)oxford"),CONCAT("LAST        P937        Q34217        S248        ",G1),"")

Translation “If the biography text contains the word ‘Oxford’, add a statement that the person’s workplace location was Oxford, sourced to the biography.”

These statements will generate false positives and need human checking before being shared on Wikidata, but this speeds up the process of extracting facts. This aspect of the work is ongoing.

Wikidata already had representations of some printers described in the book, including biographical statements, authority file identifiers and connections to sources (e.g. Robert Barker is described in the Dictionary of National Biography). It also has data about places (Datchet abuts the river Thames) and common-sense concepts (Printing is the production of printed matter) so by virtue of making connections (Robert Barker is described in the Plomer book; was born in Datchet; was a printer) we connect the book to a much larger network of knowledge.

I would like to thank User:Billinghurst and the other Bodleian staff and Wikisource volunteers who helped with this process, and the Electronic Enlightenment team for suggesting the book.

—Martin Poulter, Wikimedian in Residence

This post licensed under a CC-BY-SA 4.0 license

Bodleian Digital Library

A Bodleian Libraries blog