Reconciling database identifiers with Wikidata

Charles Grey, former Prime Minister, has an entry in Electronic Enlightenment. How do we find his UK National Archives ID, British Museum person ID, History of Parliament ID, National Portrait Gallery ID, and 22 other identifiers? By first linking his Wikidata identifier.

In a previous blog post I stressed the advantage of mapping the identifiers in databases and catalogue to Wikidata. This post describes a few different tools that were used in reconciling more than three thousand identifiers from the Electronic Enlightenment (EE) biographical dictionary.

The advantages to the source database include:

  • Maintaining links between Wikipedia and the source database. EE and Early Modern Letters Online (EMLO) are two biographical projects that maintain links to Wikipedia. As Wikipedia articles get renamed or occasionally deleted, links can break. It is also easy to miss the creation of new Wikipedia articles. As EE and EMLO links are added to Wikidata, a simple database query gets a list of Wikipedia article links and their corresponding identifiers. Thus we can save work by automatically maintaining the links.
  • Identifying the Wikipedia articles of individuals in the source database. These are targets for improvement by adding citations of the source database.
  • Identifying individuals in the source database who lack Wikipedia articles, or who have articles in other language versions of Wikipedia, but not English. New articles can raise the profile of those individuals and can link to the source database. We raised awareness among the Wikipedian community with a project page and blog post. We also arranged with Oxford University Press to give free access to EE for active Wikipedia editors who requested it, via OUP’s existing Wikipedia Library arrangement.

OpenRefine

OpenRefine is free software for data reconciliation. Thanks to the work of Antonin Delpeuch, an Oxford University student, Wikidata searching is built in. Let’s say you have a table of data about artworks, and one column represents the museum the art resides in: “Ashmolean”, “Louvre”, “Smithsonian” and so on. OpenRefine can look up those terms and return matches from Wikidata, restricting the search to a specific type of item: for example, art museums. The steps taken to clean up and match a data set in OpenRefine can be recorded and shared as a “recipe” for use on other data sets.

The ideal situation for OpenRefine, where properties are shared across many lines of a table, is different from the situation we were in, where every line represents an individual with a distinct identifier. Fortunately there are ways to automate the process.

Google Sheets: Getting a potential match

We wanted a way for two people to collaborate on the matching process, so an online shared application was preferable. Google Sheets gets prohibitively slow with large and complex datasets, but it was just about okay with our 8,000 records of Electronic Enlightenment data. However, it has one key advantage over desktop spreadsheet applications like Excel: Wikidata integration. Wikipedia and Wikidata Tools, once activated from the Add-ons menu in Google Sheets, makes calls to the Wikipedia and Wikidata APIs from within a spreadsheet.

If the names to be matched are in column A we can add a column to the spreadsheet with the function:

=WIKIDATAQID(CONCAT("en:",A2))

This will look for an English Wikipedia article whose title is the string in the A column. If the article exists, it will return Wikidata’s Q number identifier connected with that article. The “en” is a language code identifying English Wikipedia: to perform the lookup in French or German Wikipedia we instead use “fr” or “de”.

This initial matching was helped by transforming the names in EE to be more like the titles of Wikipedia articles. Wikipedia puts first names before surnames, and excludes most titles, while EE uses a surname-first arrangement. “Davidson, Reverend Doctor Archibald” in EE is “Archibald Davidson” on Wikipedia. Some titles, including “earl” and “baronet” are lower-case in EE but capitalised in Wikipedia. These differences are usually fixable with regular expressions, and more such fixes meant more matches.

Sanity checks

Often the Wikidata identifier that appears is not that of a person. The William Meredith page on Wikipedia is not a biography but a list of pointers to biographies of various people named “William Meredith”, including an 18th century British politician and a 19th century American politician. Wikipedia has many of these disambiguation pages. Shared names are another source of false matches. An attempted match for Thomas Beach the poet (died 1737) returns the wrong identifier: Wikipedia’s article Thomas Beach is about the painter (1738-1806) and it has a separate article Thomas Beach (poet).

For a secure match, we need more than just a name, both for the thing we are matching and the putative match in Wikidata. Google Sheets can help by extracting individual properties of a Wikidata item into the spreadsheet.

Let’s say the WIKIDATAQID command was in column G. Then in column H we put

=WIKIDATADESCRIPTIONS(G2,"en")

which will look up the English description of the item. These functions look up the person’s birthdate and death date respectively:

=WIKIDATAFACTS(G2,"first","P569")
=WIKIDATAFACTS(G2,”first”,”P570”)

To check that the matched item is a person, we can use:

=WIKIDATAFACTS(G2,”first”,”P31”)

Property P31 is the item’s type. People are instances of Q5: human, whereas disambiguation pages are marked as instance of Q4167410. If birth and death dates are available, then that serves as an implicit check that the item is a person.

With a comparison function, we marked records as a verified match when the name, year of birth and year of death from Wikidata matched those provided by EE. A count function kept track of how many records were verified. The Google Sheets interface also allowed us to tweak names and make manual matches. A total of 3,101 matches were made in the spreadsheet, and those identifiers were added to the relevant Wikidata records.

Mix’n’Match

Around the clock, volunteers are writing and editing Wikipedia articles, or importing batches of data into Wikidata. Less well-known, but a community activity important to the whole project, is Mix’n’Match. This is a platform to crowdsource the effort of connecting identifiers from various externals to Wikidata, and hence to each other. There are currently more than 500 catalogues of identifiers, including the Hansard identifiers for MPs and peers, the Pleiades database of ancient places, Project Gutenberg’s authors, and all 60,000 entries in the Oxford Dictionary of National Biography.

Mix’n’Match depends on volunteer work and the matches come over time as a steady drip, so automatic matching is preferable where possible. We had 3955 EE IDs identifiers that had not been matched through our spreadsheet process, so we shared them through this platform. EE and EMLO are databases of writers and addressees of letters, so there are many entries for which there is a name, or even a partial name, but no biographical detail. Wikidata is a database of things, not of names, so it cannot have an entry when there is no biographical data. Hence there will be EE and EMLO entries that never have a representation in Wikidata, because actual data about them is lost to history although the letters themselves could eventually be represented. Some identifiers, like the ODNB, can get a 100% match with Wikidata, other databases will not.

Data providers upload a catalogue of their identifiers with some description of each individual item and some indication of the type of thing being identified; people, places, organisations, artworks and so on. There has to be a property in Wikidata for the identifier; for EE this was property P3429, Electronic Enlightenment ID, created in a previous community discussion.

When a catalogue is uploaded to Mix’n’Match, some records are automatically matched by name, looking across all the labels in Wikidata, not just the titles of articles in one version of Wikipedia. These putative matches are put into an “automatically matched” area for human checking. Wikidata users can confirm or undo the match with one click, using a game-like interface. Confirming the match adds the external identifier to the Wikidata entry, while undoing the match moves it from the “automatically matched” to “unmatched” pool of identifiers. As we have seen, automatic matches are often incorrect, but human checking stops them polluting the database. For items that have multiple name matches, users are shown short summaries of each match.

At the moment, there are a few active catalogues in Mix’n’Match relevant to work at Oxford University:

Each of these will drip-feed new matches into Wikidata over time, and anyone is welcome to pitch in and help.

The SPARQL query code to get English Wikipedia articles and corresponding EE identifiers is:

SELECT ?ee ?enwp
 WHERE {
 ?person wdt:P3429 ?ee.
 ?enwp schema:about ?person; schema:isPartOf <https://en.wikipedia.org/>
 }

This currently returns 2787 results to download in XML, JSON or TSV format. Similar queries can pick up overlaps between EE and Hansard identifiers, or art that depicts people in the EE database.

This case study shows that Google Sheets, Wikipedia and Wikidata work together as a system, and in turn help hundreds of other databases, including some of the most important authority files and catalogues, work together efficiently.

—Martin Poulter, Wikimedian in Residence

Thanks to Cory Salveson for doing a lot of the work described in this case study.

This post licensed under a CC-BY-SA 4.0 license

Comments are closed.