Resource discovery and Wikidata

How can I find reference materials about Jane Austen? This query could potentially take me to dozens of different sites and databases, each with different types of material. Project Gutenberg has transcribed text of her works. Librivox has audiobooks. Find A Grave has images of her memorial stone in Winchester Cathedral. The Huygens database of Women Writers has citations for modern research about her. The Stanford project Kindred Britain has her family tree. Across the Wikimedia family of sites, there are articles about Austen in 103 language versions of Wikipedia, quotations in 27 language versions of Wikiquote, and various images in Wikimedia Commons.

Portrait of Jane Austen by her sister, Cassandra. From the National Portrait Gallery via Wikimedia Commons

Title page of a first edition of Pride and Prejudice. Public Domain via Wikimedia Commons

Coat of arms of the Austen family. Public Domain via Wikimedia Commons

How do we capture the fact that all these different resources are about the same person? How do we make a path to these and similar sources, bypassing all the irrelevant links that would come up in a web search?

I can search for her by name, but naming is never simple. She is identified by “Austen, Jane” in some databases, “Jane Austen” in others, “jane-austen” in the New York Times topic index and so on. Other names are less straightforward: Alan Turing’s “Lady Lovelace” and Augusta Ada Byron are the same person. Her father was Lord Byron or, equally, 6th Baron Byron George Gordon Byron. Not all resources are in English: if an image has been tagged by a Russian user with “Аустен, Джен” then it is relevant to my search for Jane Austen, but will not be picked up by text-matching my search.

So searching for the string “Jane Austen” in lots of databases will miss a lot of the resources I want. Searches can be “fuzzy” and bring up resources that match not-quite-exactly, but this introduces its own problems. My search is for the English novelist Jane Austen, so I do not want links about the American novelist Jane G. Austin.

This is a core problem of discovery: the tools we commonly use are geared to searching for a name or phrase, but we are usually not interested in the name, but in the person or thing the name stands for. This briefing looks at how Wikimedia is solving this problem at web scale.

The first thing to note is that Wikimedia does solve the problem. While in the past the hundreds of different sites had to manually maintain links to each other, there is now a more elegant and scalable solution. Look at Austen’s Wikisource profile, which includes some transcriptions of her works. At the top of the page are links to Austen’s profiles in other Wikimedia projects. At the foot of the page are links to her entries in a variety of external databases, including the Virtual International Authority File, Library of Congress and Bibliothèque nationale de France.

Authority file links that appear automatically in Jane Austen’s Wikipedia and Wikisource pages

However, those identifiers and links do not appear in the source wiki-code of that page. Nor, for that matter, do the picture of Austen nor her years of birth or death. Those links and facts are not held on Wikisource but automatically imported from another sister-project, Wikidata.

As I write this, Wikidata has 26.6 million distinct entries, 3.47 million of which are for human beings. Each has a language-independent identifier or Q number; for Jane Austen, Q36322. Looking at a summary of Wikidata’s statements about Jane Austen we can see dozens of links to authority files, names in many human languages, and basic biographical facts such as Austen’s immediate family and cause of death. This is how I found the links in the first paragraph of this article.

There are more than three thousand different properties that a Wikidata entry can have. Each has to be proposed and then reviewed by the community in an open discussion. Some of these are identifiers in Digital Humanities databases relevant to work done in Oxford:

  • Cultures of Knowledge ID for scholars and intellectuals of the Early Modern period.
  • Nomisma ID for coins and currency described on Nomisma.org
  • Electronic Enlightenment ID for people and groups of the Enlightenment, broadly defined
  • Cuneiform Digital Library Initiative ID for objects with Cuneiform inscriptions
  • Medieval Libraries of Great Britain Book ID for individual medieval books
  • English Short Title Catalogue Citation ID for editions of books
  • Atlas of Hillforts is a scholarly database that will be mapping its identifiers to Wikidata.
  • Beazley Archive Pottery Database ID for pottery items and fragments

Another property is inventory number (in a specified collection) which enables Wikidata to represent collections data from archives and museums, as discussed in a previous article. Any item in a collection can have a Wikidata entry, and these entries can make useful links, including to geographical data such as location depicted in an artwork or the location of discovery of an artefact. Entries about events, people or creative works can also link to representations of scholarly research that describes them.

Wikidata does not just help with matching different names and identifiers, but with other kinds of relationship. Someone searching for resources about British novelists should be offered results about Jane Austen. Similarly, I might search for “coin” and be interested in this image of a Denarius though the word “coin” does not appear in its description.

P. Satrienus (c. 77 BC) Public Domain, via Wikimedia Commons

String matching will not help with these queries, because finding the right resources depends on background knowledge. This is where Wikidata comes into its own as a knowledge base, not just an authority hub. The facts “Denarius → subclass of → coin” or “Jane Austen → occupation → novelist”, and billions more like them, are represented in Wikidata.

Sharing identifiers with Wikidata, or importing Wikidata identifiers into an existing database, is useful in making that database findable and semantically searchable, but is potentially just a start. It is also possible to share surface metadata (what names something is known by, what kind of thing it is, where it is) and improve the findability further. Going further, we could share more of the catalogue record. For a painting, we can specify its material, its measurements, and the date when it was acquired by a museum. As the data become richer, more things acquire a Wikidata representation. In adding data about historical books, we can create entries for printers and publishers. To enrich data about artworks, we can create Wikidata entries for their owners, or for the locations depicted.

The potential rewards are great, but that sounds like a lot of work. At a minimum, what should the designers or maintainers of databases and catalogues be doing?  I propose three design principles:

  1. Databases should know the Wikidata identifiers of the objects they describe. One way to do this is to share the permanent identifiers used in your database to Wikidata. A future blog post will describe this process, which we are already doing with some data sets from Oxford University. Once identifiers are matched, we can query in both directions: for things on Wikidata that correspond to entries in your database (or an interesting subset of it) and records in your database that correspond to a Wikidata query.
  2. Search tools should resolve user queries in terms of Wikidata identifiers as early in the process as possible. Search for images of cancer in Wikimedia Commons, and you are given a choice of cancer the disease, Cancer the constellation, Cancer the astrological symbol or Cancer the genus of crab. Each of these has a separate Wikidata representation that links resources about that item within and beyond Wikimedia.This architectural principle could be used whenever we want to search across data from different sources; say a museum catalogue, a bibliographic database, and a biographical dictionary. We need to distinguish things that are known by the same name, and combine identifiers that represent the same thing. Rather than build a composite index to do this, we can piggy-back on Wikidata which is already serving this function.
  3. Sites should be constructed so that information about a thing can be accessed using its standard identifier. Wikidata constructs the VIAF link for Jane Austen by adding her VIAF number, 102333412, to the end of VIAF stem: https://viaf.org/viaf/ . It constructs the Find A Grave link by adding her Find A Grave number, 44, to a different stem. If your links take the form of a stem plus an identifier, and if those identifiers are known to Wikidata, then scripts and apps can build links to your site. So this simple design principle enables a great variety of front-ends to your site, including timelines, maps and lists. You could use Wikidata identifiers, your own identifiers (which you share with Wikidata) or, if it covers your topic, an authority hub such as VIAF. If you put your information about Jane Austen at http://<<yoursite>>.ac.uk/102333412 then the rest of us can plumb in your site to the web of knowledge that we build.

Edit, 11 October 2017: a third principle has been added to the initial list of two, after discussion with archivists.

When the user has a phrase search, such as “Jane Austen and the War of Ideas”, that is not amenable to parsing into Wikidata identifiers.Then again, that kind of search is already well-served by web search engines such as Google. The simple way for a database to facilitate phrase searching is to make its text content viewable on the public web.

To find things known in English as “cancer” we can ask Wikidata with a custom query. There are presently 28 such items in Wikidata and we need to prioritise them by how likely the user is to be looking for them. The linked query orders the items by the number of pages about each on Wikipedia and related sites. This gives us a rough measure of the cultural importance of each meaning: cancer the disease has 149 of these sitelinks while Cancer the scientific journal has five and Cancer the album by American rock band Showbread has one.

Like everything about Wikipedia and Wikidata, this is a public service, and anyone is welcome to build services on it. They are incomplete works-in-progress, just as many library catalogues and research databases are, but by sharing our data to improve them, while others share their own content and data, we reap a massive cumulative benefit.

Further reading

Jason Evans, Simon Cobb “Wikidata Visiting Scholar at the National Library of Wales”
https://en.wikipedia.org/wiki/User:Jason.nlw/Wikidata_Visiting_Scholar

Joachim Neubert, Jacob Voss (2017) “Wikidata as authority linking hub”
https://hackmd.io/p/S1YmXWC0e

EveryPoliticianBot (2016) “I use Wikidata for multilingual names”
https://medium.com/mysociety-for-coders/i-use-wikidata-for-multilingual-names-d35b331f1a59

—Martin Poulter, Wikimedian in Residence

This post licensed under a CC-BY-SA 4.0 license

Comments are closed.