Some ways Wikidata can improve search and discovery

I have written in the past about how Wikidata enables entity-based browsing, but search is still necessary and it is worth considering how a semantic web database can be useful to a search engine index.

This post is about three ways Wikidata could help search and discovery applications, without replacing them: 1) providing more or less specific terms (hypernyms and hyponyms), 2) providing synonyms for a search term, 3) structuring a thesaurus of topics to provide meaningful connections. I end with the real-world example of Quora.com who are using Wikidata to manage a huge user-generated topic list.

Hypernyms and hyponyms

P. Satrienus (c. 77 BC) Public Domain, via Wikimedia Commons

I previously raised the problem that this image will not appear in a straightforward search for “coins” because it is described with the specific term “denarius”. The stemming algorithm in a search index will recognise that “coin” and “coins” are different forms of the same English word but won’t recognise that “denarius” refers to a kind of coin. That real-world knowledge has to be supplied somehow.

Part of the point of Wikidata is to provide this sort of “Knowledge As A Service”. This query asks Wikidata for names of types of the thing known in English as “coin”.

SELECT DISTINCT ?name WHERE { ?sub wdt:P279+/rdfs:label "coin"@en. {?sub rdfs:label ?name} UNION {?sub skos:altLabel ?name} FILTER (lang(?name)="en") }

This gives 173 results, although some are phrases like “silver coin” or “100 Yen coin” which a search for “coin” would have found anyway. Filtering these out with an extra statement…

FILTER ( !CONTAINS(?name, "coin") )

…leaves 103 matches. Some of these are long phrases like “Liberty Head nickel”. If we just want single words, we can add another filter:

FILTER ( !CONTAINS(?name, " ") )

This yields 50 terms including ducat, guinea, florin, crown, penny, quarter and so on. This could be refined further, but by incorporating these terms into a search index, we can make our denarius and florin records visible to a search for “coins”, and vice versa.

Synonyms

Confucius and Kung Fu-tse are different Western versions of the same Chinese name. One way a search can fail is if the user searches for one form but the indexed text has another. Since Wikidata items can have any number of aliases, it can act as a synonym server.

The following query asks for alternative names used by English speakers for the thing known in English as “Confucius”.

SELECT DISTINCT ?name WHERE { VALUES ?rel {rdfs:label skos:altLabel} VALUES ?rel2 {rdfs:label skos:altLabel} ?target ?rel "Confucius"@en; ?rel2 ?name FILTER ( lang(?name)="en" ) }

It returns Kungfutse, Zhongni, Confucio, Konfuzius, Kongfuzi, Kong Fuzi, Kongqiu, Kong Qiu, K’ung-fu-tzu, K’ung-tzu, Kongzi and variations. This sort of query could be called over the web when building a search index. Better still, if the objects in a database are tagged with their Wikidata identifiers, then a query can retrieve the various alternative names for those specific things, avoiding ambiguity.

For things with the property “country -> People’s Republic of China”, Wikidata has more than seven thousand alternative labels. For “country -> Japan”, there are more than ten thousand.

Semantically related topics and the Quora example

Pinterest and Quora are examples of sites in which content is organised by topic. To use topics for discovery, it helps if we can distinguish topics that are closely related, somewhat related and unrelated. Let’s try an experiment with these sixteen things:

Barack Obama
Congressional Black Caucus
Michelle Obama
University of Chicago
Time Person of the Year
Democratic Party
White House
Hillary Clinton
J. R. R. Tolkien
Aragorn
The Lord of the Rings
C. S. Lewis
University of Oxford
World War I
Battle of the Somme
British Army

Thanks to your knowledge of the world, you will see that there are two groups here, with sparse connections within the groups. With a quite simple query, we can ask Wikidata to form links based on properties connecting them. In a graph visualisation, the list organises into two clusters (though you might have to click and drag them apart to make that clear).

The list separated into two clusters of related entities: click to get the interactive version. Some names appear twice because there are multiple things of that name, e.g. “Aragorn” is the name of a character and of a piece of music about the character.

This graph makes use of direct connections between the entities. Shared connections to other topics (for instance that Tolkien and Lewis were both authors of fantasy literature; Obama and Clinton both presidential candidates) offer another way to cluster topics together.

The graph is not something that would be shown to users: it represents a network that could be used to offer users topics related to the ones they are browsing or have subscribed to. The connections could be weighted by how many other items have that same property. Since many notable things are connected in some way to World War I or to Oxford University, but fewer are connected to The Lord of the Rings, a sensible algorithm would prioritise The Lord of the Rings as a recommendation to people whose interests include Tolkien.

Quora.com is a site where people can ask questions on any topic, answer questions, and rate or comment on the answers. Categorising the millions of questions, and matching users to questions and answers that will interest them, is an enormous discovery problem. Users can create topics with a name and description, attach questions to them, and subscribe to topics of interest. This is how the site has built a large folksonomy of two and a half million topics.

By employing a Wikidata specialist and starting off a bulk reconciliation of their topics, they benefit from translations, synonym identification, short descriptions, and semantic relations between topics. Thus they are turning a potentially chaotic list of strings into a structured network that reflects some of the human meaning of the topics.

—Martin Poulter, Wikimedian in Residence

This post licensed under a CC-BY-SA 4.0 license

P.S. A query to get items that are semantically related to a given topic (in this case Kung Fu-tse) is

SELECT DISTINCT ?related ?relatedLabel ?relatedDescription WHERE { VALUES ?target { wd:Q4604 } { ?target ?prop ?related } UNION { ?related ?prop ?target } filter ( CONTAINS(STR(?related),'/entity/Q' ) ). SERVICE wikibase:label { bd:serviceParam wikibase:language 'en' } } ORDER BY UCASE(?relatedLabel)

Presently this gives 53 items

Post by Martin Poulter, Wikimedian In Residence
This post licensed under a CC-BY-SA 4.0 license

Bodleian Digital Library

A Bodleian Libraries blog

Some ways Wikidata can improve search and discovery

Hypernyms and hyponyms

Synonyms