Report from Wikimania

Last month I had to privilege to attend the Wikimania conference in Montreal, Canada, where 900 people from around the world gathered for two days of hacking and building and then three days of conference sessions. The conference scope includes not just the Wikimedia projects but also the big themes of open education, open access, community building, and privacy and rights in the digital age. One blog post by one attendee is only going to capture a sliver of what went on, and here I am summarising some big projects of most relevance to university research projects and GLAMs.

This time round, Wikidata rather than Wikipedia was generating the most excitement. Wikidata, the free structured knowledge-base, is going through a period of explosive growth, helped in a small part by data shared from partner institutions including Oxford University, and the conference brought together many people using Wikidata to document cultural heritage and current knowledge.

The author and hundreds of other Wikimedians. Photo by Victor Grigas of the Wikimedia foundation, CC-BY-SA 4.0

Wikidata’s growth over its four and a half years of existence, including its recent growth spurt. The actual number of items is somewhat lower because some identifiers are re-directs. (via)

All art

“Sum of all paintings” is the project to create Wikidata representations of all notable paintings, by importing catalogue data from many GLAM institutions. Already there are more than 200,000 paintings described in Wikidata, 100,000 of them having freely-reusable images in Wikimedia Commons. A past blog post gives examples of what we can do with these data.

However, there is no reason to focus only on paintings. An inspiring presentation by Maarten Dammers looked at other cultural works represented in Wikimedia projects. There is three-dimensional art, drawings, architecture, photographic collections, music from all eras and places, fashion, and furniture, all of which can be described in Wikipedia and Wikidata and depicted in images on Commons. Dammers is the creator of the Monuments database which tracks 2.5 million freely-reusable images of 1.5 million buildings, and which is used in global campaigns to crowd-source additional images. A separate session reported on work ingesting performing arts data from Swiss institutions into Wikidata, creating a custom ontology in the process.

The project pages on Wikidata give an overview of the many kinds of cultural data that Wikidata users are working to import.

Structured data on Commons

Frustratingly, Wikimedia is building two huge collections of data about art. While Wikidata holds metadata about works of art, there is also Wikimedia Commons, the archive of digital media which recently celebrated 40 million files, including 375,000 recently shared by New York’s Metropolitan Museum of Art. Metadata on Commons has become more sophisticated over the years, and images of an art work can have rich, multilingual records describing the work’s origin and physical location. Creators of art and host institutions can be identified with proper authority identifiers rather than text strings, which is helpful for internationalization.

Still, while there is some structure to the data, it is not in a database and the existing structure risks redundancy when the same objects are also being described in Wikidata. Images from the Bodleian’s Curzon Collection of Political Prints have been uploaded and tagged, so for instance they appear when you look for depictions of Robert Stewart, Viscount Castlereagh, but we are not yet able to query this to show all the Commons images of people who have held the title “Marquess of Londonderry”. There are impressive efforts to transfer some categories of data to Wikidata, but the current situation is a mix of metadata held on Commons in a structured form, imported to Commons from Wikidata or still in the form of unstructured text.

So it is good news that the Wikimedia Foundation is funding a dedicated project to transform Commons, combining its media repository functions with a Wikidata-style structured record for each file. Commons is so central to Wikimedia—almost any image, audio clip or video clip that illustrates a Wikipedia article is hosted there—that such a project has to proceed slowly and carefully: there is a lot that can break. The new hybrid platform will take some getting used to for existing users, but will make it much easier for GLAM institutions to share digital media and metadata from their collections.

All scholarly literature

Wikipedia’s authority is derived from the millions of reliable sources that it cites. These citations constitute the largest curated bibliography ever created, and it is helped by the recent integration of Zotero and OCLC bibliographic services into Wikipedia’s citation tool. WikiCite is the effort to make these citation data openly available and queryable through Wikidata, and was the subject of its own conference earlier this year. Across the wider publishing sector there is the Initiative for Open Citations (I4OC) which is releasing millions of citations as open data, not just those used in Wikipedia.

Progress on importing citations is dramatic. At the time of writing, around six and a half million scientific articles are currently represented in Wikidata, including all scholarly sources about the Zika virus. This enables citation analysis and tools such as the scholarly profile platform Scholia.

An interesting project I saw demonstrated was WikiFactMine, a creation of ContentMine at the University of Cambridge. They are importing open-access papers from Europe PubMed Central, and scanning them to identify text that can be converted into Wikidata facts.

A chemical such as paracetamol is known by various informal names, trade names, chemical formulae in various formats, or by identifiers in various databases. Then there are names in languages other than English. So, to find papers that mention paracetamol, one has to scan for many different possible text strings. To find statements about, for example, “drugs used as treatment in psychiatry” you need to scan for all possible names and identifiers of all the drugs in that category. WikiFactMine makes this possible via dictionaries: users can create and submit these, or use existing dictionaries from an already-extensive library. A specialist can thus get a feed of statements relevant to their field from recently-published peer-reviewed papers. Enable a script on Wikidata itself, and the text extracts appear at the click of a link, integrated with the Wikidata interface. Then human judgement comes in, as users then decide if the selected papers contain a fact that can be represented in Wikidata.

Though open access scientific papers are the low-hanging fruit, other scholarly papers and other kinds of publication are also in scope for Wikidata. The site has settled on a simplified form of Functional Requirements for Bibliographic Records (FRBR) to represent books (and similar works like music recordings). As a result of my placement at Oxford, Wikidata can now hold English Short Title Catalogue identifiers (covering editions of English books prior to 1800). Bibliographic databases already exist, but Wikidata affordsthe ability to integrate with biographical, geographical and other data.

All words

Wiktionary is a hugely ambitious project to build a massively multilingual dictionary, not just describing words in their own language (e.g. English words described in English) but in every other language as well (Norwegian words described in Dutch; Arabic words described in French…). It has impressive numbers of entries despite its comparatively small editor community.

Wikidata project manager Lydia Pintscher demonstrated a Wiktionary prototype built on Wikidata. A word can have different forms (“hard”, “harder”, “hardest”), different senses (antonym of “easy”; antonym of “soft”) and different pronunciations (e.g. Scottish versus Australian). Each of these forms or senses of a word can have properties, and each of these properties can be represented as structured data, giving properties language-independent identifiers to which we then attach human labels such as “antonym of” or “comparative form of”.

A structured-data platform for Wiktionary will mean easier multilingual operation, easier maintenance (because a property can be expressed independently of the user’s human language) and structured querying.

Putting it all together…

It is a decade since that influential (and much parodied) presentation in which Steve Jobs announced a new mobile telephone, a video iPod, and a wireless internet device. What really excited the crowd was that “these are not three separate devices”. People already had mobile phones, portable music players and mobile internet, but the iPhone was revolutionary. While Wikimedia and Apple Corporation are very different enterprises, Wikidata represents an analogous merging of functions. An art catalogue for the world’s GLAM institutions, a master gazetteer, an authority file of everything, a bibliographic database of all scholarly literature, and a lexicon of the world’s languages are all very ambitious projects that could transform their respective fields. The really exciting thing, however, is that these are not different databases.

Wikidata has less formal rigour than some databases constructed by professional teams, and progress in Wikidata involves some steps back among the steps forward, as the data and the community-driven standards evolve together. However, that is not necessarily a weakness: we saw this more than a decade ago when Wikipedia, rather than professionally-written encyclopaedias, came to dominate the web. The interesting professional databases are themselves never complete, but continually taking on new knowledge and fixing incorrect statements: Wikidata just makes this more explicitly part of the design.

As Wikipedia did, the projects described here will take years to reach their potential impact, but by going for sheer scale, variety and the sustainability that results from the open source/ open content approach they will, before you know it, become an unavoidable part of the environment that universities and cultural institutions work in.

—Martin Poulter, Wikimedian in Residence

This post licensed under a CC-BY-SA 4.0 license

Bodleian Digital Library

A Bodleian Libraries blog