A step forward in the sharing of open data about theses

Title page of Marie Curie’s doctoral thesis; Yale University via Wikimedia Commons; Public Domain

Theses, particularly doctoral theses, are an important part of the scholarly record. Some are published and become influential books in their own right. As well as demonstrating the author’s ability to do original research, a thesis gives a snapshot of its author’s intellectual development at a formative time. This post reports on work sharing open data about thousands of theses, with links back to their full text in a repository.

The Oxford Research Archive (ORA) has 3237 Oxford doctoral theses on open access for anyone to download and read. Some of the authors have gone on to highly accomplished careers, such as the psychologist Professor Dorothy Bishop or the economist Sir John Vickers. During the confirmation hearings that eventually saw Neil Gorsuch appointed to the US Supreme Court, the interest in his background was such that TIME magazine wrote an article analysing his thesis and linking to ORA. This may well have been prompted by our linking the thesis from the top Google hit about Gorsuch; his Wikipedia biography.

Even for Oxford theses, notable authors are only a minority. The rest do not have an entry in Wikidata or in any authority file. The repository itself has a name for the author, but almost no biographical data about them. Wikidata has an “author name string” property. As the name suggests, it connects a publication to a name, with the possibility that the name might be replaced with a full biographical record later on. This leaves us the problem of how to connect the thesis to an institution, when there is no author record to infer the institution from. Wikidata lacked a property for this, so we proposed one.

Like all Wikidata properties, this was created after a public discussion that knocked a proposal into shape. These often complete quickly, but for the thesis property it was particularly long and involved with different suggested use cases. During the discussion it emerged that the bibliographic data format BibTeX has a property “school” to link theses to their institutions. We wanted something with the same function but without that possibly misleading name.

The result is Wikidata property P4101, known in English as “dissertation submitted to”. Now the standard is set, anyone with a bibliographic data about theses can share it openly, with links back to the full records and scanned files in official repositories. At the moment, Wikidata has more data about Oxford doctoral theses than all other theses put together, but more open data from more institutions will mean more useful queries and greater ease of finding the theses and linking them from Wikipedia.

The Oxford records (which you can retrieve via a Wikidata query) are quite rudimentary. They do not indicate the subject areas of the theses: this is an area that could be tackled in future work. Wikidata’s open platform with tens of millions of other records makes it possible to link theses to their subjects or to build citation networks. Wikipedians might be able to find or create Wikipedia articles for authors where we presently only have an author name. The set of all doctoral theses that Wikidata knows about can be obtained from http://tinyurl.com/yczlabov .

Another change we got made will specifically benefit Wikipedia articles about philosophers. Many types of articles have an infobox at the top to summarise key facts. For most academics, this infobox can cite and link the individual’s thesis and name their doctoral advisor and doctoral students. However, philosophers were an exception. A kind Wikipedian has fixed this at our request and you can see an example of these new features in Brad Hooker’s Wikipedia biography.

To learn about data sharing on Wikidata, read the data donation guide or talk to me at martin.poulter@bodleian.ox.ac.uk. See Wikiproject Source Metadata for details of the bibliographic properties that can be shared on Wikidata about books, papers, conference proceedings and other scholarly publications. There is a lot going on to make this bibliographic data openly available, under the heading of the Initiative for Open Citations (I4OC).

—Martin Poulter, Wikimedian in Residence

Post Scriptum: more detail on data modelling

I’ve heard that another university intends to share bulk data about theses along the same lines as what we’ve done at Oxford, so I’ll reproduce the advice I’ve given about modelling theses.

  • Wikidata has a property for the language of a written work. I didn’t automatically add English as the language to all the Oxford theses in case there were exceptions, but it is worth considering if doing a bulk upload and the source database records the language of the thesis.
  • Wikidata does not have many thesis types, just a choice of doctoral thesis, masters thesis, or bachelor’s thesis. Specific degrees such as MA, MSc, PhD, D.Phil. are attached to the person, not the thesis.
  • Wikidata can represent number of pages, so again, if sharing data from theses that exist in print, and the source database has this information, consider including it.
  • To describe the subject matter of the thesis, there is main subject. I thought of generating this from the university department associated with the thesis, but decided this is too blunt a classification. Deciding the primary topic of a thesis, rather than the field of study, will often need specialist knowledge.
  • Wikidata can point to the full text of a work, but it does not yet seem to distinguish transcribed text or born-digital text from scanned pages/ Optical Character Recognition.
  • Abstracts cannot be included: Wikidata is not designed for chunks of text. This is less of a problem for theses than it is for other works as their titles are almost always long and descriptive. A three word thesis title is a blessedly rare thing.

This post licensed under a CC-BY-SA 4.0 license

Comments are closed.