Semantic data and the stories we’re not telling

One of my earliest memories of television was James Burke’s series Connections. It was fascinating yet accessible: each episode explored technology, history, science and society, jumping across topics based on historical connections or charming coincidences. One episode started with the stone fireplace and ended with Concorde.

In a digital utopia, we would each be our own James Burke, creating and sharing intellectual journeys by following the connections that interest us. We are not there yet. Many very valuable databases exist online, but the connections between them are obscured rather than celebrated, and this is an obstacle for anyone using those data in education or research. In a previous post I described the problems that come from the fact that things have different names in different databases, and described a semantic web approach to link them together.

Building on this approach, web applications can help people create their own stories; choosing their own path through sources of reliable information, building unexpected connections. In this post I describe three design principles behind these applications. Let’s start with a story.

Continue reading

Report from Wikimania

Last month I had to privilege to attend the Wikimania conference in Montreal, Canada, where 900 people from around the world gathered for two days of hacking and building and then three days of conference sessions. The conference scope includes not just the Wikimedia projects but also the big themes of open education, open access, community building, and privacy and rights in the digital age. One blog post by one attendee is only going to capture a sliver of what went on, and here I am summarising some big projects of most relevance to university research projects and GLAMs.

This time round, Wikidata rather than Wikipedia was generating the most excitement. Wikidata, the free structured knowledge-base, is going through a period of explosive growth, helped in a small part by data shared from partner institutions including Oxford University, and the conference brought together many people using Wikidata to document cultural heritage and current knowledge.

The author and hundreds of other Wikimedians. Photo by Victor Grigas of the Wikimedia foundation, CC-BY-SA 4.0

Continue reading

A step forward in the sharing of open data about theses

Title page of Marie Curie’s doctoral thesis; Yale University via Wikimedia Commons; Public Domain

Theses, particularly doctoral theses, are an important part of the scholarly record. Some are published and become influential books in their own right. As well as demonstrating the author’s ability to do original research, a thesis gives a snapshot of its author’s intellectual development at a formative time. This post reports on work sharing open data about thousands of theses, with links back to their full text in a repository.

The Oxford Research Archive (ORA) has 3237 Oxford doctoral theses on open access for anyone to download and read. Some of the authors have gone on to highly accomplished careers, such as the psychologist Professor Dorothy Bishop or the economist Sir John Vickers. During the confirmation hearings that eventually saw Neil Gorsuch appointed to the US Supreme Court, the interest in his background was such that TIME magazine wrote an article analysing his thesis and linking to ORA. This may well have been prompted by our linking the thesis from the top Google hit about Gorsuch; his Wikipedia biography. Continue reading

Publicising a historic event in Wikipedia

The front page of English Wikipedia gets around five million hits per day. Highlighted sections of the page, such as “Did you know” and “In the news” trumpet the site’s purpose: sharing knowledge for its own sake. One of these sections, “On this day…” features five different facts each day, with links to relevant articles. These facts in turn are chosen from a large collection of roughly 100 historic events for each date. Many other language versions of Wikipedia have a similar “This day in history” section, though with different sets of facts.

As with everything else on Wikipedia, this collection of historic facts is offered freely for anyone to use for any purpose. “On this day in history” facts are ideal for sharing on social media, for example by Wikipedia’s official presence on Twitter.

Napoléon Bonaparte, listed in Wikipedia’s May 26 article for his coronation as King of Italy on 26 May 1805. Image from the Curzon Collection of political prints, CC-BY the Bodleian Libraries.

To avoid repetition from year to year, it helps to be able to draw on a large pool of historic events, so each day can showcase a variety of types of event, of locations and of eras. There is a relative shortage of events before 1800, so additions are welcome.

Being featured on the front page generates a lot of interest in the article.

  • The Alhambra Decree article typically gets about 300 views per day. When linked from the front page as a recent “On this day” item, it had nearly 10,000.
  • The Treaty of Fontainebleau (1814) article gets 70 to 80 views on a typical day, but had 5,400 when linked from the home page on its anniversary.
  • The article about Suvarnadurg, an Indian fort, usually gets around 30 views a day, but had 8,500 when the fort’s 1755 capture by the East India Company was listed on April 2.

By considering one example, we can look at how a historic event is made visible in Wikipedia.

March 31: 1492 – The Catholic Monarchs of Spain issued the Alhambra Decree, ordering all Jews to convert to Christianity or be expelled from the country.

The typical form is a single sentence, in past tense, linking multiple different Wikipedia articles, with a bold link to the one most closely connected to the fact. Not every historical event qualifies:

  • The event must have happened on a single day, so not a crisis or war, but a precipitating or concluding event such as the signing of a treaty.
  • Births and deaths have their own process for appearing on the front page, so do not qualify for this collection of facts.
  • It must be an event with notable repercussions: one notable figure marrying another, or writing a letter to another, is not always significant in itself, but can be significant by initiating other events.
  • There must be no controversy about the day on which it happened. Reputable sources should agree.
  • The fact must be backed up by at least one reliable source, which must be cited in the article. As with all Wikipedia references, paywalled sources are fine but open-access sources have an advantage because they can be checked by Wikipedians outside subscribing institutions. With software developments over the last couple of years, adding citations has become extremely easy: the Cite tool expands DOIs into full citations and normally succeeds in transforming web links into full citations.

If you have a cited fact that meets the above criteria, it can have multiple mentions in Wikipedia:

  • The fact must be stated in the “home” article, in this case Alhambra Decree.
  • It can also go in the articles about the calendar date and the year. There are English Wikipedia articles about the year 1492 and about the date March 31. Unlike most Wikipedia articles, these are essentially lists of facts under different headings.
  • It can also appear in the biographies of the people, organisations or nations involved (in this case, Isabella of Castille). Some topics have timeline articles which are essentially lists of dates, such as Timeline of Spanish history.

The articles about individual dates, such as March 30, also have lists of births and deaths. In the long term, these will probably be driven by Wikidata, which is ideal for this kind of data. These lists have the same relative paucity of dates before 1800, and the same requirement that dates should be sourced and uncontroversial.

Facts for a particular day are chosen well in advance by an administrator, working behind the scenes in an area called the Selected anniversaries project. It is accepted, even encouraged, for other users to proactively edit in their own suggestions if they know wiki-code. The listing is decided two to four days in advance, so include your suggestion further in advance than that.

The guidelines give preference to events with a significant anniversary (meaning a multiple of 25, e.g. a 325th anniversary), events that differ from the others on the list (in era or geography), and articles that have not been on the front page before. “On this day” articles do not have to be comprehensive, but should be good examples of Wikipedia articles with citations in all sections. Each day’s “staging area” has a list of events that were submitted but did not qualify. Usually the article is rejected for having insufficient citations, so by improving the articles with links to scholarly sources, we can help those links reach the front page.

So there is an opportunity here for heritage organisations and historians to extend awareness of the turning points of history, and the use of biographical papers or databases. We just need to succinctly describe the key events and share citations about them.

—Martin Poulter, Wikimedian in Residence

This post licensed under a CC-BY-SA 4.0 license

Wikimedia for public engagement

By Dr Martin Poulter, Wikimedian in Residence at the University of Oxford

A takin is a Himalayan goat-antelope with whom I feel a personal connection, and the reason goes back to an event I attended in 2011. The wildlife charity ARKive had allowed some of its descriptions of threatened species to be copied into Wikipedia. After presentations introducing both ARKive and Wikipedia, we split up the room. One table took birds, another took lizards, and I must have been among the mammals. After carefully reading what ARKive and Wikipedia said about the takin, I found a couple of sentences that could be copied from one to the other. Everyone in the room made a small but concrete improvement to their target Wikipedia article.

The trainer at the event, Andy Mabbett, thanked me afterwards with a message through Wikipedia. Making that change, and being recognised for it, connected me to the topic that a film or lecture could not. Somehow the takin had become my endangered mammal. People had turned up with a general curiosity about threatened species, engaged with the question of how to describe a specific species and had a positive experience with a peer-reviewed source – the ARKive site.

How do we create similar events where people are not just informed about a topic or a resource, but engage with it in a way that makes a lasting impression? Here are some suggested requirements for a public engagement event:

  • a collaborative task around a topic;
  • that requires thinking and reading, but not expertise, so anyone can take part;
  • that can be broken down into small chunks, identified beforehand;
  • that can be done in-person or remotely;
  • with a way to track individual contributions. We want to thank and reward contributors, and it’s also useful to assess the quality of their work. For a big, long-term project we might want something like a leader-board or a participation award.

Wikimedia platforms

Wikipedia and its sister projects are ideal platforms for meeting the above criteria. They

  • cover all academic subjects;
  • support collaboration between experts and non-experts;
  • have various tools to generate lists of “targets”: things that need improvement;
  • can be accessed by anyone with an internet connection;
  • have contributor records which publicly show what changes each user has made, even allowing ‘thank you’ messages for individual changes.

Perhaps most relevant is that Wikimedia resources do not exist in isolation but are derived from something else. A fact in Wikipedia or Wikidata needs to be backed up with a citation of a reliable source. A photo in Wikimedia Commons needs a description of where it was taken, or a citation of the collection it is drawn from. A transcribed text in Wikisource needs a pointer to the page scans that were transcribed, and ultimately to the physical copy of the book. So a Wikimedia event is always necessarily about a Wikimedia project and something else: a scholarly site or database, or physical exhibits, books or artworks.

Four Wikimedia projects hold distinct types of information about the same subject.

The best-established type of public event is a Wikipedia editathon, in which visitors are invited to write Wikipedia articles. Newcomer participants in editathons usually achieve little, because a lot of time and thought is needed to get to grips with Wikipedia’s interface, with Wikipedia’s culture and norms, and with the sources they will be using. Editathons can be very productive if participants are confident wiki editors, but that confidence does not come immediately. Fortunately, the Wikimedia projects offer simpler, less demanding ways for the public to engage with a subject.

An example: a Wikisource transcribe-a-thon

Looking to create events for the Ada Lovelace Bicentennial in 2015, I read about Mary Somerville, a 19th century mathematician and scientist who tutored Lovelace and for whom Somerville College is named. I could find none of Somerville’s works in electronic form, but some were available as scanned documents in the Internet Archive. This suggested how we could engage an audience interested in women scientists.

Wikisource is a platform for sharing and connecting out-of-copyright or freely-licensed text. Wikipedia’s article about Lord Byron summarizes his life, with brief mentions or quotes from his work. Wikisource, on the other hand, has a brief description of who he was and the full text of many of his poems and other works. Naturally, the two profiles link to each other. Most works on Wikisource come from scanned books which have been put through Optical Character Recognition (OCR) and then manually fixed. Each page has to be checked and approved by at least two different users before it is considered “validated” and ready for public readers.

Attendees at the event were given a shortened URL for the transcription and each got a post-it note with the page number that they should fix. They adapted to Wikisource at different rates, but that worked out fine because the quicker, more confident people checked and tweaked the work of those who made slower progress. In two hours, we got through one paper and a large proportion of a book by Somerville. After some further checking, these texts were linked from the front page of Wikisource. Feedback on the event was very good: participants recognised they were doing something important; not just learning about Somerville but helping to republish her work. The “transcribe-a-thon” format has been repeated as a conference session.

A transcription project on Wikisource: pages are yellow when they’ve been approved by one user, and green for two users.

A transcribe-a-thon needs some careful preparation in choosing the text, importing it into Wikisource and preparing it for transcription. The import process on Wikisource is documented, but not very intuitive, even for experienced wiki editors. Not all scans are suitable: if the images are poor quality, the OCR does not produce usable text and if there is non-standard text such as mathematical formalism, transcription will be too difficult for newcomers. A little more work is necessary once all the pages are validated, to assemble them into a single work.

Photographs and WikiShootMe

Wikipedia articles about a place or building usually have a geographical point, defined by latitude/ longitude pairs, attached to them in a machine-readable way. For example, the article about Oxford has coordinates that correspond to the central junction at Carfax. Wikidata, another sister project, has many more entries with locations, for items such as listed buildings. On some historic streets, almost every building has a Wikidata entry.

WikiShootMe is an online mapping tool that shows these articles and Wikidata items, colour-coded according to whether they have an image. It also allows users to upload images, but they need to register an account first on Wikimedia Commons. The images do not have to be professional quality, and photos taken with a smartphone or cheap digital camera are often suitable. As more images are uploaded, red dots disappear from the map.

A WikiShootMe scan around the North end of St. Giles, Oxford

So for an event or campaign that gets people engaged with local history, public art, or architecture, the group can decide on places to photograph and describe, then go to the location, and either upload their images from home or return to a central computer room and transfer images from their devices.

A tip to monitor contributions: when uploading an image, users are prompted for image categories. If they all add the same category, then it is possible to track images uploaded with that tag using PetScan (explained below). However, categories are case-sensitive so you have to make sure people type the category tag exactly as instructed. Commons helps by colouring the text red if it does not correspond to an existing category.

Instructions to attendees:

  • Create an account on commons.wikimedia.org
    • A Wikipedia account will work if you already have one of those.
  • Open in another tab
  • Click on the red dot on the map where your photo was taken
  • Press the button to ‘Authorise uploading’
  • Click ‘Allow’. This will permit WikiShootMe to accept your photo, and return you to the map.
  • Navigate through the map to the red dot again. This time when you click the dot, the button says ‘Upload image’
  • Select the image on the computer or device
  • Give the image a title, description (say what you’ve photographed, e.g. the address of the building) and date.
  • In the Category box, type Buildings in Oxford with that exact capitalization.

PetScan is a tool for customised queries  If you are running an event or campaign where people create or upload images to a given category, use this procedure to get an overview of their contributions.

  • Go to https://petscan.wmflabs.org/
  • Click ‘Commons’
  • Enter the category “Buildings in Oxford”
  • Select the Page properties tab and click the checkbox next to File.
  • Select the Output tab, then choose Sort by date, Sort order descending.
  • Click ‘Do it!’ and on the resulting page, bookmark the link ‘for the query you just ran’.

This gives you a list of images in the category, most recent additions first. Clicking on an entry in the list will take you to the full description of that file, including the user profile of the uploader.

Other quick ideas

  • Use a biographical source to add individual facts, such as universities attended or birthplaces, to Wikidata entries or Wikipedia articles.
  • Examine a free image source (e.g. Europeana’s World War I collection) and find Wikipedia articles that the images can illustrate.
  • Search through audio archives for short clips that can be uploaded to Commons and embedded in Wikipedia articles about a person or event.

This post licensed under a CC-BY-SA 4.0 license

Value, metrics and action in publishing data

The funding community and other proponents of Open Science and Open Data have been trying to persuade the mainstream research community to publish their data for some time with only partial success [1].

A key problem is that, although the arguments for doing so are logical – research becomes more reproducible, data can be cited and re-used, opportunities for cross-domain cooperation are increased, and so forth – concrete underlying evidence has until recently been in quite short supply, with a resulting lack of engagement from the wider research community.

It’s been possible to argue for a while that linking an open dataset to a primary publication is correlated with increased citation rates (of up to 30%) [2]. But this still doesn’t draw attention to the dataset itself. Researchers are busy and need to optimise their behaviour towards activities that will drive their research field, departments, institutions and personal career progress and to date the proactive management, deposition and publication of their data has often simply not been a logical priority.

With Giving Researchers Credit for their Data we’re hoping to lower the barrier to action by automating and simplifying the process of submitting data papers to journals. The carrot of having a publishable, citable product at the end of the process is also part of the value proposition. And the proposition itself has been strengthened in recent weeks by the news of the data journal Earth System Science Data’s high citation rates. ESSD has been assigned an Impact Factor over 8, leapfrogging its primary research competitor titles to achieve a ranking of 2nd in Meteorology & Atmospheric Sciences, and 3rd in Geosciences, Multidisciplinary.

Whilst it can rightly be argued that the Impact Factor is a blunt instrument at best with which to measure the value of individual articles, this announcement does imply that researchers use and credit data papers in their work at levels comparable to, or exceeding, many traditional research articles (at least in Geosciences). Perhaps this development will lead to ‘write my data paper’ making its way on to the standard academic To Do list.

And that is certainly worth celebrating!

-Neil Jefferies (PI for Giving Researchers Credit for Their Data)

 

Giving Researchers Credit for their Data, funded as part of the Jisc Data Spring Initiative, aims to provide a button that can be added to a DataCite compliant data repository which largely automates the process of data paper submission for an authenticated researcher. The project uses a cloud-based app and SWORD2-based APIs to link with multiple repositories and publishers, taking advantage of existing DataCite and ORCID metadata so that a paper can be automatically inserted into a publisher’s submission system without requiring any data re-entry by the author.

References:
[1] Aleixandre-Benavent, R et al. Scientometrics (2016) 107: 1. doi:10.1007/s11192-016-1868-7
[2] Piwowar HA, Vision TJ. (2013) Data reuse and the open data citation advantage. PeerJ 1:e175 https://doi.org/10.7717/peerj.175. This analysis specifically concentrated on micro-array data.

Act on Acceptance reaches 1,250 deposits

The Act on Acceptance programme was launched on 1 October 2015. This programme is a mechanism allowing Oxford researchers to comply with the new mandated Open Access requirements set out by HEFCE for the next REF.

These requirements, which ask for a copy of the accepted manuscript to be deposited into an Institutional Repository within three months of acceptance, come into force on 1 April 2016. Launching on 1 October will allow all researchers to become familiar with the process before the requirements come into effect.

The programme includes the Open Access team, Research Services, Symplectic Elements and ORA. To date we have had over 1,250 deposits, and the rate of deposits continues to climb.

ORA currently receives at least 30 deposits a day via Symplectic Elements. Follow how we’re going on Twitter with the “#actonacceptance” hashtag.

DivisionalTotals

DivisionalTotalsByType

RunningTotals

—Sarah Barkla

ORA-Data: managing research data

oraWe’re pleased to announce that depositing in ORA-Data will now allow researchers and the University to comply with the EPSRC’s (Engineering and Physical Sciences Research Council) policy about research data management, which comes into effect from 1 May 2015.

The Bodleian Libraries recently launched a new service for the University, the Oxford Research Archive for Data (ORA-Data). A digital repository and catalogue for research data, ORA-Data offers a service to archive and enable the discovery, citation and sharing of data produced by researchers at Oxford.

ORA-Data is aimed especially at researchers who wish:

  • to deposit data that underpins publications
  • to deposit data that their funding body requires be preserved and made accessible
  • to add a record to the University’s catalogue of data

Any type of digital research data, from across all academic disciplines, may be deposited in ORA-Data, and we accept any file format. A permanent catalogue record is created for all data deposited in ORA-Data and a persistent unique identifier generated (a DOI, or Digital Object Identifier), which allows the dataset to be cited. Data and records will be discoverable through Google and other search engines, maximising the visibility and impact of the research. Researchers can also choose to set an embargo period for their files if they wish.

ORA-Data is running as a free pilot until July 2015. We’re keen for users to try it out, and would welcome any feedback to help us improve the service. ORA-Data can be accessed via the main ORA website: just select ‘Contribute’ followed by the ‘Data’ link. Our ‘How to deposit’ guide is available via our LibGuide and the ORA-Data team can be contacted at: ora@bodleian.ox.ac.uk – we would love to hear from anyone interested to find out more.

—Amanda Flynn, Digital Scholarship Support Officer

ORA: more than you might think

ORA doesn’t just take work published in academic journals!  We also include working papers, reports, Oxford doctoral theses and metadata records.

Our recent work with the Transport Studies Unit (TSU) is online.

We’ve been working with Computer Science to add another 5,000 records, including 1,300 full-text papers.

The working papers from the Oxford Institute for Energy Studies (OIES) as well as their publication the Oxford Energy Forum are in the final stages of being added.

Our Polonsky Foundation-funded theses digitization project is winding up and ORA already holds these records.

—Sarah Barkla