New Corpora available from BYU

Two new corpora are now available via the Brigham Young University collection.

The TV Corpus and The Movie Corpus together contain over 525 million words of data, and are a vital resource for looking at informal language.

Here’s more information from BYU:

The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time.

The Movies Corpus contains 200 million words of data in more than 25,000 movies from the 1930s to the current time.

All of the 75,000 TV episodes, and 25,000+ movies, are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata — year, country, series, rating, genre, plot summary, etc.

Both Corpora allow you to look at variation over time (1950s-1970s to 1990s-2010s) and variation between dialects (e.g. American and British English). In this sense, the corpus is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English.

You can find the corpora at and to use it you will need to register using your university email account

Leave a Reply

Your email address will not be published. Required fields are marked *