Lecture 12: Corpora

Title slide

Slides : Recording

In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.

In this course we look at the second kind of corpus, as a substantial example of semistructured data used for research. Today’s lecture covered briefly the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. I talked about some standard large corpora, and examples of research results from corpus analysis.

Link: Slides for Lecture 12; Recording

1. Read These
2. Do This

Explore the Corpus of Contemporary American English.

  1. Go to http://corpus.byu.edu/coca/.
  2. Read the text box on the right-hand side, beginning “The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English”. Then follow the link “large and balanced” and read about how COCA is built up.
  3. Click on “SEARCH” at the top left, select “Chart” below that, type data in the text box, click “See frequency by section”, and wait for a chart that shows how often the word “data” appears in different kinds of writing.
  4. Click on one of the blue bars to dig into specific occurrences: wait for a table showing uses of the word “data&lrdquo; in context. Pick one row and click on the left-hand index number to see the original source text.

Now try out the other SEARCH options, and read the box on the right about each one.


Olivia Culpo, cellist nerd, playing with the Boston Accompanietta while a student at Boston University in March 2012. Later that year she went on to win Miss USA and Miss Universe; disrupt Burr Settles’s attempt to analyse geek/nerd word associations; and demonstrate that — regardless of stereotype — the appearance of nerds and geeks can be whatever they damn well please.