In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.
In this course we look at the second kind of corpus as a substantial example of semistructured data used for research. Today’s lecture covered briefly the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. I talked about some standard large corpora, and examples of research results from corpus analysis.
Link: Slides for Lecture 12
McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.
Distributed by email, as a photocopied handout in the lecture, and available outside the ITO. Section 2.2.2 is optional; but you should read all the other sections to the end of the chapter.
The worksheet for Tutorial 5 went online earlier this week. In this exercise you use the
xmllint command-line tool to check XML validity and run your own XPath queries.
Link: Tutorial Exercises
- For information about connecting remotely to DICE, and many other things, see the Informatics Computing Support Help Pages.
- Some corpora: British National Corpus; Corpus of Contemporary American English; Oxford English Corpus (also here); Corpus of American Soap Operas.
- The Expression of Emotions in 20th Century Books — Text mining uncovers British reserve and US emotion.
- Burr Settles. On “Geek” Versus “Nerd”.