Lecture 12: Corpora

Title slideIn literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.

Note: There was a photocopied handout in today’s lecture, and all copies were taken. I’ve placed a large number of additional copies in the ITO; see homework below for details.

In this course we look at the second kind of corpus as a substantial example of semistructured data used for research. Today’s lecture covered briefly the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. I talked about some standard large corpora, and examples of recent research results from corpus analysis.

Edinburgh Student Experience Survey

This closes tomorrow. If you have not already done so, please go to MyEd and complete the survey now. Thank you.

Links: Slides for Lecture 12; Video of Lecture 12; Log in to MyEd; About the survey.



McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.

Distributed as photocopied handout in the lecture and available from the ITO. Section 2.2.2 is optional; but you should read all the other sections to the end of the chapter.


The worksheet for Tutorial 5 went online earlier this week. In this exercise you use the xmllint command-line tool to check XML validity and run your own XPath queries.

Link: Tutorial Exercises