Lecture 12: Corpora

Title slideIn literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.

In this course we look at the second kind of corpus as a substantial example of semistructured data used for research. Today’s lecture covered briefly the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. I talked about some standard large corpora, and examples of research results from corpus analysis.

Link: Slides for Lecture 12

Homework

Read This

McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.

Distributed by email, as a photocopied handout in the lecture, and available outside the ITO. Section 2.2.2 is optional; but you should read all the other sections to the end of the chapter.

Do This

The worksheet for Tutorial 5 went online earlier this week. In this exercise you use the xmllint command-line tool to check XML validity and run your own XPath queries.

Link: Tutorial Exercises

References