In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.
Note: There was a photocopied handout in today’s lecture, and all copies were taken. I’ve placed a large number of additional copies in the ITO; see homework below for details.
In this course we look at the second kind of corpus as a substantial example of semistructured data used for research. Today’s lecture covered briefly the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. I talked about some standard large corpora, and examples of recent research results from corpus analysis.
Edinburgh Student Experience Survey
This closes tomorrow. If you have not already done so, please go to MyEd and complete the survey now. Thank you.
McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.
Distributed as photocopied handout in the lecture and available from the ITO. Section 2.2.2 is optional; but you should read all the other sections to the end of the chapter.
The worksheet for Tutorial 5 went online earlier this week. In this exercise you use the
xmllint command-line tool to check XML validity and run your own XPath queries.
Link: Tutorial Exercises
- For information about connecting remotely to DICE, and many other things, see the Informatics Computing Support Help Pages.
- Some corpora: British National Corpus; Open American National Corpus; Oxford English Corpus (also here); Corpus of American Soap Operas.
- The Expression of Emotions in 20th Century Books — Text mining uncovers British reserve and US emotion.
- Burr Settles. On “Geek” Versus “Nerd”.