Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
The annotations were parts of speech and syntactic structure; the analyses were concordances, frequencies, and n-grams. These are basic examples, but essential foundations for building more advanced computational linguistics. Moreover, a key insight into working with natural language is that basic techniques turn out to be remarkably effective when provided with large quantities of data; even when they use little or no information about the meaning of the text.
The lecture slides include a range of specific texts illustrating these annotations and analyses: from the British National Corpus, Penn Treebank, and the Sherlock Holmes short story A Case of Identity.
McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.
Section 2.2.2 is optional (and that’s a very large section, through the middle of the chapter) but you should read all the sections before and after this.
More copies of this text were distributed in the lecture today. If you still don’t have one, please go to the ITO who have some. If they run out, too, email me and I’ll get more.
This is not required, but if you are interested in seeing what a full-scale corpus looks like, then follow these instructions below for the interactive tour of COCA: The Corpus of Contemporary American English.
Go to http://corpus.byu.edu/coca where you will find a three-pane window layout (if instead you get a title page with pictures, click the “Enter” button). Read the “Introduction” in the bottom half-window — follow links if you like, but make sure you get back to the introduction.
Now select “[ Where Should I Start? ]” just below the central bar, towards the left. Click “Brief tour for non-linguists”. Read this section and click links to activate searches.
After that, it’s up to you. Dig around in the various tours: there are a huge number of examples. At some point you will need to register to go on using COCA; do this.
- The BYU Corpora web interface from Brigham Young University in Utah provides access to many of the corpora mentioned in the course, including the British National Corpus (BNC) and the Corpus of American Soap Operas.
- Buy The Mamur Zapt and The Girl in the Nile through Blackwell’s.
- Read A Case of Identity from Project Gutenberg.