Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
The annotations were parts of speech and syntactic structure; the analyses were concordances, frequencies, and n-grams. These are basic examples, but essential foundations for building more advanced computational linguistics. Moreover, a key insight into working with natural language is that basic techniques turn out to be remarkably effective when provided with large quantities of data; even when they use little or no information about the meaning of the text.
The lecture slides include a range of specific texts illustrating these annotations and analyses: from the British National Corpus, Penn Treebank, and the Sherlock Holmes short story A Case of Identity.
I also presented some more history of Unicode, and in particular the UTF-8 encoding of Unicode characters, as used in XML and in time everywhere else.
Link: Slides for Lecture 13
McEnery and Wilson. What is a corpus and what is in it? Chapter 2 of Corpus Linguistics. Second edition, Edinburgh University Press, 2001.
Distributed by email, as a photocopied handout in last week’s lecture, and available outside the ITO. Section 2.2.2 is optional (and that’s a very large section, through the middle of the chapter) but you should read all the other sections before and after this.
If there are no more copies outside the ITO then email me and I’ll get more.
Follow these instructions an interactive tour of COCA: The Corpus of Contemporary American English.
Go to http://corpus.byu.edu/coca where you will find a three-pane window layout (if instead you get a title page with pictures, click the “Enter” button). Read the “Introduction” in the bottom half-window — follow links if you like, but make sure you get back to the introduction.
Now select “[ Where Should I Start? ]” just below the central bar, towards the left. Click “Brief tour for non-linguists”. Read this section and click links to activate searches.
After that, it’s up to you. Dig around in the various tours: there are a huge number of examples. At some point you will need to register to go on using COCA.
The BYU Corpora web interface from Brigham Young University in Utah provides access to many of the corpora mentioned in the course, including the British National Corpus (BNC) and the Corpus of American Soap Operas.
Buy The Mamur Zapt and The Girl in the Nile through Blackwell’s.
The Observer Eye proposal for Unicode. Originally put forward around this time last year, it’s been provisionally allocated code 23FF. You can read more on the proposer’s blog and see its progress in the Unicode pipeline.
If you liked UTF-8, then I strongly recommend finding out about the extraordinary construction that is Punycode. There’s a formal description in RFC 3492; a useful explanation in the Wikipedia article on Punycode; you can try it yourself with this Punycode convertor; and confirm that it works for real by visiting the Chinese Internet Network Information Center at http://中国互联网络信息中心.中国.