Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
The annotations were parts of speech and syntactic structure; the analyses were concordances, frequencies, and n-grams. These are basic examples, but essential foundations for building more advanced computational linguistics. Moreover, a key insight into working with natural language is that basic techniques turn out to be remarkably effective when provided with large quantities of data; even when they use little or no information about the meaning of the text.
The lecture slides include a range of specific texts illustrating these annotations and analyses: from the British National Corpus, Penn Treebank, and the Sherlock Holmes short story A Case of Identity.
I also presented some more history of Unicode, and in particular the UTF-8 encoding of Unicode characters, as used in XML and in time everywhere else.
Link: Slides for Lecture 13; No video but there is music
References
-
You can buy The Mamur Zapt and The Girl in the Nile through Blackwell’s.
-
Read A Case of Identity for free from Project Gutenberg.
-
Find out the origin story of UTF-8.
-
From the bright clear simplicity of UTF-8 to the magnificently twisted genius that is Punycode. Read the Wikipedia article or dive into the formal description of RFC 3492; try it yourself with this online Punycode convertor; or just check out that it works for real by visiting the Chinese Internet Network Information Center at http://中国互联网络信息中心.中国/.