Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.

The annotations were parts of speech and syntactic structure; the analyses were concordances, frequencies, and n-grams. These are basic examples, but essential foundations for building more advanced computational linguistics. Moreover, a key insight into working with natural language is that basic techniques turn out to be remarkably effective when provided with large quantities of data; even when they use little or no information about the meaning of the text.

The lecture slides include a range of specific texts illustrating these annotations and analyses: from the British National Corpus, Penn Treebank, and the Sherlock Holmes short story A Case of Identity.

I also presented some more history of Unicode, and in particular the UTF-8 encoding of Unicode characters, as used in XML and in time everywhere else.

