Lecture 14: Example Corpora Applications

Title slideCorpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.

This lecture presented two specific examples: identifying collocations through statistical analysis; and different approaches to machine translation between human languages. A key point for collocations is to identify these without attempting machine understanding: we expect that collocations do have a distinctive meaning, but they stand out statistically even without knowing anything about that meaning. In machine translation, the lecture covered the classic rule-based approach and the ever-growing success of statistical schemes.

Alongside this core material: some questions to ask about assessment and coursework; and a wander around Google n-grams, culturomics, and the Anachronism Machine.

Link: Slides for Lecture 14; Video of Lecture 14

References

Experimental: Everything. All of the links I used in the lecture, and those below, loaded into more tabs/windows than is healthy. Behaviour will depend on browser, may cause warnings and alerts, may cause crops to fail.