Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
This lecture presented two specific examples: identifying collocations through statistical analysis; and different approaches to machine translation between human languages. A key point for collocations is to identify these without attempting machine understanding: we expect that collocations do have a distinctive meaning, but they stand out statistically even without knowing anything about that meaning. In machine translation, the lecture covered the classic rule-based approach and the ever-growing success of statistical schemes.
Alongside this core material: some questions to ask about assessment and coursework; and a wander around Google n-grams, culturomics, and the Anachronism Machine.
Link: Slides for Lecture 14
Experimental: Click here for links to everything. All of the links I used in the lecture, and more, loaded into more tabs/windows than is healthy. Behaviour will depend on browser, may cause warnings and alerts, may cause cats to howl and crops to fail.
Google n-grams on the relative attention to vampires, werewolves, and zombies over the last two centuries.
What different kinds of tea people have been interested in over the decades.
Quantitative Analysis of Culture Using Millions of Digitized Books. Michel et al., Science 331(6014):176–182. (EASE login required)
The Culturomics FAQ answers many of the comments, objections and complaints that the authors of this paper received once their publication made newspaper front pages.
This is the server that failed during the lecture, and I used a Google cache instead. The real page includes graphs of word use over time.
Bookworm: “Bookworm is a simple and powerful way to visualize trends in repositories of digitized texts.”
This is from the Culturomics folk, and lets people build their own tools to explore repositories. There are some interesting examples right on their front page. Here’s Ben Schmidt’s Simpsons Bookworm that lets you compare the profile of different characters through 25 years of the show.
Prof. Schmidt has any number of enticing data visualisations, many based on text corpora. Here’s a visualisation of plot arcs in TV episodes based on topic words in the script. UK period detective drama, for example, spends a lot of time on the machine-identified topic “Inspector Professor Holmes Mr. sir Thank”.
Ben Schmidt again, with top-notch historiolinguistic analyses of cultural landmarks: Abraham Lincoln, Vampire Hunter and Abraham Lincoln vs. Zombies.