Title slide
Slides : Recording

Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.

This lecture presented two specific examples: identifying collocations through statistical analysis; and different approaches to machine translation between human languages. A key point for collocations is to identify these without attempting machine understanding: we expect that collocations do have a distinctive meaning, but they stand out statistically even without knowing anything about that meaning. In machine translation, the lecture covered the classic rule-based approach and the ever-growing success of statistical schemes.

Following these more detailed examples, there were brief illustrations of other similar applications: through Google n-grams, culturomics, automatic topic identification, and analysis of TV crime drama.

Links: Slides for Lecture 14; Recording of Lecture 14

Homework

Read these. The first two are short blog posts; the third is a longer research article.

The Culturomics FAQ answers many of the comments, objections and complaints that the authors of this paper received once their publication made newspaper front pages. You might also be interested to read this commentary on the limitations of Google Books for research.

References

Experimental: Click https://is.gd/DA18L14 for all of the links I used in the lecture, and more, loaded into more tabs/windows than is healthy. Behaviour will depend on browser, may cause warnings and alerts, may cause darkness to fall and wells to run dry.

  • Google n-grams on the relative attention to vampires, werewolves, and zombies over the last two centuries.
  • What different kinds of tea people have been interested in over the decades.
  • 1000 topics automatically extracted from 20 years of the New York Times by David Mimno.
  • What’s happening in US health research where Mimno and others automatically generated a network of topic clusters based on grant funding.
  • NeuroElectro is a project to automatically gather information about different types of neuron and enable the discovery of new neuron/neuron relationships by scanning existing research literature.
  • Ben Schmidt is a history professor working on research in digital archives and data visualisation. He has written about identifying anachronistic use of language in Downton Abbey together with other TV and film series.
  • Bookworm: “Bookworm is a simple and powerful way to visualize trends in repositories of digitized texts.”
    This is from the Culturomics folk, and lets people build their own tools to explore repositories. There are some interesting examples right on their front page. Here’s Ben Schmidt’s Simpsons Bookworm that lets you compare the profile of different characters through 25 years of the show.
  • Prof. Schmidt has any number of enticing data visualisations, many based on text corpora. Here’s a visualisation of plot arcs in TV episodes based on topic words in the script. UK period detective drama, for example, spends a lot of time on the machine-identified topic “Inspector Professor Holmes Mr. sir Thank”.
  • Ben Schmidt again, with top-notch historiolinguistic analyses of cultural landmarks: Abraham Lincoln, Vampire Hunter and Abraham Lincoln vs. Zombies.
Lecture 14: Example Corpora Applications