Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
This lecture presented two specific examples: identifying collocations through statistical analysis; and different approaches to machine translation between human languages. A key point for collocations is to identify these without attempting machine understanding: we expect that collocations do have a distinctive meaning, but they stand out statistically even without knowing anything about that meaning. In machine translation, the lecture covered the classic rule-based approach and the ever-growing success of statistical schemes.
Alongside this core material: some questions to ask about assessment and coursework; and a wander around Google n-grams, culturomics, and the Anachronism Machine.
Read these. The first two are short blog posts; the third is a longer research article.
Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System. Schuster, Johnson, and Thorat. Google Research blog, November 2016.
Zero-Shot Translation Is Both More and Less Important Than You Think. Arle Lommel. Technology, Translation and Localization blog, CSA Research, February 2017
Quantitative Analysis of Culture Using Millions of Digitized Books. Michel, Aiden, et al. Science 331(6014):176–182, January 2011. DOI: 10.1126/science.1199644
The Culturomics FAQ answers many of the comments, objections and complaints that the authors of this paper received once their publication made newspaper front pages. You might also be interested to read this commentary on the limitations of Google Books for research.
The Pitfalls of Using Google Ngram to Study Language. Sarah Zang. Wired, October 2015
Experimental: Click here for links to everything. All of the links I used in the lecture, and more, loaded into more tabs/windows than is healthy. Behaviour will depend on browser, may cause warnings and alerts, may cause cats to howl and crops to fail.
- Google n-grams on the relative attention to vampires, werewolves, and zombies over the last two centuries.
- What different kinds of tea people have been interested in over the decades.
- 1000 topics automatically extracted from 20 years of the New York Times by David Mimno.
- Text analysis of Trump tweets from August 2016 explores variations by time of day, device used, and sentiments expressed in a notable contemporary dataset.
- Ben Schmidt is a history professor working on research in digital archives and data visualisation. He has written about identifying anachronistic use of language in Downton Abbey together with other TV and film series.
- Bookworm: “Bookworm is a simple and powerful way to visualize trends in repositories of digitized texts.”
This is from the Culturomics folk, and lets people build their own tools to explore repositories. There are some interesting examples right on their front page. Here’s Ben Schmidt’s Simpsons Bookworm that lets you compare the profile of different characters through 25 years of the show.
- Prof. Schmidt has any number of enticing data visualisations, many based on text corpora. Here’s a visualisation of plot arcs in TV episodes based on topic words in the script. UK period detective drama, for example, spends a lot of time on the machine-identified topic “Inspector Professor Holmes Mr. sir Thank”.
- Ben Schmidt again, with top-notch historiolinguistic analyses of cultural landmarks: Abraham Lincoln, Vampire Hunter and Abraham Lincoln vs. Zombies.