![]() ![]() |
Slides : Recording |
Today’s lecture was cancelled owing to the heavy snowfall. In its place I recommend you watch the recording from last year, linked on the right. I did trial a YouTube Live stream yesterday, but in practice it didn’t offer much over what this earlier screencast does. I have changed the reading and references slightly from last time: the handout “What is a corpus and what is in it?” is no longer required. In the recording you will also see some TopHat questions used last year. I don’t think those turned out too well in the lectures, but I will look into recovering them for you to try outside lectures.
In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.
In this course we look at the second kind of corpus, as a substantial example of semistructured data used for research. This recording of last year’s lecture briefly covers the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. There is a little about some standard large corpora and examples of research results from corpus analysis.
Link: Slides for Lecture 12; Recording
Homework
1. Read These
-
A Data Scientist Discovered the Most Metal Word in the English Language. Drake Baer, Science of Us, nymag, July 2016.You can also read the original post which gives a longer word list and explains the calculations.
-
Text mining uncovers British reserve and US emotion. Philip Ball. Nature News, March 2013.
-
On “Geek” versus “Nerd”. Burr Settles, Slackpropagation blog, June 2013.
2. Do This
Explore the Corpus of Contemporary American English.
- Go to http://corpus.byu.edu/coca/.
- Read the text box on the right-hand side, beginning “The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English”. Then follow the link “large and balanced” and read about how COCA is built up.
- Click on “SEARCH” at the top left, select “Chart” below that, type
data
in the text box, click “See frequency by section”, and wait for a chart that shows how often the word “data” appears in different kinds of writing. - Click on one of the blue bars to dig into specific occurrences: wait for a table showing uses of the word “data”; in context. Pick one row and click on the left-hand index number to see the original source text.
Now try out the other SEARCH options, and read the box on the right about each one.
References
- The British National Corpus
- The Oxford English Corpus
- More about the corpus and Oxford Dictionaries
- The Expression of Emotions in 20th Century Books. Acerbi et al. PLOS ONE, 8(3):e59030, March 2013.
Taichi Fukumura, CC BY-SA 3.0, Link
Olivia Culpo, cellist nerd, playing with the Boston Accompanietta while a student at Boston University in March 2012. Later that year she went on to: win Miss USA and Miss Universe; disrupt Burr Settles’s attempt to analyse geek/nerd word associations; and demonstrate that — regardless of stereotype — the appearance of nerds and geeks can be whatever they damn well please.
![]() |
XKCD 747 |