This final lecture of the course covered worked solutions to the two past exam question set as homework last Friday.

# Lecture 20: Course Review

# Lecture 19: chi² Testing on Categorical Data

Today’s lecture covered more on hypothesis testing, presenting the χ^{2} test and working through three examples: student Inf1-DA exam results in 2011, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.

# Lecture 18: Hypothesis Testing and Correlation

Where the last lecture was about summary statistics for a single set of data, we now address *multi-dimensional* data with several linked sets of values among which we might look for *correlations*. This leads into several more sophisticated questions which are key to the effective application of statistics: how do we identify potential effects like correlation; how do we know when we have found evidence for an effect; and what might this tell us about any *causal* connections?

# Lecture 17: Data Scales and Summary Statistics

This morning’s lecture gave a general overview of *statistics* and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.

# Lecture 16: Vector Spaces for Information Retrieval

Today’s lecture presented various techniques to support effective information retrieval: the *big bag of words* model; *term-frequency inverse document frequency* (tf-idf); the *vector space model*; and *cosine similarity* for document ranking.

# Lecture 15: Information Retrieval

Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more *unstructured* data. Today’s lecture provided a brief introduction to the classic *information retrieval task* of searching a large collection of documents to find those that match a simple query.

# Lecture 14: Example Corpora Applications

Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.

# Lecture 13: Annotation of Corpora

Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.

# Lecture 12: Corpora

In literature a *corpus* (plural *corpora*) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In *computational linguistics* and in *theoretical linguistics* a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of *empirical linguistics*.

