In this final lecture I reviewed two questions from last year’s exams: one on entity-relationship modelling and database queries; and another on statistics and hypothesis testing. The slides give details of the questions and some possible answers, with notes on which elements are important in preparing a solution.
Today’s lecture introduced a selection of data scales — refinements of the qualitative/quantitative distinction — and discussed in more detail issues of the application and misapplication of statistics.
Continue reading Lecture 19: Data Scales; Correlation and Causation
Today’s lecture revisited the idea of correlation in data sets, and introduced the method of hypothesis testing for identifying whether features observed in samples in fact arise by chance.
For paired series of numerical data we can use the correlation coefficient, and for qualitative data the χ2 statistic. The lecture included examples of this applied to last year’s Inf1-DA exam results, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.
Continue reading Lecture 18: Hypothesis Testing and chi²
This morning’s lecture gave a general overview of statistics and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.
Continue reading Lecture 17: Summary Statistics
Today’s lecture presented various techniques to support effective information retrieval: the big bag of words model; term-frequency inverse document frequency (tf-idf); the vector space model; and cosine similarity for document ranking.
Continue reading Lecture 16: Vector Spaces for Information Retrieval
Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.
Continue reading Lecture 15: Information Retrieval
Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
Continue reading Lecture 14: Example Corpora Applications
Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
Continue reading Lecture 13: Annotation of Corpora
In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.
Continue reading Lecture 12: Corpora