Notes on solutions for this week’s tutorial are now online. Tutors are marking coursework submissions, and will return them with feedback in next week’s tutorial.
Thanks to everyone who contributed to the revision topics poll. There was strongest interest in material from the earlier parts of the course, so I’ll pick some past questions in that area to go over in lectures.
Link: Tutorial notes
Today’s lecture covered more on hypothesis testing, presenting the χ2 test and working through three examples: student Inf1-DA exam results in 2011, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.
Continue reading Lecture 19: chi² Testing on Categorical Data
Where the last lecture was about summary statistics for a single set of data, we now address multi-dimensional data with several linked sets of values among which we might look for correlations. This leads into several more sophisticated questions which are key to the effective application of statistics: how do we identify potential effects like correlation; how do we know when we have found evidence for an effect; and what might this tell us about any causal connections?
Continue reading Lecture 18: Hypothesis Testing and Correlation
The sheet of exercises for Tutorial 8 posted last night had some confusion about whether Question 2 was measuring sleep or exercise hours. This was my error. It’s now corrected, and there’s a revised version online in the usual place. My apologies: thanks to the student raising this on Piazza, and to Fabian for fixing this up.
Exercises for Tutorial 8 are now online, as well as notes on solutions for Tutorial 7.
This week you will apply statistical tests to the survey data gathered in the first lecture of the course, looking for possible connections between sleep, exercise, and choice of operating system. This uses techniques to be presented in this afternoon’s lecture and, awkwardly, next Tuesday’s. To support this, I’ve prepared and posted both sets of lecture slides in advance.
Thanks to everyone who submitted their coursework assignment yesterday. These are now going out to individual tutors, who will mark them and give you feedback on your work in the Week 11 tutorial.
Links: Tutorial exercises; Lecture slides
This morning’s lecture gave a general overview of statistics and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.
Continue reading Lecture 17: Data Scales and Summary Statistics
Today’s lecture presented various techniques to support effective information retrieval: the big bag of words model; term-frequency inverse document frequency (tf-idf); the vector space model; and cosine similarity for document ranking.
Continue reading Lecture 16: Vector Spaces for Information Retrieval
Exercises for Tutorial 7 are now online, together with notes on solutions for Tutorial 6.
Question 1 you can do immediately; Question 2 requires material covered in Friday’s lecture.
The exercises for this tutorial are shorter than those in previous weeks. That’s because you will also be working on the coursework assignment. This tutorial is an opportunity for you to talk about that and ask your tutor questions. Please plan for this: come to the tutorial prepared to discuss your progress on the assignment.
Link: Tutorial exercises
Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.
Continue reading Lecture 15: Information Retrieval
Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
Continue reading Lecture 14: Example Corpora Applications