Lecture 18: Hypothesis Testing and chi²

Title slide

Slides : Recording

Today’s lecture revisited the idea of correlation in data sets, and introduced the method of hypothesis testing for identifying whether features observed in samples in fact arise by chance.

For paired series of numerical data we can use the correlation coefficient, and for qualitative data the χ2 statistic. The lecture included examples of this applied to last year’s Inf1-DA exam results, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.
Continue reading Lecture 18: Hypothesis Testing and chi²

British National Corpus

In the most recent tutorial exercises you used the cqp tool to search a 3-gigaword Dickens corpus. We also have the 96-gigaword British National Corpus installed under cqp which you can explore by selecting BNC at the commmand line.

$ cqp -e
[no corpus]> BNC
BNC> AllWords = [word="[a-zA-Z].*"]
BNC> size AllWords
96063265
BNC>

This has part-of-speech and lemma information like the Dickens corpus, using the Claws 5 POS tag set. As this corpus is much larger you will find queries take noticeably longer to execute.

I also recommend reading the following article on the design and creation of the BNC.

This includes information about text corpora in general, as well as specific details about how the BNC came about.

Tutorial Exercises: Information Retrieval

The tutorials web page now has the latest set of tutorial exercises. The coursework assignment has also been running for a week now, and is due in on Thursday next week. The Information Retrieval tutorial work is fairly brief, which will allow you to spend time in the tutorial discussing any questions you have about the assignment. Please do take advantage of this: attempt every question in the assignment before your tutorial, and note down any concerns you have.

Links: Tutorial exercises; Coursework assignment

Lecture 15: Information Retrieval

Title slide

Slides : Music

Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.
Continue reading Lecture 15: Information Retrieval