The tutorials web page now has the latest set of exercises. As before, Question 1 is based on content of Tuesday’s lecture, and material for Question 2 appears on Friday.
Continue reading Week 10 Tutorial Exercises: Statistical Analysis
Exam Information
I’ve added a page with dates and other information about the Inf1-DA course exam. I’ll also review this in the Week 11 lectures.
Tutorial 7: Notes and Solutions
Notes and solutions for the information retrieval questions in Tutorial 7 are now on the tutorials web page.
Lecture 17: Summary Statistics
This morning’s lecture gave a general overview of statistics and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.
Continue reading Lecture 17: Summary Statistics
Lecture 16: Vector Spaces for Information Retrieval
Today’s lecture presented various techniques to support effective information retrieval: the big bag of words model; term-frequency inverse document frequency (tf-idf); the vector space model; and cosine similarity for document ranking.
Continue reading Lecture 16: Vector Spaces for Information Retrieval
British National Corpus
In the most recent tutorial exercises you used the cqp
tool to search a 3-megaword Dickens corpus. We also have the 96-megaword British National Corpus installed under cqp
which you can explore by selecting BNC
at the commmand line.
$ cqp -e [no corpus]> BNC BNC> AllWords = [word="[a-zA-Z].*"] BNC> size AllWords 96063265 BNC>
This has part-of-speech and lemma information like the Dickens corpus, using the Claws 5 POS tag set. As this corpus is much larger you will find queries take noticeably longer to execute.
I also recommend reading the following article on the design and creation of the BNC.
- Gavin Burnage and Glynis Baguley. The British National Corpus. Library and Information Briefings 65, February 1996.
This includes information about text corpora in general, as well as specific details about how the BNC came about.
Tutorial 6: Notes and Solutions
Notes and solutions for the corpus querying exercises in Tutorial 6 are now on the tutorials web page.
Tutorial Exercises: Information Retrieval
The tutorials web page now has the latest set of tutorial exercises. The coursework assignment has also been running for a week now, and is due in on Thursday next week. The Information Retrieval tutorial work is fairly brief, which will allow you to spend time in the tutorial discussing any questions you have about the assignment. Please do take advantage of this: attempt every question in the assignment before your tutorial, and note down any concerns you have.
Lecture 15: Information Retrieval
Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.
Continue reading Lecture 15: Information Retrieval
Lecture 14: Example Corpora Applications
Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
Continue reading Lecture 14: Example Corpora Applications