# Lecture 17: Data Scales and Summary Statistics

This morning’s lecture gave a general overview of statistics and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.
Continue reading Lecture 17: Data Scales and Summary Statistics

# Lecture 16: Vector Spaces for Information Retrieval

Today’s lecture presented various techniques to support effective information retrieval: term-frequency inverse document frequency (tf-idf); the big bag of words model; the vector space model; and cosine similarity for document ranking.
Continue reading Lecture 16: Vector Spaces for Information Retrieval

# Contribute to the Spoken British National Corpus

Fancy being recorded for posterity as an example of real-life speakers of British English? Then sign up for the new Spoken British National Corpus. Researchers at Lancaster and Cambridge are compiling a new, publicly accessible corpus of English speech.

“We aim to encourage people from all over the UK to record their interactions and send them to us as MP3 files. For each hour of good quality recordings we receive, along with all associated consent forms and information sheets completed correctly, we will pay £18.”

# Tutorial Solutions

I’ve just posted notes and solutions to this week’s tutorial exercises, on Corpus Querying.

The Inf1-DA assignment is continuing through this week and next week. If you have any questions or problems with this, then please do go along to InfBASE; ask on Piazza; ask me after Friday’s lecture; or ask your course tutor.

# Lecture 15: Information Retrieval

Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.
Continue reading Lecture 15: Information Retrieval

# Tutorial Exercises

I’ve just posted the next set of exercises to the tutorial web page. These are about Information Retrieval, which we shall be covering in this week’s lectures.

The Inf1-DA assignment is running through this week and next week. If you have any questions or problems with this, then please do go along to InfBASE, ask on Piazza, or ask your course tutor.

# Lecture 14: Example Corpora Applications

Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
Continue reading Lecture 14: Example Corpora Applications

# Coursework Assignment and Tutorial Solutions

Next week’s tutorial exercises have been on the web pages since Monday, but I now have two new things to release.

• Coursework assignment. Available immediately, due for submission in two weeks.

• Notes and solutions to this week’s tutorial exercises, including files with validating XML and all the XPath queries.

Please post to Piazza or email me if you have any questions about the assignment, or the current tutorial exercises.

# Lecture 13: Annotation of Corpora

Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
Continue reading Lecture 13: Annotation of Corpora

# Lecture 12: Corpora

In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.

Note: There was a photocopied handout in today’s lecture, and all copies were taken. I’ve placed a large number of additional copies in the ITO; see homework below for details.