## Lecture 17: Data Scales and Summary Statistics

This lecture gave a general overview of statistics and their role in analysing quantities of data. Most of the technical constructions — mean, median, mode, standard deviation — are probably familiar to many, but the setting for their application and the computational context may not be.

Posted in Lecture log | Comments Off on Lecture 17: Data Scales and Summary Statistics

## Lecture 16: Vector Spaces for Information Retrieval

The vector space model for information retrieval treats documents as vectors in a very high-dimensional space: a dimension for every distinct word, with the vector coordinate being the number of times the word occurs in the document. In a collection of documents, these all combine to give a document matrix. We can assess whether a document matches a query by cosine similarity — the cosine of the angle between their vector representations. Evaluating cosine similarity on all documents gives a ranking of relevance to the query.

Posted in Lecture log | Comments Off on Lecture 16: Vector Spaces for Information Retrieval

## Tutorial 6: Solutions

I have posted notes on solutions to Tutorial 6 to the course web page and to NB. Following feedback from tutors and students, these also contain a few corrections and clarifications.

Posted in Tutorials | Comments Off on Tutorial 6: Solutions

## Tutorial 7: Information Retrieval

I have posted the exercise sheet for Tutorial 7 on the course web page and NB. As usual, the questions themselves are followed by some examples and solutions, based this time on questions in past exam papers.

These tutorial exercises are running concurrently with the Inf1-DA assignment. Consequently, the tutorial meetings next week will also be a chance for you to discuss the assignment with your tutor, ask questions, and seek help. Please do use this opportunity: start work on the assignment well before the tutorial, take along your work, and tell your tutor which parts you find difficult.

Posted in Announcements | Comments Off on Tutorial 7: Information Retrieval

## Lecture 15: Information Retrieval

Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query.

Posted in Lecture log | Comments Off on Lecture 15: Information Retrieval

## Lecture 14: Example Corpora Applications

Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building real applications which handle text or speech, corpora provide the base material for machine learning and other algorithms to work with.

Posted in Lecture log | Comments Off on Lecture 14: Example Corpora Applications

## Coursework Assignment

The Inf1-DA web page now has details for the written coursework assignment. It’s a copy of last year’s exam, although instead of two hours in an exam hall you have two weeks to complete and submit your solutions. Follow the link and read Coursework for more details.

Posted in Announcements | Comments Off on Coursework Assignment

## Tutorial 6: Corpus Querying

I have placed the Tutorial 6 exercises on the course web page and NB. This week you’ll again be using a command-line tool — `cqp`, the Corpus Query Processor — and analysing a corpus of all works by Charles Dickens. Before working through the tutorial you will need to read the CQP Tutorial, also available for download on the course web page.

At the same time, I have put up solution notes from Tutorial 5, together with files containing example XPath queries and complete results returned by `xmllint`.

Posted in Tutorials | Comments Off on Tutorial 6: Corpus Querying

## Lecture 13: Annotation of Corpora

This lecture described some examples of the kinds of annotation added to text corpora, how they are generated, and some simple analyses of them; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.

Posted in Lecture log | Comments Off on Lecture 13: Annotation of Corpora

## Lecture 12: Corpora

In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety.

In this course we look at the latter kind of corpus, as a substantial example of semistructured data used for research. This lecture briefly introduced the aims and requirements of corpora, indicating how they are used and the considerations that go into building them: things like balancing and sampling; tokenization and annotation. There were presentations of some standard large corpora, and examples of recent research results from corpus analysis.

## Edinburgh Student Experience Survey

This closes tomorrow. If you have not already done so, please go to MyEd and complete the survey now. Thank you.