Lecture 16: Vector Spaces for Information Retrieval

Title slide


Today’s lecture presented various techniques to support effective information retrieval: the big bag of words model; term-frequency inverse document frequency (tf-idf); the vector space model; and cosine similarity for document ranking.

The vector space model for information retrieval treats documents as vectors in a very high-dimensional space: a dimension for every distinct word, with the vector coordinate being the number of times the word occurs in the document. In a collection of documents, these all combine to give a document matrix. We can assess whether a document matches a query by computing the angle between these vector representations. Evaluating cosine similarity on all documents gives a ranking of relevance to the query.

The notion of a model is a powerful one that occurs across the natural sciences, and is actively used in Software Engineering to manage the design and maintenance of large systems. A well-defined model gives a precise representation of some aspects of a system being studied; the model need not capture everything about the system, and indeed it’s often important to abstract from concrete details.

In this case the model allows us to describe ranking and similarity without fixating on implementation details; and potentially to compare different kinds of ranking algorithm sharing the same model.

Sadly, problems with the podium PC continue and there are no recordings from AT4 today. I have logged this with Information Services, and hope they will have fixed this by next Tuesday. My apologies.

Link: Slides for Lecture 16; Music


Complete the tutorial exercise on Information Retrieval. Question  1 relates to Tuesday’s lecture, Question 2 uses material from today.

Continue working on the coursework assignment. Make sure you have started all three questions in time for your tutorial: bring along your written solutions so far, with a list of any problems you have found.

Bottles of 0101 0101
Brewed by Data

Blending beer recipes by their “values, needs and emotional states”.

Link: 0101

Bottles of IntelligentX IntelligentX
Beer Brewed by AI

Competition for 0101, using reinforcement learning and Bayesian decision making. Which are real things. Watch the video.

Link: IntelligentX

2671845245_9ebfd6be7c_b A Vector Space Model for Information Retrieval
Gerard Salton
Not in Comm. ACM, 1975; nor in J. Amer. Soc. for Inf. Science, 1975; nor indeed anywhere. This apparently highly influential paper is a bibliography virus.
Link: None. It’s a mirage.
Screenshot The Most Influential Paper Gerard Salton Never Wrote
David Dubin
Library Trends 52(4): 748–764, 2004

Link: Copy in the Free Library

Picture of Karen Spärck-Jones Karen Spärck-Jones
Pioneer in information retrieval and natural language processing.

Professor of Computers and Information
University of Cambridge Computer Laboratory

Link: Obituary