Lecture 16: Vector Spaces for Information Retrieval

Title slideToday’s lecture presented various techniques to support effective information retrieval: term-frequency inverse document frequency (tf-idf); the big bag of words model; the vector space model; and cosine similarity for document ranking.

The vector space model for information retrieval treats documents as vectors in a very high-dimensional space: a dimension for every distinct word, with the vector coordinate being the number of times the word occurs in the document. In a collection of documents, these all combine to give a document matrix. We can assess whether a document matches a query by computing the angle between these vector representations. Evaluating cosine similarity on all documents gives a ranking of relevance to the query.

The notion of a model is a powerful one that occurs across the natural sciences, and is actively used in Software Engineering to manage the design and maintenance of large systems. A well-defined model gives a precise representation of some aspects of a system being studied; the model need not capture everything about the system, and indeed it’s often important to abstract from concrete details.

In this case the model allows us to describe ranking and similarity without fixating on implementation details; and potentially to compare different kinds of ranking algorithm sharing the same model.

Links: Slides for Lecture 16; Recording of Lecture 16

References

2671845245_9ebfd6be7c_b A Vector Space Model for Information Retrieval
Gerard Salton
Not in Comm. ACM, 1975; nor in J. Amer. Soc. for Inf. Science, 1975; nor indeed anywhere. As explained below, this apparently highly influential paper is a bibliography virus.
Link: None. It’s a mirage.
Screenshot The Most Influential Paper Gerard Salton Never Wrote
David Dubin
Library Trends 52(4): 748–764, 2004

Link: Copy in the Free Library

Picture of Karen Spärck-Jones Karen Spärck-Jones
Pioneer in information retrieval and natural language processing.

Professor of Computers and Information
University of Cambridge Computer Laboratory

Links Obituary