Today’s lecture presented various techniques to support effective information retrieval: term-frequency inverse document frequency (tf-idf); the big bag of words model; the vector space model; and cosine similarity for document ranking.
The vector space model for information retrieval treats documents as vectors in a very high-dimensional space: a dimension for every distinct word, with the vector coordinate being the number of times the word occurs in the document. In a collection of documents, these all combine to give a document matrix. We can assess whether a document matches a query by computing the angle between these vector representations. Evaluating cosine similarity on all documents gives a ranking of relevance to the query.
The notion of a model is a powerful one that occurs across the natural sciences, and is actively used in Software Engineering to manage the design and maintenance of large systems. A well-defined model gives a precise representation of some aspects of a system being studied; the model need not capture everything about the system, and indeed it’s often important to abstract from concrete details.
In this case the model allows us to describe ranking and similarity without fixating on implementation details; and potentially to compare different kinds of ranking algorithm sharing the same model.
|A Vector Space Model for Information Retrieval|
Not in Comm. ACM, 1975; nor in J. Amer. Soc. for Inf. Science, 1975; nor indeed anywhere. As explained below, this apparently highly influential paper is a bibliography virus.
Link: None. It’s a mirage.
|The Most Influential Paper Gerard Salton Never Wrote
Library Trends 52(4): 748–764, 2004
Link: Copy in the Free Library
Pioneer in information retrieval and natural language processing.
Professor of Computers and Information