Today’s lecture presented various techniques to support effective information retrieval: the big bag of words model; term-frequency inverse document frequency (tf-idf); the vector space model; and cosine similarity for document ranking.
The vector space model for information retrieval treats documents as vectors in a very high-dimensional space: a dimension for every distinct word, with the vector coordinate being the number of times the word occurs in the document. In a collection of documents, these all combine to give a document matrix. We can assess whether a document matches a query by computing the angle between these vector representations. Evaluating cosine similarity on all documents gives a ranking of relevance to the query.
The notion of a model is a powerful one that occurs across the natural sciences, and is actively used in Software Engineering to manage the design and maintenance of large systems. A well-defined model gives a precise representation of some aspects of a system being studied; the model need not capture everything about the system, and indeed it’s often important to abstract from concrete details.
In this case the model allows us to describe ranking and similarity without fixating on implementation details; and potentially to compare different kinds of ranking algorithm sharing the same model.
Sadly, problems with the podium PC continue and there are no recordings from AT4 today. I have logged this with Information Services, and hope they will have fixed this by next Tuesday. My apologies.
Complete the tutorial exercise on Information Retrieval. Question 1 relates to Tuesday’s lecture, Question 2 uses material from today.
Continue working on the coursework assignment. Make sure you have started all three questions in time for your tutorial: bring along your written solutions so far, with a list of any problems you have found.
Brewed by Data
Blending beer recipes by their “values, needs and emotional states”.
Beer Brewed by AI
Competition for 0101, using reinforcement learning and Bayesian decision making. Which are real things. Watch the video.
|A Vector Space Model for Information Retrieval
Not in Comm. ACM, 1975; nor in J. Amer. Soc. for Inf. Science, 1975; nor indeed anywhere. This apparently highly influential paper is a bibliography virus.
Link: None. It’s a mirage.
|The Most Influential Paper Gerard Salton Never Wrote
Library Trends 52(4): 748–764, 2004
Link:Copy in the Free Library
Pioneer in information retrieval and natural language processing.
Professor of Computers and Information