Lecture 15: Information Retrieval

After the tightly-structured tables of relational databases and the annotated trees of semistructured data, lectures for the remaining weeks of semester introduce ways to explore and analyze comparatively unstructured data. Today’s was on the classic task of information retrieval, looking in a large collection of documents for those that match a simple query.

The focus here is not on the precise algorithms or implementation, but on specifying the problem, how to recognise when you have a solution, and how to rate the performance of different competing solutions. In this case that meant making exact notions of precision and recall in information retrieval; considering how each might be important in different problem domains; and the use of blends like the F-score to combine both measures.

Also, IBM’s Watson and its performance on Jeopardy! as an example system working with unstructured data, natural language, and information retrieval.

Notices

Please note that the University has all lectures and tutorials as usual on Friday 29 March (Good Friday) and Monday 1 April (Easter Monday).

Accidents of calendar and my own miscalculation mean that the second question on the current tutorial exercise is slightly ahead of lectures: most notably for students who have tutorials on Monday or Tuesday. To help with this, I have put the relevant slides from last year on the course web page. My apologies for this oversight.

Links: Slides; Recording

References

These are all on Watson. As well as material from the lecture, there’s a video on Watson in healthcare and a more technical paper describing how it’s done. I’m very impressed by this project, and its synthesis of many ideas from across informatics, from natural language processing and knowledge representation through to information retrieval and parallel algorithms.

This entry was posted in Lecture log. Bookmark the permalink.