Lecture 11: Navigating XML using XPath

Title slide

Slides : Recording

Once we have some semistructured data gathered into an XML tree, we might want to find information within it. For small XML documents we can just look at it, or use text search; for large and very large documents there are dedicated query languages. Today’s lecture presented one of them: XPath, the XML Path Language. As well as being a query language in its own right, XPath is also a key component of many other XML and web technologies, where it is used to navigate around documents.

As a path language, XPath describes a possible route through an XML tree. As a query language, XPath returns the set of endpoints of all such paths. This step from an individual run to considering all possible runs is crucial in raising the level of abstraction and allowing us to think bigger thoughts. It also introduces the new challenge of finding efficient ways to compute the answers to these higher-level queries.

I also reviewed the comments from mid-semester feedback sheets, presented the XML side of the Path timetable browser, and gave a short history of character sets from ticker-tape machines through teletypes to Unicode.

Links: Slides for Lecture 11; Recording

Homework

1. Read This

Background preparation for Friday’s lecture:

McEnery and Wilson. What is a corpus and what is in it?
Chapter 2 of Corpus Linguistics.
Second edition, Edinburgh University Press, 2001.

Section 2.2.2 is optional: the technologies it describes in detail have since transformed into the XML we now use. That leaves the parts before and after — read Sections 2.1, 2.2, 2.2.1, 2.3 and 2.4.

I’ve sent this round to all students by email, distributed copies in the lecture, and a few are still available outside the ITO.

2. Do This

Start on the Tutorial 5 exercises as soon as you are done with Tutorial 4. In these you use the xmllint command-line tool to check validity of XML against a DTD and run your own XPath queries.

References

Miscellany

EUSA Teaching Awards