Slides : RecordingOnce we have some semistructured data gathered into an XML tree, we might want to find information within it. For small XML documents we can just look at it, or use text search; for large and very large documents there are dedicated query languages. Today’s lecture presented one of them: XPath, the XML Path Language. As well as being a query language in its own right, XPath is also a key component of many other XML and web technologies, where it is used to navigate around documents.
As a path language, XPath describes a possible route through an XML tree. As a query language, XPath returns the set of endpoints of all such paths. This step from an individual run to considering all possible runs is crucial in raising the level of abstraction and allowing us to think bigger thoughts. It also introduces the new challenge of finding efficient ways to compute the answers to these higher-level queries.
I also reviewed the comments from mid-semester feedback sheets, presented the XML side of the Path timetable browser, and gave a short history of character sets from ticker-tape machines through teletypes to Unicode.
1. Read This
Background preparation for Friday’s lecture:
McEnery and Wilson. What is a corpus and what is in it?
Chapter 2 of Corpus Linguistics.
Second edition, Edinburgh University Press, 2001.
Section 2.2.2 is optional: the technologies it describes in detail have since transformed into the XML we now use. That leaves the parts before and after — read Sections 2.1, 2.2, 2.2.1, 2.3 and 2.4.
I’ve sent this round to all students by email, distributed copies in the lecture, and a few are still available outside the ITO.
2. Do This
Start on the Tutorial 5 exercises as soon as you are done with Tutorial 4. In these you use the
xmllint command-line tool to check validity of XML against a DTD and run your own XPath queries.
- Wikipedia on XPath.
- The 10-minute XPath tutorial. I think ten minutes is rather optimistic, but I do recommend the tutorial.
- The full XPath specification XML Path Language Version 1.0. This is quite challenging, but I think worth browsing to see what the full formal standard looks like.
- Other XML technologies from the World Wide Web Consortium (W3C).
- The Inf1-DA timetable on Path, as a web page.
- The Inf1-DA timetable on Path, as XML.
- Developer information for Path timetabling: how to get exactly the details you want in the format you prefer.
- Wikipedia pages on some character sets: Baudot, ASCII, EBCDIC, ISO-8859-xx and Unicode.
- Some notes on the early years of Unicode.
- On the official Unicode site you can pay money to adopt a character.
- Or, if you feel like spending time not money, watch the Unicode slideshow, all 105 characters in 3 hours.