Once we have some semistructured data gathered into an XML tree, we might want to find information within it. For small XML documents we could just look at it, or use text search; for large and very large documents there are dedicated query languages. Today’s lecture presented one of them: XPath, the XML Path Language. As well as being a query language in its own right, XPath is also a key component of many other XML and web technologies, where it is used to navigate around documents.
As a path language, XPath describes a possible route through an XML tree. As a query language, XPath returns the set of endpoints of all such paths. This step from an individual run to considering all possible runs is crucial in raising the level of abstraction and allowing us to think bigger thoughts. It also introduces the new challenge of finding efficient ways to compute the answers to these higher-level queries.
I also presented the XML side of the PATH timetable browser, and a short history of character sets from ticker-tape machines through teletypes to Unicode.
Link: Slides for Lecture 11
If you haven’t already done so, now is the time to read Sections 2.1–2.5 from Chapter 2 of Møller and Schwartzbach from Week 5. All students received a scanned PDF of this chapter by email; I have just placed additional fresh copies outside the ITO in Forrest Hill; and if those are all taken then email me. You can also find the whole book in the Library HUB.
- Wikipedia on XPath.
- The 10-minute XPath tutorial. I think ten minutes is rather optimistic, but I do recommend the tutorial.
- The full XPath specification XML Path Language Version 1.0. This is quite challenging, but I think worth browsing to see what the full formal standard looks like.
- Other XML technologies from the World Wide Web Consortium (W3C).
- Wikipedia pages on some character sets: Baudot, ASCII, EBCDIC, ISO-8859-xx and Unicode.
- Some notes on the early years of Unicode.
- On the official Unicode site you can pay money to adopt a character.
- Or, if you feel like spending time not money, watch the Unicode slideshow, all 105 characters in 3 hours.