Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages.
In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics.
Note: There was a photocopied handout in today’s lecture, and all copies were taken. I’ve placed a large number of additional copies in the ITO; see homework below for details.
Once we have some semistructured data gathered into an XML tree, we might want to find information within it. For small XML documents we could just look at it, or use text search; for large and very large documents there are dedicated query languages. Tuesday’s lecture presented one of them: XPath, the XML Path Language. As well as being a query language in its own right, XPath is also a key component of many other XML and web technologies, where it is used to navigate around documents.
Notes and solutions to this week’s tutorials are now available online; I’ve also put up next week’s exercises. These are about XML and XPath, using the
xmllint command-line tool to analyse XML documents and execute XPath queries.
Link: Tutorial Exercises
Every well-formed XML document is neatly arranged as a tree, with names for element nodes and all their attributes. This is enough for basic tools to correctly transmit and process XML; but for many applications it is useful to add more precise domain-specific constraints that we expect documents to satisfy. For this we have XML schema languages: specialised languages for describing types of XML document. This lecture covered one in particular, the Document Type Definition language DTD.
Notes and solutions to the third set of tutorial exercises are available from the tutorial web page. Please note that the next tutorials are in two weeks’ time, after Innovative Learning Week.
Links: Tutorial Exercises; ILW Programme of Events
From the strict rectangles of structured data to the more generous triangles of semistructured data. This lecture gave an overview of what might qualify data as semistructured; trees in general as a mathematical model of data; the particular form of trees in the XPath data model; and their textual respresentation in XML — the Extensible Markup Language.
Finally, some examples of real XML data: from musical scores to financial trading.
Today’s lecture was the final one on Structure Data and covered a range of database topics: ACID properties for transactions; the NoSQL movement; nested SQL queries, set operations, and aggregate queries; ultimate physical limits to computation; the wonders of nature captured in SkyServer; and the idea of doing scientific research and experiments from inside the database.
Lovelace Colloquium 2015
Thursday 9 April 2015
One-day conference for women students of Computing and related subjects
We’re proud to be hosting the colloquium at Edinburgh this year.
The aims of the event are:
- To provide a forum for women undergraduate and masters students to share their ideas and network;
- To provide a stimulating series of talks from women in computing, both from academia and industry;
- To provide both formal (talks) and informal (networking) advice to undergraduate women about careers in computing from a female perspective.
There are poster competitions for women students at all levels of study. Everyone with a poster accepted for the meeting is eligible for travel funding to attend, and there are additional cash prizes from industry sponsors.
Link: Poster contest — Enter your 250-word abstract by 28 February
The organisers are also looking for students to help coordinate social events. If you don’t want to make a poster but are interested in getting involved then please contact
See also the Edinburgh University Hoppers for women in Informatics.
Links: Lovelace Colloquium; Edinburgh University Hoppers
Notes and solutions on this week’s tutorial exercises are now online: see the “Notes” field in the table of tutorials.
Link: Tutorial Exercises