In the most recent tutorial exercises you used the cqp tool to search a 3-gigaword Dickens corpus. We also have the 96-gigaword British National Corpus installed under cqp which you can explore by selecting BNC at the commmand line.
The tutorials web page now has the latest set of tutorial exercises. The coursework assignment has also been running for a week now, and is due in on Thursday next week. The Information Retrieval tutorial work is fairly brief, which will allow you to spend time in the tutorial discussing any questions you have about the assignment. Please do take advantage of this: attempt every question in the assignment before your tutorial, and note down any concerns you have.
Following the rectangular tables of relational databases and the triangular trees of semistructured data, the remaining Inf1-DA lectures will address the representation and analysis of more unstructured data. Today’s lecture provided a brief introduction to the classic information retrieval task of searching a large collection of documents to find those that match a simple query. Continue reading Lecture 15: Information Retrieval→
Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms. Continue reading Lecture 14: Example Corpora Applications→
Today’s lecture described some of the annotations added to text corpora, how they are generated, and some simple analyses; as well as indicating how these relate to applications such as empirical linguistics and the engineering of systems which work with natural languages. Continue reading Lecture 13: Annotation of Corpora→
In literature a corpus (plural corpora) is a collection of written texts, in particular the complete works of a single author or a body of writing on a single subject. In computational linguistics and in theoretical linguistics a corpus is a body of written or spoken text used for study of a particular language or language variety. These corpora may be very large (billions of words) and provide the raw material for experimental investigation of real-world language use: the science of empirical linguistics. Continue reading Lecture 12: Corpora→