In the most recent tutorial exercises you used the
cqp tool to search a 3-gigaword Dickens corpus. We also have the 96-gigaword British National Corpus installed under
cqp which you can explore by selecting
BNC at the commmand line.
$ cqp -e [no corpus]> BNC BNC> AllWords = [word="[a-zA-Z].*"] BNC> size AllWords 96063265 BNC>
This has part-of-speech and lemma information like the Dickens corpus, using the Claws 5 POS tag set. As this corpus is much larger you will find queries take noticeably longer to execute.
I also recommend reading the following article on the design and creation of the BNC.
- Gavin Burnage and Glynis Baguley. The British National Corpus. Library and Information Briefings 65, February 1996.
This includes information about text corpora in general, as well as specific details about how the BNC came about.