This lecture followed on from Friday’s in looking at the use of hypothesis testing to detect correlations in data. The first section examined the χ² test for working with qualitative data, using two demonstration examples: possible correlation between coursework submission and exam grades; and the discovery of collocations in large text corpora.
The second part of the lecture looked in more detail at some of the risks in misapplying statistical tests. Hypothesis testing can be a tremendously sensitive and powerful tool for discovering new science and identifying the connections between events. However, when used poorly it becomes misleading and unhelpful. The lecture covered a range of concerns about these risks: confusing correlation with causation; what p-values can tell us and what they can’t; when statistical “significance” is really about being statistically detectable; p-hacking, data dredging, outcome switching; and the current replication crisis in some experimental sciences. There is also hope and success, though: in the discovery of robust results through meta-analysis; the active discussions around reproducibility and predictive power in scientific research; and the many projects to record trials, replicate results, and improve publication of both negative and positive outcomes.
Today’s lecture presented the idea of correlation in data sets: observing correlations through scatter plots; measuring them with the correlation coefficient; and using hypothesis testing to see whether that gives evidence to distinguish them from chance coincidence. In this way we get increasingly more precise and sensitive measures for detecting correlation.
Although, remember: correlation does not imply causation. More on that next time.
Notes and solutions for the Information Retrieval tutorial are now online.
I have rearranged the content of the remaining two tutorials. The strike action means there is no written assignment this year: a practice exam paper which would normally be reviewed in Week 11. Instead of this, the tutorial exercise for next week is two practice exam questions and in the tutorial itself you will work through assessing your solutions with the tutor using the original examiners’ marking guidelines.
The final tutorial exercises will be in the following week on the topic of Statistical Analysis. By that time I will have addressed the necessary syllabus content in lectures.
Link: Tutorial exercises
Corpora are widely used for computational research into language, and for engineering natural-language computer systems. In linguistics, they make it possible to do real experimental science: to formulate hypotheses about the structure of languages, or changes in language between different places, times or people; and then test these on data. In building applications that handle text or speech, corpora provide the mass quantities of raw material used for machine learning and other algorithms.
I have put more course material online.
Notes and solutions for Tutorial 4, with a LibreOffice Base file of SQL queries.
Exercises for Tutorial 6, with instructions for the CQP tool you will be using.
Link: Tutorial exercises and notes
Regrettably, there is no recording of Lecture 13 from 2017. Last year’s blog post has slides and references; for video you might do best to go back to the course from 2014/15.
As I write this, the University and Colleges strike is still set to run from Monday to Thursday this week.
Universities UK and UCU have agreed to further talks mediated by the conciliation service ACAS. These will begin tomorrow. However, UUK have not conceded that any of the disputed pension changes are open to review and the strike action continues.
Link: Union statement on further talks
I think it’s excellent that we are now seeing some talks, but I haven’t yet heard anything from UUK suggesting progress toward any actual change. Several university vice-chancellors, though, have made such commitments for their own universities. Here at Edinburgh the Principal, Prof. Mathieson, is meeting with staff on Monday after a 300+ professors sent him an open letter.
(The reason for this being limited to professors is because they have a formal role in the governing of the University as part of the Senate.)
Link: Letter to University Principal
Today’s lecture was cancelled owing to the heavy snowfall. In its place I recommend you watch the recording from last year, linked on the right. I did trial a YouTube Live stream yesterday, but in practice it didn’t offer much over what this earlier screencast does. I have changed the reading and references slightly from last time: the handout “What is a corpus and what is in it?” is no longer required. In the recording you will also see some TopHat questions used last year. I don’t think those turned out too well in the lectures, but I will look into recovering them for you to try outside lectures.
Staff in the University and College Union (UCU) will be on strike for Monday, Tuesday and Wednesday this week. This is part of a continuing dispute with Universities (UUK) over proposed changes to pension arrangements. Further strikes are planned over the following weeks: this schedule may change if the parties involved return to talks or take up arbitration.
Link: University guidance for students
This is the event that Riccardo Fiorista advertised at the start of Friday’s lecture. He’s a second-year Informatics student who has set this up with other students from Edinburgh and Heriot-Watt.
Links: Start|ED event site; Map; Booking information
A well-formed XML document is one that is properly arranged as a tree, with names for element nodes and all their attributes. This is enough for basic tools to correctly transmit and process XML; but for many applications it is useful to add more precise domain-specific constraints that we expect documents to satisfy. For this we have XML schema languages: specialised languages for describing types of XML document. This lecture covered one in particular, the Document Type Definition language DTD.
Notes for Tutorial 3 are now online, as well as the next set of exercises. These include practice working directly with SQL in the LibreOffice desktop database tool. There are extension exercises on working with GUI query generation and also command-line interaction with a local PostgreSQL database server.
Link: Tutorial Exercises