Lecture 19: chi² Testing on Categorical Data

da19This lecture covered more on hypothesis testing, presenting the χ2 test and working through three examples: student Inf1-DA exam results in 2011, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.

The χ2 test is a tool for assessing potential correlations in categorical data, where it is not possible to apply the correlation coefficient measures used on quantitative data.

Use of χ2 follows the standard pattern for statistical tests: we have a null hypothesis that there is no correlation between different possible categories of data, and the test indicates the probability p that if this were true then we would observe data similar to that actually seen. If p is very low, we reject the null hypothesis and take the statistical result as evidence suggesting a correlation.

Of course, as usual, correlation does not imply causation, but a correlation may lead us to investigate possible mechanisms of causation, which might in turn give rise to predictions that can be repeatedly tested.

This repetition is key to establishing concrete scientific results. I gave an outline of an example where appropriate statistical meta-analysis can use this squeeze out significant information from pre-existing datasets: specifically the results of Lau et al. on heart attack treatments where they discovered that strong evidence for certain treatment had been already present in the data fifteen years before the results of individual large trials brought the treatment into common use. The Cochrane Collaboration now routinely carries out large and influential medical meta-analyses, and uses a statistical “blobbogram” demonstrating the effect as its logo.

Links: Slides for Lecture 19; Video of Lecture 19


The final lecture, this Friday, will review the course and address exam preparation. Please fill out the online poll to indicate particular areas you would like covered.

Link: Doodle poll on review topics


Correlation as Cosine Similarity
  • Jordan Ellenberg. How Not to Be Wrong: The Power of Mathematical Thinking. Penguin, 2014.

    Links: Author page; Publisher page; Edinburgh University Library

    The link between correlation coefficients and cosine similarity is described in Chapter 15 on Galton’s Ellipse. I recommend the whole book, though. There’s a copy available for loan on the third floor of the Main Library right now (2015-03-24 14:04).

Statistics and Evaluation of Medical Treatments
Berkeley Admissions
  • P. J. Bickel, E. A. Hammel, and J. W. O’Connell. Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175):398–404, 1975. DOI: 10.1126/science.187.4175.398.

    This closely analyses the admissions data to conclude that it does point at serious issues of discrimination, although not quite in the places first indicated.

    Link: Full text via EASE login

    The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.

  • Wikipedia page on Simpson’s Paradox, of which the Berkeley admissions is a well-known example.