Today’s lecture revisited the idea of correlation in data sets, and introduced the method of hypothesis testing for identifying whether features observed in samples in fact arise by chance.
For paired series of numerical data we can use the correlation coefficient, and for qualitative data the χ2 statistic. The lecture included examples of this applied to last year’s Inf1-DA exam results, bigram frequency in the British National Corpus, and possible gender bias in student admissions to Berkeley in 1973.
Link: Slides for Lecture 18; Recording; Music
1. Do This
- Hack Your Way To Scientific Glory
Find “statistically significant” results yourself in 60 years of data on the US economy.
2. Read This
- Christie Aschwanden. Science Isn’t Broken.
FiveThirtyEight: Science, August 2015
- P. J. Bickel, E. A. Hammel, and J. W. O’Connell. Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175):398–404, 1975.
This closely analyses the admissions data to conclude that it does point at serious issues of discrimination, although not quite in the places first indicated.
The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.
- Wikipedia page on Simpson’s Paradox, of which the Berkeley admissions is a well-known example.