A collection of examples – both silly and serious – for illustrating key ideas from statistical thinking.
Classroom resouces: Selection Bias
The Mother of All Sampling Biases
Natural Parenting, by xkcd.
"On one hand, every single one of my ancestors going back billions of years has managed to figure it out. On the other hand, that's the mother of all sampling biases."
Success Stories are Selection Biased
"This article demonstrates how the selection of cases for study on the basis of outcomes on the dependent variable biases conclusions. It first lays out the logic of explanation and shows how it is violated when only cases that have achieved the outcome of interest are studied. It then examines three wellknown and highly regarded studies in the field of comparative politics, comparing the conclusions reached in the original work with a test of the arguments on cases selected without regard for their position on the dependent variable. In each instance, conclusions based on the uncorrelated sample differ from the original conclusions."
Classroom resources: Models
Correlation and Causation from "The West Wing"
"PRESIDENT BARTLETT: [Post hoc, ergo propter hoc]. After it, therefore because of it. It means one thing follows the other, therefore it was caused by the other. But it's not always true, in fact it's hardly ever true. We did not lose Texas because of the hat joke."
Link; quote begins at 1m10s.
Thinking Statistically Key Terms
Selection Bias is the process whereby your inferences will be biased if you use a non-random sample and pretend that it’s random.
A model is a simplified or abstracted description of a system that captures the essence of what that system does
A dependent variable is so called because it can’t vary freely within our model: its value is dependent on the values taken by the inputs.
independent variables are so called because their variation should not be determined by any of the variables in the equation. Like teenagers, independent variables won’t let anyone tell them what to do or be.
An error term sweeps up any random variation in outcomes and represents it as a single term. Error terms only work if the variation they encode is truly random. If we accidentally create a model where the variation is systematic, not random, we can run into lots of trouble.
The planning fallacy is where people (consistently) under-estimate how long it will take them to complete a given task.
Correlation, loosely defined, means that two variables change in relationship with each other: for example, a rise in sneezing is accompanied by a rise in punching. Causation, loosely defined, means that one thing directly caused the other thing to happen.
Omitted variable bias occurs, essentially, when we omit a variable from our model that has a significant impact on the outcome.
Exogenous variation is variation that originates outside the system of interest.
Endogenous variation originates within the system of interest.
A conditional probability is the probability of one event happening given that another did. Statisticians write P(X|Y) to represent the probability of X happening given that Y did.
A hypothesis is a causal explanation for why something happened.
An alternative hypothesis is a different possible causal explanations for why something has happenened – different to the main hypothesis we're exploring at the moment.
The prior probability for a hypothesis is the probability for it before we see any new evidence.
The posterior probability for a hypothesis is the probability for it after we've incorporated what we've learned from the new evidence.
The base rate fallacy occurs whenever you neglect to take account of the base rate, a.k.a. the prior probability that something was true before new evidence was introduced.