Selection bias and predictive metrics


Freddie DeBoer has a nice post up at the ANOVA where he discusses how an administrator at Harvard, trying to figure out the correlation between SAT scores and first-year grades, could be tripped up by the statistical issue of restriction of range. Freddie’s point is a good one, and analyzing the variance of covariates is an important diagnostic step, but there’s also a slightly more subtle reason that his example is flawed: selection bias. When people say things like “the SAT/GRE don’t predict success well, so we should stop using them”, they’re often falling victim to selection bias—and unfortunately, many of the stories discussing the standardized tests don’t properly deal with selection bias either.

To see the issue, let’s consider an extremely simplified model of personal achievement in which first year grades are based on two factors: IQ and grit. When evaluating files college admissions counselors get to see two things: an applicant’s SAT score, which depends on IQ, and a measure of extracurricular leadership, which depends on grit. Applicants are ranked based on the sum of their SAT score and leadership score, and the only top 2% are admitted. In R, the model looks like this:

require(tidyverse)
require(ggplot2)

set.seed(1234)

N = 10000

admissions_rate = 0.02

# intrinsic factors 

grit = rnorm(N)
IQ = rnorm(N)

# factors that determine college admission

sat = IQ + 0.2 * rnorm(N)
extracurricular = grit + 0.2 * rnorm(N)
admit_suitability =  sat +  extracurricular + 0.1 * rnorm(N)

# admits

admit_indices = order(-admit_suitability)[1 : round(N * admissions_rate)]
admitted = seq(N) %in% admit_indices

# college gpa

college_gpa = grit + IQ + rnorm(N)

# merge into single df

df = (data_frame(sat, IQ, extracurricular, grit, 
                 college_gpa, admitted) %>%
        arrange(admitted))

Note that this model doesn’t have the problems of top-censoring that Freddie pointed out exist with the real-life SAT. Now we can see how well a hypothetical administrator’s estimate of the relationship between SAT scores and college grades among admitted students would generalize to the rest of the student population:

population_regression = lm(college_gpa ~ sat, 
                           data=filter(df, admitted==F))

admitted_regression = lm(college_gpa ~ sat, 
                         data=filter(df, admitted==T))

print(paste("Coefficient on SAT in population:", 
            round(population_regression$coefficients['sat'],3)))

print(paste("Coefficient on SAT in admitted sample:", 
            round(admitted_regression$coefficients['sat'],3)))
[1] "Coefficient on SAT in population: 0.924"
[1] "Coefficient on SAT in admitted sample: 0.151"

What does this mean? Let’s say we were comparing two potential students from the non-admitted population, knew only their SAT scores, and wanted to predict the difference in their college GPAs. The results of our simulation say that if we used data from the admitted population to train our model the predicted difference would be over six times too small. Why? Selection bias. If we plot SAT score vs. college GPA, we see that the sample of admitted students shows a much flatter relationship than the population as a whole:

college gpa plot

Intuitively, admitted students with a relatively low SAT score are much less likely to have a low college GPA than non-admitted students with the same SAT score. This is because admitted students were admitted for a reason. If a student has a low SAT score, they were probably admitted because they have a lot of grit—and that grit will also help them do better in college. This shows up pretty starkly if we plot SAT scores vs. grit:

college gpa plot

The selection process has created a strong negative relationship between grit and SAT score among admitted students, even though the two variables are perfectly uncorrelated in the population as a whole.

One obvious way to counteract this selection bias is to include the measure of extracurricular success directly in our regression:

controlled_regression = (lm(college_gpa ~ sat + extracurricular, 
                            data=filter(df, admitted==T)))
							
print(paste("Coefficient on SAT in controlled admitted sample:", 
            round(controlled_regression$coefficients['sat'],3)))
Coefficient on SAT in controlled admitted sample: 1.012

Unfortunately, we often won’t have a precise quantitative measure of how admissions officers evaluated something like grit when comparing applicants. This is especially true if we’re trying to compare outcomes across different universities. In these cases, we might try to get an prediction of how likely each student was to be admitted based on a very large number of predictors. Then we could actively control for probability of selection. Some of my current research is based on around adapting machine learning methods to this predictive task of propensity score estimation—hopefully I’ll have some original results to share soon!