rm(list = ls())
::opts_chunk$set(echo = TRUE) knitr
3 Matching
#1. Data: Job Training Data
Today we are using data from , which represents a subset of the original Lalonde 1986 data (including only male). We use it to familiarize ourselves with matching techniques.
We load the data, in .csv format:
=read.csv("http://www.spyroskosmidis.net/wp-content/uploads/2018/01/lalonde.csv") lalonde
What variables does the data include?
head(lalonde) #Checking first 6 rows of the data
- treat = dummy for treatment (job-training) or control
- age = participant age
- educ = participant education
- black, hispanic = dummies for participant race
- married = participant’s marital status
- nodegree = participant has no school degree
- re74 = real earnings in 1974
- re75 = real earnings in 1975
- unem74 = unemployed in 1974
- unem75 = unemployed in 1975
- re78 = real earnings in 1978 (outcome variable)
The data comes from a labour market experiment that randomized participants into treatment (on-job training) and control with the aim to estimate its effect on participants’ earnings. The experimental benchmark is $1794 – i.e. this is the effect of the treatment on earnings. Let’s start with an exercise. The model below shows the treatment effect with no covariates.
summary(lm(re78~treat,data=lalonde))
summary(lm(re78~treat + age + educ + black + hispan + nodegree +
+ re74 + re75 + unem74 + unem75 ,data=lalonde)) married
As we can see, from the full comparison group the estimate (difference in means) is -$8,498. What does this suggest about the earnings of those treated?
#2. Exercise1
Try to retrieve the experimental benchmark by adding covariates into the above regression.
#3. Observational Data
We are trying to obtain the experimental benchmark effect in observational data and we are going to try to do that with matching. Before we proceed, though, let’s run some randomization checks. Note here that in the context of Matching we call this process “Balance Check”. Since the treatment is binary we are going to run a logistic regression (or probit or less conservatively OLS).
<- glm(treat ~ age + educ + black + hispan + nodegree +
random + re74 + re75+unem74+unem75, data =lalonde, family = binomial("logit"))
married summary(random)
What do we notice?
Now, we are going to create a data frame by predicting the outcome from the regression we just ran. The first column will be the propensity score and the second will be a binary indicator for those treated (==1).
<- data.frame(pr_score = predict(random, type = "response"),
ps.df treat = random$model$treat)
head(ps.df, n=100)
Before we proceed, let’s install and load one more package.
#install.packages("sm")#This is a package that allows you to quickly plot two densities.
#One can do this in many different ways.
require(sm)
Now, let’s run the density plot
sm.density.compare(ps.df$pr_score, ps.df$treat, xlab="Propensity Score")
title(main="Propensity score by treatment status")
text(0.6,10, "Control", col="red")
text(0.6,8, "Treated", col="green")
#3. Matching
Generally, you should follow these steps:
- Run probit (or logit) with the treatment as outcome and (relevant) covariates as predictors
- Predict probabilities – this will create the propensity score. Note here that some (Sekhon e.g.) suggest that the logit score is superior to the propensity score.
- Match each participant to one or more non-participants using the propensity score. You can do this in a variety of ways: Nearest Neighbour Matching (NN), Caliper Matching, Mahalanobis Distance, Exact Matching etc. The default of the
Matching
package we use today is NN. - Check balance after matching
- Estimate ATT
Let’s start by installing and loading some useful packages.
#install.packages("Matching")
#install.packages("cobalt")
library(Matching) #provides functions for multivariate and propensity score matching and for finding optimal balance based on a genetic search algorithm.
require(cobalt) # Generate balance tables and plots for covariates of groups preprocessed through matching, weighting or subclassification
Think back to our random
model – was the treatment randomly assigned? No. This is why we’re doing matching…
We are going to use Matching
, a package with three important functions:
- MatchBalance gives you balance before and after matching
- Match performs propensity score and distance (i.e. mahalanobis) Matching
- GenMatch performs genetic matching
Let’s start with (1). Let’s check balance before matching with this package.
\[smd = \frac{\bar{X}_t- \bar{X}_c}{\sqrt{\frac{SD^2_{1}}{n_1} + \frac{SD^2_{1}}{n_2}}}\]
=MatchBalance(treat~+age + educ + black + hispan + nodegree +
bal.bm+ re74 + re75+unem74+unem75,data=lalonde) married
What does this tell us?
Now, let’s match! We use propensity score matching (PSM). By default, we estimate NN – what does this mean? NN sets \(C(\bar{P}_i)=min|\bar{P_i}-\bar{P}_j|\). Basically, the non-participant(s) \(\bar{P}_j\) closest to our participant \(\bar{P}_i\) will be selected as a match. The estimator has the option to perform matching with or without replacement. Today, we will only use matching without replacement. It’s good to know though that this method has the disadvantage that the final estimate will usually depend on the initial ordering of the treated observations for which the matches were selected. How many neighbours should you use? It’s usually better to oversample – use more neighbours. This, however, involves a trade-off between bias and variance: it trades lower variance with increased bias. Generally, in choosing between different matching estimators, the one with the best balance is the most satisfactory one.
To retrieve the propensity score, we estimate a logit model predicting assignment to treatment – just what we did above in random
. We’ll call this ps
though it’s exactly the same model as random
.
<- glm(treat ~ age + educ + black + hispan + nodegree +
ps + re74 + re75+unem74+unem75,
married data =lalonde, family = binomial())
Formula to calculate the conditional probability of receiving the treatment (predicted probabilities): \[P(W_I|X_i = x_i) = E(W_i) = \frac{e^{x_i\beta_i}}{1 + e^{x_i\beta_i}} = \frac{1}{1 + e^{e^{-\beta_i}}}\]
Next, we are going to match each treatment observation with a control, without replacement (that means, we are only going to use unique pairs of treatment-control, by contrast to matching with replacement, where the same individual can be control for several treated units).
set.seed(1111)
<- Match(Y=lalonde$re78, Tr=lalonde$treat, X=ps$fitted,
match.psreplace=FALSE)
summary(match.ps)
We got $1642. Pretty good, right?
Now that we’ve done the matching, let’s check balance after matching.
=MatchBalance(treat~+age + educ + black + hispan + nodegree +
bal.ps+ re74 + re75+unem74+unem75,
married match.out=match.ps,data=lalonde)
It’s getting better, but how about age? To see better what is happening, let’s plot first the standardized mean differences with a threshold of 0.1. Note here that the package gives you the standardized mean differences for the unmatched and matched data. Have a look at the code below to see how the standardized mean differences are calculated for the unmatched age variable.
=(mean(lalonde$age[lalonde$treat==1])-mean(lalonde$age[lalonde$treat==0]))/
sddiffagesd(lalonde$age[lalonde$treat==1])
abs(sddiffage*100)
The package gives you something very similar. The following plot by cobalt does not multiply it by 100.
love.plot(bal.tab(match.ps, treat~age + educ + black + hispan +
+ married + re74 + re75+unem74+unem75,
nodegree data = lalonde),stat = "mean.diffs",
threshold = .1, var.order = "unadjusted" ,abs=T)
Ideally, all covariates would be on the 0 line. The further away the absolute mean differences are, the worse the balance. Now we have seen what happens with the means of the covariates. Let’s see next what happens with their distributions. Remember, it can well be that the means are the same but the distributions are statistically different. We’ll test that like we did last week with a Kolomogorov-Smirnov Statistic.
love.plot(bal.tab(match.ps, treat~age + educ + black + hispan +
+ married + re74 + re75+unem74+
nodegree data = lalonde),
unem75, stat = "ks.statistics", var.order = "unadjusted")
Again, ideally everything would be on 0 (that would mean the covariate distributions are identical between the treatment and the control). Note also that kolomogorov-smirnov statistics are only available for continuous covariates.
3.1 3. Propensity Score, With Replacement
Now, let’s use propensity score WITH replacement. Remember, this means that a single control unit can be matched to several treated units. Matching without replacement can yield very bad matches if the number of comparison observations comparable to the treated observations is small. It keeps variance low at the cost of potential bias. Matching with replacement keeps bias low at the cost of a larger variance since you are using the same subjects again and again.
How many neighbors should you use? It’s usually better to oversample – use more neighbors. This, however, involves a trade-off between bias and variance: it trades lower variance with less potential bias.
set.seed(7777)
<- Match(Y=lalonde$re78, Tr=lalonde$treat, X=ps$fitted,
match.ps2replace=T)
summary(match.ps2)
$1314, not great, not terrible. And how’s the balance?
=MatchBalance(treat~., match.out=match.ps2,data=lalonde[,-c(1,11)]) balancepost
Let’s plot the standardized mean differences:
love.plot(bal.tab(match.ps2, treat~.,data=lalonde[,-c(1,11)]),stat = "mean.diffs",
threshold = .1, var.order = "unadjusted" ,abs=T)
#4. Exercise 2
So far the closest we’ve got to the experimental benchmark was with 1642. Let’s try Caliper Matching Now. Caliper Matching is a variant of NN method that attempts to avoid ‘bad’ matches (\(\bar{P}_j\) that are far from our \(\bar{P}_i\). In order to do this, Caliper Matching imposes a tolerance on the maximum distance between the two. If, for some participants, no matches are found within the specified caliper (distance) then they will be excluded from the analysis. Theoretically, Caliper should be better than NN, since it corrects for ‘bad’ matches. Compared to NN, Caliper (also Kernel, LLR) use weighted averages. Caliper for example uses the weighted average over multiple persons within the caliper.
Let’s estimate Caliper with a distance of 0.1 standard deviations.
set.seed(1111)
.1=Match(Y=lalonde$re78, Tr=lalonde$treat,
match.ps.cal0X=ps$fitted, caliper=0.1, replace=F)
summary(match.ps.cal0.1)
We got $1374 That’s worse than before. How’s the balance though?
.1=MatchBalance(treat~+age + educ + black + hispan + nodegree +
bal.ps.cal0+ re74+re75+unem74+unem75,
married match.out=match.ps.cal0.1,data=lalonde)
and plot that
love.plot(bal.tab(match.ps.cal0.1, treat~age + educ + black + hispan +
+ married + re74 + re75+unem74+unem75,
nodegree data = lalonde),stat = "mean.diffs",
threshold = .1, var.order = "unadjusted" ,abs=T)
Does this look good? Remember we wanted to get as close as possible to $1794. Try again by changing the caliper.
# let's set seed equal to 5555
# let's try a caliper equal to 0.25
# let's output of our Match function "match.ps.cal25"
set.seed(5555)
<- Match(Y = lalonde$re78, Tr = lalonde$treat, X = ps$fitted,
match.ps.cal25 caliper = 0.25, replace = F)
summary(match.ps.cal25)
# let's check for balance before and after matching using a 0.25 caliper.
= MatchBalance(treat~+age + educ + black + hispan + nodegree +
bal.ps.cal25 + re74 + re75 + unem74 + unem75,
married match.out = match.ps.cal25, data = lalonde)
# let's generate the balance plot
#5. Mahalanobis Distance
Like propensity score matching, Mahalanobis distance matching is built on the notion of distance between observations of pretreatment covariates. It uses the complete variance covariance matrix, which means that the relationship between variables is included in the analysis. The contribution to the distance calculation of two highly correlated covariates will then be lower than that of less correlated ones. In essence Mahalanobis distance matching scales the distance by the inverse of the covariance matrix.
\[MD(X_i, X_j) = \sqrt{(X_i - X_j)'\Sigma^{-1}(X_i - X_j)}\]
For exact match, \[MD = (X_i, X_j) = 0\]
library(tidyverse)
<- lalonde %>%
lalonde_cov ::select(-X, -treat, -re78)
dplyr
set.seed(2222)
<- Match(Y=lalonde$re78,Tr=lalonde$treat,X=lalonde_cov,
match.maha BiasAdjust=F,estimand="ATT",M=1, Weight = 2)
summary(match.maha)
With Mahalanobis, we get $1715, so it’s slightly better than the caliper =0.1 and both NNs matching
And checking balance:
<- MatchBalance(lalonde$treat~.,match.out = match.maha, data=lalonde_cov,ks=FALSE) bal.maha
Visualizing that in a plot
love.plot(bal.tab(match.maha, treat~.,data=lalonde[,-c(1,11)]),stat = "mean.diffs",
threshold = 0.1, var.order = "unadjusted" ,abs=T)
Matching technique | Estimation | Diff to benchmark |
---|---|---|
Benchmark | 1794 | 0 |
Regression | -8506 | 10300 |
Regression + cov | 1068 | 726 |
NN without rep | 1642 | 152 |
NN with rep | 1714 | 80 |
Caliper (0.1) | 1374 | 420 |
Capiler (0.25) | 1839 | 45 |
Mahalanobis | 1715 | 79 |