Lab 7: Sharp Regression Discontinuity

Lecture Slides & Data

Recap

Sharp Regression Discontinuity

In RD we exploit that treatment assignment is determined by a known assignment rule that determines whether people are assigned to the treatment. In RD, all units in the study receive a score, usually called running variable, forcing variable or index, and treatment is assigned to those units whose score is above a known cut-off (Cattaneo et al 2019).

In RD we have three components:

The score
The cutoff
The treatment.

In the Sharp Regression Discontinuity Design, the score determines - deterministically - whether the unit is being assigned to treatment or to the control condition. However, we face again the fundamental problem of causal inference because we can only observe the untreated outcome for those units below the cutoff (untreated) the treated outcome for those above the cut-off (treated). However, imposing the assumption of comparability between units with very similar values of the score but on opposite sides of cut-off enables us to calculate the treatment effect of the intervention.

Given the continuity assumption, we would expect that observations in a small neighbourhood around the cut-off, will have very similar potential outcomes. Thus, this would justify using observations just below the cut-off as a very good proxy of what would the average outcome for those units just above the cut-off would have had if they had received the control condition instead of the treatment.

The main goal in RD is to adequately perform an extrapolation of the average outcome of treated and untreated units at the cutoff.

Estimation

There are different ways that we could estimate the causal effect from RDD:

Linear
Linear with different slopes
High-order polynomial (or non-linear)
Non-parametric
Local polynomial

Choice of Kernel function and Polynomial Order.

Whenever we conduct RD we choose the following:

Polynomial order $p$ function
Kernel function $K(\dot)$, which determines how observations within a bandwidth would be weighted, and there are different options: “Uniform”, “Triangular”, and “Epanechnikov”.
Bandwidth size: But there is a trade-off between bias and efficiency. The closer you get to the cut-off, the less bias in the estimator, but more variance that there should be fewer observations.
We conduct weighted least squares and obtain the intercept of the chosen polynomial order above and below the cut-off.
Calculate the sharp point estimate by subtracting the intercepts of these polynomials: $\hat{\tau_{SRD}} = \alpha_{above} - \alpha_{below}$

Falsification tests:

Whenever we conduct RD we choose the following:

We learned that we can conduct multiple sensitivity and falsification tests:

Sensitivity: To check that the results of our estimation are robust to different specifications
Continuity: To check whether covariates do not jump at the threshold.
Sorting: To check that units do not sort around the threshold.
Placebo cut-offs: To check that outcomes do not change abruptly at an arbitrary threshold (to test the continuity assumption)

Before starting this seminar

Create a folder called “lab7”
Download the data (you can use the button or the one at the top, or read csv files directly from github):
Open an R script (or Markdown file) and save it in our “lab7” folder.
Set your working directory using the setwd() function or by clicking on “More“. For example setwd(“~/Desktop/Causal Inference/2022/Lab7”)
Let’s install an load packages that we will be using in this lab:

library(stargazer) # generate formated regression tables 
library(texreg) # generate formatted regression tables
library(tidyverse) # to conduct some tidy operations
library(ggplot2)
#install.packages(c(rdrobust, rddensity))
library(tidyverse)
#install.packages("rdd")
library(rdd) # to conduct McCrary Sorting Test 
library(rdrobust) # to conduct non-parametric rd estimates 
library(rddensity) # to conduct a density test and density plots

Seminar Overview

In this seminar, we will cover the following topics:
1. Conduct regression discontinuity using global polynomial estimation using lm() function
2. Calculate LATE using non-parametric estimations rdrobust().
3. Conduct the following robustness/falsification tests such as balance test, placebo outcome, density test, and falsification test.

Islamic Political Rule and Women’s Empowerment - Meyerson (2014)

In this paper, Meyerson is interested in the effects of Islamic parties’ control of local governments on women’s rights. He focuses on the educational attainment of young women. Meyerson conducts a Sharp RD design, based on close elections in Turkey. The challenge here is to compare municipalities where the support for Islamic parties is high and win the election, versus those that elected a secular mayor.

You would expect that municipalities controlled by Islamic parties would systematically differ from those that are controlled by a secular mayor. Particularly, if religious conservatism affects the educational outcomes of women. However, we can use RD to isolate the treatment effect of interest from all systematic differences between treated and untreated units.

We can compare municipalities the Islamic party barely won the election versus municipalities where the Islamic party barely lost. This reveals the causal (local) effect of Islamic party control on women’s educational attainment. One condition to meet in this setup is that parties cannot systematically manipulate the vote percentage they obtain.

The data used in this is from the 2014 mayoral election in Turkey. The unit of analysis is the municipality, and the running variable is the margin of victory. The outcome of interest is the educational attainment of women who were (potentially) in high school during 1994-2000, calculated as a percentage of the cohort of women aged 15 to 20 in 2000 who had completed high school by 2000.

We will be using the following variables:

Variable	Description
`margin`	This variable represents the margin of victory of the Islamic party’s in the 1994 election. If the margin is positive, it means that an Islamic party won
`school_men`	secondary school completion rate for men aged between 15 and 20
`school_women`	the secondary school completion rate for women aged 15-20
`log_pop`	log of the municipality population in 1994
`sex_ratio`	gender ratio of the municipality in 1994
`log_area`	log of the municipality area in 1994

Now let’s load the data. There are two ways to do this:

You can load the brands dataset from your laptop using the read.csv() function. You can call this data set educ

# Set your working directory
#setwd("~/Desktop/Causal Inference/2022/Lab7")
# 
library(readr)
#educ <- read.csv("~/islamic_women.csv")

head(educ)

# A tibble: 6 × 6
  margin school_men school_women log_pop sex_ratio log_area
   <dbl>      <dbl>        <dbl>   <dbl>     <dbl>    <dbl>
1 -0.356      0.379        0.230    8.00     0.977     11.2
2 -0.543      0.309        0.254    7.36     1.12      10.6
3 -0.204      0.226        0.227   13.2      0.975     NA  
4 -0.450      0.171        0.159    7.62     1.03      10.9
5 -0.201      0.173        0.182    7.65     1.18      11.5
6 -0.506      0.266        0.25     7.31     1.02      10.3

Let’s start by visualising the data. We will use plot() function to do this. This is the simplest scatter plot that you came up with.

Exercise 1: Generate a plot using the plot(X,Y) function. Replace X with educ$margin and Y with educ$school_women.

Reveal Answer

plot(educ$margin, educ$school_women)

This is a very simple plot that shows the raw relationship between the margin of victory and the outcome variable. However, it conveys some important information. For example, the margin of victory is clustered around -0.5 and roughly 0.3. Also, the outcome variable, school attainment, usually goes from 0 to 40%. Now let’s generate a slightly fancier plot.

Exercise 2: Generate a scatter plot using ggplot() function. Use the functions below to add some additional features into this plot.

ggplot(aes(x = running variable, y = outcome, colour =outcome), data = data) +
  # Make points small and semi-transparent since there are lots of them
  geom_point(size = 0.5, alpha = 0.5, position = position_jitter(width = 0, height = 0.25, seed = 1234)) + 
  # Add vertical line
  geom_vline(xintercept = 0) + 
  # Add labels
  labs(x = "Label X", y = "Label Y") + 
  # Turn off the color legend, since it's redundant
  guides(color = FALSE)

Reveal Answer

# Let's check if this is a sharp RD. 
ggplot(educ, aes(x = margin, y = school_women, colour =school_women)) +
  # Make points small and semi-transparent since there are lots of them
  geom_point(size = 0.5, alpha = 0.5, 
             position = position_jitter(width = 0, height = 0.25, seed = 1234)) + 
  # Add vertical line
  geom_vline(xintercept = 0) + 
  # Add labels
  labs(x = "Vote share", y = "Womens' Educational Attainment (Proportion)") + 
  # Turn off the color legend, since it's redundant
  guides(color = FALSE)

We can now see more clearly that most municipalities elected a secular mayor. Remember that we are using the margin of victory for the Islamic parties. This means that if the margin of victory is positive, it means that the Islamic parties won the election in that municipality.

Exercise 3: Let’s generate a dummy variable that is equal to “Treated” if the margin of victory is equal or greater than zero and “Untreated” otherwise. You can use the mutate() function and the ifelse() functions to do this. Remember to use the pipeline operator to store the variable into your existing data frame. See below the syntax for more information

Function/argument	Description
`data <- data %>%`	Pipeline operator to assign the new operation into a new data or existing data frame
`mutate(new variable = ifelse(variable >= "condition", "Treated", "Untreated")`	If the condition is met, the new variable takes value equal to “Treated” and “Untreated” otherwise

data <- data %>% 
  mutate(newvariable = ifelse(variable >= 0 , "Treated", "Untreated"))

Reveal Answer

educ <- educ %>% 
  mutate(treat = ifelse(margin >= 0 , "Treated", "Untreated"))

Now that we have created our treatment condition variable, let’s generate an additional plot that conveys information about the distribution of the running variable for both treated and untreated municipalities.

Exercise 4: Generate a plot looking at the distribution margin of victory variable. Using the ggplot() function let’s set the x argument equal to the margin of victory variable. Set fill equal the new treatment variable treat. There is no need to add the y argument given that we expect to generate a histogram that will give us the number of observations for each value in the margin of victory variable. Add the geom_histogram() function. Set the argument binwidth in this function equal to 0.01 and set colour argument equal to “dark”. Let’s add the geom_vline() function and add a vertical line by setting the xintercept argument equal to zero. Add the labs() function and set x label equal to “Margin of victory” and the y label equal to “count”, and the fill argument equal to “Treatment Status”.

Function/argument	Description
``gplot(aes(x = variable, fill = condition), data = data) `\| Map data components into the graph \| \|`geom_histogram(binwidth = 0.01, color = “dark”)`\| To generate a histogram \| \|`geom_vline(xintercept = 0)`\| To add a vertical line that goes through zero \| \|`labs(x = “Label X”, y = “Label Y”, fill = “Label Legend”)```	To add labels

ggplot(educ, aes(x = running_variable, fill = treat_status_variable)) +
  geom_histogram(binwidth = 0.01, color = "white") + 
  geom_vline(xintercept = 0) + 
  labs(x = "Running variable", y = "Count", fill = "Treatment Status")

Reveal Answer

ggplot(educ, aes(x = margin, fill = treat)) +
  geom_histogram(binwidth = 0.01, color = "white") + 
  geom_vline(xintercept = 0) + 
  labs(x = "Margin of Victory", y = "Count", fill = "Treatment Status")

In this plot, we can see the distribution of the running variable (margin of victory). Again, we can see that in the majority of the municipalities a secular mayor was elected.

Global Parametric Estimation:

As we discussed in the lecture, there are different ways to estimate the causal effect using RD. These different approaches differ on how many observations and how to estimate the average outcome for those units just above the cut-off and below the cut-off. One way to estimate the effect of the intervention in RD is using OLS, but only using the running variable as the main predictor. In this case, we use the running variable measured as the distance from the cut-off. Stated formally: $\tilde{X_l} = X - c$.

In this case, the regression in the left-hand side would be equal to:

\[Y= \alpha_l + \tilde{X} + \epsilon\] Whereas the regression above the cut-off is equal to:

\[Y= \alpha_r + \tilde{X} + \epsilon\] It’s important to point out that for all estimations the treatment effect is equal to the differences of the intercepts of the regressions above and below the cut-off.

\[\tau = \alpha_r - \alpha_l\]

In this case, we need to subset the data for those observations above and below the threshold. Then, regress the outcome on the running variable on . Finally, subtract the intercepts from each regression. Let’s do that in the following exercise.

Exercise 5: Run a regression using only margin variable as the predictor. Set your outcome variable equal to school_women variable. Set the data argument equal to educ (unless you called your data frame differently). Add the subset function inside of the lm() function and set it equal to margin >= 0 for the regression above the cut-off and margin < 0 for the regression below the cut-off. Use the summary() function to report your results, but also store the output this function into an object. You can retrieve the intercept from the object that you stored the output from the summary() function this way:

object <- summary(lm(outcome ~ variable, data = data, subset = variable >= condition))

# intercept
object$coefficient[1]

Reveal Answer

# above
above <- summary(lm(school_women~margin, data = educ, subset = margin >= 0))

above$coefficients[1]

[1] 0.1564529

# above
below <- summary(lm(school_women~margin, data = educ, subset = margin < 0))

below$coefficients[1]

[1] 0.1620016

# 0.156453 - 0.162002
tau_rd = above$coefficients[1] - below$coefficients[1]
tau_rd

[1] -0.00554867

Based on this approach the intercepts of the two regressions yield the estimated value of the average outcome at the cut-off point for the treated and untreated units. This difference of intercept is the causal effect of whenever an Islamic party is in power (at the municipal level), there is a 0.5 percentage decrease in women’s educational attainment.

However, a more direct way of estimating the treatment effect is to run a pooled regression on both sides of the cut-off point, using the following specification: $Y = \alpha + \tau D + \beta \tilde{X} + \epsilon$

Where $\tau$ is the coefficient of interest. Here again LATE is the difference between the two intercepts: $\tau = \alpha_r -\alpha_l$. When $D$ switches off and also we are also controlling the different values of the forcing variable, $\tilde{X}$, we get the slope of the regression below the threshold. Conversely, for units above the cut-off, $D$ switches on, and we control for different values of the forcing variable, we get the slope of the regression above the cut-off. The final step is to reweigh both coefficients from both regressions and obtain one single $\tau$ coefficient.

However, you are constraining the slope of the regression lines to the same on both sides of the cut-off. ($\beta_l = \beta_r$) This strategy is not consistent with the idea of RD designs, as estimates are calculated approximating from each side to the cutoff.

Exercise 6: Calculate the effect of the intervention using a regression model including the margin and treat variables. Use thelm() function to conduct this analysis. Store this regression into an object and call it global2. Use the summary() to inspect your results. Interpret the coefficient of interest.

Reveal Answer

# a more direct way is to run a pooled regression on both sides of the cut-off (constraining the slopes)

global2 <- lm(school_women~treat+margin, data = educ)
summary(global2)


Call:
lm(formula = school_women ~ treat + margin, data = educ)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.17051 -0.07650 -0.00691  0.06527  0.51306 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.142358   0.005494  25.912   <2e-16 ***
treatUntreated  0.017347   0.007467   2.323   0.0202 *  
margin         -0.019321   0.010967  -1.762   0.0782 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09543 on 2627 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.008805,  Adjusted R-squared:  0.00805 
F-statistic: 11.67 on 2 and 2627 DF,  p-value: 9.016e-06

Using this estimation approach, we obtain that, on average, the effect of a municipality in control of an Islamic party leads to a 1.7%-point increase in women’s educational attainment.

We can also allow the regression function to differ on both sides of the cut-off by including interaction terms between $D$ and $\tilde{X}$. This would be as follows:

\[Y = \alpha_l + \tau D + \beta_0\tilde{X} + \beta_1 (D \times\tilde{X})+ \epsilon\]

Let’s do that in the following exercise.

Exercise 7: Calculate the effect of the intervention using a regression model including the margin and treat variables and the interaction between these two variables. Use thelm() function to conduct this analysis Store this regression into an object and call it global3. Use the summary() to inspect your results. Interpret the coefficient of interest.

Reveal Answer

# a more direct way is to run a pooled regression on both sides of the cut-off (constraining the slopes)

global3 <- lm(school_women~treat+margin + treat*margin, data = educ)
summary(global3)


Call:
lm(formula = school_women ~ treat + margin + treat * margin, 
    data = educ)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16896 -0.07655 -0.00723  0.06561  0.51348 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.156453   0.007439  21.030  < 2e-16 ***
treatUntreated         0.005549   0.008561   0.648  0.51697    
margin                -0.156501   0.050110  -3.123  0.00181 ** 
treatUntreated:margin  0.144062   0.051352   2.805  0.00506 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09531 on 2626 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.01177,   Adjusted R-squared:  0.01064 
F-statistic: 10.42 on 3 and 2626 DF,  p-value: 8.144e-07

Here, we find that the coefficient of interest is again positive (0.005). This means that, on average, the effect of an Islamic party in control of the municipal government leads to an increase in women’s educational attainment.

The main limitation of these estimations is that we are relying upon observations that are far away from the cut-off. And the evidence against using a global polynomial approach is quite substantial. According to Cattaneo et al (2019), this estimation technique does not provide accurate point estimators and inference procedures with good statistical properties.

Exercise 7 (no coding required): Think about how the global polynomial approach weights each observation when it calculates the coefficient of interest?

Reveal Answer

OLS will estimate $\tau$ based on all observations across the score. This means that the observation that is very far from the cut-off weights is equal to the one that is very close. In the worst-case scenario, if the observations are clustered far from the cut-off, the estimation of $\tau$ would be heavily influenced by those values rather than those close.

We can also use high-order polynomial to retrieve LATE. However, the evidence against using high-order polynomial seems to be quite robust: High-order polynomials The issues with using high-order polynomials is that they leads to noisy estimates, they are sensitivity to the degree of the polynomial, and they have poor coverage of confidence intervals.

Exercise 8: Conduct third-order polynomial regression function. Include in this model the treat and the margin variables. Also, add the margin variable raised at power 2 and then at the power of 3. We also need to include the I() function or insulate function for the margin variable that is raised at the power of 2 and 3. The I() function insulates whatever is inside this function. It creates a new predictor that is the product of the margin variable by itself two and three times. Store this output in an object and call that object global4. Then, use the summary() to check the results of this specification. Interpret the results.

Reveal Answer

global4 <- lm(school_women~ treat + margin+I(margin^2)+ I(margin^3) , data = educ)

summary(global4)


Call:
lm(formula = school_women ~ treat + margin + I(margin^2) + I(margin^3), 
    data = educ)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.17226 -0.07581 -0.00592  0.06335  0.50378 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.160107   0.006060  26.418  < 2e-16 ***
treatUntreated -0.021484   0.008893  -2.416  0.01577 *  
margin         -0.179024   0.023219  -7.710 1.77e-14 ***
I(margin^2)    -0.134706   0.037469  -3.595  0.00033 ***
I(margin^3)     0.190585   0.046510   4.098 4.30e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09434 on 2625 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.03201,   Adjusted R-squared:  0.03054 
F-statistic:  21.7 on 4 and 2625 DF,  p-value: < 2.2e-16

Using a high-order polynomial function, we find that womens’ educational attainment decreases, on average, roughly 2.1 per cent when an Islamic party is in power.

Finally for this section, what we can also do is use only use observations that are close to the cut-off (above and below). Let’s run the unconstrained model from Exercise 7, but only using observations that are 0.5 percentage points above and below the threshold.

Exercise 9: Run the same model used in Exercise 7, but subset your data taking only observations that are above and below 0.5 points from the threshold. Store the results from this regression into an object and call it local. Use the summary() to check your results. If you don’t remember how to subset data in the lm() function. See the syntax below

lm(outcome ~ variable1 + variable2 + variable1 * variable2, data=data,
           subset=(running_variable>=-0.5 & running_variable<=0.5))

Reveal Answer

local <- lm(school_women ~ margin + treat + treat * margin, data=educ,
           subset=(margin>=-0.5 & margin<=0.5))
summary(local)


Call:
lm(formula = school_women ~ margin + treat + treat * margin, 
    data = educ, subset = (margin >= -0.5 & margin <= 0.5))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18414 -0.07517 -0.00407  0.06421  0.50304 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.159465   0.007898  20.190  < 2e-16 ***
margin                -0.190408   0.060455  -3.150  0.00166 ** 
treatUntreated        -0.010577   0.009179  -1.152  0.24930    
margin:treatUntreated  0.118165   0.062268   1.898  0.05787 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09352 on 2174 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.02567,   Adjusted R-squared:  0.02432 
F-statistic: 19.09 on 3 and 2174 DF,  p-value: 3.224e-12

We can see that using a local polynomial function, we find again a negative coefficient. This suggest that being under the controlled of a Islamic party leads, on average, to a decrease by 1% in women’s educational attainment.

It is important to stress that modern RD empirical work employs local polynomial methods. In this case, we are estimating the average outcomes for treated and untreated units using observations that are near the cut-off. This approach tends to be more robust and less sensitive to boundary and overfitting problems. In local polynomial point estimation, we are still using linear regression, but within a specific bandwidth near the threshold. In the following section, we will look at how to use the approach using a non-parametric estimation strategy.

Non-Parametric Estimation

Let’s now estimate the LATE using a non-parametric estimator. Conveniently, we can easily do so by using the rdrobust package. As the name indicates, the package allows us to estimate robust measurements of uncertainty such as standard errors and confidence intervals. It is based on theoretical and technical work by Calonico, Cattaneo and Titiunik. rdrobust estimates robust bias-corrected confidence interval that address the problem of undersmoothing conventional confidence intervals face in RDDs. In other words, a small bias would be required for them to be valid, which might not be the case. Moreover, they also address the poor performance of (non-robust) bias-corrected confidence intervals.

As suggested by the authors and somewhat counter-intuitively, we therefore use the point estimate provided by the conventional estimator, but robust standard errors, confidence intervals and p-values to make statistical inferences.

The rdrobust command has the following minimal syntax. You can use the same bandwidth or specify two different ones. We will work with further arguments later. Note that you do not need to specify if the running variable is centred on the cutoff.

robust_model = rdrobust(data$running_var, data$dependent_var, 
                       c=[cutoff], h=[bandwidth])

Exercise 10: Estimate the LATE using rdrobust with a bandwidth of 5% on either side of the cutoff. Interpret your result.

Reveal Answer

robust_5=rdrobust(educ$school_women, educ$margin, c=0, h=0.05)
summary(robust_5)

Call: rdrobust

Number of Obs.                 2630
BW type                      Manual
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2315          315
Eff. Number of Obs.             148          109
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.050        0.050
BW bias (b)                   0.050        0.050
rho (h/b)                     1.000        1.000
Unique Obs.                    2313          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.023     0.027     0.848     0.396    [-0.030 , 0.075]     
        Robust         -         -     0.774     0.439    [-0.049 , 0.112]     
=============================================================================

Our point estimate is 0.023. That is,a victory of an Islamic party would be associated with an increase in the rate of women who complete school by 2.3% - however, the ´p-value and confidence intervals indicate that the estimate is not statistically significant. Therefore, based on this model we would conclude that winning an election does not have a an effect on women schooling.

A bandwidth of 5% seems about reasonable. But we should better check different ones, too. Let’s see what happens if we halve the bandwidth.

Exercise 11: Estimate the same model as before with a bandwidth of 2.5%. Report and interpret your results.

Reveal Answer

robust_25=rdrobust(educ$school_women, educ$margin, c=0, h=0.025)
summary(robust_25)

Call: rdrobust

Number of Obs.                 2630
BW type                      Manual
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2315          315
Eff. Number of Obs.              68           62
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.025        0.025
BW bias (b)                   0.025        0.025
rho (h/b)                     1.000        1.000
Unique Obs.                    2313          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.027     0.040     0.691     0.490    [-0.050 , 0.105]     
        Robust         -         -    -0.333     0.739    [-0.145 , 0.103]     
=============================================================================

Well, we can see that the number of observations used to estimate the effect has been reduced - which is reasonable and we should be aware of. The point estimate did not change much and, as before, we find that there is in fact no significant effect at the cutoff.

Bandwidths of 5% or 2.5% around the cutoff seem somewhat reasonable in this case - but so would several others. How can we know what bandwidth we should use to estimate our effect?

Recall the trade-off we are facing when choosing bandwidths that was discussed in the lecture: On the one hand we know that more narrow bandwidths are associated with less biased estimates - we rely on units that are indeed comparable with their distance in the running variable being as-if random the closer we get to the cut-off. On the other hand: the wider the bandwidths, the smaller the variance. As in several cases before, the structure of our data, such as the number of observations, plays an important role. Even if small bandwidths are desirable, it can be hard [impossible] to estimate a robust and significant effect if the number of observations around the cut-off is very small - even if there is a true effect.

Luckily, the rdrobust package provides a remedy for this. The packages allows us to specify that we want to use bandwidths that are optimal given the data input. The rdrobust command then picks the bandwidth that optimally reduces the mean square error - in other words, MSE-optimal bandwidths. Note that this is the default bandwidth if you don’t specify any. Let’s try to find out what this would be in our case.

Exercise 12: Estimate the LATE using rdrobust and MSE-optimal bandwidths. To specify the model, replace the h argument with bwselect="mserd".

Reveal Answer

robust_mserd=rdrobust(educ$school_women, educ$margin, c=0, bwselect="mserd")
summary(robust_mserd)

Call: rdrobust

Number of Obs.                 2630
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2315          315
Eff. Number of Obs.             529          266
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.172        0.172
BW bias (b)                   0.286        0.286
rho (h/b)                     0.603        0.603
Unique Obs.                    2313          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.030     0.014     2.116     0.034     [0.002 , 0.058]     
        Robust         -         -     1.776     0.076    [-0.003 , 0.063]     
=============================================================================

This now looks very different. We estimate a LATE of about 3%, which is significant at the 90% lvel. Note that the optimal bandwidths has been estimated to be 17.2%-points on either side of the cut-off, with a separate optimal bandwidth for the error term. Note that the MSE-optimal bandwidth is optimal in statistical terms - we should always make sure to asses the bandwidths against our theory. Here, we’d compare parties’ winning margins up to 17%-points.

Note that, so far, we have used a single bandwidth for data below and above the cut-off. We can also specify different ones - both manually and in terms of optimal bandwidths. As the structure of our data might differ, different bandwidths might be optimal. We can specify two different MSE-optimal bandwidth selectors by specifying bwselect="msetwo".

Exercise 13: Estimate the LATE using rdrobust and two MSE-optimal bandwidths. Interpret your results and compare it the model with a single optimal bandwidth.

Reveal Answer

robust_msetwo=rdrobust(educ$school_women, educ$margin, c=0, bwselect="msetwo")
summary(robust_msetwo)

Call: rdrobust

Number of Obs.                 2630
BW type                      msetwo
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2315          315
Eff. Number of Obs.             607          267
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.200        0.174
BW bias (b)                   0.323        0.297
rho (h/b)                     0.619        0.584
Unique Obs.                    2313          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.030     0.014     2.134     0.033     [0.002 , 0.057]     
        Robust         -         -     1.810     0.070    [-0.002 , 0.062]     
=============================================================================

This now looks pretty similar to the single optimal bandwidth, which is a god sign. In fact, the optimal bandwidth for data points above the cut-off remains virtually unchanged. For the ones below the cut-off, the bandwidths is slightly extended. The point estimate and measures of uncertainty also remain virtually unchanged. If specifying two optimal bandwidths alters the results significantly, this is an indicator that the data should be inspected closely for the cause of the diverging bandwidths.

Kernels

One of the advantages of the non-parametric RDD is that observations can be weighted based on their proximity to the cutoff. This is based on the idea that those closer to the cut-off provide better comparisons than those further away. Think of victory margins as a running variable: If we look at fractions of %-points around the cut-off, it is fair to say that victory is as-if random. The farther away we move, the less plausible this statement becomes.

As you learned in the lecture, there are different ways to do so. The default kernel used by the rdrobust package is a triangular kernel - which continuously assigns higher weight to observations closer to the cut-off. This is the one we have been using so far as we didn’t explicitly specify the kernel. As the figure below shows, other possible options include epachnechnikov and uniform kernels. These can be specified via the kernel argument.

Exercise 14: Estimate the LATE using rdrobust and a single MSE-optimal bandwidths but specifying a uniform kernel. Interpret your results.

Reveal Answer

robust_uniform=rdrobust(educ$school_women, educ$margin, c=0, bwselect="mserd", kernel="uniform")
summary(robust_uniform)

Call: rdrobust

Number of Obs.                 2630
BW type                       mserd
Kernel                      Uniform
VCE method                       NN

Number of Obs.                 2315          315
Eff. Number of Obs.             474          251
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.154        0.154
BW bias (b)                   0.283        0.283
rho (h/b)                     0.546        0.546
Unique Obs.                    2313          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.032     0.014     2.360     0.018     [0.005 , 0.059]     
        Robust         -         -     2.044     0.041     [0.001 , 0.063]     
=============================================================================

As compared to the triangular kernel, our point estimate is slightly larger and we find that our robust measures of uncertainty show that the effect is significant at the 95% level. Note that the optimal bandwidth is slightly narrower in the model with a uniform kernel. We have to be aware that this model is less conservative though: We assign the same weight to each observation, independent of its distance to the cutoff - which introduces growing bias over increasing bandwidths.

Falsification Checks

Sensitivity

If you apply an RDD, you should always make sure that results are robust to different specifications of the model. Importantly, this involves sensitivity checks. This means you should make sure that results are robust across model specifications and, importantly, different bandwidths.

While there are packages that help produce useful plots that visualise robustness across bandwidths, they can be cumbersome to adjust and are not very flexible. Hence, one avenue to create such a plot can be to loop rdrobust mdoels over different bandwidths. This solution involves somewhat more coding, but gives you a lot of flexibility as to what you want to display and highlight.

Exercise 15: Create a plot that shows estimates and confidence intervals for estimates of the effect across different bandwidths.

There are several ways to do this. Feel free to play around a bit and try to come up with your own way.

Reveal Hint

Hint: You could use the following approach:
- Create a data frame with all variables you need for the plot and a and observation for each bandwidth.
- Extract the values from the rdrobust output which you need for your plot.
- Loop the regression and the extraction of output over the bandwidths indicated in the initial data frame.
- Save the output in your loop to the respective row in the data frame.
- Plot the output from the newly created data frame.

Reveal Answer

bandwidth <-  seq(from = 0.05, to = 1, by = 0.05)  #create a vector with values for each bandwidth you want to estimate

coefficient<- NA 
se <- NA
obs <- NA
bw <- NA
ci_u <- NA
ci_l <- NA

data_extract <- data.frame(bandwidth, coefficient, se, obs, bw, ci_u, ci_l) # create a data.frame with all variables you want to incldue in your dataset

# create a loop for each bandwidth that is indicated by 'i'
for(i in bandwidth){
 rdbw <- rdrobust(educ$school_women, educ$margin, c=0, h=i) # run the model
                                      
# extract the model output (make sure to extract *robust* statistics)
data_extract$coefficient[data_extract$bandwidth==i]  <- rdbw$coef[3] 
data_extract$se[data_extract$bandwidth==i]  <- rdbw$se[3]
data_extract$obs[data_extract$bandwidth==i]  <- (rdbw$N_h[1] + rdbw$N_h[2]) 
data_extract$bw[data_extract$bandwidth==i]  <- (rdbw$bws[1, 1]) 
data_extract$ci_l[data_extract$bandwidth==i]  <- rdbw$ci[3,1]
data_extract$ci_u[data_extract$bandwidth==i]  <- rdbw$ci[3,2]
                   }

# Make sure the coefficient (and all other values) are numeric
data_extract$coefficient  <- as.numeric(data_extract$coefficient)

# Plot the estimates across bandwidths
ggplot(data = data_extract,
       aes(x = bandwidth, y = coefficient)) +
  geom_point(size = 0.8) +
  geom_ribbon(aes(ymin = ci_l, ymax = ci_u), alpha = 0.2) +
  geom_hline(aes(yintercept = 0), col = "red", linetype = 2) +
  coord_cartesian(ylim = c(-.05, 0.15)) +
  theme_minimal() +
  labs(y = "LATE at Discontinuity", x = "Bandwidths (Vote Margin)")

Sorting

The key identification assumption of RDDs is that there is no sorting on the running variable. That is, the running variable must be continuous around the threshold. If this is not the case, we have a problem: Then there’s a good chance that the observations close to the threshold (or on either side of the threshold) are not random. In other words: Presence of sorting is usually interpreted as empirical evidence of self-selection or non-random sorting of units into control and treatment status. Let’s check if sorting is problem here. We can do this using density checks. This means we’re looking at the density of the running variable.

There are several ways to check for sorting. Let’s start with McCrary’s density test. We can use the DCdenstiy command to conduct both a visual test and statistical test. The syntax is the following:

DCdensity(running_var ,cutpoint= [cutoff] ,plot=TRUE, ext.out = TRUE)
title(xlab="X Lab",ylab="Y Lab")
abline(v=0, lty=1) # Adding a vertical line at the cutoff

Exercise 16: Examine the density of the running variable using the DCdensity command.

Reveal Answer

DCdensity(educ$margin ,cutpoint= 0 ,plot=TRUE, ext.out = TRUE)

$theta
[1] -0.09537487

$se
[1] 0.1466241

$z
[1] -0.6504719

$p
[1] 0.5153874

$binsize
[1] 0.008611425

$bw
[1] 0.1654581

$cutpoint
[1] 0

$data
          cellmp    cellval
1   -1.003231050 0.04365594
2   -0.994619625 0.04365594
3   -0.986008199 0.00000000
4   -0.977396774 0.00000000
5   -0.968785349 0.04365594
6   -0.960173923 0.00000000
7   -0.951562498 0.04365594
8   -0.942951073 0.00000000
9   -0.934339647 0.00000000
10  -0.925728222 0.04365594
11  -0.917116797 0.00000000
12  -0.908505371 0.04365594
13  -0.899893946 0.00000000
14  -0.891282521 0.08731188
15  -0.882671095 0.00000000
16  -0.874059670 0.00000000
17  -0.865448245 0.00000000
18  -0.856836819 0.04365594
19  -0.848225394 0.04365594
20  -0.839613969 0.04365594
21  -0.831002543 0.08731188
22  -0.822391118 0.04365594
23  -0.813779693 0.00000000
24  -0.805168268 0.00000000
25  -0.796556842 0.04365594
26  -0.787945417 0.00000000
27  -0.779333992 0.00000000
28  -0.770722566 0.00000000
29  -0.762111141 0.08731188
30  -0.753499716 0.00000000
31  -0.744888290 0.00000000
32  -0.736276865 0.04365594
33  -0.727665440 0.04365594
34  -0.719054014 0.08731188
35  -0.710442589 0.17462375
36  -0.701831164 0.08731188
37  -0.693219738 0.08731188
38  -0.684608313 0.13096781
39  -0.675996888 0.08731188
40  -0.667385462 0.04365594
41  -0.658774037 0.17462375
42  -0.650162612 0.21827969
43  -0.641551186 0.21827969
44  -0.632939761 0.26193563
45  -0.624328336 0.39290344
46  -0.615716910 0.48021532
47  -0.607105485 0.21827969
48  -0.598494060 0.87311876
49  -0.589882634 0.52387126
50  -0.581271209 0.61118313
51  -0.572659784 0.65483907
52  -0.564048359 0.78580688
53  -0.555436933 0.91677470
54  -0.546825508 1.30967814
55  -0.538214083 1.65892564
56  -0.529602657 1.74623752
57  -0.520991232 2.66301222
58  -0.512379807 1.70258158
59  -0.503768381 2.75032409
60  -0.495156956 2.48838846
61  -0.486545531 2.44473253
62  -0.477934105 1.78989346
63  -0.469322680 1.70258158
64  -0.460711255 1.35333408
65  -0.452099829 1.65892564
66  -0.443488404 1.83354939
67  -0.434876979 1.17871033
68  -0.426265553 1.65892564
69  -0.417654128 2.22645284
70  -0.409042703 1.70258158
71  -0.400431277 1.17871033
72  -0.391819852 1.57161377
73  -0.383208427 2.00817315
74  -0.374597001 1.78989346
75  -0.365985576 1.96451721
76  -0.357374151 1.83354939
77  -0.348762726 1.30967814
78  -0.340151300 2.22645284
79  -0.331539875 1.57161377
80  -0.322928450 1.57161377
81  -0.314317024 1.30967814
82  -0.305705599 1.48430189
83  -0.297094174 1.52795783
84  -0.288482748 1.52795783
85  -0.279871323 1.57161377
86  -0.271259898 1.35333408
87  -0.262648472 1.35333408
88  -0.254037047 1.35333408
89  -0.245425622 1.09139845
90  -0.236814196 1.00408657
91  -0.228202771 1.30967814
92  -0.219591346 1.00408657
93  -0.210979920 1.35333408
94  -0.202368495 1.09139845
95  -0.193757070 0.91677470
96  -0.185145644 1.30967814
97  -0.176534219 1.00408657
98  -0.167922794 0.96043064
99  -0.159311368 1.39699001
100 -0.150699943 0.96043064
101 -0.142088518 1.35333408
102 -0.133477092 1.22236626
103 -0.124865667 1.09139845
104 -0.116254242 1.00408657
105 -0.107642817 1.52795783
106 -0.099031391 1.44064595
107 -0.090419966 1.26602220
108 -0.081808541 1.17871033
109 -0.073197115 0.78580688
110 -0.064585690 1.13505439
111 -0.055974265 1.35333408
112 -0.047362839 0.96043064
113 -0.038751414 1.61526970
114 -0.030139989 1.22236626
115 -0.021528563 1.17871033
116 -0.012917138 1.00408657
117 -0.004305713 0.82946282
118  0.004305713 1.13505439
119  0.012917138 0.96043064
120  0.021528563 0.69849501
121  0.030139989 0.78580688
122  0.038751414 0.61118313
123  0.047362839 0.78580688
124  0.055974265 0.82946282
125  0.064585690 0.65483907
126  0.073197115 0.56752719
127  0.081808541 0.61118313
128  0.090419966 0.61118313
129  0.099031391 0.56752719
130  0.107642817 0.65483907
131  0.116254242 0.52387126
132  0.124865667 0.34924750
133  0.133477092 0.34924750
134  0.142088518 0.26193563
135  0.150699943 0.39290344
136  0.159311368 0.30559157
137  0.167922794 0.39290344
138  0.176534219 0.17462375
139  0.185145644 0.21827969
140  0.193757070 0.17462375
141  0.202368495 0.13096781
142  0.210979920 0.17462375
143  0.219591346 0.17462375
144  0.228202771 0.17462375
145  0.236814196 0.08731188
146  0.245425622 0.00000000
147  0.254037047 0.08731188
148  0.262648472 0.17462375
149  0.271259898 0.08731188
150  0.279871323 0.00000000
151  0.288482748 0.00000000
152  0.297094174 0.08731188
153  0.305705599 0.00000000
154  0.314317024 0.04365594
155  0.322928450 0.00000000
156  0.331539875 0.04365594
157  0.340151300 0.00000000
158  0.348762726 0.04365594
159  0.357374151 0.00000000
160  0.365985576 0.00000000
161  0.374597001 0.00000000
162  0.383208427 0.00000000
163  0.391819852 0.00000000
164  0.400431277 0.00000000
165  0.409042703 0.00000000
166  0.417654128 0.00000000
167  0.426265553 0.00000000
168  0.434876979 0.04365594
169  0.443488404 0.00000000
170  0.452099829 0.08731188
171  0.460711255 0.00000000
172  0.469322680 0.04365594
173  0.477934105 0.00000000
174  0.486545531 0.04365594
175  0.495156956 0.04365594
176  0.503768381 0.00000000
177  0.512379807 0.00000000
178  0.520991232 0.00000000
179  0.529602657 0.04365594
180  0.538214083 0.00000000
181  0.546825508 0.00000000
182  0.555436933 0.00000000
183  0.564048359 0.00000000
184  0.572659784 0.00000000
185  0.581271209 0.00000000
186  0.589882634 0.04365594
187  0.598494060 0.00000000
188  0.607105485 0.00000000
189  0.615716910 0.00000000
190  0.624328336 0.00000000
191  0.632939761 0.00000000
192  0.641551186 0.00000000
193  0.650162612 0.00000000
194  0.658774037 0.00000000
195  0.667385462 0.00000000
196  0.675996888 0.00000000
197  0.684608313 0.00000000
198  0.693219738 0.00000000
199  0.701831164 0.00000000
200  0.710442589 0.00000000
201  0.719054014 0.00000000
202  0.727665440 0.00000000
203  0.736276865 0.00000000
204  0.744888290 0.00000000
205  0.753499716 0.00000000
206  0.762111141 0.00000000
207  0.770722566 0.00000000
208  0.779333992 0.00000000
209  0.787945417 0.00000000
210  0.796556842 0.00000000
211  0.805168268 0.00000000
212  0.813779693 0.00000000
213  0.822391118 0.00000000
214  0.831002543 0.00000000
215  0.839613969 0.00000000
216  0.848225394 0.00000000
217  0.856836819 0.00000000
218  0.865448245 0.00000000
219  0.874059670 0.00000000
220  0.882671095 0.00000000
221  0.891282521 0.00000000
222  0.899893946 0.00000000
223  0.908505371 0.00000000
224  0.917116797 0.00000000
225  0.925728222 0.00000000
226  0.934339647 0.00000000
227  0.942951073 0.00000000
228  0.951562498 0.00000000
229  0.960173923 0.00000000
230  0.968785349 0.00000000
231  0.977396774 0.00000000
232  0.986008199 0.00000000
233  0.994619625 0.04365594

title(xlab="Winning Margin",ylab="Density") # add labels (always add labels. Really. Always.)
abline(v=0, lty=1)

The test provides relatively clear visual evidence that sorting is not a problem in this case. There is a clear overlap of the confidence intervals at the cut-off, so the continuity assumption holds. This command provides also statistical tests. You can see that information on the individual cells - or bins - is provided. For now, we focus on the inference test for sorting. The p-value of about 0.52 makes clear that we do not reject the null hypothesis of continuity.

Let’s further explore the structure of our data and check if sorting really isn’t a problem here. Recall, we want to make sure the density of units should be continuous near the cut-off. We can use the rddensity package to do so. This package applies a different test than the one we used before. It also is somewhat more powerful and provides additional information.

The command is straightforward. You only have to insert the running variable as argument and, if different from 0, the cutoff. Note that you can optionally specify bandwidths, as above, using the h argument. Otherwise, an optimal bandwidth will be specified automatically. You can then save this as an object and run the summary command to see the output.

Exercise 17: Examine the density of the running variable using the rddensity command. Interpret your findings. Are they equivalent to those in the previous exercise?

Reveal Answer

rdd <- rddensity(educ$margin, c=0)
summary(rdd)


Manipulation testing using local polynomial density estimation.

Number of obs =       2660
Model =               unrestricted
Kernel =              triangular
BW method =           estimated
VCE method =          jackknife

c = 0                 Left of c           Right of c          
Number of obs         2332                328                 
Eff. Number of obs    769                 314                 
Order est. (p)        2                   2                   
Order bias (q)        3                   3                   
BW est. (h)           0.25                0.282               

Method                T                   P > |T|             
Robust                -0.9203             0.3574              


P-values of binomial tests (H0: p=0.5).

Window Length / 2          <c     >=c    P>|T|
0.004                      11       9    0.8238
0.009                      18      26    0.2912
0.013                      32      34    0.9022
0.017                      42      48    0.5984
0.021                      52      58    0.6338
0.026                      68      63    0.7269
0.030                      76      73    0.8699
0.034                      97      82    0.2954
0.038                     116      89    0.0691
0.043                     134      93    0.0078

This output looks clearer. Note that the test is based on quadratic local polynomial functions. We also get a different bandwidth - recall that we are solely looking at the density of the running variable here.

Importantly, the test reveals a p-value of 0.357. We again do not reject the null hypothesis of continuity around the cut-off. Both tests indicating the same result gives us confidence in doing so.

The second part of the output indicates the results of binomial tests for the allocation of observations around the cut-off for different windows. The number of observations is indicated as well as individual p-values for significance tests, with the null hypothesis being equal distribution on either side of the cut-off. Note that this can give us an indication, but modest unbalance is not evidence for sorting.

Let’s now plot this density test, too. The rdplotdensity function allows us to do so easily. It uses the results of the test conducted in rddensity. The syntax is simple:

rdplotdensity([rdd_object] ,[running_var])

We can also add some useful arguments. Feel free to adjust your plot:

Argument	Description
`plotRange`	Indicates starting and end point of plot
`plotN`	Number of grid points used on either side of the cutoff
`CIuniform`	True or False. If true, CI are displayed as continuous bands

Exercise 18: Use the rplotdensity to conduct a visual density check.

Reveal Answer

plot_rdd <- rdplotdensity(rdd,educ$margin, plotRange = c(-.2, 0.2),  CIuniform = TRUE)

Again, we find visual evidence for continuity around the cutoff. Now we can be confident about sorting not being a problem with this data set.

Balance

One important falsification test is examining whether near the cut-off treated units are similar to control units in terms of observable characteristics. The logic of this test is to identify that if units cannot manipulate the score they receive, there should not be systematic differences between untreated and treated units with similar values of the score.

Exercise 19: Let’s check for balance around the cutoff. Use the rdrobust() function. Set the cut-off argument c equal to 0, and the bwselect argument equal to “mserd”. In this case, the outcome variables are the covariates that we want to check for balance. Check if there is balance for the following covariates: log_pop, school_men, sex_ratio, log_area. Remember to use the summary().

Reveal Answer

summary(rdrobust(educ$log_pop, educ$margin, c=0, bwselect="mserd"))

Call: rdrobust

Number of Obs.                 2660
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2332          328
Eff. Number of Obs.             385          232
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.122        0.122
BW bias (b)                   0.195        0.195
rho (h/b)                     0.627        0.627
Unique Obs.                    2330          328

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.013     0.290     0.045     0.964    [-0.555 , 0.581]     
        Robust         -         -     0.005     0.996    [-0.676 , 0.680]     
=============================================================================

summary(rdrobust(educ$school_men, educ$margin, c=0, bwselect="mserd"))

Call: rdrobust

Number of Obs.                 2632
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2317          315
Eff. Number of Obs.             373          222
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.121        0.121
BW bias (b)                   0.187        0.187
rho (h/b)                     0.646        0.646
Unique Obs.                    2315          315

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.013     0.014     0.916     0.360    [-0.015 , 0.040]     
        Robust         -         -     0.747     0.455    [-0.021 , 0.046]     
=============================================================================

summary(rdrobust(educ$sex_ratio, educ$margin, c=0, bwselect="mserd"))

Call: rdrobust

Number of Obs.                 2659
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2331          328
Eff. Number of Obs.             333          206
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.106        0.106
BW bias (b)                   0.191        0.191
rho (h/b)                     0.554        0.554
Unique Obs.                    2329          328

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.017     0.019     0.920     0.357    [-0.020 , 0.055]     
        Robust         -         -     1.006     0.314    [-0.021 , 0.065]     
=============================================================================

summary(rdrobust(educ$log_area, educ$margin, c=0, bwselect="mserd"))

Call: rdrobust

Number of Obs.                 2642
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 2320          322
Eff. Number of Obs.             352          214
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.112        0.112
BW bias (b)                   0.185        0.185
rho (h/b)                     0.607        0.607
Unique Obs.                    2318          322

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.029     0.234     0.124     0.901    [-0.431 , 0.489]     
        Robust         -         -     0.045     0.964    [-0.534 , 0.559]     
=============================================================================

# We will advise you to create a loop to conduct this analysis.

We observe that there is balance across all the covariates of interest.

Placebo cut-offs

Another falsification test is to identify treatment effects at artificials or placebo cut-off values. If you remember from the lecture one of the identifying assumptions is the continuity (or lack of abrupt changes) of the regression functions for the treated and control units around the cut-off. However, we cannot empirically test this assumption, we can test continuity in other points apart from the threshold. Although this does not imply continuity assumption hold at the cut-off, it helps to rule out discontinuities away from this point.

This test implies looking at the outcome variable but using different cut-offs where we should not expect changes in the outcome. Thus, the estimates from this test should be near zero (not statistically significant as well). One important step to conduct this test is that you need to split the sample for those observations that are above the cut-off and those that are below. We do this to avoid “contamination” due to the real treatment effects. It also ensures that the analysis of each placebo cut-off is conducted using only observations with the same treatment status.

Exercise 20: Conduct placebo cut-off test using the following placebo values: -0.2 and -0.25 for the placebo tests below the cut-off. Conduct the same test above the cut-off using the following placebo values: 0.20, and 0.25. Use the rdrobust() function to perform this test. Set the argument bwselect equal to “mserd”. Replace the cut-off argument c with the values listed before. Use the summary() to report the results. Remember to create two new data frames. You can use the filter function from tidyverse to do this..

Function/argument	Description
`new_data <- data`	Creates a new data frame using the conditions set before the pipeline operator `%>>%`
`filter(variable == "condition")`	Subset the data based on a logical operation

new_data <- data %>% 
  filter(variable >= 0)

Reveal Answer

educ_above <- educ %>% 
  filter(margin >= 0)

educ_below <- educ %>% 
  filter(margin < 0)

summary(rdrobust(educ_above$school_women, educ_above$margin, c=0.2, bwselect = "mserd")) # -0.044

Call: rdrobust

Number of Obs.                  315
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  280           35
Eff. Number of Obs.              38           14
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.053        0.053
BW bias (b)                   0.093        0.093
rho (h/b)                     0.577        0.577
Unique Obs.                     280           35

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional    -0.044     0.034    -1.287     0.198    [-0.112 , 0.023]     
        Robust         -         -    -1.022     0.307    [-0.122 , 0.038]     
=============================================================================

summary(rdrobust(educ_above$school_women, educ_above$margin, c=0.25, bwselect="mserd")) #-0.025

Call: rdrobust

Number of Obs.                  315
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  294           21
Eff. Number of Obs.              24           10
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.066        0.066
BW bias (b)                   0.119        0.119
rho (h/b)                     0.560        0.560
Unique Obs.                     294           21

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional    -0.025     0.067    -0.380     0.704    [-0.156 , 0.105]     
        Robust         -         -    -0.323     0.747    [-0.185 , 0.133]     
=============================================================================

summary(rdrobust(educ_below$school_women, educ_below$margin, c=-0.2, bwselect="mserd")) # 0.063

Call: rdrobust

Number of Obs.                 2315
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 1707          608
Eff. Number of Obs.             105           96
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.034        0.034
BW bias (b)                   0.064        0.064
rho (h/b)                     0.542        0.542
Unique Obs.                    1705          608

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     0.063     0.030     2.081     0.037     [0.004 , 0.123]     
        Robust         -         -     2.172     0.030     [0.007 , 0.142]     
=============================================================================

summary(rdrobust(educ_below$school_women, educ_below$margin, c=-0.25, bwselect="mserd")) # -0.008

Call: rdrobust

Number of Obs.                 2315
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 1560          755
Eff. Number of Obs.             430          308
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.106        0.106
BW bias (b)                   0.160        0.160
rho (h/b)                     0.663        0.663
Unique Obs.                    1558          755

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional    -0.008     0.015    -0.506     0.613    [-0.037 , 0.022]     
        Robust         -         -    -0.371     0.711    [-0.041 , 0.028]     
=============================================================================

We find that for almost all the placebo cut-off values, the coefficients are all nearly zero and not statistically significant (with the exception of -0.2. This evidence suggests that the assumption of continuity hold in this case. Our outcome of interest does not jump abruptly at artificial cut-offs.

Lab 7: Sharp Regression Discontinuity

Week 7 HT 2022

Lecture Slides & Data

Recap

Sharp Regression Discontinuity

Estimation

Choice of Kernel function and Polynomial Order.

Falsification tests:

Seminar Overview

Islamic Political Rule and Women’s Empowerment - Meyerson (2014)

Global Parametric Estimation:

Non-Parametric Estimation

Kernels

Falsification Checks

Sensitivity

Sorting

Balance

Placebo cut-offs