library(tidyverse)
library(magrittr)
library(tidymodels)
library(haven)
library(estimatr)
library(haven)
<- read.csv("Weather_publicfile.csv") weather
6 Panel Data
6.0.1 Data
#install.packages("downloadthis")
library(downloadthis)
download_link(
link = "https://github.com/bayreuth-politics/CI22/raw/gh-pages/docs/data/Weather_publicfile.dta",
output_name = "weather",
output_extension = ".dta",
button_label = "Lab 5 Data",
button_type = "success",
has_icon = TRUE,
self_contained = TRUE
)
#download_link(
# link = "https://github.com/dpir-ci/CI22/raw/gh-pages/docs/lectures/lecture4.pdf",
# output_name = "week4",
# output_extension = ".pdf",
# button_label = "Lecture Slides",
# button_type = "default",
# has_icon = FALSE,
# self_contained = FALSE
#)
7 Recap
7.1 Omitted variable bias
Using panel data helps us partly circumvent the issue of omitted variable bias, rather than controlling for all potential time-invariant confounders. We can simply include fixed effects that capture all of these confounders - whether observed or unobserved.
7.2 Time-fixed effects
We know that bias can come from variables that change over time across all units. To minimise bias from these factors, we can include in our model time-effect controls for any common impact that affects all units. Put differently, time-fixed effects are time-specific factors that affect all entities at the same point intime but evolve over time.
7.3 Entity/Unit-fixed effects
A model “unit-fixed effects” is a model to which we have added a dummy variable for every unit (county, individual, or state). By adding these to our model, we control for all differences between units that do not change over time. Unit fixed effects are essentially factors that change from one unit to another, but remain constant over time - this includes all time-invariant unobserved (and, importantly, unobservable) heterogeneity. Put differently, entity or unit-fixed effects are factors that are constant over time, but vary within entities.
The coefficient of each dummy variable shifts the intercept, but does not affect the slope.
\[Y_{i,t} = \beta_1 X_{i,t} + \alpha_i + \mu_{i,t}\]
7.4 Two-way fixed effects
We can combine both time and unit fixed effects at the same time. These models are referred as Two-way fixed effects models. By including both fixed effects we are reducing bias that comes from:
- Factors that are constant, but differ across each unit.
- Factor that vary over time, but affect all units at the same time.
7.5 Diagnosis Test and Standard errors
Using panel data we cannot assume that regression errors are independent of each other. Indeed, standard errors are likely correlated within the same unit or the same year. Therefore, we need to cluster standard errors to adjust/account for this correlation. Using Clustered standard errors we allow for heteroskedasticity and autocorrelation within an entity, but treat the errors as uncorrelated across entities.
Before starting this seminar
Create a folder called “lab5”
Download the data (you can use this button or the one at the top, or read csv files directly from github):
Open an R script (or Markdown file) and save it in our “lab4” folder.
Set your working directory using the setwd() function or by clicking on “More“. For example setwd(“~/Desktop/Causal Inference/2022/Lab4”)
Let’s install an load the packages that we will be using in this lab:
library(jtools) # generate plots with regression coefficients
library(stargazer) # generate formated regression tables
library(texreg) # generate formatted regression tables
library(tidyverse) # to conduct some tidy operations
library(ggplot2) # to generate plots
library(plm) # conduct one-way and two-way fixed effects
library(estimatr) # to conduct ols and provides robust standard errors
library(lmtest) # calculates the variance-covariance matrix to be clustered by group.
library(sandwich) # to calculate heteroscedasticity-Consistent Covariance Matrix Estimation
library(haven) # upload dta files
8 Seminar Overview
In this seminar, we will cover the following topics:
1. Conduct a time fixed effects model using the lm()
function
2. Conduct a time and unit fixed model using the lm()
function
3. Conduct a time fixed effects model using the plm()
function
4. Conduct a time and unit fixed effect model using the plm()
function.
4. Correct standard errors obtained from the plm()
function.
5. Conduct a Hausmann test and F-test
8.1 The Republicans Should Pray for Rain: Wheather, Turnout, and Voting in U.S. Presidential Elections
Today, we will work with data from Gomez et al work on The Republicans Should Pray for Rain: Wheather, Turnout, and Voting in U.S. Presidential Elections. In this paper, Gomez et al seek to answer the following question: Does bad weather affect turnout? To address this question, they looked at 14 U.S presidential elections. Using GIS interpolations and weather data from 22,000 U.S weather stations, they provide election day estimates of rain and snow at the county level.
The authors find that rain significantly reduces voter participation by a rate of just less than 1% per inch. Meanwhile, an inch of snowfall decreases turnout by almost .5%. The authors also find that poor weather benefits Republicans. The authors’ main outcome of interest is vote turnout at the county level.
Their data consists of over 3,000 counties in the continental United States for each presidential election from 1948 to 2000. The meteorological variables were drawn from the National Climatic Data Center that contained Day data report of various measures of the day’s weather, including rainfall and snow, for over 20,000 weather stations. In total the authors collected up to 43,340 observations (N = 3,115, max, and T = 14 elections).
The main “treatment” variable of this study is estimated rainfall and snowfall measured in inches (1 in = 2,54 cm). The authors used two alternative measures of weather. First, they calculated the normal (average) rainfalls and snowfalls for each election data for each county using data from 1948-2000 time span. Then, they subtracted the appropriate daily normal value from the rainfall or snowfall estimated to have occurred on each election day under analysis.
The authors control for a number of socio-economic factors that are associated with voter turnout such as High School Graduates in the county, median household income, and how rural the county is.
The authors find when measured as deviations from their normal values, rain and snow elicit a negative and statistically significant effect on voter turnout. Thus, evidence supports the claim that bad weather lowers voters turnout. The authors also find that when rain and snow increase above their respective election day normal, the vote share of Republican presidential candidates increases.
A description of some of the variables that we will use today is below:
Variable | Description |
---|---|
Year |
Election Year |
State |
State name |
County |
County name |
FIPS_County |
County id |
Turnout |
Turnout presidential election as a percentage of Voting-age population |
Rain |
GIS rainfall estimate |
Snow |
GIS snowfall estimate |
Rain_Dev |
Deviation from normal rainfall |
Snow_Dev |
Deviation from normal snowfall |
GOPVoteShare |
Republican Candidate’s Vote Share |
PcntBlack |
Estimated percentage of Black population |
ZPcntHSGrad |
Estimated percentage high school grads, standarised by years at the county level |
FarmsPerCap |
Farms per capita |
Unemploy |
Estimated Unemployment Rate |
AdjIncome |
CPI Adj Median Household Income (1982-84=100), in 10000s |
Closing |
length of time between date voter must register and election day |
Literacy |
Proportion of legislator’s children that are female |
Poll Tax |
Literacy test |
Closing |
Poll tax |
Yr52 |
Dummy election 1952 |
Yr56 |
Dummy election 1956 |
Yr.. |
Dummy election … |
Yr2000 |
Dummy election 2000 |
Now let’s load the data. There are two ways to do this:
You can load the dataset from your laptop using the read_dta()
function. We will call this data frame weather.
# Set your working directory
#setwd("~/Desktop/Causal Inference/2022/Lab5")
#
#setwd("/cloud/project")
Or you can download the data from the course website from following url: https://github.com/bayreuth-politics/CI22/raw/gh-pages/docs/data/Weather_publicfile.dta.
Let us start with a few descriptive statistics to become familiar with the data. First, as always, we use the head()
function to get an impression of the data
head(weather)
Now, let’s have a look at the number of elections that are included in the data set. We know that the authors are looking at presidential elections, so each unique year in the data can cover only one and the same presidential election. We use the unique()
command to retrieve unique values of the Year
variable. Then, you can use the length()
command to display the number of unique values.
# Retrieving unique values of the variable
unique(weather$Year)
# Retrieving the number of the unique values of the variable.
length(unique(weather$Year))
So the data set covers 14 different presidential elections - as the unique()
command shows us, these are all elections that took place between 1948 and 2000. Let’s do the same now for the number of states.
Exercise 1: How many US states are covered by the data - which ones are missing?
Reveal Answer
# Retrieving the number of the unique values of the variable.
length(unique(weather$State))
# Retrieving unique values of the variable
unique(weather$State)
The data cover 48 US states, so two are missing. We can see that the data cover all contiguous US states - the two missing ones are Alaska and Hawaii. Is there a reason to exclude them?
There is! Alaska and Hawaii joined the Union only in 1959 - so the authors would have had shorter time-series for these two states.
8.2 The Times are a changin’: Pooled OLS v Fixed Effects
Let’s start by paying attention to the temporal dimension of our voting data. Conventional wisdom has it that each and every election is different. Accordingly, we want to acknowledge these differences across elections and take them into account when estimating our models.
Let’s first have a look at the broad picture by plotting the outcome of interest across time. We can use the boxplot()
function to this. The syntax is the following boxplot(Y ~ X, data= data, xlab="X Label", ylab="Y Label")
.
Exercise 2: Plot GOPVoteShare
over time
Reveal Answer
boxplot(GOPVoteShare ~ Year,
data = weather,
xlab = "Election Year",
ylab = "GOP Vote Share",
las = 2) # This argument rotates the axis labels
The vote share cast for the GOP varies widely. Note that the data set includes an observation for each county. The box indicates the interquartile range (i.e. the difference between the third and first quartile of data points) and the solid lines the median value for each year. There are meaningful differences: Note how GOP vote share is larger in 1972, when McGovern (ever heard of him?) had no chance against Nixon. Importantly, we still see a rather broad range of GOP vote shares - this is the variation introduced by individual counties and their specifics. Considering only temporal differences here, we do not distinguish any other effects.
Let’s now run a simple OLS model without considering any variation introduced by temporal differences or unit (i.e. county) specifics. Such a model is called pooled OLS as we simply pool all observations in our data set to estimate a coefficient. By design, all observations are considered to be independent of each other in such model.
Exercise 3: Regress GOPVoteShare
on Rain
Estimate two models: First, a simple bivariate model and second a model that includes important covariates: PcntBlack
, ZPcntHSGrad
, FarmsPerCap
, AdjIncome
and Literacy
. Interpret your model estimates.
Reveal Answer
# Bivariate pooled OLS
<- lm(GOPVoteShare ~ Rain, data=weather)
ols_bivariate summary(ols_bivariate)
# Multivariate pooled OLS
<- lm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, data=weather)
ols_covariates summary(ols_covariates)
The bivariate model estimates a - hardly significant - association between rain and GOP vote share of 0.5%-points per inch. As we know, however, such bivariate association does not account for several other effects. Indeed, all covariates included in the multivariate pooled OLS are highly correlated with GOP vote share - and the estimate of rain on election day doubles to about 1%-point in the multivariate model.
Let’s now move away from the pooled - or naive - OLS model. We know that despite accounting for the covariates, the multivariate model does not account for variance from two important sources: (i) general temporal development, (ii) unit-specific time-invariant variance. In a first step, we account for the former. Let’s see how acknowledging general election effects changes our estimate.
In other words, we will now estimate a fixed-effects model. You can use the lm()
function to include time-fixed effects. If you add the Year
variable as factorised variable, a dummy variable for each year is added to the regression equation. In other words, we allow each presidential election to have a different intercept but estimate a single slop for the independent variables.
Exercise 4: Estimate the same bivariate and multivariate models as above including time-fixed effects.
Reveal Answer
Note that this is not the same as simply adding the Year
variable as an independent variable. If you did the latter, - with Year
being a continuous variable - R would simply understand it to be a single, general and continuous time trend.
# Bivariate time FE
<- lm(GOPVoteShare ~ Rain + factor(Year), data=weather)
FE_bivariate summary(FE_bivariate)
# Multivariate time FE
<- lm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy + factor(Year), data=weather)
FE_covariates summary(FE_covariates)
Do you notice the vast differences of our estimated effect of Rain
? In the bivariate FE model, the estimated effect is negative, so one inch of rain would be associated with a 1%-point decrease in GOP vote share. Once we control for the same set of covariates as above, the model estimates that an increase in one unit ofRain
is not associated with GOP vote share at all!
Let’s pause a moment to make sure we grasp the difference between a pooled OLS and the FE estimator. Simply speaking, the pooled OLS considers each observation but does not differentiate them in any particular way. Thus, the bivariate model simply estimates the mathematical association between the two variables, using all data points:
require(modelr)
require(ggplot2)
<- data.frame(Intercept=1, Rain=seq_range(weather$Rain, 10))
grid $pred <- predict(ols_bivariate,grid)
gridggplot(weather, aes(Rain)) +
geom_point(aes(y = GOPVoteShare)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
The logic of the FE estimator is different, however. This estimator looks at and compares data points within a certain period. It estimates a single coefficient (or slope) for independent variables, but allows each period to have an individual intercept, thus capturing general time trends that affect all units. Let’s look at the coefficient of Rain
in the bivariate FE model now:
$fYear <- factor(weather$Year) # Creating a factor variable makes it easier to plot
weather
<- expand(data.frame(Intercept=1, Rain=seq_range(weather$Rain, 14), Year=unique(weather$fYear)),
grid_fixdum
Rain, Year)$pred <- predict(FE_bivariate,grid_fixdum)
grid_fixdumggplot(weather, aes(Rain)) +
geom_point(aes(y = GOPVoteShare, colour=fYear)) +
geom_smooth(aes(y = GOPVoteShare), method='lm', se=FALSE, colour="black")+
geom_line(data=grid_fixdum, aes(x=Rain, y=pred, colour=Year))
In fact, looking only at the effects within Years
, we can see that the aggregate slope is even slightly negative!
Even with our FE estimates, however, there is an important problem. Using the lm()
command, R estimates the time dummies as additional variables. This affects the measures of uncertainty and, therefore, statistical significance as observations within the same year are not independent. In fact, we want to estimate FE models precisely because we have reason to believe that they are not independent of each other.
To adjust for this dependence of observations, we must calculate robust - or clustered or panel-corrected - standard errors. This is important and should always be done when estimating FE models - otherwise standard errors will be (heavily) biased. In turn, this would mean we make wrong inferences.
Luckily, we can easily adjust standard errors, either manually using special packages such as the sandwich
package or automatically as part of the regression estimation. To do the latter, we can use the lm_robust()
command from the estimatr
package. It’s syntax is similar to the general lm()
command, but it allows you to simply specify fixed effects by adding a fixed_effects=
argument, followed by the variable indicating FE. Robust standard errors are automatically calculated when you use that function.
Exercise 5: Estimate the same time-FE models as above with robust standard errors using the lm_robust()
function.
Reveal Answer
require(estimatr)
# Bivariate time-FE with roust SEs
<- lm_robust(GOPVoteShare ~ Rain, fixed_effects= factor(Year), data=weather)
FE_robust_bivariate summary(FE_robust_bivariate)
# Multivariate time-FE with roust SEs
<- lm_robust(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, fixed_effects= factor(Year), data=weather)
FE_robust_covariates summary(FE_robust_covariates)
Exercise 6: Display the regression output of all six models in a single table. Interpret the models.
Reveal Answer
require(texreg)
# This is one possible option, there are many others.
screenreg(l=list(ols_bivariate, ols_covariates, FE_bivariate, FE_covariates, FE_robust_bivariate, FE_robust_covariates),
custom.header = list("Pooled OLS" = 1:2, "Time FE" = 3:4, "Robust Time FE" = 5:6),
omit.coef = "(Intercept)")
Note that the default display of the lm_robust()
differs from the standard OLS output we encountered so far. Instead of standard errors, the table indicates the 95% confidence intervals of the estimates - this is indeed an increasingly popular way to present results as confidence intervals give a clearer indication of the effect’s uncertainty. Also, all significant results are marked by only one asterisk, which indicates that the value of the null hypothesis (i.e. 0) falls outside the confidence interval.
Pay attention to the significant differences between the OLS models and the FE models. Based on the pooled OLS model, we would have thought that rain
is associated with an increase in GOP vote share once we control for covariates. However, doing so does not account for temporal changes and is based on comparisons across different elections. In other words, a pooled OLS compares different observations from different points in time of different units to each other.
Rain
(or AdjIncome
) and GOPVoteShare
once we include a list of (time-variant) covariates into our model. Note that in the FE models we are estimating effects of Rain
within each individual election year and do not simply compare observations from different election years to each other.
8.2.1 PLM Function
Rather than using lm()
or lm_robust()
. We can use the plm()
function from the plm
package that allows conducting unit and time fixed effect. The syntax is quite simple: plm(formula, effect = "", model = "", index = c("unit", "year"), data = data)
. The effect
argument sets what type of model you want to run, some of the options are “individual”, which is unit-fixed effect, “time” for time-fixed effects, and “twoways” for two-way fixed effects. The model
argument determines the estimation method. If you want to calculate the coefficients by demeaning both dependent and independent variables, and then calculate estimates using deviations from the mean, you need to set model
argument equal to "within"
. What essentially happens with the “within” estimation is that subtracts group-specific means values from the dependent variable and explanatory variables, which removes unit-specific and time-specific effects. If you want to estimate the coefficients using the first difference estimator, you can set the model
argument equal to "fd"
. The index
argument tells the function what is the structure of the data i.e. what variable contains the time periods and what variable contains the units, so we need to tell the function what variable is the entity identifiers and the time identifier.
Now, let’s estimate this using a fixed-effects model, using the plm()
function. Let’s start conducting time FE model for the Turnout
variable for both “treatment” variables Rain_Dev
and Snow_Dev
. Let’s conduct the simplest version without covariates.
Exercise 7: Run time FE model for both weather variables (Rain_Dev
and Snow_Dev
in separate models), set effects
equal to “time”. Set the model argument equal to "within"
set the index
argument equal to c("FIPS_County", "Year")
. Store the output of your time FE model into an object and then use the summary()
to check for your results. Interpret the results of both estimations.
Reveal Answer
<- plm(Turnout ~ Rain_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_time_plm_rain summary(turnout_time_plm_rain)
<- plm(Turnout ~ Snow_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_time_plm_snow summary(turnout_time_plm_snow)
We find that a unit increase on deviation of rainfall from a “normal” election day weather is associated with a 2.16
increase in turnout, once we include time-fixed effects. The effect of snowfalls seems to be smaller at 1.0105
. Note that the results of these bivariate models are in the opposite direction of what the authors found in their study. Let’s do the same but now let’s look at vote share for Republicans.
Exercise 8: Run a time FE model using the plm()
function, but look at the effect of rainfall and snowfall on the vote share for Republicans (GOPVoteShare
). Remember to set the effect
argument equal to "time"
and time model
argument equal to within
. Store the output from the plm()
function into an object. Use the summary()
command check the results. Interpret the coefficient from the Rain_Dev
and Snow_Dev
“treatment” variables. Doabove average rainfalls and snowfalls increase or decrease GOP vote share? Are the results statistically significant?
Reveal Answer
<- plm(GOPVoteShare ~ Rain_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_time_plm_rain
summary(gop_time_plm_rain)
<- plm(GOPVoteShare ~ Snow_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_time_plm_snow summary(gop_time_plm_snow)
We find that a unit increase in the deviation from the average rainfalls on election day is associated with an increase in vote share for the Republicans by 1.93 percent, once we control for time-FE. Meanwhile, a unit increase in the deviation from the average snowfall is associated with a 0.01 increase in vote share for Republicans. However, this last result is not statistically significant. These results are more in line with what the authors found in their study.
8.2.2 Unit-FE
With the plm()
function we can also conduct unit-FE. We only need to do is to set the effect
argument equal to "individual"
. Let’s do that:
Exercise 9: Conduct a unit FE looking at the effect of weather on voter turnout (Turnout
) and vote share for Republicans (GOPVoteShare
). Use both measures of weather Rain_Dev
and Snow_Dev
. Store the output of your panel data estimations into an object and use the summary()
function to check the results.
Reveal Answer
# Turnout - rain
<- plm(Turnout ~ Rain_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_unit_plm_rain summary(turnout_unit_plm_rain)
# Turnout - snow
<- plm(Turnout ~ Snow_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_unit_plm_snow summary(turnout_unit_plm_snow)
# GOP vote share - rain
<- plm(GOPVoteShare ~ Rain_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_unit_plm_rain summary(gop_unit_plm_rain)
# GOP vote share - snow
<- plm(GOPVoteShare ~ Snow_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_unit_plm_snow summary(gop_unit_plm_snow)
Using County-level FE, we find results that are in line with the findings reported by Gomez et al. We find that a unit increase in the deviation from the average weather on election day is associated with a decrease in voter turnout by 2.68 per cent. We find similar results for snow (-0.87). We also find having relatively bad weather favours Republicans by a sizeable 4.96 percentage point increase in their vote share.
8.2.3 Two-Way Fixed Effects
Now that we have conducted time and unit FE separately, but we can conduct a two-way FE model that controls for unit and time FE at the same time. In a two-ways FE model, we are controlling for all unit-specific time-invariant confounders and year-specific confounders that affect all units at the same time. To do this using the plm()
is very simple, we only need to change the effect
argument and set it equal to twoways
.
Exercise 10: Conduct a two-way FE model looking at the effect of weather on voter turnout and GOP vote share. Use both measures of weather Rain_Dev
and Snow_Dev
. Set the effect
argument equal to "twoways"
. Store the outputs into objects and then use the summary()
to check the results.
Reveal Answer
# two-way Turnout - rain
<- plm(Turnout ~ Rain_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_twoway_plm_rain summary(turnout_twoway_plm_rain)
# two-way Turnout - snow
<- plm(Turnout ~ Snow_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_twoway_plm_snow summary(turnout_twoway_plm_snow)
# two-way GOP vote share - rain
<- plm(GOPVoteShare ~ Rain_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_twoway_plm_rain summary(gop_twoway_plm_rain)
# two-way GOP vote share - rain
<- plm(GOPVoteShare ~ Snow_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
gop_twoway_plm_snow summary(gop_twoway_plm_snow)
We again find evidence that weather is associated with a decrease in turnout and an increase in the GOP vote share once we include time and unit fixed effects. With time and unit fixed effects we are controlling for all time-invariant confounders. We are also controlling for time-specific confounders, but a particular type of time-specific confounders, only those that affect all units in the same period of analysis. Thus, we need to further control for time-varying unit-specific confounders such as the average income or unemployment rates at the county levels, which are different from one County to another and also vary from one presidential cycle to another.
Exercise 11: Run a two-way FE model looking at the effect of weather only on Turnout. The syntax is the same as lm()
, you can include additional variables into your model by adding the variable name followed by a +
sign between variables. Include the following covariates: PcntBlack
, ZPcntHSGrad
, FarmsPerCap
, AdjIncome
and Literacy
. Store the output from your plm()
estimations into separate objects and then use the summary()
function to check the results. Again, conduct separate estimations one only using the Rain_Dev
and another model using the Snow_Dev
variable.
Reveal Answer
# turnout - rain + covariates
<- plm(Turnout ~ Rain_Dev + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_twoway_plm_rain_cov
summary(turnout_twoway_plm_rain_cov)
# turnout - snow + covariates
<- plm(Turnout ~ Snow_Dev + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
turnout_twoway_plm_snow_cov
summary(turnout_twoway_plm_snow_cov)
Now let’s report the FE models only for the turnout outcome variable into a single regression table. Let’s also only report for the rainfall variable (otherwise we will have 14 models). Let’s use screenreg()
and let’s compare the results. You can store all of the outputs into one single list()
and give them names, as you can see below.
<- list("Time fixed effect" = turnout_time_plm_rain,
plm_models "Unit fixed effect" = turnout_unit_plm_rain,
"Twoway" = turnout_twoway_plm_rain,
"Twoway + cov" = turnout_twoway_plm_rain_cov)
Exercise 12: Report in a regression table the following modes: time FE, unit FE, Two ways and two ways FE models for the rainfall variable (Rain_Dev
) and only the voter turnout (Turnout
) outcome. Use the screenreg()
function.
Reveal Answer
screenreg(plm_models,
stars = 0.05,
digits = 2,
custom.coef.names = c("Rain", "% Black ", "Graduates", "Rural", "Income", "Literacy"),
reorder.coef = c(1, 3, 4, 5, 2 ,6),
caption = "Fixed effects models - Voter turnout",
caption.above = TRUE)
One limitation with the plm()
function is that it calculates clustered standard errors without correcting for small samples and not reducing downward bias. The data that we are using is far from small, but when you are dealing with small panels, you may want to try to conduct this small sample correction. For that, we can use the coeftest()
and vcovHC()
function to obtain cluster-robust standard errors and draw inference based on robust standard errors.
The syntax’s a little bit tricky, but let’s look at the code below from the inside out. The first function is vcovHC()
function. vcovHC()
calculates heteroscedasticity-consistent standard errors or aka “HC1”. Then, once we calculate our standard errors, we can perform a z/quasi-t Wald test using the coeftest()
using the standards errors that we estimated from the vcovHC()
. The first argument inside of the vcovHC()
function is the object where you stored the output from your fixed-effect model from plm()
. The second argument is type
. This is the type of small sample adjustment that we want to make. “HC1” is the commonly used approach to correct for small samples, so we can set type
equal to "HC1"
. Finally, the third argument is cluster
is an argument that we can use to determine whether we want to obtain the variance-covariance matrix clustered by units ("groups"
) either or by time ("time"
). You can set both by setting the argument cluster = c("group", "time")
. Then, the second function is coeftest()
. The first argument of this function is the object where you stored your output from plm()
. The second argument vcov
is the variance-covariance matrix of the estimated coefficients. In this case, rather than using the non-robust variance-covariance matrix from the plm()
function, we will use the White’s heteroscedasticity-consistent covariance matrix that we obtained from the vcovHC()
function. Using this matrix also allow us to overcome with serial correlation, cross-sectional dependent and heteroscedasticity across groups and time. You see the syntax below for the two-way fixed effect model where we looked at the effect of rainfalls on voter turnout, including covariates.
<- coeftest(turnout_twoway_plm_rain_cov, vcov = vcovHC(turnout_twoway_plm_rain_cov, type = "HC1", cluster = c("group", "time")))
out_rain_cov out_rain_cov
It’s important to separate estimation from inference, so first we got our point estimate, which is equal -1.23
, but when we want to make inference from this point estimate, we need to make the appropriate adjustments to yield accurate standard errors. If you see the “unadjusted” standard errors from plm()
of the Rainfall variable were 0.159, but once we did this small sample bias correct, they changed to 0.170, which is not that much, a roughly 6% increase. Presumably the standard errors did not change substantially given that we are working with a relatively large sample. In other cases, the change might well have a meaningful impact on the interpretation of your findings.
Now obtain the estimates of the two-way fixed effect model of snowfalls on voter turnout and adjust your standard errors.
Exercise 13: Perform the small sample correction to the standard errors obtained from the two-way FE with covariates, do this for the model that looks at the effect of snowfall (Snow_Dev
) on voter turnout (Turnout
). Compare the standard errors to the “unadjusted” standard errors obtained from plm()
.
Reveal Answer
summary(turnout_twoway_plm_snow_cov)
<- coeftest(turnout_twoway_plm_snow_cov, vcov = vcovHC(turnout_twoway_plm_snow_cov, type = "HC1", cluster = c("group", "time")))
out_snow_cov out_snow_cov
We can see that without the small sample adjustment, the standard errors were 0.136, but once we did this small sample adjustment we got slightly larger standard errors (0.167). They don’t seemed to change that much, but actually this is a 22% increase, which may radically change whether your results are statistically significant or not!
8.2.4 Fixed Effects or Random Effects?
So far we have looked at fixed effects using the lm()
and plm()
functions, assuming that we should run a fixed effect model. But which model should we go with? Let’s now test whether time-FE or random effects is the more appropriate model.Recall the differences between the two estimators.
The fixed effects estimator - or within estimator accounts for unobserved heterogeneity that is correlated with the independent covariates. In this model, the group means are fixed, with each one having a separate intercept - recall the year-dummies in the models above. For the time-fixed effects in our example this means we are essentially looking at the unit-demeaned data for each time period. In other words, we are looking at the variation across units within a certain year - i.e. for each presidential election In other words, we are estimating our coefficient while time is fixed.
\[FE: Y_{i,t} = \alpha_i + \beta_1 X_{i,t} + u_{i,t}\]
In the case of a random effects model, we assume that \(\alpha_i\) has zero mean. In other words, instead of using fixed means per group, the group means are a random sample of a population. While the fixed effects model ‘gets rid’ of unobserved time-invariant heterogeneity (\(\alpha_i\)), the random effects model makes the assumption that the unobserved effect is uncorrelated with the explanatory variables. If the assumption holds, the random effects estimator is more efficient as it accounts for \(\alpha_i\) or time-invariant covariates. In essence, the random effects model is a weighted average of the within estimator and a between estimator that looks at effects between units.
\[RE: Y_{i,t} = \alpha_i + \beta_0 + \beta_1 X_{i,t,1} + \beta_2 X_{i,t,2} + u_{i,t}\]
We can use two different tests to check our data in order to find out which model is more appropriate. First, the Hausman-Test (sometimes also called Durbin–Wu–Hausman test) analyses the difference of the vectors of the coefficients of the two models under the null hypothesis that the models’ estimates are consistent. If the test finds that the models are consistent, we choose the random effects model as it is more efficient than the fixed effects model. However, the random effects model is inconsistent if there indeed is unobserved heterogeneity correlated with the covariates due to omitted variables, in which case the FE estimator should always be preferred.
An F-Test looks at whether any additional effects - such as one-way or two-way effects - are predictive of the outcome. That is, it tests the effects introduced in a within-model as compared to another model such as a random effects model. If the effects of the within-estimator are not significant, we can, again, use the random effects model to increase the efficiency of our estimators. If this is not the case, we go with FE, which is always unbiased with regard to unobserved heterogeneity that is constant over time.
Let’s now conduct the tests. To do so, we first specify fixed-effects and random effects models using the plm
package. The syntax is essentially the same as for lm()
. However, you need to add the model
argument as well as the index
argument to specify the time variable (for fixed effects) and the unit and time variables (for random effects).
Exercise 14: Specify a time-fixed effects and random effects model using plm()
.Run this panel data estimation for the GOP vote share outcome. Include the same covariates that we included before. Store the outputs from these models into different objects.
Reveal Answer
require(plm)
# Time-FE.
<- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "within", index="Year", data=weather)
FE_time
#Note: The identical model can be specified by explicitly indicating time-FE:
#FE_time2 <- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "within", effect="time", index=c("FIPS_County","Year"), data=weather)
# Random Effects
<- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "random", index=c("FIPS_County", "Year"), data=weather) random
Exercise 15: Conduct both a Hausman test
and F-test
using the models you just specified. Interpret your results.
You can use the phtest
and pFtest
commands for Hausman and F-Tests, respectively. The syntax is simple and similar for both: command(Model1, Model2)
.
Reveal Answer
phtest(FE_time, random)
pFtest(FE_time, random)
What do the tests tell us? In the case of the Hausman test, we can reject the null hypothesis of consistency and find that the models are inconsistent. The F-test shows us that we can reject the null hypothesis of no significant effects in the fixed-effects model: The FE model indeed adds significant effects. So, both models would caution us against using the random effects model and make us choose the FE estimator. Both tests giving us the same indication is the best outcome as we can be pretty sure about the need to choose the FE model. However, in some cases you might get two different indications - in which case it usually is a good choice to choose the conservative option: the unbiased but (possibly) inefficient FE estimator.
8.2.5 First Differences
Let’s now finally look at another estimator that can be of use when working with panel data. The First Difference Estimator looks at the differences between time units. Similar to the FE estimator, time invariant unit-specific covariates are dropped from the model as it only considers changes from one time period to the other:
\[FD: \Delta Y_i = \delta + \beta_1 \Delta X_{i1} + \beta_2 \Delta X_{i2} + \Delta v_i\]
You might have noticed a similarity to the FE estimator. While the FE estimator demeans over units or times, the FD estimator considers differences over time periods. Accordingly, the first difference estimator is identical to the FE estimator if the number of time periods is two: Whenever this is the case, the differencing of the FD estimator and the demeaning of the FE estimator are equivalent.
Exercise 16: Estimate the first difference estimator using the same model specification as in Exercise 6.
Make sure to use the plm
package and the specify the FD estimator with the argument model="fd"
and to specify the unit and time variable using the index
argument.
Reveal Answer
<- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy,data=weather,effect="individual",model="fd",index=c("FIPS_County","Year"))
fd_model summary(fd_model)
The output of the FD estimator indicates the association of the effect of a change in an independent variable from one time period to the the following period for the same unit. That is, if Rain
in a county increase by one inch, Republican vote share is estimated be 8%-points
higher on average in the next period. Note that the FD estimator does not consider longer time effects, i.e. those that have an effect beyond the following time period. This might be useful if we have strong reasons to believe this is what happens, otherwise the FE estimator is generally the more powerful estimator.
8.2.6 HOMEWORK:
Now let’s go back estimating the effect of weather on voter turnout. We also can look at the extensive and intensive margin of weather on vote turnout. Extensive margin refers to a discrete change in the level to which an activity or intervention takes place. For example, going from normal daily election weather to a day that is above this average. Let’s dichotomise the Rain_Dev
and Snow_Dev
variable. Let’s set 1 for any day that the weather is worse than normal (Rain_Dev > 0
or Snow_Dev > 0
) .Meanwhile, intensive margin refers to the degree (intensity) to which a resource is utilised, but more broadly on how strong the intervention is, which is basically what we did before.
Homework 1: Create a dummy variable
Homework 1
<- weather %>%
weather mutate(dummy_rainfall = ifelse(Rain_Dev > 0, 1, 0),
dummy_snowfall = ifelse(Snow_Dev > 0, 1, 0))
# 9616 County - election day with rainfall above election day county average
table(weather$dummy_rainfall)
# 2439 County - election day with snowfalls above election day county average
table(weather$dummy_snowfall)
Homework 2: Run plm()
looking at the extensive margin of weather on voter turnout. Use the plm()
function. Include the same covariates that we used in the previous exercise. Set the model
argument equal to "within"
and set the effect
argument equal to "twoways"
so we can get the estimates of a two-way FE.
Homework 2
<- plm(Turnout ~ dummy_rainfall + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
extensive_rain_cov
summary(extensive_rain_cov)