Link Search Menu Expand Document
Lab 1: Randomised Trials

Lecture Slides & Data






Recap

Causality

This week lecture introduced the concepts of causality and the potential outcomes framework. We reviewed the concept of causality which refers to "the relationship between events: one set of events (the effects) is a direct consequence of another set of events (the causes).

We also discussed the concept of the logic counterfactuals: If X had/had not been the case, Y would/would not have happened. This implies that in order to make causal claims, our main task is to construct counterfactuals.

Given that we are interested in the effect of X on Y, we can use regression as an statistical tool to characterise the relationship between two or more variables. We can formalise OLS as follows:

\[Y_i = \beta_0 + \beta_1 X_i + \mu_i\] We also discussed Omitted variable bias that appears despite controlling for additional covariates: There are still potential unobservable confounders that we are not able to consider.

Potential Outcomes

We also discussed the Potential Outcomes Framework. In particular, we discussed The Fundamental Problem of Causal Inference, according to which it’s impossible to observe outcomes of different treatment status for the same unit \(i\).

We also learned about different estimands such as ATE and ATT.

The lecture covered the issue of selection bias that corresponds to pre-treatment differences in the outcome of difference between treated and control groups.

We also learned that with random assignment we can circumvent the issue of selection bias. In theory - or expectation - units in the treatment and control groups are similar in on all (un)observed traits and only differ in terms of term treatment assignment \(D_i\) if randomly allocated.

Finally, we talked about the key assumptions for identification:

  • SUTVA
  • Unconfoundedness


In this seminar, we will cover the following topics around:

Identifying the ATE

  1. Use of the lm() function to retrieve the ATE
  2. Use lm() including covariates to retrieve the ATE

Randomisation Checks

  1. Perform two sample t-test using the t.test() and chi.square() functions to compare treated versus control groups.
  2. Conduct a Kolmogorov–Smirnov test using the ks.test() to compare the cumulative distribution of a continuous variable between treated vs control group.
  3. Check for balance using multin() for cases that we have multiple treatments and logit glm() and lm() when we have two treatment conditions (treatment and control)

Before starting this seminar

  1. Create a folder called “lab1”.
  2. Download the data (you can use this button or the one at the top, or read csv files directly from github):


3. Open an R script (or Markdown file) and save it in our “lab1” folder.
4. Set your working directory using the setwd() function or by clicking on “More“. For example setwd(“~/Desktop/Causal Inference/2022/Lab1”)


Seminar Overview

Party Brands.

Today we will be working with data from Noam Lupu’s work on Party Brands and Partisanship in Argentina.

What we need to know about the paper:

  • Two main theories on the dynamics of partisanship:
    • Partisanship as a social identity (stable)
    • Partisanship as a rational maximisation expected utility (unstable)

What is Lupu trying to empirically test in this paper?

  • Testable implication: When parties converge, their brands are diluted, thus, it becomes more difficult for voters to distinguish parties from one another and subsequently party attachments are weaken.

Details of the experiment

  • A survey experiment
  • Manipulation of information about political parties that suggest either diverge or converge among the parties
  • Outcomes of interest:
    • Partisanship: as the proportion of respondents said they identified with a political party
    • Partisan attachment: on a 0 - 10 scale
  • Between-subject design with three-treatment arms, plus a control
    • Control: Mostly blank information that listed only the party symbols, along with the name of the parties and their leaders.
    • Platform treatment: Provided information about each party’s platform on economic, security, and federal issues. The idea that this treatment will enhance a respondent’s certainty about their party brands and, in turn their ability to identify with a party. Therefore, the main prediction is that party’s attachment will increase in this treatment (positive treatment effect)
    • Alliance Switching treatment: Opposite to the platform treatment, this treatment arm provided information that would suggest that parties are less distinguishable such as interparty alliances and party switching. Therefore, we will expect to weaken party’s attachment (negative coefficient)
    • Full text treatment: This treatment contained both type of information provided in the Platform and Alliance treatment arms. The main prediction is that party’s attachment will not change as there are countervailing effects.
  • The information was gleaned from party manifestos and official statements

A description of the variables is listed below:

Name Description
formulario Treatment assignment variable, from 1 to 4
frm2 Binary variable equal to 1 for subjects that received treatment 2 (Platform Information)
frm3 Binary variable equal to 1 for subjects that received treatment 3 (Alliance Switching)
frm4 Binary variable equal to 1 for subjects that received treatment 4 (Full Text)
pidstr a 10 point scale representing the strength of the respondent’s party identification
pid Binary variable equal to 1 for subjects report that they identify with a party
age Continuous variable with age the of the respondent.
income Categorical variables with values that go from 1 from 10, 1 lowest income, 10 highest
educgrp Categorical variable 1 = No primary, 2=Primary, 3= Secondary, 4=Tertiary
info Variable that reports respondent’s level of political information 1 = low, 2= medium, 3 = high
female Binary variable indicating the gender of the respondent. 1 female, 0 otherwise.

Now let’s load the data. There are two ways to do this:

You can load the brands dataset from your laptop using the read.csv() function. Remember to load the readr package. If you stored your data in the same directory that set your working directory, you only need to write the name of the dataset, plus the extension (.csv).

# setwd("C:/Desktop/Causal Inference/2022/Lab1")
# 
# library(readr)
# brands <- read_csv("brands.csv") # loading your data from your laptop

Or you can load the data from the course website using the read.csv() function. Write the following url: https://raw.githubusercontent.com/dpir-ci/CI22/gh-pages/docs/data/lupubrands.csv inside the read.csv() function

brands=read.csv("https://raw.githubusercontent.com/dpir-ci/CI22/gh-pages/docs/data/lupubrands.csv")
# updating the file from the course website. 

Let’s create a copy of the data frame called “brands_tidy”, so we can work with a separate data frame using r code in a tidy format:

brands_tidy <- brands

What variables does the data include?

We can inspect the data using several functions such as head() function. The head() function returns the first 6 observations of your dastaset. The syntax of this function is very simple, you basically have to type the name of your data frame inside of the head() function. Go ahead and use the head() function with the brands data frame.

head(brands) #Checking first 6 rows of the data
  X nr formulario time pidstr frm2 frm3 frm4 age female income educgrp cordoba
1 1  1          3    1     NA    0    1    0  32      0     10       3       1
2 2  2          4    1     NA    0    0    1  26      0      3       3       1
3 3  3          1    1      3    0    0    0  34      0      5       3       1
4 4  4          2    1     NA    1    0    0  30      0      5       3       1
5 5  5          3    1     NA    0    1    0  22      0      8       3       1
6 6  6          4    2     NA    0    0    1  27      0     NA       3       1
  info pid polar mandate
1    2   0     7       2
2    0   0     6       5
3    2   1     5       4
4    0   0     4       5
5    0   0    NA       2
6    1   0    NA       4

Another function that we can use is the glimple() function. It shows you a transposed version of the data. It also provides information about the data types of the variables, for example, if the variables are numeric, integers, logical or characters.

glimpse(brands)
Rows: 1,199
Columns: 17
$ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
$ nr         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
$ formulario <int> 3, 4, 1, 2, 3, 4, 1, 1, 2, 2, 2, 1, 1, 4, 3, 3, 1, 3, 2, 4,~
$ time       <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2,~
$ pidstr     <int> NA, NA, 3, NA, NA, NA, 8, NA, 10, 9, NA, 8, NA, 7, NA, NA, ~
$ frm2       <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ frm3       <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,~
$ frm4       <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,~
$ age        <int> 32, 26, 34, 30, 22, 27, 72, 27, 24, 27, 30, 41, 27, 50, 43,~
$ female     <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,~
$ income     <int> 10, 3, 5, 5, 8, NA, 4, 5, 10, 9, 3, 3, 4, 4, 5, 6, 10, 3, 6~
$ educgrp    <int> 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 4, 4, 3, 2, 4, 2,~
$ cordoba    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ info       <int> 2, 0, 2, 0, 0, 1, 1, 1, 2, 2, 0, 2, 0, 2, 2, 2, 1, 1, 1, 1,~
$ pid        <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,~
$ polar      <int> 7, 6, 5, 4, NA, NA, 8, NA, 7, 9, NA, 10, 6, 5, 0, 5, 10, 10~
$ mandate    <int> 2, 5, 4, 5, 2, 4, 1, 4, 2, 4, 1, 1, 1, 4, 3, 1, 3, 2, 1, 1,~

Now let’s focus on the treatment variables and the outcome. There are four variables associated with the treatments and they are captured by the categorical variableformulario

  • 1 = Control group
  • 2 = Platform Information
  • 3 = Alliance Switching
  • 4 = Full Text

You can inspect this variable using the table() function and see how many observations there are in each treatment condition:

table(brands$formulario)

  1   2   3   4 
301 295 302 301 

This variable has been recoded into dummies (such as frm2=1) which takes the value 1 if the respondent received the platform information treatment, frm3=1 for alliance switching and frm4=1 for the full text. Note the omitted dummy represents the control group.

Rather than modifying the fomulario variable lets’ create a new variable called treat where the control group is equal to 0 rather than 1.

Go ahead and create a new variable and called it treat. In order to create this new variable use the formulario variable and subtract 1 from that variable. You can use the assignment operator <- to create new variables. If you use tidy, you can create this variable using the mutate() function. In tidy, we need to use what is called an pipeline operator %>% that we need to include before the dataframe that we want to modify.

Reveal Answer

brands$treat=brands$formulario-1  #Assigning value 0 to the control group

# tidyverse
brands_tidy <- brands_tidy %>%
  mutate(treat = formulario - 1) 



Did we define the treatment variable correctly? Let’s use the head() function to inspect our data, but rather than showing all the variables, let’s subset our data frame to the assignment variables: formulario andtreat. We can do this by subsetting by calling the dataframe, plus the $ sign, followed by the variable of interest. We then can then combine these two vectors (variables) by using the cbind() function that takes the sequence of vectors and combines them by columns.

head(cbind(brands$formulario, brands$treat))
     [,1] [,2]
[1,]    3    2
[2,]    4    3
[3,]    1    0
[4,]    2    1
[5,]    3    2
[6,]    4    3

It seems that it worked, we can see that for observations where the formulario variable is equal to 4, then the treat variable is equal to 3. Similarly, for the control group the formulario variable is equal to 1, whereas for the treat variable is equal to 0.

Now in addition to the formulario and treat variables, use the head() function including the frm2, frm3 and frm4 dummies, combine these variables using the cbind function.

Reveal Answer

head(cbind(brands$formulario, brands$treat, brands$frm2,brands$frm3,brands$frm4))
     [,1] [,2] [,3] [,4] [,5]
[1,]    3    2    0    1    0
[2,]    4    3    0    0    1
[3,]    1    0    0    0    0
[4,]    2    1    1    0    0
[5,]    3    2    0    1    0
[6,]    4    3    0    0    1

Everything looks good! Taking the first row as example, when formulario==3 (column1), the treatment variable we generated takes the value 2 (formulario-1) and the dummy for value 3 (frm3) take the value 1, whereas the others are 0.


OLS Estimations

Lupu is interested in the effect of the three treatment conditions on partisanship. Among the two outcomes, pid (whether the individual identifies with a party) and pidstr (a 10 point scale representing the strength of the respondent’s party identification), we will focus on the latter.

Exercise 1: Regress the strength of respondents’ party identity on the treatment

Reveal Answer

summary(lm(pidstr~treat, data=brands))

Call:
lm(formula = pidstr ~ treat, data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.3801 -1.3801  0.1589  1.7996  3.1589 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.3801     0.1828  40.363   <2e-16 ***
treat        -0.1797     0.1017  -1.766   0.0782 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.283 on 395 degrees of freedom
  (802 observations deleted due to missingness)
Multiple R-squared:  0.007831,  Adjusted R-squared:  0.00532 
F-statistic: 3.118 on 1 and 395 DF,  p-value: 0.07821


Does this regression make sense?
Reveal Hint Run the command class(brands$treat). What does it reveal about the variable?

Reveal Answer It doesn’t! The OLS regression we just ran assumes that treat is a metric - or continuous - variable, such as vote shares or trade volume. Accordingly, the coefficient for treat shows the increase in the dependent variable associated with a one-unit increase in treat . Be aware of what the variable means, we know that this doesn’t make any sense. The outcomes of treat simply describe assignment to different treatments: The difference between them and order of the outcomes are arbitrary and do not actually mean anything. That is, treat is a categorical variable. We must thus run a factor regression model.

Let’s assign labels to each group first. This will help with the output and makes the coding easier.

brands$treat[brands$treat==0]="Control"
brands$treat[brands$treat==1]="Platform"
brands$treat[brands$treat==2]="Switch"
brands$treat[brands$treat==3]="Full"

# tidy version:
brands_tidy <- brands_tidy %>%
  mutate(treat = recode(
    treat,
    `0` = "Control",
    `1` = "Platform",
    `2` = "Switch",
    `3` = "Full"
  ))


What will be our reference category here?

Exercise 2: Regress the strength of respondents’ party identity on the treatment, telling R to model treat as factor variable. Interpret the coefficients.

Reveal Answer

summary(lm(pidstr~factor(treat), data=brands))

Call:
lm(formula = pidstr ~ factor(treat), data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7934 -1.1389  0.2066  1.8611  3.8611 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            7.13084    0.21532  33.118  < 2e-16 ***
factor(treat)Full     -0.09991    0.31225  -0.320  0.74916    
factor(treat)Platform  0.66255    0.29556   2.242  0.02554 *  
factor(treat)Switch   -0.99195    0.33950  -2.922  0.00368 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.227 on 393 degrees of freedom
  (802 observations deleted due to missingness)
Multiple R-squared:  0.06022,   Adjusted R-squared:  0.05305 
F-statistic: 8.395 on 3 and 393 DF,  p-value: 2.021e-05
# tidy version: 
lin_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(pidstr ~ factor(treat), data = brands_tidy)

tidy(lin_fit)
# A tibble: 4 x 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)             7.13       0.215    33.1   8.89e-116
2 factor(treat)Full      -0.0999     0.312    -0.320 7.49e-  1
3 factor(treat)Platform   0.663      0.296     2.24  2.55e-  2
4 factor(treat)Switch    -0.992      0.339    -2.92  3.68e-  3

As you can see, both models provide the exactly same results. Since R now understands that treat is a factor variable, the model treats each outcome as a separate variable and computes a different coefficient for each outcome. The coefficients exhibit the effect on the dependent variable associated with each respective type of treatment compared to the reference category. That is, receiving the switch treatment is associated with roughly a 1-point decrease in partisan attachment. Receiving the party platform treatment, on the other hand, is associated with a 0.66-point increase in party attachment. There is no significant effect, i.e. association, for respondents receiving the full treatment.


Exercise 3a - Create a new variable called treat2

We may want to change the reference category, for example, we may want to set the Platform treatment condition as the baseline category and compare the rest of the treatment effects to this condition. For that we can use the relevel() function. Let’s set Platform as the reference category, but first, create a new variable called treat2, which is the same as variable treat. We will create this new variable just to not affect the results of the rest of the lab:

Reveal Answer

brands$treat2 <- brands$treat

# or using tidy

brands <- brands %>% 
  mutate(treat2 = treat)


Now that you have created the treat2 variable, we can change the levels of this categorical variable using the relevel() function. You can also create a new variable and change the reference category. Here is the syntax of these functions:

#dataframe <- within(dataframe, variable <- relevel, ref = "new reference category") NOTE: The relevel function only works with unordered variables!!!

## set the levels explicitly: 
#dataframe$variable  <- factor(dataframe$variable, levels = c("Level1", "Level2", "Level3"..."Level4"))

If you create a new variable using the factor() function, the first level in the vector will be the reference level. Now using any of the functions, change the reference category to “Platform” and run the same regression that you ran in Exercise 3a. Provide an interpretation of the coefficients for each treatment condition. Do these estimates tell the same story as before?

Exercise 3b: Set the Platform treatment as the reference category


Reveal Answer

brands$treat2 <- factor(brands$treat2, levels = c("Platform", "Control", "Switch", "Full"))
summary(lm(pidstr ~ treat2, data=brands))

Call:
lm(formula = pidstr ~ treat2, data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7934 -1.1389  0.2066  1.8611  3.8611 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.7934     0.2025  38.490  < 2e-16 ***
treat2Control  -0.6625     0.2956  -2.242   0.0255 *  
treat2Switch   -1.6545     0.3315  -4.991 9.05e-07 ***
treat2Full     -0.7625     0.3035  -2.512   0.0124 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.227 on 393 degrees of freedom
  (802 observations deleted due to missingness)
Multiple R-squared:  0.06022,   Adjusted R-squared:  0.05305 
F-statistic: 8.395 on 3 and 393 DF,  p-value: 2.021e-05
We can see now that our estimated coefficients of our new treatment variable have changed given that now the reference category is Platform. The beta coefficient of the control condition means that, on average, those that were assigned to the Control condition report lower levels of partisanship’s strength -0.66 relative to those in the Platform condition. Is this story consistent with the regression results when we used the treat variable? Yes, it is ! If you go back, you will see that the different in means between partisanship between the Control condition and the Platform was 0.66.


As we have seen, the OLS regression results look like mere differences in means. Let’s see if that is true. Why should it (not) be true?


Let’s “manually” calculate the ATE for the Platform treatment using the mean() function. The syntax of this function is as follows mean(x, na.rm = F) where x is the variable. The na.rm is an argument where you can indicate whether you want to exclude missing values. If you set na.rm = T or na.rm = TRUE you will be excluding all observations with missing values.

But first, we need to subset our data by each treatment condition. We can do this by using squared brackets [ , ]. The first argument inside of the square brackets sets the logical condition that selects what observations are included in your subset data. The second argument right after the comma sets what variables you want to include in your subsetted data frame. For example, if you would like to subset the data to only observations where the treat variable is equal to “Full” and retrieve only the partisanship strength variable pidstr, the code would be the following:

pid.control=mean(brands$pidstr[brands$treat=="Control"], na.rm=T)
pid.platform=mean(brands$pidstr[brands$treat=="Platform"],na.rm=T)
pid.switch=mean(brands$pidstr[brands$treat=="Switch"],na.rm=T)
pid.full=mean(brands$pidstr[brands$treat=="Full"],na.rm=T)

ate.platform=pid.platform-pid.control
ate.switch= pid.switch-pid.control
ate.full=pid.full-pid.control
# tidy version:

pid.control <- brands_tidy %>% # E_Y_X_control
  filter(treat == "Control") %>%
  summarise(conditional_mean = mean(pidstr, na.rm = TRUE))

pid.platform <- brands_tidy %>% # E_Y_X_platform
  filter(treat == "Platform") %>%
  summarise(conditional_mean = mean(pidstr, na.rm = TRUE))

pid.switch <- brands_tidy %>% # E_Y_X_switch
  filter(treat == "Switch") %>%
  summarise(conditional_mean = mean(pidstr, na.rm = TRUE))

pid.switch <- brands_tidy %>% # E_Y_X_1_full
  filter(treat == "Full") %>%
  summarise(conditional_mean = mean(pidstr, na.rm = TRUE)) 

ate.platform=pid.platform-pid.control
ate.switch= pid.switch-pid.control
ate.full=pid.full-pid.control

ate.platform
  conditional_mean
1        0.6625473
ate.switch
  conditional_mean
1      -0.09991329
ate.full
  conditional_mean
1      -0.09991329


Exercise 4: The ATEs we calculated just now are identical to the OLS slopes we calculated before. So, why do OLS?
Reveal Answer Firstly, we prefer models that can calculate measures of dispersion for the estimates. An OLS regression takes into account the distribution of variables as well as the number of observations. In other words, through substracting the means by treatment group we would not be able to know whether the ATEs are statistically significant. Secondly, the advantage of using OLS is that we can include (pre-treatment) covariates.


Still, can you think of other ways to estimate statistical uncertainty without using OLS?


Covariates

We can obtain more precise estimates of the treatment effects by including relevant covariates in our model. Let’s first add the age covariate.

model <- lm(pidstr~factor(treat)+age, data=brands)

summary(model)

Call:
lm(formula = pidstr ~ factor(treat) + age, data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2634 -1.4220  0.4429  1.5392  3.8192 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.279101   0.369852  16.977  < 2e-16 ***
factor(treat)Full     -0.064302   0.311289  -0.207  0.83646    
factor(treat)Platform  0.704956   0.293997   2.398  0.01696 *  
factor(treat)Switch   -1.089289   0.341047  -3.194  0.00152 ** 
age                    0.018018   0.006383   2.823  0.00500 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.213 on 390 degrees of freedom
  (804 observations deleted due to missingness)
Multiple R-squared:  0.07763,   Adjusted R-squared:  0.06817 
F-statistic: 8.206 on 4 and 390 DF,  p-value: 2.312e-06

Exercise 5: include the following additional covariates into the model age, income, educgrp and info. Provide an interpretation of the coefficients for all treatments (treat). Use the summary() function to report your results. Are the estimated treatment effects statistically significant for all treatment groups?


Reveal Answer

model2 <- lm(pidstr~factor(treat)+age+income+educgrp+info, data=brands)
summary(lm(model2))

Call:
lm(formula = model2)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2503 -1.4581  0.3581  1.4727  3.9318 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.427020   0.648090   8.374 2.28e-15 ***
factor(treat)Full     -0.099458   0.337709  -0.295  0.76858    
factor(treat)Platform  1.106895   0.335590   3.298  0.00109 ** 
factor(treat)Switch   -1.154413   0.375977  -3.070  0.00234 ** 
age                    0.032159   0.007677   4.189 3.70e-05 ***
income                -0.015579   0.068661  -0.227  0.82066    
educgrp                0.126116   0.169046   0.746  0.45624    
info                  -0.052458   0.191788  -0.274  0.78465    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.159 on 295 degrees of freedom
  (896 observations deleted due to missingness)
Multiple R-squared:  0.139, Adjusted R-squared:  0.1186 
F-statistic: 6.803 on 7 and 295 DF,  p-value: 1.7e-07
Once we include revelant covariates we should expect to improve the precision of our estimates. We found, for example, that for the Platform and Switch treatment the magnitude of the beta coefficients increases. Thus, the treatment effects are much larger than we expected, once we introduce controls.


Are we making a fair comparison when we compare our simple model to the model with covariates?
Well no, given that we are using different observations in each estimation. We can use the nobs() function to obtain the number of observations used in an object and see the numbers of observations that we are using in each regression. Run the simple regression model again, store it in a object and then use the nobs() function to obtain the number of observations used in that regression.

#Re-estimate the main model
atemod=lm(pidstr~factor(treat), data=brands)
summary(atemod)

Call:
lm(formula = pidstr ~ factor(treat), data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7934 -1.1389  0.2066  1.8611  3.8611 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            7.13084    0.21532  33.118  < 2e-16 ***
factor(treat)Full     -0.09991    0.31225  -0.320  0.74916    
factor(treat)Platform  0.66255    0.29556   2.242  0.02554 *  
factor(treat)Switch   -0.99195    0.33950  -2.922  0.00368 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.227 on 393 degrees of freedom
  (802 observations deleted due to missingness)
Multiple R-squared:  0.06022,   Adjusted R-squared:  0.05305 
F-statistic: 8.395 on 3 and 393 DF,  p-value: 2.021e-05
nobs(atemod)
[1] 397

Exercise 6: Obtain the number of observations used in the model with covariates using the nobs() function.


Reveal Answer

#Add covariates
atemodcont=lm(pidstr~factor(treat)+age+income+educgrp+info, data=brands)
summary(atemodcont)

Call:
lm(formula = pidstr ~ factor(treat) + age + income + educgrp + 
    info, data = brands)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2503 -1.4581  0.3581  1.4727  3.9318 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.427020   0.648090   8.374 2.28e-15 ***
factor(treat)Full     -0.099458   0.337709  -0.295  0.76858    
factor(treat)Platform  1.106895   0.335590   3.298  0.00109 ** 
factor(treat)Switch   -1.154413   0.375977  -3.070  0.00234 ** 
age                    0.032159   0.007677   4.189 3.70e-05 ***
income                -0.015579   0.068661  -0.227  0.82066    
educgrp                0.126116   0.169046   0.746  0.45624    
info                  -0.052458   0.191788  -0.274  0.78465    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.159 on 295 degrees of freedom
  (896 observations deleted due to missingness)
Multiple R-squared:  0.139, Adjusted R-squared:  0.1186 
F-statistic: 6.803 on 7 and 295 DF,  p-value: 1.7e-07
nobs(atemodcont)
[1] 303


Why do we lose so many observations? Is this is a fair comparison of the ATE? Let’s constrain the estimation to the observations of the model with the added covariates. We can do this by using the resid() function which will extract the residuals from the models and the row names (all data frames have row names). Then, using the as.matrix() function that will help us store both the row names and the residual of the observations used in the model in a matrix format. Finally, with the rownames() function we will able to extracts the row names.

esample=rownames(as.matrix(resid(atemodcont)))

Once we extract the row names, we can subset our data using the esample object by including this object in the first argument of the square brackets.

atemodsample=lm(pidstr~factor(treat), data=brands[esample,])
summary(atemodsample)

Call:
lm(formula = pidstr ~ factor(treat), data = brands[esample, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1379 -1.1455  0.8545  1.8583  3.8545 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             7.1379     0.2369  30.124  < 2e-16 ***
factor(treat)Full      -0.1879     0.3423  -0.549  0.58345    
factor(treat)Platform   0.9979     0.3412   2.924  0.00372 ** 
factor(treat)Switch    -0.9925     0.3807  -2.607  0.00960 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.21 on 299 degrees of freedom
Multiple R-squared:  0.0859,    Adjusted R-squared:  0.07673 
F-statistic: 9.366 on 3 and 299 DF,  p-value: 6.182e-06
nobs(atemodsample)
[1] 303

But first let’s install and/or load the stargazer package before we compare both models. You may be familiar with this package already, if not, it is used to create formatted regression tables. Please remember to install and/or load the package before using it.

#install.packages("stargazer")
library(stargazer)

Let’s now compare both models:

stargazer(atemodcont, atemodsample,  type = "text")

===================================================================
                                   Dependent variable:             
                      ---------------------------------------------
                                         pidstr                    
                               (1)                    (2)          
-------------------------------------------------------------------
factor(treat)Full             -0.099                 -0.188        
                             (0.338)                (0.342)        
                                                                   
factor(treat)Platform        1.107***               0.998***       
                             (0.336)                (0.341)        
                                                                   
factor(treat)Switch         -1.154***              -0.992***       
                             (0.376)                (0.381)        
                                                                   
age                          0.032***                              
                             (0.008)                               
                                                                   
income                        -0.016                               
                             (0.069)                               
                                                                   
educgrp                       0.126                                
                             (0.169)                               
                                                                   
info                          -0.052                               
                             (0.192)                               
                                                                   
Constant                     5.427***               7.138***       
                             (0.648)                (0.237)        
                                                                   
-------------------------------------------------------------------
Observations                   303                    303          
R2                            0.139                  0.086         
Adjusted R2                   0.119                  0.077         
Residual Std. Error      2.159 (df = 295)       2.210 (df = 299)   
F Statistic           6.803*** (df = 7; 295) 9.366*** (df = 3; 299)
===================================================================
Note:                                   *p<0.1; **p<0.05; ***p<0.01

We can see that both models contain 303 observations. We also find that the treatment effect of the constrained simple model with no covariates substantially differs from the unconstrained simple model. For example, the treatment effect from the simple unconstrained model for the Full treatment was -0.09, and now it’s -0.188. This sharply reduces the difference the beta coefficients between the between constrained simple model and model with covariates.


Randomisation Checks

To ensure that respondents - or units - are randomly selected into treatment and control groups, we can do some randomisation checks. We don’t only want to know whether covariates are associated with the outcome - we want to make sure there is no meaningful difference between respondents that received the different treatments. Here, again, comparing simple means would not be sufficient. What differences would be acceptable? Since we don’t want to rely on arbitrary cut-offs, we can use special tests. One of these, and maybe the most common one, is a t-test. It basically compares the means of two groups - taking into account standard errors and degrees of freedom - to test the null hypothesis of no difference between the means of the groups. Let’s check if the mean value of age is similar across treatment groups. We do this by separately comparing one of the treatment groups to the control group. Note that we are subsetting the data accordingly:

t.test(brands$age~brands$frm2, data=brands, subset=c(brands$frm3!=1,brands$frm4!=1))

    Welch Two Sample t-test

data:  brands$age by brands$frm2
t = 0.82403, df = 615.13, p-value = 0.4102
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.387264  3.393151
sample estimates:
mean in group 0 mean in group 1 
       45.02007        44.01712 
#and, similarly, for the other treatments
t.test(brands$age~brands$frm3, data=brands, subset=c(brands$frm2!=1,brands$frm4!=1))

    Welch Two Sample t-test

data:  brands$age by brands$frm3
t = -0.29667, df = 594.69, p-value = 0.7668
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.869676  2.116476
sample estimates:
mean in group 0 mean in group 1 
       45.02007        45.39667 
t.test(brands$age~brands$frm4, data=brands, subset=c(brands$frm3!=1,brands$frm2!=1))

    Welch Two Sample t-test

data:  brands$age by brands$frm4
t = 0.007524, df = 539.28, p-value = 0.994
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.520972  2.540358
sample estimates:
mean in group 0 mean in group 1 
       44.69426        44.68456 


With p-values nowhere near the conventional thresholds, we cannot reject the null hypothesis of no difference in any of these comparisons. Age is randomly distributed all treatment groups. Accordingly, we conclude that randomisation in terms of age was successful.


Exercise 7a: Check whether randomisation was successful for income.

Reveal Answer

t.test(brands$income~brands$frm2, data=brands, subset=c(brands$frm3!=1,brands$frm4!=1))

    Welch Two Sample t-test

data:  brands$income by brands$frm2
t = -0.50244, df = 386.02, p-value = 0.6156
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4287813  0.2542368
sample estimates:
mean in group 0 mean in group 1 
       4.948081        5.035354 
t.test(brands$income~brands$frm3, data=brands, subset=c(brands$frm2!=1,brands$frm4!=1))

    Welch Two Sample t-test

data:  brands$income by brands$frm3
t = -0.52723, df = 437.34, p-value = 0.5983
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4197579  0.2421877
sample estimates:
mean in group 0 mean in group 1 
       4.948081        5.036866 
t.test(brands$income~brands$frm4, data=brands, subset=c(brands$frm3!=1,brands$frm2!=1))

    Welch Two Sample t-test

data:  brands$income by brands$frm4
t = -1.0524, df = 422.56, p-value = 0.2932
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5245158  0.1587156
sample estimates:
mean in group 0 mean in group 1 
       4.911695        5.094595 


Exercise 7b: Check whether randomisation was successful for gender.

Note that the t-test uses Student’s t-distribution, which should generally be used for comparisons of continuous variables. Gender, however, is coded as a binary variable here. For binary variables, the prop.test command should be used, which is a comparison of proportions:

table(brands$frm2,brands$female[brands$treat!=3 & brands$treat!=2])
   
      0   1
  0 366 538
  1 110 185
prop.test(table(brands$frm2,brands$female[brands$treat!=3 & brands$treat!=2]))

    2-sample test for equality of proportions with continuity correction

data:  table(brands$frm2, brands$female[brands$treat != 3 & brands$treat != 2])
X-squared = 0.82165, df = 1, p-value = 0.3647
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.03405038  0.09802218
sample estimates:
   prop 1    prop 2 
0.4048673 0.3728814 


Now, check the balance of gender between treatment and control group for the switch and full treatments.

Reveal Answer

prop.test(table(brands$frm3,brands$female[brands$treat!=3 & brands$treat!=1]))

    2-sample test for equality of proportions with continuity correction

data:  table(brands$frm3, brands$female[brands$treat != 3 & brands$treat != 1])
X-squared = 0.80703, df = 1, p-value = 0.369
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.09783690  0.03492668
sample estimates:
   prop 1    prop 2 
0.3890747 0.4205298 
prop.test(table(brands$frm4,brands$female[brands$treat!=1 & brands$treat!=2]))

    2-sample test for equality of proportions with continuity correction

data:  table(brands$frm4, brands$female[brands$treat != 1 & brands$treat != 2])
X-squared = 0.90893, df = 1, p-value = 0.3404
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.09976218  0.03319121
sample estimates:
   prop 1    prop 2 
0.3886414 0.4219269 
# tidy version - need to double check: 
library(broom)

brands_tidy %>%
  filter(treat ==  c("Switch", "Control")) %>% 
  group_by(frm4, female) %>%
  summarise(cases = n()) %>%
  mutate(pop = sum(cases)) %>%
  rowwise() %>%
  mutate(tst = list(broom::tidy(prop.test(
    cases, pop, conf.level = 0.95
  )))) %>%
  tidyr::unnest(tst)
# A tibble: 2 x 12
# Groups:   frm4 [1]
   frm4 female cases   pop estimate statistic  p.value parameter conf.low
  <int>  <int> <int> <int>    <dbl>     <dbl>    <dbl>     <int>    <dbl>
1     0      0   118   299    0.395      12.9 0.000336         1    0.339
2     0      1   181   299    0.605      12.9 0.000336         1    0.547
# ... with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>


Another test to check for balance is the Two-Sample Kolmogorov-Smirnov test. This test examines if the cumulative distributions of a continuous variables of two samples that come from the same population are similar. In technical terms, the test looks at the largest vertical distance of the cumulative distribution of two samples. The null hypothesis is both samples come from a population with the same distribution, meaning that there is no differences in the cumulative distributions. You can see how this vertical distance is calculated in the following plot:

Let’s run the ks.test() for the Full treatment, and examine if they are difference with regards to age between respondents in the Full and Control condition.

table(brands$treat) # copy and paste the name of the treatment 

 Control     Full Platform   Switch 
     301      301      295      302 
ks.test(brands$age[brands$treat=="Full"],
        brands$age[brands$treat=="Control"]) # Fullinfo vs Control

    Two-sample Kolmogorov-Smirnov test

data:  brands$age[brands$treat == "Full"] and brands$age[brands$treat == "Control"]
D = 0.068076, p-value = 0.4925
alternative hypothesis: two-sided

We can see that there is no statistically significant difference with regards to age p-value = 0.49 between respondents in the Full and the Control groups.

Exercise 8: Perform the KS test for the age variable for the Switch and Platform treatments (vs Control), do you find statistically significant differences in the cumulative distribution of these variables?

Reveal Answer

table(brands$treat) # copy and paste the name of the treatment 

 Control     Full Platform   Switch 
     301      301      295      302 
ks.test(brands$age[brands$treat=="Switch"],
        brands$age[brands$treat== "Control"]) # Switching vs Control

    Two-sample Kolmogorov-Smirnov test

data:  brands$age[brands$treat == "Switch"] and brands$age[brands$treat == "Control"]
D = 0.05, p-value = 0.8475
alternative hypothesis: two-sided
ks.test(brands$age[brands$frm3==1 & brands$frm2!=1 & brands$frm4!=1],
        brands$age[brands$frm3==0 & brands$frm2!=1 & brands$frm4!=1]) # Switching vs Control

    Two-sample Kolmogorov-Smirnov test

data:  brands$age[brands$frm3 == 1 & brands$frm2 != 1 & brands$frm4 != 1] and brands$age[brands$frm3 == 0 & brands$frm2 != 1 & brands$frm4 != 1]
D = 0.05, p-value = 0.8475
alternative hypothesis: two-sided
ks.test(brands$age[brands$frm3!=1 & brands$frm2==1 & brands$frm4!=1],
        brands$age[brands$frm3!=1 & brands$frm2==0 & brands$frm4!=1]) # Platform vs control

    Two-sample Kolmogorov-Smirnov test

data:  brands$age[brands$frm3 != 1 & brands$frm2 == 1 & brands$frm4 != 1] and brands$age[brands$frm3 != 1 & brands$frm2 == 0 & brands$frm4 != 1]
D = 0.055982, p-value = 0.7426
alternative hypothesis: two-sided

Given the evidence obtained from the KS tests, we see that there is no systematic differences on the distributions of the pre-treatment variable age.


There are alternatives of running these randomisation tests. For example, we could redefine all categorical variables into dummies and run prop.test() for all. A more conservative way of testing this for categorical variables is to run a chi-square test. This test examines the distribution of a categorical variable. This test evaluates if the observed proportions for a categorical variable differ between two samples. What the Chi-square test essentially reveals is: Are the relative proportions of the two samples the same in each class?” The null hypothesis is that two samples come from a common distribution have the same relative proportion in each category. Let’s conduct a Chi-square test using the info variable comparing Platform versus the Control group.

table(brands[c(brands$frm3!=1,brands$frm4!=1),]$info,brands[c(brands$frm3!=1,brands$frm4!=1),]$frm2)
   
      0   1
  0 162  76
  1 304 165
  2 136  54
chisq.test(table(brands[c(brands$frm3!=1,brands$frm4!=1),]$info,brands[c(brands$frm3!=1,brands$frm4!=1),]$frm2))

    Pearson's Chi-squared test

data:  table(brands[c(brands$frm3 != 1, brands$frm4 != 1), ]$info, brands[c(brands$frm3 !=     1, brands$frm4 != 1), ]$frm2)
X-squared = 2.9335, df = 2, p-value = 0.2307

Finally, one could also do a multivariate randomisation check. This is is easier if the treatment is binary. In our case, we need to use the nnet package in order to use multinom() function to conduct a Multinomial Logistic regression. We use multinomial to predict a nominal dependent variable given one or more independent variables. Given that our treat variable has more than 4 categories. The coefficients of a multinomial are measured in log odds. If there is balance, we should expect that the probability of

#install.packages("nnet")
require(nnet)

We can use the multinom() function, which the syntax is the same as the lm() function. For example, if we run a multinomial regression using the age variable as predictor and treatvariable as the dependent variable, it would be as follows:

multi=multinom(treat~age, data=brands)
# weights:  12 (6 variable)
initial  value 1649.690290 
final  value 1649.007784 
converged
summary(multi)
Call:
multinom(formula = treat ~ age, data = brands)

Coefficients:
          (Intercept)           age
Full      0.090371559 -0.0021560204
Platform  0.167328243 -0.0043498385
Switch   -0.006269943  0.0001382091

Std. Errors:
         (Intercept)         age
Full       0.2245073 0.004644583
Platform   0.2252394 0.004694651
Switch     0.2246623 0.004612671

Residual Deviance: 3298.016 
AIC: 3310.016 

In terms of the interpretation of the coefficients of a multinomial model, If a subject were to increase his/her ageby one year, the multinomial log-odds for Full treatment relative to the Control would be expected to increase by -0.002 units while holding all other variables in the model constant - which is not meaningful.

What essentially doing is to evaluate whether age predicts treatment assignment to a particular treatment arm. One of the main limitations of this function is that does not provides p-value (apparently the creators of this package didn’t like the p-value approach), but we can “manually calculate” the p-values.

Let’s do it step by step. First, we need to retrieve the beta coefficients from the model and divide them by their standard errors, this will give us the z-scores for each treatment arm.

summary(multi)$coefficient
          (Intercept)           age
Full      0.090371559 -0.0021560204
Platform  0.167328243 -0.0043498385
Switch   -0.006269943  0.0001382091
summary(multi)$standard.errors
         (Intercept)         age
Full       0.2245073 0.004644583
Platform   0.2252394 0.004694651
Switch     0.2246623 0.004612671
alpha=summary(multi)$coefficients/summary(multi)$standard.errors
alpha
         (Intercept)         age
Full       0.4025328 -0.46420103
Platform   0.7428907 -0.92655196
Switch    -0.0279083  0.02996293

Then, let’s get the absolute value of \(\alpha\) using the abs() function. We are doing this to facilitate the computation of the p-value, as we were calculating a one-sided test.

abs(alpha)
         (Intercept)        age
Full       0.4025328 0.46420103
Platform   0.7428907 0.92655196
Switch     0.0279083 0.02996293

Finally, we get the cumulative density function of a normal distribution using the pnorm function for a given random variable. The syntax is as follows: pnorm(q, mean, sd, lower.tail = TRUE, log.p = FALSE). The q argument is the vector quantities, the second and third arguments are the mean and standard deviation of the normal distribution used to calculate the density of your random variable. The pnorm function will enable to calculate the area under the normal distribution for a particular z-score. By subtracting this quantity to 1, we are getting the area under the tail above the z-score for one tail of the distribution. Finally, by multiplying this number by 2, we will obtained a two-tailed p-value.

p <- (1 - pnorm(abs(alpha), 0, 1)) * 2
p
         (Intercept)       age
Full       0.6872919 0.6425037
Platform   0.4575478 0.3541592
Switch     0.9777353 0.9760966

As we can see age does not predicts treatment assignment for any of the treatment arms. All of the p-values are more than 0.05.

Exercise 9: Run the multinomial regression including age, income, info, and female. Store the output of this function into an object and then use the summary() function to report the results of this model. Then, manually calculate the p-values for all treatment arms.

Reveal Answer

multicov=multinom(treat~age+income+educgrp+info+female, data=brands)
# weights:  28 (18 variable)
initial  value 1178.350207 
iter  10 value 1173.107014
iter  20 value 1170.000313
final  value 1169.960514 
converged
summary(multicov)
Call:
multinom(formula = treat ~ age + income + educgrp + info + female, 
    data = brands)

Coefficients:
         (Intercept)          age     income     educgrp        info
Full       0.1602862 -0.001568734 0.09956541 -0.09463015 -0.15089367
Platform  -0.1210902 -0.006817290 0.04245140  0.10252286 -0.13096349
Switch    -0.1435468 -0.002097874 0.02905994  0.10144407 -0.01909619
              female
Full     -0.35475460
Platform -0.08517474
Switch   -0.28808492

Std. Errors:
         (Intercept)         age     income   educgrp      info    female
Full       0.4940061 0.005796394 0.05472623 0.1382340 0.1508345 0.2004447
Platform   0.5092535 0.006053721 0.05649825 0.1411189 0.1552429 0.2091415
Switch     0.4956036 0.005825677 0.05532334 0.1374163 0.1507181 0.2014886

Residual Deviance: 2339.921 
AIC: 2375.921 
alphacov=summary(multicov)$coefficients/summary(multicov)$standard.errors
alphacov
         (Intercept)        age    income    educgrp       info    female
Full       0.3244620 -0.2706395 1.8193362 -0.6845651 -1.0003925 -1.769838
Platform  -0.2377798 -1.1261321 0.7513755  0.7264997 -0.8436036 -0.407259
Switch    -0.2896404 -0.3601082 0.5252745  0.7382246 -0.1267014 -1.429783
abs(alphacov)
         (Intercept)       age    income   educgrp      info   female
Full       0.3244620 0.2706395 1.8193362 0.6845651 1.0003925 1.769838
Platform   0.2377798 1.1261321 0.7513755 0.7264997 0.8436036 0.407259
Switch     0.2896404 0.3601082 0.5252745 0.7382246 0.1267014 1.429783
p <- (1 - pnorm(abs(alphacov), 0, 1)) * 2
p
         (Intercept)       age     income   educgrp      info     female
Full       0.7455883 0.7866683 0.06886015 0.4936184 0.3171206 0.07675412
Platform   0.8120519 0.2601096 0.45242668 0.4675325 0.3988909 0.68381778
Switch     0.7720913 0.7187662 0.59939242 0.4603780 0.8991768 0.15277942

Only income and gender in the Full treatment are suspect for imbalance. All other covariates are fine!


What if the treatment is binary? Suppose we were only interested in the Platform Vs Control treatment effect. How should you check if randomisation worked?

One options is to use OLS. Let’s check for balance between the Control and Platform groups. But first let’s subset our data so it only contains observations from these two groups.

PlatVControl=brands[c(brands$treat=="Control" | brands$treat=="Platform"),]
randcheck=lm(frm2~age,data=PlatVControl)
summary(randcheck)

Call:
lm(formula = frm2 ~ age, data = PlatVControl)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5250 -0.4953 -0.4500  0.5027  0.5821 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.546428   0.058595   9.326   <2e-16 ***
age         -0.001190   0.001228  -0.969    0.333    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5004 on 590 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.00159,   Adjusted R-squared:  -0.0001022 
F-statistic: 0.9396 on 1 and 590 DF,  p-value: 0.3328
## another way to do this is using subset 
randcheck=lm(frm2~age,data= subset(brands, treat=="Control" | brands$treat=="Platform"))
summary(randcheck)

Call:
lm(formula = frm2 ~ age, data = subset(brands, treat == "Control" | 
    brands$treat == "Platform"))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5250 -0.4953 -0.4500  0.5027  0.5821 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.546428   0.058595   9.326   <2e-16 ***
age         -0.001190   0.001228  -0.969    0.333    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5004 on 590 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.00159,   Adjusted R-squared:  -0.0001022 
F-statistic: 0.9396 on 1 and 590 DF,  p-value: 0.3328

We can observe that the pre-treatment variable age does not predicts treatment assignment, at least for the “Platform” treatment condition.

We could use the coefplot() function to plot the coefficients of our regression. The syntax is rather simple. The first argument is the output of your regression stored in a object.You can remove the intercept of your regression, as it is not that informative, at least for this case. We can also set the colour of our plot. Please install and/or load the coefplot package before running this function . You can see an example on how to use this function below:

#install.package(coefplot)
require(coefplot)
coefplot(randcheck, intercept=F, col="black")

We can see that the confidence interval of the age coefficient contains zero, thus there is evidence that suggest that there is no systematic differences in terms age between subjects in the Platform and Control groups.

Exercise 10: Plot the coefficients of the age, income, educgrp, info, and female. Remove the intercept from the plot, and change the colour of your plot. Also, include the following arguments to the coefplot() function: decreasing = TRUE, sort = "magnitude". By including these arguments, you will plot the coefficients of our model by their magnitude in a descending order. Do you find any systematic differences in any of the pre-treatment variables?

Reveal Answer

randcheckcov=lm(frm2~age+income+educgrp+info+female, data=PlatVControl)
summary(randcheckcov)

Call:
lm(formula = frm2 ~ age + income + educgrp + info + female, data = PlatVControl)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5850 -0.4698 -0.3815  0.5116  0.6805 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.493851   0.136435   3.620 0.000332 ***
age         -0.001956   0.001640  -1.193 0.233674    
income       0.011737   0.013914   0.844 0.399391    
educgrp      0.020109   0.034659   0.580 0.562103    
info        -0.033649   0.039130  -0.860 0.390330    
female      -0.026027   0.051965  -0.501 0.616740    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4996 on 410 degrees of freedom
  (180 observations deleted due to missingness)
Multiple R-squared:  0.01198,   Adjusted R-squared:  -6.44e-05 
F-statistic: 0.9947 on 5 and 410 DF,  p-value: 0.4206
coefplot(randcheckcov, intercept=F, col="red", decreasing = TRUE, sort = "magnitude")


A second option is to use logistic regression and for that we can use the glm() function. The syntax of this function is slightly different to the lm() function. There is an additional argument called family. This is argument that helps identify what type of Generalised Linear Model you want to use. In our case, given that we dealing with a binary outcome variable frm2, we need to set family = binomial(logit). Let’s run a logistic regression only using the age variable.

randchecklogit=glm(frm2~age,data=PlatVControl, family=binomial(logit) )
summary(randchecklogit) # (the deviance residual is -2*log likelihood)

Call:
glm(formula = frm2 ~ age, family = binomial(logit), data = PlatVControl)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.220  -1.169  -1.094   1.182   1.320  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.186036   0.234575   0.793    0.428
age         -0.004769   0.004918  -0.970    0.332

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.58  on 591  degrees of freedom
Residual deviance: 819.64  on 590  degrees of freedom
  (4 observations deleted due to missingness)
AIC: 823.64

Number of Fisher Scoring iterations: 3
coefplot(randcheck, intercept=F, col="black" )

Question 11: Run a logistic regression including age, income, educgrp, info and female. Use the summary function summary() to report the results of your logistic regression. Does the results from this regression suggest that there is balance between Platform and the Control group.


Reveal Answer

randchecklogitcov=glm(frm2~age+income+educgrp+info+female,data=PlatVControl, family=binomial(logit) )
summary(randchecklogitcov) # (the deviance residual is -2*log likelihood)

Call:
glm(formula = frm2 ~ age + income + educgrp + info + female, 
    family = binomial(logit), data = PlatVControl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3271  -1.1259  -0.9809   1.1975   1.5007  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.023398   0.549889  -0.043    0.966
age         -0.007969   0.006651  -1.198    0.231
income       0.047572   0.056095   0.848    0.396
educgrp      0.081712   0.139860   0.584    0.559
info        -0.136592   0.158441  -0.862    0.389
female      -0.105979   0.209682  -0.505    0.613

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 575.07  on 415  degrees of freedom
Residual deviance: 570.05  on 410  degrees of freedom
  (180 observations deleted due to missingness)
AIC: 582.05

Number of Fisher Scoring iterations: 4


Question 12: Plot the coefficients of your logistic regression. Remove the intercept. If you want, change the colour of your plot.


Reveal Answer

coefplot(randchecklogitcov, intercept=F, col="green")



Copyright © 2022 Felipe Torres Raposo & Kenneth Stiller.