Reading time: 10 minutes (1,851 words)
In economics, researchers are often using natural or quasi experimental setting as randomized experiments can rarely be conducted. With them one can study a given change in the environment, which allows to split the population at hand into a treatment and control group. The split requires some kind of border, which might be:
in time
in a geographical sense
or with respect to some threshold value
The simple (or naïve) way to examine a natural experiment is the computation of first differences in the outcome variable (Y), i.e. a before-after comparison.
Example:
observations from regions that were prone to a (local) policy intervention vs. observations from regions were no policy intervention took place
observations from a region before and after the (local) policy intervention
The main assumption is that without the change in the natural environment the outcome variable would have remained constant!
The DiD approach includes a before-after comparison for a treatment and control group. This is a combination of:
a cross-sectional comparison (= compare a sample that was treated to an non-treated control group)
a before-after comparison(= compare treatment group with itself, before and after the treatment)
The before-after difference in the treatment group gets a correction, by accounting for the before-after difference in the control group, eliminating the trend problem. To obtain an unbiased estimate of the treatment effect one needs to make a parallel trend assumption. That means without the change in the environment, the change in the outcome variable would have been the same for the two groups (counterfactual outcome).
The DiD approach requires panel data (i.e. multiple observations on the same individual over time).
The validity of the DiD approach is closely related to the similarity of the treatment and control group. Hence, some plausibility checks should be conducted:
compute Placebo-DiD for periods without a change in the environment
for (longer) time series: check and demonstrate the parallel time trends
use an alternative control group (if available): the estimate should be the same
replace Y by an alternative outcome which is definitely independent of the treatment (the DiD estimator should be 0)
In a randomized experiment the treatment and control group are identical by design. Hence, the DiD estimator is unbiased. For a natural experiment it is useful to compare the observable characteristics of individuals in the treatment and control group. However, they do not have to be identical! In a linear regression approach one should consider interactions that control for structural changes between the treatment and control group over time.
In this section I am replicating a study by David Card and Alan B. Krueger about the effect of a raise in minimum wages on employment. Conventional economic theory suggests that in a labour market with perfect competition an increase in the minimum wage leads to an increase in unemployment. In April 1992, the U.S. state of New Jersey (NJ) raised the minimum wage from $4.25 to $5.05. Card and Krueger (1994) use a DiD approach and show that this increase in minimum wages led to an increase in employment in the sector of fast food restaurants. The control group in their settting is the neighbouring U.S. state of Pennsylvania (PA), which was not subject to this policy change The authors conducted a survey before and after the raise of the minimum wage with a representative sample of fast food restaurants in NJ and PA. This setting can be regarded as quasi experimental, as both states are not identical in many aspects and the legislative procedure, in order the raise the minimum wage, was not initiated at random.
The following R packages are required for this tutorial.
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(sjlabelled)
library(ggrepel)
library(scales)
library(ggpubr)
library(plm)
library(lmtest)
The raw data of the study can be accessed here. At first, I am downloading the data directly into RStudio and store the variable names and labels from the codebook included in the .zip file in separate vectors. The data is then read with function read_table2()
from R package readr
. At the moment, it has no column names!
# Temporary file and path
tfile_path <- tempfile()
tdir_path <- tempdir()
# Download zip file
download.file("http://davidcard.berkeley.edu/data_sets/njmin.zip",
destfile = tfile_path)
# Unzip
unzip(tfile_path, exdir = tdir_path)
# Read codebook
codebook <- read_lines(file = paste0(tdir_path, "/codebook"))
# Generate a vector with variable names
variable_names <- codebook %>%
`[`(8:59) %>% # Variablennamen starten bei Element 8 (sheet)
`[`(-c(5, 6, 13, 14, 32, 33)) %>% # Elemente ohne Variablennamen entfernen
str_sub(1, 8) %>% # längster Variablenname enthält 8 Zeichen
str_squish() %>% # Whitespaces entfernen
str_to_lower() # nur Kleinbuchstaben verwenden
# Generate a vector with variable labels
variable_labels <- codebook %>%
`[`(8:59) %>% # variable names start at element 8 (sheet)
`[`(-c(5, 6, 13, 14, 32, 33)) %>% # remove elements w/o variable names
sub(".*\\.[0-9]", "", .) %>%
`[`(-c(5:10)) %>% # these elements are combined later on
str_squish() # remove white spaces
# Region
variable_labels[41] <- "region of restaurant"
# Read raw data
data_raw <- read_table2(paste0(tdir_path, "/public.dat"),
col_names = FALSE)
Next, I am cleaning the data to make it ready for an analysis! I am assigning the stored variable names from the codebook to each column of the raw dataset and to certain dummy and catergorical variables I am assigning meaningful value labels based on the codebook.
# Add variable names
data_mod <- data_raw %>%
select(-X47) %>% # remove empty column
`colnames<-`(., variable_names) %>% # Assign variable names
mutate_all(as.numeric) %>% # treat all variables as numeric
mutate(sheet = ifelse(sheet == 407 & chain == 4, 408, sheet)) # duplicated sheet id 407
# Process data (currently wide format)
data_mod <- data_mod %>%
# chain value label
mutate(chain = case_when(chain == 1 ~ "bk",
chain == 2 ~ "kfc",
chain == 3 ~ "roys",
chain == 4 ~ "wendys")) %>%
# state value label
mutate(state = case_when(state == 1 ~ "New Jersey",
state == 0 ~ "Pennsylvania")) %>%
# Region dummy
mutate(region = case_when(southj == 1 ~ "southj",
centralj == 1 ~ "centralj",
northj == 1 ~ "northj",
shore == 1 ~ "shorej",
pa1 == 1 ~ "phillypa",
pa2 == 1 ~ "eastonpa")) %>%
# meals value label
mutate(meals = case_when(meals == 0 ~ "none",
meals == 1 ~ "free meals",
meals == 2 ~ "reduced price meals",
meals == 3 ~ "both free and reduced price meals")) %>%
# meals value label
mutate(meals2 = case_when(meals2 == 0 ~ "none",
meals2 == 1 ~ "free meals",
meals2 == 2 ~ "reduced price meals",
meals2 == 3 ~ "both free and reduced price meals")) %>%
# status2 value label
mutate(status2 = case_when(status2 == 0 ~ "refused second interview",
status2 == 1 ~ "answered 2nd interview",
status2 == 2 ~ "closed for renovations",
status2 == 3 ~ "closed permanently",
status2 == 4 ~ "closed for highway construction",
status2 == 5 ~ "closed due to Mall fire")) %>%
mutate(co_owned = if_else(co_owned == 1, "yes", "no")) %>%
mutate(bonus = if_else(bonus == 1, "yes", "no")) %>%
mutate(special2 = if_else(special2 == 1, "yes", "no")) %>%
mutate(type2 = if_else(type2 == 1, "phone", "personal")) %>%
select(-southj, -centralj, -northj, -shore, -pa1, -pa2) %>% # now included in region dummy
mutate(date2 = lubridate::mdy(date2)) %>% # Convert date
rename(open2 = open2r) %>% #Fit name to wave 1
rename(firstinc2 = firstin2) %>% # Fit name to wave 1
sjlabelled::set_label(variable_labels) # Add stored variable labels
The dataset is currently in a wide format, i.e. there are separate columns for variables of each wave of the survey. First, I am separating the contents in different objects and then stack both waves on top of each other. The functions bind_cols()
and bind_rows()
from R package dplyr
are very helpful here.
# Structural variables
structure <- data_mod %>%
select(sheet, chain, co_owned, state, region)
# Wave 1 variables
wave1 <- data_mod %>%
select(-ends_with("2"), - names(structure)) %>%
mutate(observation = "February 1992") %>%
bind_cols(structure)
# Wave 2 variables
wave2 <- data_mod %>%
select(ends_with("2")) %>%
rename_all(~str_remove(., "2")) %>%
mutate(observation = "November 1992") %>%
bind_cols(structure)
# Final dataset
card_krueger_1994 <- bind_rows(wave1, wave2) %>%
select(sort(names(.))) %>% # Sort columns alphabetically
sjlabelled::copy_labels(data_mod) # Restore variable labels
Finaly, I am generating a variable that measures employment. According to the paper, the full-time equivalents (FTE) consist of full-time employees, managers and part-time employees (emptot
). The latter are multiplied by factor 0.5 before entering the calculation. Also, I am generating the share of full-time employees of all FTE (pct_ftw
).
card_krueger_1994_mod <- card_krueger_1994 %>%
mutate(emptot = empft + nmgrs + 0.5 * emppt,
pct_fte = empft / emptot * 100)
Table 2 in the paper shows extensive descriptive statistics of the dataset. Some of them are replicated in this section, in order to show that reading and processing the data was not prone to errors. At first, I calculate the distribution of fast food chains in NJ
and PA
. The function prop.table()
creates the relative shares of each fast food chain within each state.
card_krueger_1994_mod %>%
select(chain, state) %>%
table() %>%
prop.table(margin = 2) %>%
apply(MARGIN = 2,
FUN = scales::percent_format(accuracy = 0.1)) %>%
noquote
state
chain New Jersey Pennsylvania
bk 41.1% 44.3%
kfc 20.5% 15.2%
roys 24.8% 21.5%
wendys 13.6% 19.0%
Next, I am adding the mean values of certain variables of the first wave of the survey grouped by each state …
card_krueger_1994_mod %>%
filter(observation == "February 1992") %>%
group_by(state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE),
pct_fte = mean(pct_fte, na.rm = TRUE),
wage_st = mean(wage_st, na.rm = TRUE),
hrsopen = mean(hrsopen, na.rm = TRUE)) %>%
pivot_longer(cols=-state, names_to = "variable") %>%
pivot_wider(names_from = state, values_from = value)
# A tibble: 4 x 3
variable `New Jersey` Pennsylvania
<chr> <dbl> <dbl>
1 emptot 20.4 23.3
2 pct_fte 32.8 35.0
3 wage_st 4.61 4.63
4 hrsopen 14.4 14.5
… as well as the mean values of the second wave of the survey. My calculations are in line with the numbers published in the paper - phew!
card_krueger_1994_mod %>%
filter(observation == "November 1992") %>%
group_by(state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE),
pct_fte = mean(pct_fte, na.rm = TRUE),
wage_st = mean(wage_st, na.rm = TRUE),
hrsopen = mean(hrsopen, na.rm = TRUE)) %>%
pivot_longer(cols=-state, names_to = "variable") %>%
pivot_wider(names_from = state, values_from = value)
# A tibble: 4 x 3
variable `New Jersey` Pennsylvania
<chr> <dbl> <dbl>
1 emptot 21.0 21.2
2 pct_fte 35.9 30.4
3 wage_st 5.08 4.62
4 hrsopen 14.4 14.7
In this section I am redproducing figure 1 of the study. The authors created this figure with SAS and it shows the distribution of the fast food restaurants’ wages grouped by the federal states NJ
and PA
before and after the treatment. I am using the function geom_histogram()
from R package ggplot2
to create a histogram. To show the percentage of stores within each state on the y-axis, the aes
argument of the histogram has to be manipulated here. With function ggarrange()
from R package ggpubr
I am combining both histograms into a single plot with a single legend.
hist.feb <- card_krueger_1994_mod %>%
filter(observation == "February 1992") %>%
ggplot(aes(wage_st, fill = state)) +
geom_histogram(aes(y=c(..count..[..group..==1]/sum(..count..[..group..==1]),
..count..[..group..==2]/sum(..count..[..group..==2]))*100),
alpha=0.5, position = "dodge", bins = 23) +
labs(title = "February 1992", x = "Wage range", y = "Percent of stores", fill = "") +
scale_fill_grey()
hist.nov <- card_krueger_1994_mod %>%
filter(observation == "November 1992") %>%
ggplot(aes(wage_st, fill = state)) +
geom_histogram(aes(y=c(..count..[..group..==1]/sum(..count..[..group..==1]),
..count..[..group..==2]/sum(..count..[..group..==2]))*100),
alpha = 0.5, position = "dodge", bins = 23) +
labs(title = "November 1992", x="Wage range", y = "Percent of stores", fill="") +
scale_fill_grey()
ggarrange(hist.feb, hist.nov, ncol = 2,
common.legend = TRUE, legend = "bottom")
Before calculating the DiD estimator with OLS, I want to deduce it by means of differencing the mean values of employment (emptot
) between each group. This is easily done with functions group_by()
and summarise()
. We obtain four groups with distinct mean values!
differences <- card_krueger_1994_mod %>%
group_by(observation, state) %>%
summarise(emptot = mean(emptot, na.rm = TRUE))
# Treatment group (NJ) before treatment
njfeb <- differences[1,3]
# Control group (PA) before treatment
pafeb <- differences[2,3]
# Treatment group (NJ) after treatment
njnov <- differences[3,3]
# Control group (PA) after treatment
panov <- differences[4,3]
The Average Treatment Effect (ATT) in this setting can be determined in two ways:
(njnov-njfeb)-(panov-pafeb)
emptot
1 2.753606
(njnov-panov)-(njfeb-pafeb)
emptot
1 2.753606
This number can be found in row 3 of table 3 in the paper!
Representing the relationship between treatment and control group graphically can be very helpful in order to understand the DiD approach. First, I use the differences
of variable emptot
calculated in the previous step for NJ and PJ in February and November. Additionally, we require the outcome of NJ
if the treatment (raise of the minimum wage) did not happen. This is called the counterfactual outcome (nj_counterfactual
). The DiD assumption states that the trends of treatment and control group are identical until the treatment takes place. Hence, without the treatment the employment (emptot
) of NJ would decline from February to November by the same amount as PA. Since I want to make a graph for this effect I am adding three additional data points to depict an intervention line in the plot which indicates the treatment event.
# Calculate counterfactual outcome
nj_counterfactual <- tibble(
observation = c("February 1992","November 1992"),
state = c("New Jersey (Counterfactual)","New Jersey (Counterfactual)"),
emptot = as.numeric(c(njfeb, njfeb-(pafeb-panov)))
)
# Data points for treatment event
intervention <- tibble(
observation = c("Intervention", "Intervention", "Intervention"),
state = c("New Jersey", "Pennsylvania", "New Jersey (Counterfactual)"),
emptot = c(19.35, 22.3, 19.35)
)
# Combine data
did_plotdata <- bind_rows(differences,
nj_counterfactual,
intervention)
I am creating a line plot with function geom_line()
from R package ggplot2
to illustrate the DiD approach. In November 1992, the distance between the actual and counterfactual employment (emptot
) of NJ identifies the causal effect of an increase in minimum wages on employment. Note the use of function geom_label_repel
from R package ggrepel
here, to align the labels of each line.
did_plotdata %>%
mutate(label = if_else(observation == "November 1992", as.character(state), NA_character_)) %>%
ggplot(aes(x=observation,y=emptot, group=state)) +
geom_line(aes(color=state), size=1.2) +
geom_vline(xintercept = "Intervention", linetype="dotted",
color = "black", size=1.1) +
scale_color_brewer(palette = "Accent") +
scale_y_continuous(limits = c(17,24)) +
ggrepel::geom_label_repel(aes(label = label),
nudge_x = 0.5, nudge_y = -0.5,
na.rm = TRUE) +
guides(color=FALSE) +
labs(x="", y="FTE Employment (mean)") +
annotate(
"text",
x = "November 1992",
y = 19.6,
label = "{Difference-in-Differences}",
angle = 90,
size = 3
)
With linear regression, this result can be achieved very easy. At first, we need to create two dummy variables. One indicates the start of the treatment (time
) and is equal to zero before the treatment and equal to one after the treatment. The other variable separates the observations into a treatment and control group (treated
). This dummy variable is equal to one for fast food restaurants located in NJ and equal to zero for fast food restaurants located in PA.
card_krueger_1994_mod <- mutate(card_krueger_1994_mod,
time = ifelse(observation == "November 1992", 1, 0),
treated = ifelse(state == "New Jersey", 1, 0)
)
The DiD estimator is an interaction between both dummy variables. This interaction can be specified with the :
operator in the formula of function lm()
in addition to the individual dummy variables.
Hint: Another possibility is to only specify time*treated
in the formula which adds the individual dummy variables automatically.
did_model <- lm(emptot ~ time + treated + time:treated, data = card_krueger_1994_mod)
summary(did_model)
Call:
lm(formula = emptot ~ time + treated + time:treated, data = card_krueger_1994_mod)
Residuals:
Min 1Q Median 3Q Max
-21.166 -6.439 -1.027 4.473 64.561
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.331 1.072 21.767 <2e-16 ***
time -2.166 1.516 -1.429 0.1535
treated -2.892 1.194 -2.423 0.0156 *
time:treated 2.754 1.688 1.631 0.1033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.406 on 790 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.007401, Adjusted R-squared: 0.003632
F-statistic: 1.964 on 3 and 790 DF, p-value: 0.118
The coefficient of time:treated
is the difference-in-differences estimator. The treatment (raise of the minimum wage) leads on average to an increase of employment (emptot
) in NJ by 2.75 FTE.
Row 4 of table 3 in the study shows a more precise calculation of the DiD estimator, which only includes fast food restaurants that have responses regarding employment (emptot
) before and after the treatment (this is a so called balanced sample). In RStudio we can get this result by computing a fixed effects model
which is sometimes also called a within estimator
. I am using R package plm
to run this regression with function plm()
and argument model = "within"
. Beforehand, the data has to be declared as a panel with function p.dataframe()
. With variable sheet
each fast food restaurant can be uniquely identified. Additionally, we need the function coeftest()
from R package lmtest
in order to obtain the correct standard errors which must be clustered by sheet
.
# Declare as panel data
panel <- pdata.frame(card_krueger_1994_mod, "sheet")
# Within model
did.reg <- plm(emptot ~ time + treated + time:treated,
data = panel, model = "within")
# obtain clustered standard errors
coeftest(did.reg, vcov = function(x)
vcovHC(x, cluster = "group", type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
time2 -2.2833 1.2465 -1.8319 0.06775 .
time2:treated 2.7500 1.3359 2.0585 0.04022 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1