Reading time: 16 minutes (3,009 words)
The following R packages are required for this tutorial.
library(ggplot2)
library(dplyr)
library(plm)
library(lfe)
library(lmtest)
library(car)
library(geepack)
All practical examples are built around Grunfeld’s Investment Data contained in R package plm
. This dataset is a panel of 10 observational units (firms) from 1935 to 1954.
data("Grunfeld", package = "plm")
A panel is usually denoted by having multiple entries (rows) for the same entity (firm, person, …) in the dataset. The multiple entries are due to different time periods at which the entity was observed. Initially, I create a twoway table of the variables firm
and year
. In total there are 10 different firms, each having one observation for year 1935 to year 1954. Since no holes (zeros) can be found in this structure, that is no missing information in any given year for each firm, this panel is considered to be balanced.
Grunfeld %>%
select(year, firm) %>%
table()
## firm
## year 1 2 3 4 5 6 7 8 9 10
## 1935 1 1 1 1 1 1 1 1 1 1
## 1936 1 1 1 1 1 1 1 1 1 1
## 1937 1 1 1 1 1 1 1 1 1 1
## 1938 1 1 1 1 1 1 1 1 1 1
## 1939 1 1 1 1 1 1 1 1 1 1
## 1940 1 1 1 1 1 1 1 1 1 1
## 1941 1 1 1 1 1 1 1 1 1 1
## 1942 1 1 1 1 1 1 1 1 1 1
## 1943 1 1 1 1 1 1 1 1 1 1
## 1944 1 1 1 1 1 1 1 1 1 1
## 1945 1 1 1 1 1 1 1 1 1 1
## 1946 1 1 1 1 1 1 1 1 1 1
## 1947 1 1 1 1 1 1 1 1 1 1
## 1948 1 1 1 1 1 1 1 1 1 1
## 1949 1 1 1 1 1 1 1 1 1 1
## 1950 1 1 1 1 1 1 1 1 1 1
## 1951 1 1 1 1 1 1 1 1 1 1
## 1952 1 1 1 1 1 1 1 1 1 1
## 1953 1 1 1 1 1 1 1 1 1 1
## 1954 1 1 1 1 1 1 1 1 1 1
With package plm
this can be examined with function is.pbalanced()
.
Grunfeld %>%
is.pbalanced()
## [1] TRUE
In package plm
there is another panel dataset named EmplUK
. Testing whether this panel is balanced gives the result FALSE
.
data("EmplUK", package="plm")
EmplUK %>%
is.pbalanced()
## [1] FALSE
I am computing a twoway table of the first 10 indexes of firms and find that firm 1 to 4 have no records for the year 1976 and firms 5 to 10 have no data for the year 1983.
EmplUK %>%
select(year, firm) %>%
filter(firm %in% c(1:10)) %>%
table()
## firm
## year 1 2 3 4 5 6 7 8 9 10
## 1976 0 0 0 0 1 1 1 1 1 1
## 1977 1 1 1 1 1 1 1 1 1 1
## 1978 1 1 1 1 1 1 1 1 1 1
## 1979 1 1 1 1 1 1 1 1 1 1
## 1980 1 1 1 1 1 1 1 1 1 1
## 1981 1 1 1 1 1 1 1 1 1 1
## 1982 1 1 1 1 1 1 1 1 1 1
## 1983 1 1 1 1 0 0 0 0 0 0
With function make.pbalanced()
an unbalanced panel can be balanced. There are different ways how to do this and the argument balance.type=
must be supplied with one out of three options.
Using "fill"
creates a new row with NAs
for each missing time point. The columns of firm 1, formerly missing for the year 1976, are now filled with NA
.
EmplUK.balanced1 <- make.pbalanced(EmplUK, balance.type = "fill")
EmplUK.balanced1[1:8,]
## firm year sector emp wage capital output
## 1 1 1976 NA NA NA NA NA
## 2 1 1977 7 5.041 13.1516 0.5894 95.7072
## 3 1 1978 7 5.600 12.3018 0.6318 97.3569
## 4 1 1979 7 5.015 12.8395 0.6771 99.6083
## 5 1 1980 7 4.715 13.8039 0.6171 100.5501
## 6 1 1981 7 4.093 14.2897 0.5076 99.5581
## 7 1 1982 7 3.166 14.8681 0.4229 98.6151
## 8 1 1983 7 2.936 13.7784 0.3920 100.0301
Using "shared.times"
keeps all available firms in the dataset but drops all time periods where at least one firm has no data. Only the years 1978-1982 remain since they are shared by all firms.
EmplUK.balanced2 <- make.pbalanced(EmplUK, balance.type = "shared.times")
EmplUK.balanced2[1:10,]
## firm year sector emp wage capital output
## 2 1 1978 7 5.600 12.3018 0.6318 97.3569
## 3 1 1979 7 5.015 12.8395 0.6771 99.6083
## 4 1 1980 7 4.715 13.8039 0.6171 100.5501
## 5 1 1981 7 4.093 14.2897 0.5076 99.5581
## 6 1 1982 7 3.166 14.8681 0.4229 98.6151
## 9 2 1978 7 70.643 14.1036 17.2422 97.3569
## 10 2 1979 7 70.918 14.9534 17.5413 99.6083
## 11 2 1980 7 72.031 15.4910 17.6574 100.5501
## 12 2 1981 7 73.689 16.1969 16.7133 99.5581
## 13 2 1982 7 72.419 16.1314 16.2469 98.6151
By using "shared.individuals"
all available time periods are kept but only for those firms which have information for each of them. Only firm 127 to 140 remain since they share all available time periods.
EmplUK.balanced3 <- make.pbalanced(EmplUK, balance.type = "shared.individuals")
EmplUK.balanced3 %>%
group_by(firm) %>%
slice(1)
## # A tibble: 14 x 7
## # Groups: firm [14]
## firm year sector emp wage capital output
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 127 1976 7 1.14 14.7 0.690 94.7
## 2 128 1976 7 1.70 14.2 0.422 94.7
## 3 129 1976 7 0.768 8.83 0.201 94.7
## 4 130 1976 3 20 29.3 23.6 102.
## 5 131 1976 9 4.10 25.6 0.504 104.
## 6 132 1976 1 0.971 20.1 0.114 127.
## 7 133 1976 4 2.17 27.6 0.284 118.
## 8 134 1976 8 13.0 15.2 1.50 117.
## 9 135 1976 8 2.37 18.6 0.398 117.
## 10 136 1976 8 0.615 24.1 0.124 117.
## 11 137 1976 2 2.60 38.4 0.664 110.
## 12 138 1976 9 1.92 27.0 0.255 102.
## 13 139 1976 9 1.05 22.2 0.147 103.
## 14 140 1976 3 1.54 29.1 0.651 105.
Furthermore, the function is.pconsecutive()
tests whether the entities in the data have any gaps in their time series. Testing the Grunfeld
data shows that no gaps in the time periods are present for any firm.
Grunfeld %>%
is.pconsecutive()
## 1 2 3 4 5 6 7 8 9 10
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Next I drop the year 1940 for firm 1, 3 and 7 and repeat the test. The output shows that now there are gaps in the time series for firms 1, 3 and 7!
Grunfeld %>%
filter(!(firm %in% c(1,3,7) & year == 1940)) %>%
is.pconsecutive()
## 1 2 3 4 5 6 7 8 9 10
## FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
With function make.pconsecutive()
the time gaps can be identified and closed with NA
values.
When doing graphical analysis with panel data, the entity and time dimension has to be taken into account in order to get meaningful results.
A simple lineplot of gross investment (inv
) over time (year
) without taking into account the entity dimension does not give meaningful results!
ggplot(data = Grunfeld, aes(x = year, y = inv)) +
geom_line() +
labs(x = "Year", y = "Gross Investment") +
theme(legend.position = "none")
Therefore, I create a line plot for each firm
separated by colour. The additional blue dashed line indicates the overall trend in the data considering all firms simultaneously. It is in fact the fitted regression line of a linear model between inv
and year
. Firm 1 and 2 have a relatively high gross investment compared to the other firms. On average, gross investment increases over time.
ggplot(data = Grunfeld, aes(x = year, y = inv)) +
geom_line(aes(colour = as.factor(firm))) +
geom_smooth(method = "lm", se = F, lty = "dashed") +
labs(x = "Year", y = "Gross Investment") +
theme(legend.position = "none")
Heterogeneity across firms can be shown with a line plot. The blue line connects the mean values of inv
, using all available years across firms (entities).
Grunfeld %>%
group_by(firm) %>%
summarise(inv_mean = mean(inv)) %>%
left_join(Grunfeld) %>%
ggplot(data = .,
aes(x = reorder(as.character(firm), firm), y = inv)) +
geom_point() +
geom_line(aes(x = firm, y = inv_mean), col = "blue") +
labs(x = "Firm", y = "Gross Investment")
Depending on the dataset the entity identifier might be non-numeric. In the Gasoline
dataset, the entity dimension refers to countries and the argument group = 1
has to be added in function geom_line()
in order to draw a connecting line between the mean values.
data(Gasoline, package = "plm")
Gasoline %>%
group_by(country) %>%
summarise(mean = mean(lgaspcar)) %>%
left_join(Gasoline) %>%
ggplot(data = .,
aes(x = country, y = lgaspcar)) +
geom_point() +
geom_line(aes(x = country, y = mean, group = 1), col = "blue") +
labs(x = "Country", y = "Motor gasoline consumption per car (Log)") +
theme(axis.text.x = element_text(angle = 90))
The same holds for the time dimension. Here the blue line connects the mean values of inv
, using all available firms across years (time).
Grunfeld %>%
group_by(year) %>%
summarise(inv_mean = mean(inv)) %>%
left_join(Grunfeld) %>%
ggplot(data = .,
aes(x = year, y = inv)) +
geom_point() +
geom_line(aes(x = year, y = inv_mean), col = "blue") +
scale_x_continuous(labels = as.character(Grunfeld$year),
breaks = Grunfeld$year) +
labs(x = "Year", y = "Gross Investment") +
theme(axis.text.x = element_text(angle = 90))
OLS can be used to pool observations of the same individual recorded at different time points. However, observations of the same individual are then treated as if they originate from other individuals. Important influences like serial correlation of observations within the same entity cannot be considered, leading to biased estimates.
With function lm()
it is straightforward to estimate the pooled OLS model. I regress the firms’ gross investment (inv
) on their stock of plant and equipment (capital
).
pooled_ols_lm <- lm(inv ~ capital, data = Grunfeld )
summary(pooled_ols_lm)
##
## Call:
## lm(formula = inv ~ capital, data = Grunfeld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -316.92 -96.45 -14.43 17.07 481.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.23620 15.63927 0.91 0.364
## capital 0.47722 0.03834 12.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162.9 on 198 degrees of freedom
## Multiple R-squared: 0.439, Adjusted R-squared: 0.4362
## F-statistic: 154.9 on 1 and 198 DF, p-value: < 2.2e-16
The coefficient of capital
shows the average effect on inv
when capital
increases by one unit.
We achieve the same coefficient estimates by using function plm()
from package plm
. First, an index has to be supplied, corresponding to the entity and/or time dimension of the panel. The argument model=
is set to "pooling"
. Additional information about the dataset’s panel structure is shown at the top of the summary output. The Grunfeld
data consists of 10 firms (n
) measured at 20 time points (T
) resulting in 200 observations in total (N
).
pooled_ols_plm <- plm(inv ~ capital, data = Grunfeld,
index = c("firm", "year"),
effect = "individual", model = "pooling")
summary(pooled_ols_plm)
## Pooling Model
##
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual",
## model = "pooling", index = c("firm", "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -316.924 -96.450 -14.429 17.069 481.924
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 14.236205 15.639266 0.9103 0.3638
## capital 0.477224 0.038339 12.4474 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 9359900
## Residual Sum of Squares: 5251000
## R-Squared: 0.43899
## Adj. R-Squared: 0.43616
## F-statistic: 154.937 on 1 and 198 DF, p-value: < 2.22e-16
With a scatterplot it is easy to see that, although firms could be distinguished by the variable firm
, OLS estimation treats all observations as if they come from different entities and fits the regression line accordingly.
ggplot(data = Grunfeld,
aes(x = capital, y = inv)) +
geom_point(aes(shape = factor(firm,
levels = c(1:10)))) +
geom_smooth(method = "lm", se = F) +
scale_shape_manual(values = 1:10) +
labs(x = "Stock of Plant and Equipment",
y = "Gross Investment",
shape = "Firm")
The fixed effects (FE) model, also called within estimator or least squares dummy variable (LSDV) model, is commonly applied to remove omitted variable bias. By estimating changes within a specific group (over time) all time-invariant differences between entities (individuals, firms, …) are controlled for. For example:
the unobserved ability of the management influencing the firm’s revenue
or the skills influencing an employee’s wage .
The assumption behind the FE model is that something influences the independent variables and one needs to control for it (the error term and independent variables are correlated). Hence, the FE model removes characteristics that do not change over time, leading to unbiased estimates of the remaining regressors on the dependent variable. If unobserved characteristics do not change over time, each change in the dependent variable must be due to influences not related to the fixed effects, which are controlled for. The FE model is hence suited for investigating causal relationships.
Note that the influence of time-invariant regressors on the dependent variable cannot be examined with a FE model. Also they do not work well with data with low within-variance or variables which only change slowly over time.
With function lm()
a FE model can be estimated by including dummy variables for all firms. This is the so called least squares dummy variable (LSDV) approach. I have shown before that the factor variable firm
uniquely identifies each firm in the dataset. Similarly to the pooled OLS model, I am regressing inv
on capital
. If there is a large number of individuals, the LSDV method is expensive from a computational point of view.
fe_model_lm <- lm(inv ~ capital + factor(firm), data = Grunfeld)
summary(fe_model_lm)
##
## Call:
## lm(formula = inv ~ capital + factor(firm), data = Grunfeld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.715 -20.835 -0.459 21.383 293.687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 367.61297 18.96710 19.382 < 2e-16 ***
## capital 0.37075 0.01937 19.143 < 2e-16 ***
## factor(firm)2 -66.45535 21.23578 -3.129 0.00203 **
## factor(firm)3 -413.68214 20.66845 -20.015 < 2e-16 ***
## factor(firm)4 -326.44100 22.54586 -14.479 < 2e-16 ***
## factor(firm)5 -486.27841 20.34373 -23.903 < 2e-16 ***
## factor(firm)6 -350.86559 22.69651 -15.459 < 2e-16 ***
## factor(firm)7 -436.78321 21.11352 -20.687 < 2e-16 ***
## factor(firm)8 -356.47246 22.86642 -15.589 < 2e-16 ***
## factor(firm)9 -436.17028 21.21684 -20.558 < 2e-16 ***
## factor(firm)10 -366.73127 23.64118 -15.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared: 0.9184, Adjusted R-squared: 0.9141
## F-statistic: 212.7 on 10 and 189 DF, p-value: < 2.2e-16
Remember that one firm dummy variable is dropped to avoid the dummy variable trap.
Next up, I calculate the same model but drop the constant (intercept
) by adding -1
to the formula, so that no coeffcient (level) of firm
is excluded. Note that this does not alter the coefficient estimate of capital
!
fe_model_lm_nocons <- lm(inv ~ capital + factor(firm) -1, data = Grunfeld)
summary(fe_model_lm_nocons)
##
## Call:
## lm(formula = inv ~ capital + factor(firm) - 1, data = Grunfeld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.715 -20.835 -0.459 21.383 293.687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## capital 0.37075 0.01937 19.143 < 2e-16 ***
## factor(firm)1 367.61297 18.96710 19.382 < 2e-16 ***
## factor(firm)2 301.15762 15.31806 19.660 < 2e-16 ***
## factor(firm)3 -46.06917 16.18939 -2.846 0.00492 **
## factor(firm)4 41.17196 14.40645 2.858 0.00474 **
## factor(firm)5 -118.66544 17.05605 -6.957 5.52e-11 ***
## factor(firm)6 16.74738 14.35657 1.167 0.24487
## factor(firm)7 -69.17024 15.46733 -4.472 1.33e-05 ***
## factor(firm)8 11.14050 14.31023 0.778 0.43725
## factor(firm)9 -68.55731 15.34015 -4.469 1.35e-05 ***
## factor(firm)10 0.88169 14.21425 0.062 0.95061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared: 0.9439, Adjusted R-squared: 0.9407
## F-statistic: 289.3 on 11 and 189 DF, p-value: < 2.2e-16
Due to the introduction of firm dummy variables each firm has its own intercept with the y axis! For comparison, I plotted the fitted values from the pooled OLS model (blue dashed line). Its slope is more steep compared to the LSDV approach as influential observations of firm 1 lead to an upward bias.
ggplot(data = broom::augment(fe_model_lm),
aes(x = capital, y = .fitted)) +
geom_point(aes(color = `factor(firm)`)) +
geom_line(aes(color = `factor(firm)`)) +
geom_line(data=broom::augment(pooled_ols_lm),
aes(x = capital, y =.fitted),
color = "blue", lty="dashed", size = 1) +
labs(x = "Stock of Plant and Equipment", y = "Fitted Values (inv ~ capital)",
color = "Firm")
The same coefficient estimates as with the LSDV approach can be computed with function plm()
. The argument model=
is now set to "within"
. This is the within estimator with n entity-specific intercepts.
fe_model_plm <- plm(inv ~ capital, data = Grunfeld,
index = c("firm", "year"),
effect = "individual", model = "within")
summary(fe_model_plm)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual",
## model = "within", index = c("firm", "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -190.71466 -20.83474 -0.45862 21.38262 293.68714
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## capital 0.370750 0.019368 19.143 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2244400
## Residual Sum of Squares: 763680
## R-Squared: 0.65973
## Adj. R-Squared: 0.64173
## F-statistic: 366.446 on 1 and 189 DF, p-value: < 2.22e-16
The coefficient of capital
indicates how much inv
changes over time, on average per country, when capital
increases by one unit.
With function fixef()
the fixed effects, i.e. the constants for each firm, can be extracted. Compare them with the coefficients of the LSDV approach (w/o the consant) - they must be identical.
fixef(fe_model_plm)
## 1 2 3 4 5 6 7
## 367.61297 301.15762 -46.06917 41.17196 -118.66544 16.74738 -69.17024
## 8 9 10
## 11.14050 -68.55731 0.88169
The function felm()
from package lfe
does the same as lm()
and plm()
. However, it displays the F-statistic and R2 of a full and projected model. The full model is the first model with dummy variables for each firm (LSDV approach) and the projected model is the within estimator. In general, you want to provide the R2 from the full model. Note that the values of the F-statistic and R2 of the full model displayed here are only the same when the constant is included in the LSDV model.
fe_model_felm <- lfe::felm(inv ~ capital | factor(firm),
data = Grunfeld)
summary(fe_model_felm)
##
## Call:
## lfe::felm(formula = inv ~ capital | factor(firm), data = Grunfeld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.715 -20.835 -0.459 21.383 293.687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## capital 0.37075 0.01937 19.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared(full model): 0.9184 Adjusted R-squared: 0.9141
## Multiple R-squared(proj model): 0.6597 Adjusted R-squared: 0.6417
## F-statistic(full model):212.7 on 10 and 189 DF, p-value: < 2.2e-16
## F-statistic(proj model): 366.4 on 1 and 189 DF, p-value: < 2.2e-16
With function pFtest()
one can test for fixed effects with the null hypothesis that pooled OLS is better than FE. Alternatively, this test can be carried out by jointly assessing the significance of the dummy variables in the LSDV approach. The results are identical.
# Within estimator vs. Pooled OLS
pFtest(fe_model_plm, pooled_ols_plm)
##
## F test for individual effects
##
## data: inv ~ capital
## F = 123.39, df1 = 9, df2 = 189, p-value < 2.2e-16
## alternative hypothesis: significant effects
# Joint significane test with LSDV approach
car::linearHypothesis(fe_model_lm,
hypothesis.matrix = matchCoefs(fe_model_lm, "firm"))
## Linear hypothesis test
##
## Hypothesis:
## factor(firm)2 = 0
## factor(firm)3 = 0
## factor(firm)4 = 0
## factor(firm)5 = 0
## factor(firm)6 = 0
## factor(firm)7 = 0
## factor(firm)8 = 0
## factor(firm)9 = 0
## factor(firm)10 = 0
##
## Model 1: restricted model
## Model 2: inv ~ capital + factor(firm)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 5250996
## 2 189 763680 9 4487316 123.39 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In both cases the null hypothesis is rejected in favor of the alternative that there are significant fixed effects.
There is another way of estimating a FE model by specifying model = "fd"
in function plm()
.
fe_model_fd<- plm(inv ~ capital -1, data = Grunfeld,
index = c("firm", "year"),
effect = "individual", model = "fd")
summary(fe_model_fd)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = inv ~ capital - 1, data = Grunfeld, effect = "individual",
## model = "fd", index = c("firm", "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
## Observations used in estimation: 190
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -240.4 -11.7 0.1 3.5 12.6 333.2
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## capital 0.230780 0.059639 3.8696 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 584410
## Residual Sum of Squares: 561210
## R-Squared: 0.04476
## Adj. R-Squared: 0.04476
## F-statistic: 14.9739 on 1 and 189 DF, p-value: 0.00014998
The coefficient of capital
is now different compared to the LSDV approach and within-groups estimator. This is because the coefficients and standard errors of the first-differenced model are only identical to the previously obtained results when there are two time periods. For longer time series, both the coefficients and the standard errors will be different.
Let’s verify the former assumption by dropping all years except 1935 and 1936 from the Grunfeld
dataset and estimate the model again (also for the within model).
# Within estimation (two periods)
fe_model_plm_check <- plm(inv ~ capital,
data = Grunfeld,
subset = year %in% c(1935, 1936),
index = c("firm", "year"),
effect = "individual", model = "within")
lmtest::coeftest(fe_model_plm_check)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## capital 0.91353 0.85333 1.0705 0.3122
# FD estimation (two periods)
fe_model_fd_check<- plm(inv ~ capital -1,
data = Grunfeld,
subset = year %in% c(1935, 1936),
index = c("firm", "year"),
effect = "individual", model = "fd")
lmtest::coeftest(fe_model_fd_check)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## capital 0.91353 0.85333 1.0705 0.3122
The RE model (also called Partial Pooling Model) assumes, in contrast to the FE model, that any variation between entities is random and not correlated with the regressors used in the estimation model. If there are reasons to believe that differences between entities influence the dependent variable, a RE model should be preferred. This also means that time-invariant variables (like a person’s gender) can be taken into account as regressors. The entity’s error term (unobserved heterogeneity) is hence not correlated with the regressors.
With RE models individual characteristics have to be specified if they have an influence on the other regressors. This poses the problem hat some variables need to be controlled for and might not be available, leading to omitted variable bias. Furthermore, the RE model allows for population inference from the sample because it assumes a normal distribution.
To break down the difference between FE and RE:
the FE model assumes that an individual (entity) specific effect is correlated with the independent variables,
while the RE model assumes that an individual (entity) specific effect is not correlated with the independent variables.
With function plm()
the RE model can be estimated. The argument model=
is set to value "random"
.
re_model_plm <- plm(inv ~ capital, data = Grunfeld,
index = c("firm", "year"),
effect = "individual", model = "random")
summary(re_model_plm)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual",
## model = "random", index = c("firm", "year"))
##
## Balanced Panel: n = 10, T = 20, N = 200
##
## Effects:
## var std.dev share
## idiosyncratic 4040.63 63.57 0.135
## individual 25949.52 161.09 0.865
## theta: 0.9121
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -164.0821 -22.2955 -3.7463 16.9121 319.9564
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 43.246697 51.411319 0.8412 0.4002
## capital 0.372120 0.019316 19.2652 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2299300
## Residual Sum of Squares: 799910
## R-Squared: 0.65211
## Adj. R-Squared: 0.65036
## Chisq: 371.149 on 1 DF, p-value: < 2.22e-16
The coefficients in the RE model include both the within-entity and between-entity effects. When having data with multiple entities and time periods the coefficient of capital
represents the average effect on inv
when capital
changes across years and between firms by one unit.
This figure is taken from Chapter 14 Doughtery (2016).
With this technique we can make specific assumptions about the structure of the correlation matrix of the errors. If the error structure is correctly specified, our model’s coefficients will be more exact. However, as opposed to the FE-model, the coefficients can still be biased due to missing time-invariant variables. The function geeglm()
is used below to speciy a model with an "exchangeable"
correlation structure. Note that this is equivalent to the RE-model.
gee_model1 <- geeglm(inv ~ capital, id = firm, waves = year,
data = Grunfeld, family = gaussian, corstr = "exchangeable")
summary(gee_model1)
##
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld,
## id = firm, waves = year, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 43.20521 35.77051 1.459 0.227
## capital 0.37227 0.06234 35.658 2.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 27249 11383
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.8525 0.01326
## Number of clusters: 10 Maximum cluster size: 20
Next I’m using "ar1"
to estimate the same model with an correlation matrix of the type first order auto regression.
gee_model2 <- geeglm(inv ~ capital, id = firm, waves = year,
data = Grunfeld, family = gaussian, corstr = "ar1")
summary(gee_model2)
##
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld,
## id = firm, waves = year, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 79.653 52.268 2.32 0.128
## capital 0.262 0.142 3.39 0.065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = ar1
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 30476 13423
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.983 0.0094
## Number of clusters: 10 Maximum cluster size: 20
When using "independence"
as input for argument corstr =
we obtain the pooled OLS result.
gee_model3 <- geeglm(inv ~ capital, id = firm, waves = year,
data = Grunfeld, family = gaussian, corstr = "independence")
summary(gee_model3)
##
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld,
## id = firm, waves = year, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 14.236 28.046 0.26 0.61173
## capital 0.477 0.126 14.37 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 26255 10354
## Number of clusters: 10 Maximum cluster size: 20
The pooled OLS model may be enhanced with the time dimension by including appropriate dummy variables. In the Grunfeld
dataset the factor variable year
contains information for the time dimension. Remember that one level of a factor variable will be held out when estimating the model. By controlling for year
the coefficient of capital
changes compared to the initial pooled OLS model.
pooled_ols_time_lm <- lm(inv ~ capital + factor(year), data = Grunfeld)
coef(pooled_ols_time_lm)[2]
## capital
## 0.538
The same stratey can be applied to function plm()
.
pooled_ols_time_plm <- plm(inv ~ capital + factor(year), data = Grunfeld,
index = c("firm", "year"), effect = "individual",
model = "pooling")
coef(pooled_ols_time_plm)[2]
## capital
## 0.538
For the LSDV approach the time dimension is added in the same fashion as with pooled OLS.
fe_time_lm <- lm(inv ~ capital + factor(firm) + factor(year),
data = Grunfeld)
coef(fe_time_lm)[2]
## capital
## 0.414
The within estimator computed with function plm()
can be supplied with time FE by setting the argument effect=
to value "twoways"
.
fe_time_plm <- plm(inv ~ capital, data = Grunfeld,
index = c("firm", "year"), effect = "twoways",
model = "within")
coef(fe_time_plm)[1]
## capital
## 0.414
In function felm()
the time dimension is added at the right hand side of the formula.
fe_time_felm <- lfe::felm(inv ~ capital | factor(firm) + factor(year),
data = Grunfeld)
coef(fe_time_felm)[1]
## capital
## 0.414
There is also a possibility to test whether time fixed effects are needed. The null hypothesis is that the coefficients are together zero for all years and hence no time fixed effects need to be taken into account. For the within model this can be tested with function pFtest()
which I already used for testing for the presence of individual fixed effects in section 3.2. The model with time FE is compared to the one without. You may also use the LSDV model and test the joint hypothesis that all coefficients of variable year
are together zero. The results are identical.
# Within model
pFtest(fe_time_plm, fe_model_plm)
##
## F test for twoways effects
##
## data: inv ~ capital
## F = 2, df1 = 19, df2 = 170, p-value = 0.06
## alternative hypothesis: significant effects
# LSDV model
car::linearHypothesis(fe_time_lm, matchCoefs(fe_time_lm, "year"))
## Linear hypothesis test
##
## Hypothesis:
## factor(year)1936 = 0
## factor(year)1937 = 0
## factor(year)1938 = 0
## factor(year)1939 = 0
## factor(year)1940 = 0
## factor(year)1941 = 0
## factor(year)1942 = 0
## factor(year)1943 = 0
## factor(year)1944 = 0
## factor(year)1945 = 0
## factor(year)1946 = 0
## factor(year)1947 = 0
## factor(year)1948 = 0
## factor(year)1949 = 0
## factor(year)1950 = 0
## factor(year)1951 = 0
## factor(year)1952 = 0
## factor(year)1953 = 0
## factor(year)1954 = 0
##
## Model 1: restricted model
## Model 2: inv ~ capital + factor(firm) + factor(year)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 189 763680
## 2 170 648201 19 115478 1.59 0.062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is evidence that time fixed effects should be taken into account.
A decision between a fixed and random effects model can be made with the Hausman test, which checks whether the individual error terms are correlated with the regressors. The null hypothesis states that there is no such correlation (RE). The alternative hypothesis is that a correlation exists (FE). The test is implemented in function phtest()
.
phtest(fe_model_plm, re_model_plm)
##
## Hausman Test
##
## data: inv ~ capital
## chisq = 0.9, df = 1, p-value = 0.3
## alternative hypothesis: one model is inconsistent
The null hypothesis cannot be rejected here, hence we should use a RE model.
The Breusch-Pagan Lagrange multiplier (LM) Test helps to decide between a random effects model and a simple OLS regression. This test is implemented in function plmtest()
with the null hypothesis that the variance across entities is zero. In this setting this means that there are no significant differences across firms (no panel effect).
plmtest(pooled_ols_plm, effect = "individual", type = c("bp"))
##
## Lagrange Multiplier Test - (Breusch-Pagan) for balanced panels
##
## data: inv ~ capital
## chisq = 1285, df = 1, p-value <2e-16
## alternative hypothesis: significant effects
The test shows that there are significant differences across firms. Running a pooled OLS regression is thus not appropriate and the RE model is the better choice.
Testing for the presence of heteroskedasticity is also possible in panel settings.The null hypothesis of the Breusch-Pagan test against heteroskedasticity is that homoskedasticity is present. The test is implemented in function bptest()
in package lmtest
.
lmtest::bptest(inv ~ capital + factor(firm),
studentize = F, data = Grunfeld)
##
## Breusch-Pagan test
##
## data: inv ~ capital + factor(firm)
## BP = 387, df = 10, p-value <2e-16
There is strong evidence for the presense of heteroskedasticity. Hence, the use of robust standard errors is advised.
Since the Grunfeld
dataset is “20 years long,” a test for serial correlation of the residuals should be performed. Serial correlation leads to an underestimation of standard errors (too small) and an overestimation of R2 (too large). The Breusch-Godfrey/Wooldridge test for serial correlation in panel models is implemented in function pbgtest()
with the null hypothesis that there is no serial correlation.
pbgtest(fe_model_plm)
##
## Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data: inv ~ capital
## chisq = 74, df = 20, p-value = 4e-08
## alternative hypothesis: serial correlation in idiosyncratic errors
There is strong evidence that the residuals are serially correlated.
If the error terms of different observations from the same entity are correlated, the standard errors have to be adjusted. In the Grunfdeld
dataset a firm is observed at 20 different time points and the observations for the same individuals are hence not independent. This leads to the previously described problem of serial correlation of the residuals. In order to solve this issue clustered standard errors have to be used. Both standard OLS as well as heteroskedasticity robust standard errors are wrong because they assume that the error terms ui,t are not serially correlated. Hence they underestimate the true sampling uncertainty (there is less random variation when error terms are correlated). Clustered standard errors estimate the variance of the coefficient when regressors are i.i.d. across entities but correlated within the entity.
This issue is not restricted to panel data and can also occur in cross-sectional studies, if the data contains clusters of observations and their error terms are correlated within but not between the clusters (regions, schools, branches, …).
In order to corrects the standard errors function vcovHC()
is used which originates from package sandwich
but is also available in package plm
. The argument cluster =
is set to "group"
and the argument type=
controls the estimation type of the standard errors. I am using "sss"
which includes the small sample correction method as applied by Stata (so it is easy to check the results with another software).
# OLS standard error
coeftest(pooled_ols_plm)[2,c(1,2)]
## Estimate Std. Error
## 0.4772 0.0383
# Cluster robust standard error
coeftest(pooled_ols_plm,
vcov = vcovHC(pooled_ols_plm,
type = "sss",
cluster = "group"))[2,c(1,2)]
## Estimate Std. Error
## 0.477 0.133
This works just the same for the FE model. However, although using type = "sss"
leads to a minor difference when I compute the standard error with Stata. I am not sure why that is the case.
# FE standard error
coeftest(fe_model_plm)[,c(1,2)]
## Estimate Std. Error
## 0.3707 0.0194
# Cluster robust standard error
coeftest(fe_model_plm,
vcov = vcovHC(fe_model_plm,
type = "sss",
cluster = "group"))[,c(1,2)]
## Estimate Std. Error
## 0.3707 0.0649
By using function felm()
from package lfe
cluster-robust standard errors can be computed directly. The standard errors are the same as in Stata here so this might be the best function to use when wanting to compute within-groups FE.
fe_model_felm2 <- lfe::felm(inv ~ capital | firm | 0 | firm, data = Grunfeld)
summary(fe_model_felm2)
##
## Call:
## lfe::felm(formula = inv ~ capital | firm | 0 | firm, data = Grunfeld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.71 -20.83 -0.46 21.38 293.69
##
## Coefficients:
## Estimate Cluster s.e. t value Pr(>|t|)
## capital 0.3707 0.0651 5.69 3e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.6 on 189 degrees of freedom
## Multiple R-squared(full model): 0.918 Adjusted R-squared: 0.914
## Multiple R-squared(proj model): 0.66 Adjusted R-squared: 0.642
## F-statistic(full model, *iid*): 213 on 10 and 189 DF, p-value: <2e-16
## F-statistic(proj model): 32.4 on 1 and 9 DF, p-value: 0.000296
Computing cluster robust standard errors for the RE model is not different. Again, the results are the same as in Stata.
coeftest(re_model_plm)[2,c(1,2)]
## Estimate Std. Error
## 0.3721 0.0193
coeftest(re_model_plm,
vcov = vcovHC(re_model_plm,
type = "sss",
cluster = "group"))[2,c(1,2)]
## Estimate Std. Error
## 0.3721 0.0658