R Tutorial: Panel Data Analysis 1

Reading time: 16 minutes (3,009 words)

1. Introduction

1.1 R packages

The following R packages are required for this tutorial.

library(ggplot2)
library(dplyr)
library(plm)
library(lfe)
library(lmtest)
library(car)
library(geepack)

1.2 Dataset

All practical examples are built around Grunfeld’s Investment Data contained in R package plm. This dataset is a panel of 10 observational units (firms) from 1935 to 1954.

data("Grunfeld", package = "plm")

1.3 Panel structure

Balanced panels

A panel is usually denoted by having multiple entries (rows) for the same entity (firm, person, …) in the dataset. The multiple entries are due to different time periods at which the entity was observed. Initially, I create a twoway table of the variables firm and year. In total there are 10 different firms, each having one observation for year 1935 to year 1954. Since no holes (zeros) can be found in this structure, that is no missing information in any given year for each firm, this panel is considered to be balanced.

Grunfeld %>%
  select(year, firm) %>%
  table()

##       firm
## year   1 2 3 4 5 6 7 8 9 10
##   1935 1 1 1 1 1 1 1 1 1  1
##   1936 1 1 1 1 1 1 1 1 1  1
##   1937 1 1 1 1 1 1 1 1 1  1
##   1938 1 1 1 1 1 1 1 1 1  1
##   1939 1 1 1 1 1 1 1 1 1  1
##   1940 1 1 1 1 1 1 1 1 1  1
##   1941 1 1 1 1 1 1 1 1 1  1
##   1942 1 1 1 1 1 1 1 1 1  1
##   1943 1 1 1 1 1 1 1 1 1  1
##   1944 1 1 1 1 1 1 1 1 1  1
##   1945 1 1 1 1 1 1 1 1 1  1
##   1946 1 1 1 1 1 1 1 1 1  1
##   1947 1 1 1 1 1 1 1 1 1  1
##   1948 1 1 1 1 1 1 1 1 1  1
##   1949 1 1 1 1 1 1 1 1 1  1
##   1950 1 1 1 1 1 1 1 1 1  1
##   1951 1 1 1 1 1 1 1 1 1  1
##   1952 1 1 1 1 1 1 1 1 1  1
##   1953 1 1 1 1 1 1 1 1 1  1
##   1954 1 1 1 1 1 1 1 1 1  1

With package plm this can be examined with function is.pbalanced().

Grunfeld %>%
  is.pbalanced()

## [1] TRUE

Unbalanced panels

In package plm there is another panel dataset named EmplUK. Testing whether this panel is balanced gives the result FALSE.

data("EmplUK", package="plm")

EmplUK %>%
  is.pbalanced()

## [1] FALSE

I am computing a twoway table of the first 10 indexes of firms and find that firm 1 to 4 have no records for the year 1976 and firms 5 to 10 have no data for the year 1983.

EmplUK %>%
  select(year, firm) %>%
  filter(firm %in% c(1:10)) %>%
  table()

##       firm
## year   1 2 3 4 5 6 7 8 9 10
##   1976 0 0 0 0 1 1 1 1 1  1
##   1977 1 1 1 1 1 1 1 1 1  1
##   1978 1 1 1 1 1 1 1 1 1  1
##   1979 1 1 1 1 1 1 1 1 1  1
##   1980 1 1 1 1 1 1 1 1 1  1
##   1981 1 1 1 1 1 1 1 1 1  1
##   1982 1 1 1 1 1 1 1 1 1  1
##   1983 1 1 1 1 0 0 0 0 0  0

Balancing an unbalanced panel

With function make.pbalanced() an unbalanced panel can be balanced. There are different ways how to do this and the argument balance.type= must be supplied with one out of three options.

Using "fill" creates a new row with NAs for each missing time point. The columns of firm 1, formerly missing for the year 1976, are now filled with NA.

EmplUK.balanced1 <- make.pbalanced(EmplUK, balance.type = "fill")
EmplUK.balanced1[1:8,]

##   firm year sector   emp    wage capital   output
## 1    1 1976     NA    NA      NA      NA       NA
## 2    1 1977      7 5.041 13.1516  0.5894  95.7072
## 3    1 1978      7 5.600 12.3018  0.6318  97.3569
## 4    1 1979      7 5.015 12.8395  0.6771  99.6083
## 5    1 1980      7 4.715 13.8039  0.6171 100.5501
## 6    1 1981      7 4.093 14.2897  0.5076  99.5581
## 7    1 1982      7 3.166 14.8681  0.4229  98.6151
## 8    1 1983      7 2.936 13.7784  0.3920 100.0301

Using "shared.times" keeps all available firms in the dataset but drops all time periods where at least one firm has no data. Only the years 1978-1982 remain since they are shared by all firms.

EmplUK.balanced2 <- make.pbalanced(EmplUK, balance.type = "shared.times")
EmplUK.balanced2[1:10,]

##    firm year sector    emp    wage capital   output
## 2     1 1978      7  5.600 12.3018  0.6318  97.3569
## 3     1 1979      7  5.015 12.8395  0.6771  99.6083
## 4     1 1980      7  4.715 13.8039  0.6171 100.5501
## 5     1 1981      7  4.093 14.2897  0.5076  99.5581
## 6     1 1982      7  3.166 14.8681  0.4229  98.6151
## 9     2 1978      7 70.643 14.1036 17.2422  97.3569
## 10    2 1979      7 70.918 14.9534 17.5413  99.6083
## 11    2 1980      7 72.031 15.4910 17.6574 100.5501
## 12    2 1981      7 73.689 16.1969 16.7133  99.5581
## 13    2 1982      7 72.419 16.1314 16.2469  98.6151

By using "shared.individuals" all available time periods are kept but only for those firms which have information for each of them. Only firm 127 to 140 remain since they share all available time periods.

EmplUK.balanced3 <- make.pbalanced(EmplUK, balance.type = "shared.individuals")

EmplUK.balanced3 %>%
  group_by(firm) %>%
  slice(1)

## # A tibble: 14 x 7
## # Groups:   firm [14]
##     firm  year sector    emp  wage capital output
##    <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>  <dbl>
##  1   127  1976      7  1.14  14.7    0.690   94.7
##  2   128  1976      7  1.70  14.2    0.422   94.7
##  3   129  1976      7  0.768  8.83   0.201   94.7
##  4   130  1976      3 20     29.3   23.6    102. 
##  5   131  1976      9  4.10  25.6    0.504  104. 
##  6   132  1976      1  0.971 20.1    0.114  127. 
##  7   133  1976      4  2.17  27.6    0.284  118. 
##  8   134  1976      8 13.0   15.2    1.50   117. 
##  9   135  1976      8  2.37  18.6    0.398  117. 
## 10   136  1976      8  0.615 24.1    0.124  117. 
## 11   137  1976      2  2.60  38.4    0.664  110. 
## 12   138  1976      9  1.92  27.0    0.255  102. 
## 13   139  1976      9  1.05  22.2    0.147  103. 
## 14   140  1976      3  1.54  29.1    0.651  105.

Time dimension gaps

Furthermore, the function is.pconsecutive() tests whether the entities in the data have any gaps in their time series. Testing the Grunfeld data shows that no gaps in the time periods are present for any firm.

Grunfeld %>%
  is.pconsecutive()

##    1    2    3    4    5    6    7    8    9   10 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Next I drop the year 1940 for firm 1, 3 and 7 and repeat the test. The output shows that now there are gaps in the time series for firms 1, 3 and 7!

Grunfeld %>%
  filter(!(firm %in% c(1,3,7) & year == 1940)) %>%
  is.pconsecutive()

##     1     2     3     4     5     6     7     8     9    10 
## FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

With function make.pconsecutive() the time gaps can be identified and closed with NA values.

2. Visualization Techniques

When doing graphical analysis with panel data, the entity and time dimension has to be taken into account in order to get meaningful results.

Naive Lineplot

A simple lineplot of gross investment (inv) over time (year) without taking into account the entity dimension does not give meaningful results!

ggplot(data = Grunfeld, aes(x = year, y = inv)) +
  geom_line() +
  labs(x = "Year",  y = "Gross Investment") +
  theme(legend.position = "none")

Separated Lineplot

Therefore, I create a line plot for each firm separated by colour. The additional blue dashed line indicates the overall trend in the data considering all firms simultaneously. It is in fact the fitted regression line of a linear model between inv and year. Firm 1 and 2 have a relatively high gross investment compared to the other firms. On average, gross investment increases over time.

ggplot(data = Grunfeld, aes(x = year, y = inv)) +
  geom_line(aes(colour = as.factor(firm))) +
  geom_smooth(method = "lm", se = F, lty = "dashed") +
  labs(x = "Year",  y = "Gross Investment") +
  theme(legend.position = "none")

Entity Heterogeneity

Heterogeneity across firms can be shown with a line plot. The blue line connects the mean values of inv, using all available years across firms (entities).

Grunfeld %>%
  group_by(firm) %>%
  summarise(inv_mean = mean(inv)) %>%
  left_join(Grunfeld) %>%
  ggplot(data = ., 
         aes(x = reorder(as.character(firm), firm), y = inv)) +
  geom_point() +
  geom_line(aes(x = firm, y = inv_mean), col = "blue") +
  labs(x = "Firm", y = "Gross Investment")

Depending on the dataset the entity identifier might be non-numeric. In the Gasoline dataset, the entity dimension refers to countries and the argument group = 1 has to be added in function geom_line() in order to draw a connecting line between the mean values.

data(Gasoline, package = "plm")

Gasoline %>%
  group_by(country) %>%
  summarise(mean = mean(lgaspcar)) %>%
  left_join(Gasoline) %>%
  ggplot(data = ., 
         aes(x = country, y = lgaspcar)) +
  geom_point() +
  geom_line(aes(x = country, y = mean, group = 1), col = "blue") +
  labs(x = "Country", y = "Motor gasoline consumption per car (Log)") +
  theme(axis.text.x = element_text(angle = 90))

Time Heterogeneity

The same holds for the time dimension. Here the blue line connects the mean values of inv, using all available firms across years (time).

Grunfeld %>%
  group_by(year) %>%
  summarise(inv_mean = mean(inv)) %>%
  left_join(Grunfeld) %>%
  ggplot(data = ., 
         aes(x = year, y = inv)) +
  geom_point() +
  geom_line(aes(x = year, y = inv_mean), col = "blue") +
  scale_x_continuous(labels = as.character(Grunfeld$year), 
                     breaks = Grunfeld$year) +
  labs(x = "Year", y = "Gross Investment") +
  theme(axis.text.x = element_text(angle = 90))

3. Estimation Methods

3.1 Pooled Cross Sections

OLS can be used to pool observations of the same individual recorded at different time points. However, observations of the same individual are then treated as if they originate from other individuals. Important influences like serial correlation of observations within the same entity cannot be considered, leading to biased estimates.

Pooled OLS via lm()

With function lm() it is straightforward to estimate the pooled OLS model. I regress the firms’ gross investment (inv) on their stock of plant and equipment (capital).

pooled_ols_lm <- lm(inv ~ capital, data = Grunfeld )

summary(pooled_ols_lm)

## 
## Call:
## lm(formula = inv ~ capital, data = Grunfeld)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -316.92  -96.45  -14.43   17.07  481.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.23620   15.63927    0.91    0.364    
## capital      0.47722    0.03834   12.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162.9 on 198 degrees of freedom
## Multiple R-squared:  0.439,  Adjusted R-squared:  0.4362 
## F-statistic: 154.9 on 1 and 198 DF,  p-value: < 2.2e-16

The coefficient of capital shows the average effect on inv when capital increases by one unit.

Pooled OLS via plm()

We achieve the same coefficient estimates by using function plm() from package plm. First, an index has to be supplied, corresponding to the entity and/or time dimension of the panel. The argument model= is set to "pooling". Additional information about the dataset’s panel structure is shown at the top of the summary output. The Grunfeld data consists of 10 firms (n) measured at 20 time points (T) resulting in 200 observations in total (N).

pooled_ols_plm <- plm(inv ~ capital, data = Grunfeld, 
                      index = c("firm", "year"), 
                      effect = "individual", model = "pooling")

summary(pooled_ols_plm)

## Pooling Model
## 
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual", 
##     model = "pooling", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -316.924  -96.450  -14.429   17.069  481.924 
## 
## Coefficients:
##              Estimate Std. Error t-value Pr(>|t|)    
## (Intercept) 14.236205  15.639266  0.9103   0.3638    
## capital      0.477224   0.038339 12.4474   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    9359900
## Residual Sum of Squares: 5251000
## R-Squared:      0.43899
## Adj. R-Squared: 0.43616
## F-statistic: 154.937 on 1 and 198 DF, p-value: < 2.22e-16

Scatterplot

With a scatterplot it is easy to see that, although firms could be distinguished by the variable firm, OLS estimation treats all observations as if they come from different entities and fits the regression line accordingly.

ggplot(data = Grunfeld,
       aes(x = capital, y = inv)) +
  geom_point(aes(shape = factor(firm, 
                                levels = c(1:10)))) +
  geom_smooth(method = "lm", se = F) +
  scale_shape_manual(values = 1:10) +
  labs(x = "Stock of Plant and Equipment",
       y = "Gross Investment",
       shape = "Firm")

3.2 Fixed Effects Model

The fixed effects (FE) model, also called within estimator or least squares dummy variable (LSDV) model, is commonly applied to remove omitted variable bias. By estimating changes within a specific group (over time) all time-invariant differences between entities (individuals, firms, …) are controlled for. For example:

the unobserved ability of the management influencing the firm’s revenue
or the skills influencing an employee’s wage .

The assumption behind the FE model is that something influences the independent variables and one needs to control for it (the error term and independent variables are correlated). Hence, the FE model removes characteristics that do not change over time, leading to unbiased estimates of the remaining regressors on the dependent variable. If unobserved characteristics do not change over time, each change in the dependent variable must be due to influences not related to the fixed effects, which are controlled for. The FE model is hence suited for investigating causal relationships.

Note that the influence of time-invariant regressors on the dependent variable cannot be examined with a FE model. Also they do not work well with data with low within-variance or variables which only change slowly over time.

3.2.1 Least Squares Dummy Variable Estimation

FE using lm()

With function lm() a FE model can be estimated by including dummy variables for all firms. This is the so called least squares dummy variable (LSDV) approach. I have shown before that the factor variable firm uniquely identifies each firm in the dataset. Similarly to the pooled OLS model, I am regressing inv on capital. If there is a large number of individuals, the LSDV method is expensive from a computational point of view.

fe_model_lm <- lm(inv ~ capital + factor(firm), data = Grunfeld)

summary(fe_model_lm)

## 
## Call:
## lm(formula = inv ~ capital + factor(firm), data = Grunfeld)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -190.715  -20.835   -0.459   21.383  293.687 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     367.61297   18.96710  19.382  < 2e-16 ***
## capital           0.37075    0.01937  19.143  < 2e-16 ***
## factor(firm)2   -66.45535   21.23578  -3.129  0.00203 ** 
## factor(firm)3  -413.68214   20.66845 -20.015  < 2e-16 ***
## factor(firm)4  -326.44100   22.54586 -14.479  < 2e-16 ***
## factor(firm)5  -486.27841   20.34373 -23.903  < 2e-16 ***
## factor(firm)6  -350.86559   22.69651 -15.459  < 2e-16 ***
## factor(firm)7  -436.78321   21.11352 -20.687  < 2e-16 ***
## factor(firm)8  -356.47246   22.86642 -15.589  < 2e-16 ***
## factor(firm)9  -436.17028   21.21684 -20.558  < 2e-16 ***
## factor(firm)10 -366.73127   23.64118 -15.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared:  0.9184, Adjusted R-squared:  0.9141 
## F-statistic: 212.7 on 10 and 189 DF,  p-value: < 2.2e-16

Remember that one firm dummy variable is dropped to avoid the dummy variable trap.

Excluding the intercept

Next up, I calculate the same model but drop the constant (intercept) by adding -1 to the formula, so that no coeffcient (level) of firm is excluded. Note that this does not alter the coefficient estimate of capital!

fe_model_lm_nocons <- lm(inv ~ capital + factor(firm) -1, data = Grunfeld)

summary(fe_model_lm_nocons)

## 
## Call:
## lm(formula = inv ~ capital + factor(firm) - 1, data = Grunfeld)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -190.715  -20.835   -0.459   21.383  293.687 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## capital           0.37075    0.01937  19.143  < 2e-16 ***
## factor(firm)1   367.61297   18.96710  19.382  < 2e-16 ***
## factor(firm)2   301.15762   15.31806  19.660  < 2e-16 ***
## factor(firm)3   -46.06917   16.18939  -2.846  0.00492 ** 
## factor(firm)4    41.17196   14.40645   2.858  0.00474 ** 
## factor(firm)5  -118.66544   17.05605  -6.957 5.52e-11 ***
## factor(firm)6    16.74738   14.35657   1.167  0.24487    
## factor(firm)7   -69.17024   15.46733  -4.472 1.33e-05 ***
## factor(firm)8    11.14050   14.31023   0.778  0.43725    
## factor(firm)9   -68.55731   15.34015  -4.469 1.35e-05 ***
## factor(firm)10    0.88169   14.21425   0.062  0.95061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared:  0.9439, Adjusted R-squared:  0.9407 
## F-statistic: 289.3 on 11 and 189 DF,  p-value: < 2.2e-16

Scatterplot

Due to the introduction of firm dummy variables each firm has its own intercept with the y axis! For comparison, I plotted the fitted values from the pooled OLS model (blue dashed line). Its slope is more steep compared to the LSDV approach as influential observations of firm 1 lead to an upward bias.

ggplot(data = broom::augment(fe_model_lm),
       aes(x = capital, y = .fitted)) +
  geom_point(aes(color = `factor(firm)`)) +
  geom_line(aes(color = `factor(firm)`)) +
  geom_line(data=broom::augment(pooled_ols_lm), 
            aes(x = capital, y =.fitted), 
            color = "blue", lty="dashed", size = 1) +
  labs(x = "Stock of Plant and Equipment", y = "Fitted Values (inv ~ capital)",
       color = "Firm")

3.2.2 Within-groups Estimator

FE using plm()

The same coefficient estimates as with the LSDV approach can be computed with function plm(). The argument model= is now set to "within". This is the within estimator with n entity-specific intercepts.

fe_model_plm <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", model = "within")

summary(fe_model_plm)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual", 
##     model = "within", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -190.71466  -20.83474   -0.45862   21.38262  293.68714 
## 
## Coefficients:
##         Estimate Std. Error t-value  Pr(>|t|)    
## capital 0.370750   0.019368  19.143 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2244400
## Residual Sum of Squares: 763680
## R-Squared:      0.65973
## Adj. R-Squared: 0.64173
## F-statistic: 366.446 on 1 and 189 DF, p-value: < 2.22e-16

The coefficient of capital indicates how much inv changes over time, on average per country, when capital increases by one unit.

With function fixef() the fixed effects, i.e. the constants for each firm, can be extracted. Compare them with the coefficients of the LSDV approach (w/o the consant) - they must be identical.

fixef(fe_model_plm)

##          1          2          3          4          5          6          7 
##  367.61297  301.15762  -46.06917   41.17196 -118.66544   16.74738  -69.17024 
##          8          9         10 
##   11.14050  -68.55731    0.88169

FE using felm()

The function felm() from package lfe does the same as lm() and plm(). However, it displays the F-statistic and R² of a full and projected model. The full model is the first model with dummy variables for each firm (LSDV approach) and the projected model is the within estimator. In general, you want to provide the R² from the full model. Note that the values of the F-statistic and R² of the full model displayed here are only the same when the constant is included in the LSDV model.

fe_model_felm <- lfe::felm(inv ~ capital | factor(firm), 
                           data = Grunfeld)

summary(fe_model_felm)

## 
## Call:
##    lfe::felm(formula = inv ~ capital | factor(firm), data = Grunfeld) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -190.715  -20.835   -0.459   21.383  293.687 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## capital  0.37075    0.01937   19.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.57 on 189 degrees of freedom
## Multiple R-squared(full model): 0.9184   Adjusted R-squared: 0.9141 
## Multiple R-squared(proj model): 0.6597   Adjusted R-squared: 0.6417 
## F-statistic(full model):212.7 on 10 and 189 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 366.4 on 1 and 189 DF, p-value: < 2.2e-16

Testing for FE

With function pFtest() one can test for fixed effects with the null hypothesis that pooled OLS is better than FE. Alternatively, this test can be carried out by jointly assessing the significance of the dummy variables in the LSDV approach. The results are identical.

# Within estimator vs. Pooled OLS
pFtest(fe_model_plm, pooled_ols_plm)

## 
##  F test for individual effects
## 
## data:  inv ~ capital
## F = 123.39, df1 = 9, df2 = 189, p-value < 2.2e-16
## alternative hypothesis: significant effects

# Joint significane test with LSDV approach
car::linearHypothesis(fe_model_lm,
                      hypothesis.matrix = matchCoefs(fe_model_lm, "firm"))

## Linear hypothesis test
## 
## Hypothesis:
## factor(firm)2 = 0
## factor(firm)3 = 0
## factor(firm)4 = 0
## factor(firm)5 = 0
## factor(firm)6 = 0
## factor(firm)7 = 0
## factor(firm)8 = 0
## factor(firm)9 = 0
## factor(firm)10 = 0
## 
## Model 1: restricted model
## Model 2: inv ~ capital + factor(firm)
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 5250996                                  
## 2    189  763680  9   4487316 123.39 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In both cases the null hypothesis is rejected in favor of the alternative that there are significant fixed effects.

3.2.3 First-difference Estimator

More than two time periods

There is another way of estimating a FE model by specifying model = "fd" in function plm().

fe_model_fd<- plm(inv ~ capital -1, data = Grunfeld,
                  index = c("firm", "year"), 
                  effect = "individual", model = "fd")

summary(fe_model_fd)

## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = inv ~ capital - 1, data = Grunfeld, effect = "individual", 
##     model = "fd", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## Observations used in estimation: 190
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -240.4   -11.7     0.1     3.5    12.6   333.2 
## 
## Coefficients:
##         Estimate Std. Error t-value Pr(>|t|)    
## capital 0.230780   0.059639  3.8696  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    584410
## Residual Sum of Squares: 561210
## R-Squared:      0.04476
## Adj. R-Squared: 0.04476
## F-statistic: 14.9739 on 1 and 189 DF, p-value: 0.00014998

The coefficient of capital is now different compared to the LSDV approach and within-groups estimator. This is because the coefficients and standard errors of the first-differenced model are only identical to the previously obtained results when there are two time periods. For longer time series, both the coefficients and the standard errors will be different.

Two time periods

Let’s verify the former assumption by dropping all years except 1935 and 1936 from the Grunfeld dataset and estimate the model again (also for the within model).

# Within estimation (two periods)
fe_model_plm_check <- plm(inv ~ capital, 
                          data = Grunfeld, 
                          subset = year %in% c(1935, 1936), 
                          index = c("firm", "year"), 
                          effect = "individual", model = "within")

lmtest::coeftest(fe_model_plm_check)

## 
## t test of coefficients:
## 
##         Estimate Std. Error t value Pr(>|t|)
## capital  0.91353    0.85333  1.0705   0.3122

# FD estimation (two periods)
fe_model_fd_check<- plm(inv ~ capital -1,
                        data = Grunfeld, 
                        subset = year %in% c(1935, 1936), 
                        index = c("firm", "year"), 
                        effect = "individual", model = "fd")


lmtest::coeftest(fe_model_fd_check)

## 
## t test of coefficients:
## 
##         Estimate Std. Error t value Pr(>|t|)
## capital  0.91353    0.85333  1.0705   0.3122

3.3 Random Effects Model

The RE model (also called Partial Pooling Model) assumes, in contrast to the FE model, that any variation between entities is random and not correlated with the regressors used in the estimation model. If there are reasons to believe that differences between entities influence the dependent variable, a RE model should be preferred. This also means that time-invariant variables (like a person’s gender) can be taken into account as regressors. The entity’s error term (unobserved heterogeneity) is hence not correlated with the regressors.

With RE models individual characteristics have to be specified if they have an influence on the other regressors. This poses the problem hat some variables need to be controlled for and might not be available, leading to omitted variable bias. Furthermore, the RE model allows for population inference from the sample because it assumes a normal distribution.

To break down the difference between FE and RE:

the FE model assumes that an individual (entity) specific effect is correlated with the independent variables,
while the RE model assumes that an individual (entity) specific effect is not correlated with the independent variables.

RE via plm()

With function plm() the RE model can be estimated. The argument model= is set to value "random".

re_model_plm <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", model = "random")

summary(re_model_plm)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual", 
##     model = "random", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Effects:
##                    var  std.dev share
## idiosyncratic  4040.63    63.57 0.135
## individual    25949.52   161.09 0.865
## theta: 0.9121
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -164.0821  -22.2955   -3.7463   16.9121  319.9564 
## 
## Coefficients:
##              Estimate Std. Error z-value Pr(>|z|)    
## (Intercept) 43.246697  51.411319  0.8412   0.4002    
## capital      0.372120   0.019316 19.2652   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2299300
## Residual Sum of Squares: 799910
## R-Squared:      0.65211
## Adj. R-Squared: 0.65036
## Chisq: 371.149 on 1 DF, p-value: < 2.22e-16

The coefficients in the RE model include both the within-entity and between-entity effects. When having data with multiple entities and time periods the coefficient of capital represents the average effect on inv when capital changes across years and between firms by one unit.

Decision Tree

This figure is taken from Chapter 14 Doughtery (2016).

Generalized Estimating Equations

With this technique we can make specific assumptions about the structure of the correlation matrix of the errors. If the error structure is correctly specified, our model’s coefficients will be more exact. However, as opposed to the FE-model, the coefficients can still be biased due to missing time-invariant variables. The function geeglm() is used below to speciy a model with an "exchangeable" correlation structure. Note that this is equivalent to the RE-model.

gee_model1 <- geeglm(inv ~ capital, id = firm, waves = year, 
                     data = Grunfeld, family = gaussian, corstr = "exchangeable")
summary(gee_model1)

## 
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld, 
##     id = firm, waves = year, corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate  Std.err   Wald Pr(>|W|)    
## (Intercept) 43.20521 35.77051  1.459    0.227    
## capital      0.37227  0.06234 35.658 2.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    27249   11383
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha   0.8525 0.01326
## Number of clusters:   10  Maximum cluster size: 20

Next I’m using "ar1" to estimate the same model with an correlation matrix of the type first order auto regression.

gee_model2 <- geeglm(inv ~ capital, id = firm, waves = year, 
                     data = Grunfeld, family = gaussian, corstr = "ar1")
summary(gee_model2)

## 
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld, 
##     id = firm, waves = year, corstr = "ar1")
## 
##  Coefficients:
##             Estimate Std.err Wald Pr(>|W|)  
## (Intercept)   79.653  52.268 2.32    0.128  
## capital        0.262   0.142 3.39    0.065 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    30476   13423
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.983  0.0094
## Number of clusters:   10  Maximum cluster size: 20

When using "independence" as input for argument corstr = we obtain the pooled OLS result.

gee_model3 <- geeglm(inv ~ capital, id = firm, waves = year, 
                     data = Grunfeld, family = gaussian, corstr = "independence")
summary(gee_model3)

## 
## Call:
## geeglm(formula = inv ~ capital, family = gaussian, data = Grunfeld, 
##     id = firm, waves = year, corstr = "independence")
## 
##  Coefficients:
##             Estimate Std.err  Wald Pr(>|W|)    
## (Intercept)   14.236  28.046  0.26  0.61173    
## capital        0.477   0.126 14.37  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    26255   10354
## Number of clusters:   10  Maximum cluster size: 20

3.4 Including the Time Dimension

Pooled OLS via lm()/plm()

The pooled OLS model may be enhanced with the time dimension by including appropriate dummy variables. In the Grunfeld dataset the factor variable year contains information for the time dimension. Remember that one level of a factor variable will be held out when estimating the model. By controlling for year the coefficient of capital changes compared to the initial pooled OLS model.

pooled_ols_time_lm <- lm(inv ~ capital + factor(year), data = Grunfeld)

coef(pooled_ols_time_lm)[2]

## capital 
##   0.538

The same stratey can be applied to function plm().

pooled_ols_time_plm <- plm(inv ~ capital + factor(year), data = Grunfeld, 
                           index = c("firm", "year"), effect = "individual",
                           model = "pooling")

coef(pooled_ols_time_plm)[2]

## capital 
##   0.538

FE via lm()/plm()

For the LSDV approach the time dimension is added in the same fashion as with pooled OLS.

fe_time_lm <- lm(inv ~ capital + factor(firm) + factor(year), 
                 data = Grunfeld)

coef(fe_time_lm)[2]

## capital 
##   0.414

The within estimator computed with function plm() can be supplied with time FE by setting the argument effect= to value "twoways".

fe_time_plm <- plm(inv ~ capital, data = Grunfeld, 
                   index = c("firm", "year"), effect = "twoways",
                   model = "within")

coef(fe_time_plm)[1]

## capital 
##   0.414

FE via felm()

In function felm() the time dimension is added at the right hand side of the formula.

fe_time_felm <- lfe::felm(inv ~ capital | factor(firm) + factor(year), 
                          data = Grunfeld)

coef(fe_time_felm)[1]

## capital 
##   0.414

Testing for Time FE

There is also a possibility to test whether time fixed effects are needed. The null hypothesis is that the coefficients are together zero for all years and hence no time fixed effects need to be taken into account. For the within model this can be tested with function pFtest() which I already used for testing for the presence of individual fixed effects in section 3.2. The model with time FE is compared to the one without. You may also use the LSDV model and test the joint hypothesis that all coefficients of variable year are together zero. The results are identical.

# Within model
pFtest(fe_time_plm, fe_model_plm)

## 
##  F test for twoways effects
## 
## data:  inv ~ capital
## F = 2, df1 = 19, df2 = 170, p-value = 0.06
## alternative hypothesis: significant effects

# LSDV model
car::linearHypothesis(fe_time_lm, matchCoefs(fe_time_lm, "year"))

## Linear hypothesis test
## 
## Hypothesis:
## factor(year)1936 = 0
## factor(year)1937 = 0
## factor(year)1938 = 0
## factor(year)1939 = 0
## factor(year)1940 = 0
## factor(year)1941 = 0
## factor(year)1942 = 0
## factor(year)1943 = 0
## factor(year)1944 = 0
## factor(year)1945 = 0
## factor(year)1946 = 0
## factor(year)1947 = 0
## factor(year)1948 = 0
## factor(year)1949 = 0
## factor(year)1950 = 0
## factor(year)1951 = 0
## factor(year)1952 = 0
## factor(year)1953 = 0
## factor(year)1954 = 0
## 
## Model 1: restricted model
## Model 2: inv ~ capital + factor(firm) + factor(year)
## 
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)  
## 1    189 763680                           
## 2    170 648201 19    115478 1.59  0.062 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is evidence that time fixed effects should be taken into account.

4. Tests for Panel Data Models

RE or FE?

A decision between a fixed and random effects model can be made with the Hausman test, which checks whether the individual error terms are correlated with the regressors. The null hypothesis states that there is no such correlation (RE). The alternative hypothesis is that a correlation exists (FE). The test is implemented in function phtest().

phtest(fe_model_plm, re_model_plm)

## 
##  Hausman Test
## 
## data:  inv ~ capital
## chisq = 0.9, df = 1, p-value = 0.3
## alternative hypothesis: one model is inconsistent

The null hypothesis cannot be rejected here, hence we should use a RE model.

Pooled OLS or RE?

The Breusch-Pagan Lagrange multiplier (LM) Test helps to decide between a random effects model and a simple OLS regression. This test is implemented in function plmtest() with the null hypothesis that the variance across entities is zero. In this setting this means that there are no significant differences across firms (no panel effect).

plmtest(pooled_ols_plm, effect = "individual", type = c("bp"))

## 
##  Lagrange Multiplier Test - (Breusch-Pagan) for balanced panels
## 
## data:  inv ~ capital
## chisq = 1285, df = 1, p-value <2e-16
## alternative hypothesis: significant effects

The test shows that there are significant differences across firms. Running a pooled OLS regression is thus not appropriate and the RE model is the better choice.

Heteroskedasticity

Testing for the presence of heteroskedasticity is also possible in panel settings.The null hypothesis of the Breusch-Pagan test against heteroskedasticity is that homoskedasticity is present. The test is implemented in function bptest() in package lmtest.

lmtest::bptest(inv ~ capital + factor(firm), 
               studentize = F, data = Grunfeld)

## 
##  Breusch-Pagan test
## 
## data:  inv ~ capital + factor(firm)
## BP = 387, df = 10, p-value <2e-16

There is strong evidence for the presense of heteroskedasticity. Hence, the use of robust standard errors is advised.

Serial Correlation

Since the Grunfeld dataset is “20 years long,” a test for serial correlation of the residuals should be performed. Serial correlation leads to an underestimation of standard errors (too small) and an overestimation of R² (too large). The Breusch-Godfrey/Wooldridge test for serial correlation in panel models is implemented in function pbgtest() with the null hypothesis that there is no serial correlation.

pbgtest(fe_model_plm)

## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  inv ~ capital
## chisq = 74, df = 20, p-value = 4e-08
## alternative hypothesis: serial correlation in idiosyncratic errors

There is strong evidence that the residuals are serially correlated.

5. Panel Data and Inference

If the error terms of different observations from the same entity are correlated, the standard errors have to be adjusted. In the Grunfdeld dataset a firm is observed at 20 different time points and the observations for the same individuals are hence not independent. This leads to the previously described problem of serial correlation of the residuals. In order to solve this issue clustered standard errors have to be used. Both standard OLS as well as heteroskedasticity robust standard errors are wrong because they assume that the error terms u_i,t are not serially correlated. Hence they underestimate the true sampling uncertainty (there is less random variation when error terms are correlated). Clustered standard errors estimate the variance of the coefficient when regressors are i.i.d. across entities but correlated within the entity.

This issue is not restricted to panel data and can also occur in cross-sectional studies, if the data contains clusters of observations and their error terms are correlated within but not between the clusters (regions, schools, branches, …).

Clustered SE (OLS)

In order to corrects the standard errors function vcovHC() is used which originates from package sandwich but is also available in package plm. The argument cluster = is set to "group" and the argument type= controls the estimation type of the standard errors. I am using "sss" which includes the small sample correction method as applied by Stata (so it is easy to check the results with another software).

# OLS standard error
coeftest(pooled_ols_plm)[2,c(1,2)]

##   Estimate Std. Error 
##     0.4772     0.0383

# Cluster robust standard error
coeftest(pooled_ols_plm,
         vcov = vcovHC(pooled_ols_plm,
                       type = "sss", 
                       cluster = "group"))[2,c(1,2)]

##   Estimate Std. Error 
##      0.477      0.133

Clustered SE (FE)

This works just the same for the FE model. However, although using type = "sss" leads to a minor difference when I compute the standard error with Stata. I am not sure why that is the case.

# FE standard error
coeftest(fe_model_plm)[,c(1,2)]

##   Estimate Std. Error 
##     0.3707     0.0194

# Cluster robust standard error
coeftest(fe_model_plm, 
         vcov = vcovHC(fe_model_plm,
                       type = "sss",
                       cluster = "group"))[,c(1,2)]

##   Estimate Std. Error 
##     0.3707     0.0649

By using function felm() from package lfe cluster-robust standard errors can be computed directly. The standard errors are the same as in Stata here so this might be the best function to use when wanting to compute within-groups FE.

fe_model_felm2 <- lfe::felm(inv ~ capital | firm | 0 | firm, data = Grunfeld)

summary(fe_model_felm2)

## 
## Call:
##    lfe::felm(formula = inv ~ capital | firm | 0 | firm, data = Grunfeld) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.71  -20.83   -0.46   21.38  293.69 
## 
## Coefficients:
##         Estimate Cluster s.e. t value Pr(>|t|)    
## capital   0.3707       0.0651    5.69    3e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.6 on 189 degrees of freedom
## Multiple R-squared(full model): 0.918   Adjusted R-squared: 0.914 
## Multiple R-squared(proj model): 0.66   Adjusted R-squared: 0.642 
## F-statistic(full model, *iid*): 213 on 10 and 189 DF, p-value: <2e-16 
## F-statistic(proj model): 32.4 on 1 and 9 DF, p-value: 0.000296

Clustered SE (RE)

Computing cluster robust standard errors for the RE model is not different. Again, the results are the same as in Stata.

coeftest(re_model_plm)[2,c(1,2)]

##   Estimate Std. Error 
##     0.3721     0.0193

coeftest(re_model_plm, 
         vcov = vcovHC(re_model_plm,
                       type = "sss",
                       cluster = "group"))[2,c(1,2)]

##   Estimate Std. Error 
##     0.3721     0.0658

References

Croissant, Yves, and Giovanni Millo. 2008. “Panel Data Econometrics in R: The plm Package.” Journal of Statistical Software 27 (2): 1–43. https://doi.org/10.18637/jss.v027.i02.

Doughtery, C. 2016. Introduction to Econometrics. Oxford University Press.

Greene, W. H. 2018. Econometric Analysis. Pearson.

R Tutorial: Panel Data Analysis 1

Pooled Cross Sections, Fixed and Random Effects Models and Inference

Philipp Leppert

15.01.2021

1. Introduction

1.1 R packages

1.2 Dataset

1.3 Panel structure

Balanced panels

Unbalanced panels

Balancing an unbalanced panel

Time dimension gaps

2. Visualization Techniques

Naive Lineplot

Separated Lineplot

Entity Heterogeneity

Time Heterogeneity

3. Estimation Methods

3.1 Pooled Cross Sections

Pooled OLS via lm()

Pooled OLS via plm()

Scatterplot

3.2 Fixed Effects Model

3.2.1 Least Squares Dummy Variable Estimation

FE using lm()

Excluding the intercept

Scatterplot

3.2.2 Within-groups Estimator

FE using plm()

FE using felm()

Testing for FE

3.2.3 First-difference Estimator

More than two time periods

Two time periods

3.3 Random Effects Model

RE via plm()

Decision Tree

Generalized Estimating Equations

3.4 Including the Time Dimension

Pooled OLS via lm()/plm()

FE via lm()/plm()

FE via felm()

Testing for Time FE

4. Tests for Panel Data Models

RE or FE?

Pooled OLS or RE?

Heteroskedasticity

Serial Correlation

5. Panel Data and Inference

Clustered SE (OLS)

Clustered SE (FE)

Clustered SE (RE)

References