R Tutorial: Linear Regression 3

Reading time: 25 minutes (4,939 words)

1. Introduction

1.1 R packages

The following R packages are required in this tutorial.

library(dplyr)
library(broom)
library(ggplot2)
library(tidyr)
library(car)
library(quantreg)
library(ggrepel)

1.2 Dataset

All practical examples are built around the Boston Housing Dataset from R package MASS. This dataset contains data on housing values in suburbs of Boston.

Boston <- 
  MASS::Boston %>% 
  mutate(id = as.character(row_number()),
         medv = medv*1000)

1.3 Recap

When computing a relationship between predictors \(x_i\) and a dependent variable \(y\) using OLS (ordinary least squares) for each observed value (or combination of values) of \(x_i\) a fitted value is determined. By combining these fitted values we obtain the regression line (fit). The difference between each fitted value \(\hat{y}\) and observed value \(y\) is called the residual (error). A linear regression model always describes the data incompletely and we can utilize residuals to improve the fit in different stages or verify whether our model satisfies necessary assumptions of OLS. In the following sections we will therefore consider different approaches how to to be sure that our model is well specified or that it can be improved. This may include but is not limited to:

the inclusion of additional variables in the model
the special treatment unusual data points
the transformation of variables.

Example model

Below I’m using the average number of rooms per dwelling rm to predict the median value of owner-occupied homes medv (more rooms = higher value?). I’m using OLS to estimate the model, which is implemented in function lm().

example.model <- lm(medv ~ rm, data = Boston)

The R package broom offers many functions to ease the handling of the output generated by lm() and other related functions. With augment() we can transform an lm object to a tibble which then includes the variables used in the model but also other measures like the fitted values (.fitted ) or residuals (.resid).

broom::augment(example.model)

## # A tibble: 506 x 8
##     medv    rm .fitted  .resid .std.resid    .hat .sigma     .cooksd
##    <dbl> <dbl>   <dbl>   <dbl>      <dbl>   <dbl>  <dbl>       <dbl>
##  1 24000  6.58  25176. -1176.     -0.178  0.00231  6623. 0.0000367  
##  2 21600  6.42  23774. -2174.     -0.329  0.00205  6622. 0.000111   
##  3 34700  7.18  30728.  3972.      0.602  0.00523  6620. 0.000952   
##  4 33400  7.00  29026.  4374.      0.662  0.00402  6620. 0.000885   
##  5 36200  7.15  30382.  5818.      0.882  0.00496  6618. 0.00194    
##  6 28700  6.43  23856.  4844.      0.733  0.00206  6619. 0.000555   
##  7 22900  6.01  20051.  2849.      0.431  0.00227  6622. 0.000212   
##  8 27100  6.17  21508.  5592.      0.846  0.00203  6618. 0.000727   
##  9 16500  5.63  16583.   -83.4    -0.0126 0.00369  6623. 0.000000295
## 10 18900  6.00  19978. -1078.     -0.163  0.00229  6623. 0.0000306  
## # ... with 496 more rows

Scatterplot

With all these measures combined in a single tibble we can immediately use it with function ggplot() to produce a graph. Below I’m creating a scatterplot of income and read and add a regression line in blue which combines all the fitted values predicted by our model. Then I’m using geom_segment() to visualize the residuals with a red line as they measure the distance between the observed data points and the predicted values of owner-occupied homes (blue regression line).

augment(example.model) %>%
  ggplot(data = ., aes(x = rm, y = medv)) +
  geom_point() +
  geom_line(aes(y = .fitted), color = "blue") +
  geom_segment(aes(xend = rm, yend = .fitted, color = "red")) +
  labs(x = "Average number of rooms per dwelling",
       y = "Median value of owner-occupied homes (US-$)") +
  theme(legend.position = "none")

2. Unusual and Influential data

Before we start to examine different methods for carrying out regression diganostics I want to introduce common measures that will recur throughout this tutorial.

2.1 Measures

To describe data as unusual or influential we are equipped with different tools. Below we’ll walk through each of them individually but we will soon discover that a lot of them are interrelated.

Consider the following linear model. I’m regressing the average number of rooms per dwelling (rm), the proportion of owner-occupied units built prior to 1940 (age) and the lower status of the population in % (lstat) on the median value of owner-occupied homes (medv).

linear.model <- lm(medv ~ rm + age + lstat, data = Boston)

The summary() output of the lm object, however, provides us with no diagnostics related to our specified model.

summary(linear.model)

## 
## Call:
## lm(formula = medv ~ rm + age + lstat, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18210  -3467  -1053   1957  27500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1175.311   3181.924  -0.369    0.712    
## rm           5019.133    454.306  11.048   <2e-16 ***
## age             9.091     11.215   0.811    0.418    
## lstat        -668.513     54.357 -12.298   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5542 on 502 degrees of freedom
## Multiple R-squared:  0.639,  Adjusted R-squared:  0.6369 
## F-statistic: 296.2 on 3 and 502 DF,  p-value: < 2.2e-16

Residuals

As already noted in the recap section, the raw residual is the difference between an observed data point and a calculated predicted value for that point. However, there are also standardized and studentized residuals. Both attempt to adjust the raw residuals for their standard errors. As we’ll see in a later section residuals can also be normalized by scaling them.

When using standardized residuals, they are divided by an estimate of the standard deviation as implemented in function rstandard(). The function rstudent() produces studentized residuals in the same way, but it uses a leave-one-out estimate of the error variance. Again we use augment() to access the raw (.resid) and standardized (.std.resid) residuals.

augment(linear.model) %>%
  pivot_longer(cols = c(.resid, .std.resid), 
               names_to = "type", 
               values_to = "value") %>%
  ggplot(data = ., aes(x = value)) + 
  geom_histogram(color = "black", fill = "white") +
  facet_wrap(. ~ type, scales = "free") +
  labs(x = "Residuals", y = "Absolute frequency")

Both histograms show a distribution of the residuals that is relatively close to a normal distribution. Note, however, the different scale of the residuals on the x-axis.

Leverage

An observation with an extreme value on an explanatory variable is called a point with high leverage. It measures how far an observation deviates from the mean of that variable. Such points can have a large effect on the estimate of the regression coefficients. A leverage exceeding \(\frac{3k}{n}\) (small samples) or \(\frac{2k+2}{n}\) (large samples) is considered as high (with \(k\) denoting the number of predictors and \(n\) the number of observations). In R we can compute this measure with function hatvalues(). Again we can also use broom() to transform our lm object such that it includes the leverage. Below I’m plotting the standardized residuals (.std.resid) against the leverage (.hat) and indicate the threshold with a red line. We can also label the data points above the threshold with geom_text().

augment(linear.model) %>%
  mutate(high_lev = if_else(.hat > (3*4)/dim(Boston)[1], Boston$id, "")) %>%
  ggplot(data = ., aes(x = .std.resid, y = .hat, label = high_lev)) +
  geom_point() + geom_text(vjust = 0, hjust = 0) +
  geom_hline(yintercept = (3*4)/dim(Boston)[1], color = "red") +
  labs(x = "Standardized residuals", y = "Leverage")

Influence

An observation is said to be influential if removing it from the model substantially changes the estimate of coefficients. The influence describes the product of leverage and discrepancy. The leverage is the exceptional combination of predictors \(x_i\) and discrepancy the exceptionality of the dependent variable \(y\). If one of these aspects is missing, the influence of an observation is equal to zero. Mathematically, the discrepancy is an observation’s standardized residual adjusted by the influence of leverage.

Cook’s D

This measure combines information on the residual and leverage. The lowest value that Cook’s D can assume is zero. The higher Cook’s D is, the more influential the data point. A conventional cut-off point is \(\frac{4}{n}\). Also we are able to assess the influence of an outlier on all regression coefficients simultaneously. The idea behind Cook’s D is that an observation’s influence is due to an unusual value of \(y\) and an unusual combination of the predictors \(x_i\). Our model’s Cook’s D can be computed with function cooks.distance(). Let’s checks graphically which observations exceed the threshold. To do so we plot the Cook’s D against a running variable in the dataset (id) and highlight the cut-off value with a horizonal line.

augment(linear.model) %>%
  mutate(outlier = if_else(.cooksd > 4/dim(Boston)[1], Boston$id, "")) %>%
  ggplot(data = ., aes(x = Boston$id, y = .cooksd, label = outlier)) +
  geom_point() + geom_text(vjust = 0, hjust = 0) +
  geom_hline(yintercept = 4/dim(Boston)[1], color = "red") +
  labs(y = "Cook's D") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Especially the suburb with id 369 is leaping out in this plot.

DFFITS

This measure also combines information on the residual and leverage. A common cut-off point for DFFITS is \(2*\sqrt{\frac{k}{n}}\). It can be either positive or negative, with numbers close to zero corresponding to the data points with small or zero influence. This mesaure is very similar to Cook’s D except that they scale differently. DFFITS is computed with function dffits().

dffits <- dffits(linear.model)

which(unname(dffits) > 2*sqrt((length(linear.model$coefficients) -1) / dim(Boston)[1]))

##  [1]   9  49  99 142 148 149 158 162 163 164 167 187 196 203 204 205 215 225 226
## [20] 229 234 257 258 262 263 268 281 283 284 366 368 369 370 371 372 373 374 375
## [39] 385 407 408 413 415 496

DFBETA

Another formal way to detect outliers is to compute DFBETA. It’s calculated by estimating the specified regression model with all available observations. Then one observation is held back and the result is compared to the initial model. If there’s a large difference in the estimated coefficients, the held out observation had a large influence. This procedure is repeated for each observation and for each \(x_i\) separately. A common cut-off value is \(\frac{2}{\sqrt{n}}\). In R we can use function dfbeta() to do this - so let’s have a look at the variable rm.

dfbeta <- dfbeta(linear.model)

Below, I use boxplots to show the distribution of DFBETA.

as_tibble(dfbeta) %>%
  mutate(id = rownames(dfbeta)) %>%
  dplyr::select(-`(Intercept)`) %>%
  pivot_longer(cols = -id, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) + geom_boxplot() +
  labs(y = "DFBETA", x = "")

Outlier (univariate)

An outlier is an observation with an extreme value. It can indicate a peculiarity of the sample or a data entry error like an additional zero or misplaced decimal point. To identify univariate outliers we can use boxplots. All points outside the whiskers of the boxplot are considered outliers.

Boston %>%
  pivot_longer(cols = c(rm, age, lstat), 
               names_to = "predictor", values_to = "x_value") %>%
  ggplot(data = ., aes(x = x_value, group = predictor)) + 
  geom_boxplot() + facet_grid(.~predictor, scale = "free_x") +
  labs(x = "Value of predictor", y = "District's average reading score")

Outlier (multivariate)

Even so, outliers are often multivariate, which means that they have an unusual combination of the predictor variables. Mostly outliers can be explained. If you have a variable in your dataset that can explain the outlier(s) then you should add this variable to your model. It’s always helpful to create a scatterplot matrix of the predictors you are interested in. Below I do this with function ggpairs from R package GGally. I highlighted suburbs where the average number of rooms per dwelling exceeds 8 (some really big houses). You can see that these suburbs also have a very low percentage of the lower status of the population (lstat) and a high proportion of owner-occupied units built prior to 1940 (age).

Boston %>%
  mutate(outlier = if_else(rm > 8, "outlier", "normal")) %>%
  select(rm, age, lstat, outlier) %>%
  GGally::ggpairs(data = ., aes(fill = outlier, color = outlier)) + 
  scale_fill_manual(values = c("black","red")) + 
  scale_color_manual(values = c("black","red"))

Furthermore, the package car contains the function outliertTest() that implements the Bonferroni Outlier Test. It returns Bonferroni p-values for studentized residuals when using a linear regression model as input. Here we judge data points to be outliers after we’ve already run our regression model. In this case an outlier is an observation with a large (studentized) residual.

outlierTest(linear.model)

##     rstudent unadjusted p-value Bonferroni p
## 369 5.216449         2.6742e-07   0.00013531
## 373 4.976143         8.9325e-07   0.00045198
## 372 4.708526         3.2346e-06   0.00163670

2.2 Diagnostic plots

RVF Plot

The residuals vs. fitted plot is well suited for multiple regression diagnostics as it combines all explanatory variables in a two-dimensional layout. We only need the .fitted values and residuals (.resid). Commonly the fitted values are reported on the x-axis and the residuals on the y-axis. I’m also labeling the outliers we’ve found with function outlierTest(). With the RVF plot we can quickly assess whether the relationship between a predictor \(x_i\) and \(y\) is nonlinear, outliers strongly influence the model or we missed some other influence correlated with the predictors in the model.

augment(linear.model) %>%
  mutate(outlier = if_else(Boston$id %in% c(369, 373, 372), Boston$id, "")) %>%
  ggplot(data = .) +
  aes(x = .fitted, y = .resid, label = outlier) +
  geom_point() + ggrepel::geom_label_repel(hjust = 1, vjust = 1) +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted values", y = "Residuals")

We can see that there are two striking patterns in this plot. First, there is a downward sloping line between fitted values of med (20.000 and 40.000). Additionally, the remaining residuals exhibit an U-shaped pattern.

Residual Patterns

Remember that the mean of the residuals is always zero because OLS regression coeffiecients are estimated in this manner. To fulfil the assumption \(E(e_i)\) = 0, the mean of the residuals must also be zero locally (= for arbitrary values of \(x_i\)). This is not the case in our RVF plot.

Below I listed common patterns that might emerge in a RVF plot:

point cloud: no residual relationship between r and x
sloping band: consider adding a linear term
curved band: suggests using a more flexible dependence, e.g. a quadratic term
wedge: occurs either when the density of the points changes with x, or the variability systematically increases / decreases with increasing values of x

Leverage Plot

With function leveragePlots() from package car we can automatically create leverage plots. One plot is created for each predictor used in the linear model

leveragePlots(linear.model, id = TRUE)

Let’s color high leverage points in the residual vs. fitted plot.

augment(linear.model) %>%
  mutate(highLeverage = if_else(.hat > (3*4)/dim(Boston)[1], "yes","no") %>% as.factor()) %>%
  ggplot(data = .) +
  aes(x = .fitted, y = .resid, color = highLeverage) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  scale_color_manual(values = c("blue","red"))

AV Plot

To create an AV plot (or partial regression plot) of a predictor \(x_i\) we have to run a regression of \(y\) against all other predictors except \(x_1\) first. Then we run another regression of \(x_1\) against all remaining predictors. The residuals of both these models are stored and then plotted against each other. The package car contains the function avPlot() to create this type of plot for a specific predictor or function avPlots() for plots of each predictor in the model.

avPlot(linear.model, variable = "lstat")

Points that lie far away from the majority of observations are considered multivariate outliers.

AV Plot (DFBETA)

It’s recommended to adjust the points of the AV plot in proportion with the observation’s DFBETA. Remember to use the abs() function to obtain absolute values of DFBETA as they can be negative.

Boston %>%
  # estimate first model
  lm(medv ~ rm + age, data = .) %>% 
  augment() %>% select(.resid) %>% rename(resid_1 = ".resid") %>%
  bind_cols( # estimate second model
    Boston %>% 
      lm(lstat ~ rm + age, data = .) %>% 
      augment()
    ) %>%
  ggplot(aes(x = .resid, y = resid_1, size = abs(dfbeta[,"lstat"]))) +
  geom_point() +
  labs(x = "Residuals lstat ~ rm + age",
       y = "Residuals medv ~ rm + age",
       size = "DFBETA lstat")

LVR2 Plot

The leverage versus residual squared plot uses (squared) normalized residuals instead of the raw residuals and plots them against the leverage. This is a quick way of checking potential influential observations and outliers at the same time. The horizontal reference line indicates the mean of the leverage and the vertical reference line the mean of the normalized squared residuals. Points above the horizontal line have higher-than-average leverage; points to the right of the vertical line have larger-than-average residuals.

Boston %>%
  mutate(leverage_values = hatvalues(linear.model),
         residuals_norm = resid(linear.model) / sqrt(sum(resid(linear.model)^2)),
         residuals_norm_sq= residuals_norm^2) %>%
  ggplot(data = .) +
  aes(x = residuals_norm_sq, y = leverage_values, label = id) +
  geom_point() + geom_text(hjust = 0, vjust = 0) +
  geom_hline(yintercept = mean(hatvalues(linear.model)), color = "red") +
  geom_vline(xintercept = mean(rstandard(linear.model)), color = "red") +
  labs(x = "Normalized residuals squared",
       y = "Leverage")

The normalized residuals in this plot are computed as in Stata’s lvr2plot command via \(\frac{r}{\sqrt{\sum_{i}^{} r_i^2}}\) with \(r\) denoting the estimated residual.

2.3 Robust estimation

To counteract an outlier’s (disproportionate) influence, robust methods exist such that the fit is almost unaffected by the presence of outliers. A robust method may exclude the actual value of an outlier but still takes into account their presence (for example by weighting). One way to handle outliers is to use quantile (median) regression or to complete an incompletel specified regression model. Often exceptional cases are exceptional because our underlying theory does not explain them. For example a variable like income, whose distribution is often skewed to the right, may be log-transformed. If you still decide to remove an outlier from the regression model always report both results (also the one with the outlier).

The package quantreg contains the function rq that lets us run a quantile regression of our model. The argument tau = 0.5 is the default and corresponds to a median regression.

median.regression <- quantreg::rq(medv ~ lstat, tau = 0.5, data = Boston)

Now let’s compare both regressions visually. We can see that the fit of the linear model is shifted upwards because of the observations with high values of medv and small values of lstat. The fit of the median regression, however, is less influenced by these data points.

augment(median.regression) %>%
  ggplot(data = ., aes(x = lstat, y = medv)) +
  geom_point() +
  geom_line(aes(y = .fitted, color = "red"), size = 1) +
  geom_smooth(aes(color = "blue"), method = "lm", se = FALSE) +
  scale_colour_manual(name = "Model", 
                      values =c("blue"="blue","red"="red"), 
                      labels = c("lm",'rq'))

3. Normality of Residuals

We require a normal distribution of residuals for testing statistical hypotheses as the normality assumption assures that p-values for t-tests and F-tests are valid. However, it is not required to obtain unbiased estimates of regression coefficients. The OLS model only requires that the residuals are identically and independently distributed (i.i.d.). Also a predictor variable \(x_i\) is not required to be normally distributed.

Kerndel density estimation

The kernel density is best shown graphically. It is not really different from a histogram but with very narrow bins and a moving average. Below I plot the distrubtion of the raw residuals of our estimated model.

broom::augment(linear.model) %>%
  ggplot(aes(x=.resid)) +
  geom_histogram(aes(y=..density..), color = "black", fill = "white") +
  geom_density(color = "red") +
  labs(x = "Residuals", y = "Density")

Q-Q Plot

We can also plot the quantiles of the residuals against the quantiles of a normal distribution, which is more sensitive to non-normality near the tails of the distribution.

augment(linear.model) %>%
  ggplot(data = ., aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles",
       y = "Residuals")

P-P Plot

Here we plot the theoretical proportion less than or equal to each observed value against the actual proportion. The P-P plot is a Q-Q plot of these transformed values against a uniform distribution. It is sensitive to non-normality in the middle range of the data.

augment(linear.model) %>%
  mutate(mean_resid = mean(.resid, na.rm = TRUE),
         sd_resid = sd(.resid, na.rm = TRUE),
         p = (1 : dim(Boston)[1]) / dim(Boston)[1] - 0.5 / dim(Boston)[1]) %>%
  ggplot(aes(x = p, y = sort(pnorm(.resid, mean_resid, sd_resid)))) +
  geom_point() +
  labs(x = "Theoretical Proportion",
       y = "Residuals")

Inter-quartile-range

Severe outliers lie either 3 inter-quartile-ranges below the first quartile or above the third quartile. The presence of any severe outliers should be sufficient evidence to reject normality at a 5% significance level. However, mild outliers are common in samples of any size. The inter-quartile-range is already implemented in a boxplot.

augment(linear.model) %>%
  ggplot(data = ., aes (x = .resid)) +
  geom_boxplot() + 
  labs(y = "", x = "Residuals") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Shapiro-Wilk test

This test can be used to test whether the data follows a normal distribution. The null hypothesis is that the data is normally distributed. This test can be computed with function shapiro.test().

shapiro.test(augment(linear.model)$.resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  augment(linear.model)$.resid
## W = 0.91406, p-value = 2.385e-16

The test indicates a strong rejection of the null hypothesis.

4. Homoscedasticity of Residuals

If the model is well-specified, there should be no pattern after plotting the residuals against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be heteroscedastic. This is a violation of the assumption that \(VAR(e_i)\) = \(d^2\). It states that the variance of the error terms is constant for all values of \(X\) (homoskedasticity). It does not lead to biased coefficient estimates but they are no longer efficient. Infefficient estimates have an increased probability that the computed coefficient deviates from its true population value. Often this problem arises when the distribution of a predictor is not symmetrical.

Symmetry Plot

In a symmetry plot we determine the median and then the distance to the median for the next larger and smaller observation. If the distances are equal the points lie on the main diagonal. If the distances of the observation above the median are larger than the distances below the median, the distribution is skewed to the right or left in the opposite case. First we have to compute the differences for each pair of observation.

lstat_sorted <- sort(Boston$lstat)

y <- rep(0, 253)
x <- rep(0, 253)

for (i in 1:253){
  
  y[i] <- lstat_sorted[507-i] - median(lstat_sorted, na.rm = TRUE)
  
  x[i] <- median(lstat_sorted, na.rm = TRUE) - lstat_sorted[0+i] 

}

Now we plot the distances.

tibble(below = x, above = y) %>%
  ggplot(aes(x = below, y = above)) +
  geom_point() + geom_abline(intercept = 0, slope = 1, size = 0.5) +
  labs(x = "Distance below median",
       y = "Distance above median")

We can conclude that the distribution of lstat is skewed to the right.

Extended RVF Plot

For this plot we partion the x-axis in \(k\) groups with approximately equal number of observations. For each group we then create a boxplot of the standardized residuals.

augment(linear.model) %>%
  mutate(k = .bincode(
    x = .fitted, 
    breaks = quantile(.fitted, seq(0, 1, length.out = round(dim(Boston)[1]/20)), type = 2),
    include.lowest = TRUE)
    ) %>%
  ggplot(aes(y = .std.resid, x = k, group = k)) +
  geom_boxplot() +
  labs(x = "Group", y = "Standardized residuals")

We see a slightly U-shaped pattern of the residuals.

Tests

The Breusch-Pagan test is usually applied to detect heteroskedasticity. It tests the null hypothesis that the variance of the residuals is homogeneous. Hence, if the p-value is very small, we have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogeneous (heteroskedastic). The test is very sensitive to model assumptions (like normality). In R the test is implemented in function breusch_pagan() from package skedastic.

skedastic::breusch_pagan(linear.model)

## # A tibble: 1 x 5
##   statistic  p.value parameter method                alternative
##       <dbl>    <dbl>     <dbl> <chr>                 <chr>      
## 1      19.8 0.000189         3 Koenker (studentised) greater

We can also use the White test, which is a special case of the Breusch-Pagan test. It is less sensitive to violations of the normality assumption (distribution of the errors) and requires at least two independent variables in order to work. The test is implemented in function white_lm().

skedastic::white_lm(linear.model)

## # A tibble: 1 x 5
##   statistic  p.value parameter method       alternative
##       <dbl>    <dbl>     <dbl> <chr>        <chr>      
## 1      116. 1.28e-22         6 White's Test greater

Another test is the Goldfeld-Quandt-Test which compares the variances of two submodels divided by a specified breakpoint and rejects the null hypothesis if the variances differ.

lmtest::gqtest(linear.model)

## 
##  Goldfeld-Quandt test
## 
## data:  linear.model
## GQ = 2.6442, df1 = 249, df2 = 249, p-value = 2.722e-14
## alternative hypothesis: variance increases from segment 1 to 2

Autocorrelation

Autocorrelation is a violation of the assumption that \(COV(e_i, e_j) = 0; i \neq j\)

This assumption states that the error terms should be not correlated. It leads to inefficient coefficient estimates and standard errors that are too small. This issue often arises when handling time series data where observations are recorded at multiple points in time and observations in short time intervals are more similar to each other than observations with a larger time span between them.

Remedies

To circumvent heteroskedasticity it often suffices to tranform the dependent variable. If \(y\) is skewed to the right you should consider a log-transformation. Consider also applying a \(log(x-(min(x)-1))\) transformation if you have negative or zero values in your variables.

If heteroskedasticity is still present you cannot use the standard errors reported by the summary ouput of lm(). However, various methods exist to correct standard errors in the presence of heteroskedasticity. I will discuss them in another tutorial.

5. Linearity

To assess the functional form of a relationship non-parametric methods have to be applied. For small number of observations it may suffice to use a scatterplot matrix but as the number of cases rises we should consider using some form of a scatterplot-smoother. You have to be careful because the functional form between two variables can change if we control for other variables in a multiple linear regression model. Hence we also have to examine the linearity between the controlled predictors.

Usually we start with examining the relationship between a predictor \(x_i\) and \(y\) by plotting the residuals against a predictor. However, this plot does not yield a result about the exact form of relationship, so the plot may indicate a U-shape or logarithmic relationship at the same time. The distinction is important because for an U-shaped relationship we should add a squared term and for a logarithmic relationship it can suffice to transform the variable.

CPR Plot

We can use the Compononent-Plus-Residual-Plot (or partial residual plot) to determine the type of functional relationship. Instead of the residual we plot its product with the linear share of the predictor \(x_i\) against \(y\). Each plot displays a blue dashed least-squares line and a violet scatterplot smoother line (loess). These plots help us to assess the relationship of each predictor with the partial residuals.

crPlots(linear.model, id = FALSE)

Below I recreate the CPR plot for variable rm with the ggplot2 package.

Boston %>%
  mutate(residual_values = residuals(linear.model),
         residual_plus_rm = residual_values + coefficients(linear.model)["rm"]*rm ) %>%
  ggplot(data = .) +
  aes(x = rm, y = residual_plus_rm) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "violet") +
  labs(y = "Component plus residual",
       x = "Average number of rooms per dwelling")

Single regression

Let’s have a look at the relationship between the median value of owner-occupied homes (medv) and the suburb’s share of the lower status of the population lstat. I specify one model where I use lstat without any transformation and another model with a log-transformation.

ggplot(data = Boston, aes(x = lstat, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, aes(color = "red")) +
  geom_smooth(method = "lm", formula = y ~ log(x), se = F, aes(color = "blue")) +
  scale_colour_manual(name = "Model", 
                      values =c("blue"="blue","red"="red"), 
                      labels = c("Linear-Log",'Linear')) +
  labs(x = "Lower status of the population (percent)",
       y = "Median value of owner-occupied homes")

The log-transformation of lstat fits a lot better than the simple linear model.

Multiple regression

Here we can plot the standardized residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. Otherwise, we should see in each of the plots just a random scatter of points

model1 <- lm(medv ~ rm + age + lstat, data = Boston)

augment(model1) %>%
  ggplot(data = ., aes(x = lstat, y = .std.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red")

After detecting a non-lineartity, we may conclude that the variable is very skewed. We can consider a log transformation in this case.

model2 <- lm(medv ~ rm + age + log(lstat), data = Boston)

augment(model2) %>%
  ggplot(data = ., aes(x = `log(lstat)`, y = .std.resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red")

ACPR Plot

The augmented component-plus-residual plot aims to improve over the CPR plot for identifying nonlinearities in the data. I’ve recreated this plot from the acprplot command in Stata. In addition to the CPR plot we have to add squared terms of each predictor to our model and then add the squared term’s coefficient multiplied with the data values.

modelsq <-  lm(medv ~ rm + I(rm^2) + age + I(age^2) + lstat + I(lstat^2), data = Boston) 

augment(modelsq) %>%
  mutate(residual_aug = .resid + coefficients(modelsq)["rm"]*rm +
           coefficients(modelsq)["I(rm^2)"]*rm*rm ) %>%
  ggplot(data = .) +
  aes(x = rm, y = residual_aug) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "violet") +
  labs(x = "Average number of rooms per dwelling", y = "Augmented component plus residual")

6. Multicollinearity

Multicollinearity often arises in the multiple regression model and involves the combinations of more than two variables. Variables that are uncorrelated are said to be orthogonal.

Multicollinearity can lead to the follwing issues:

imprecise estimates of regression coefficients
slight fluctuations in correlation may lead to large differences in regression coefficients
Adding or dropping cases may lead to large differences in regression coefficients
Increases the standard error of coefficients (affecting significance tests)

Variance Inflation Factor

A variable whose VIF values are greater than 5 may be futher investigated. Another measure tolerance is defined as 1/VIF. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables. The computation of the VIF is implemented in function vif() of R package car.

car::vif(linear.model)

##       rm      age    lstat 
## 1.675215 1.638542 2.477304

We can also assess the VIF visually with a bar plot.

tibble(vifs = car::vif(linear.model), variable = names(car::vif(linear.model))) %>%
  ggplot(data = ., aes(x = variable, y = vifs)) +
  geom_bar(stat = "identity", color = "black", fill = "white") +
  geom_hline(yintercept = 5, color = "red") +
  coord_flip() +
  labs(x = "", y = "VIF")

Collinearity diagnostics

In Stata there is a command called collin written by Phil Ender which reports more measures of collinearity such as the eigenvalues, condition indices and a condition number. The latter, for example, is a combined measure of all predictors together and reflects the global instability of the regression coefficients. A range between 20 and 30 indicates moderate multicollinearity while values higher than 30 indicate severe collinearity. In R we can use function CN() contained in package multiColl for this purpose.

cte <- array(1,length(Boston[,1]))
Boston.X <- cbind(cte,Boston[,c(1,6,10,13)])
multiColl::CN(Boston.X)

## [1] 34.79292

In R we can also compute the condition indices for linear regression coefficients. The function cond.index() in R package klaR does this.

klaR::cond.index( medv ~ crim + rm + tax + lstat, data = Boston)

## [1]  1.000000  2.279673  5.038487  7.988427 34.792924

7. Model Specification

The assumption \(E(e|X) = 0\) is crucial for the consistency of the least-squares estimator. A typical source for violation of this assumption is a misspecification of the model, for example when functions of predictors (squared term, interaction term) are omitted. For continuous dependent variables we might use an appropriate nonlinear specifications of \(x_i\) (logarithm, interaction, etc.). For discrete (binary) dependent variables we can extend multiple regression methods by using a probit or logit model.

Link Test

The model specification link test is based on the idea that if a regression is properly specified, one should not be able to find any additional independent variables that are significant except by chance. The test creates two new variables, the predictions \(\hat{y}\) and the squared predictions \(\hat{y}^2\). Our model is then refitted using these two variables as predictors.

\(\hat{y}\) should be significant since it is the predicted value
\(\hat{y}^2\) should not be significant, because if our model is specified correctly, the squared predictions should not have much explanatory power

We then look at the p-value of \(\hat{y}^2\). The null hypothesis is that the influence of \(\hat{y}^2\) = 0.

yhat <- fitted.values(linear.model)
yhat_sq <- fitted.values(linear.model)^2

link_test_model <- lm(medv ~ yhat + yhat_sq, data = Boston)
summary(link_test_model)

## 
## Call:
## lm(formula = medv ~ yhat + yhat_sq, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28288  -2654   -337   2277  29307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.318e+04  1.167e+03  11.288  < 2e-16 ***
## yhat        -4.179e-01  1.063e-01  -3.931 9.64e-05 ***
## yhat_sq      3.343e-05  2.414e-06  13.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4711 on 503 degrees of freedom
## Multiple R-squared:  0.7386, Adjusted R-squared:  0.7376 
## F-statistic: 710.7 on 2 and 503 DF,  p-value: < 2.2e-16

Here we must reject the null hypothesis that \(\hat{y}^2\) = 0 - that is the squared predictions do have explanatory power - so our specification has to be improved.

RESET

The REgression Specification Error Test aims to ensure that no nonlinear function of the independent variables should be significant when added to the original model. Hence it’s only a test of whether the model is linear in the original variables. It has no power of detecting omitted variables. To run the test we have to compute powers of the model’s fitted values and include them, along with the original predictors, in the regression model. There are different opinions on which powers to use. For example Stata’s ovtest command always uses four powers of \(\hat{y}\) (not without problems as discussed here). In many applications squared and cubed terms have proven to be useful.

yhat_p2 <- fitted.values(linear.model)^2
yhat_p3 <- fitted.values(linear.model)^3

reset_model <- lm(medv ~ crim + rm + tax + lstat + yhat_p2 + yhat_p3, data = Boston)

The test’s null hypothesis checks whether the powers of the fitted values of med are together equal to 0.

linearHypothesis(reset_model, c("yhat_p2=0","yhat_p3=0"))

## Linear hypothesis test
## 
## Hypothesis:
## yhat_p2 = 0
## yhat_p3 = 0
## 
## Model 1: restricted model
## Model 2: medv ~ crim + rm + tax + lstat + yhat_p2 + yhat_p3
## 
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    501 1.4925e+10                                   
## 2    499 9.5864e+09  2 5338419318 138.94 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An insignificant p-value implies that the specification of the model is fine according to the RESET diagnostic and, in the other case, that there is some problem with the functional form of the model. Note that even when your model passes the test, this does not necessarily mean that you’ve found the best model.

To make things easier there is also a function resettest() in package lmtest that implements the RESET.

lmtest::resettest(linear.model, powers = 2:3, type = "fitted")

## 
##  RESET test
## 
## data:  linear.model
## RESET = 132.18, df1 = 2, df2 = 500, p-value < 2.2e-16

Note that if the functional form is properly specified, RESET has no power of detecting heteroskedasticity.

Rainbow Test

The idea behind this test is that even a misspecified model might be able to fit the data reasonably well in the center of the sample but might lack fit in the tails. The rainbow test fits a model to a subsample (typically the middle 50%) and compares it with the model fitted to the full sample using an F-test. To determine the middle, the sample has to be ordered; e.g., by a regressor \(x_i\).

lmtest::raintest(linear.model, order.by = ~ lstat, data = Boston)

## 
##  Rainbow test
## 
## data:  linear.model
## Rain = 2.0917, df1 = 253, df2 = 249, p-value = 3.97e-09

The null hypothesis is rejected whenever the overall fit is significantly worse than the fit for the subsample. Above I’ve ordered the Boston data by rm and the test assesses whether the model fit for the 50% middle-range percentage of lower status of the population suburbs is different from the fit comprising all suburbs. This appears to be the case at the chosen significance level.

Harvey-Collier test

The Harvey-Collier test performs a t-test (with \(k\) degrees of freedom) on the recursive residuals. If the true relationship is not linear but convex or concave the mean of the recursive residuals should differ from 0 significantly.

lmtest::harvtest(linear.model, order.by = ~ lstat, data = Boston)

## 
##  Harvey-Collier test
## 
## data:  linear.model
## HC = 7.3517, df = 501, p-value = 8.025e-13

J Test

The J Test for Comparing Non-Nested Models can be used for deciding whether an independent variable should appear in level or log form or to test a functional form misspecification. If the first model contains the correct set of regressors, then including the fitted values of the second model into the set of regressors should provide no significant improvement. But if it does, it can be concluded that the original model does not contain the correct set of regressors. Of course, the dependent variable must remain unchanged for this test.

linear.model2 <- lm(medv ~ log(crim) + log(rm) + log(tax) + log(lstat), data = Boston)
lmtest::jtest(linear.model, linear.model2)

## J test
## 
## Model 1: medv ~ rm + age + lstat
## Model 2: medv ~ log(crim) + log(rm) + log(tax) + log(lstat)
##                 Estimate Std. Error t value Pr(>|t|)    
## M1 + fitted(M2)   1.1816   0.099128 11.9197   <2e-16 ***
## M2 + fitted(M1)  -0.1375   0.153649 -0.8949   0.3713    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The problem with this test is that both models could win, or be rejected. If both are not rejected you may base your decision on the adjusted \(R^2\) to choose between them. Here the inclusion of the fitted values of the second model provides a significant improvement to the first model but not the other way around.

Proxy variables

Sometimes we’re able to obtain a proxy variable for an omitted variable in our analysis. A proxy variable is related to the unobserved variable that we would like to control for. For example reasearcher in psychology might use the IQ quotient as a proxy for ability. In order for this to work the IQ needs to be correlated with ability! In reality a respondent’s education and experience will also affect her ability (and maybe quite different than measured by her IQ).

In time series you could consider to use a lagged dependent variable to account for historical factors that cause current differences in the dependent variable.

Overlooked variables

These variables have an influence on \(y\) and are correlated with at least one other predictor \(x_i\). Although a plot of the model’s residuals against a variable that was not included in the model (but still available in the dataset) might indicate the need to add the variable to the model it is above all a theoretical problem. You should therefore not randomly add variables to the model as issues like multicollinearity might arise.

We also refer to this problem as omitted variable bias that arises in multiple rergression if an omitted variable is both a determinant of \(y\) and correlated with at least one included regressor \(x_i\). If the omitted variables can be measured then you should include them in your model. More often than not this will not be the case so you might consider using panel data (fixed effects regression) or instrumental variables regression.

References

Belsley, D. A., E. Kuh, R. E. Welsch, and R. O. Wells. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/0471725153.

Davidson, Russell, and James G. MacKinnon. 1981. “Several Tests for Model Specification in the Presence of Alternative Hypotheses.” Econometrica 49 (3): 781–93. http://www.jstor.org/stable/1911522.

Harvey, Andrew, and Patrick Collier. 1977. “Testing for Functional Misspecification in Regression Analysis.” Journal of Econometrics 6 (1): 103–19. https://EconPapers.repec.org/RePEc:eee:econom:v:6:y:1977:i:1:p:103-119.

Hoaglin, David C., Frederick Mosteller, and John W. Tukey (Editor). 2000. Understanding Robust and Exploratory Data Analysis. 1st ed. Wiley-Interscience.

Mallows, C. L. 1986. “Augmented Partial Residuals.” Technometrics 28 (4): 313–19. http://www.jstor.org/stable/1268980.

Ramsey, J. B. 1969. “Tests for Specification Errors in Classical Linear Least-Squares Regression Analysis.” Journal of the Royal Statistical Society. Series B (Methodological) 31 (2): 350–71. http://www.jstor.org/stable/2984219.

Utts, Jessica M. 1982. “The Rainbow Test for Lack of Fit in Regression.” Communications in Statistics - Theory and Methods 11 (24): 2801–15. https://doi.org/10.1080/03610928208828423.