Reading time: 16 minutes (3,007 words)
The linear regression is a fundamental statistical procedure for data analysis, which comprises:
the model, which assumes a linear relationship between the specified variables and
the estimation method of the model’s parameters.
By linear regression, we usually refer to an ordinary least squares estimation (OLS).
They define the fundaments on which the linear regression model is built. If those assumptions are met, the OLS estimator is said to be BLUE (best linear unbiased estimator). The framework refers to cross-sectional data generated by taking a random sample from a population. However, with empirical data this is rarely the case.
The main interest by doing a linear regression is to estimate the influence of some variable(s) xi on y. The xi are regarded as random variables. Below are some of the most common terms for x and y.
\(y\) | \(x_i\) |
---|---|
dependent variable | independent variables |
explained variable | explanatory variables |
endogenous variable | exogenous variables |
outcome variable | control variables |
predicted variable | predictors |
regressand | regressors |
Assumption 1: Linear model
The population model can be written as y = β0 + β1x1 + . . . + βkxk + u where β0 is a constant, β1, . . . ,βk are the unknown parameters of interest and u is an unobserved random variable called the error term. The parameters are constants, also called regression coefficients. The coefficient βj measures the average effect of a unit change of xj on y while holding everything else constant (ceteris paribus).
The error term represents unobserved factors which have an influence on y but are not captured by the explanatory variables. If we omit relevant regressors, that is the use of fewer explanatory variables than needed (underspecification), we speak of omitted variables. Those variables are hence part of the error term because they have the same effect as unobserved factors. Omitted variables can bias the estimation of the model.
Assumption 2: Random sample
There is no relationship or dependency between the units i. This assumption may be violated when data is clustered (i.e. observations from the same region are correlated). It also implies no auto-correlation of the error terms.
Assumption 3: No perfect collinearity
The regressors in the population and the sample are not collinear (= linear independent).
Assumption 4: Exogeneity of the explanatory variables
The the error term has a mean of zero and no systematic relationship with the explanatory variables
Under assumptions 1, 2, 3 and 4 the OLS estimator is consistent.
Assumption 5: Homoskedasticity
The variance of the error term is constant. There is a remedy to use heteroskadicity-robust standard errors, should this assumption be violated.
Assumption 6: Normal distribution of the error term
The error term is normally distributed. This assumption is often violated in reality.
In linear regression the concept of correlation plays an important role, so let’s have a look at the two most common measures of dependence.
With the Pearson correlation coefficient we can measure the linear dependence between two variables \(x_1\) and \(y\). This correlation coefficient is for data that follows a normal distribution (parametric).
cor(mtcars$mpg, mtcars$hp, method = "pearson", use = "complete.obs")
## [1] -0.7761684
From the test below we can conclude that hp
and mpg
are significantly correlated with a correlation coefficient of -0.776.
cor.test(mtcars$mpg, mtcars$hp, method = "pearson", use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
The Spearman rank correlation coefficient is a nonparametric measure of association. While Pearson’s correlation coefficient assesses relationships that are linear, this measure assesses monotonic relationships. It can be used for both continuous and discrete ordinal variables and the data do not need to follow a normal distribution.
cor(mtcars$cyl, mtcars$gear, method = "spearman", use = "complete.obs")
## [1] -0.5643105
From the test below we can conclude that cyl
and gear
are significantly correlated with a correlation coefficient of -0.564.
cor.test(mtcars$cyl, mtcars$gear, method = "spearman", use = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: mtcars$cyl and mtcars$gear
## S = 8534.9, p-value = 0.0007678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5643105
In this article, I will not consider any modifications of the functional form of the linear model. Hence, every section refers to a so called level-level model which measures the average effect of a unit change of a regressor \(x_i\) on the dependent variable \(y\).
The following R packages are required for this tutorial.
library(dplyr)
library(ggplot2)
library(tidyr)
library(broom)
library(purrr)
library(car)
library(scatterplot3d)
library(QuantPsyc)
library(e1071)
All practical examples are built around the mtcars
dataset. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
head(datasets::mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The simple linear regression model has only one explanatory variable. I am regressing miles/(US) gallon (mpg
) on gross horsepower (hp
) by specifying both variables within the argument formula =
of the function lm()
.
lm_simple <- lm(formula = mpg ~ hp, data = mtcars)
The regression results are saved in object lm_simple
and can be accessed with different functions. The function summary()
displays the output of the regression model in the R console.
summary(lm_simple)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
The first two lines of the output simply display the model specified in the formula of function lm()
.
The next three lines show the distribution of the residuals. A residual is the difference between the observed value of mpg
in the dataset and the predicted value of mpg
by the linear model. The smaller these differences are, the better the model fits the data. Also, a symmetrical distribution of the residuals around zero reflects a good fit. You might have a look at the residuals with a histogram.
The next block of lines shows the estimate of the specified variable. Beginning with the constant (intercept
) in row one, each additional row refers to one coefficient. The first column Estimate
shows the corresponding values of the coefficients. The coefficient of intercept
shows the predicted mpg
for cars with a hp
of zero. The coefficient of hp
shows the average effect on mpg
when hp
is increased by one unit. The second column Standard Error
measures the variation of the coefficient estimate. The smaller the standard error relative to the coefficient estimate, the more reliable the estimation. Standard errors can be used to calculate confidence intervals and hypothesis tests. The third column t value
measures how many standard deviations the coefficient is away from zero. In order to reject the null hypothesis for the alternative that a relationship between x and y exists (H0: βi = 0 vs. H1: βi ≠ 0), the t value should be far away from zero. The t values are then used to calculate the p values. The fourth column Pr(>|t|)
reflects the probability to calculate a value greater or equal to t by chance. A small p value suggests that it is unlikely to observe a relationship between the independent and dependent variable that is completely random. Typically, a p value of 5% is used as a limit. The stars show the significance of the p value and the corresponding legend can be found in the first row below the coefficients.
The RSE measures the goodness of fit of the regression model. It is calculated by extracting the square root of the mean squared error (MSE) which is the residual sum of squares (RSS) divided by the degrees of freedom (n-k). It represents the sum of errors we avoid if we use our regression model instead of the mean to predict \(y\). In theory, every linear model has an error term, so the dependent variable can never be perfectly predicted by the independent variable(s). The RSE measures the average deviation of the observed \(y\) from the fitted regression line. The degrees of freedom show how many data points (observations) were used in the estimation after the parameters (intercept
and hp
) have been taken into account.
rss <- sum(lm_simple$residuals^2) # Residual Sum of Squares)
mse <- rss/lm_simple$df.residual # Mean Squared Error
# RSE (Residual Standard Error)
sqrt(sum(lm_simple$residuals^2)/lm_simple$df.residual)
## [1] 3.862962
The R² is a measure of how good the model fits the actual data and is standardized between zero and one. A model with an R² close to zero performs bad in explaining the observed variance in y. An R² close to one however, shows that the model explains the observed variance in y very good. This model has an R² of 0.60 which means that 60% of the observed variance in mpg
can be explained with regressor hp
. In practice, it is quite difficult to determine which R² is sufficient to consider a model as relevant for explaining the relationship between y and x. It will always depend on the given data and field of application. Note that the R² increases with the number of variables added to the model, hence you should then consult the adjusted version as it imposes a penalty on the number of regressors.
rss <- sum(lm_simple$residuals ^ 2) # Residual Sum of Squares)
tss <- sum(
(lm_simple$model$mpg - mean(lm_simple$model$mpg)) ^ 2
) # Total Sum of Squares
# R²
1 - rss / tss
## [1] 0.6024373
# Adjusted R²
1 - ((1 - (1 - rss / tss)) * (dim(mtcars)[1] - 1)) / (dim(mtcars)[1] - 1 - 1)
## [1] 0.5891853
This is a measure for whether there is a relationship between y and one or more xi. It is called an overall F-test to the null hypothesis H0: βi = 0 ∀ i=2,3,4 vs. H1: ∃i ∈ {2,3,4}: βi ≠ 0 (the intercept
is not tested). The more far away the F value is from one, the better! How large the F value should, be depends on the number of observations as well as the number of independent variables. With many observations an F value a little larger than one will suffice to reject the null hypothesis. Conversely, with few observations, a larger value of F will be required in order to demonstrate a potential relationship between \(y\) and \(x_i\).
# TSS (total sum of squares)
tss <- sum((lm_simple$model$mpg - mean(lm_simple$model$mpg)) ^ 2)
# RSS (residual sum of squares)
rss <- sum(lm_simple$residuals ^ 2)
# predictors whose coefficient we are testing
p <- 1
# observations
n <- dim(mtcars)[1]
# F statistic
((tss - rss) / p) / (rss / (n-p-1))
## [1] 45.4598
# p value
1 - pf(((tss - rss) / p) / (rss / (n - p - 1)),
df1 = 1,
df2 = 30)
## [1] 1.787835e-07
With a scatterplot we can visualize the relationship between the independent and dependent variable.
Before doing this, I am creating a new dataset (lmpredict.df
) with the observed values of mpg
and hp
as well as the predicted values of mpg
by our model and the corresponding standard errors.
lmpredict.df <- tibble(
hp = mtcars$hp,
mpg = mtcars$mpg,
.fitted = predict(lm_simple, se = T, mtcars)$fit,
.se = predict(lm_simple, se = T, mtcars)$se.fit
)
Now we create the scatterplot of mpg
against hp
. The fitted regression line is shown in blue and the two red dashed-lines depict the lower and upper bound of its confidence interval.
ggplot(data = lmpredict.df, aes(x = hp, y = mpg)) +
geom_point() +
geom_line(data = lmpredict.df, aes(y = .fitted),
color = "blue", size = 1) +
geom_line(data = lmpredict.df, aes(y = .fitted + 2 * .se),
color = "red", size = 0.5, linetype = "dashed") +
geom_line(data = lmpredict.df, aes(y = .fitted - 2 * .se),
color = "red", size = 0.5, linetype = "dashed") +
scale_x_continuous(expand = expansion(mult = c(0, 0))) +
labs(x = "Gross horsepower", y = "Miles/(US) gallon")
The multiple linear regression model has at least two explanatory variables. I am regressing mpg
on hp
and displacement (cu.in.) (disp
) by specifying the variables within the argument formula =
of the function lm()
.
lm_multiple <- lm(formula = mpg ~ hp + disp, data = mtcars)
The regression results are now saved in object lm_multiple
and displayed with function summary()
in the R console.
summary(lm_multiple)
##
## Call:
## lm(formula = mpg ~ hp + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7945 -2.3036 -0.8246 1.8582 6.9363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
## hp -0.024840 0.013385 -1.856 0.073679 .
## disp -0.030346 0.007405 -4.098 0.000306 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
## F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09
All of the section above also applies here.
All of the section above also applies here.
All of the section above also applies here, except that we now find another row for the newly added explanatory variable disp
.
All of the section above also applies here except that for the calculation of the degrees of freedom now the parameters intercept
, hp
and disp
have to be taken into account.
# RSS (residual sum of squares)
sum(lm_multiple$residuals^2)
## [1] 283.4934
# MSE (mean squared error)
sum(lm_multiple$residuals^2)/lm_multiple$df.residual
## [1] 9.775636
# RSE (residual standard error)
sqrt(sum(lm_multiple$residuals^2)/lm_multiple$df.residual)
## [1] 3.126601
All of the section above also applies here except that we should definitely consult the adjusted R² now. This model has an adjusted R² of 0.73 which means that 73% of the observed variance in mpg
can be explained with regressors hp
and disp
.
# RSS (residual sum of squares)
rss <- sum(lm_multiple$residuals ^ 2)
# TSS (total sum of squares)
tss <- sum((lm_multiple$model$mpg - mean(lm_multiple$model$mpg)) ^ 2)
# R²
1 - rss / tss
## [1] 0.7482402
# Adjusted R²
1 - ((1 - (1 - rss / tss)) * (dim(mtcars)[1] - 1)) / (dim(mtcars)[1] - 2 - 1)
## [1] 0.7308774
All of the section above also applies here.
# TSS (total sum of squares)
tss <- sum((lm_multiple$model$mpg - mean(lm_multiple$model$mpg)) ^ 2)
# RSS (residual sum of squares)
rss <- sum(lm_multiple$residuals ^ 2)
# predictors whose coefficient we are testing
p <- 2
# observations
n <- dim(mtcars)[1]
# F statistic
((tss - rss) / p) / (rss / (n-p-1))
## [1] 43.09458
# p value
1 - pf(((tss - rss) / p) / (rss / (n - p - 1)),
df1 = 2,
df2 = 29)
## [1] 2.062068e-09
With a 3D scatterplot we can visualize the relationship between the two independent variables and the dependent variable. I am using the function scatterplot3d()
from R package scatterplot3d
here and add a regression plane based on the specified multiple linear regression model.
s3d <- scatterplot3d(x= mtcars[,4], y=mtcars[,3], z=mtcars[,1],
xlab="Gross horsepower", ylab="Displacement (cu.in.)",
zlab="Miles/(US) gallon", grid = TRUE, box = FALSE, angle=20)
s3d$plane3d(lm_multiple, lty.box = "solid", col="blue")
We can see that there is a negative relationship between gross horsepower as well as displacement with miles per gallon.
In the regression model below the coefficient of wt
is much larger than the coefficients of hp
and disp
. This difference in the coefficients’ magnitude tempts to assume that wt
has a larger impact on mpg
than the other two parameters. However, remember that a regression coefficient measures the average change of \(y\) for a unit change of the respective \(x_i\). In the case of wt
the unit is 1.000 pounds while the unit of hp
is horse power (which is roughly equivalent to 745.6 watts) and the unit of disp
is cubic inch.
lm_multiple2 <- lm(formula = mpg ~ hp + disp + wt, data = mtcars)
summary(lm_multiple2)
##
## Call:
## lm(formula = mpg ~ hp + disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## hp -0.031157 0.011436 -2.724 0.01097 *
## disp -0.000937 0.010350 -0.091 0.92851
## wt -3.800891 1.066191 -3.565 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
In order to compare the impact of different variables with different units we can standardize the regression coefficients with the formula below:
\[ \begin{aligned} b_i = x_i\frac{s_{X_i}}{s_{Y}} \end{aligned} \]
The beta coefficient \(b_i\) is computed by adjusting the regression coefficient \(x_i\) with the ratio of the standard deviation of \(x_i\) and the standard deviation of \(y\).
In R we can either use the function scale()
on all variables specified in our model…
lm_multiple2_scaled <- lm(formula = scale(mpg) ~ scale(hp) + scale(disp) + scale(wt), data = mtcars)
coef(lm_multiple2_scaled)[2:4]
## scale(hp) scale(disp) scale(wt)
## -0.35443851 -0.01926874 -0.61706350
… or use the function lm.beta()
from R package QuantPsyc
.
QuantPsyc::lm.beta(lm_multiple2)
## hp disp wt
## -0.35443851 -0.01926874 -0.61706350
No matter which approach you choose, the interpretation of the beta coefficients are identical. Raising, for example, the weight (wt
) by one standard deviation leads to an reduction of mpg
by -0.62 standard deviations. On the contrary, raising the horse power (hp
) by one standard deviation leads to an reduction of mpg
by only -0.35 standard deviations.
As measured by the beta coefficients, wt
has a stronger influence on mpg
than hp
.
The function lm.beta()
does not produce correct standardized coefficients when interaction terms are present in the model. Also, it is not advisable to use standardized regression coefficients to compare regression models between different datasets as their variance can vary.
Using beta coefficients with binary (dummy) variables is incorrect as their magnitude is decreasing when the skewness of the variables increase. Below I included code to demonstrate this issue. Please note that this example requires good knowledge of data structures as well as functions of tidyverse
packages.
The for-loop generates a dataset with 1.000 regression models where \(x_1\) and \(x_2\) are used to predict \(y\). Both predictors are dummy variables which only take on values of either zero or one. While the fraction of zero (or one) values is constant in \(x_1\), it increases in \(x_2\) for each additional model.
dataset_list <- list()
j <- 0
for (i in seq(from = 0, to = 1, by = 1/1000)){
# Counter
j <- j + 1
# Computation
set.seed(123)
error <- rnorm(1000)
bin_cons = sample(rep(c(0,1),each = 500))
bin_var = sample(c(0,1), size = 1000, replace = TRUE, prob = c(1-i,i))
dep_var = bin_cons + bin_var + error
# Output
dataset_list[[j]] <-
tibble(
j,
data = list(tibble(j, dep_var, bin_cons, bin_var)),
frac_binary = mean(bin_var),
beta_cons = tryCatch(
coef(lm(scale(dep_var) ~ scale(bin_cons) + scale(bin_var)))[2],
error=function(e) NA),
beta_var = tryCatch(
coef(lm(scale(dep_var) ~ scale(bin_cons) + scale(bin_var)))[3],
error=function(e) NA)
)
}
output <- tibble(dataset_list)
Let’s have a look at how the output of model 123 looks like. The column data
is a so called list-column and comprises 4 additional columns (j
, dep_var
, bin_cons
and bin_var
). You can access this column with the function map_dfr()
. The last two columns are the beta coefficients of \(x_1\) and \(x_2\) of this model. The fraction of zero/one values in variable bin_var
is 8 to 1 (frac_binary
).
output[[1]][[123]]
## # A tibble: 1 x 5
## j data frac_binary beta_cons beta_var
## <dbl> <list> <dbl> <dbl> <dbl>
## 1 123 <tibble [1,000 x 4]> 0.125 0.447 0.262
Below you can see how the skewness of variable bin_var
changes with each model.
map_dfr(dataset_list,"data") %>%
pivot_longer(cols= c(bin_cons, bin_var)) %>%
group_by(j, name) %>%
summarise(skewness = e1071::skewness(value)) %>%
ggplot(aes(x = j, y = skewness, color = name)) +
geom_line() +
scale_color_manual(values = c("red","blue"),
labels = c(expression(Dummy~x[1]),
expression(Dummy~x[2]))) +
labs(x = "Model", y = "Skewness", color = "")
And finally a graphical representation of the described problem that the beta coefficients of a dummy variable decrease as their skewness increase.
tibble(
map_dfr(dataset_list,"data") %>%
group_by(j) %>%
summarise(skewness_bin_var = e1071::skewness(bin_var))
) %>%
inner_join(
tibble(
map_dfr(dataset_list,"data") %>% distinct(j),
bin_cons = map_dbl(dataset_list, "beta_cons"),
bin_var = map_dbl(dataset_list, "beta_var")
),
by = c("j")
) %>%
pivot_longer(cols=c(bin_cons, bin_var)) %>%
ggplot(aes(x = skewness_bin_var, y = value, color = name)) +
geom_line() +
scale_color_manual(values = c("red","blue"),
labels = c(expression(x[1]~Skewness~constant),
expression(x[2]~Skewness~varies))) +
labs(title = expression(Model:~y==x[1]~+~x[2]~+~epsilon),
x = expression(Skewness~of~X[2]),
y = "Beta coefficient", color = "")
This section considers the t-test and F-test in more detail with special regards to the applications in the linear regression model. Let us consider another multiple linear regression model in which we add the variable qsec
(1/4 mile time).
lm_multiple2 <- lm(formula = mpg ~ hp + disp + qsec, data = mtcars)
summary(lm_multiple2)
##
## Call:
## lm(formula = mpg ~ hp + disp + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5029 -2.6417 -0.6002 1.9749 7.2263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.622206 9.668300 3.995 0.000426 ***
## hp -0.034642 0.017967 -1.928 0.064041 .
## disp -0.028468 0.007787 -3.656 0.001049 **
## qsec -0.385561 0.468127 -0.824 0.417114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.144 on 28 degrees of freedom
## Multiple R-squared: 0.7542, Adjusted R-squared: 0.7279
## F-statistic: 28.64 on 3 and 28 DF, p-value: 1.118e-08
At the 5 % significance level the critical t-value is approximately 2. When t is much larger or smaller than 2 we can already tell if we do or do not reject the null hypothesis.
A coefficient’s t-value is calculated by dividing its estimate by the corresponding standard error.
# hp
hp_tval <- summary(lm_multiple2)$coefficients[2, 1] / summary(lm_multiple2)$coefficients[2, 2]
hp_pval <- 2 * pt(hp_tval, df = lm_multiple2$df.residual)
# disp
disp_tval <- summary(lm_multiple2)$coefficients[3, 1] / summary(lm_multiple2)$coefficients[3, 2]
disp_pval <- 2 * pt(disp_tval, df = lm_multiple2$df.residual)
# qsec
qsec_tval <- summary(lm_multiple2)$coefficients[4, 1] / summary(lm_multiple2)$coefficients[4, 2]
qsec_pval <- 2 * pt(qsec_tval, df = lm_multiple2$df.residual)
# Output
tibble(variable = c("hp", "disp", "qsec"),
t_value = c(hp_tval, disp_tval, qsec_tval),
p_value = c(hp_pval, disp_pval, qsec_pval))
## # A tibble: 3 x 3
## variable t_value p_value
## <chr> <dbl> <dbl>
## 1 hp -1.93 0.0640
## 2 disp -3.66 0.00105
## 3 qsec -0.824 0.417
The F-test is more versatile than the t-test. We can test relationships between regression coefficients, i.e. one variable has the same effect or an effect twice as large on y as another variable. Furthermore, we can carry out exclusion tests (= excludability of variable groups).
For example, we want to check if not only a single coefficient but all are equal to zero. Hence, we estimate the model two times. One time with and one time without the regressors. At first, we need to compute the restricted model with only a constant. Then we compare this model to the regression model specified above with function anova()
. We obtain the same value for the F-statistic as in the summary of the regression output above.
lm_constant <- lm(mpg ~ 1, data = mtcars)
anova(lm_constant, lm_multiple2)
## Analysis of Variance Table
##
## Model 1: mpg ~ 1
## Model 2: mpg ~ hp + disp + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 1126.05
## 2 28 276.79 3 849.26 28.637 1.118e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the exclusion restriction the mutual significance of a group of variables is tested. That is not the same as testing each t-statistic separately. In practice, it is possible variables are significant as a group but no variable is significant individually (both assessed at the same level of significance).
Below, I am testing whether an additional unit of hp
has the same effect on mpg
as four additional units of disp
on average. Formally, we can write the null hypothesis as βhp = 4βdisp. I am using the function linearHypothesis()
from R package car
to test it.
linearHypothesis(lm_multiple2, c("hp=4*disp"))
## Linear hypothesis test
##
## Hypothesis:
## hp - 4 disp = 0
##
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 305.73
## 2 28 276.79 1 28.939 2.9275 0.09814 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find that the F statistic has a rather small value and there is only weak evidence at the 10% level that an additional unit hp
has a different effect on mpg
than 4 units of disp
.
Let us also test whether hp
and disp
are mutually significant.
linearHypothesis(lm_multiple2, c("hp=0", "disp=0"))
## Linear hypothesis test
##
## Hypothesis:
## hp = 0
## disp = 0
##
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 928.66
## 2 28 276.79 2 651.87 32.972 4.366e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At last, we can also construct an overall F-test by testing the mutual significance of all regressors. The result is identical as compared to estimating a full and restricted model and using function anova()
.
linearHypothesis(lm_multiple2, c("hp=0", "disp=0", "qsec=0"))
## Linear hypothesis test
##
## Hypothesis:
## hp = 0
## disp = 0
## qsec = 0
##
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 1126.05
## 2 28 276.79 3 849.26 28.637 1.118e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1