R Tutorial: Linear Regression 1

1. Introduction

1.1 What is a linear regression?
1.2 Gauss-Markov-Assumptions
Concept of Correlation

2. Applied OLS regression

2.1 R packages & Dataset
2.2 Simple linear regression
2.3 Multiple linear regression
2.4 Beta Coefficients

3. Inference with OLS

3.1 t-test
3.2 F-test

Reading time: 16 minutes (3,007 words)

1. Introduction

1.1 What is a linear regression?

The linear regression is a fundamental statistical procedure for data analysis, which comprises:

the model, which assumes a linear relationship between the specified variables and
the estimation method of the model’s parameters.

By linear regression, we usually refer to an ordinary least squares estimation (OLS).

1.2 Gauss-Markov-Assumptions

They define the fundaments on which the linear regression model is built. If those assumptions are met, the OLS estimator is said to be BLUE (best linear unbiased estimator). The framework refers to cross-sectional data generated by taking a random sample from a population. However, with empirical data this is rarely the case.

The main interest by doing a linear regression is to estimate the influence of some variable(s) x_i on y. The x_i are regarded as random variables. Below are some of the most common terms for x and y.


dependent variable	independent variables
explained variable	explanatory variables
endogenous variable	exogenous variables
outcome variable	control variables
predicted variable	predictors
regressand	regressors

Assumption 1: Linear model

The population model can be written as y = β₀ + β₁x₁ + . . . + β_kx_k + u where β₀ is a constant, β₁, . . . ,β_k are the unknown parameters of interest and u is an unobserved random variable called the error term. The parameters are constants, also called regression coefficients. The coefficient β_j measures the average effect of a unit change of x_j on y while holding everything else constant (ceteris paribus).

The error term represents unobserved factors which have an influence on y but are not captured by the explanatory variables. If we omit relevant regressors, that is the use of fewer explanatory variables than needed (underspecification), we speak of omitted variables. Those variables are hence part of the error term because they have the same effect as unobserved factors. Omitted variables can bias the estimation of the model.

Assumption 2: Random sample

There is no relationship or dependency between the units i. This assumption may be violated when data is clustered (i.e. observations from the same region are correlated). It also implies no auto-correlation of the error terms.

Assumption 3: No perfect collinearity

The regressors in the population and the sample are not collinear (= linear independent).

Assumption 4: Exogeneity of the explanatory variables

The the error term has a mean of zero and no systematic relationship with the explanatory variables

Under assumptions 1, 2, 3 and 4 the OLS estimator is consistent.

Assumption 5: Homoskedasticity

The variance of the error term is constant. There is a remedy to use heteroskadicity-robust standard errors, should this assumption be violated.

Assumption 6: Normal distribution of the error term

The error term is normally distributed. This assumption is often violated in reality.

Concept of Correlation

In linear regression the concept of correlation plays an important role, so let’s have a look at the two most common measures of dependence.

Pearson correlation coefficient
Spearman correlation coefficient

With the Pearson correlation coefficient we can measure the linear dependence between two variables and . This correlation coefficient is for data that follows a normal distribution (parametric).

cor(mtcars$mpg, mtcars$hp, method = "pearson", use = "complete.obs")

## [1] -0.7761684

From the test below we can conclude that hp and mpg are significantly correlated with a correlation coefficient of -0.776.

cor.test(mtcars$mpg, mtcars$hp, method = "pearson", use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

The Spearman rank correlation coefficient is a nonparametric measure of association. While Pearson’s correlation coefficient assesses relationships that are linear, this measure assesses monotonic relationships. It can be used for both continuous and discrete ordinal variables and the data do not need to follow a normal distribution.

cor(mtcars$cyl, mtcars$gear, method = "spearman", use = "complete.obs")

## [1] -0.5643105

From the test below we can conclude that cyl and gear are significantly correlated with a correlation coefficient of -0.564.

cor.test(mtcars$cyl, mtcars$gear, method = "spearman", use = "complete.obs")

## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$cyl and mtcars$gear
## S = 8534.9, p-value = 0.0007678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5643105

2. Applied OLS regression

In this article, I will not consider any modifications of the functional form of the linear model. Hence, every section refers to a so called level-level model which measures the average effect of a unit change of a regressor on the dependent variable .

2.1 R packages & Dataset

The following R packages are required for this tutorial.

library(dplyr)
library(ggplot2)
library(tidyr)
library(broom)
library(purrr)
library(car)
library(scatterplot3d)
library(QuantPsyc)
library(e1071)

All practical examples are built around the mtcars dataset. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

head(datasets::mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2.2 Simple linear regression

The simple linear regression model has only one explanatory variable. I am regressing miles/(US) gallon (mpg) on gross horsepower (hp) by specifying both variables within the argument formula = of the function lm().

lm_simple <- lm(formula = mpg ~ hp, data = mtcars)

2.2.1 Model Output

The regression results are saved in object lm_simple and can be accessed with different functions. The function summary() displays the output of the regression model in the R console.

summary(lm_simple)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

The first two lines of the output simply display the model specified in the formula of function lm().

The next three lines show the distribution of the residuals. A residual is the difference between the observed value of mpg in the dataset and the predicted value of mpg by the linear model. The smaller these differences are, the better the model fits the data. Also, a symmetrical distribution of the residuals around zero reflects a good fit. You might have a look at the residuals with a histogram.

The next block of lines shows the estimate of the specified variable. Beginning with the constant (intercept) in row one, each additional row refers to one coefficient. The first column Estimate shows the corresponding values of the coefficients. The coefficient of intercept shows the predicted mpg for cars with a hp of zero. The coefficient of hp shows the average effect on mpg when hp is increased by one unit. The second column Standard Error measures the variation of the coefficient estimate. The smaller the standard error relative to the coefficient estimate, the more reliable the estimation. Standard errors can be used to calculate confidence intervals and hypothesis tests. The third column t value measures how many standard deviations the coefficient is away from zero. In order to reject the null hypothesis for the alternative that a relationship between x and y exists (H₀: β_i = 0 vs. H₁: β_i ≠ 0), the t value should be far away from zero. The t values are then used to calculate the p values. The fourth column Pr(>|t|) reflects the probability to calculate a value greater or equal to t by chance. A small p value suggests that it is unlikely to observe a relationship between the independent and dependent variable that is completely random. Typically, a p value of 5% is used as a limit. The stars show the significance of the p value and the corresponding legend can be found in the first row below the coefficients.

The RSE measures the goodness of fit of the regression model. It is calculated by extracting the square root of the mean squared error (MSE) which is the residual sum of squares (RSS) divided by the degrees of freedom (n-k). It represents the sum of errors we avoid if we use our regression model instead of the mean to predict . In theory, every linear model has an error term, so the dependent variable can never be perfectly predicted by the independent variable(s). The RSE measures the average deviation of the observed from the fitted regression line. The degrees of freedom show how many data points (observations) were used in the estimation after the parameters (intercept and hp) have been taken into account.

rss <- sum(lm_simple$residuals^2) # Residual Sum of Squares)

mse <- rss/lm_simple$df.residual # Mean Squared Error

# RSE (Residual Standard Error)
sqrt(sum(lm_simple$residuals^2)/lm_simple$df.residual)

## [1] 3.862962

The R² is a measure of how good the model fits the actual data and is standardized between zero and one. A model with an R² close to zero performs bad in explaining the observed variance in y. An R² close to one however, shows that the model explains the observed variance in y very good. This model has an R² of 0.60 which means that 60% of the observed variance in mpg can be explained with regressor hp. In practice, it is quite difficult to determine which R² is sufficient to consider a model as relevant for explaining the relationship between y and x. It will always depend on the given data and field of application. Note that the R² increases with the number of variables added to the model, hence you should then consult the adjusted version as it imposes a penalty on the number of regressors.

rss <- sum(lm_simple$residuals ^ 2) # Residual Sum of Squares)

tss <- sum(
  (lm_simple$model$mpg - mean(lm_simple$model$mpg)) ^ 2
  ) # Total Sum of Squares

# R²
1 - rss / tss

## [1] 0.6024373

# Adjusted R²
1 - ((1 - (1 - rss / tss)) * (dim(mtcars)[1] - 1)) / (dim(mtcars)[1] - 1 - 1)

## [1] 0.5891853

This is a measure for whether there is a relationship between y and one or more x_i. It is called an overall F-test to the null hypothesis H₀: β_i = 0 ∀ i=2,3,4 vs. H₁: ∃i ∈ {2,3,4}: β_i ≠ 0 (the intercept is not tested). The more far away the F value is from one, the better! How large the F value should, be depends on the number of observations as well as the number of independent variables. With many observations an F value a little larger than one will suffice to reject the null hypothesis. Conversely, with few observations, a larger value of F will be required in order to demonstrate a potential relationship between and .

# TSS (total sum of squares)
tss <- sum((lm_simple$model$mpg - mean(lm_simple$model$mpg)) ^ 2)

# RSS (residual sum of squares)
rss <- sum(lm_simple$residuals ^ 2)

# predictors whose coefficient we are testing
p <-  1

# observations
n <- dim(mtcars)[1]

# F statistic
((tss - rss) / p) / (rss / (n-p-1))

## [1] 45.4598

# p value
1 - pf(((tss - rss) / p) / (rss / (n - p - 1)),
       df1 = 1,
       df2 = 30)

## [1] 1.787835e-07

2.2.2 Scatterplot

With a scatterplot we can visualize the relationship between the independent and dependent variable.

Obtaining Predictions
Plot

Before doing this, I am creating a new dataset (lmpredict.df) with the observed values of mpg and hp as well as the predicted values of mpg by our model and the corresponding standard errors.

lmpredict.df <- tibble(
  hp = mtcars$hp,
  mpg = mtcars$mpg,
  .fitted =  predict(lm_simple, se = T, mtcars)$fit,
  .se = predict(lm_simple, se = T, mtcars)$se.fit
  )

Now we create the scatterplot of mpg against hp. The fitted regression line is shown in blue and the two red dashed-lines depict the lower and upper bound of its confidence interval.

ggplot(data = lmpredict.df, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_line(data = lmpredict.df, aes(y = .fitted),
            color = "blue", size = 1) +
  geom_line(data = lmpredict.df, aes(y = .fitted + 2 * .se),
            color = "red", size = 0.5, linetype = "dashed") +
  geom_line(data = lmpredict.df, aes(y = .fitted - 2 * .se),
            color = "red", size = 0.5, linetype = "dashed") +
  scale_x_continuous(expand = expansion(mult = c(0, 0))) +
  labs(x = "Gross horsepower", y = "Miles/(US) gallon")

2.3 Multiple linear regression

The multiple linear regression model has at least two explanatory variables. I am regressing mpg on hp and displacement (cu.in.) (disp) by specifying the variables within the argument formula = of the function lm().

lm_multiple <- lm(formula = mpg ~ hp + disp, data = mtcars)

2.3.1 Model Output

The regression results are now saved in object lm_multiple and displayed with function summary() in the R console.

summary(lm_multiple)

## 
## Call:
## lm(formula = mpg ~ hp + disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7945 -2.3036 -0.8246  1.8582  6.9363 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.735904   1.331566  23.083  < 2e-16 ***
## hp          -0.024840   0.013385  -1.856 0.073679 .  
## disp        -0.030346   0.007405  -4.098 0.000306 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared:  0.7482, Adjusted R-squared:  0.7309 
## F-statistic: 43.09 on 2 and 29 DF,  p-value: 2.062e-09

All of the section above also applies here.

All of the section above also applies here, except that we now find another row for the newly added explanatory variable disp.

All of the section above also applies here except that for the calculation of the degrees of freedom now the parameters intercept, hp and disp have to be taken into account.

# RSS (residual sum of squares)
sum(lm_multiple$residuals^2)

## [1] 283.4934

# MSE (mean squared error)
sum(lm_multiple$residuals^2)/lm_multiple$df.residual

## [1] 9.775636

# RSE (residual standard error)
sqrt(sum(lm_multiple$residuals^2)/lm_multiple$df.residual)

## [1] 3.126601

All of the section above also applies here except that we should definitely consult the adjusted R² now. This model has an adjusted R² of 0.73 which means that 73% of the observed variance in mpg can be explained with regressors hp and disp.

# RSS (residual sum of squares)
rss <- sum(lm_multiple$residuals ^ 2)

# TSS (total sum of squares)
tss <- sum((lm_multiple$model$mpg - mean(lm_multiple$model$mpg)) ^ 2)

# R²
1 - rss / tss

## [1] 0.7482402

# Adjusted R²
1 - ((1 - (1 - rss / tss)) * (dim(mtcars)[1] - 1)) / (dim(mtcars)[1] - 2 - 1)

## [1] 0.7308774

All of the section above also applies here.

# TSS (total sum of squares)
tss <- sum((lm_multiple$model$mpg - mean(lm_multiple$model$mpg)) ^ 2)

# RSS (residual sum of squares)
rss <- sum(lm_multiple$residuals ^ 2)

# predictors whose coefficient we are testing
p <-  2

# observations
n <- dim(mtcars)[1]

# F statistic
((tss - rss) / p) / (rss / (n-p-1))

## [1] 43.09458

# p value
1 - pf(((tss - rss) / p) / (rss / (n - p - 1)),
       df1 = 2,
       df2 = 29)

## [1] 2.062068e-09

2.3.2 3D Scatterplot

With a 3D scatterplot we can visualize the relationship between the two independent variables and the dependent variable. I am using the function scatterplot3d() from R package scatterplot3d here and add a regression plane based on the specified multiple linear regression model.

s3d <- scatterplot3d(x= mtcars[,4], y=mtcars[,3], z=mtcars[,1], 
                     xlab="Gross horsepower", ylab="Displacement (cu.in.)",
                     zlab="Miles/(US) gallon", grid = TRUE, box = FALSE, angle=20)

s3d$plane3d(lm_multiple, lty.box = "solid", col="blue")

We can see that there is a negative relationship between gross horsepower as well as displacement with miles per gallon.

2.4 Beta Coefficients

In the regression model below the coefficient of wt is much larger than the coefficients of hp and disp. This difference in the coefficients’ magnitude tempts to assume that wt has a larger impact on mpg than the other two parameters. However, remember that a regression coefficient measures the average change of for a unit change of the respective . In the case of wt the unit is 1.000 pounds while the unit of hp is horse power (which is roughly equivalent to 745.6 watts) and the unit of disp is cubic inch.

lm_multiple2 <- lm(formula = mpg ~ hp + disp + wt, data = mtcars)
summary(lm_multiple2)

## 
## Call:
## lm(formula = mpg ~ hp + disp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## disp        -0.000937   0.010350  -0.091  0.92851    
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

In order to compare the impact of different variables with different units we can standardize the regression coefficients with the formula below:

The beta coefficient is computed by adjusting the regression coefficient with the ratio of the standard deviation of and the standard deviation of .

In R we can either use the function scale() on all variables specified in our model…

lm_multiple2_scaled <- lm(formula = scale(mpg) ~ scale(hp) + scale(disp) + scale(wt), data = mtcars)
coef(lm_multiple2_scaled)[2:4]

##   scale(hp) scale(disp)   scale(wt) 
## -0.35443851 -0.01926874 -0.61706350

… or use the function lm.beta() from R package QuantPsyc.

QuantPsyc::lm.beta(lm_multiple2)

##          hp        disp          wt 
## -0.35443851 -0.01926874 -0.61706350

No matter which approach you choose, the interpretation of the beta coefficients are identical. Raising, for example, the weight (wt) by one standard deviation leads to an reduction of mpg by -0.62 standard deviations. On the contrary, raising the horse power (hp) by one standard deviation leads to an reduction of mpg by only -0.35 standard deviations.

As measured by the beta coefficients, wt has a stronger influence on mpg than hp.

The function lm.beta() does not produce correct standardized coefficients when interaction terms are present in the model. Also, it is not advisable to use standardized regression coefficients to compare regression models between different datasets as their variance can vary.

Using beta coefficients with binary (dummy) variables is incorrect as their magnitude is decreasing when the skewness of the variables increase. Below I included code to demonstrate this issue. Please note that this example requires good knowledge of data structures as well as functions of tidyverse packages.

The for-loop generates a dataset with 1.000 regression models where and are used to predict . Both predictors are dummy variables which only take on values of either zero or one. While the fraction of zero (or one) values is constant in , it increases in for each additional model.

dataset_list <- list()
j <- 0

for (i in seq(from = 0, to = 1, by = 1/1000)){
  
  # Counter
  j <- j + 1
  
  # Computation
  set.seed(123)
  error <- rnorm(1000)   
  bin_cons = sample(rep(c(0,1),each = 500))
  bin_var = sample(c(0,1), size = 1000, replace = TRUE, prob = c(1-i,i))
  dep_var =  bin_cons + bin_var + error
  
  # Output
  dataset_list[[j]] <- 
    tibble(
      j,
      data = list(tibble(j, dep_var, bin_cons, bin_var)),
      frac_binary = mean(bin_var),
      beta_cons = tryCatch(
        coef(lm(scale(dep_var) ~ scale(bin_cons) + scale(bin_var)))[2],
        error=function(e) NA),
      beta_var = tryCatch(
        coef(lm(scale(dep_var) ~ scale(bin_cons) + scale(bin_var)))[3],
        error=function(e) NA)
      )
}

output <- tibble(dataset_list)

Let’s have a look at how the output of model 123 looks like. The column data is a so called list-column and comprises 4 additional columns (j, dep_var, bin_cons and bin_var). You can access this column with the function map_dfr(). The last two columns are the beta coefficients of and of this model. The fraction of zero/one values in variable bin_var is 8 to 1 (frac_binary).

output[[1]][[123]]

## # A tibble: 1 x 5
##       j data                 frac_binary beta_cons beta_var
##   <dbl> <list>                     <dbl>     <dbl>    <dbl>
## 1   123 <tibble [1,000 x 4]>       0.125     0.447    0.262

Below you can see how the skewness of variable bin_var changes with each model.

map_dfr(dataset_list,"data") %>%
  pivot_longer(cols= c(bin_cons, bin_var)) %>%
  group_by(j, name) %>%
  summarise(skewness = e1071::skewness(value)) %>%
  ggplot(aes(x = j, y = skewness, color = name)) +
  geom_line() +
  scale_color_manual(values = c("red","blue"), 
                     labels = c(expression(Dummy~x[1]),
                                expression(Dummy~x[2]))) +
  labs(x = "Model", y = "Skewness", color = "")

And finally a graphical representation of the described problem that the beta coefficients of a dummy variable decrease as their skewness increase.

tibble(
  map_dfr(dataset_list,"data") %>%
  group_by(j) %>%
  summarise(skewness_bin_var = e1071::skewness(bin_var))
  ) %>%
  inner_join(
    tibble(
      map_dfr(dataset_list,"data") %>% distinct(j),
      bin_cons = map_dbl(dataset_list, "beta_cons"),
      bin_var = map_dbl(dataset_list, "beta_var")
      ), 
    by = c("j")
    ) %>%
  pivot_longer(cols=c(bin_cons, bin_var)) %>%
  ggplot(aes(x = skewness_bin_var, y = value, color = name)) +
  geom_line() +
  scale_color_manual(values = c("red","blue"), 
                     labels = c(expression(x[1]~Skewness~constant),
                                expression(x[2]~Skewness~varies))) +
  labs(title = expression(Model:~y==x[1]~+~x[2]~+~epsilon),
       x = expression(Skewness~of~X[2]),
       y = "Beta coefficient", color = "")

3. Inference with OLS

This section considers the t-test and F-test in more detail with special regards to the applications in the linear regression model. Let us consider another multiple linear regression model in which we add the variable qsec (1/4 mile time).

lm_multiple2 <- lm(formula = mpg ~ hp + disp + qsec, data = mtcars)
summary(lm_multiple2)

## 
## Call:
## lm(formula = mpg ~ hp + disp + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5029 -2.6417 -0.6002  1.9749  7.2263 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.622206   9.668300   3.995 0.000426 ***
## hp          -0.034642   0.017967  -1.928 0.064041 .  
## disp        -0.028468   0.007787  -3.656 0.001049 ** 
## qsec        -0.385561   0.468127  -0.824 0.417114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.144 on 28 degrees of freedom
## Multiple R-squared:  0.7542, Adjusted R-squared:  0.7279 
## F-statistic: 28.64 on 3 and 28 DF,  p-value: 1.118e-08

3.1 t-test

At the 5 % significance level the critical t-value is approximately 2. When t is much larger or smaller than 2 we can already tell if we do or do not reject the null hypothesis.

A coefficient’s t-value is calculated by dividing its estimate by the corresponding standard error.

# hp
hp_tval <- summary(lm_multiple2)$coefficients[2, 1] / summary(lm_multiple2)$coefficients[2, 2]
hp_pval <- 2 * pt(hp_tval, df = lm_multiple2$df.residual)

# disp
disp_tval <- summary(lm_multiple2)$coefficients[3, 1] / summary(lm_multiple2)$coefficients[3, 2]
disp_pval <- 2 * pt(disp_tval, df = lm_multiple2$df.residual)

# qsec
qsec_tval <- summary(lm_multiple2)$coefficients[4, 1] / summary(lm_multiple2)$coefficients[4, 2]
qsec_pval <- 2 * pt(qsec_tval, df = lm_multiple2$df.residual)

# Output
tibble(variable = c("hp", "disp", "qsec"),
       t_value = c(hp_tval, disp_tval, qsec_tval),
       p_value = c(hp_pval, disp_pval, qsec_pval))

## # A tibble: 3 x 3
##   variable t_value p_value
##   <chr>      <dbl>   <dbl>
## 1 hp        -1.93  0.0640 
## 2 disp      -3.66  0.00105
## 3 qsec      -0.824 0.417

3.2 F-test

The F-test is more versatile than the t-test. We can test relationships between regression coefficients, i.e. one variable has the same effect or an effect twice as large on y as another variable. Furthermore, we can carry out exclusion tests (= excludability of variable groups).

Overall F-test
Specific F-test

For example, we want to check if not only a single coefficient but all are equal to zero. Hence, we estimate the model two times. One time with and one time without the regressors. At first, we need to compute the restricted model with only a constant. Then we compare this model to the regression model specified above with function anova(). We obtain the same value for the F-statistic as in the summary of the regression output above.

lm_constant <- lm(mpg ~ 1, data = mtcars)
anova(lm_constant, lm_multiple2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ 1
## Model 2: mpg ~ hp + disp + qsec
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     31 1126.05                                  
## 2     28  276.79  3    849.26 28.637 1.118e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the exclusion restriction the mutual significance of a group of variables is tested. That is not the same as testing each t-statistic separately. In practice, it is possible variables are significant as a group but no variable is significant individually (both assessed at the same level of significance).

Below, I am testing whether an additional unit of hp has the same effect on mpg as four additional units of disp on average. Formally, we can write the null hypothesis as β_hp = 4β_disp. I am using the function linearHypothesis() from R package car to test it.

linearHypothesis(lm_multiple2, c("hp=4*disp"))

## Linear hypothesis test
## 
## Hypothesis:
## hp - 4 disp = 0
## 
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
## 
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 305.73                              
## 2     28 276.79  1    28.939 2.9275 0.09814 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that the F statistic has a rather small value and there is only weak evidence at the 10% level that an additional unit hp has a different effect on mpg than 4 units of disp.

Let us also test whether hp and disp are mutually significant.

linearHypothesis(lm_multiple2, c("hp=0", "disp=0"))

## Linear hypothesis test
## 
## Hypothesis:
## hp = 0
## disp = 0
## 
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 928.66                                  
## 2     28 276.79  2    651.87 32.972 4.366e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At last, we can also construct an overall F-test by testing the mutual significance of all regressors. The result is identical as compared to estimating a full and restricted model and using function anova().

linearHypothesis(lm_multiple2, c("hp=0", "disp=0", "qsec=0"))

## Linear hypothesis test
## 
## Hypothesis:
## hp = 0
## disp = 0
## qsec = 0
## 
## Model 1: restricted model
## Model 2: mpg ~ hp + disp + qsec
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     31 1126.05                                  
## 2     28  276.79  3    849.26 28.637 1.118e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1