R Tutorial: Linear Regression 2

Reading time: 16 minutes (3,148 words)

1. Introduction

This tutorial assumes that you are already familiar with the theoretical basics of linear regression like the Gauss-Markov-assumptions as well as how to run OLS in R. In case you are not, have a look at this tutorial. In the following sections I am going to discuss different approaches how to modify the functional form of a linear regression model, starting with very basic content to more sophisticated methods.

1.1 R packages

The following R packages are required for this tutorial.

library(dplyr)
library(tibble)
library(ggplot2)
library(magrittr)
library(broom)
library(purrr)
library(scales)
library(GGally)
library(splines)

1.2 Dataset

All practical examples are built around the Mid-Atlantic Wage Dataset contained in R package ISLR. It contains information on 3,000 male workers in the Mid-Atlantic region like their yearly salary, age and other demographic characteristics. After reading the data, I convert wage in full US-Dollar.

data("Wage", package = "ISLR")

Wage <- 
  Wage %>%
  mutate(wage = wage*1000)

2. Variable transformation

The standard linear regression model (without any transformations) implies that raising any regressor ($x_i$) by one unit leads to a change of the dependent variable ($y$) by $β_i$ units. Now, any transformation of a regressor or the regressand will alter the interpretation.

2.1 Overview

We can distinguish four different types of models. Each is discussed in more detail throughout this section.

Model	Equation	Regressand	Regressor	Interpretation
Level-Level	y = β₀ + β₁x + u	y	x	Raising x by one unit changes (c. p.) y on average by β₁ units.
Log-Level	log_e(y) = β₀ + β₁x + u	log_e(y)	x	Raising x by one unit changes (c. p.) y on average approximately by 100$$β₁%. Raising x by one unit changes (c. p.) y on average exactly by 100$$(e^β₁-1)%.
Level-Log	y = β₀ + β₁log_e(x) + u	y	log_e(x)	Raising x by one percent changes (c. p.) y on average by approximately β₁/100 units. Raising x by one percent changes (c. p.) y on average exactly by β₁$*$ln(1 + 1/100) units (semi-elasticity).
Log-Log	log_e(y) = β₀ + β₁log_e(x) + u	log_e(y)	log_e(x)	Raising x by one percent changes (c. p.) y on average by β₁% (elasticity).

2.2 Level-Level-Model

This is the standard linear regression model and I am regressing wage on age. The variable wage measures a respondent’s salary for a specific year in US-Dollar and the variable age measures a respondent’s age in years.

Model

levlev <- lm(wage ~ age, data = Wage)
summary(levlev)

## 
## Call:
## lm(formula = wage ~ age, data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100265  -25115   -6063   16601  205748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81704.74    2846.24   28.71   <2e-16 ***
## age           707.28      64.75   10.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40930 on 2998 degrees of freedom
## Multiple R-squared:  0.03827,    Adjusted R-squared:  0.03795 
## F-statistic: 119.3 on 1 and 2998 DF,  p-value: < 2.2e-16

Scatterplot

The function augment() from R package broom takes an lm object and transforms its content into a tidy format. The model’s predicted values are found in column .fitted.

ggplot(data = broom::augment(levlev),
       aes(x = age, y = wage)) +
  geom_point() +
  geom_line(aes(y = .fitted), size = 1, color = "red") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

Digression: The Intercept

At first, I am creating a new dataset (age.df) with a column age whose values range from 0 to 100. With function predict() I am able to use the already specified model levlev to predict the wage for each value of age. Hence, I am generating predicted values for wage which are outside the boundaries of the respondents’ age in the original dataset. There, the lowest and highest values of age are 18 and 80 respectively. This extrapolation is not very meaningful since working respondents will rarely be younger than 18 or older than 80. However, it highlights the irrelevance of interpreting the constant in this model. The intercept is hence the intersection of the model’s regression line with the y-axis, reflected by respondents that are hypothetically 0 years old.

age.df <- tibble(age = 0:100)
age.df$yhat <- predict(levlev, newdata = age.df)

ggplot(data = Wage, aes(x = age, y = wage)) +
  geom_point() +
  geom_line(data = age.df,
            aes(x = age, y = yhat),
            size = 1, color = "red") +
  scale_x_continuous(expand = expansion(mult = c(0, 0))) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

Raising the respondents’ age by 1 year changes their wage on average by $707 (remember that this is the slope of the regression line). As shown in the digression, the intercept is the predicted value of wage for respondents aged 0 years ($81,705).

2.3 Log-Level-Model

Now we’ll discuss the first type of transformation in more detail. For this model I’m computing the logarithm of wage and regress it on age.

Model

loglev <- lm(log(wage) ~ age, data = Wage)
summary(loglev)

## 
## Call:
## lm(formula = log(wage) ~ age, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65779 -0.20059  0.00439  0.20045  1.13716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.128e+01  2.388e-02  472.41   <2e-16 ***
## age         6.640e-03  5.432e-04   12.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3434 on 2998 degrees of freedom
## Multiple R-squared:  0.04748,    Adjusted R-squared:  0.04716 
## F-statistic: 149.4 on 1 and 2998 DF,  p-value: < 2.2e-16

Scatterplot

ggplot(broom::augment(loglev),
       aes(x = age, y = `log(wage)`)) +
  geom_point() + 
  geom_line(aes(y = .fitted), size = 1, color = "red") +
  labs(x = "Age", y = "Log(Wage)")

Raising the respondents’ age by 1 year changes their wage on average by approximately 0.664%. Exactly, the wage changes on average by 0.666% (calculation below).

exp(loglev$coefficients[2])-1

##         age 
## 0.006662229

2.4 Level-Log-Model

Next, I compute the logarithm of age and use it instead of the original variable in the regression model.

Model

levlog <- lm(wage ~ log(age), data = Wage)
summary(levlog)

## 
## Call:
## lm(formula = wage ~ log(age), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100346  -24828   -5958   16216  204959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -11500       9555  -1.204    0.229    
## log(age)       33228       2569  12.933   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40620 on 2998 degrees of freedom
## Multiple R-squared:  0.05285,    Adjusted R-squared:  0.05253 
## F-statistic: 167.3 on 1 and 2998 DF,  p-value: < 2.2e-16

Scatterplot

ggplot(broom::augment(levlog),
       aes(x = `log(age)`, y = wage)) +
  geom_point() +
  geom_line(aes(y = .fitted), size = 1, color = "red") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Log(Age)", y = "Wage")

When doubling the respondents’ age, i.e. an increase by 100%, their wage changes on average by $33,228. Alternatively we can say that raising the respondents’ age by 1% changes their wage on average by $332.28.

2.5 Log-Log-Model

At last, I compute the logarithm of both wage and age and run the model again.

Model

loglog <- lm(log(wage) ~ log(age), data = Wage)
summary(loglog)

## 
## Call:
## lm(formula = log(wage) ~ log(age), data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67076 -0.19396  0.00286  0.19475  1.13857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.39102    0.07993  130.00   <2e-16 ***
## log(age)     0.31572    0.02149   14.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3398 on 2998 degrees of freedom
## Multiple R-squared:  0.06714,    Adjusted R-squared:  0.06683 
## F-statistic: 215.8 on 1 and 2998 DF,  p-value: < 2.2e-16

Scatterplot

ggplot(broom::augment(loglog),
       aes(x = `log(age)`, y = `log(wage)`)) +
  geom_point() +
  geom_line(aes(y = .fitted), size = 1, color = "red") +
  labs(x = "Log(Age)", y = "Log(Wage)")

Raising the respondents’ age by 1% changes their wage on average by 0.32%.

3. Adding qualitative variables

In the previous section I only used the numerical variables wage and age, which are both interval-scaled (there is no such thing as being -1 years old or earning -1000 US-Dollars). Now, I’m going to add dichotomous variables with exactly two distinct values (often referred to as dummy variables) and variables with at least three distinct values (often referred to as factor or categorical variables) to the regression model.

3.1 Dummy variables

A dummy variable has exactly two values (levels). In the model below I am regressing the respondents’ wage on their health and age. The variable health indicates the health level of a worker and has two levels: <= Good and >= Very Good.

Model

The lm() function treats dummy variables as intended if they are either of class numeric, factor or character, so usually this should not cause any trouble.

lm_dummy <- lm(wage ~ health + age, data = Wage)
summary(lm_dummy)

## 
## Call:
## lm(formula = wage ~ health + age, data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100786  -24871   -5446   16599  215826 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          65739.28    3198.62   20.55   <2e-16 ***
## health2. >=Very Good 16900.00    1641.32   10.30   <2e-16 ***
## age                    799.20      64.27   12.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40230 on 2997 degrees of freedom
## Multiple R-squared:  0.07113,    Adjusted R-squared:  0.07051 
## F-statistic: 114.8 on 2 and 2997 DF,  p-value: < 2.2e-16

Scatterplot

When plotting this model we see two fitted regression lines for each level of health. The slopes of both regression lines are identical but they have diferent intercepts with the y-axis. The health dummy is added in argument color =.

ggplot(augment(lm_dummy),
       aes(x = age, y = wage, color = health)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = .fitted), size = 1) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c("red","darkgreen")) +
  labs(x = "Age", y = "Wage", color = "Health")

Interpretation of dummy variables in the linear model is always with respect to a so called reference or base group (= a level of the dummy variable.) This level does not appear with a coefficient in the regression output. Regarding the dummy variable health, respondents with health condition <= Good are the reference group. The regression output hence displays the coefficient for respondents with health condition >= Very Good, indiciating the average change of wage compared to the reference group by $16,900. The effect of respondents’ age on their wage is independent of their health condition ($799) but different from the value we’ve found in section 2 without using health in the model. Some of the variation in a respondent’s wage is explained with health, hence the magnitude of the age coefficient changes. We might call health a control variables in this context.

3.2 Categorical variables

A categorical or factor variable has, compared to a dummy variable, at least three levels. In the following model I am regressing the respondents’ wage on their level of education and age. The categorical variable education has 5 levels. Often a logic order of the categories eases the interpretation of the model (in the Wage dataset the categories already range from lower to higher education levels). However, it doesn’t really matter with respect to the specification of the model or the estimation of the coefficients.

Model

The lm() function does recognize categorical variables when they are of class factor or character. For example, you can check this before running the model with function is.factor(). Otherwise a catergorical variable has to be converted before specifiying it or within the lm() function or R will treat it like a numeric variable (like age).

lm_faktor <- lm(wage ~ education + age, data=Wage)
summary(lm_faktor)

## 
## Call:
## lm(formula = wage ~ education + age, data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110033  -19635   -3907   14441  220408 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 60335.97    3245.71  18.589  < 2e-16 ***
## education2. HS Grad         11438.65    2480.25   4.612 4.16e-06 ***
## education3. Some College    24167.00    2609.76   9.260  < 2e-16 ***
## education4. College Grad    39766.77    2590.31  15.352  < 2e-16 ***
## education5. Advanced Degree 64986.56    2808.38  23.140  < 2e-16 ***
## age                           568.69      57.19   9.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35940 on 2994 degrees of freedom
## Multiple R-squared:  0.2593, Adjusted R-squared:  0.2581 
## F-statistic: 209.6 on 5 and 2994 DF,  p-value: < 2.2e-16

Scatterplot

Now we obtain 5 different regression lines when plotting the model - one line for each level of education. Again, their slopes are identical while their intercepts with the y-axis are different.

ggplot(augment(lm_faktor),
       aes(x = age, y = wage, color = education)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = .fitted), size = 1) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage", color = "Education")

Digression: Plotting Regression Coefficients

Interpreting coefficients of categorical variables with many levels is often cumbersome. However, there is visualization technique called coefficient plot, which eases the interpretation. It can be created with function ggcoef() from R package GGally. Each blue dot represents the point estimate of the corresponding regression coefficient. The line passing through each dot represents an error bar of the associated 95% confidence interval.

ggcoef(lm_faktor, vline = FALSE, exclude_intercept = TRUE,
  errorbar_height = .2, color = "blue") +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Estimated increase in wage", y = "")

Caution: Ordered Factors

It is important to note that when using summary() on an lm object, the coefficients of each level of the factor variable are only displayed when the factor is not ordered. For an ordered factor the orthogonal polynomial contrasts are displayed. The L gives a measure of the linear trend, Q and C specify quadratic and cubic terms and so forth. You may use the function is.ordered() to check your categorical varables before running your regression model.

lm_faktor_ordered <- lm(wage ~ factor(education, ordered = TRUE) + age, data=Wage)
summary(lm_faktor_ordered)

## 
## Call:
## lm(formula = wage ~ factor(education, ordered = TRUE) + age, 
##     data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110033  -19635   -3907   14441  220408 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         88407.77    2537.89  34.835  < 2e-16 ***
## factor(education, ordered = TRUE).L 50059.25    1865.18  26.839  < 2e-16 ***
## factor(education, ordered = TRUE).Q  8133.75    1747.09   4.656 3.37e-06 ***
## factor(education, ordered = TRUE).C  2634.28    1439.92   1.829   0.0674 .  
## factor(education, ordered = TRUE)^4   617.56    1368.37   0.451   0.6518    
## age                                   568.69      57.19   9.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35940 on 2994 degrees of freedom
## Multiple R-squared:  0.2593, Adjusted R-squared:  0.2581 
## F-statistic: 209.6 on 5 and 2994 DF,  p-value: < 2.2e-16

As with dummy variables the coefficients of categorial variables are interpreted with respect to a reference group, which is formed here by respondents with an education level of < HS Grad. For example, respondents with an education level of Advanced Degree earn, compared to the base group, a wage which is on average $64,987 higher (in the scatterplot this is reflected by the distance between the red line and the purple line).

Again note that the effect of the respondents’ age on their wage is independent of their education ($569)!

4. Adding interactions

With an interaction (term) you can model that the influence of one regressor $x_i$ on the dependent variable $y$ systematically depends on another regressor $x_j$. For example, in the above section the influence of an respondent’s age on her wage was independent of her education. With an interaction of age and education you would be able to see if there is any codependence between them. There are three different types of interactions that will be covered in this section.

4.1 Two binary variables

In the first model I am regressing the respondents’ wage on the variables jobclass and healt_ins. Remember that binary variables (dummies) have exactly two levels.

Model

In the formula = argument of function lm() interactions can be specified with the * operator. This operator also adds the base effect of each variable to the model. Alternatively you can use the operator :, which only includes the interaction term. If you want to include the base effects in your model as well, the variables have to be explicitly specified. Below I’ve specified both variants.

lm_interact_bin <- lm(wage ~ jobclass * health_ins, data = Wage)
lm_interact_bin <- lm(wage ~ jobclass + health_ins + jobclass:health_ins, data = Wage)
summary(lm_interact_bin)

## 
## Call:
## lm(formula = wage ~ jobclass + health_ins + jobclass:health_ins, 
##     data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94816 -24274  -6982  16425 210443 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              112255       1257  89.325   <2e-16 ***
## jobclass2. Information                    14927       1718   8.686   <2e-16 ***
## health_ins2. No                          -23990       2059 -11.650   <2e-16 ***
## jobclass2. Information:health_ins2. No    -4064       3176  -1.279    0.201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39120 on 2996 degrees of freedom
## Multiple R-squared:  0.122,  Adjusted R-squared:  0.1211 
## F-statistic: 138.8 on 3 and 2996 DF,  p-value: < 2.2e-16

Line Plot

The interaction between two binary variables can be visualized with a line plot. At first, I am caluculating the groupwise means of wage for each combination of the levels of jobclass and health_ins. Then I am assigning health_ins to the x-axis and the mean of wage to the y-axis. For each level of jobclass a separate line is drawn. The slopes of the two lines differ!

Wage %>%
  group_by(jobclass, health_ins) %>%
  summarise(mean_wage = mean(wage, na.rm = T)) %>%
  ggplot(aes(x = health_ins, y = mean_wage, group = jobclass)) +
  geom_line(aes(color = jobclass), size = 1.2) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Health insurance?", y = "Wage (mean)", color = "Type of job")

Contingency Table (DiD)

Sometimes this type of interaction is also called a differences-in-differences estimator as it measures the difference of the difference within each dummy variable regarding the outcome wage. This can be shown with a contingency table.

Jobclass	Health insurance	No health insurance	Difference	DiD
Industrial	$112,255	$88,265	$23,990	$-4,064
Information	$127,182	$99,129	$28,054

First, in this specific model the intercept has a meaningful interpretation! It repesents the mean for the reference group of each dummy variable. Respondents with health_insurance employed in the Industrial sector have on average a wage of $112,255.

The coefficient of jobclass measures the average change of wage for health_insured respondents between the sectors Information and Industrial. Respondents working in the Industrial sector form the reference group, hence the wage for health_insured respondents working in the Information sector changes on average by $14,927.

The coefficient of health_ins measures the average change of wage for respondents working in the Industrial sector depending on whether they have a health insurance or not. Having no health insurance changes their wage on average by $-23,990.

The interaction between the two dummy variables measures the job-specific difference between respondents with and without health_insurance. Having no health_insurance has a stronger (negative) effect on the respondents’ wage in the Information sector which is on average $-4,064. This can also be seen in the line plot as the slope of the Information sector is a little bit more steep. However, this interaction is not statistically significant!

4.2 Binary and continuous variable

In the next model the respondents’ wage is regressed on the dummy variable health_ins and the continuous variable age as well as the interaction term between both variables. In contrast to the interaction term between two binary variables there are some differences regarding its specification and interpretation. Initially, I am centering age around its mean value in order to ease the interpretation of the model coefficients.

Model

To mean-center the respondents’ age you have two options. Either you do it beforehand and add this variable to the Wage dataset or you specify it directly within the formula = argument of function lm(). The latter equires to wrap the expression inside the function I(). Otherwise a minus sign (-) is interpreted as removing a parameter from the model.

lm_interact_bin_con <- lm(wage ~ health_ins*I(age-mean(age)), data = Wage)
summary(lm_interact_bin_con)

## 
## Call:
## lm(formula = wage ~ health_ins * I(age - mean(age)), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92638 -23793  -6467  15560 204373 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        119746.29     862.37 138.857  < 2e-16 ***
## health_ins2. No                    -25575.19    1575.40 -16.234  < 2e-16 ***
## I(age - mean(age))                    450.65      78.69   5.727 1.12e-08 ***
## health_ins2. No:I(age - mean(age))    297.07     129.86   2.288   0.0222 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39160 on 2996 degrees of freedom
## Multiple R-squared:  0.1201, Adjusted R-squared:  0.1192 
## F-statistic: 136.3 on 3 and 2996 DF,  p-value: < 2.2e-16

Scatterplot

Again, the slopes of the two regression lines differ. The fitted line representing respondents without health insurance is more steep!

ggplot(broom::augment(lm_interact_bin_con),
       aes(x = `I(age - mean(age))`, y = wage, color = health_ins)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = .fitted), size = 1) +
  geom_vline(xintercept = 0) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c("darkgreen", "red")) +
  labs(x = "Mean centered age (0 = 42 years)", y = "Wage", color = "Health insurance?")

Conditional Effects Plot

The conditional effects plot shows regression lines for different combinations of the regressors used in the model. The effect of each variable can be obtained via the distance between the lines and their slope. Centering the age around its mean value is not as important here as it is for the interpretation of the model output. First we have to create a dataset with different values of age and different states of health_insurance. Then I compute the linear predictions using the specified interaction model as well as corresponding standard errors. With function geom_ribbon() I add the confidence bands around the regression lines for each stat of health_insurance. From the plot we can see that the increase of wage along increasing values of age is stronger for respondents without health_insurance than for respondents with it.

tibble(
  age = rep(seq(from = 20, to = 80, by = 5), 2),
  health_ins = factor(rep(c("1. Yes", "2. No"), each = 13))
  ) %>%
  mutate(fit = predict(lm_interact_bin_con, newdata = ., type = "response"),
         se = predict(lm_interact_bin_con, newdata = ., type = "response", se = TRUE)$se.fit,
         ll = fit - (1.96 * se), ul = fit + (1.96 * se)) %>%
  ggplot(aes(x = age, y = fit)) + 
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = health_ins), alpha = 0.2) + 
  geom_line(aes(colour = health_ins), size = 1) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c("darkgreen", "red")) +
  scale_fill_manual(values = c("darkgreen", "red")) +
  labs(x = "Age", y = "Predicted wage", fill = "Health insurance?", color = "Health insurance?")

For respondents of average age the coefficient of health_ins measures the average change of wage when shifting from respondents with health_insurance to respondents without it ($-25,575).

Raising the age of a respondent by 1 year (above the average age of approximately 42 years) leads to an average change of wage by $450.65.

The interaction term shows the effect of age on wage depending on the health insurance status of the respondent. Raising the age of respondents without health_insurance by 1 year above the mean-centered age, leads to an average change of their wage by additional $297.07. Hence, with increasing age the wage of respondents without health_insurance rises faster than the wage of respondents with health_insurance.

4.3 Two continuous variables

For the last model, I am regressing the respondents’ wage on the continuous variable age and its quadratic term. This is in fact an interaction between age and age. However, different continuous variables can be interacted with each other and the quadratic term is only a special case of this interaction.

Mathematically, a squared term shows marginal effects of $x_i$ on $y$.

Consider the model y = β₀ + β₁x + β₂x². At x = -β₁/2β₂ is either a:

maximum: if β₂ < 0 (inverse U-shape)
minimum: if β₂ > 0 (U-shape)

Model

In the formula = argument of function lm() the quadratic term has to be wrapped in the function I() as the operator ^ has a special meaning in R.

lm_interact_con_con <- lm(wage ~ age + I(age^2), data = Wage)
summary(lm_interact_con_con)

## 
## Call:
## lm(formula = wage ~ age + I(age^2), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99126 -24309  -5017  15494 205621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10425.224   8189.780  -1.273    0.203    
## age           5294.030    388.689  13.620   <2e-16 ***
## I(age^2)       -53.005      4.432 -11.960   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39990 on 2997 degrees of freedom
## Multiple R-squared:  0.08209,    Adjusted R-squared:  0.08147 
## F-statistic:   134 on 2 and 2997 DF,  p-value: < 2.2e-16

Scatterplot

Plotting the fitted values of this model against age shows the inverse U-shaped curve due to the inclusion of the squared term. I also added predicted values for age outside the boundaries of the Wage dataset to show the intersection with the y-axis (intercept).

age.df <- tibble(age = 0:100)
age.df$yhat <- predict(lm_interact_con_con, newdata = age.df)

ggplot(data = Wage, aes(x = age, y = wage)) +
  geom_point() +
  geom_line(data = age.df,
       aes(x = age, y = yhat), size = 1, color = "red") +
  scale_x_continuous(expand = expansion(mult = c(0, 0))) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

The coefficent of age shows the effect of increasing the respondents age by one additional year on the average change of their wage. However, this coefficient only applies to respondents aged 1 year (from zero years)! So conceptually their wage increases by $5,241. The lowest age of respondents present in the dataset is 18 years. Raising their age from 18 to 19 years leads to a predicted increase of wage by $4,340.

As the coefficient of the quadratic term is negative, with each additional year of age its effect declines. Hence, with increasing values of age the increase of wage is reduced. Raising the age of 30-year old respondents to 31 years leads to an increase of wage by only $3,704.

Also, the inflection point can be calculated. It occurs at 50 years.

4.4 Three-way interactions

With statistical software more complicated models can easily be calculated. As the title of this sections suggests, three-way interactions create a ménage à trois between regressors. Starting from the interaction between health_ins and age I am adding jobclass to the mix. This is hence an interaction between a continuous variable and two dummy variables.

Model

The interaction is specified as seen in the section before with the * operator. Again I’m centering age around its mean vaue. The function tidy in R package broom reduces the size of the regression output we usually create with function summary(). We’ve already discussed the interaction between age and health_insurance in section 4.2. Now the only thing that changes is that also jobclass has to be taken into account. Hence, there is an interaction of it between age and health_insurance as well as an interaction term of all three variables (row 8)

lm_interact_3way <- lm(wage ~ health_ins*I(age-mean(age))*jobclass, data = Wage)
tidy(lm_interact_3way)

## # A tibble: 8 x 5
##   term                                     estimate std.error statistic  p.value
##   <chr>                                       <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                               111967.     1243.    90.0   0       
## 2 health_ins2. No                           -21206.     2095.   -10.1   1.06e-23
## 3 I(age - mean(age))                           482.      112.     4.31  1.71e- 5
## 4 jobclass2. Information                     14656.     1707.     8.59  1.43e-17
## 5 health_ins2. No:I(age - mean(age))           186.      174.     1.07  2.83e- 1
## 6 health_ins2. No:jobclass2. Information     -6007.     3183.    -1.89  5.92e- 2
## 7 I(age - mean(age)):jobclass2. Informati~    -115.      156.    -0.742 4.58e- 1
## 8 health_ins2. No:I(age - mean(age)):jobc~     207.      261.     0.793 4.28e- 1

Conditional Effects Plot

Especially the construction of an appropriate dataset to draw the CEP is a bit complex with three different variables. I’m using the function rep_along() from R package purrr to nest the categories of joblcass inside the categories of health_insurance.

tibble(
  age = rep(seq(from = 20, to = 80, by = 5), 4),
  health_ins = factor(rep(c("1. Yes", "2. No"), each = 26)),
  jobclass = factor(rep_along(age, c("1. Industrial", "2. Information")))
  ) %>%
  mutate(fit = predict(lm_interact_3way, newdata = ., type = "response"),
         se = predict(lm_interact_3way, newdata = ., type = "response", se = TRUE)$se.fit,
         ll = fit - (1.96 * se), ul = fit + (1.96 * se)) %>%
  ggplot(aes(x = age, y = fit)) + 
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = health_ins), alpha = 0.2) + 
  geom_line(aes(colour = health_ins), size = 1) +
  facet_wrap(. ~ jobclass) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c("darkgreen", "red")) +
  scale_fill_manual(values = c("darkgreen", "red")) +
  labs(x = "Age", y = "Predicted wage", fill = "Health insurance?", color = "Health insurance?")

Let us first take a look at the results within each level of jobclass. In the Industrial sector the slopes for each level of health_insurance are nearly identical. Hence, a respondent’s wage increases along increasing values of age and it doesn’t really matter whether she is health_insured or not. In the Information sector, however, a respondent’s wage increases stronger when she has no health_insurance along increasing values of age. Hence, this phenomenon is rather attributable to respondents employed in this sector. You find this in row #6 of the regression output. The p-value almost indicates significane at the 5%-level.

Futhermore we can make comparisons between the levels of jobclass. The wage of a respondent without health_insurance increases at a similar rate in the Industrial and Information sector (the slopes are nearly the same) along increasing values of age.

Also, compare this plot to the one we’ve created in section 4.2.

5. Polynomials

Polynomials extend the linear regression model by raising the included regressors to a higher power. For example, a cubic regression model uses $x_i$, $x_i^2$ and $x_i^3$ as regressors. This is a fairly easy way to model non-linear relationships in the data.

Orthogonals

The function poly() can be used to apply polynomials in the formula = argument of function lm(). In the model below, I am regressing the respondents’ wage on the third order polynomial of their age. By default the function poly() creates orthogonals instead of raw polynomials. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.

lm_poly_ort <- lm(wage ~ poly(age, 3), data = Wage)
summary(lm_poly_ort)

## 
## Call:
## lm(formula = wage ~ poly(age, 3), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99693 -24562  -5222  15096 206119 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    111703.6      729.1 153.211  < 2e-16 ***
## poly(age, 3)1  447067.9    39933.5  11.195  < 2e-16 ***
## poly(age, 3)2 -478315.8    39933.5 -11.978  < 2e-16 ***
## poly(age, 3)3  125521.7    39933.5   3.143  0.00169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39930 on 2996 degrees of freedom
## Multiple R-squared:  0.0851, Adjusted R-squared:  0.08419 
## F-statistic: 92.89 on 3 and 2996 DF,  p-value: < 2.2e-16

Digression: Choosing The Order

Using an ANOVA we can make a decision which polynomial best fits the data. The F-test tests the null hypothesis whether the restricted model M₁ is sufficient to explain the data opposed to a more complex model M₂.

model1 <- lm(wage ~ age, data = Wage)
model2 <- lm(wage ~ poly(age, 2), data = Wage)
model3 <- lm(wage ~ poly(age, 3), data = Wage)
model4 <- lm(wage ~ poly(age, 4), data = Wage)
model5 <- lm(wage ~ poly(age, 5), data = Wage)

anova(model1, model2, model3, model4, model5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df        RSS Df  Sum of Sq        F    Pr(>F)    
## 1   2998 5.0222e+12                                     
## 2   2997 4.7934e+12  1 2.2879e+11 143.5931 < 2.2e-16 ***
## 3   2996 4.7777e+12  1 1.5756e+10   9.8888  0.001679 ** 
## 4   2995 4.7716e+12  1 6.0702e+09   3.8098  0.051046 .  
## 5   2994 4.7703e+12  1 1.2826e+09   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with the second order polynomial is better than the simple linear model (the p value is close to zero). Also the cubic model with a third order polynomial performs better than the quadratic model. The model with a fourth order polynomial is only significant at the 10% level whereas the fifth order polynomial does not show any improvement. Also take into consideration that when using models with higher order polynomials you sacrifice the interpretability of the model!

Raw Polynomials

When using raw polynomials you obtain regression coefficients that are identical to using the second and third power of a regressor.

lm_poly_raw <- lm(wage ~ poly(age, 3, raw = TRUE), data = Wage)
lm_poly_check <- lm(wage ~ age + I(age^2) + I(age^3), data = Wage)

tibble(coefficient = names(coef(lm_poly_check)),
       raw_polynomial = coef(lm_poly_raw),
       check = coef(lm_poly_check))

## # A tibble: 4 x 3
##   coefficient raw_polynomial      check
##   <chr>                <dbl>      <dbl>
## 1 (Intercept)     -75244.    -75244.   
## 2 age              10190.     10190.   
## 3 I(age^2)          -168.      -168.   
## 4 I(age^3)             0.849      0.849

Scatterplot

At last I am creating a new dataset (predicted.df) with the predicted values of wage for polynomials of different orders.

predicted.df <- 
  tibble(quadratic.fit = predict(lm(wage ~ poly(age, 2), data = Wage), Wage),
         cubic.fit = predict(lm(wage ~ poly(age, 3), data = Wage), Wage),
         quartic.fit = predict(lm(wage ~ poly(age, 4), data = Wage), Wage),
         quintic.fit = predict(lm(wage ~ poly(age, 5), data = Wage), Wage),
         age= Wage$age
         )

ggplot(Wage, aes(x = age, y = wage)) +
  geom_point(alpha=0.3) +
  geom_line(data=predicted.df, aes(x=age, y = quadratic.fit, col ="2fit"), size = 1) +
  geom_line(data=predicted.df, aes(x=age, y = cubic.fit, col="3fit"), size = 1) +
  geom_line(data=predicted.df, aes(x=age, y = quartic.fit, col="4fit"), size = 1) +
  geom_line(data=predicted.df, aes(x=age, y = quintic.fit, col="5fit"), size = 1) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c("2fit" = "green", "3fit" = "red", 
                                "4fit" = "blue", "5fit" = "yellow"),
                     labels = c("2nd order", "3rd order",
                                "4th order", "5th order")) +
  labs(x = "Age", y = "Wage", color="Polynomial")

The regression lines of the higher order polynomials don’t seem to fit the data much than a model with just a quadratic term (2nd order polynomial).

6. Step functions

Step function separate the range of a continuous variable in K different regions. Hence, a categorical (or factor) variable is created which fits stepwise constant functions to the data.

Scatterplot

Step functions can be applied by using the function cut() in the formula = argument of lm(). The argument breaks = controls the number of step functions that are created. Here, I am cutting the variable age into 3 different regions. With function table() I check beforehand how many observations fall within each created region.

table(cut(Wage$age, 3))

## 
## (17.9,38.7] (38.7,59.3] (59.3,80.1] 
##        1127        1663         210

lm_step <- lm(wage ~ cut(age, breaks = 3), data = Wage)
summary(lm_step)

## 
## Call:
## lm(formula = wage ~ cut(age, breaks = 3), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98119 -25953  -6474  16500 212635 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       101695       1221  83.268  < 2e-16 ***
## cut(age, breaks = 3)(38.7,59.3]    16510       1582  10.437  < 2e-16 ***
## cut(age, breaks = 3)(59.3,80.1]    12239       3082   3.972  7.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41000 on 2997 degrees of freedom
## Multiple R-squared:  0.03528,    Adjusted R-squared:  0.03464 
## F-statistic:  54.8 on 2 and 2997 DF,  p-value: < 2.2e-16

Visualization

Below I plot the steps in the fitted regression lines.

tibble(
  age = Wage$age,
  wage = Wage$wage,
  yhat = predict(lm_step, Wage)
  ) %>%
  ggplot(aes(x = age, y = wage)) +
  geom_point() +
  geom_line(aes(y = yhat), size = 1, col = "red") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

7. Regression Splines

Instead of fitting a higher order polynomial over all values of an regressor $x_i$, regression splines apply single polynomials of lower order on different ranges of $x_i$. The points, where the coefficients change, are called knots. The more knots you use, the more flexible is the resulting fit.

7.1 Piecewise polynomial

At first, a piecewise cubic polynomial with a single knot at respondents aged 50 years is applied. Therefore, I am running two regressions each on a subset of the data with respondents less than 50 years old and respondents that are at least 50 years old.

Note: The step functions described in the previous chapter are piecewise polynomials of order zero!

Model

lm_piecewise_poly1 <- lm(wage ~ I(age) + I(age^2) + I(age^3),
                         data = Wage,
                         subset = age < 50)

lm_piecewise_poly2 <- lm(wage ~ I(age) + I(age^2) + I(age^3),
                         data = Wage,
                         subset = age >= 50)

Scatterplot

tibble(
  age = Wage$age,
  wage = Wage$wage,
  .fitted1 = ifelse(age < 50, predict(lm_piecewise_poly1, Wage), NA),
  .fitted2 = ifelse(age >= 50, predict(lm_piecewise_poly2, Wage), NA)
  ) %>%
  ggplot(aes(x = age, y = wage)) +
  geom_point() +
  geom_line(aes(y = .fitted1), size = 1, col = "red") +
  geom_line(aes(y = .fitted2), size = 1, col = "green") +
  geom_vline(xintercept = 50, linetype = "dashed",
             size = 1, color = "blue") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

7.2 Cubic spline

With the R package splines the function bs() can be used to fit regression splines. The argument knots = controls the number and position of the knots. Alternatively, argument df = might be used to place the knots at uniform quantiles of the data. The argument degree = determines the order of the spline.

Model

lm_cubic_spline_bs <-lm(wage ~ bs(age, knots = c(50), degree = 3), data = Wage)
summary(lm_cubic_spline_bs)

## 
## Call:
## lm(formula = wage ~ bs(age, knots = c(50), degree = 3), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98641 -24785  -4901  15588 202827 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            52515       4800  10.941  < 2e-16 ***
## bs(age, knots = c(50), degree = 3)1    68442       8862   7.723 1.53e-14 ***
## bs(age, knots = c(50), degree = 3)2    65597       6130  10.702  < 2e-16 ***
## bs(age, knots = c(50), degree = 3)3    66496      10114   6.574 5.74e-11 ***
## bs(age, knots = c(50), degree = 3)4    25759      13702   1.880   0.0602 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39900 on 2995 degrees of freedom
## Multiple R-squared:  0.08677,    Adjusted R-squared:  0.08555 
## F-statistic: 71.14 on 4 and 2995 DF,  p-value: < 2.2e-16

Scatterplot

tibble(age = Wage$age,
       wage = Wage$wage,
       .fitted = predict(lm_cubic_spline_bs, Wage)
       ) %>%
  ggplot(aes(x = age, y = wage)) +
  geom_point() +
  geom_line(aes(y = .fitted), size = 1, col = "red") +
  geom_vline(xintercept = 50, linetype = "dashed", size = 1, color = "blue") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

Interaction term

Also, a cubic spline can be created with an interaction term. This leads to a disonctinuity at the third derivate at the knot. The function remains continuous in the first and secod derivatives. The discontinuity is at the knot of 50 years, which is not visible.

lm_cubic_spline_int <- Wage %>%
  mutate(threshold = ifelse(age >= 50, 1, 0)) %$%
  lm(wage ~ age + I(age ^ 2) + I(age ^ 3) + threshold:I((age - 50) ^ 3))

summary(lm_cubic_spline_int)

## 
## Call:
## lm(formula = wage ~ age + I(age^2) + I(age^3) + threshold:I((age - 
##     50)^3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98641 -24785  -4901  15588 202827 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.420e+05  3.611e+04  -3.931 8.64e-05 ***
## age                        1.589e+04  2.917e+03   5.448 5.51e-08 ***
## I(age^2)                  -3.217e+02  7.527e+01  -4.273 1.99e-05 ***
## I(age^3)                   2.164e+00  6.232e-01   3.472 0.000523 ***
## threshold:I((age - 50)^3) -3.721e+00  1.590e+00  -2.340 0.019327 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39900 on 2995 degrees of freedom
## Multiple R-squared:  0.08677,    Adjusted R-squared:  0.08555 
## F-statistic: 71.14 on 4 and 2995 DF,  p-value: < 2.2e-16

Scatterplot

tibble(age = Wage$age,
       wage = Wage$wage,
       .fitted = predict(lm_cubic_spline_int, Wage)
       ) %>%
  ggplot(aes(x = age, y = wage)) +
  geom_point() +
  geom_line(aes(y = .fitted), size = 1.5, col = "red") +
  geom_vline(xintercept = 50, linetype = "dashed", size = 1.5, color = "blue") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

7.3 Linear spline

A linear spline is a polynomial of order zero.

Model

lm_linear_spline <- Wage %>%
  mutate(threshold = ifelse(age >= 50, 1, 0)) %$% 
  lm(wage ~ threshold*I(age-50))

summary(lm_linear_spline)

## 
## Call:
## lm(formula = wage ~ threshold * I(age - 50))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103247  -24393   -5878   16305  204546 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           130317.8     1671.4  77.970  < 2e-16 ***
## threshold             -11205.9     2677.0  -4.186 2.92e-05 ***
## I(age - 50)             1558.9      108.6  14.356  < 2e-16 ***
## threshold:I(age - 50)  -1967.8      266.1  -7.396 1.82e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40260 on 2996 degrees of freedom
## Multiple R-squared:  0.07002,    Adjusted R-squared:  0.06908 
## F-statistic: 75.19 on 3 and 2996 DF,  p-value: < 2.2e-16

Scatterplot

Wage %>% 
  select(age, wage) %>% 
  mutate(threshold = as.factor(ifelse(age >= 50, 1, 0))) %>% 
  ggplot(aes(x = age, y = wage, color = threshold)) +
  geom_point(color="grey") + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_vline(xintercept = 50, linetype = "dashed", size = 1, color = "blue") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage") +
  guides(color = FALSE)

7.4 Natural spline

Splines can have high variance at the regressors’ tails of the distribution (when x has either very small or very large values). A natural spline is a regression spline with additional constraints at the borders. There the function is linear, producing estimates that are more stable.

Model

Natural splines can be applied with function ns() in the R package splines. I am using 4 degrees of freedom in this example. For comparison, I also compute a B-spline.

lm_spline_natural <- lm(wage ~ ns(age ,df = 4), data = Wage)

lm_spline_base <- lm(wage ~ bs(age, df = 4), data = Wage)

Scatterplot

tibble(age = Wage$age,
       wage = Wage$wage,
       fitted.base =  predict(lm_spline_base, se=T, Wage)$fit,
       se.base = predict(lm_spline_base, se=T, Wage)$se.fit,
       fitted.natural =  predict(lm_spline_natural, se=T, Wage)$fit,
       se.natural = predict(lm_spline_natural, se=T, Wage)$se.fit
       ) %>%
  ggplot(aes(x = age, y = wage)) +
  geom_point(color="grey") +
  geom_line(aes(y = fitted.base, color = "base"), size = 1) +
  geom_line(aes(y = fitted.base + 2*se.base), size = 0.5, col = "red", linetype="dashed") +
  geom_line(aes(y = fitted.base - 2*se.base), size = 0.5, col = "red", linetype="dashed") +
  geom_line(aes(y = fitted.natural, color = "natural"), size = 1) +
  geom_line(aes(y = fitted.natural + 2*se.natural), size = 0.5, col = "blue", linetype="dashed") +
  geom_line(aes(y = fitted.natural - 2*se.natural), size = 0.5, col = "blue", linetype="dashed") +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = c('base' = 'darkblue', 'natural' = 'red'),
                     labels = c("cubic spline", "natural cubic spline")) +
  labs(x = "Age", y = "Wage", color = "") +
  theme(legend.position = c(0.8, 0.8))

Digression: Choosing the number and position of knots

A regression spline is most flexible in regions with many knots because the polynomial coefficients can change rapidly. Hence, more knots are placed in regions where the function is assumed to have the most variation and less knots are placed in the region where the function is more stable. Although this might work, in practice knots are commonly placed uniformly. This is done by specifying the desired degrees of freeom and the knots are placed at uniform quantiles of the data.

8. Inverse functions

Althoug they may be not frequently applied in practice, inverse functions are easy to specify.

Model

I am regressing the respondents’ wage on the inverse of age. The calculation of the inverse is specified within function I() since the / operator has special meaning in the formula of lm(). A negative algebraic sign of the coefficient of the inverted age means a positive impact on wage (et vc. vs.).

lm_age_inverse <- lm(wage ~ I(1 / age), data = Wage)
summary(lm_age_inverse)

## 
## Call:
## lm(formula = wage ~ I(1/age), data = Wage)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99179 -24578  -5750  15855 203561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   146165       2476   59.03   <2e-16 ***
## I(1/age)    -1345057      92267  -14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40330 on 2998 degrees of freedom
## Multiple R-squared:  0.06619,    Adjusted R-squared:  0.06588 
## F-statistic: 212.5 on 1 and 2998 DF,  p-value: < 2.2e-16

Below you find the estimates of the original estimates without transforming age.

lm_age_normal  <- lm(wage ~ age, data = Wage)
summary(lm_age_normal)

## 
## Call:
## lm(formula = wage ~ age, data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100265  -25115   -6063   16601  205748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81704.74    2846.24   28.71   <2e-16 ***
## age           707.28      64.75   10.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40930 on 2998 degrees of freedom
## Multiple R-squared:  0.03827,    Adjusted R-squared:  0.03795 
## F-statistic: 119.3 on 1 and 2998 DF,  p-value: < 2.2e-16

Scatterplot

ggplot(broom::augment(lm_age_inverse), 
       aes(x = `I(1/age)`, y = wage)) +
 geom_point(alpha=0.3) +
 geom_line(aes(y = .fitted), size = 1) +
 scale_y_continuous(labels = scales::dollar) +
 labs(x = "Age (inverted)", y = "Wage", color="")

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

R Tutorial: Linear Regression 2

Functional form in the linear model

Philipp Leppert

02.10.2020

1. Introduction

1.1 R packages

1.2 Dataset

2. Variable transformation

2.1 Overview

2.2 Level-Level-Model

Model

Scatterplot

Digression: The Intercept

2.3 Log-Level-Model

Model

Scatterplot

2.4 Level-Log-Model

Model

Scatterplot

2.5 Log-Log-Model

Model

Scatterplot

3. Adding qualitative variables

3.1 Dummy variables

Model

Scatterplot

3.2 Categorical variables

Model

Scatterplot

Digression: Plotting Regression Coefficients

Caution: Ordered Factors

4. Adding interactions

4.1 Two binary variables

Model

Line Plot

Contingency Table (DiD)

4.2 Binary and continuous variable

Model

Scatterplot

Conditional Effects Plot

4.3 Two continuous variables

Model

Scatterplot

4.4 Three-way interactions

Model

Conditional Effects Plot

5. Polynomials

Orthogonals

Digression: Choosing The Order

Raw Polynomials

Scatterplot

6. Step functions

Scatterplot

Visualization

7. Regression Splines

7.1 Piecewise polynomial

Model

Scatterplot

7.2 Cubic spline

Model

Scatterplot

Interaction term

Scatterplot

7.3 Linear spline

Model

Scatterplot

7.4 Natural spline

Model

Scatterplot

Digression: Choosing the number and position of knots

8. Inverse functions

Model

Scatterplot

References