Reading time: 16 minutes (3,148 words)
This tutorial assumes that you are already familiar with the theoretical basics of linear regression like the Gauss-Markov-assumptions as well as how to run OLS in R. In case you are not, have a look at this tutorial. In the following sections I am going to discuss different approaches how to modify the functional form of a linear regression model, starting with very basic content to more sophisticated methods.
The following R packages are required for this tutorial.
library(dplyr)
library(tibble)
library(ggplot2)
library(magrittr)
library(broom)
library(purrr)
library(scales)
library(GGally)
library(splines)
All practical examples are built around the Mid-Atlantic Wage Dataset contained in R package ISLR
. It contains information on 3,000 male workers in the Mid-Atlantic region like their yearly salary, age and other demographic characteristics. After reading the data, I convert wage
in full US-Dollar.
data("Wage", package = "ISLR")
Wage <-
Wage %>%
mutate(wage = wage*1000)
The standard linear regression model (without any transformations) implies that raising any regressor (\(x_i\)) by one unit leads to a change of the dependent variable (\(y\)) by \(β_i\) units. Now, any transformation of a regressor or the regressand will alter the interpretation.
We can distinguish four different types of models. Each is discussed in more detail throughout this section.
Model | Equation | Regressand | Regressor | Interpretation |
---|---|---|---|---|
Level-Level | y = β0 + β1x + u | y | x | Raising x by one unit changes (c. p.) y on average by β1 units. |
Log-Level | loge(y) = β0 + β1x + u | loge(y) | x | Raising x by one unit changes (c. p.) y on average approximately by 100\(*\)β1%. Raising x by one unit changes (c. p.) y on average exactly by 100\(*\)(eβ1-1)%. |
Level-Log | y = β0 + β1loge(x) + u | y | loge(x) | Raising x by one percent changes (c. p.) y on average by approximately β1/100 units. Raising x by one percent changes (c. p.) y on average exactly by β1\(*\)ln(1 + 1/100) units (semi-elasticity). |
Log-Log | loge(y) = β0 + β1loge(x) + u | loge(y) | loge(x) | Raising x by one percent changes (c. p.) y on average by β1% (elasticity). |
This is the standard linear regression model and I am regressing wage
on age
. The variable wage
measures a respondent’s salary for a specific year in US-Dollar and the variable age
measures a respondent’s age in years.
levlev <- lm(wage ~ age, data = Wage)
summary(levlev)
##
## Call:
## lm(formula = wage ~ age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100265 -25115 -6063 16601 205748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81704.74 2846.24 28.71 <2e-16 ***
## age 707.28 64.75 10.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40930 on 2998 degrees of freedom
## Multiple R-squared: 0.03827, Adjusted R-squared: 0.03795
## F-statistic: 119.3 on 1 and 2998 DF, p-value: < 2.2e-16
The function augment()
from R package broom
takes an lm
object and transforms its content into a tidy format. The model’s predicted values are found in column .fitted
.
ggplot(data = broom::augment(levlev),
aes(x = age, y = wage)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1, color = "red") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
At first, I am creating a new dataset (age.df
) with a column age
whose values range from 0 to 100. With function predict()
I am able to use the already specified model levlev
to predict the wage
for each value of age
. Hence, I am generating predicted values for wage
which are outside the boundaries of the respondents’ age
in the original dataset. There, the lowest and highest values of age
are 18 and 80 respectively. This extrapolation is not very meaningful since working respondents will rarely be younger than 18 or older than 80. However, it highlights the irrelevance of interpreting the constant in this model. The intercept
is hence the intersection of the model’s regression line with the y-axis, reflected by respondents that are hypothetically 0 years old.
age.df <- tibble(age = 0:100)
age.df$yhat <- predict(levlev, newdata = age.df)
ggplot(data = Wage, aes(x = age, y = wage)) +
geom_point() +
geom_line(data = age.df,
aes(x = age, y = yhat),
size = 1, color = "red") +
scale_x_continuous(expand = expansion(mult = c(0, 0))) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
Raising the respondents’ age
by 1 year changes their wage
on average by $707 (remember that this is the slope of the regression line). As shown in the digression, the intercept
is the predicted value of wage
for respondents aged 0 years ($81,705).
Now we’ll discuss the first type of transformation in more detail. For this model I’m computing the logarithm of wage
and regress it on age
.
loglev <- lm(log(wage) ~ age, data = Wage)
summary(loglev)
##
## Call:
## lm(formula = log(wage) ~ age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65779 -0.20059 0.00439 0.20045 1.13716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.128e+01 2.388e-02 472.41 <2e-16 ***
## age 6.640e-03 5.432e-04 12.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3434 on 2998 degrees of freedom
## Multiple R-squared: 0.04748, Adjusted R-squared: 0.04716
## F-statistic: 149.4 on 1 and 2998 DF, p-value: < 2.2e-16
ggplot(broom::augment(loglev),
aes(x = age, y = `log(wage)`)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1, color = "red") +
labs(x = "Age", y = "Log(Wage)")
Raising the respondents’ age
by 1 year changes their wage
on average by approximately 0.664%. Exactly, the wage
changes on average by 0.666% (calculation below).
exp(loglev$coefficients[2])-1
## age
## 0.006662229
Next, I compute the logarithm of age
and use it instead of the original variable in the regression model.
levlog <- lm(wage ~ log(age), data = Wage)
summary(levlog)
##
## Call:
## lm(formula = wage ~ log(age), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100346 -24828 -5958 16216 204959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11500 9555 -1.204 0.229
## log(age) 33228 2569 12.933 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40620 on 2998 degrees of freedom
## Multiple R-squared: 0.05285, Adjusted R-squared: 0.05253
## F-statistic: 167.3 on 1 and 2998 DF, p-value: < 2.2e-16
ggplot(broom::augment(levlog),
aes(x = `log(age)`, y = wage)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1, color = "red") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Log(Age)", y = "Wage")
When doubling the respondents’ age
, i.e. an increase by 100%, their wage
changes on average by $33,228. Alternatively we can say that raising the respondents’ age
by 1% changes their wage
on average by $332.28.
At last, I compute the logarithm of both wage
and age
and run the model again.
loglog <- lm(log(wage) ~ log(age), data = Wage)
summary(loglog)
##
## Call:
## lm(formula = log(wage) ~ log(age), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67076 -0.19396 0.00286 0.19475 1.13857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.39102 0.07993 130.00 <2e-16 ***
## log(age) 0.31572 0.02149 14.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3398 on 2998 degrees of freedom
## Multiple R-squared: 0.06714, Adjusted R-squared: 0.06683
## F-statistic: 215.8 on 1 and 2998 DF, p-value: < 2.2e-16
ggplot(broom::augment(loglog),
aes(x = `log(age)`, y = `log(wage)`)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1, color = "red") +
labs(x = "Log(Age)", y = "Log(Wage)")
Raising the respondents’ age
by 1% changes their wage
on average by 0.32%.
In the previous section I only used the numerical variables wage
and age
, which are both interval-scaled (there is no such thing as being -1 years old or earning -1000 US-Dollars). Now, I’m going to add dichotomous variables with exactly two distinct values (often referred to as dummy variables) and variables with at least three distinct values (often referred to as factor or categorical variables) to the regression model.
A dummy variable has exactly two values (levels). In the model below I am regressing the respondents’ wage
on their health
and age
. The variable health
indicates the health level of a worker and has two levels: <= Good
and >= Very Good
.
The lm()
function treats dummy variables as intended if they are either of class numeric
, factor
or character
, so usually this should not cause any trouble.
lm_dummy <- lm(wage ~ health + age, data = Wage)
summary(lm_dummy)
##
## Call:
## lm(formula = wage ~ health + age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100786 -24871 -5446 16599 215826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65739.28 3198.62 20.55 <2e-16 ***
## health2. >=Very Good 16900.00 1641.32 10.30 <2e-16 ***
## age 799.20 64.27 12.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40230 on 2997 degrees of freedom
## Multiple R-squared: 0.07113, Adjusted R-squared: 0.07051
## F-statistic: 114.8 on 2 and 2997 DF, p-value: < 2.2e-16
When plotting this model we see two fitted regression lines for each level of health
. The slopes of both regression lines are identical but they have diferent intercepts with the y-axis. The health
dummy is added in argument color =
.
ggplot(augment(lm_dummy),
aes(x = age, y = wage, color = health)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = .fitted), size = 1) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c("red","darkgreen")) +
labs(x = "Age", y = "Wage", color = "Health")
Interpretation of dummy variables in the linear model is always with respect to a so called reference or base group (= a level of the dummy variable.) This level does not appear with a coefficient in the regression output. Regarding the dummy variable health
, respondents with health condition <= Good
are the reference group. The regression output hence displays the coefficient for respondents with health condition >= Very Good
, indiciating the average change of wage
compared to the reference group by $16,900. The effect of respondents’ age
on their wage
is independent of their health condition ($799) but different from the value we’ve found in section 2 without using health
in the model. Some of the variation in a respondent’s wage
is explained with health
, hence the magnitude of the age
coefficient changes. We might call health
a control variables in this context.
A categorical or factor variable has, compared to a dummy variable, at least three levels. In the following model I am regressing the respondents’ wage
on their level of education
and age
. The categorical variable education
has 5 levels. Often a logic order of the categories eases the interpretation of the model (in the Wage
dataset the categories already range from lower to higher education
levels). However, it doesn’t really matter with respect to the specification of the model or the estimation of the coefficients.
The lm()
function does recognize categorical variables when they are of class factor
or character
. For example, you can check this before running the model with function is.factor()
. Otherwise a catergorical variable has to be converted before specifiying it or within the lm()
function or R will treat it like a numeric variable (like age
).
lm_faktor <- lm(wage ~ education + age, data=Wage)
summary(lm_faktor)
##
## Call:
## lm(formula = wage ~ education + age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110033 -19635 -3907 14441 220408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60335.97 3245.71 18.589 < 2e-16 ***
## education2. HS Grad 11438.65 2480.25 4.612 4.16e-06 ***
## education3. Some College 24167.00 2609.76 9.260 < 2e-16 ***
## education4. College Grad 39766.77 2590.31 15.352 < 2e-16 ***
## education5. Advanced Degree 64986.56 2808.38 23.140 < 2e-16 ***
## age 568.69 57.19 9.943 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35940 on 2994 degrees of freedom
## Multiple R-squared: 0.2593, Adjusted R-squared: 0.2581
## F-statistic: 209.6 on 5 and 2994 DF, p-value: < 2.2e-16
Now we obtain 5 different regression lines when plotting the model - one line for each level of education. Again, their slopes are identical while their intercepts with the y-axis are different.
ggplot(augment(lm_faktor),
aes(x = age, y = wage, color = education)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = .fitted), size = 1) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage", color = "Education")
Interpreting coefficients of categorical variables with many levels is often cumbersome. However, there is visualization technique called coefficient plot, which eases the interpretation. It can be created with function ggcoef()
from R package GGally
. Each blue dot represents the point estimate of the corresponding regression coefficient. The line passing through each dot represents an error bar of the associated 95% confidence interval.
ggcoef(lm_faktor, vline = FALSE, exclude_intercept = TRUE,
errorbar_height = .2, color = "blue") +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Estimated increase in wage", y = "")
It is important to note that when using summary()
on an lm
object, the coefficients of each level of the factor variable are only displayed when the factor is not ordered. For an ordered factor the orthogonal polynomial contrasts are displayed. The L
gives a measure of the linear trend, Q
and C
specify quadratic and cubic terms and so forth. You may use the function is.ordered()
to check your categorical varables before running your regression model.
lm_faktor_ordered <- lm(wage ~ factor(education, ordered = TRUE) + age, data=Wage)
summary(lm_faktor_ordered)
##
## Call:
## lm(formula = wage ~ factor(education, ordered = TRUE) + age,
## data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110033 -19635 -3907 14441 220408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88407.77 2537.89 34.835 < 2e-16 ***
## factor(education, ordered = TRUE).L 50059.25 1865.18 26.839 < 2e-16 ***
## factor(education, ordered = TRUE).Q 8133.75 1747.09 4.656 3.37e-06 ***
## factor(education, ordered = TRUE).C 2634.28 1439.92 1.829 0.0674 .
## factor(education, ordered = TRUE)^4 617.56 1368.37 0.451 0.6518
## age 568.69 57.19 9.943 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35940 on 2994 degrees of freedom
## Multiple R-squared: 0.2593, Adjusted R-squared: 0.2581
## F-statistic: 209.6 on 5 and 2994 DF, p-value: < 2.2e-16
As with dummy variables the coefficients of categorial variables are interpreted with respect to a reference group, which is formed here by respondents with an education
level of < HS Grad
. For example, respondents with an education
level of Advanced Degree
earn, compared to the base group, a wage
which is on average $64,987 higher (in the scatterplot this is reflected by the distance between the red line and the purple line).
Again note that the effect of the respondents’ age
on their wage
is independent of their education
($569)!
With an interaction (term) you can model that the influence of one regressor \(x_i\) on the dependent variable \(y\) systematically depends on another regressor \(x_j\). For example, in the above section the influence of an respondent’s age
on her wage
was independent of her education
. With an interaction of age
and education
you would be able to see if there is any codependence between them. There are three different types of interactions that will be covered in this section.
In the first model I am regressing the respondents’ wage
on the variables jobclass
and healt_ins
. Remember that binary variables (dummies) have exactly two levels.
In the formula =
argument of function lm()
interactions can be specified with the *
operator. This operator also adds the base effect of each variable to the model. Alternatively you can use the operator :
, which only includes the interaction term. If you want to include the base effects in your model as well, the variables have to be explicitly specified. Below I’ve specified both variants.
lm_interact_bin <- lm(wage ~ jobclass * health_ins, data = Wage)
lm_interact_bin <- lm(wage ~ jobclass + health_ins + jobclass:health_ins, data = Wage)
summary(lm_interact_bin)
##
## Call:
## lm(formula = wage ~ jobclass + health_ins + jobclass:health_ins,
## data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94816 -24274 -6982 16425 210443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112255 1257 89.325 <2e-16 ***
## jobclass2. Information 14927 1718 8.686 <2e-16 ***
## health_ins2. No -23990 2059 -11.650 <2e-16 ***
## jobclass2. Information:health_ins2. No -4064 3176 -1.279 0.201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39120 on 2996 degrees of freedom
## Multiple R-squared: 0.122, Adjusted R-squared: 0.1211
## F-statistic: 138.8 on 3 and 2996 DF, p-value: < 2.2e-16
The interaction between two binary variables can be visualized with a line plot. At first, I am caluculating the groupwise means of wage
for each combination of the levels of jobclass
and health_ins
. Then I am assigning health_ins
to the x-axis and the mean of wage
to the y-axis. For each level of jobclass
a separate line is drawn. The slopes of the two lines differ!
Wage %>%
group_by(jobclass, health_ins) %>%
summarise(mean_wage = mean(wage, na.rm = T)) %>%
ggplot(aes(x = health_ins, y = mean_wage, group = jobclass)) +
geom_line(aes(color = jobclass), size = 1.2) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Health insurance?", y = "Wage (mean)", color = "Type of job")
Sometimes this type of interaction is also called a differences-in-differences estimator as it measures the difference of the difference within each dummy variable regarding the outcome wage
. This can be shown with a contingency table.
Jobclass | Health insurance | No health insurance | Difference | DiD |
---|---|---|---|---|
Industrial | $112,255 | $88,265 | $23,990 | $-4,064 |
Information | $127,182 | $99,129 | $28,054 |
First, in this specific model the intercept
has a meaningful interpretation! It repesents the mean for the reference group of each dummy variable. Respondents with health_ins
urance employed in the Industrial
sector have on average a wage
of $112,255.
The coefficient of jobclass
measures the average change of wage
for health_ins
ured respondents between the sectors Information
and Industrial
. Respondents working in the Industrial
sector form the reference group, hence the wage
for health_ins
ured respondents working in the Information
sector changes on average by $14,927.
The coefficient of health_ins
measures the average change of wage
for respondents working in the Industrial
sector depending on whether they have a health insurance or not. Having no health insurance changes their wage
on average by $-23,990.
The interaction between the two dummy variables measures the job-specific difference between respondents with and without health_ins
urance. Having no health_ins
urance has a stronger (negative) effect on the respondents’ wage
in the Information
sector which is on average $-4,064. This can also be seen in the line plot as the slope of the Information
sector is a little bit more steep. However, this interaction is not statistically significant!
In the next model the respondents’ wage
is regressed on the dummy variable health_ins
and the continuous variable age
as well as the interaction term between both variables. In contrast to the interaction term between two binary variables there are some differences regarding its specification and interpretation. Initially, I am centering age
around its mean value in order to ease the interpretation of the model coefficients.
To mean-center the respondents’ age
you have two options. Either you do it beforehand and add this variable to the Wage
dataset or you specify it directly within the formula =
argument of function lm()
. The latter equires to wrap the expression inside the function I()
. Otherwise a minus sign (-
) is interpreted as removing a parameter from the model.
lm_interact_bin_con <- lm(wage ~ health_ins*I(age-mean(age)), data = Wage)
summary(lm_interact_bin_con)
##
## Call:
## lm(formula = wage ~ health_ins * I(age - mean(age)), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92638 -23793 -6467 15560 204373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119746.29 862.37 138.857 < 2e-16 ***
## health_ins2. No -25575.19 1575.40 -16.234 < 2e-16 ***
## I(age - mean(age)) 450.65 78.69 5.727 1.12e-08 ***
## health_ins2. No:I(age - mean(age)) 297.07 129.86 2.288 0.0222 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39160 on 2996 degrees of freedom
## Multiple R-squared: 0.1201, Adjusted R-squared: 0.1192
## F-statistic: 136.3 on 3 and 2996 DF, p-value: < 2.2e-16
Again, the slopes of the two regression lines differ. The fitted line representing respondents without health insurance is more steep!
ggplot(broom::augment(lm_interact_bin_con),
aes(x = `I(age - mean(age))`, y = wage, color = health_ins)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = .fitted), size = 1) +
geom_vline(xintercept = 0) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c("darkgreen", "red")) +
labs(x = "Mean centered age (0 = 42 years)", y = "Wage", color = "Health insurance?")
The conditional effects plot shows regression lines for different combinations of the regressors used in the model. The effect of each variable can be obtained via the distance between the lines and their slope. Centering the age around its mean value is not as important here as it is for the interpretation of the model output. First we have to create a dataset with different values of age
and different states of health_ins
urance. Then I compute the linear predictions using the specified interaction model as well as corresponding standard errors. With function geom_ribbon()
I add the confidence bands around the regression lines for each stat of health_ins
urance. From the plot we can see that the increase of wage
along increasing values of age
is stronger for respondents without health_ins
urance than for respondents with it.
tibble(
age = rep(seq(from = 20, to = 80, by = 5), 2),
health_ins = factor(rep(c("1. Yes", "2. No"), each = 13))
) %>%
mutate(fit = predict(lm_interact_bin_con, newdata = ., type = "response"),
se = predict(lm_interact_bin_con, newdata = ., type = "response", se = TRUE)$se.fit,
ll = fit - (1.96 * se), ul = fit + (1.96 * se)) %>%
ggplot(aes(x = age, y = fit)) +
geom_ribbon(aes(ymin = ll, ymax = ul, fill = health_ins), alpha = 0.2) +
geom_line(aes(colour = health_ins), size = 1) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c("darkgreen", "red")) +
scale_fill_manual(values = c("darkgreen", "red")) +
labs(x = "Age", y = "Predicted wage", fill = "Health insurance?", color = "Health insurance?")
For respondents of average age
the coefficient of health_ins
measures the average change of wage
when shifting from respondents with health_ins
urance to respondents without it ($-25,575).
Raising the age
of a respondent by 1 year (above the average age of approximately 42 years) leads to an average change of wage
by $450.65.
The interaction term shows the effect of age
on wage
depending on the health insurance status of the respondent. Raising the age
of respondents without health_ins
urance by 1 year above the mean-centered age, leads to an average change of their wage
by additional $297.07. Hence, with increasing age
the wage
of respondents without health_ins
urance rises faster than the wage
of respondents with health_ins
urance.
For the last model, I am regressing the respondents’ wage
on the continuous variable age
and its quadratic term. This is in fact an interaction between age
and age
. However, different continuous variables can be interacted with each other and the quadratic term is only a special case of this interaction.
Mathematically, a squared term shows marginal effects of \(x_i\) on \(y\).
Consider the model y = β0 + β1x + β2x2. At x = -β1/2β2 is either a:
In the formula =
argument of function lm()
the quadratic term has to be wrapped in the function I()
as the operator ^
has a special meaning in R.
lm_interact_con_con <- lm(wage ~ age + I(age^2), data = Wage)
summary(lm_interact_con_con)
##
## Call:
## lm(formula = wage ~ age + I(age^2), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99126 -24309 -5017 15494 205621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10425.224 8189.780 -1.273 0.203
## age 5294.030 388.689 13.620 <2e-16 ***
## I(age^2) -53.005 4.432 -11.960 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39990 on 2997 degrees of freedom
## Multiple R-squared: 0.08209, Adjusted R-squared: 0.08147
## F-statistic: 134 on 2 and 2997 DF, p-value: < 2.2e-16
Plotting the fitted values of this model against age
shows the inverse U-shaped curve due to the inclusion of the squared term. I also added predicted values for age
outside the boundaries of the Wage dataset to show the intersection with the y-axis (intercept
).
age.df <- tibble(age = 0:100)
age.df$yhat <- predict(lm_interact_con_con, newdata = age.df)
ggplot(data = Wage, aes(x = age, y = wage)) +
geom_point() +
geom_line(data = age.df,
aes(x = age, y = yhat), size = 1, color = "red") +
scale_x_continuous(expand = expansion(mult = c(0, 0))) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
The coefficent of age
shows the effect of increasing the respondents age by one additional year on the average change of their wage
. However, this coefficient only applies to respondents aged 1 year (from zero years)! So conceptually their wage
increases by $5,241. The lowest age
of respondents present in the dataset is 18 years. Raising their age
from 18 to 19 years leads to a predicted increase of wage
by $4,340.
As the coefficient of the quadratic term is negative, with each additional year of age
its effect declines. Hence, with increasing values of age
the increase of wage
is reduced. Raising the age
of 30-year old respondents to 31 years leads to an increase of wage
by only $3,704.
Also, the inflection point can be calculated. It occurs at 50 years.
With statistical software more complicated models can easily be calculated. As the title of this sections suggests, three-way interactions create a ménage à trois between regressors. Starting from the interaction between health_ins
and age
I am adding jobclass
to the mix. This is hence an interaction between a continuous variable and two dummy variables.
The interaction is specified as seen in the section before with the *
operator. Again I’m centering age
around its mean vaue. The function tidy
in R package broom
reduces the size of the regression output we usually create with function summary()
. We’ve already discussed the interaction between age
and health_ins
urance in section 4.2. Now the only thing that changes is that also jobclass
has to be taken into account. Hence, there is an interaction of it between age
and health_ins
urance as well as an interaction term of all three variables (row 8)
lm_interact_3way <- lm(wage ~ health_ins*I(age-mean(age))*jobclass, data = Wage)
tidy(lm_interact_3way)
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 111967. 1243. 90.0 0
## 2 health_ins2. No -21206. 2095. -10.1 1.06e-23
## 3 I(age - mean(age)) 482. 112. 4.31 1.71e- 5
## 4 jobclass2. Information 14656. 1707. 8.59 1.43e-17
## 5 health_ins2. No:I(age - mean(age)) 186. 174. 1.07 2.83e- 1
## 6 health_ins2. No:jobclass2. Information -6007. 3183. -1.89 5.92e- 2
## 7 I(age - mean(age)):jobclass2. Informati~ -115. 156. -0.742 4.58e- 1
## 8 health_ins2. No:I(age - mean(age)):jobc~ 207. 261. 0.793 4.28e- 1
Especially the construction of an appropriate dataset to draw the CEP is a bit complex with three different variables. I’m using the function rep_along()
from R package purrr
to nest the categories of joblcass
inside the categories of health_ins
urance.
tibble(
age = rep(seq(from = 20, to = 80, by = 5), 4),
health_ins = factor(rep(c("1. Yes", "2. No"), each = 26)),
jobclass = factor(rep_along(age, c("1. Industrial", "2. Information")))
) %>%
mutate(fit = predict(lm_interact_3way, newdata = ., type = "response"),
se = predict(lm_interact_3way, newdata = ., type = "response", se = TRUE)$se.fit,
ll = fit - (1.96 * se), ul = fit + (1.96 * se)) %>%
ggplot(aes(x = age, y = fit)) +
geom_ribbon(aes(ymin = ll, ymax = ul, fill = health_ins), alpha = 0.2) +
geom_line(aes(colour = health_ins), size = 1) +
facet_wrap(. ~ jobclass) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c("darkgreen", "red")) +
scale_fill_manual(values = c("darkgreen", "red")) +
labs(x = "Age", y = "Predicted wage", fill = "Health insurance?", color = "Health insurance?")
Let us first take a look at the results within each level of jobclass
. In the Industrial
sector the slopes for each level of health_ins
urance are nearly identical. Hence, a respondent’s wage
increases along increasing values of age
and it doesn’t really matter whether she is health_ins
ured or not. In the Information
sector, however, a respondent’s wage
increases stronger when she has no health_ins
urance along increasing values of age
. Hence, this phenomenon is rather attributable to respondents employed in this sector. You find this in row #6 of the regression output. The p-value almost indicates significane at the 5%-level.
Futhermore we can make comparisons between the levels of jobclass
. The wage
of a respondent without health_ins
urance increases at a similar rate in the Industrial
and Information
sector (the slopes are nearly the same) along increasing values of age
.
Also, compare this plot to the one we’ve created in section 4.2.
Polynomials extend the linear regression model by raising the included regressors to a higher power. For example, a cubic regression model uses \(x_i\), \(x_i^2\) and \(x_i^3\) as regressors. This is a fairly easy way to model non-linear relationships in the data.
The function poly()
can be used to apply polynomials in the formula =
argument of function lm()
. In the model below, I am regressing the respondents’ wage on the third order polynomial of their age
. By default the function poly()
creates orthogonals instead of raw polynomials. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
lm_poly_ort <- lm(wage ~ poly(age, 3), data = Wage)
summary(lm_poly_ort)
##
## Call:
## lm(formula = wage ~ poly(age, 3), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99693 -24562 -5222 15096 206119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111703.6 729.1 153.211 < 2e-16 ***
## poly(age, 3)1 447067.9 39933.5 11.195 < 2e-16 ***
## poly(age, 3)2 -478315.8 39933.5 -11.978 < 2e-16 ***
## poly(age, 3)3 125521.7 39933.5 3.143 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39930 on 2996 degrees of freedom
## Multiple R-squared: 0.0851, Adjusted R-squared: 0.08419
## F-statistic: 92.89 on 3 and 2996 DF, p-value: < 2.2e-16
Using an ANOVA we can make a decision which polynomial best fits the data. The F-test tests the null hypothesis whether the restricted model M1 is sufficient to explain the data opposed to a more complex model M2.
model1 <- lm(wage ~ age, data = Wage)
model2 <- lm(wage ~ poly(age, 2), data = Wage)
model3 <- lm(wage ~ poly(age, 3), data = Wage)
model4 <- lm(wage ~ poly(age, 4), data = Wage)
model5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(model1, model2, model3, model4, model5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5.0222e+12
## 2 2997 4.7934e+12 1 2.2879e+11 143.5931 < 2.2e-16 ***
## 3 2996 4.7777e+12 1 1.5756e+10 9.8888 0.001679 **
## 4 2995 4.7716e+12 1 6.0702e+09 3.8098 0.051046 .
## 5 2994 4.7703e+12 1 1.2826e+09 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with the second order polynomial is better than the simple linear model (the p value is close to zero). Also the cubic model with a third order polynomial performs better than the quadratic model. The model with a fourth order polynomial is only significant at the 10% level whereas the fifth order polynomial does not show any improvement. Also take into consideration that when using models with higher order polynomials you sacrifice the interpretability of the model!
When using raw polynomials you obtain regression coefficients that are identical to using the second and third power of a regressor.
lm_poly_raw <- lm(wage ~ poly(age, 3, raw = TRUE), data = Wage)
lm_poly_check <- lm(wage ~ age + I(age^2) + I(age^3), data = Wage)
tibble(coefficient = names(coef(lm_poly_check)),
raw_polynomial = coef(lm_poly_raw),
check = coef(lm_poly_check))
## # A tibble: 4 x 3
## coefficient raw_polynomial check
## <chr> <dbl> <dbl>
## 1 (Intercept) -75244. -75244.
## 2 age 10190. 10190.
## 3 I(age^2) -168. -168.
## 4 I(age^3) 0.849 0.849
At last I am creating a new dataset (predicted.df
) with the predicted values of wage
for polynomials of different orders.
predicted.df <-
tibble(quadratic.fit = predict(lm(wage ~ poly(age, 2), data = Wage), Wage),
cubic.fit = predict(lm(wage ~ poly(age, 3), data = Wage), Wage),
quartic.fit = predict(lm(wage ~ poly(age, 4), data = Wage), Wage),
quintic.fit = predict(lm(wage ~ poly(age, 5), data = Wage), Wage),
age= Wage$age
)
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha=0.3) +
geom_line(data=predicted.df, aes(x=age, y = quadratic.fit, col ="2fit"), size = 1) +
geom_line(data=predicted.df, aes(x=age, y = cubic.fit, col="3fit"), size = 1) +
geom_line(data=predicted.df, aes(x=age, y = quartic.fit, col="4fit"), size = 1) +
geom_line(data=predicted.df, aes(x=age, y = quintic.fit, col="5fit"), size = 1) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c("2fit" = "green", "3fit" = "red",
"4fit" = "blue", "5fit" = "yellow"),
labels = c("2nd order", "3rd order",
"4th order", "5th order")) +
labs(x = "Age", y = "Wage", color="Polynomial")
The regression lines of the higher order polynomials don’t seem to fit the data much than a model with just a quadratic term (2nd order polynomial).
Step function separate the range of a continuous variable in K
different regions. Hence, a categorical (or factor) variable is created which fits stepwise constant functions to the data.
Step functions can be applied by using the function cut()
in the formula =
argument of lm()
. The argument breaks =
controls the number of step functions that are created. Here, I am cutting the variable age
into 3 different regions. With function table()
I check beforehand how many observations fall within each created region.
table(cut(Wage$age, 3))
##
## (17.9,38.7] (38.7,59.3] (59.3,80.1]
## 1127 1663 210
lm_step <- lm(wage ~ cut(age, breaks = 3), data = Wage)
summary(lm_step)
##
## Call:
## lm(formula = wage ~ cut(age, breaks = 3), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98119 -25953 -6474 16500 212635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101695 1221 83.268 < 2e-16 ***
## cut(age, breaks = 3)(38.7,59.3] 16510 1582 10.437 < 2e-16 ***
## cut(age, breaks = 3)(59.3,80.1] 12239 3082 3.972 7.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41000 on 2997 degrees of freedom
## Multiple R-squared: 0.03528, Adjusted R-squared: 0.03464
## F-statistic: 54.8 on 2 and 2997 DF, p-value: < 2.2e-16
Below I plot the steps in the fitted regression lines.
tibble(
age = Wage$age,
wage = Wage$wage,
yhat = predict(lm_step, Wage)
) %>%
ggplot(aes(x = age, y = wage)) +
geom_point() +
geom_line(aes(y = yhat), size = 1, col = "red") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
Instead of fitting a higher order polynomial over all values of an regressor \(x_i\), regression splines apply single polynomials of lower order on different ranges of \(x_i\). The points, where the coefficients change, are called knots. The more knots you use, the more flexible is the resulting fit.
At first, a piecewise cubic polynomial with a single knot at respondents age
d 50 years is applied. Therefore, I am running two regressions each on a subset of the data with respondents less than 50 years old and respondents that are at least 50 years old.
Note: The step functions described in the previous chapter are piecewise polynomials of order zero!
lm_piecewise_poly1 <- lm(wage ~ I(age) + I(age^2) + I(age^3),
data = Wage,
subset = age < 50)
lm_piecewise_poly2 <- lm(wage ~ I(age) + I(age^2) + I(age^3),
data = Wage,
subset = age >= 50)
tibble(
age = Wage$age,
wage = Wage$wage,
.fitted1 = ifelse(age < 50, predict(lm_piecewise_poly1, Wage), NA),
.fitted2 = ifelse(age >= 50, predict(lm_piecewise_poly2, Wage), NA)
) %>%
ggplot(aes(x = age, y = wage)) +
geom_point() +
geom_line(aes(y = .fitted1), size = 1, col = "red") +
geom_line(aes(y = .fitted2), size = 1, col = "green") +
geom_vline(xintercept = 50, linetype = "dashed",
size = 1, color = "blue") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
With the R package splines
the function bs()
can be used to fit regression splines. The argument knots =
controls the number and position of the knots. Alternatively, argument df =
might be used to place the knots at uniform quantiles of the data. The argument degree =
determines the order of the spline.
lm_cubic_spline_bs <-lm(wage ~ bs(age, knots = c(50), degree = 3), data = Wage)
summary(lm_cubic_spline_bs)
##
## Call:
## lm(formula = wage ~ bs(age, knots = c(50), degree = 3), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98641 -24785 -4901 15588 202827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52515 4800 10.941 < 2e-16 ***
## bs(age, knots = c(50), degree = 3)1 68442 8862 7.723 1.53e-14 ***
## bs(age, knots = c(50), degree = 3)2 65597 6130 10.702 < 2e-16 ***
## bs(age, knots = c(50), degree = 3)3 66496 10114 6.574 5.74e-11 ***
## bs(age, knots = c(50), degree = 3)4 25759 13702 1.880 0.0602 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39900 on 2995 degrees of freedom
## Multiple R-squared: 0.08677, Adjusted R-squared: 0.08555
## F-statistic: 71.14 on 4 and 2995 DF, p-value: < 2.2e-16
tibble(age = Wage$age,
wage = Wage$wage,
.fitted = predict(lm_cubic_spline_bs, Wage)
) %>%
ggplot(aes(x = age, y = wage)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1, col = "red") +
geom_vline(xintercept = 50, linetype = "dashed", size = 1, color = "blue") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
Also, a cubic spline can be created with an interaction term. This leads to a disonctinuity at the third derivate at the knot. The function remains continuous in the first and secod derivatives. The discontinuity is at the knot of 50 years, which is not visible.
lm_cubic_spline_int <- Wage %>%
mutate(threshold = ifelse(age >= 50, 1, 0)) %$%
lm(wage ~ age + I(age ^ 2) + I(age ^ 3) + threshold:I((age - 50) ^ 3))
summary(lm_cubic_spline_int)
##
## Call:
## lm(formula = wage ~ age + I(age^2) + I(age^3) + threshold:I((age -
## 50)^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -98641 -24785 -4901 15588 202827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.420e+05 3.611e+04 -3.931 8.64e-05 ***
## age 1.589e+04 2.917e+03 5.448 5.51e-08 ***
## I(age^2) -3.217e+02 7.527e+01 -4.273 1.99e-05 ***
## I(age^3) 2.164e+00 6.232e-01 3.472 0.000523 ***
## threshold:I((age - 50)^3) -3.721e+00 1.590e+00 -2.340 0.019327 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39900 on 2995 degrees of freedom
## Multiple R-squared: 0.08677, Adjusted R-squared: 0.08555
## F-statistic: 71.14 on 4 and 2995 DF, p-value: < 2.2e-16
tibble(age = Wage$age,
wage = Wage$wage,
.fitted = predict(lm_cubic_spline_int, Wage)
) %>%
ggplot(aes(x = age, y = wage)) +
geom_point() +
geom_line(aes(y = .fitted), size = 1.5, col = "red") +
geom_vline(xintercept = 50, linetype = "dashed", size = 1.5, color = "blue") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
A linear spline is a polynomial of order zero.
lm_linear_spline <- Wage %>%
mutate(threshold = ifelse(age >= 50, 1, 0)) %$%
lm(wage ~ threshold*I(age-50))
summary(lm_linear_spline)
##
## Call:
## lm(formula = wage ~ threshold * I(age - 50))
##
## Residuals:
## Min 1Q Median 3Q Max
## -103247 -24393 -5878 16305 204546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130317.8 1671.4 77.970 < 2e-16 ***
## threshold -11205.9 2677.0 -4.186 2.92e-05 ***
## I(age - 50) 1558.9 108.6 14.356 < 2e-16 ***
## threshold:I(age - 50) -1967.8 266.1 -7.396 1.82e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40260 on 2996 degrees of freedom
## Multiple R-squared: 0.07002, Adjusted R-squared: 0.06908
## F-statistic: 75.19 on 3 and 2996 DF, p-value: < 2.2e-16
Wage %>%
select(age, wage) %>%
mutate(threshold = as.factor(ifelse(age >= 50, 1, 0))) %>%
ggplot(aes(x = age, y = wage, color = threshold)) +
geom_point(color="grey") +
geom_smooth(method = "lm", se = FALSE) +
geom_vline(xintercept = 50, linetype = "dashed", size = 1, color = "blue") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage") +
guides(color = FALSE)
Splines can have high variance at the regressors’ tails of the distribution (when x has either very small or very large values). A natural spline is a regression spline with additional constraints at the borders. There the function is linear, producing estimates that are more stable.
Natural splines can be applied with function ns()
in the R package splines
. I am using 4 degrees of freedom in this example. For comparison, I also compute a B-spline.
lm_spline_natural <- lm(wage ~ ns(age ,df = 4), data = Wage)
lm_spline_base <- lm(wage ~ bs(age, df = 4), data = Wage)
tibble(age = Wage$age,
wage = Wage$wage,
fitted.base = predict(lm_spline_base, se=T, Wage)$fit,
se.base = predict(lm_spline_base, se=T, Wage)$se.fit,
fitted.natural = predict(lm_spline_natural, se=T, Wage)$fit,
se.natural = predict(lm_spline_natural, se=T, Wage)$se.fit
) %>%
ggplot(aes(x = age, y = wage)) +
geom_point(color="grey") +
geom_line(aes(y = fitted.base, color = "base"), size = 1) +
geom_line(aes(y = fitted.base + 2*se.base), size = 0.5, col = "red", linetype="dashed") +
geom_line(aes(y = fitted.base - 2*se.base), size = 0.5, col = "red", linetype="dashed") +
geom_line(aes(y = fitted.natural, color = "natural"), size = 1) +
geom_line(aes(y = fitted.natural + 2*se.natural), size = 0.5, col = "blue", linetype="dashed") +
geom_line(aes(y = fitted.natural - 2*se.natural), size = 0.5, col = "blue", linetype="dashed") +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = c('base' = 'darkblue', 'natural' = 'red'),
labels = c("cubic spline", "natural cubic spline")) +
labs(x = "Age", y = "Wage", color = "") +
theme(legend.position = c(0.8, 0.8))
A regression spline is most flexible in regions with many knots because the polynomial coefficients can change rapidly. Hence, more knots are placed in regions where the function is assumed to have the most variation and less knots are placed in the region where the function is more stable. Although this might work, in practice knots are commonly placed uniformly. This is done by specifying the desired degrees of freeom and the knots are placed at uniform quantiles of the data.
Althoug they may be not frequently applied in practice, inverse functions are easy to specify.
I am regressing the respondents’ wage
on the inverse of age
. The calculation of the inverse is specified within function I()
since the /
operator has special meaning in the formula of lm()
. A negative algebraic sign of the coefficient of the inverted age
means a positive impact on wage
(et vc. vs.).
lm_age_inverse <- lm(wage ~ I(1 / age), data = Wage)
summary(lm_age_inverse)
##
## Call:
## lm(formula = wage ~ I(1/age), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99179 -24578 -5750 15855 203561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146165 2476 59.03 <2e-16 ***
## I(1/age) -1345057 92267 -14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40330 on 2998 degrees of freedom
## Multiple R-squared: 0.06619, Adjusted R-squared: 0.06588
## F-statistic: 212.5 on 1 and 2998 DF, p-value: < 2.2e-16
Below you find the estimates of the original estimates without transforming age
.
lm_age_normal <- lm(wage ~ age, data = Wage)
summary(lm_age_normal)
##
## Call:
## lm(formula = wage ~ age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100265 -25115 -6063 16601 205748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81704.74 2846.24 28.71 <2e-16 ***
## age 707.28 64.75 10.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40930 on 2998 degrees of freedom
## Multiple R-squared: 0.03827, Adjusted R-squared: 0.03795
## F-statistic: 119.3 on 1 and 2998 DF, p-value: < 2.2e-16
ggplot(broom::augment(lm_age_inverse),
aes(x = `I(1/age)`, y = wage)) +
geom_point(alpha=0.3) +
geom_line(aes(y = .fitted), size = 1) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age (inverted)", y = "Wage", color="")