Reading time: 29 minutes (5,665 words)
Let’s start with a list of the required R packages and the dataset used in this tutorial.
Please install and load the following R packages.
library(ggplot2)
library(dplyr)
library(purrr)
library(Ecdat)
library(broom)
library(aod)
library(margins)
library(lmtest)
library(sandwich)
library(DescTools)
library(mfx)
library(brant)
library(tidyr)
library(janitor)
library(nnet)
All practical examples are built around the Boston HMDA Data Set contained in R package Ecdat
. This dataset is a cross-section from 1997-1998 of 2,381 individuals with information on their mortgage application status and other demographic characteristics. I create an additional variable deny_num
which is equal to 0 if deny
is "no"
and equal to 1 if deny
is "yes"
. The variable ccs
is of ordinal scale but has one implausible values which I round off to a whole number.
data("Hdma", package="Ecdat")
Hdma <- Hdma %>%
mutate(deny_num = ifelse(deny == "no", 0, 1),
black_num = ifelse(black == "no", 0, 1),
ccs = round(ccs, digits = 0))
Mainly, I will work with the variables dir
, black
, ccs
and denial
in this tutorial. Note that black
and denial
are so called dummy variables and take on only two values ("no"
and "yes"
). The variable css
is a categorical variable (or factor) and has 6 levels. The variable dir
is the ratio of a respondent’s debt payment to her total income. Most of its values range between 0 and 1.5. For example, a value of dir
equal to 1 means that the loan payment is the same as the income. A value of dir
greater than 1 means that the debt payment exceeds the income, while values smaller than 1 mean that the income exceeds the debt payment.
Hdma %>%
glimpse()
## Rows: 2,381
## Columns: 15
## $ dir <dbl> 0.221, 0.265, 0.372, 0.320, 0.360, 0.240, 0.350, 0.280, 0.~
## $ hir <dbl> 0.221, 0.265, 0.248, 0.250, 0.350, 0.170, 0.290, 0.220, 0.~
## $ lvr <dbl> 0.8000000, 0.9218750, 0.9203980, 0.8604651, 0.6000000, 0.5~
## $ ccs <dbl> 5, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2~
## $ mcs <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2~
## $ pbcr <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ dmi <fct> no, no, no, no, no, no, no, no, yes, no, no, no, yes, no, ~
## $ self <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ single <fct> no, yes, no, no, no, no, yes, no, no, yes, yes, yes, no, n~
## $ uria <dbl> 3.9, 3.2, 3.2, 4.3, 3.2, 3.9, 3.9, 1.8, 3.1, 3.9, 3.1, 4.3~
## $ comdominiom <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ black <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ deny <fct> no, no, no, no, no, no, no, no, yes, no, no, no, yes, no, ~
## $ deny_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ black_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
In binary regression models, the dependent variable takes on only two values, for example 0 and 1. Technically, the linear regression model can be used to estimate the relationship between a binary dependent variable and other indepdenent variables. This is called the linear probability model (LPM). The predicted values generated by an LPM are interpreted as the predicted probability that \(y\)=1, and, for a given regressor \(x_i\), \(β_i\) is the change in that predicted probability for a unit change in \(x_i\).
The LPM assumes that the change in the predicted probability for a given change in \(x_i\) is constant for all values of \(x_i\), which is often a practical shortcoming of this model. This can be mitigated by using non-linear probability models, which I will show in the next section.
First, I want to regress a respondent’s mortgage application status (deny
) on her ratio of debt payments to total income (dir
). The dependent variable must be of numerical scale as the function lm()
does not accept it to be of class factor
, hence I use deny_num
instead of deny
. If the loan is denied, the variable deny_num
is equal to one.
model_lpm1 <- lm(deny_num ~ dir, data = Hdma)
summary(model_lpm1)
##
## Call:
## lm(formula = deny_num ~ dir, data = Hdma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73064 -0.13731 -0.11317 -0.07092 1.05582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07996 0.02115 -3.780 0.000161 ***
## dir 0.60353 0.06083 9.922 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3182 on 2379 degrees of freedom
## Multiple R-squared: 0.03974, Adjusted R-squared: 0.03933
## F-statistic: 98.44 on 1 and 2379 DF, p-value: < 2.2e-16
The estimate of coefficient dir
is 0.604. Doubling the loan-payment-to-income ratio (an increase by one unit) leads to an increase of the probalility of loan denial by 60.4%.
We can also see that the predicted probabilities in the LPM can be smaller than 0 or greater than 1. The intercept
shows the probability of loan denial for respondents with dir = 0
which is -0.080 ( or -8.0%). This is not really meaningful as there is no such thing as a negative probability of loan denial.
A payment-to-income ratio of 1 is associated with an expected probability of mortgage application denial of roughly 50%. The model indicates that there is a positive relation between the payment-to-income ratio and the probability of a denied mortgage application. Individuals with a high value of dir
are more likely to be rejected. The regression line (blue) is fitted outside the possible range of deny
.
ggplot(data = Hdma,
aes(x = dir, y = deny_num)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
geom_hline(yintercept = 1, color = "red", lty = "dashed") +
geom_hline(yintercept = 0, color = "red", lty = "dashed") +
labs(x = "Debt to Income Ratio", y = "Mortgage Application Status")
Below I am adding another regressor black
to the model, which indicates whether the respondent is black or not.
model_lpm2 <- lm(deny_num ~ dir + black, data = Hdma)
summary(model_lpm2)
##
## Call:
## lm(formula = deny_num ~ dir + black, data = Hdma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62520 -0.11763 -0.09291 -0.05483 1.06819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09056 0.02078 -4.357 1.37e-05 ***
## dir 0.55918 0.05986 9.342 < 2e-16 ***
## blackyes 0.17747 0.01837 9.664 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3122 on 2378 degrees of freedom
## Multiple R-squared: 0.07602, Adjusted R-squared: 0.07524
## F-statistic: 97.83 on 2 and 2378 DF, p-value: < 2.2e-16
When holding dir
constant, the probablity that the mortgage application is denied increases by 0.177 (or 17.7%) for black
respondents.
The residuals-versus-fitted plot shows that the residuals are not normally distributed and violate the homoskedasticity assumption. Each residual is immediately given by its fitted value as there are only two possible outcomes of Y (0 and 1). Hence, standard error correction methods have to be applied when making inference with the LPM.
augment(model_lpm2) %>%
ggplot(data = .,
aes(x = .fitted, y = .std.resid)) +
geom_point() +
labs(x = "Fitted values", y = "Standardized residuals")
Below I calculate heteroskedasticity-robust standard errors for the augmented model.
coeftest(model_lpm2, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.090556 0.028598 -3.1665 0.001563 **
## dir 0.559183 0.088662 6.3069 3.384e-10 ***
## blackyes 0.177475 0.024945 7.1145 1.480e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One drawback of the LPM is that it assumes that the probability of Y=1
is linear. This issue can be solved by modeling the probability with a nonlinear functional form. The choice between logit and probit depends largely on individual preferences as they produce similiar results.
Below I’ve plotted both distributions of the logit and probit transformation. They don’t look too different from each other but their probability function is bound between 0 and 1.
tibble(x_values = seq(-4,4, by = 0.1),
Probit = pnorm(seq(-4,4, by = 0.1)),
Logit = exp(seq(-4,4, by = 0.1))/(1+exp(seq(-4,4, by = 0.1)))
) %>%
pivot_longer(cols = -x_values, names_to = "model", values_to = "values") %>%
ggplot(aes(x = x_values, y = values, group = model, color = model)) +
geom_line() +
labs(x = "Random variable", y = "Cumulative Probability", color = "Transformation")
Another difference to the LPM or OLS estimation is that the parameters of a logit or probit model are computed using Maximum Likelihood Estimation (MLE). It uses iterative algorithms and aims to maximize the conditional probability of observing the data given a specific probability distribution and its parameters.
For each itertaion the algorithm tries to maximize the log-likelihood. The more the intital and the final likelihood value differ from each other, the bigger the advantage of using the specified model with the indenpendent variables compared to the null model with only an intercept.
A large number of iterations may indicate some problem with the specified model. However, take into account that the more predictors you include in your model, the more iterations will be required. Should the model not converge it might be helpful to exclude one or more predictor variables.
Some of the concepts of OLS can also be found when using MLE. The intital log-likelihood \(L_0\) for example is a rough equivalent to the total sum of squares (\(TSS\)). The final log-likelihood \(L_K\) on the other hand is an approximate equivalent to the residual sum of squares (\(RSS\)). And finally, the difference between \(L_0\) and \(L_K\) can be thought of as the mean squared error (\(MSE\)).
This section starts with the logit model as it is commonly more often used in practice than the probit model and some of the general concepts are easier to illustrate. The logit model is based on a cumulative standard logistic distribution.
Consider the following model:
\[ \begin{aligned} Log(\frac{Pr(Denied)}{NotDenied}) = β_0 + β_1(dir) + β_2(black) + ε \end{aligned} \]
The function glm()
is used to compute a logit model by supplying argument family =
with the eponymous link
. The dependent variable can be of class factor
, so I am now using deny
instead of deny_num
. I follow up the extended model from the previous section and regress a respondent’s mortgage application status on her loan payment-to-income ratio (dir
) and on a dummy variable which indicates whether she is black
or not. There are quite a few differences compared to the output of the linear regression model.
model_logit <- glm(deny ~ dir + black,
data = Hdma,
family = binomial(link = "logit"))
summary(model_logit)
##
## Call:
## glm(formula = deny ~ dir + black, family = binomial(link = "logit"),
## data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3710 -0.4731 -0.4219 -0.3555 2.8041
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1264 0.2684 -15.372 < 2e-16 ***
## dir 5.3714 0.7284 7.374 1.65e-13 ***
## blackyes 1.2733 0.1462 8.709 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1591.6 on 2378 degrees of freedom
## AIC: 1597.6
##
## Number of Fisher Scoring iterations: 5
For a one unit increase in an independent variable the logit regression coefficients give the change in the log odds of the dependent variable. This is not really helpful to begin with but the sign of the regression coefficients gives us a notion whether the probability of loan denial increases or decreases. A higher loan to income ratio (dir
) is associated with an increase in the probability that the loan is denied. A black
applicant has a higher probability of denial, holding dir
constant. Also the magnitude of the coefficients can be assessed. The effect of a one unit increase in dir
is stronger than a one unit increase in black
. The log odds of denial
when dir = 0
and black = "no"
are reported by the intercept
.
The z-value is the test value of the z-statistic (Wald z-statistic) and is reported in place of the t-value from the linear (probability) model. This different test statistic is needed because the response variable is binary.
Below I created a scatterplot of deny
against dir
and added the logit function. It is S-shaped and flattens out for large and small values of dir
. The function also ensures that the predicted conditional probabilities of deny
are between 0 and 1. The plot shows that for values of dir
around 1.5, the probability of loan denial is already very close to 1 (100%).
ggplot(data = Hdma,
aes(x = dir, y = deny_num)) +
geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial(link="logit")),
se = F) +
scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
geom_hline(yintercept = 1, color = "red", lty = "dashed") +
geom_hline(yintercept = 0, color = "red", lty = "dashed") +
labs(x = "Debt to Income Ratio", y = "Mortgage Application Denied?")
As mentioned before, the concept of log odds which are reported by default in the summary output of glm()
are not a good measure for interpreting the effect of an independent variable on Y. Let us therefore take a look at odds. Consider some event where the outcome is either failure (Y=0) or success (Y=1). The probability of success is 0.8
and hence the probability of failure is 1 - 0.8 = 0.2
. The odds of success are defined as the ratio of the probability of success over the probability of failure, in this example 0.8/0.2 = 4
(odds = p/(1-p)
with p denoting the probability of the event). The odds of success are 4 to 1 or success is 4 times more likely than failure.
Considering the Hdma
dataset, what are the odds of a black respondent being denied a loan and what are the odds of a nonblack respondent being denied a loan? The function xtabs()
creates a cross-table of dir
and black
.
xtabs(~deny + black, data = Hdma)
## black
## deny no yes
## no 1853 243
## yes 189 96
For black respondents the odds of loan denial are 96/243 = 0.40
and for nonblack respondents the odds of loan denial are 189/1853 = 0.10
. The ratio of the odds for black to the odds for non-black respondents is (96/243)/(189/1853) = 3.87
. The odds of loan denial for black respondents are about 3.87 times higher than the odds for non-black respondents.
Below are some probabilities with their associated odds and log-odds.
Probability | Odds | Logits (Log-Odds) |
---|---|---|
10% | 10/90 = 0.11 | -2.21 |
20% | 20/80 = 0.25 | -1.39 |
30% | 30/80 = 0.43 | -0.84 |
40% | 40/60 = 0.67 | -0.40 |
50% | 50/50 = 1.00 | 0.00 |
60% | 60/40 = 1.50 | 0.41 |
70% | 70/30 = 2.33 | 0.85 |
80% | 80/20 = 4.00 | 1.39 |
90% | 90/10 = 9.00 | 2.20 |
Logistic regression coefficients can also be depicted in the form of odds ratios, i.e. the multiplicative effects on chances. They can be calculated by exponentiating the logit coefficients (log odds). In the baseline model, the odds (chance of loan denial) for black
respondents are 3.57 times higher when keeping dir
constant. Stating that the odds for black
respondents are 257% (3.57 - 1.00 * 100%
) higher than the odds of non-black respondents is also possible. Note that the z-values and hence the p-values do not change when transforming the log odds to odds ratios.
exp(coef(model_logit))
## (Intercept) dir blackyes
## 0.01614017 215.15719913 3.57263200
Another form of interpretation is the transformation of logits to probablities. This can be achieved by using the baseline model (model_logit
) in function predict()
with argument type =
set to "response"
. Before, however, we must decide for which fraction of the data the probabilities should be calculated. Below I am calculating the predicted probabilities of loan denial (deny
) for black and non-black respondents (black
) with an average loan-payment-to-income ratio (mean(Hdma$dir)
)
tibble(
dir = mean(Hdma$dir),
black = factor(c("no", "yes"))
) %>%
mutate(pred.prob = predict(model_logit, newdata = .,
type = "response"))
## # A tibble: 2 x 3
## dir black pred.prob
## <dbl> <fct> <dbl>
## 1 0.331 no 0.0871
## 2 0.331 yes 0.254
The predicted probability of loan denial for a black
respondent with average dir
is 0.254 (or 25.4%). The predicted probability of loan denial for a non-black respondent, in contrast, is only 0.087 (or 8.7%). There is a difference between black and non-black respondents of 0.167 (percentage points).
Instead of keeping dir
fixed at its mean, we may vary it over a selected range of values. Below are the predicted probabilities of deny
for black
respondents over four different values of the loan-payment-to-income ratio.
tibble(
dir = seq(from = 0.2, to = 0.5 , by = 0.1),
black = factor(c("yes"))
) %>%
mutate(pred.prob = predict(model_logit, newdata = .,
type = "response"))
## # A tibble: 4 x 3
## dir black pred.prob
## <dbl> <fct> <dbl>
## 1 0.2 yes 0.144
## 2 0.3 yes 0.224
## 3 0.4 yes 0.331
## 4 0.5 yes 0.458
For black
respondents with a dir
of 0.2 the predicted probability of loan denial (deny = yes
) is 0.144 (or 14.4%). The predicted probability of loan denial of black
respondents with a dir
of 0.5 is 0.458 (or 45.8%). Note that although the predicted probability increases with higher values of dir
, this increase is not proportional!
With the logistic regression model we can understand residuals as the difference between observed values and the classified values (by our model). That is, if the probability is larger than 0.5 the observation is classified with 1 and if it is smaller than 0.5 the observations is classified with 0. Typically we compare the classified values by our model with the observed values in a cross-table. I’m using the function augment()
with argument type.predict = "response"
to obtain the predicted probabilities. Then I’m using functions from R package janitor
to create a table of the observed an predicted values of deny
.
model_logit %>%
augment(type.predict = "response") %>%
mutate(deny_prediction = if_else(.fitted > 0.5, "yes", "no")) %>%
janitor::tabyl(deny_prediction, deny) %>%
janitor::adorn_totals(where = c("row", "col")) %>%
janitor::adorn_title()
## deny
## deny_prediction no yes Total
## no 2090 273 2363
## yes 6 12 18
## Total 2096 285 2381
With this table we can compute measures like the sensitivity or specifictiy. These are column percentages in the main diagonal. The sensitivity is the share of predicted loan denials relative to all loan denials (12/285
= 4.21 %). The specificity is the share of predicted loan refusals relative to all refusals (2090/2096
= 99.71%).
In social sciences you’ll frequently encounter the measure Count R2, which is the share of all correct predictions ((2090+12)/2381
= 88.28%). However, be wary that even without knowing the independent variables we are able to correctly classify some persons. If we’d only know the distribution of deny
we make the least errors if we classify all observations in the category with the most cases. If we would classify all persons as being denied the loan, we would be correct in 2096/2381
88.03% of the cases.
Comparing the correct classifications by means of the marginal distribution of deny
and by means of using the independent variables we can compute the Adjusted Count R2. It is computed by substracting the column with the most cases (deny
= no) from the Count R2 and dividing it by the difference between all observations and the column with the most cases.
((2090+12) - 2096)/(2381-2096)
## [1] 0.02105263
This measure means that by knowing the independent variables (dir
and black
), the prediction errors decrease by 0.02% compared to the prediction by using only the marginal distribution of \(y\) (deny
).
With function lrtest()
from package lmtest
likelihood ratio tests can be performed. Similar to an F-test it checks whether all regressors in the model are together equal to zero. That means whether the model with just a constant is as good as the specified model with the regressors. Note that F-tests are not utilizied in the setting of binary response variables. The LR test is performed by estimating two models and comparing the fit of one model to the fit of the other. Removing predictor variables from a model will almost always make the model fit less well (i.e., a model will have a lower log-likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. The LR test does this by comparing the log-likelihoods of the two models. Below, the null hypothesis, that the log-likelihood of the model with just a constant is the same as the log-likelihood of the baseline model, is rejected.
lrtest(model_logit)
## Likelihood ratio test
##
## Model 1: deny ~ dir + black
## Model 2: deny ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -795.79
## 2 1 -872.21 -2 152.85 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The AIC is another measure of goodness of fit that takes into account the ability of the model to fit the data. It can be extracted with function AIC()
. This is very useful when comparing two models where one may fit better but perhaps only because it’s more flexible and thus better able to fit any data. Since we have only one model, this is uninformative at the moment.
AIC(model_logit)
## [1] 1597.573
Compared to OLS, no value of R2 is reported in the summary routput of function glm()
. However, there are some Pseudo R2 measures which can be used in the setting of binary response variables. Below I am calculating Mc-Faddens R2 which is based on the model’s log-likelihood. As with the standard R2, the higher the value the better the fit of the model.
1 - model_logit$deviance / model_logit$null.deviance
## [1] 0.08762363
The function PseudoR2()
from package DescTools
provides many different pseudo R2 measures, which you can look up if you like to. We can also assess the goodness of fit with the Hosmer-Lemeshow-Test implemented in function HosmerLemeshowTest()
of the above mentioned package.
model.diagnostics <-
model_logit %>%
augment(type.predict = "response") %>%
mutate(deny_num = if_else(deny == "no",0,1))
DescTools::HosmerLemeshowTest(fit = model.diagnostics$.fitted,
obs = model.diagnostics$deny_num, ngr = 10)$C
##
## Hosmer-Lemeshow C statistic
##
## data: model.diagnostics$.fitted and model.diagnostics$deny_num
## X-squared = 25.863, df = 8, p-value = 0.001109
We can access the number of iterations before the model converged with:
model_logit$iter
## [1] 5
With function logLik()
we can extract the log-likelihood of our specified model.
logLik(model_logit)
## 'log Lik.' -795.7864 (df=3)
The null model’s log-likelihood can also be retrieved by refitting the model with only a constant, for example using the function update()
.
logLik(update(model_logit, . ~ 1))
## 'log Lik.' -872.2128 (df=1)
I will now remove the dummy variable black
from the model but add the consumer’s credit score (ccs
). This is an ordinal categorical variable with values ranging from 1 to 6. A low value represents a good score and a high value represents a bad score (just like grades in Germany).
model_logit2 <- glm(deny ~ dir + factor(ccs),
data = Hdma,
family = binomial(link = "logit"))
summary(model_logit2)
##
## Call:
## glm(formula = deny ~ dir + factor(ccs), family = binomial(link = "logit"),
## data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3218 -0.5037 -0.3664 -0.2901 2.9754
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.6351 0.2901 -15.977 < 2e-16 ***
## dir 5.5119 0.7434 7.414 1.23e-13 ***
## factor(ccs)2 0.7287 0.1902 3.831 0.000127 ***
## factor(ccs)3 1.1465 0.2652 4.324 1.53e-05 ***
## factor(ccs)4 1.5797 0.3043 5.192 2.08e-07 ***
## factor(ccs)5 1.4712 0.2143 6.864 6.68e-12 ***
## factor(ccs)6 2.0244 0.1919 10.548 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1526.2 on 2374 degrees of freedom
## AIC: 1540.2
##
## Number of Fisher Scoring iterations: 5
The interpretation of the coefficients of each level of ccs
is with respect to the left out catergory (ccs = 1
). Remember that these coefficients show the change in the log odds of deny
. The worse a respondent’s ccs
, the higher the probability of being denied a loan.
Especially with categorical variables the predicted probabilities are essential for an easy and meaningful interpretation. Below I am holding dir
at its mean value and calculate the predicted probabilty of loan denial (deny
) for each level of ccs
. This is by the way the same as using the margins command in Stata (there will be more on this topic later on). Then I am using the function geom_errorbar()
to show the confidence interval around each estimate.
tibble(dir = mean(Hdma$dir),
ccs = factor(rep(1:6))) %>%
mutate(
pred.prob = predict(model_logit2,newdata = .,
type = "response"),
se = predict(model_logit2, newdata = .,
type = "response",se = TRUE)$se.fit,
ll = pred.prob - (1.96 * se),
ul = pred.prob + (1.96 * se)) %>%
ggplot(data = .,
aes(x = ccs, y = pred.prob)) +
geom_errorbar(aes(ymin = ll, ymax = ul),
width = 0.2, col = "red") +
geom_point(fill = "black") +
labs(x = "Consumer's credit score",
y = "Predicted Probability (of loan denial)")
Below I am generating a new dataset with 100 values of dir
per level of ccs
. The dir
values are generated by function seq()
with argument length.out =
set to 100 and are spaced uniformly between the from =
and to =
value. In total there are 600 observations because ccs
has 6 levels. Then I am using the function predict()
to compute the predicted probabilities and standard errors. I also calculate the upper and lower limit of the confidence interval for each predicted probability. Below is a snippet of this dataset.
probability_data <- tibble(
dir = rep(seq(from = 0, to = 1.5, length.out = 50), 6),
ccs = factor(rep(1:6, each = 50))
) %>%
mutate(pred.probs = predict(model_logit2, newdata = .,
type = "response"),
se = predict(model_logit2, newdata = .,
type = "response", se = TRUE)$se.fit,
ll = pred.probs - (1.96 * se),
ul = pred.probs + (1.96 * se))
probability_data %>%
group_by(ccs) %>%
slice(1:3)
## # A tibble: 18 x 6
## # Groups: ccs [6]
## dir ccs pred.probs se ll ul
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 0.00961 0.00276 0.00420 0.0150
## 2 0.0306 1 0.0114 0.00303 0.00543 0.0173
## 3 0.0612 1 0.0134 0.00330 0.00696 0.0199
## 4 0 2 0.0197 0.00581 0.00833 0.0311
## 5 0.0306 2 0.0233 0.00638 0.0107 0.0358
## 6 0.0612 2 0.0274 0.00699 0.0137 0.0411
## 7 0 3 0.0296 0.0105 0.00913 0.0501
## 8 0.0306 3 0.0349 0.0117 0.0120 0.0578
## 9 0.0612 3 0.0410 0.0130 0.0155 0.0666
## 10 0 4 0.0450 0.0161 0.0134 0.0766
## 11 0.0306 4 0.0528 0.0180 0.0175 0.0882
## 12 0.0612 4 0.0619 0.0201 0.0225 0.101
## 13 0 5 0.0405 0.0126 0.0159 0.0652
## 14 0.0306 5 0.0476 0.0138 0.0206 0.0747
## 15 0.0612 5 0.0559 0.0151 0.0263 0.0855
## 16 0 6 0.0685 0.0194 0.0304 0.106
## 17 0.0306 6 0.0800 0.0210 0.0389 0.121
## 18 0.0612 6 0.0934 0.0225 0.0493 0.137
With function geom_ribbon()
it is easy to show intervals around a given value. Below are the predicted probabilites of loan denial (deny
) for each value of the consumers’ credit scores (ccs
). The worse the consumer’s credit score, the more likely is the loan denial for a given value of dir
. It is also apparent that along the values of dir
the predicted probabilities of deny
do not increase linearly.
ggplot(data = probability_data,
aes(x = dir, y = pred.probs)) +
geom_ribbon(aes(ymin = ll,
ymax = ul, fill = ccs), alpha = 0.2) +
geom_line(aes(colour = ccs),
size = 1) +
scale_color_brewer(palette = "RdYlGn", direction = -1) +
scale_fill_brewer(palette = "RdYlGn", direction = -1) +
labs(x = "Loan payment-to-income ratio",
y = "Predicted Probability (of loan denial)",
col = "Consumer's credit score", fill = "Consumer's credit score")
With a Wald-test we can check whether all levels of a factor variable are together equal to zero. The Wald test approximates the LR-test, but with the advantage that it only requires estimating one model.
wald.test(b = coef(model_logit2),
Sigma = vcov(model_logit2),
Terms = 3:7)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 132.2, df = 5, P(> X2) = 0.0
The chi-squared test statistic of 132.2 with five degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of ccs
is statistically significant.
Below I test whether the coefficients of ccs = 5
and ccs = 6
are equal.
test_vec <- cbind(0, 0, 0, 0, 0, 1, -1)
wald.test(b = coef(model_logit2), Sigma = vcov(model_logit2), L = test_vec)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019
The chi-squared test statistic of 5.5 with one degree of freedom is associated with a p-value of 0.019 indicating that the two coefficients are significantly different from each other at the 5% level.
Below is an even more advanced example, where I added black
to the model again. Now there is one continuous, one dummy and one catergorical variable!
model_logit3 <- glm(deny ~ dir + black + factor(ccs),
data = Hdma,
family = binomial(link = "logit"))
summary(model_logit3)
##
## Call:
## glm(formula = deny ~ dir + black + factor(ccs), family = binomial(link = "logit"),
## data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0940 -0.4976 -0.3473 -0.2823 2.9861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.6540 0.2876 -16.181 < 2e-16 ***
## dir 5.1816 0.7328 7.071 1.54e-12 ***
## blackyes 0.9414 0.1558 6.045 1.50e-09 ***
## factor(ccs)2 0.7170 0.1913 3.748 0.000178 ***
## factor(ccs)3 1.0260 0.2698 3.803 0.000143 ***
## factor(ccs)4 1.3527 0.3127 4.326 1.52e-05 ***
## factor(ccs)5 1.3439 0.2182 6.160 7.28e-10 ***
## factor(ccs)6 1.8055 0.1981 9.114 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1492.0 on 2373 degrees of freedom
## AIC: 1508
##
## Number of Fisher Scoring iterations: 5
When using predicted probabilities three different variables have now to be taken into account for the interpretation! I am generating 20 values of dir between 0 and 1.5 for all combinations of ccs
and black
(120 observations in total). Truth be told, already creating this datset is quite complicated! I am using the function rep_along()
from package purrr
to create matching number of categories for each sequence of values.
probability_data2 <- tibble(
dir = rep(seq(from = 0, to = 1.5, length.out = 20), 6),
ccs = factor(rep(1:6, each = 20)),
black = factor(rep_along(ccs, c("no","yes")))
) %>%
mutate(pred.prob = predict(model_logit3, newdata = .,
type = "response"),
se = predict(model_logit3, newdata = .,
type = "response", se = TRUE)$se.fit,
ll = pred.prob - (1.96 * se),
ul = pred.prob + (1.96 * se))
probability_data2 %>%
group_by(ccs, black) %>%
slice(1:2)
## # A tibble: 24 x 7
## # Groups: ccs, black [12]
## dir ccs black pred.prob se ll ul
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 no 0.00943 0.00269 0.00417 0.0147
## 2 0.158 1 no 0.0211 0.00391 0.0135 0.0288
## 3 0.0789 1 yes 0.0355 0.00930 0.0172 0.0537
## 4 0.237 1 yes 0.0769 0.0139 0.0496 0.104
## 5 0 2 no 0.0191 0.00558 0.00820 0.0301
## 6 0.158 2 no 0.0423 0.00840 0.0259 0.0588
## 7 0.0789 2 yes 0.0700 0.0184 0.0339 0.106
## 8 0.237 2 yes 0.146 0.0268 0.0932 0.198
## 9 0 3 no 0.0259 0.00921 0.00783 0.0439
## 10 0.158 3 no 0.0568 0.0156 0.0262 0.0873
## # ... with 14 more rows
The CEP shows that along the values of dir
and for all consumer’s credit scores (ccs
) a black
respondent has a higher predicted probability of loan denial compared to a non-black respondent.
ggplot(data = probability_data2,
aes(x = dir, y = pred.prob)) +
geom_ribbon(aes(ymin = ll, ymax = ul, fill = black), alpha = 0.2) +
geom_line(aes(colour = black), size = 1) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(.~ ccs) +
labs(x = "Loan payment-to-income ratio",
y = "Predicted Probability (of loan denial)",
col = "Respondent is black", fill = "Respondent is black")
Let’s consider again the previous model but now with an interaction term between black
and ccs
.
model_logit4 <- glm(deny ~ dir + black + factor(ccs) + black:factor(ccs),
data = Hdma,
family = binomial(link = "logit"))
summary(model_logit4)
##
## Call:
## glm(formula = deny ~ dir + black + factor(ccs) + black:factor(ccs),
## family = binomial(link = "logit"), data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1662 -0.4899 -0.3465 -0.2812 2.9912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.67083 0.29452 -15.859 < 2e-16 ***
## dir 5.21577 0.73831 7.065 1.61e-12 ***
## blackyes 0.96500 0.29362 3.287 0.001014 **
## factor(ccs)2 0.73348 0.21279 3.447 0.000567 ***
## factor(ccs)3 1.19355 0.30811 3.874 0.000107 ***
## factor(ccs)4 0.95443 0.45721 2.088 0.036840 *
## factor(ccs)5 1.16676 0.26910 4.336 1.45e-05 ***
## factor(ccs)6 1.96144 0.23551 8.328 < 2e-16 ***
## blackyes:factor(ccs)2 -0.08307 0.48418 -0.172 0.863771
## blackyes:factor(ccs)3 -0.57349 0.61253 -0.936 0.349136
## blackyes:factor(ccs)4 0.82885 0.67518 1.228 0.219603
## blackyes:factor(ccs)5 0.51268 0.49145 1.043 0.296853
## blackyes:factor(ccs)6 -0.40924 0.43222 -0.947 0.343721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1485.5 on 2368 degrees of freedom
## AIC: 1511.5
##
## Number of Fisher Scoring iterations: 5
tibble(
dir = rep(seq(from = 0, to = 1.5, length.out = 20), 6),
ccs = factor(rep(1:6, each = 20)),
black = factor(rep_along(ccs, c("no","yes")))
) %>%
mutate(pred.prob = predict(model_logit4, newdata = .,
type = "response"),
se = predict(model_logit4, newdata = .,
type = "response", se = TRUE)$se.fit,
ll = pred.prob - (1.96 * se),
ul = pred.prob + (1.96 * se)) %>%
ggplot(data = .,
aes(x = dir, y = pred.prob)) +
geom_ribbon(aes(ymin = ll, ymax = ul, fill = black), alpha = 0.2) +
geom_line(aes(colour = black), size = 1) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(.~ ccs) +
labs(x = "Loan payment-to-income ratio",
y = "Predicted Probability (of loan denial)",
col = "Respondent is black", fill = "Respondent is black")
The probit regression model is based on a cumulative standard normal distribution. Logit and probit regression deliver practically identical results and you should consider using the probit model as an alternative to the logit model and not as an extension. Commonly, the probit coefficients are smaller than the logit coefficients.
The function glm()
is used to compute the probit model by supplying argument family =
with the eponymous link
. The dependent variable might now be of class factor
, so I use deny
instead of deny_num
.
model_probit <- glm(deny ~ dir + black ,
data = Hdma,
family = binomial(link = "probit"))
summary(model_probit)
##
## Call:
## glm(formula = deny ~ dir + black, family = binomial(link = "probit"),
## data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1208 -0.4761 -0.4254 -0.3549 2.8802
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.25916 0.13670 -16.527 < 2e-16 ***
## dir 2.74208 0.38050 7.207 5.74e-13 ***
## blackyes 0.70843 0.08335 8.499 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1594.5 on 2378 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 5
The probit regression coefficients give the change in the z-score or probit index for a one unit change in the independent variable and cannot be easily interpreted. First, we take a look at the signs of the coefficients. A higher loan to income ratio (dir
) is, as we already know, associated with an increase in the probability of loan denial. Also, black
applicants have a higher probability of denial, holding constant the payments-to-income ratio. However, the coefficient estimates are not probabilities - yet!
The z-value is the test value of the z-statistic (Wald z-statistic) and is again used instead of the t-statistic known from the linear probability model. This different test statistic is needed because the response variable is binary.
The function is clearly nonlinear and flattens out for large and small values of the loan payments to income ratio. The functional form thus also ensures that the predicted conditional probabilities of loan denial lie between 0 and 1.
ggplot(data = Hdma,
aes(x = dir, y = deny_num)) +
geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial(link="probit")),
se = F) +
scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
geom_hline(yintercept = 1, color = "red", lty = "dashed") +
geom_hline(yintercept = 0, color = "red", lty = "dashed") +
labs(x = "Debt to Income Ratio", y = "Mortgage Application Status")
First, I compute predictions for white and black
applicants with the same loan payment-to-income ratio. Then, with function diff()
I compute the difference in probabilities.
predictions <- predict(model_probit,
newdata = tibble(
"dir" = c(0.8, 0.8),
"black" = c("no", "yes")
),
type = "response")
diff(predictions)
## 2
## 0.265976
The estimated difference in denial probabilities is about 0.266 (26.6%).
Please refer to the section from the logit regression as there are practically no differences.
I am adding the variable ccs
to the model which indicates a respondent’s credit score. Its values range from 1 to 6 with a low value indicating a good score.
The function xtabs()
creates a cross-table of the dependent variable and the added factor variable ccs
.
xtabs(~deny + ccs, data = Hdma)
## ccs
## deny 1 2 3 4 5 6
## no 1269 391 103 60 140 133
## yes 84 51 23 17 42 68
Below I am estimating the model.
model_probit2 <- glm(deny ~ dir + black + factor(ccs),
data = Hdma,
family = binomial(link = "probit"))
summary(model_probit2)
##
## Call:
## glm(formula = deny ~ dir + black + factor(ccs), family = binomial(link = "probit"),
## data = Hdma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9093 -0.5004 -0.3481 -0.2759 3.1174
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.52646 0.14419 -17.521 < 2e-16 ***
## dir 2.65806 0.38073 6.982 2.92e-12 ***
## blackyes 0.52622 0.08805 5.976 2.28e-09 ***
## factor(ccs)2 0.35009 0.09756 3.588 0.000333 ***
## factor(ccs)3 0.54245 0.14571 3.723 0.000197 ***
## factor(ccs)4 0.70248 0.17505 4.013 6.00e-05 ***
## factor(ccs)5 0.69869 0.12002 5.821 5.83e-09 ***
## factor(ccs)6 0.98931 0.11055 8.949 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1744.4 on 2380 degrees of freedom
## Residual deviance: 1494.2 on 2373 degrees of freedom
## AIC: 1510.2
##
## Number of Fisher Scoring iterations: 5
With a Wald-test we can check whether all levels of a factor variable are together equal to zero. The Wald test approximates the LR-test, but with the advantage that it only requires estimating one model.
wald.test(b = coef(model_probit2),
Sigma = vcov(model_probit2),
Terms = 4:8)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 99.5, df = 5, P(> X2) = 0.0
The chi-squared test statistic of 99.5 with five degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of ccs
is statistically significant.
Below I test whether the coefficient of ccs = 5 is equal to the coeffcient of ccs = 6.
l <- cbind(0, 0, 0, 0, 0, 0, 1, -1)
wald.test(b = coef(model_probit2), Sigma = vcov(model_probit2), L = l)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 4.2, df = 1, P(> X2) = 0.04
Opposed to linear (OLS) regression, the logistic regression model has no single measure which expresses the influence of an independent on the dependent variable. However, one candidate is the average marginal effect. In OLS regressions the marginal effect is the slope of the regression line and constant for all values of X. As we have seen in the conditional effects plots the slope of the regression line, and hence the marginal effect, is not constant in the logistic regression. The effect of an independent variable is stronger the closer the predicted probabilities are to 50%. It is more weak if they approach 0 % or 100 %. It is thus reasonable to calculate the slope of the regression line at various points.
Average marginal effects show the change in a probability when the predictor or independent variable is increased by one unit. For continuous variables this represents the instantaneous change given that the unit is very small. For binary variables, the change is from 0 to 1, so one unit as it is usually thought of.
Three types of marginal effects can be distinguished:
Average marginal effects (AME)
Marginal effects at representative values (MER)
Marginal effects at means (MEM)
Let us consider again the baseline model with black
and dir
as regressors and deny
as response. We now use the function margins()
from package margins
to calculate the average marginal effects. For non-continuous variables one refers to them as the average discrete change (ADC) relative to the base level.
model_logit <- glm(deny ~ dir + black,
data = Hdma,
family = binomial(link = "logit"))
summary(margins(model_logit, type = "response"))
## factor AME SE z p lower upper
## blackyes 0.1664 0.0239 6.9744 0.0000 0.1197 0.2132
## dir 0.5182 0.0697 7.4304 0.0000 0.3815 0.6549
The output from function margins()
may be plotted with geom_point()
and geom_errorbar()
in order to compare the relative influence of each regressor.
ggplot(data = summary(margins(model_logit))) +
geom_point(aes(factor, AME)) +
geom_errorbar(aes(x = factor, ymin = lower, ymax = upper),
width = 0.2) +
labs(x = "")
The at =
argument allows you to calculate marginal effects at representative values, which are marginal effects for particularly interesting (sets of) observations in a dataset. Below I calculate the marginal effects at three different values of dir.
summary(margins(model_logit, at = list(dir = c(0.3, 0.4, 0.5)), type = "response"))
## factor dir AME SE z p lower upper
## blackyes 0.3000 0.1493 0.0228 6.5606 0.0000 0.1047 0.1939
## blackyes 0.4000 0.2093 0.0286 7.3067 0.0000 0.1531 0.2654
## blackyes 0.5000 0.2668 0.0339 7.8689 0.0000 0.2004 0.3333
## dir 0.3000 0.4518 0.0528 8.5588 0.0000 0.3484 0.5553
## dir 0.4000 0.6611 0.1083 6.1069 0.0000 0.4490 0.8733
## dir 0.5000 0.9029 0.1747 5.1685 0.0000 0.5605 1.2453
When using the function logitmfx()
from package mfx
one has the option to specify the calculation of marginal effects at means. There, the slope of the regression line is calculated while all independent variables are seto to their mean value.
While it was once common practice to estimate MEMs - rather than AMEs or MERs - this is now considered somewhat inappropriate because it focuses on cases that may not exist (e.g., the average of a dummy variable is not going to reflect a case that can actually exist in reality). Also, we are often interested in the effect of a variable at multiple possible values of covariates, rather than an arbitrarily selected case that.
logitmfx(deny ~ dir + black,
data = Hdma, atmean = TRUE)
## Call:
## logitmfx(formula = deny ~ dir + black, data = Hdma, atmean = TRUE)
##
## Marginal Effects:
## dF/dx Std. Err. z P>|z|
## dir 0.494729 0.066267 7.4657 8.286e-14 ***
## blackyes 0.167118 0.024409 6.8465 7.570e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## dF/dx is for discrete change for the following variables:
##
## [1] "blackyes"
For all respondents who correspond to the average concerning all independent variables, the predicted probablity of loan denial (deny
) increases by 0.495 (49.5%-points) if dir
changes. For black
respondents the probability of loan denial increases by 0.167 (16.7%-points).
One major advantage of the linear probability model is its interpretability. We’ve seen in this section that the log-odds displayed in the summary output of the logit model are not really useful unless we either calculate odds ratios or, even better, compute predicted probabilities. So when using a logit model your results will be statistically more profound at the expense of more complexity. When you are still at the beginning of your study, it might be useful to fit an LPM just to have a look at whether the coefficients point in the direction you’re assuming. Then you can always enhance your analysis by fitting a logit or a probit model.
The column ccs
of the Hdma
dataset is a categorical variable of ordinal scale. Each category represents a score that is associated with a rating of either very good (1) or very bad (6). Using css
as the dependent variable in a regression model requires a special method since \(y\) will always be constrained by the highest or lowest category and we would run into similar problems we’ve encountered when using OLS with a binary dependent variable. Below is a table which shows the distribution of ccs
.
janitor::tabyl(Hdma$ccs)
## Hdma$ccs n percent
## 1 1353 0.56824864
## 2 442 0.18563629
## 3 126 0.05291894
## 4 77 0.03233935
## 5 182 0.07643847
## 6 201 0.08441831
Most respondents have a very good score and since the other categories have relatively few observations I’m going to modify the ccs
variable such that only three categories are left.
Hdma <-
Hdma %>%
mutate(ccs_mod =
case_when(
ccs == 1 ~ 1,
ccs %in% c(2,3) ~ 2,
ccs %in% c(4,5,6) ~ 3
)
)
The ordered logit regression model is implemented in function polr()
of R package MASS
. It is a generalization of the binary logistic regression model (proportional odds model). When specifying the model with polr()
the dependent variable must be of type factor
. Otherwise the syntax is similar to function glm()
.
is.factor(Hdma$ccs_mod)
## [1] FALSE
Since this is not the case yet, we have to specify this in the formula =
argument by using factor(ccs)
.
model_ologit <- MASS::polr(factor(ccs_mod) ~ pbcr, data = Hdma)
The output looks different than the one we obtain after running an LPM with function lm()
or a logistic regression model with function glm()
.
summary(model_ologit)
## Call:
## MASS::polr(formula = factor(ccs_mod) ~ pbcr, data = Hdma)
##
## Coefficients:
## Value Std. Error t value
## pbcryes 1.669 0.1517 11
##
## Intercepts:
## Value Std. Error t value
## 1|2 0.3894 0.0433 8.9974
## 2|3 1.6014 0.0559 28.6642
##
## Residual Deviance: 4542.331
## AIC: 4548.331
## (1 observation deleted due to missingness)
The coefficient values are again in log-odds. The output shows that for respondents with a public bad credit record (pbcr
) are more likely to have a high (= bad) consumer credit score compared to respondents without a public bad credit record.
Let’s also calculate predicted probabilities for our model. We need a data frame with the levels of pbcr
which we then plug into the function predict()
.
prediction_df <- tibble(pbcr =c ("no","yes"))
predict(object = model_ologit, prediction_df, type="probs")
## 1 2 3
## 1 0.5961307 0.2360849 0.1677844
## 2 0.2176752 0.2655254 0.5167994
The columns of the output indicate the levels of the dependent variable (ccs_mod
) and the rows the levels of pbcr
("no"
and "yes"
).
For respondents without a public bad credit record the odds of having a consumer credit score of 1 versus having a score of 2 or 3 are 5.31 times that of respondents with a bad public credit record.
(0.60/(1-0.60)) / (0.22/(1-0.22))
## [1] 5.318182
The assumption can be tested using a Brant
test, which is available from R package brant
and works with objects generated by function polr()
.
brant::brant(model_ologit)
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 0.08 1 0.78
## pbcryes 0.08 1 0.78
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
According to the test the parallel regression assumption holds, because all p-values are above 0.05. The first row Omnibus
is for the whole model and the second row for the indvidual coefficent pbcr
.
The function polr()
can also be used to estimate an ordered probit model by changing the method =
argument. Below I’m estimating the same model as in the previous section.
model_oprobit <- MASS::polr(factor(ccs_mod) ~ pbcr, data = Hdma,
method = "probit")
For interpretation the usual approaches apply.
summary(model_oprobit)
## Call:
## MASS::polr(formula = factor(ccs_mod) ~ pbcr, data = Hdma, method = "probit")
##
## Coefficients:
## Value Std. Error t value
## pbcryes 1.011 0.09159 11.04
##
## Intercepts:
## Value Std. Error t value
## 1|2 0.2430 0.0268 9.0525
## 2|3 0.9642 0.0313 30.8466
##
## Residual Deviance: 4542.262
## AIC: 4548.262
## (1 observation deleted due to missingness)
If the dependent variables has more than two categories but they cannot be ordered the multinominal logit model is used.
Therefore I’m creating a new variable status
which divides respondents into one of four classes according to their family (single
) and employment (self
) status.
Hdma <-
Hdma %>%
mutate(
status = case_when(
single == "no" & self == "no" ~ "Not single & not self employed",
single == "yes" & self == "no" ~ "Single & not self employed",
single == "no" & self == "yes" ~ "Not Single & self employed",
single == "yes" & self == "yes" ~ "Single & self employed"
)
)
Let’s have a look at this variable. We find that the majority of respondents are neither single nor self employed, while the second largest group are respondents who area single but not self employed.
tabyl(Hdma$status)
## Hdma$status n percent valid_percent
## Not single & not self employed 1267 0.5321293574 0.53235294
## Not Single & self employed 177 0.0743385132 0.07436975
## Single & not self employed 836 0.3511129777 0.35126050
## Single & self employed 100 0.0419991600 0.04201681
## <NA> 1 0.0004199916 NA
With the multinominal logit model we predict the probability for each level of \(y\). We could for example run four separate binary regression models, each with a level of status
as the dependent variable. With these models we’re able to compute a predicted probability for each level of status
. On the downside, however, these probabilities won’t add up to one. Therefore it makes sense to estimate the probabilities simultaneously. To solve the math behind this approach requires to set the coefficients of one of the equations to zero (baseline). This is somewhat similar to excluding a level of a binary or categorical predictor variable when estimating a model with OLS.
In order to run the multinominal logit model we use the funcion multinom()
from R package nnet
. It sets the coefficients equal to zero for the first class (level) of the dependent variable, which in our data is Not single & not self employed. You can change the ordering of status
if you want another category set to zero.
model_mlogit <- nnet::multinom(status ~ pbcr + dir, data = Hdma)
## # weights: 16 (9 variable)
## initial value 3299.380579
## iter 10 value 2452.456492
## final value 2445.721144
## converged
Except for the base category we find coefficient estimates and standard errors for each level of status
.
summary(model_mlogit)
## Call:
## nnet::multinom(formula = status ~ pbcr + dir, data = Hdma)
##
## Coefficients:
## (Intercept) pbcryes dir
## Not Single & self employed -2.3086576 -0.04344533 1.0353472
## Single & not self employed -0.6457075 -0.12811151 0.7243947
## Single & self employed -3.0188938 0.57934691 1.2615303
##
## Std. Errors:
## (Intercept) pbcryes dir
## Not Single & self employed 0.2365491 0.3093766 0.6666061
## Single & not self employed 0.1559587 0.1764939 0.4551908
## Single & self employed 0.2687501 0.3185651 0.7284517
##
## Residual Deviance: 4891.442
## AIC: 4909.442
The coefficients of the other three equations are to be interpreted relative to respondents who are neither single nor self employed. For example, the coefficients for Single & self employed measure the change in the log-odds of a respondent being in this category and not in the baseline group, when the predictors are increased by one unit. A major problem of this model is that you cannot interpret the algebraic sign in the usual manner. So a positive sign does not necessarily mean an increase in the probability. Again, I advise you to use marginal effects or a CEP.
Using predict()
on the object generated by function multinom()
requires to set the argument type = "probs"
in order to obtain the predicted probabilities. However, no standard errors are calculated and have to be computed separately. I’m not sure how to do this right now, so I’ll add this in the future.
pred.data <-
tibble(dir = rep(seq(from = 0, to = 1.5, length.out = 50), 2),
pbcr = factor(rep(c("no","yes"), each = 50)))
pred.probs <-
predict(model_mlogit, newdata = pred.data, type = "probs") %>%
as_tibble(.name_repair = make_clean_names)
probability_data3 <- bind_cols(pred.data, pred.probs)
ggplot(data = probability_data3,
aes(x = dir, y = single_self_employed)) +
geom_line(aes(colour = pbcr), size = 1) +
scale_color_manual(values = c("blue","red")) +
labs(x = "Loan payment-to-income ratio",
y = "Predicted Probability Status = Single & self employed",
col = "Public bad credit record?")
From this plot we can see that the probability of being in the group Single & self employed increases along the values of dir
and is larger for respondents with a public bad credir records (pbcr
).
We can also use a CEP to compare the predicted probabilities of all categories simultaneously, for example by making the data set longer (pivot_longer()
) and then using facet_grid()
to create a CEP-matrix.
probability_data3 %>%
pivot_longer(cols = -c(dir, pbcr),
names_to = "category",
values_to = "pred.probs") %>%
ggplot(data = .,
aes(x = dir, y = pred.probs)) +
geom_line(aes(colour = pbcr), size = 1) +
facet_grid(. ~ category) +
scale_color_manual(values = c("blue","red")) +
labs(x = "Loan payment-to-income ratio",
y = "Predicted Probability",
col = "Public bad credit record?") +
theme(legend.position = "bottom")
As you may have noticed no p-values are returned after summarizing the output generated by function multinom()
. Still, we can compute the p-values with a Wald test…
z <- summary(model_mlogit)$coefficients/summary(model_mlogit)$standard.errors
# 2-tailed z test
(1 - pnorm(abs(z), 0, 1)) * 2
## (Intercept) pbcryes dir
## Not Single & self employed 0.000000e+00 0.88832133 0.12038445
## Single & not self employed 3.469341e-05 0.46791879 0.11151756
## Single & self employed 0.000000e+00 0.06897039 0.08330977
… or simply use function tidy()
from R package broom
.
broom::tidy(model_mlogit)
## # A tibble: 9 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Not Single & self employed (Intercept) -2.31 0.237 -9.76 1.68e-22
## 2 Not Single & self employed pbcryes -0.0434 0.309 -0.140 8.88e- 1
## 3 Not Single & self employed dir 1.04 0.667 1.55 1.20e- 1
## 4 Single & not self employed (Intercept) -0.646 0.156 -4.14 3.47e- 5
## 5 Single & not self employed pbcryes -0.128 0.176 -0.726 4.68e- 1
## 6 Single & not self employed dir 0.724 0.455 1.59 1.12e- 1
## 7 Single & self employed (Intercept) -3.02 0.269 -11.2 2.81e-29
## 8 Single & self employed pbcryes 0.579 0.319 1.82 6.90e- 2
## 9 Single & self employed dir 1.26 0.728 1.73 8.33e- 2
Hdma %>%
mutate(lvr_group = cut(lvr, 8)) %>%
group_by(lvr_group) %>%
summarize(deny_mean = mean(as.numeric(deny), na.rm = TRUE)) %>%
left_join(Hdma %>% mutate(lvr_group = cut(lvr, 8)), by = "lvr_group") %>%
ggplot(aes(x=lvr, y = deny_mean)) +
geom_line() + geom_point(aes(y = as.numeric(deny)))
Hdma %>%
mutate(lvr_group = cut(lvr, 8)) %>%
group_by(lvr_group) %>%
summarize(deny_mean = mean(as.numeric(deny), na.rm = TRUE)) %>%
left_join(Hdma %>% mutate(lvr_group = cut(lvr, 8)), by = "lvr_group") %>%
ggplot(aes(x=lvr, y = as.numeric(deny))) +
geom_point() +
geom_smooth(method="loess", se = FALSE)
Cook’s D analogue in logistic regression is the Pregibon Delta Beta Statistic, dbeta
model_diagnostic <- glm(deny ~ lvr + black + pbcr, data = Hdma, family = binomial(link="logit"))
tidy(model_diagnostic)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.42 0.357 -12.4 2.70e-35
## 2 lvr 2.58 0.443 5.84 5.34e- 9
## 3 blackyes 0.995 0.154 6.46 1.03e-10
## 4 pbcryes 1.72 0.178 9.68 3.51e-22
hat <- hatvalues(model_diagnostic)
stdresid <- rstudent(model_diagnostic)
model <- lm(hat ~ stdresid)
new_resid <- residuals(model)
sum_res <- sum(new_resid^2)
dbeta <- stdresid*new_resid/sqrt((1-hat)*sum_res)