R Tutorial: Binary Regression 1

1. Introduction

2. Linear Probability Model

3. Logit and Probit Model

3.1 General Considerations
3.2 Logit Regression
3.3 Probit Regression
3.4. Marginal Effects
3.5 Comparison with LPM

4. Ordered Logit/Probit Model

4.1 Ordered Logit Regression
4.2 Ordered Probit Regression

5. Multinominal Logit Model

6. Binary Model Diganostics

References

Reading time: 29 minutes (5,665 words)

1. Introduction

Let’s start with a list of the required R packages and the dataset used in this tutorial.

Please install and load the following R packages.

library(ggplot2)
library(dplyr)
library(purrr)
library(Ecdat)
library(broom)
library(aod)
library(margins)
library(lmtest)
library(sandwich)
library(DescTools)
library(mfx)
library(brant)
library(tidyr)
library(janitor)
library(nnet)

All practical examples are built around the Boston HMDA Data Set contained in R package Ecdat. This dataset is a cross-section from 1997-1998 of 2,381 individuals with information on their mortgage application status and other demographic characteristics. I create an additional variable deny_num which is equal to 0 if deny is "no" and equal to 1 if deny is "yes". The variable ccs is of ordinal scale but has one implausible values which I round off to a whole number.

data("Hdma", package="Ecdat")

Hdma <- Hdma %>%
  mutate(deny_num = ifelse(deny == "no", 0, 1),
         black_num = ifelse(black == "no", 0, 1),
         ccs = round(ccs, digits = 0))

Mainly, I will work with the variables dir, black, ccs and denial in this tutorial. Note that black and denial are so called dummy variables and take on only two values ("no" and "yes"). The variable css is a categorical variable (or factor) and has 6 levels. The variable dir is the ratio of a respondent’s debt payment to her total income. Most of its values range between 0 and 1.5. For example, a value of dir equal to 1 means that the loan payment is the same as the income. A value of dir greater than 1 means that the debt payment exceeds the income, while values smaller than 1 mean that the income exceeds the debt payment.

Hdma %>%
  glimpse()

## Rows: 2,381
## Columns: 15
## $ dir         <dbl> 0.221, 0.265, 0.372, 0.320, 0.360, 0.240, 0.350, 0.280, 0.~
## $ hir         <dbl> 0.221, 0.265, 0.248, 0.250, 0.350, 0.170, 0.290, 0.220, 0.~
## $ lvr         <dbl> 0.8000000, 0.9218750, 0.9203980, 0.8604651, 0.6000000, 0.5~
## $ ccs         <dbl> 5, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2~
## $ mcs         <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2~
## $ pbcr        <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ dmi         <fct> no, no, no, no, no, no, no, no, yes, no, no, no, yes, no, ~
## $ self        <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ single      <fct> no, yes, no, no, no, no, yes, no, no, yes, yes, yes, no, n~
## $ uria        <dbl> 3.9, 3.2, 3.2, 4.3, 3.2, 3.9, 3.9, 1.8, 3.1, 3.9, 3.1, 4.3~
## $ comdominiom <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ black       <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no~
## $ deny        <fct> no, no, no, no, no, no, no, no, yes, no, no, no, yes, no, ~
## $ deny_num    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ black_num   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

2. Linear Probability Model

In binary regression models, the dependent variable takes on only two values, for example 0 and 1. Technically, the linear regression model can be used to estimate the relationship between a binary dependent variable and other indepdenent variables. This is called the linear probability model (LPM). The predicted values generated by an LPM are interpreted as the predicted probability that =1, and, for a given regressor , is the change in that predicted probability for a unit change in .

The LPM assumes that the change in the predicted probability for a given change in is constant for all values of , which is often a practical shortcoming of this model. This can be mitigated by using non-linear probability models, which I will show in the next section.

First, I want to regress a respondent’s mortgage application status (deny) on her ratio of debt payments to total income (dir). The dependent variable must be of numerical scale as the function lm() does not accept it to be of class factor, hence I use deny_num instead of deny. If the loan is denied, the variable deny_num is equal to one.

model_lpm1 <- lm(deny_num ~ dir, data = Hdma)

summary(model_lpm1)

## 
## Call:
## lm(formula = deny_num ~ dir, data = Hdma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73064 -0.13731 -0.11317 -0.07092  1.05582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07996    0.02115  -3.780 0.000161 ***
## dir          0.60353    0.06083   9.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3182 on 2379 degrees of freedom
## Multiple R-squared:  0.03974,    Adjusted R-squared:  0.03933 
## F-statistic: 98.44 on 1 and 2379 DF,  p-value: < 2.2e-16

The estimate of coefficient dir is 0.604. Doubling the loan-payment-to-income ratio (an increase by one unit) leads to an increase of the probalility of loan denial by 60.4%.

We can also see that the predicted probabilities in the LPM can be smaller than 0 or greater than 1. The intercept shows the probability of loan denial for respondents with dir = 0 which is -0.080 ( or -8.0%). This is not really meaningful as there is no such thing as a negative probability of loan denial.

A payment-to-income ratio of 1 is associated with an expected probability of mortgage application denial of roughly 50%. The model indicates that there is a positive relation between the payment-to-income ratio and the probability of a denied mortgage application. Individuals with a high value of dir are more likely to be rejected. The regression line (blue) is fitted outside the possible range of deny.

ggplot(data = Hdma,
       aes(x = dir, y = deny_num)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
  geom_hline(yintercept = 1, color = "red", lty = "dashed") +
  geom_hline(yintercept = 0, color = "red", lty = "dashed") +
  labs(x = "Debt to Income Ratio", y = "Mortgage Application Status")

Below I am adding another regressor black to the model, which indicates whether the respondent is black or not.

model_lpm2 <- lm(deny_num ~ dir + black, data = Hdma)

summary(model_lpm2)

## 
## Call:
## lm(formula = deny_num ~ dir + black, data = Hdma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62520 -0.11763 -0.09291 -0.05483  1.06819 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09056    0.02078  -4.357 1.37e-05 ***
## dir          0.55918    0.05986   9.342  < 2e-16 ***
## blackyes     0.17747    0.01837   9.664  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3122 on 2378 degrees of freedom
## Multiple R-squared:  0.07602,    Adjusted R-squared:  0.07524 
## F-statistic: 97.83 on 2 and 2378 DF,  p-value: < 2.2e-16

When holding dirconstant, the probablity that the mortgage application is denied increases by 0.177 (or 17.7%) for black respondents.

The residuals-versus-fitted plot shows that the residuals are not normally distributed and violate the homoskedasticity assumption. Each residual is immediately given by its fitted value as there are only two possible outcomes of Y (0 and 1). Hence, standard error correction methods have to be applied when making inference with the LPM.

augment(model_lpm2) %>%
  ggplot(data = .,
         aes(x = .fitted, y = .std.resid)) +
  geom_point() +
  labs(x = "Fitted values", y = "Standardized residuals")

Below I calculate heteroskedasticity-robust standard errors for the augmented model.

coeftest(model_lpm2, vcov. = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -0.090556   0.028598 -3.1665  0.001563 ** 
## dir          0.559183   0.088662  6.3069 3.384e-10 ***
## blackyes     0.177475   0.024945  7.1145 1.480e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Logit and Probit Model

One drawback of the LPM is that it assumes that the probability of Y=1 is linear. This issue can be solved by modeling the probability with a nonlinear functional form. The choice between logit and probit depends largely on individual preferences as they produce similiar results.

Below I’ve plotted both distributions of the logit and probit transformation. They don’t look too different from each other but their probability function is bound between 0 and 1.

tibble(x_values = seq(-4,4, by = 0.1),
       Probit = pnorm(seq(-4,4, by = 0.1)),
       Logit = exp(seq(-4,4, by = 0.1))/(1+exp(seq(-4,4, by = 0.1)))
       ) %>%
  pivot_longer(cols = -x_values, names_to = "model", values_to = "values") %>%
  ggplot(aes(x = x_values, y = values, group = model, color = model)) +
  geom_line() +
  labs(x = "Random variable", y = "Cumulative Probability", color = "Transformation")

Another difference to the LPM or OLS estimation is that the parameters of a logit or probit model are computed using Maximum Likelihood Estimation (MLE). It uses iterative algorithms and aims to maximize the conditional probability of observing the data given a specific probability distribution and its parameters.

For each itertaion the algorithm tries to maximize the log-likelihood. The more the intital and the final likelihood value differ from each other, the bigger the advantage of using the specified model with the indenpendent variables compared to the null model with only an intercept.

A large number of iterations may indicate some problem with the specified model. However, take into account that the more predictors you include in your model, the more iterations will be required. Should the model not converge it might be helpful to exclude one or more predictor variables.

Some of the concepts of OLS can also be found when using MLE. The intital log-likelihood for example is a rough equivalent to the total sum of squares (). The final log-likelihood on the other hand is an approximate equivalent to the residual sum of squares (). And finally, the difference between and can be thought of as the mean squared error ().

3.2 Logit Regression

This section starts with the logit model as it is commonly more often used in practice than the probit model and some of the general concepts are easier to illustrate. The logit model is based on a cumulative standard logistic distribution.

3.2.1 Baseline model

Consider the following model:

The function glm() is used to compute a logit model by supplying argument family = with the eponymous link. The dependent variable can be of class factor, so I am now using deny instead of deny_num. I follow up the extended model from the previous section and regress a respondent’s mortgage application status on her loan payment-to-income ratio (dir) and on a dummy variable which indicates whether she is black or not. There are quite a few differences compared to the output of the linear regression model.

model_logit <- glm(deny ~ dir + black, 
                   data = Hdma, 
                   family = binomial(link = "logit")) 

summary(model_logit)

## 
## Call:
## glm(formula = deny ~ dir + black, family = binomial(link = "logit"), 
##     data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3710  -0.4731  -0.4219  -0.3555   2.8041  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.1264     0.2684 -15.372  < 2e-16 ***
## dir           5.3714     0.7284   7.374 1.65e-13 ***
## blackyes      1.2733     0.1462   8.709  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1591.6  on 2378  degrees of freedom
## AIC: 1597.6
## 
## Number of Fisher Scoring iterations: 5

For a one unit increase in an independent variable the logit regression coefficients give the change in the log odds of the dependent variable. This is not really helpful to begin with but the sign of the regression coefficients gives us a notion whether the probability of loan denial increases or decreases. A higher loan to income ratio (dir) is associated with an increase in the probability that the loan is denied. A black applicant has a higher probability of denial, holding dir constant. Also the magnitude of the coefficients can be assessed. The effect of a one unit increase in dir is stronger than a one unit increase in black. The log odds of denial when dir = 0 and black = "no" are reported by the intercept.

The z-value is the test value of the z-statistic (Wald z-statistic) and is reported in place of the t-value from the linear (probability) model. This different test statistic is needed because the response variable is binary.

Below I created a scatterplot of deny against dir and added the logit function. It is S-shaped and flattens out for large and small values of dir. The function also ensures that the predicted conditional probabilities of deny are between 0 and 1. The plot shows that for values of dir around 1.5, the probability of loan denial is already very close to 1 (100%).

ggplot(data = Hdma,
       aes(x = dir, y = deny_num)) +
  geom_point() +
  stat_smooth(method="glm", method.args=list(family=binomial(link="logit")),
              se = F) +
  scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
  geom_hline(yintercept = 1, color = "red", lty = "dashed") +
  geom_hline(yintercept = 0, color = "red", lty = "dashed") +
  labs(x = "Debt to Income Ratio", y = "Mortgage Application Denied?")

As mentioned before, the concept of log odds which are reported by default in the summary output of glm()are not a good measure for interpreting the effect of an independent variable on Y. Let us therefore take a look at odds. Consider some event where the outcome is either failure (Y=0) or success (Y=1). The probability of success is 0.8 and hence the probability of failure is 1 - 0.8 = 0.2. The odds of success are defined as the ratio of the probability of success over the probability of failure, in this example 0.8/0.2 = 4 (odds = p/(1-p) with p denoting the probability of the event). The odds of success are 4 to 1 or success is 4 times more likely than failure.

Considering the Hdma dataset, what are the odds of a black respondent being denied a loan and what are the odds of a nonblack respondent being denied a loan? The function xtabs() creates a cross-table of dirand black.

xtabs(~deny + black, data = Hdma)

##      black
## deny    no  yes
##   no  1853  243
##   yes  189   96

For black respondents the odds of loan denial are 96/243 = 0.40 and for nonblack respondents the odds of loan denial are 189/1853 = 0.10. The ratio of the odds for black to the odds for non-black respondents is (96/243)/(189/1853) = 3.87. The odds of loan denial for black respondents are about 3.87 times higher than the odds for non-black respondents.

Below are some probabilities with their associated odds and log-odds.

Probability	Odds	Logits (Log-Odds)
10%	10/90 = 0.11	-2.21
20%	20/80 = 0.25	-1.39
30%	30/80 = 0.43	-0.84
40%	40/60 = 0.67	-0.40
50%	50/50 = 1.00	0.00
60%	60/40 = 1.50	0.41
70%	70/30 = 2.33	0.85
80%	80/20 = 4.00	1.39
90%	90/10 = 9.00	2.20

Logistic regression coefficients can also be depicted in the form of odds ratios, i.e. the multiplicative effects on chances. They can be calculated by exponentiating the logit coefficients (log odds). In the baseline model, the odds (chance of loan denial) for black respondents are 3.57 times higher when keeping dir constant. Stating that the odds for blackrespondents are 257% (3.57 - 1.00 * 100%) higher than the odds of non-black respondents is also possible. Note that the z-values and hence the p-values do not change when transforming the log odds to odds ratios.

exp(coef(model_logit))

##  (Intercept)          dir     blackyes 
##   0.01614017 215.15719913   3.57263200

Another form of interpretation is the transformation of logits to probablities. This can be achieved by using the baseline model (model_logit) in function predict() with argument type = set to "response". Before, however, we must decide for which fraction of the data the probabilities should be calculated. Below I am calculating the predicted probabilities of loan denial (deny) for black and non-black respondents (black) with an average loan-payment-to-income ratio (mean(Hdma$dir))

tibble(
  dir = mean(Hdma$dir),
  black = factor(c("no", "yes"))
  ) %>%
  mutate(pred.prob = predict(model_logit, newdata = ., 
                             type = "response"))

## # A tibble: 2 x 3
##     dir black pred.prob
##   <dbl> <fct>     <dbl>
## 1 0.331 no       0.0871
## 2 0.331 yes      0.254

The predicted probability of loan denial for a black respondent with average dir is 0.254 (or 25.4%). The predicted probability of loan denial for a non-black respondent, in contrast, is only 0.087 (or 8.7%). There is a difference between black and non-black respondents of 0.167 (percentage points).

Instead of keeping dir fixed at its mean, we may vary it over a selected range of values. Below are the predicted probabilities of deny for black respondents over four different values of the loan-payment-to-income ratio.

tibble(
  dir = seq(from = 0.2, to = 0.5 , by = 0.1),
  black = factor(c("yes"))
  ) %>%
  mutate(pred.prob =  predict(model_logit, newdata = ., 
                             type = "response"))

## # A tibble: 4 x 3
##     dir black pred.prob
##   <dbl> <fct>     <dbl>
## 1   0.2 yes       0.144
## 2   0.3 yes       0.224
## 3   0.4 yes       0.331
## 4   0.5 yes       0.458

For black respondents with a dir of 0.2 the predicted probability of loan denial (deny = yes) is 0.144 (or 14.4%). The predicted probability of loan denial of black respondents with a dir of 0.5 is 0.458 (or 45.8%). Note that although the predicted probability increases with higher values of dir, this increase is not proportional!

With the logistic regression model we can understand residuals as the difference between observed values and the classified values (by our model). That is, if the probability is larger than 0.5 the observation is classified with 1 and if it is smaller than 0.5 the observations is classified with 0. Typically we compare the classified values by our model with the observed values in a cross-table. I’m using the function augment() with argument type.predict = "response" to obtain the predicted probabilities. Then I’m using functions from R package janitor to create a table of the observed an predicted values of deny.

model_logit %>%
  augment(type.predict = "response") %>%
  mutate(deny_prediction = if_else(.fitted > 0.5, "yes", "no")) %>%
  janitor::tabyl(deny_prediction, deny) %>%
  janitor::adorn_totals(where = c("row", "col")) %>%
  janitor::adorn_title()

##                  deny          
##  deny_prediction   no yes Total
##               no 2090 273  2363
##              yes    6  12    18
##            Total 2096 285  2381

With this table we can compute measures like the sensitivity or specifictiy. These are column percentages in the main diagonal. The sensitivity is the share of predicted loan denials relative to all loan denials (12/285 = 4.21 %). The specificity is the share of predicted loan refusals relative to all refusals (2090/2096 = 99.71%).

In social sciences you’ll frequently encounter the measure Count R², which is the share of all correct predictions ((2090+12)/2381 = 88.28%). However, be wary that even without knowing the independent variables we are able to correctly classify some persons. If we’d only know the distribution of deny we make the least errors if we classify all observations in the category with the most cases. If we would classify all persons as being denied the loan, we would be correct in 2096/2381 88.03% of the cases.

Comparing the correct classifications by means of the marginal distribution of deny and by means of using the independent variables we can compute the Adjusted Count R². It is computed by substracting the column with the most cases (deny = no) from the Count R² and dividing it by the difference between all observations and the column with the most cases.

((2090+12) - 2096)/(2381-2096)

## [1] 0.02105263

This measure means that by knowing the independent variables (dir and black), the prediction errors decrease by 0.02% compared to the prediction by using only the marginal distribution of (deny).

With function lrtest() from package lmtest likelihood ratio tests can be performed. Similar to an F-test it checks whether all regressors in the model are together equal to zero. That means whether the model with just a constant is as good as the specified model with the regressors. Note that F-tests are not utilizied in the setting of binary response variables. The LR test is performed by estimating two models and comparing the fit of one model to the fit of the other. Removing predictor variables from a model will almost always make the model fit less well (i.e., a model will have a lower log-likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. The LR test does this by comparing the log-likelihoods of the two models. Below, the null hypothesis, that the log-likelihood of the model with just a constant is the same as the log-likelihood of the baseline model, is rejected.

lrtest(model_logit)

## Likelihood ratio test
## 
## Model 1: deny ~ dir + black
## Model 2: deny ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -795.79                         
## 2   1 -872.21 -2 152.85  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The AIC is another measure of goodness of fit that takes into account the ability of the model to fit the data. It can be extracted with function AIC(). This is very useful when comparing two models where one may fit better but perhaps only because it’s more flexible and thus better able to fit any data. Since we have only one model, this is uninformative at the moment.

AIC(model_logit)

## [1] 1597.573

Compared to OLS, no value of R² is reported in the summary routput of function glm(). However, there are some Pseudo R² measures which can be used in the setting of binary response variables. Below I am calculating Mc-Faddens R² which is based on the model’s log-likelihood. As with the standard R², the higher the value the better the fit of the model.

1 - model_logit$deviance / model_logit$null.deviance

## [1] 0.08762363

The function PseudoR2() from package DescTools provides many different pseudo R² measures, which you can look up if you like to. We can also assess the goodness of fit with the Hosmer-Lemeshow-Test implemented in function HosmerLemeshowTest() of the above mentioned package.

model.diagnostics <- 
  model_logit %>%
  augment(type.predict = "response") %>%
  mutate(deny_num = if_else(deny == "no",0,1))

DescTools::HosmerLemeshowTest(fit = model.diagnostics$.fitted, 
                              obs = model.diagnostics$deny_num,  ngr = 10)$C

## 
##  Hosmer-Lemeshow C statistic
## 
## data:  model.diagnostics$.fitted and model.diagnostics$deny_num
## X-squared = 25.863, df = 8, p-value = 0.001109

We can access the number of iterations before the model converged with:

model_logit$iter

## [1] 5

With function logLik() we can extract the log-likelihood of our specified model.

logLik(model_logit)

## 'log Lik.' -795.7864 (df=3)

The null model’s log-likelihood can also be retrieved by refitting the model with only a constant, for example using the function update().

logLik(update(model_logit, . ~ 1))

## 'log Lik.' -872.2128 (df=1)

3.2.2 Extended Model

I will now remove the dummy variable black from the model but add the consumer’s credit score (ccs). This is an ordinal categorical variable with values ranging from 1 to 6. A low value represents a good score and a high value represents a bad score (just like grades in Germany).

model_logit2 <- glm(deny ~ dir + factor(ccs), 
                   data = Hdma, 
                   family = binomial(link = "logit")) 

summary(model_logit2)

## 
## Call:
## glm(formula = deny ~ dir + factor(ccs), family = binomial(link = "logit"), 
##     data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3218  -0.5037  -0.3664  -0.2901   2.9754  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.6351     0.2901 -15.977  < 2e-16 ***
## dir            5.5119     0.7434   7.414 1.23e-13 ***
## factor(ccs)2   0.7287     0.1902   3.831 0.000127 ***
## factor(ccs)3   1.1465     0.2652   4.324 1.53e-05 ***
## factor(ccs)4   1.5797     0.3043   5.192 2.08e-07 ***
## factor(ccs)5   1.4712     0.2143   6.864 6.68e-12 ***
## factor(ccs)6   2.0244     0.1919  10.548  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1526.2  on 2374  degrees of freedom
## AIC: 1540.2
## 
## Number of Fisher Scoring iterations: 5

The interpretation of the coefficients of each level of ccs is with respect to the left out catergory (ccs = 1). Remember that these coefficients show the change in the log odds of deny. The worse a respondent’s ccs, the higher the probability of being denied a loan.

Especially with categorical variables the predicted probabilities are essential for an easy and meaningful interpretation. Below I am holding dir at its mean value and calculate the predicted probabilty of loan denial (deny) for each level of ccs. This is by the way the same as using the margins command in Stata (there will be more on this topic later on). Then I am using the function geom_errorbar() to show the confidence interval around each estimate.

tibble(dir = mean(Hdma$dir),
       ccs = factor(rep(1:6))) %>%
  mutate(
    pred.prob = predict(model_logit2,newdata = .,
                        type = "response"),
    se = predict(model_logit2, newdata = .,
                 type = "response",se = TRUE)$se.fit,
    ll = pred.prob - (1.96 * se),
    ul = pred.prob + (1.96 * se)) %>%
  ggplot(data = .,
         aes(x = ccs, y = pred.prob)) +
  geom_errorbar(aes(ymin = ll, ymax = ul),
                width = 0.2, col = "red") +
  geom_point(fill = "black") +
  labs(x = "Consumer's credit score",
       y = "Predicted Probability (of loan denial)")

Below I am generating a new dataset with 100 values of dir per level of ccs. The dir values are generated by function seq() with argument length.out = set to 100 and are spaced uniformly between the from = and to = value. In total there are 600 observations because ccs has 6 levels. Then I am using the function predict() to compute the predicted probabilities and standard errors. I also calculate the upper and lower limit of the confidence interval for each predicted probability. Below is a snippet of this dataset.

probability_data <- tibble(
  dir = rep(seq(from = 0, to = 1.5, length.out = 50), 6),
  ccs = factor(rep(1:6, each = 50))
  ) %>%
  mutate(pred.probs = predict(model_logit2, newdata = ., 
                              type = "response"),
         se = predict(model_logit2, newdata = ., 
                      type = "response", se = TRUE)$se.fit,
         ll = pred.probs - (1.96 * se),
         ul = pred.probs + (1.96 * se))

probability_data %>%
  group_by(ccs) %>%
  slice(1:3)

## # A tibble: 18 x 6
## # Groups:   ccs [6]
##       dir ccs   pred.probs      se      ll     ul
##     <dbl> <fct>      <dbl>   <dbl>   <dbl>  <dbl>
##  1 0      1        0.00961 0.00276 0.00420 0.0150
##  2 0.0306 1        0.0114  0.00303 0.00543 0.0173
##  3 0.0612 1        0.0134  0.00330 0.00696 0.0199
##  4 0      2        0.0197  0.00581 0.00833 0.0311
##  5 0.0306 2        0.0233  0.00638 0.0107  0.0358
##  6 0.0612 2        0.0274  0.00699 0.0137  0.0411
##  7 0      3        0.0296  0.0105  0.00913 0.0501
##  8 0.0306 3        0.0349  0.0117  0.0120  0.0578
##  9 0.0612 3        0.0410  0.0130  0.0155  0.0666
## 10 0      4        0.0450  0.0161  0.0134  0.0766
## 11 0.0306 4        0.0528  0.0180  0.0175  0.0882
## 12 0.0612 4        0.0619  0.0201  0.0225  0.101 
## 13 0      5        0.0405  0.0126  0.0159  0.0652
## 14 0.0306 5        0.0476  0.0138  0.0206  0.0747
## 15 0.0612 5        0.0559  0.0151  0.0263  0.0855
## 16 0      6        0.0685  0.0194  0.0304  0.106 
## 17 0.0306 6        0.0800  0.0210  0.0389  0.121 
## 18 0.0612 6        0.0934  0.0225  0.0493  0.137

With function geom_ribbon() it is easy to show intervals around a given value. Below are the predicted probabilites of loan denial (deny) for each value of the consumers’ credit scores (ccs). The worse the consumer’s credit score, the more likely is the loan denial for a given value of dir. It is also apparent that along the values of dir the predicted probabilities of deny do not increase linearly.

ggplot(data = probability_data, 
         aes(x = dir, y = pred.probs)) + 
  geom_ribbon(aes(ymin = ll,
                  ymax = ul, fill = ccs), alpha = 0.2) + 
  geom_line(aes(colour = ccs),
            size = 1) +
  scale_color_brewer(palette = "RdYlGn", direction = -1) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(x = "Loan payment-to-income ratio",
       y = "Predicted Probability (of loan denial)",
       col = "Consumer's credit score", fill = "Consumer's credit score")

With a Wald-test we can check whether all levels of a factor variable are together equal to zero. The Wald test approximates the LR-test, but with the advantage that it only requires estimating one model.

wald.test(b = coef(model_logit2), 
          Sigma = vcov(model_logit2), 
          Terms = 3:7)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 132.2, df = 5, P(> X2) = 0.0

The chi-squared test statistic of 132.2 with five degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of ccs is statistically significant.

Below I test whether the coefficients of ccs = 5 and ccs = 6 are equal.

test_vec <- cbind(0, 0, 0, 0, 0, 1, -1)
wald.test(b = coef(model_logit2), Sigma = vcov(model_logit2), L = test_vec)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019

The chi-squared test statistic of 5.5 with one degree of freedom is associated with a p-value of 0.019 indicating that the two coefficients are significantly different from each other at the 5% level.

3.2.3 Complex Models

Below is an even more advanced example, where I added black to the model again. Now there is one continuous, one dummy and one catergorical variable!

model_logit3 <- glm(deny ~ dir + black + factor(ccs), 
                   data = Hdma, 
                   family = binomial(link = "logit")) 

summary(model_logit3)

## 
## Call:
## glm(formula = deny ~ dir + black + factor(ccs), family = binomial(link = "logit"), 
##     data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0940  -0.4976  -0.3473  -0.2823   2.9861  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.6540     0.2876 -16.181  < 2e-16 ***
## dir            5.1816     0.7328   7.071 1.54e-12 ***
## blackyes       0.9414     0.1558   6.045 1.50e-09 ***
## factor(ccs)2   0.7170     0.1913   3.748 0.000178 ***
## factor(ccs)3   1.0260     0.2698   3.803 0.000143 ***
## factor(ccs)4   1.3527     0.3127   4.326 1.52e-05 ***
## factor(ccs)5   1.3439     0.2182   6.160 7.28e-10 ***
## factor(ccs)6   1.8055     0.1981   9.114  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1492.0  on 2373  degrees of freedom
## AIC: 1508
## 
## Number of Fisher Scoring iterations: 5

When using predicted probabilities three different variables have now to be taken into account for the interpretation! I am generating 20 values of dir between 0 and 1.5 for all combinations of ccs and black (120 observations in total). Truth be told, already creating this datset is quite complicated! I am using the function rep_along() from package purrr to create matching number of categories for each sequence of values.

probability_data2 <- tibble(
  dir = rep(seq(from = 0, to = 1.5, length.out = 20), 6),
  ccs = factor(rep(1:6, each = 20)),
  black = factor(rep_along(ccs, c("no","yes")))
  ) %>%
  mutate(pred.prob = predict(model_logit3, newdata = ., 
                              type = "response"),
         se = predict(model_logit3, newdata = .,
                      type = "response", se = TRUE)$se.fit,
         ll = pred.prob - (1.96 * se),
         ul = pred.prob + (1.96 * se))

probability_data2 %>%
  group_by(ccs, black) %>%
  slice(1:2)

## # A tibble: 24 x 7
## # Groups:   ccs, black [12]
##       dir ccs   black pred.prob      se      ll     ul
##     <dbl> <fct> <fct>     <dbl>   <dbl>   <dbl>  <dbl>
##  1 0      1     no      0.00943 0.00269 0.00417 0.0147
##  2 0.158  1     no      0.0211  0.00391 0.0135  0.0288
##  3 0.0789 1     yes     0.0355  0.00930 0.0172  0.0537
##  4 0.237  1     yes     0.0769  0.0139  0.0496  0.104 
##  5 0      2     no      0.0191  0.00558 0.00820 0.0301
##  6 0.158  2     no      0.0423  0.00840 0.0259  0.0588
##  7 0.0789 2     yes     0.0700  0.0184  0.0339  0.106 
##  8 0.237  2     yes     0.146   0.0268  0.0932  0.198 
##  9 0      3     no      0.0259  0.00921 0.00783 0.0439
## 10 0.158  3     no      0.0568  0.0156  0.0262  0.0873
## # ... with 14 more rows

The CEP shows that along the values of dir and for all consumer’s credit scores (ccs) a black respondent has a higher predicted probability of loan denial compared to a non-black respondent.

ggplot(data = probability_data2, 
         aes(x = dir, y = pred.prob)) + 
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = black), alpha = 0.2) + 
  geom_line(aes(colour = black), size = 1) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  facet_wrap(.~ ccs) +
  labs(x = "Loan payment-to-income ratio",
       y = "Predicted Probability (of loan denial)",
       col = "Respondent is black", fill = "Respondent is black")

Let’s consider again the previous model but now with an interaction term between black and ccs.

model_logit4 <- glm(deny ~ dir + black + factor(ccs) + black:factor(ccs), 
                   data = Hdma, 
                   family = binomial(link = "logit")) 

summary(model_logit4)

## 
## Call:
## glm(formula = deny ~ dir + black + factor(ccs) + black:factor(ccs), 
##     family = binomial(link = "logit"), data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1662  -0.4899  -0.3465  -0.2812   2.9912  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -4.67083    0.29452 -15.859  < 2e-16 ***
## dir                    5.21577    0.73831   7.065 1.61e-12 ***
## blackyes               0.96500    0.29362   3.287 0.001014 ** 
## factor(ccs)2           0.73348    0.21279   3.447 0.000567 ***
## factor(ccs)3           1.19355    0.30811   3.874 0.000107 ***
## factor(ccs)4           0.95443    0.45721   2.088 0.036840 *  
## factor(ccs)5           1.16676    0.26910   4.336 1.45e-05 ***
## factor(ccs)6           1.96144    0.23551   8.328  < 2e-16 ***
## blackyes:factor(ccs)2 -0.08307    0.48418  -0.172 0.863771    
## blackyes:factor(ccs)3 -0.57349    0.61253  -0.936 0.349136    
## blackyes:factor(ccs)4  0.82885    0.67518   1.228 0.219603    
## blackyes:factor(ccs)5  0.51268    0.49145   1.043 0.296853    
## blackyes:factor(ccs)6 -0.40924    0.43222  -0.947 0.343721    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1485.5  on 2368  degrees of freedom
## AIC: 1511.5
## 
## Number of Fisher Scoring iterations: 5

tibble(
  dir = rep(seq(from = 0, to = 1.5, length.out = 20), 6),
  ccs = factor(rep(1:6, each = 20)),
  black = factor(rep_along(ccs, c("no","yes")))
  ) %>%
  mutate(pred.prob = predict(model_logit4, newdata = ., 
                              type = "response"),
         se = predict(model_logit4, newdata = .,
                      type = "response", se = TRUE)$se.fit,
         ll = pred.prob - (1.96 * se),
         ul = pred.prob + (1.96 * se)) %>%
  ggplot(data = ., 
         aes(x = dir, y = pred.prob)) + 
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = black), alpha = 0.2) + 
  geom_line(aes(colour = black), size = 1) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  facet_wrap(.~ ccs) +
  labs(x = "Loan payment-to-income ratio",
       y = "Predicted Probability (of loan denial)",
       col = "Respondent is black", fill = "Respondent is black")

3.3 Probit Regression

The probit regression model is based on a cumulative standard normal distribution. Logit and probit regression deliver practically identical results and you should consider using the probit model as an alternative to the logit model and not as an extension. Commonly, the probit coefficients are smaller than the logit coefficients.

The function glm() is used to compute the probit model by supplying argument family = with the eponymous link. The dependent variable might now be of class factor, so I use deny instead of deny_num.

model_probit <- glm(deny ~ dir + black , 
                    data = Hdma, 
                    family = binomial(link = "probit")) 

summary(model_probit)

## 
## Call:
## glm(formula = deny ~ dir + black, family = binomial(link = "probit"), 
##     data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1208  -0.4761  -0.4254  -0.3549   2.8802  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.25916    0.13670 -16.527  < 2e-16 ***
## dir          2.74208    0.38050   7.207 5.74e-13 ***
## blackyes     0.70843    0.08335   8.499  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1594.5  on 2378  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 5

The probit regression coefficients give the change in the z-score or probit index for a one unit change in the independent variable and cannot be easily interpreted. First, we take a look at the signs of the coefficients. A higher loan to income ratio (dir) is, as we already know, associated with an increase in the probability of loan denial. Also, black applicants have a higher probability of denial, holding constant the payments-to-income ratio. However, the coefficient estimates are not probabilities - yet!

The z-value is the test value of the z-statistic (Wald z-statistic) and is again used instead of the t-statistic known from the linear probability model. This different test statistic is needed because the response variable is binary.

The function is clearly nonlinear and flattens out for large and small values of the loan payments to income ratio. The functional form thus also ensures that the predicted conditional probabilities of loan denial lie between 0 and 1.

ggplot(data = Hdma,
       aes(x = dir, y = deny_num)) +
  geom_point() +
  stat_smooth(method="glm", method.args=list(family=binomial(link="probit")),
              se = F) +
  scale_y_continuous(breaks = c(0,1), limits = c(-0.5,1.5)) +
  geom_hline(yintercept = 1, color = "red", lty = "dashed") +
  geom_hline(yintercept = 0, color = "red", lty = "dashed") +
  labs(x = "Debt to Income Ratio", y = "Mortgage Application Status")

First, I compute predictions for white and black applicants with the same loan payment-to-income ratio. Then, with function diff() I compute the difference in probabilities.

predictions <- predict(model_probit, 
                       newdata = tibble(
                         "dir" = c(0.8, 0.8),
                         "black" = c("no", "yes")
                         ),
                       type = "response")

diff(predictions)

##        2 
## 0.265976

The estimated difference in denial probabilities is about 0.266 (26.6%).

Please refer to the section from the logit regression as there are practically no differences.

I am adding the variable ccs to the model which indicates a respondent’s credit score. Its values range from 1 to 6 with a low value indicating a good score.

The function xtabs() creates a cross-table of the dependent variable and the added factor variable ccs.

xtabs(~deny + ccs, data = Hdma)

##      ccs
## deny     1    2    3    4    5    6
##   no  1269  391  103   60  140  133
##   yes   84   51   23   17   42   68

Below I am estimating the model.

model_probit2 <- glm(deny ~ dir + black + factor(ccs), 
                    data = Hdma, 
                    family = binomial(link = "probit")) 

summary(model_probit2)

## 
## Call:
## glm(formula = deny ~ dir + black + factor(ccs), family = binomial(link = "probit"), 
##     data = Hdma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9093  -0.5004  -0.3481  -0.2759   3.1174  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.52646    0.14419 -17.521  < 2e-16 ***
## dir           2.65806    0.38073   6.982 2.92e-12 ***
## blackyes      0.52622    0.08805   5.976 2.28e-09 ***
## factor(ccs)2  0.35009    0.09756   3.588 0.000333 ***
## factor(ccs)3  0.54245    0.14571   3.723 0.000197 ***
## factor(ccs)4  0.70248    0.17505   4.013 6.00e-05 ***
## factor(ccs)5  0.69869    0.12002   5.821 5.83e-09 ***
## factor(ccs)6  0.98931    0.11055   8.949  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1744.4  on 2380  degrees of freedom
## Residual deviance: 1494.2  on 2373  degrees of freedom
## AIC: 1510.2
## 
## Number of Fisher Scoring iterations: 5

With a Wald-test we can check whether all levels of a factor variable are together equal to zero. The Wald test approximates the LR-test, but with the advantage that it only requires estimating one model.

wald.test(b = coef(model_probit2), 
          Sigma = vcov(model_probit2), 
          Terms = 4:8)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 99.5, df = 5, P(> X2) = 0.0

The chi-squared test statistic of 99.5 with five degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of ccs is statistically significant.

Below I test whether the coefficient of ccs = 5 is equal to the coeffcient of ccs = 6.

l <- cbind(0, 0, 0, 0, 0, 0, 1, -1)
wald.test(b = coef(model_probit2), Sigma = vcov(model_probit2), L = l)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 4.2, df = 1, P(> X2) = 0.04

3.4. Marginal Effects

Opposed to linear (OLS) regression, the logistic regression model has no single measure which expresses the influence of an independent on the dependent variable. However, one candidate is the average marginal effect. In OLS regressions the marginal effect is the slope of the regression line and constant for all values of X. As we have seen in the conditional effects plots the slope of the regression line, and hence the marginal effect, is not constant in the logistic regression. The effect of an independent variable is stronger the closer the predicted probabilities are to 50%. It is more weak if they approach 0 % or 100 %. It is thus reasonable to calculate the slope of the regression line at various points.

Average marginal effects show the change in a probability when the predictor or independent variable is increased by one unit. For continuous variables this represents the instantaneous change given that the unit is very small. For binary variables, the change is from 0 to 1, so one unit as it is usually thought of.

Three types of marginal effects can be distinguished:

Average marginal effects (AME)
Marginal effects at representative values (MER)
Marginal effects at means (MEM)

Let us consider again the baseline model with blackand dir as regressors and deny as response. We now use the function margins() from package margins to calculate the average marginal effects. For non-continuous variables one refers to them as the average discrete change (ADC) relative to the base level.

model_logit <- glm(deny ~ dir + black, 
                   data = Hdma, 
                   family = binomial(link = "logit")) 

summary(margins(model_logit, type = "response"))

##    factor    AME     SE      z      p  lower  upper
##  blackyes 0.1664 0.0239 6.9744 0.0000 0.1197 0.2132
##       dir 0.5182 0.0697 7.4304 0.0000 0.3815 0.6549

The output from function margins() may be plotted with geom_point() and geom_errorbar() in order to compare the relative influence of each regressor.

ggplot(data = summary(margins(model_logit))) +
  geom_point(aes(factor, AME)) +
  geom_errorbar(aes(x = factor, ymin = lower, ymax = upper), 
                width = 0.2) +
  labs(x = "")

The at = argument allows you to calculate marginal effects at representative values, which are marginal effects for particularly interesting (sets of) observations in a dataset. Below I calculate the marginal effects at three different values of dir.

summary(margins(model_logit, at = list(dir = c(0.3, 0.4, 0.5)),  type = "response"))

##    factor    dir    AME     SE      z      p  lower  upper
##  blackyes 0.3000 0.1493 0.0228 6.5606 0.0000 0.1047 0.1939
##  blackyes 0.4000 0.2093 0.0286 7.3067 0.0000 0.1531 0.2654
##  blackyes 0.5000 0.2668 0.0339 7.8689 0.0000 0.2004 0.3333
##       dir 0.3000 0.4518 0.0528 8.5588 0.0000 0.3484 0.5553
##       dir 0.4000 0.6611 0.1083 6.1069 0.0000 0.4490 0.8733
##       dir 0.5000 0.9029 0.1747 5.1685 0.0000 0.5605 1.2453

When using the function logitmfx() from package mfx one has the option to specify the calculation of marginal effects at means. There, the slope of the regression line is calculated while all independent variables are seto to their mean value.

While it was once common practice to estimate MEMs - rather than AMEs or MERs - this is now considered somewhat inappropriate because it focuses on cases that may not exist (e.g., the average of a dummy variable is not going to reflect a case that can actually exist in reality). Also, we are often interested in the effect of a variable at multiple possible values of covariates, rather than an arbitrarily selected case that.

logitmfx(deny ~ dir + black,
         data = Hdma, atmean = TRUE)

## Call:
## logitmfx(formula = deny ~ dir + black, data = Hdma, atmean = TRUE)
## 
## Marginal Effects:
##             dF/dx Std. Err.      z     P>|z|    
## dir      0.494729  0.066267 7.4657 8.286e-14 ***
## blackyes 0.167118  0.024409 6.8465 7.570e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "blackyes"

For all respondents who correspond to the average concerning all independent variables, the predicted probablity of loan denial (deny) increases by 0.495 (49.5%-points) if dir changes. For black respondents the probability of loan denial increases by 0.167 (16.7%-points).

3.5 Comparison with LPM

One major advantage of the linear probability model is its interpretability. We’ve seen in this section that the log-odds displayed in the summary output of the logit model are not really useful unless we either calculate odds ratios or, even better, compute predicted probabilities. So when using a logit model your results will be statistically more profound at the expense of more complexity. When you are still at the beginning of your study, it might be useful to fit an LPM just to have a look at whether the coefficients point in the direction you’re assuming. Then you can always enhance your analysis by fitting a logit or a probit model.

4. Ordered Logit/Probit Model

The column ccs of the Hdma dataset is a categorical variable of ordinal scale. Each category represents a score that is associated with a rating of either very good (1) or very bad (6). Using css as the dependent variable in a regression model requires a special method since will always be constrained by the highest or lowest category and we would run into similar problems we’ve encountered when using OLS with a binary dependent variable. Below is a table which shows the distribution of ccs.

janitor::tabyl(Hdma$ccs)

##  Hdma$ccs    n    percent
##         1 1353 0.56824864
##         2  442 0.18563629
##         3  126 0.05291894
##         4   77 0.03233935
##         5  182 0.07643847
##         6  201 0.08441831

Most respondents have a very good score and since the other categories have relatively few observations I’m going to modify the ccs variable such that only three categories are left.

Hdma <-
  Hdma %>%
  mutate(ccs_mod = 
           case_when(
             ccs == 1 ~ 1,
             ccs %in% c(2,3) ~ 2,
             ccs %in% c(4,5,6) ~ 3
             )
         )

4.1 Ordered Logit Regression

The ordered logit regression model is implemented in function polr() of R package MASS. It is a generalization of the binary logistic regression model (proportional odds model). When specifying the model with polr() the dependent variable must be of type factor. Otherwise the syntax is similar to function glm().

is.factor(Hdma$ccs_mod)

## [1] FALSE

Since this is not the case yet, we have to specify this in the formula = argument by using factor(ccs).

model_ologit <- MASS::polr(factor(ccs_mod) ~ pbcr, data = Hdma)

The output looks different than the one we obtain after running an LPM with function lm() or a logistic regression model with function glm().

summary(model_ologit)

## Call:
## MASS::polr(formula = factor(ccs_mod) ~ pbcr, data = Hdma)
## 
## Coefficients:
##         Value Std. Error t value
## pbcryes 1.669     0.1517      11
## 
## Intercepts:
##     Value   Std. Error t value
## 1|2  0.3894  0.0433     8.9974
## 2|3  1.6014  0.0559    28.6642
## 
## Residual Deviance: 4542.331 
## AIC: 4548.331 
## (1 observation deleted due to missingness)

The coefficient values are again in log-odds. The output shows that for respondents with a public bad credit record (pbcr) are more likely to have a high (= bad) consumer credit score compared to respondents without a public bad credit record.

Let’s also calculate predicted probabilities for our model. We need a data frame with the levels of pbcr which we then plug into the function predict().

prediction_df <- tibble(pbcr =c ("no","yes"))

predict(object = model_ologit, prediction_df, type="probs")

##           1         2         3
## 1 0.5961307 0.2360849 0.1677844
## 2 0.2176752 0.2655254 0.5167994

The columns of the output indicate the levels of the dependent variable (ccs_mod) and the rows the levels of pbcr ("no" and "yes").

For respondents without a public bad credit record the odds of having a consumer credit score of 1 versus having a score of 2 or 3 are 5.31 times that of respondents with a bad public credit record.

(0.60/(1-0.60)) / (0.22/(1-0.22))

## [1] 5.318182

The assumption can be tested using a Brant test, which is available from R package brant and works with objects generated by function polr().

brant::brant(model_ologit)

## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      0.08    1   0.78
## pbcryes      0.08    1   0.78
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

According to the test the parallel regression assumption holds, because all p-values are above 0.05. The first row Omnibus is for the whole model and the second row for the indvidual coefficent pbcr.

4.2 Ordered Probit Regression

The function polr() can also be used to estimate an ordered probit model by changing the method = argument. Below I’m estimating the same model as in the previous section.

model_oprobit <- MASS::polr(factor(ccs_mod) ~ pbcr, data = Hdma, 
                            method = "probit")

For interpretation the usual approaches apply.

summary(model_oprobit)

## Call:
## MASS::polr(formula = factor(ccs_mod) ~ pbcr, data = Hdma, method = "probit")
## 
## Coefficients:
##         Value Std. Error t value
## pbcryes 1.011    0.09159   11.04
## 
## Intercepts:
##     Value   Std. Error t value
## 1|2  0.2430  0.0268     9.0525
## 2|3  0.9642  0.0313    30.8466
## 
## Residual Deviance: 4542.262 
## AIC: 4548.262 
## (1 observation deleted due to missingness)

5. Multinominal Logit Model

If the dependent variables has more than two categories but they cannot be ordered the multinominal logit model is used.

Therefore I’m creating a new variable status which divides respondents into one of four classes according to their family (single) and employment (self) status.

Hdma <- 
  Hdma %>%
  mutate(
    status = case_when(
      single == "no" & self == "no" ~ "Not single & not self employed",
      single == "yes" & self == "no" ~ "Single & not self employed",
      single == "no" & self == "yes" ~ "Not Single & self employed",
      single == "yes" & self == "yes" ~ "Single & self employed"
      )
  )

Let’s have a look at this variable. We find that the majority of respondents are neither single nor self employed, while the second largest group are respondents who area single but not self employed.

tabyl(Hdma$status)

##                     Hdma$status    n      percent valid_percent
##  Not single & not self employed 1267 0.5321293574    0.53235294
##      Not Single & self employed  177 0.0743385132    0.07436975
##      Single & not self employed  836 0.3511129777    0.35126050
##          Single & self employed  100 0.0419991600    0.04201681
##                            <NA>    1 0.0004199916            NA

With the multinominal logit model we predict the probability for each level of . We could for example run four separate binary regression models, each with a level of status as the dependent variable. With these models we’re able to compute a predicted probability for each level of status. On the downside, however, these probabilities won’t add up to one. Therefore it makes sense to estimate the probabilities simultaneously. To solve the math behind this approach requires to set the coefficients of one of the equations to zero (baseline). This is somewhat similar to excluding a level of a binary or categorical predictor variable when estimating a model with OLS.

In order to run the multinominal logit model we use the funcion multinom() from R package nnet. It sets the coefficients equal to zero for the first class (level) of the dependent variable, which in our data is Not single & not self employed. You can change the ordering of status if you want another category set to zero.

model_mlogit <- nnet::multinom(status ~ pbcr + dir, data = Hdma)

## # weights:  16 (9 variable)
## initial  value 3299.380579 
## iter  10 value 2452.456492
## final  value 2445.721144 
## converged

Except for the base category we find coefficient estimates and standard errors for each level of status.

summary(model_mlogit)

## Call:
## nnet::multinom(formula = status ~ pbcr + dir, data = Hdma)
## 
## Coefficients:
##                            (Intercept)     pbcryes       dir
## Not Single & self employed  -2.3086576 -0.04344533 1.0353472
## Single & not self employed  -0.6457075 -0.12811151 0.7243947
## Single & self employed      -3.0188938  0.57934691 1.2615303
## 
## Std. Errors:
##                            (Intercept)   pbcryes       dir
## Not Single & self employed   0.2365491 0.3093766 0.6666061
## Single & not self employed   0.1559587 0.1764939 0.4551908
## Single & self employed       0.2687501 0.3185651 0.7284517
## 
## Residual Deviance: 4891.442 
## AIC: 4909.442

The coefficients of the other three equations are to be interpreted relative to respondents who are neither single nor self employed. For example, the coefficients for Single & self employed measure the change in the log-odds of a respondent being in this category and not in the baseline group, when the predictors are increased by one unit. A major problem of this model is that you cannot interpret the algebraic sign in the usual manner. So a positive sign does not necessarily mean an increase in the probability. Again, I advise you to use marginal effects or a CEP.

Using predict() on the object generated by function multinom() requires to set the argument type = "probs" in order to obtain the predicted probabilities. However, no standard errors are calculated and have to be computed separately. I’m not sure how to do this right now, so I’ll add this in the future.

pred.data <- 
  tibble(dir = rep(seq(from = 0, to = 1.5, length.out = 50), 2),
         pbcr = factor(rep(c("no","yes"), each = 50)))

pred.probs <- 
  predict(model_mlogit, newdata = pred.data, type = "probs") %>%
  as_tibble(.name_repair = make_clean_names)

probability_data3 <- bind_cols(pred.data, pred.probs) 

ggplot(data = probability_data3, 
       aes(x = dir, y = single_self_employed)) + 
  geom_line(aes(colour = pbcr), size = 1) +
  scale_color_manual(values = c("blue","red")) +
  labs(x = "Loan payment-to-income ratio",
       y = "Predicted Probability Status = Single & self employed",
       col = "Public bad credit record?")

From this plot we can see that the probability of being in the group Single & self employed increases along the values of dir and is larger for respondents with a public bad credir records (pbcr).

We can also use a CEP to compare the predicted probabilities of all categories simultaneously, for example by making the data set longer (pivot_longer()) and then using facet_grid() to create a CEP-matrix.

probability_data3 %>%
  pivot_longer(cols = -c(dir, pbcr), 
               names_to = "category", 
               values_to = "pred.probs") %>%
  ggplot(data = ., 
       aes(x = dir, y = pred.probs)) + 
  geom_line(aes(colour = pbcr), size = 1) +
  facet_grid(. ~ category) +
   scale_color_manual(values = c("blue","red")) +
  labs(x = "Loan payment-to-income ratio",
       y = "Predicted Probability",
       col = "Public bad credit record?") +
  theme(legend.position = "bottom")

As you may have noticed no p-values are returned after summarizing the output generated by function multinom(). Still, we can compute the p-values with a Wald test…

z <- summary(model_mlogit)$coefficients/summary(model_mlogit)$standard.errors
# 2-tailed z test
(1 - pnorm(abs(z), 0, 1)) * 2

##                             (Intercept)    pbcryes        dir
## Not Single & self employed 0.000000e+00 0.88832133 0.12038445
## Single & not self employed 3.469341e-05 0.46791879 0.11151756
## Single & self employed     0.000000e+00 0.06897039 0.08330977

… or simply use function tidy() from R package broom.

broom::tidy(model_mlogit)

## # A tibble: 9 x 6
##   y.level                    term        estimate std.error statistic  p.value
##   <chr>                      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 Not Single & self employed (Intercept)  -2.31       0.237    -9.76  1.68e-22
## 2 Not Single & self employed pbcryes      -0.0434     0.309    -0.140 8.88e- 1
## 3 Not Single & self employed dir           1.04       0.667     1.55  1.20e- 1
## 4 Single & not self employed (Intercept)  -0.646      0.156    -4.14  3.47e- 5
## 5 Single & not self employed pbcryes      -0.128      0.176    -0.726 4.68e- 1
## 6 Single & not self employed dir           0.724      0.455     1.59  1.12e- 1
## 7 Single & self employed     (Intercept)  -3.02       0.269   -11.2   2.81e-29
## 8 Single & self employed     pbcryes       0.579      0.319     1.82  6.90e- 2
## 9 Single & self employed     dir           1.26       0.728     1.73  8.33e- 2

6. Binary Model Diganostics

Hdma %>%
  mutate(lvr_group = cut(lvr, 8)) %>%
  group_by(lvr_group) %>%
  summarize(deny_mean = mean(as.numeric(deny), na.rm = TRUE)) %>%
  left_join(Hdma %>% mutate(lvr_group = cut(lvr, 8)), by = "lvr_group") %>%
  ggplot(aes(x=lvr, y = deny_mean)) +
  geom_line() + geom_point(aes(y = as.numeric(deny)))

Hdma %>%
  mutate(lvr_group = cut(lvr, 8)) %>%
  group_by(lvr_group) %>%
  summarize(deny_mean = mean(as.numeric(deny), na.rm = TRUE)) %>%
  left_join(Hdma %>% mutate(lvr_group = cut(lvr, 8)), by = "lvr_group") %>%
  ggplot(aes(x=lvr, y = as.numeric(deny))) +
  geom_point() + 
  geom_smooth(method="loess", se = FALSE)

Cook’s D analogue in logistic regression is the Pregibon Delta Beta Statistic, dbeta

model_diagnostic <- glm(deny ~ lvr + black + pbcr, data = Hdma, family = binomial(link="logit"))

tidy(model_diagnostic)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -4.42      0.357    -12.4  2.70e-35
## 2 lvr            2.58      0.443      5.84 5.34e- 9
## 3 blackyes       0.995     0.154      6.46 1.03e-10
## 4 pbcryes        1.72      0.178      9.68 3.51e-22

hat <- hatvalues(model_diagnostic)
stdresid <- rstudent(model_diagnostic)
model <- lm(hat ~ stdresid)
new_resid <- residuals(model)
sum_res <- sum(new_resid^2)

dbeta <-  stdresid*new_resid/sqrt((1-hat)*sum_res)

References

Long, J. S., and J. Freese. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. SAGE Publications.