Reading time: 4 minutes (640 words)


1 Introduction

The R package broom offers a set of functions which aim to ease the handling of output generated by statistical models. The package follows the tidy data concept (see the tidyr tutorial) and hence will work well with other R packages of the tidyverse. For illustrating the use of each function, I’m using the starwars dataset from the dplyr package. Let’s load both packages.

library(broom)
library(dplyr)

Later on I’m using a logistic regression model, so I need to create a recoded version of the gender column.

starwars <-
  starwars %>%
  mutate(gender_2 = if_else(gender == "masculine",0,1))

If you don’t know it already: this is the starwars dataset.

print(starwars)
## # A tibble: 87 x 15
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, gender_2 <dbl>

2 Tidying functions

The package contains three main functions to manipulate objects that are usually not in a tidy format. This may include the output of statistical models or tests.

2.1 tidy()

With this function we can turn an object into a tidy tibble. For more information on the tibble see here.

Regression Model Output with summary()

Consider the following linear regression model, which is computed with function lm().

linear.model <- lm(height ~ mass, data = starwars, subset = species == "Human")

Typcially you’ll use the summary() function, which prints the following output to R’s console.

summary(linear.model)
## 
## Call:
## lm(formula = height ~ mass, data = starwars, subset = species == 
##     "Human")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.626  -6.261   1.810   5.130  14.518 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 147.88670    8.45289  17.495 1.37e-13 ***
## mass          0.38244    0.09954   3.842  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.842 on 20 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.4247, Adjusted R-squared:  0.3959 
## F-statistic: 14.76 on 1 and 20 DF,  p-value: 0.001018

Although this output is neatly arranged it will take sometime to extract certain measures, for example when attempting to compare them with another model. At least with function coef() we can arrange the model’s coefficients in a tabular format such that the point estimate, standard error and the t and p-value are in their own columns. However the column names are pretty messy (especially the one holding the p-value) and will require additional data manipulation.

coef(summary(linear.model))
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 147.8866974 8.45289168 17.495397 1.365844e-13
## mass          0.3824361 0.09953906  3.842071 1.017536e-03

Using tidy()

The function tidy() summarizes the model’s statistical findings in a consistent way and stores them in a tibble.

tidy(linear.model)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  148.       8.45       17.5  1.37e-13
## 2 mass           0.382    0.0995      3.84 1.02e- 3

This function can be applied to a wide range of objects and not only of the output of function lm(). Consider for example the following logistic regression computed by using function glm().

logit.model <- glm(gender_2 ~ mass, data = starwars, family = binomial(link = "logit"))
tidy(logit.model)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   0.815     1.04       0.785  0.433 
## 2 mass         -0.0371    0.0160    -2.32   0.0203

To obtain a complete list of possible objects tidy() can handle just type methods(tidy) into R’s console.

Hypothesis Testing

We can also use the function tidy() to clean the output of hypothesis testing functions like t.test().

The usual output of a t-test looks like this:

t.test(mass ~ gender, data = starwars)
## 
##  Welch Two Sample t-test
## 
## data:  mass by gender
## t = -1.936, df = 49.094, p-value = 0.05864
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -104.868811    1.952712
## sample estimates:
##  mean in group feminine mean in group masculine 
##                54.68889               106.14694

Below I use tidy() to clean the output.

t.test(mass ~ gender, data = starwars) %>%
  tidy()
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1    -51.5      54.7      106.     -1.94  0.0586      49.1    -105.      1.95
## # ... with 2 more variables: method <chr>, alternative <chr>


2.2 glance()

In addition of tidying the coefficient block of the model output we can construct a single row summary of the model’s fit (like R2, overall F-test, etc.) with function glance(). The output of this function varies with the type of object you supply it with. For example the output of a linear regression model contains the following columns:

glance(linear.model)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.425         0.396  8.84      14.8 0.00102     1  -78.1  162.  166.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The output of the logit model, for example, contains no R2 (since this is not a meaningful diagnostic measure for logistic regressions).

glance(logit.model)
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          50.1      57  -21.6  47.2  51.3     43.2          56    58

2.3 augment()

With augment() we can enhance data (used in estimating the model) with information from an object. For example the object generated with function lm() can be augmented with regression diagnostics like the residuals, leverage and Cook’s D.

Regression Diagnostics

Usually you would have to compute regression diagnostics with functions like residuals() or fitted.values() and then combine them in a data frame.

r <- residuals(linear.model)
fit <- fitted.values(linear.model)
leverage <- hatvalues(linear.model)
cooksd <- cooks.distance(linear.model)

tibble(r, fit, leverage, cooksd)
## # A tibble: 22 x 4
##         r   fit leverage   cooksd
##     <dbl> <dbl>    <dbl>    <dbl>
##  1  -5.33  177.   0.0497 0.0100  
##  2   2.10  200.   0.404  0.0322  
##  3 -16.6   167.   0.190  0.512   
##  4 -15.8   194.   0.221  0.580   
##  5 -11.6   177.   0.0531 0.0507  
##  6   2.99  180.   0.0456 0.00286 
##  7   4.67  177.   0.0497 0.00766 
##  8   7.99  180.   0.0456 0.0205  
##  9   1.52  178.   0.0464 0.000753
## 10  -7.33  177.   0.0497 0.0189  
## # ... with 12 more rows

The function augment() combines all the relevant information in a tibble.

augment(linear.model)
## # A tibble: 22 x 9
##    .rownames height  mass .fitted .resid .std.resid   .hat .sigma  .cooksd
##    <chr>      <int> <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
##  1 1            172    77    177.  -5.33     -0.619 0.0497   8.98 0.0100  
##  2 4            202   136    200.   2.10      0.308 0.404    9.05 0.0322  
##  3 5            150    49    167. -16.6      -2.09  0.190    8.02 0.512   
##  4 6            178   120    194. -15.8      -2.02  0.221    8.09 0.580   
##  5 7            165    75    177. -11.6      -1.34  0.0531   8.65 0.0507  
##  6 9            183    84    180.   2.99      0.346 0.0456   9.04 0.00286 
##  7 10           182    77    177.   4.67      0.541 0.0497   9.00 0.00766 
##  8 11           188    84    180.   7.99      0.925 0.0456   8.88 0.0205  
##  9 14           180    80    178.   1.52      0.176 0.0464   9.06 0.000753
## 10 17           170    77    177.  -7.33     -0.851 0.0497   8.91 0.0189  
## # ... with 12 more rows

This is often a useful starting point for creating diganostic plots with the ggplot2 package.

Chi-squared Test

The function augment() can also be applied to output generated by function chisq.test(). First, take a look at the contingency table below.

ftable(xtabs(mass ~ gender + hair_color, data = starwars))
##           hair_color auburn, white  black  blond blonde  brown brown, grey   grey   none  white
## gender                                                                                         
## feminine                       0.0  106.2    0.0   55.0  169.0         0.0    0.0  162.0    0.0
## masculine                     77.0  405.2  161.0    0.0  703.0       120.0   75.0 1910.0  179.0

The output of the chi-squared test looks like this:

chisq.test(xtabs(mass ~ gender + species, data = starwars))
## 
##  Pearson's Chi-squared test
## 
## data:  xtabs(mass ~ gender + species, data = starwars)
## X-squared = 3752.1, df = 30, p-value < 2.2e-16

Now, with augment() we can enhance the categories of the contingency table with the information genereated by the test.

augment(chisq.test(xtabs(mass ~ gender + species, data = starwars)))
## # A tibble: 62 x 9
##    gender    species  .observed   .prop .row.prop .col.prop .expected .resid
##    <fct>     <fct>        <dbl>   <dbl>     <dbl>     <dbl>     <dbl>  <dbl>
##  1 feminine  Aleena           0 0         0               0      1.30 -1.14 
##  2 masculine Aleena          15 0.00263   0.00288         1     13.7   0.350
##  3 feminine  Besalisk         0 0         0               0      8.82 -2.97 
##  4 masculine Besalisk       102 0.0179    0.0196          1     93.2   0.913
##  5 feminine  Cerean           0 0         0               0      7.09 -2.66 
##  6 masculine Cerean          82 0.0144    0.0158          1     74.9   0.819
##  7 feminine  Clawdite        55 0.00966   0.112           1      4.75 23.0  
##  8 masculine Clawdite         0 0         0               0     50.2  -7.09 
##  9 feminine  Droid            0 0         0               0     24.1  -4.91 
## 10 masculine Droid          279 0.0490    0.0536          1    255.    1.51 
## # ... with 52 more rows, and 1 more variable: .std.resid <dbl>


References

Robinson, David, Alex Hayes, and Simon Couch. 2020. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.