Reading time: 4 minutes (640 words)
The R package broom
offers a set of functions which aim to ease the handling of output generated by statistical models. The package follows the tidy data concept (see the tidyr tutorial) and hence will work well with other R packages of the tidyverse. For illustrating the use of each function, I’m using the starwars
dataset from the dplyr
package. Let’s load both packages.
library(broom)
library(dplyr)
Later on I’m using a logistic regression model, so I need to create a recoded version of the gender
column.
starwars <-
starwars %>%
mutate(gender_2 = if_else(gender == "masculine",0,1))
If you don’t know it already: this is the starwars
dataset.
print(starwars)
## # A tibble: 87 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, gender_2 <dbl>
The package contains three main functions to manipulate objects that are usually not in a tidy format. This may include the output of statistical models or tests.
With this function we can turn an object into a tidy tibble. For more information on the tibble see here.
Consider the following linear regression model, which is computed with function lm()
.
linear.model <- lm(height ~ mass, data = starwars, subset = species == "Human")
Typcially you’ll use the summary()
function, which prints the following output to R’s console.
summary(linear.model)
##
## Call:
## lm(formula = height ~ mass, data = starwars, subset = species ==
## "Human")
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.626 -6.261 1.810 5.130 14.518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.88670 8.45289 17.495 1.37e-13 ***
## mass 0.38244 0.09954 3.842 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.842 on 20 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.4247, Adjusted R-squared: 0.3959
## F-statistic: 14.76 on 1 and 20 DF, p-value: 0.001018
Although this output is neatly arranged it will take sometime to extract certain measures, for example when attempting to compare them with another model. At least with function coef()
we can arrange the model’s coefficients in a tabular format such that the point estimate, standard error and the t and p-value are in their own columns. However the column names are pretty messy (especially the one holding the p-value) and will require additional data manipulation.
coef(summary(linear.model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.8866974 8.45289168 17.495397 1.365844e-13
## mass 0.3824361 0.09953906 3.842071 1.017536e-03
The function tidy()
summarizes the model’s statistical findings in a consistent way and stores them in a tibble.
tidy(linear.model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 148. 8.45 17.5 1.37e-13
## 2 mass 0.382 0.0995 3.84 1.02e- 3
This function can be applied to a wide range of objects and not only of the output of function lm()
. Consider for example the following logistic regression computed by using function glm()
.
logit.model <- glm(gender_2 ~ mass, data = starwars, family = binomial(link = "logit"))
tidy(logit.model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.815 1.04 0.785 0.433
## 2 mass -0.0371 0.0160 -2.32 0.0203
To obtain a complete list of possible objects tidy()
can handle just type methods(tidy)
into R’s console.
We can also use the function tidy()
to clean the output of hypothesis testing functions like t.test()
.
The usual output of a t-test looks like this:
t.test(mass ~ gender, data = starwars)
##
## Welch Two Sample t-test
##
## data: mass by gender
## t = -1.936, df = 49.094, p-value = 0.05864
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -104.868811 1.952712
## sample estimates:
## mean in group feminine mean in group masculine
## 54.68889 106.14694
Below I use tidy()
to clean the output.
t.test(mass ~ gender, data = starwars) %>%
tidy()
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -51.5 54.7 106. -1.94 0.0586 49.1 -105. 1.95
## # ... with 2 more variables: method <chr>, alternative <chr>
In addition of tidying the coefficient block of the model output we can construct a single row summary of the model’s fit (like R2, overall F-test, etc.) with function glance()
. The output of this function varies with the type of object you supply it with. For example the output of a linear regression model contains the following columns:
glance(linear.model)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.425 0.396 8.84 14.8 0.00102 1 -78.1 162. 166.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The output of the logit model, for example, contains no R2 (since this is not a meaningful diagnostic measure for logistic regressions).
glance(logit.model)
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 50.1 57 -21.6 47.2 51.3 43.2 56 58
With augment()
we can enhance data (used in estimating the model) with information from an object. For example the object generated with function lm()
can be augmented with regression diagnostics like the residuals, leverage and Cook’s D.
Usually you would have to compute regression diagnostics with functions like residuals()
or fitted.values()
and then combine them in a data frame.
r <- residuals(linear.model)
fit <- fitted.values(linear.model)
leverage <- hatvalues(linear.model)
cooksd <- cooks.distance(linear.model)
tibble(r, fit, leverage, cooksd)
## # A tibble: 22 x 4
## r fit leverage cooksd
## <dbl> <dbl> <dbl> <dbl>
## 1 -5.33 177. 0.0497 0.0100
## 2 2.10 200. 0.404 0.0322
## 3 -16.6 167. 0.190 0.512
## 4 -15.8 194. 0.221 0.580
## 5 -11.6 177. 0.0531 0.0507
## 6 2.99 180. 0.0456 0.00286
## 7 4.67 177. 0.0497 0.00766
## 8 7.99 180. 0.0456 0.0205
## 9 1.52 178. 0.0464 0.000753
## 10 -7.33 177. 0.0497 0.0189
## # ... with 12 more rows
The function augment()
combines all the relevant information in a tibble.
augment(linear.model)
## # A tibble: 22 x 9
## .rownames height mass .fitted .resid .std.resid .hat .sigma .cooksd
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 172 77 177. -5.33 -0.619 0.0497 8.98 0.0100
## 2 4 202 136 200. 2.10 0.308 0.404 9.05 0.0322
## 3 5 150 49 167. -16.6 -2.09 0.190 8.02 0.512
## 4 6 178 120 194. -15.8 -2.02 0.221 8.09 0.580
## 5 7 165 75 177. -11.6 -1.34 0.0531 8.65 0.0507
## 6 9 183 84 180. 2.99 0.346 0.0456 9.04 0.00286
## 7 10 182 77 177. 4.67 0.541 0.0497 9.00 0.00766
## 8 11 188 84 180. 7.99 0.925 0.0456 8.88 0.0205
## 9 14 180 80 178. 1.52 0.176 0.0464 9.06 0.000753
## 10 17 170 77 177. -7.33 -0.851 0.0497 8.91 0.0189
## # ... with 12 more rows
This is often a useful starting point for creating diganostic plots with the ggplot2
package.
The function augment()
can also be applied to output generated by function chisq.test()
. First, take a look at the contingency table below.
ftable(xtabs(mass ~ gender + hair_color, data = starwars))
## hair_color auburn, white black blond blonde brown brown, grey grey none white
## gender
## feminine 0.0 106.2 0.0 55.0 169.0 0.0 0.0 162.0 0.0
## masculine 77.0 405.2 161.0 0.0 703.0 120.0 75.0 1910.0 179.0
The output of the chi-squared test looks like this:
chisq.test(xtabs(mass ~ gender + species, data = starwars))
##
## Pearson's Chi-squared test
##
## data: xtabs(mass ~ gender + species, data = starwars)
## X-squared = 3752.1, df = 30, p-value < 2.2e-16
Now, with augment()
we can enhance the categories of the contingency table with the information genereated by the test.
augment(chisq.test(xtabs(mass ~ gender + species, data = starwars)))
## # A tibble: 62 x 9
## gender species .observed .prop .row.prop .col.prop .expected .resid
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 feminine Aleena 0 0 0 0 1.30 -1.14
## 2 masculine Aleena 15 0.00263 0.00288 1 13.7 0.350
## 3 feminine Besalisk 0 0 0 0 8.82 -2.97
## 4 masculine Besalisk 102 0.0179 0.0196 1 93.2 0.913
## 5 feminine Cerean 0 0 0 0 7.09 -2.66
## 6 masculine Cerean 82 0.0144 0.0158 1 74.9 0.819
## 7 feminine Clawdite 55 0.00966 0.112 1 4.75 23.0
## 8 masculine Clawdite 0 0 0 0 50.2 -7.09
## 9 feminine Droid 0 0 0 0 24.1 -4.91
## 10 masculine Droid 279 0.0490 0.0536 1 255. 1.51
## # ... with 52 more rows, and 1 more variable: .std.resid <dbl>