Reading time: 13 minutes (2,427 words)
In this tutorial I’ll show you how to handle categorical variables (factors) with the package forcats
. Factors represent another storage type for columns like character or numeric. The behavior of factors in R is not always consistent, especially when switching between conventional data frames and tibbles (for more information about the difference see this tutorial). There is an interesting article by Roger Peng who explains factors in the course of R’s development.
For this tutorial you’ll need to know some functions of the dplyr
package like mutate()
, summarise()
and group_by()
. So I recommend you briefly read the sections 2 to 6 of this tutorial first.
Otherwise load both packages.
library(forcats)
library(dplyr)
library(tibble)
We will work once more with the starwars
dataset conainted in the dplyr
package. Below let’s have a tibble::glimpse()
at the dataset.
starwars %>%
glimpse()
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~
There are columns of type character (chr), numeric (int, dbl), and list. Do you have an idea which columns are suitable to be treated as factors?
A factor is R’s data structure for categorical data and the forcats
package provides many functions to work with them. Factors are often required when performing regression analysis in R with lm()
or glm()
. These functions need factors to handle categorical data appropriately. For example R maps the unique values of a categorical variable to individual dummy variables in order to estimate the regression model. Therefore it’s good to know for a start about some basic functions that handle categorical data.
The following functions are used for basic operations on factors.
With this function we can create a new factor variable from scratch. First we have to input a vector x =
of values and then a vector of levels =
.
factor(
x = c("1", "2", "3",
"1", "2", "3"),
levels = c("1", "2", "3")
)
## [1] 1 2 3 1 2 3
## Levels: 1 2 3
Below the values of the factor we see a line with its levels.
There is also a label =
argument where we can supply labels for the factor levels. By default the labels are identical to the levels. The labels have to appear in the same order as the levels.
factor(
x = c("1", "2", "3",
"1", "2", "3"),
levels = c("1", "2", "3"),
labels = c("first", "second", "third")
)
## [1] first second third first second third
## Levels: first second third
We may also take an already existing column of the starwars
dataset and transform it with function as.factor()
and dplyr::mutate()
into a factor variable. Below I take the species
variable and convert it to a factor. Only the storage type of the column is altered but not the values of species
.
starwars %>%
mutate(species_fct = as.factor(species)) %>%
select(name, starts_with("species"))
## # A tibble: 87 x 3
## name species species_fct
## <chr> <chr> <fct>
## 1 Luke Skywalker Human Human
## 2 C-3PO Droid Droid
## 3 R2-D2 Droid Droid
## 4 Darth Vader Human Human
## 5 Leia Organa Human Human
## 6 Owen Lars Human Human
## 7 Beru Whitesun lars Human Human
## 8 R5-D4 Droid Droid
## 9 Biggs Darklighter Human Human
## 10 Obi-Wan Kenobi Human Human
## # ... with 77 more rows
The function levels()
creates a vector of all unique values of a factor. It is important that the column on which this function is applied to is a vector and a factor. First, let us take a look what happens if we use levels()
on a column that is not of type factor.
starwars %>%
pull(species) %>%
levels()
## NULL
It simply returns NULL
because there are no levels attached to the column species
which is originally of type character. Next we’ll try levels()
on the converted species_fct
column.
starwars %>%
mutate(species_fct = as.factor(species)) %>%
pull(species_fct) %>%
levels()
## [1] "Aleena" "Besalisk" "Cerean" "Chagrian"
## [5] "Clawdite" "Droid" "Dug" "Ewok"
## [9] "Geonosian" "Gungan" "Human" "Hutt"
## [13] "Iktotchi" "Kaleesh" "Kaminoan" "Kel Dor"
## [17] "Mirialan" "Mon Calamari" "Muun" "Nautolan"
## [21] "Neimodian" "Pau'an" "Quermian" "Rodian"
## [25] "Skakoan" "Sullustan" "Tholothian" "Togruta"
## [29] "Toong" "Toydarian" "Trandoshan" "Twi'lek"
## [33] "Vulptereen" "Wookiee" "Xexto" "Yoda's species"
## [37] "Zabrak"
This worked as expected! Anybody remember the function dplyr::distinct()
from the dplyr tutorial? This functions also lists all unique values of a column but can be applied to any column type.
starwars %>%
distinct(species)
## # A tibble: 38 x 1
## species
## <chr>
## 1 Human
## 2 Droid
## 3 Wookiee
## 4 Rodian
## 5 Hutt
## 6 Yoda's species
## 7 Trandoshan
## 8 Mon Calamari
## 9 Ewok
## 10 Sullustan
## # ... with 28 more rows
There is also a minor difference: distinct()
lists NA
as a unique value while levels()
does not. However, using function factor()
with argument exclude = NULL
retains NA
in the output.
starwars %>%
mutate(species_fct = factor(species, exclude = NULL)) %>%
pull(species_fct) %>%
levels()
## [1] "Aleena" "Besalisk" "Cerean" "Chagrian"
## [5] "Clawdite" "Droid" "Dug" "Ewok"
## [9] "Geonosian" "Gungan" "Human" "Hutt"
## [13] "Iktotchi" "Kaleesh" "Kaminoan" "Kel Dor"
## [17] "Mirialan" "Mon Calamari" "Muun" "Nautolan"
## [21] "Neimodian" "Pau'an" "Quermian" "Rodian"
## [25] "Skakoan" "Sullustan" "Tholothian" "Togruta"
## [29] "Toong" "Toydarian" "Trandoshan" "Twi'lek"
## [33] "Vulptereen" "Wookiee" "Xexto" "Yoda's species"
## [37] "Zabrak" NA
The following functions are used to have a closer look at the values and levels of a factor.
With fct_count()
you’re able to count the number of values for each level of a factor variable. Below I apply this to the species
column of the starwars
dataset.
fct_count(starwars$species)
## # A tibble: 38 x 2
## f n
## <fct> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # ... with 28 more rows
Note that it is not always necessary to convert columns of type character to a factor before applying functions of the forcats
package.
By the way: the function dplyr::count()
returns a similar output.
starwars %>%
count(species)
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # ... with 28 more rows
We can check for the presence of any level in a factor with the function fct_match()
. It simply returns TRUE
if a level is present or FALSE
if it is not. Let’s see whether any values of the factor sex
display a level "male"
.
fct_match(starwars$sex, "male") %>% table()
## .
## FALSE TRUE
## 27 60
There are 60 Star Wars characters where the level "male"
is present.
We may also check multiple levels at once.
fct_match(starwars$sex, c("male", "female")) %>% table()
## .
## FALSE TRUE
## 11 76
There are 76 Star Wars characters with either a "male"
or "female"
level of factor sex
.
The function fct_unique()
only returns the unique values of a factor and removes duplicates.
fct_unique(starwars_fct$sex)
## [1] female hermaphroditic male none
## Levels: female hermaphroditic male none
We may also combine different factors with the following functions.
With function fct_c()
we can combine factors with different levels. Below I create two factors f1
and f2
which represent the sex
and gender
column of the starwars
dataset. Then I use fct_c()
to combine them.
f1 <- as.factor(starwars$sex)
f2 <- as.factor(starwars$gender)
fct_c(f1, f2) %>% levels()
## [1] "female" "hermaphroditic" "male" "none"
## [5] "feminine" "masculine"
This function is best used to patch together factors from multiple sources that should have the same levels.
There is also a neat function called fct_cross()
which computes a factor whose levels are the combinations of the levels of all input factors.
starwars %>%
mutate(sex_gender = fct_cross(sex, gender)) %>%
select(name, sex_gender, sex, gender)
## # A tibble: 87 x 4
## name sex_gender sex gender
## <chr> <fct> <chr> <chr>
## 1 Luke Skywalker male:masculine male masculine
## 2 C-3PO none:masculine none masculine
## 3 R2-D2 none:masculine none masculine
## 4 Darth Vader male:masculine male masculine
## 5 Leia Organa female:feminine female feminine
## 6 Owen Lars male:masculine male masculine
## 7 Beru Whitesun lars female:feminine female feminine
## 8 R5-D4 none:masculine none masculine
## 9 Biggs Darklighter male:masculine male masculine
## 10 Obi-Wan Kenobi male:masculine male masculine
## # ... with 77 more rows
For standardizing the levels across a list of factors we can use the function fct_unify()
. It returns a list where each element is the initial factor augmented by the unified levels. To see only the levels()
of each factor we have to apply a special function called map()
from the package purrr
. It applies the functions levels()
to each list element. Read more about it here.
f3 <- as.factor(starwars$eye_color)
fct_unify(list(f1, f2, f3)) %>% purrr::map(levels)
## [[1]]
## [1] "female" "hermaphroditic" "male" "none"
## [5] "feminine" "masculine" "black" "blue"
## [9] "blue-gray" "brown" "dark" "gold"
## [13] "green, yellow" "hazel" "orange" "pink"
## [17] "red" "red, blue" "unknown" "white"
## [21] "yellow"
##
## [[2]]
## [1] "female" "hermaphroditic" "male" "none"
## [5] "feminine" "masculine" "black" "blue"
## [9] "blue-gray" "brown" "dark" "gold"
## [13] "green, yellow" "hazel" "orange" "pink"
## [17] "red" "red, blue" "unknown" "white"
## [21] "yellow"
##
## [[3]]
## [1] "female" "hermaphroditic" "male" "none"
## [5] "feminine" "masculine" "black" "blue"
## [9] "blue-gray" "brown" "dark" "gold"
## [13] "green, yellow" "hazel" "orange" "pink"
## [17] "red" "red, blue" "unknown" "white"
## [21] "yellow"
f1
, f2
and f3
now all have the same levels.
We can’t go on forever without making nice graphs, can’t we? Some of the functions in this section are particularly helpful when plotting factors but you’re unhappy with the order in which their levels appear in. So first, load the ggplot2
package and then take a look at the following bar chart.
library(ggplot2)
starwars %>%
ggplot(data = ., aes(x=sex)) +
geom_bar(stat = "count", fill = "white", color = "black")
Each bar represents the number of cases in the starwars
dataset with the respective level of sex
. They are ordered alphabetically except for NA
which is treated as a separate category and comes last.
These functions serve to serve the most common issues that arise when attempting to change the order of the levels of a factor. In general, they’re used in combination with dplyr::mutate()
because the values of the factor need to be accessed.
With this function you can manually reorder the levels of a factor variable. We do not have to list all levels of the factor variable in fct_relevel()
as the remaining levels are added alphabetically. Below I put the levels female
and male
in the first and second position of sex
.
starwars %>%
mutate(sex_ordered = fct_relevel(sex, c("female", "male"))) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
This functions lets you reorder a factor’s levels by the frequency in which they appear in the data. The level with the highest frequeny comes first. Let’s use this function on the sex
column before plotting the bar chart. Note that NA
is again excluded.
starwars %>%
mutate(sex_ordered = fct_infreq(f = sex)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
There is also a related function called fct_inseq()
which is designed to work with factors whose levels display a numeric sequence. Columns of type numeric such as birth_year
must always be explicitly converted to factor.
starwars %>%
mutate(birth_year_ordered = fct_inseq(as.factor(birth_year))) %>%
ggplot(data = ., aes(x = birth_year_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black") +
theme(axis.text.x = element_text(angle = 90))
For what it’s worth you can also reorder the levels of a factor by the order in which they are sorted in the dataset.
starwars %>%
mutate(sex_ordered = fct_inorder(sex)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
Let’s have a look at the first 10 rows of the starwars
dataset to verify this.
starwars %>%
print()
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
This function reverses the order the factor’s levels. This can be useful when changing the axes of a bar chart with function ggplot2::coord_flip()
. When applying fct_rev()
the levels of the factor appear alphabetically from the top to the bottom of the y-axis instead the other way around (as always except for NA
).
starwars %>%
mutate(sex_ordered = fct_rev(f = sex)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black") +
coord_flip()
There are also some more advanced functions which you’ll either use less frequent or which require more thought when applying them.
You can shift the levels of a factor to the left or right with this function. Positive values of argument n =
shift the levels to the left and negative values to the right.
starwars %>%
mutate(sex_ordered = fct_shift(f = as.factor(sex), n = -1)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
With fct_shuffle()
you can randomly permute the order of a factor’s levels. Use set.seed()
to obtain replicable results when working with random function components.
set.seed(123)
starwars %>%
mutate(sex_ordered = fct_shuffle(f = sex)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
set.seed(456)
starwars %>%
mutate(sex_ordered = fct_shuffle(f = sex)) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
The function fct_reorder()
lets you reorder a factor’s levels by their relationship with another variable. This is useful when plotting a factor and rearrange its levels by another column. Below I want to sort the levels of column sex
according to the average height of the Star Wars characters within each unique value (level) of sex
. In addition to the factor .f =
we have to specify the variable along which the reordering should be carried out .x =
, as well as a function .fun =
. determining the reordering. With argument .desc =
we can force a descending order.
starwars %>%
mutate(sex_ordered =
fct_reorder(
.f = sex,
.x = height,
.fun = mean, na.rm=T,
.desc = TRUE)
) %>%
ggplot(data = ., aes(x = sex_ordered)) +
geom_bar(stat = "count", fill = "white", color = "black")
According to this plot male
Star Wars characters have the highest average weight, followed by hermaphroditic
and female
characters. Do we trust function fct_reorder()
just like that? Better to check this with dplyr::group_by()
and dplyr::summarise()
.
starwars %>%
group_by(sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE)) %>%
arrange(desc(mean_height))
## # A tibble: 5 x 2
## sex mean_height
## <chr> <dbl>
## 1 <NA> 181.
## 2 male 179.
## 3 hermaphroditic 175
## 4 female 169.
## 5 none 131.
This is a 2d version of the function fct_reorder()
and lets you specify two variables (.x =
, .y =
among which the factor’s levels are reordered. However, only two functions might be used with argument .fun =
. The function last2()
finds the last value of .y when sorted by .x and first2()
finds the first value. This is helpful when using a line plot and aligning the line colours with the legend.
Have a look at the plot below. The legend is quite hard to read isn’t it?
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
Now we’ll use fct_reorder()2
to improve the plot.
ggplot(data = by_age,
aes(x = age, y = prop,
colour = fct_reorder2(.f = marital, .x = age, .y = prop, .fun = last2)
)
) +
geom_line() +
labs(colour = "marital")
Again you’ll often use the functions in this sections to alter the values of the factor, hence I demonstrate the examples in combination with function dplyr::mutate()
.
The function fct_recode()
lets us manually change levels of a factor. You only have to specify a new value of the level for each original value that you want to change. If no value is supplied for an original level, the level remains unaltered. Below I’m recoding the levels of column sex
such that only the first letter of each original level is present in the new factor sex_mod
.
starwars %>%
mutate(sex_mod = fct_recode(
sex,
m = "male",f = "female",
n = "none", h = "hermaphroditic")
) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker m male
## 2 C-3PO n none
## 3 R2-D2 n none
## 4 Darth Vader m male
## 5 Leia Organa f female
## 6 Owen Lars m male
## 7 Beru Whitesun lars f female
## 8 R5-D4 n none
## 9 Biggs Darklighter m male
## 10 Obi-Wan Kenobi m male
## # ... with 77 more rows
Also consider using the function fct_relabel()
which obeys the purrr::map()
syntax to apply a function or expression to each level. Below I use function paste0()
to add the string sex_
to each level of variable sex
.
starwars %>%
mutate(sex_mod = fct_relabel(
sex,
~ paste0("sex_", .x))
) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker sex_male male
## 2 C-3PO sex_none none
## 3 R2-D2 sex_none none
## 4 Darth Vader sex_male male
## 5 Leia Organa sex_female female
## 6 Owen Lars sex_male male
## 7 Beru Whitesun lars sex_female female
## 8 R5-D4 sex_none none
## 9 Biggs Darklighter sex_male male
## 10 Obi-Wan Kenobi sex_male male
## # ... with 77 more rows
Sometimes, for example when publishing sensitive data, it may become necessary to anonymize levels of a factor variable with random integers. This can be done with function fct_anon()
.
starwars %>%
mutate(sex_mod = fct_anon(as.factor(sex))) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker 4 male
## 2 C-3PO 2 none
## 3 R2-D2 2 none
## 4 Darth Vader 4 male
## 5 Leia Organa 1 female
## 6 Owen Lars 4 male
## 7 Beru Whitesun lars 1 female
## 8 R5-D4 2 none
## 9 Biggs Darklighter 4 male
## 10 Obi-Wan Kenobi 4 male
## # ... with 77 more rows
It is also possible to add a prefix =
to the anonymized factor.
starwars %>%
mutate(sex_mod =
fct_anon(
as.factor(sex),
prefix = "sex_"
)
) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker sex_1 male
## 2 C-3PO sex_4 none
## 3 R2-D2 sex_4 none
## 4 Darth Vader sex_1 male
## 5 Leia Organa sex_3 female
## 6 Owen Lars sex_1 male
## 7 Beru Whitesun lars sex_3 female
## 8 R5-D4 sex_4 none
## 9 Biggs Darklighter sex_1 male
## 10 Obi-Wan Kenobi sex_1 male
## # ... with 77 more rows
With the function fct_collapse()
we can collapse levels of a factor into manually defined groups. This is very useful when a factor has too many distinct levels but some of them share common characteristics and you can mingle them together. Below I’m generating new levels for column hair_color
. Star Wars characters with unicolored/multicolored hair are reassigned either the level "singleColor"
or "multiColor"
. Cases with "unknown"
or "none"
hair color are changed to "missing"
.
starwars %>%
mutate(hair_color_mod =
fct_collapse(
hair_color,
missing = c("unknown", "none"),
singleColor = c("auburn", "black", "blond", "blonde",
"brown", "grey", "white"),
multiColor = c("auburn, grey", "auburn, white", "brown, grey")
)
) %>%
select(name, hair_color_mod, hair_color)
## # A tibble: 87 x 3
## name hair_color_mod hair_color
## <chr> <fct> <chr>
## 1 Luke Skywalker singleColor blond
## 2 C-3PO <NA> <NA>
## 3 R2-D2 <NA> <NA>
## 4 Darth Vader missing none
## 5 Leia Organa singleColor brown
## 6 Owen Lars multiColor brown, grey
## 7 Beru Whitesun lars singleColor brown
## 8 R5-D4 <NA> <NA>
## 9 Biggs Darklighter singleColor black
## 10 Obi-Wan Kenobi multiColor auburn, white
## # ... with 77 more rows
The function fct_other()
is used to replace levels of a factor with value "Other"
. We can use this function either with argument keep =
or drop =
.
The argument keep =
merges all not supplied levels into the new category "Other"
.
starwars %>%
mutate(sex_mod =
fct_other(
sex,
keep = c("male", "female")
)
) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker male male
## 2 C-3PO Other none
## 3 R2-D2 Other none
## 4 Darth Vader male male
## 5 Leia Organa female female
## 6 Owen Lars male male
## 7 Beru Whitesun lars female female
## 8 R5-D4 Other none
## 9 Biggs Darklighter male male
## 10 Obi-Wan Kenobi male male
## # ... with 77 more rows
The argument drop =
merges all supplied levels into the new category "Other"
.
starwars %>%
mutate(sex_mod =
fct_other(
sex,
drop = c("hermaphroditic", "none")
)
) %>%
select(name, sex_mod, sex)
## # A tibble: 87 x 3
## name sex_mod sex
## <chr> <fct> <chr>
## 1 Luke Skywalker male male
## 2 C-3PO Other none
## 3 R2-D2 Other none
## 4 Darth Vader male male
## 5 Leia Organa female female
## 6 Owen Lars male male
## 7 Beru Whitesun lars female female
## 8 R5-D4 Other none
## 9 Biggs Darklighter male male
## 10 Obi-Wan Kenobi male male
## # ... with 77 more rows
To lump values of a factor’s levels means to combine them in a new level "Other"
. Four different functions exist to do this. Mind that NA
is not affected.
The function fct_lump_min()
sets a min =
number of times the value must appear - otherwise it is reassigned to "Other"
.
starwars %>%
mutate(sex_lumped = fct_lump_min(sex, min = 7)) %>%
count(sex_lumped, sex)
## # A tibble: 5 x 3
## sex_lumped sex n
## <fct> <chr> <int>
## 1 female female 16
## 2 male male 60
## 3 Other hermaphroditic 1
## 4 Other none 6
## 5 <NA> <NA> 4
All levels of a factor except for the n =
most frequent ones can be lumped with function fct_lump_n()
. This also works in the other direction (least frequent) with negative values of n.
starwars %>%
mutate(sex_lumped = fct_lump_n(sex, n = 2)) %>%
count(sex_lumped, sex)
## # A tibble: 5 x 3
## sex_lumped sex n
## <fct> <chr> <int>
## 1 female female 16
## 2 male male 60
## 3 Other hermaphroditic 1
## 4 Other none 6
## 5 <NA> <NA> 4
To lump levels of a factor that appear less than a relative proportion of cases use the function fct_lump_prop()
.
starwars %>%
mutate(sex_lumped = fct_lump_prop(sex, prop = 0.20)) %>%
count(sex_lumped, sex)
## # A tibble: 5 x 3
## sex_lumped sex n
## <fct> <chr> <int>
## 1 male male 60
## 2 Other female 16
## 3 Other hermaphroditic 1
## 4 Other none 6
## 5 <NA> <NA> 4
Last but not least the function fct_lump_lowfreq()
can be used to lump together the least frequent levels of a factor, which ensures that "other"
is still the smallest level.
starwars %>%
mutate(sex_lumped = fct_lump_lowfreq(sex)) %>%
count(sex_lumped, sex)
## # A tibble: 5 x 3
## sex_lumped sex n
## <fct> <chr> <int>
## 1 male male 60
## 2 Other female 16
## 3 Other hermaphroditic 1
## 4 Other none 6
## 5 <NA> <NA> 4
The last section of this tutorial is about manipulating the levels of a factor instead of its values.
The function fct_drop()
lets you drop unused levels from a factor. Imagine we want to create a subset of the starwars
dataset that only contains "Human"
characters. I call this dataset starwars_humans
, convert the column sex
to a factor and dplyr::filter()
the relevant cases.
starwars_humans <-
starwars %>%
mutate(sex = as.factor(sex)) %>%
filter(species == "Human")
Let’s have a look at the sex
of the remaining characters.
starwars_humans %>% count(sex)
## # A tibble: 2 x 2
## sex n
## <fct> <int>
## 1 female 9
## 2 male 26
Therea are 26 male and 9 female Star Wars characters in this subset. Now, let’s also have a look at the levels of factor sex
.
starwars_humans %>% pull(sex) %>% levels()
## [1] "female" "hermaphroditic" "male" "none"
Did you expect that the factor sex
still contains all its inital levels but which are no longer present in the subset? With function fct_drop()
we can fix this.
starwars_humans %>% pull(sex) %>% fct_drop() %>% levels()
## [1] "female" "male"
With this function we can add levels to a factor. Below I’m adding the level "other"
to the existing levels of sex
.
starwars %>% pull(sex) %>% fct_expand("other") %>% levels()
## [1] "female" "hermaphroditic" "male" "none"
## [5] "other"
Mind that the Star Wars characters’ values of sex
are not affected by this function.
To assign a level to NA
values of a factor the function fct_explicit_na()
can be used. By default the argument na_level =
reassigns NA
values to "(Missing)"
, which is then a distinct level. Let’s take a look at this by plotting the bar chart with the Star Wars characters’ sex
again!
starwars %>%
mutate(sex_f = fct_explicit_na(sex, na_level = "(Missing)")) %>%
ggplot(data = ., aes(x = sex_f)) +
geom_bar(stat = "count", fill = "white", color = "black")
Also let us compare the levels of factor sex
before…
starwars %>% mutate(sex = as.factor(sex)) %>% pull(sex) %>% levels()
## [1] "female" "hermaphroditic" "male" "none"
…and after applying fct_explicit_na()
.
starwars %>%
mutate(sex_f = fct_explicit_na(sex, na_level = "(Missing)")) %>%
pull(sex_f) %>% levels()
## [1] "female" "hermaphroditic" "male" "none"
## [5] "(Missing)"