Reading time: 13 minutes (2,427 words)


1. Introduction

In this tutorial I’ll show you how to handle categorical variables (factors) with the package forcats. Factors represent another storage type for columns like character or numeric. The behavior of factors in R is not always consistent, especially when switching between conventional data frames and tibbles (for more information about the difference see this tutorial). There is an interesting article by Roger Peng who explains factors in the course of R’s development.

For this tutorial you’ll need to know some functions of the dplyr package like mutate(), summarise() and group_by(). So I recommend you briefly read the sections 2 to 6 of this tutorial first.

Otherwise load both packages.

library(forcats)
library(dplyr)
library(tibble)

We will work once more with the starwars dataset conainted in the dplyr package. Below let’s have a tibble::glimpse() at the dataset.

starwars %>%
  glimpse()
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

There are columns of type character (chr), numeric (int, dbl), and list. Do you have an idea which columns are suitable to be treated as factors?


2. Factors

A factor is R’s data structure for categorical data and the forcats package provides many functions to work with them. Factors are often required when performing regression analysis in R with lm() or glm(). These functions need factors to handle categorical data appropriately. For example R maps the unique values of a categorical variable to individual dummy variables in order to estimate the regression model. Therefore it’s good to know for a start about some basic functions that handle categorical data.

2.1 General

The following functions are used for basic operations on factors.

factor()

With this function we can create a new factor variable from scratch. First we have to input a vector x = of values and then a vector of levels =.

factor(
  x = c("1", "2", "3",
        "1", "2", "3"), 
  levels = c("1", "2", "3")
  )
## [1] 1 2 3 1 2 3
## Levels: 1 2 3

Below the values of the factor we see a line with its levels.

There is also a label = argument where we can supply labels for the factor levels. By default the labels are identical to the levels. The labels have to appear in the same order as the levels.

factor(
  x = c("1", "2", "3",
        "1", "2", "3"), 
  levels = c("1", "2", "3"),
  labels = c("first", "second", "third")
  )
## [1] first  second third  first  second third 
## Levels: first second third

as.factor()

We may also take an already existing column of the starwars dataset and transform it with function as.factor() and dplyr::mutate() into a factor variable. Below I take the species variable and convert it to a factor. Only the storage type of the column is altered but not the values of species.

starwars %>%
  mutate(species_fct = as.factor(species)) %>%
  select(name, starts_with("species"))
## # A tibble: 87 x 3
##    name               species species_fct
##    <chr>              <chr>   <fct>      
##  1 Luke Skywalker     Human   Human      
##  2 C-3PO              Droid   Droid      
##  3 R2-D2              Droid   Droid      
##  4 Darth Vader        Human   Human      
##  5 Leia Organa        Human   Human      
##  6 Owen Lars          Human   Human      
##  7 Beru Whitesun lars Human   Human      
##  8 R5-D4              Droid   Droid      
##  9 Biggs Darklighter  Human   Human      
## 10 Obi-Wan Kenobi     Human   Human      
## # ... with 77 more rows

levels()

The function levels() creates a vector of all unique values of a factor. It is important that the column on which this function is applied to is a vector and a factor. First, let us take a look what happens if we use levels() on a column that is not of type factor.

starwars %>%
  pull(species) %>%
  levels()
## NULL

It simply returns NULL because there are no levels attached to the column species which is originally of type character. Next we’ll try levels() on the converted species_fct column.

starwars %>%
  mutate(species_fct = as.factor(species)) %>%
  pull(species_fct) %>%
  levels()
##  [1] "Aleena"         "Besalisk"       "Cerean"         "Chagrian"      
##  [5] "Clawdite"       "Droid"          "Dug"            "Ewok"          
##  [9] "Geonosian"      "Gungan"         "Human"          "Hutt"          
## [13] "Iktotchi"       "Kaleesh"        "Kaminoan"       "Kel Dor"       
## [17] "Mirialan"       "Mon Calamari"   "Muun"           "Nautolan"      
## [21] "Neimodian"      "Pau'an"         "Quermian"       "Rodian"        
## [25] "Skakoan"        "Sullustan"      "Tholothian"     "Togruta"       
## [29] "Toong"          "Toydarian"      "Trandoshan"     "Twi'lek"       
## [33] "Vulptereen"     "Wookiee"        "Xexto"          "Yoda's species"
## [37] "Zabrak"

This worked as expected! Anybody remember the function dplyr::distinct() from the dplyr tutorial? This functions also lists all unique values of a column but can be applied to any column type.

starwars %>%
  distinct(species)
## # A tibble: 38 x 1
##    species       
##    <chr>         
##  1 Human         
##  2 Droid         
##  3 Wookiee       
##  4 Rodian        
##  5 Hutt          
##  6 Yoda's species
##  7 Trandoshan    
##  8 Mon Calamari  
##  9 Ewok          
## 10 Sullustan     
## # ... with 28 more rows

There is also a minor difference: distinct() lists NA as a unique value while levels() does not. However, using function factor() with argument exclude = NULL retains NA in the output.

starwars %>%
  mutate(species_fct = factor(species, exclude = NULL)) %>%
  pull(species_fct) %>%
  levels()
##  [1] "Aleena"         "Besalisk"       "Cerean"         "Chagrian"      
##  [5] "Clawdite"       "Droid"          "Dug"            "Ewok"          
##  [9] "Geonosian"      "Gungan"         "Human"          "Hutt"          
## [13] "Iktotchi"       "Kaleesh"        "Kaminoan"       "Kel Dor"       
## [17] "Mirialan"       "Mon Calamari"   "Muun"           "Nautolan"      
## [21] "Neimodian"      "Pau'an"         "Quermian"       "Rodian"        
## [25] "Skakoan"        "Sullustan"      "Tholothian"     "Togruta"       
## [29] "Toong"          "Toydarian"      "Trandoshan"     "Twi'lek"       
## [33] "Vulptereen"     "Wookiee"        "Xexto"          "Yoda's species"
## [37] "Zabrak"         NA


2.2 Inspect

The following functions are used to have a closer look at the values and levels of a factor.

fct_count()

With fct_count() you’re able to count the number of values for each level of a factor variable. Below I apply this to the species column of the starwars dataset.

fct_count(starwars$species)
## # A tibble: 38 x 2
##    f             n
##    <fct>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ... with 28 more rows

Note that it is not always necessary to convert columns of type character to a factor before applying functions of the forcats package.

By the way: the function dplyr::count() returns a similar output.

starwars %>%
  count(species)
## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ... with 28 more rows

fct_match()

We can check for the presence of any level in a factor with the function fct_match(). It simply returns TRUE if a level is present or FALSE if it is not. Let’s see whether any values of the factor sex display a level "male".

fct_match(starwars$sex, "male") %>% table()
## .
## FALSE  TRUE 
##    27    60

There are 60 Star Wars characters where the level "male" is present.

We may also check multiple levels at once.

fct_match(starwars$sex, c("male", "female")) %>% table()
## .
## FALSE  TRUE 
##    11    76

There are 76 Star Wars characters with either a "male" or "female" level of factor sex.

fct_unique()

The function fct_unique() only returns the unique values of a factor and removes duplicates.

fct_unique(starwars_fct$sex)
## [1] female         hermaphroditic male           none          
## Levels: female hermaphroditic male none


2.3 Combine

We may also combine different factors with the following functions.

fct_c()

With function fct_c() we can combine factors with different levels. Below I create two factors f1 and f2 which represent the sex and gender column of the starwars dataset. Then I use fct_c() to combine them.

f1 <- as.factor(starwars$sex)
f2 <- as.factor(starwars$gender)

fct_c(f1, f2) %>% levels()
## [1] "female"         "hermaphroditic" "male"           "none"          
## [5] "feminine"       "masculine"

This function is best used to patch together factors from multiple sources that should have the same levels.

There is also a neat function called fct_cross() which computes a factor whose levels are the combinations of the levels of all input factors.

starwars %>%
  mutate(sex_gender = fct_cross(sex, gender)) %>%
  select(name, sex_gender, sex, gender)
## # A tibble: 87 x 4
##    name               sex_gender      sex    gender   
##    <chr>              <fct>           <chr>  <chr>    
##  1 Luke Skywalker     male:masculine  male   masculine
##  2 C-3PO              none:masculine  none   masculine
##  3 R2-D2              none:masculine  none   masculine
##  4 Darth Vader        male:masculine  male   masculine
##  5 Leia Organa        female:feminine female feminine 
##  6 Owen Lars          male:masculine  male   masculine
##  7 Beru Whitesun lars female:feminine female feminine 
##  8 R5-D4              none:masculine  none   masculine
##  9 Biggs Darklighter  male:masculine  male   masculine
## 10 Obi-Wan Kenobi     male:masculine  male   masculine
## # ... with 77 more rows

fct_unify()

For standardizing the levels across a list of factors we can use the function fct_unify(). It returns a list where each element is the initial factor augmented by the unified levels. To see only the levels() of each factor we have to apply a special function called map() from the package purrr. It applies the functions levels() to each list element. Read more about it here.

f3 <- as.factor(starwars$eye_color)

fct_unify(list(f1, f2, f3)) %>% purrr::map(levels)
## [[1]]
##  [1] "female"         "hermaphroditic" "male"           "none"          
##  [5] "feminine"       "masculine"      "black"          "blue"          
##  [9] "blue-gray"      "brown"          "dark"           "gold"          
## [13] "green, yellow"  "hazel"          "orange"         "pink"          
## [17] "red"            "red, blue"      "unknown"        "white"         
## [21] "yellow"        
## 
## [[2]]
##  [1] "female"         "hermaphroditic" "male"           "none"          
##  [5] "feminine"       "masculine"      "black"          "blue"          
##  [9] "blue-gray"      "brown"          "dark"           "gold"          
## [13] "green, yellow"  "hazel"          "orange"         "pink"          
## [17] "red"            "red, blue"      "unknown"        "white"         
## [21] "yellow"        
## 
## [[3]]
##  [1] "female"         "hermaphroditic" "male"           "none"          
##  [5] "feminine"       "masculine"      "black"          "blue"          
##  [9] "blue-gray"      "brown"          "dark"           "gold"          
## [13] "green, yellow"  "hazel"          "orange"         "pink"          
## [17] "red"            "red, blue"      "unknown"        "white"         
## [21] "yellow"

f1, f2 and f3 now all have the same levels.


3. Change the order of levels

We can’t go on forever without making nice graphs, can’t we? Some of the functions in this section are particularly helpful when plotting factors but you’re unhappy with the order in which their levels appear in. So first, load the ggplot2 package and then take a look at the following bar chart.

library(ggplot2)
starwars %>%
  ggplot(data = ., aes(x=sex)) +
  geom_bar(stat = "count", fill = "white", color = "black")

Each bar represents the number of cases in the starwars dataset with the respective level of sex. They are ordered alphabetically except for NA which is treated as a separate category and comes last.

3.1 Basic functions

These functions serve to serve the most common issues that arise when attempting to change the order of the levels of a factor. In general, they’re used in combination with dplyr::mutate() because the values of the factor need to be accessed.

fct_relevel()

With this function you can manually reorder the levels of a factor variable. We do not have to list all levels of the factor variable in fct_relevel() as the remaining levels are added alphabetically. Below I put the levels female and male in the first and second position of sex.

starwars %>%
  mutate(sex_ordered = fct_relevel(sex, c("female", "male"))) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

fct_infreq()

This functions lets you reorder a factor’s levels by the frequency in which they appear in the data. The level with the highest frequeny comes first. Let’s use this function on the sex column before plotting the bar chart. Note that NA is again excluded.

starwars %>%
  mutate(sex_ordered = fct_infreq(f = sex)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

fct_inseq()

There is also a related function called fct_inseq() which is designed to work with factors whose levels display a numeric sequence. Columns of type numeric such as birth_year must always be explicitly converted to factor.

starwars %>%
  mutate(birth_year_ordered = fct_inseq(as.factor(birth_year))) %>%
  ggplot(data = ., aes(x = birth_year_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black") +
  theme(axis.text.x = element_text(angle = 90))

fct_inorder()

For what it’s worth you can also reorder the levels of a factor by the order in which they are sorted in the dataset.

starwars %>%
  mutate(sex_ordered = fct_inorder(sex)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

Let’s have a look at the first 10 rows of the starwars dataset to verify this.

starwars %>%
  print()
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

fct_rev()

This function reverses the order the factor’s levels. This can be useful when changing the axes of a bar chart with function ggplot2::coord_flip(). When applying fct_rev() the levels of the factor appear alphabetically from the top to the bottom of the y-axis instead the other way around (as always except for NA).

starwars %>%
  mutate(sex_ordered = fct_rev(f = sex)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black") +
  coord_flip()

3.2 Advanced functions

There are also some more advanced functions which you’ll either use less frequent or which require more thought when applying them.

fct_shift()

You can shift the levels of a factor to the left or right with this function. Positive values of argument n = shift the levels to the left and negative values to the right.

starwars %>%
  mutate(sex_ordered = fct_shift(f = as.factor(sex), n = -1)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

fct_shuffle()

With fct_shuffle() you can randomly permute the order of a factor’s levels. Use set.seed() to obtain replicable results when working with random function components.

set.seed(123)
starwars %>%
  mutate(sex_ordered = fct_shuffle(f = sex)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

set.seed(456)
starwars %>%
  mutate(sex_ordered = fct_shuffle(f = sex)) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

fct_reorder()

The function fct_reorder() lets you reorder a factor’s levels by their relationship with another variable. This is useful when plotting a factor and rearrange its levels by another column. Below I want to sort the levels of column sex according to the average height of the Star Wars characters within each unique value (level) of sex. In addition to the factor .f = we have to specify the variable along which the reordering should be carried out .x =, as well as a function .fun =. determining the reordering. With argument .desc = we can force a descending order.

starwars %>%
  mutate(sex_ordered = 
           fct_reorder(
             .f = sex, 
             .x = height,
             .fun = mean, na.rm=T, 
             .desc = TRUE)
         ) %>%
  ggplot(data = ., aes(x = sex_ordered)) +
  geom_bar(stat = "count", fill = "white", color = "black")

According to this plot male Star Wars characters have the highest average weight, followed by hermaphroditic and female characters. Do we trust function fct_reorder() just like that? Better to check this with dplyr::group_by() and dplyr::summarise().

starwars %>%
  group_by(sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
  arrange(desc(mean_height))
## # A tibble: 5 x 2
##   sex            mean_height
##   <chr>                <dbl>
## 1 <NA>                  181.
## 2 male                  179.
## 3 hermaphroditic        175 
## 4 female                169.
## 5 none                  131.

Digression: fct_reorder2()

This is a 2d version of the function fct_reorder() and lets you specify two variables (.x =, .y = among which the factor’s levels are reordered. However, only two functions might be used with argument .fun =. The function last2() finds the last value of .y when sorted by .x and first2() finds the first value. This is helpful when using a line plot and aligning the line colours with the legend.

Have a look at the plot below. The legend is quite hard to read isn’t it?

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

Now we’ll use fct_reorder()2 to improve the plot.

ggplot(data = by_age, 
       aes(x = age, y = prop, 
           colour = fct_reorder2(.f = marital, .x = age, .y = prop, .fun = last2)
           )
       ) +
  geom_line() +
  labs(colour = "marital")


4. Change the value of levels

Again you’ll often use the functions in this sections to alter the values of the factor, hence I demonstrate the examples in combination with function dplyr::mutate().

fct_recode()

The function fct_recode() lets us manually change levels of a factor. You only have to specify a new value of the level for each original value that you want to change. If no value is supplied for an original level, the level remains unaltered. Below I’m recoding the levels of column sex such that only the first letter of each original level is present in the new factor sex_mod.

starwars %>%
  mutate(sex_mod = fct_recode(
    sex,
    m = "male",f = "female", 
    n = "none", h = "hermaphroditic")
    ) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod sex   
##    <chr>              <fct>   <chr> 
##  1 Luke Skywalker     m       male  
##  2 C-3PO              n       none  
##  3 R2-D2              n       none  
##  4 Darth Vader        m       male  
##  5 Leia Organa        f       female
##  6 Owen Lars          m       male  
##  7 Beru Whitesun lars f       female
##  8 R5-D4              n       none  
##  9 Biggs Darklighter  m       male  
## 10 Obi-Wan Kenobi     m       male  
## # ... with 77 more rows

fct_relabel()

Also consider using the function fct_relabel() which obeys the purrr::map() syntax to apply a function or expression to each level. Below I use function paste0() to add the string sex_ to each level of variable sex.

starwars %>%
  mutate(sex_mod = fct_relabel(
    sex,
    ~ paste0("sex_", .x))
    ) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod    sex   
##    <chr>              <fct>      <chr> 
##  1 Luke Skywalker     sex_male   male  
##  2 C-3PO              sex_none   none  
##  3 R2-D2              sex_none   none  
##  4 Darth Vader        sex_male   male  
##  5 Leia Organa        sex_female female
##  6 Owen Lars          sex_male   male  
##  7 Beru Whitesun lars sex_female female
##  8 R5-D4              sex_none   none  
##  9 Biggs Darklighter  sex_male   male  
## 10 Obi-Wan Kenobi     sex_male   male  
## # ... with 77 more rows

fct_anon()

Sometimes, for example when publishing sensitive data, it may become necessary to anonymize levels of a factor variable with random integers. This can be done with function fct_anon().

starwars %>%
  mutate(sex_mod = fct_anon(as.factor(sex))) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod sex   
##    <chr>              <fct>   <chr> 
##  1 Luke Skywalker     4       male  
##  2 C-3PO              2       none  
##  3 R2-D2              2       none  
##  4 Darth Vader        4       male  
##  5 Leia Organa        1       female
##  6 Owen Lars          4       male  
##  7 Beru Whitesun lars 1       female
##  8 R5-D4              2       none  
##  9 Biggs Darklighter  4       male  
## 10 Obi-Wan Kenobi     4       male  
## # ... with 77 more rows

It is also possible to add a prefix = to the anonymized factor.

starwars %>%
  mutate(sex_mod = 
           fct_anon(
             as.factor(sex), 
             prefix = "sex_"
             )
         ) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod sex   
##    <chr>              <fct>   <chr> 
##  1 Luke Skywalker     sex_1   male  
##  2 C-3PO              sex_4   none  
##  3 R2-D2              sex_4   none  
##  4 Darth Vader        sex_1   male  
##  5 Leia Organa        sex_3   female
##  6 Owen Lars          sex_1   male  
##  7 Beru Whitesun lars sex_3   female
##  8 R5-D4              sex_4   none  
##  9 Biggs Darklighter  sex_1   male  
## 10 Obi-Wan Kenobi     sex_1   male  
## # ... with 77 more rows

fct_collapse()

With the function fct_collapse() we can collapse levels of a factor into manually defined groups. This is very useful when a factor has too many distinct levels but some of them share common characteristics and you can mingle them together. Below I’m generating new levels for column hair_color. Star Wars characters with unicolored/multicolored hair are reassigned either the level "singleColor" or "multiColor". Cases with "unknown" or "none" hair color are changed to "missing".

starwars %>%
  mutate(hair_color_mod = 
           fct_collapse(
             hair_color, 
             missing = c("unknown", "none"),
             singleColor = c("auburn", "black", "blond", "blonde", 
                        "brown", "grey", "white"),
             multiColor = c("auburn, grey", "auburn, white", "brown, grey")
             )
         ) %>%
  select(name, hair_color_mod, hair_color)
## # A tibble: 87 x 3
##    name               hair_color_mod hair_color   
##    <chr>              <fct>          <chr>        
##  1 Luke Skywalker     singleColor    blond        
##  2 C-3PO              <NA>           <NA>         
##  3 R2-D2              <NA>           <NA>         
##  4 Darth Vader        missing        none         
##  5 Leia Organa        singleColor    brown        
##  6 Owen Lars          multiColor     brown, grey  
##  7 Beru Whitesun lars singleColor    brown        
##  8 R5-D4              <NA>           <NA>         
##  9 Biggs Darklighter  singleColor    black        
## 10 Obi-Wan Kenobi     multiColor     auburn, white
## # ... with 77 more rows

fct_other()

The function fct_other() is used to replace levels of a factor with value "Other". We can use this function either with argument keep = or drop =.

The argument keep = merges all not supplied levels into the new category "Other".

starwars %>%
  mutate(sex_mod = 
           fct_other(
             sex,
             keep = c("male", "female")
             )
         ) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod sex   
##    <chr>              <fct>   <chr> 
##  1 Luke Skywalker     male    male  
##  2 C-3PO              Other   none  
##  3 R2-D2              Other   none  
##  4 Darth Vader        male    male  
##  5 Leia Organa        female  female
##  6 Owen Lars          male    male  
##  7 Beru Whitesun lars female  female
##  8 R5-D4              Other   none  
##  9 Biggs Darklighter  male    male  
## 10 Obi-Wan Kenobi     male    male  
## # ... with 77 more rows

The argument drop = merges all supplied levels into the new category "Other".

starwars %>%
  mutate(sex_mod = 
           fct_other(
             sex,
             drop = c("hermaphroditic", "none")
             )
         ) %>%
  select(name, sex_mod, sex)
## # A tibble: 87 x 3
##    name               sex_mod sex   
##    <chr>              <fct>   <chr> 
##  1 Luke Skywalker     male    male  
##  2 C-3PO              Other   none  
##  3 R2-D2              Other   none  
##  4 Darth Vader        male    male  
##  5 Leia Organa        female  female
##  6 Owen Lars          male    male  
##  7 Beru Whitesun lars female  female
##  8 R5-D4              Other   none  
##  9 Biggs Darklighter  male    male  
## 10 Obi-Wan Kenobi     male    male  
## # ... with 77 more rows

Lumping

To lump values of a factor’s levels means to combine them in a new level "Other". Four different functions exist to do this. Mind that NA is not affected.

The function fct_lump_min() sets a min = number of times the value must appear - otherwise it is reassigned to "Other".

starwars %>%
  mutate(sex_lumped = fct_lump_min(sex, min = 7)) %>%
  count(sex_lumped, sex) 
## # A tibble: 5 x 3
##   sex_lumped sex                n
##   <fct>      <chr>          <int>
## 1 female     female            16
## 2 male       male              60
## 3 Other      hermaphroditic     1
## 4 Other      none               6
## 5 <NA>       <NA>               4

All levels of a factor except for the n = most frequent ones can be lumped with function fct_lump_n(). This also works in the other direction (least frequent) with negative values of n.

starwars %>%
  mutate(sex_lumped = fct_lump_n(sex, n = 2)) %>%
  count(sex_lumped, sex) 
## # A tibble: 5 x 3
##   sex_lumped sex                n
##   <fct>      <chr>          <int>
## 1 female     female            16
## 2 male       male              60
## 3 Other      hermaphroditic     1
## 4 Other      none               6
## 5 <NA>       <NA>               4

To lump levels of a factor that appear less than a relative proportion of cases use the function fct_lump_prop().

starwars %>%
  mutate(sex_lumped = fct_lump_prop(sex, prop = 0.20)) %>%
  count(sex_lumped, sex) 
## # A tibble: 5 x 3
##   sex_lumped sex                n
##   <fct>      <chr>          <int>
## 1 male       male              60
## 2 Other      female            16
## 3 Other      hermaphroditic     1
## 4 Other      none               6
## 5 <NA>       <NA>               4

Last but not least the function fct_lump_lowfreq() can be used to lump together the least frequent levels of a factor, which ensures that "other" is still the smallest level.

starwars %>%
  mutate(sex_lumped = fct_lump_lowfreq(sex)) %>%
  count(sex_lumped, sex) 
## # A tibble: 5 x 3
##   sex_lumped sex                n
##   <fct>      <chr>          <int>
## 1 male       male              60
## 2 Other      female            16
## 3 Other      hermaphroditic     1
## 4 Other      none               6
## 5 <NA>       <NA>               4


5. Add or drop levels

The last section of this tutorial is about manipulating the levels of a factor instead of its values.

fct_drop()

The function fct_drop() lets you drop unused levels from a factor. Imagine we want to create a subset of the starwars dataset that only contains "Human" characters. I call this dataset starwars_humans, convert the column sex to a factor and dplyr::filter() the relevant cases.

starwars_humans <- 
  starwars %>%
  mutate(sex = as.factor(sex)) %>%
  filter(species == "Human") 

Let’s have a look at the sex of the remaining characters.

starwars_humans %>% count(sex)
## # A tibble: 2 x 2
##   sex        n
##   <fct>  <int>
## 1 female     9
## 2 male      26

Therea are 26 male and 9 female Star Wars characters in this subset. Now, let’s also have a look at the levels of factor sex.

starwars_humans %>% pull(sex) %>% levels()
## [1] "female"         "hermaphroditic" "male"           "none"

Did you expect that the factor sex still contains all its inital levels but which are no longer present in the subset? With function fct_drop() we can fix this.

starwars_humans %>% pull(sex) %>% fct_drop() %>% levels()
## [1] "female" "male"

fct_expand()

With this function we can add levels to a factor. Below I’m adding the level "other" to the existing levels of sex.

starwars %>% pull(sex) %>% fct_expand("other") %>% levels()
## [1] "female"         "hermaphroditic" "male"           "none"          
## [5] "other"

Mind that the Star Wars characters’ values of sex are not affected by this function.

fct_explicit_na()

To assign a level to NA values of a factor the function fct_explicit_na() can be used. By default the argument na_level = reassigns NA values to "(Missing)", which is then a distinct level. Let’s take a look at this by plotting the bar chart with the Star Wars characters’ sex again!

starwars %>%
  mutate(sex_f = fct_explicit_na(sex, na_level = "(Missing)")) %>%
  ggplot(data = ., aes(x = sex_f)) +
  geom_bar(stat = "count", fill = "white", color = "black")

Also let us compare the levels of factor sex before…

starwars %>% mutate(sex = as.factor(sex)) %>% pull(sex) %>% levels()
## [1] "female"         "hermaphroditic" "male"           "none"

…and after applying fct_explicit_na().

starwars %>% 
  mutate(sex_f = fct_explicit_na(sex, na_level = "(Missing)")) %>% 
  pull(sex_f) %>% levels()
## [1] "female"         "hermaphroditic" "male"           "none"          
## [5] "(Missing)"


References

Wickham, Hadley. 2020. Forcats: Tools for Working with Categorical Variables (factors). https://CRAN.R-project.org/package=forcats.