Reading time: 13 minutes (2,468 words)


1. Introduction

Even when data is stored in a modern format like a tibble it can still be messy. The tidyr package therefore offers a wide range of functions that help us to bring our data into shape so that a maximum of information can be gained. This tutorial requires some knowledge of the dplyr package and I recommend to familiarize yourself with it before continuing here - for example with this tutorial. When you’re ready load both packages!

library(tidyr)
library(dplyr)

As always we’ll use the starwars dataset contained in the dplyr package. Below is a snippet of the data.

starwars %>%
  print()
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Remember the principles of the tidy data concept:

  • every row in the starwars dataset corresponds to a character of the movie (= cases)
  • each column holds information about a character like its name, height or mass (= variables)
  • when a variable is saved in its own column and each case is saved in its own row the dataset is said to be tidy

2. Reshape data

Initially we’ll discuss how to reorganize tabular data without losing any information in the process.

pivot_longer()

With function pivot_longer() we can make a dataset longer by collapsing several columns into only two. One of the new column then holds the column names of the collapsed columns (names_to =) and the second new column holds the collapsed columns’ values (values_to =).

Below I am collapsing a Star Wars character’s hair_color, skin_color and eye_color into the new columns characteristic and color - the former holding the column names and the latter the values. In addition I’m using the names_pattern = argument which lets me modify the original column names (turned values) of the names_to = column.

starwars %>%
  select(name, ends_with("_color")) %>%
  pivot_longer(
    cols = ends_with("_color"),
    names_to = "characteristic",
    values_to = "color",
    names_pattern = "(.*)_color"
    )
## # A tibble: 261 x 3
##    name           characteristic color      
##    <chr>          <chr>          <chr>      
##  1 Luke Skywalker hair           blond      
##  2 Luke Skywalker skin           fair       
##  3 Luke Skywalker eye            blue       
##  4 C-3PO          hair           <NA>       
##  5 C-3PO          skin           gold       
##  6 C-3PO          eye            yellow     
##  7 R2-D2          hair           <NA>       
##  8 R2-D2          skin           white, blue
##  9 R2-D2          eye            red        
## 10 Darth Vader    hair           none       
## # ... with 251 more rows

The original starwars dataset held 87 rows. Now, after using pivot_longer(), it holds 261 rows. Note that now there are multiple entries for the same Star Wars character.

pivot_wider()

The function pivot_wider() does the opposite of pivot_longer(). The dataset is now getting wider by expanding two columns into several. One column then provides the new column names (names from =) and the other the values (values_from =). All other columns remain unaffected and must be listed under id_cols = if they should still be included in the output. Below, I’m reshaping the starwars dataset we’ve just pivoted back to its original form. See the argument names_glue =? It uses the names_from column (and a special .value) to create custom column names.

starwars_pivot %>%
  pivot_wider(
    id_cols = c(name, mass, height),
    names_from = characteristic,
    values_from = color,
    names_glue = "{characteristic}_{.value}"
  )
## # A tibble: 87 x 6
##    name                mass height hair_color    skin_color  eye_color
##    <chr>              <dbl>  <int> <chr>         <chr>       <chr>    
##  1 Luke Skywalker        77    172 blond         fair        blue     
##  2 C-3PO                 75    167 <NA>          gold        yellow   
##  3 R2-D2                 32     96 <NA>          white, blue red      
##  4 Darth Vader          136    202 none          white       yellow   
##  5 Leia Organa           49    150 brown         light       brown    
##  6 Owen Lars            120    178 brown, grey   light       blue     
##  7 Beru Whitesun lars    75    165 brown         light       blue     
##  8 R5-D4                 32     97 <NA>          white, red  red      
##  9 Biggs Darklighter     84    183 black         light       brown    
## 10 Obi-Wan Kenobi        77    182 auburn, white fair        blue-gray
## # ... with 77 more rows

The starwars dataset is now back in its original state with one row for each character and 87 rows in total.


3. Expand data

The functions in this section let you create new combinations of variables or identify implicit missing values (= combinations of variables which are not present in the dataset but may exist).

expand()

The function expand() creates a new tibble with all possible combinations of the values of the supplied variables. Other variables are dropped from the output. Below, I expand() the sex and gender of the Star Wars characters.

starwars %>%
  expand(sex, gender)
## # A tibble: 15 x 2
##    sex            gender   
##    <chr>          <chr>    
##  1 female         feminine 
##  2 female         masculine
##  3 female         <NA>     
##  4 hermaphroditic feminine 
##  5 hermaphroditic masculine
##  6 hermaphroditic <NA>     
##  7 male           feminine 
##  8 male           masculine
##  9 male           <NA>     
## 10 none           feminine 
## 11 none           masculine
## 12 none           <NA>     
## 13 <NA>           feminine 
## 14 <NA>           masculine
## 15 <NA>           <NA>

Note that also NA values are treated as a category and are contained in the list of possible combinations.

complete()

With function complete() missing combinations of values of the supplied variables are added to the starwars dataset. By doing this, all other variables are filled with NA. This is really helpful when creating a bar chart for example. where you would like to show categories that dont’t exist in your dataset but still happen to exist (in another sample etc.). Below, I complete() the sex and gender columns.

starwars %>%
  complete(sex, gender) %>%
  slice(11:20) %>%
  select(sex, gender, name, height, mass)
## # A tibble: 10 x 5
##    sex            gender    name                  height  mass
##    <chr>          <chr>     <chr>                  <int> <dbl>
##  1 female         feminine  Zam Wesell               168    55
##  2 female         feminine  Taun We                  213    NA
##  3 female         feminine  Jocasta Nu               167    NA
##  4 female         feminine  Shaak Ti                 178    57
##  5 female         feminine  Rey                       NA    NA
##  6 female         feminine  Padmé Amidala            165    45
##  7 female         masculine <NA>                      NA    NA
##  8 female         <NA>      <NA>                      NA    NA
##  9 hermaphroditic feminine  <NA>                      NA    NA
## 10 hermaphroditic masculine Jabba Desilijic Tiure    175  1358

The missing combinations have been added to the starwars dataset. For example there is no Star Wars character with a sex-gender combination female-masculine so NA values are filled in each other column.

crossing() / expand_grid()

The function crossing() is a wrapper around expand_grid() that de-duplicates and sorts its inputs. This is very helpful when creating data frames for making predictions, for example after fitting a model.

tibble(year = c("2017","2018","2019")) %>%
  crossing(
    player = c("Thomas","Peter","Sandy","Maria"),
    mood =c("good","bad")
    )
## # A tibble: 24 x 3
##    year  player mood 
##    <chr> <chr>  <chr>
##  1 2017  Maria  bad  
##  2 2017  Maria  good 
##  3 2017  Peter  bad  
##  4 2017  Peter  good 
##  5 2017  Sandy  bad  
##  6 2017  Sandy  good 
##  7 2017  Thomas bad  
##  8 2017  Thomas good 
##  9 2018  Maria  bad  
## 10 2018  Maria  good 
## # ... with 14 more rows


4. Split cells

Next, I’ll discuss functions which can split or combine cells into individual, isolated values.

unite()

The function unite() collapses cells (row values) across several columns into a single column. Below I’m collapsing the sex and gender column into a new column sex_gender. The cell values of sex and gender are combined using a dash (sep = "-"). When specifying remove = FALSE the collapsed columns are preserved in the output.

starwars %>%
  unite(
    col = sex_gender, 
    c(sex, gender), 
    sep = "-", 
    remove = FALSE
    ) %>%
  select(name, sex_gender, sex, gender)
## # A tibble: 87 x 4
##    name               sex_gender      sex    gender   
##    <chr>              <chr>           <chr>  <chr>    
##  1 Luke Skywalker     male-masculine  male   masculine
##  2 C-3PO              none-masculine  none   masculine
##  3 R2-D2              none-masculine  none   masculine
##  4 Darth Vader        male-masculine  male   masculine
##  5 Leia Organa        female-feminine female feminine 
##  6 Owen Lars          male-masculine  male   masculine
##  7 Beru Whitesun lars female-feminine female feminine 
##  8 R5-D4              none-masculine  none   masculine
##  9 Biggs Darklighter  male-masculine  male   masculine
## 10 Obi-Wan Kenobi     male-masculine  male   masculine
## # ... with 77 more rows

See how sex_gender is related to sex and gender?

separate()

The opposite of unite() is achieved with function separate() as it separates each cell in a column into several columns. Below I separate the column name into a forename and surname column for all Human Star Wars characters. Their name is separated at the first blank space (sep = " "). The argument extra = lets me control what should happen if there are more components than variables supplied by the argument into =. Just have a look at the characters’ names. Often more than one blank space occurs between their parts of the name. With extra = "merge" I can preserve all excess parts of the name.

starwars %>%
  filter(species == "Human") %>%
  separate(
    col = name,
    into = c("forename", "surname"),
    sep = " ",
    extra = "merge",
    remove = FALSE
  ) %>%
  select(name, forename, surname)
## # A tibble: 35 x 3
##    name               forename surname      
##    <chr>              <chr>    <chr>        
##  1 Luke Skywalker     Luke     Skywalker    
##  2 Darth Vader        Darth    Vader        
##  3 Leia Organa        Leia     Organa       
##  4 Owen Lars          Owen     Lars         
##  5 Beru Whitesun lars Beru     Whitesun lars
##  6 Biggs Darklighter  Biggs    Darklighter  
##  7 Obi-Wan Kenobi     Obi-Wan  Kenobi       
##  8 Anakin Skywalker   Anakin   Skywalker    
##  9 Wilhuff Tarkin     Wilhuff  Tarkin       
## 10 Han Solo           Han      Solo         
## # ... with 25 more rows

Can you guess what would happen to Beru Whitesun lars without the argument extra = "merge"?

extract()

The function extract() is quite similar and turns captured groups by regular expressions into a new column. If the groups don’t match, or the input is NA, the output will be NA. Below I am extracting all non-human Star Wars characters’ names into a new column name_wdigit if it contains a digit.

starwars %>%
  filter(species != "Human") %>%
  extract(
    col = name,
    into = "name_wdigit",
    regex = "(.*[0-9].*)",
    remove = FALSE
  ) %>%
  select(name, name_wdigit, height, mass)
## # A tibble: 48 x 4
##    name                  name_wdigit height  mass
##    <chr>                 <chr>        <int> <dbl>
##  1 C-3PO                 C-3PO          167    75
##  2 R2-D2                 R2-D2           96    32
##  3 R5-D4                 R5-D4           97    32
##  4 Chewbacca             <NA>           228   112
##  5 Greedo                <NA>           173    74
##  6 Jabba Desilijic Tiure <NA>           175  1358
##  7 Yoda                  <NA>            66    17
##  8 IG-88                 IG-88          200   140
##  9 Bossk                 <NA>           190   113
## 10 Ackbar                <NA>           180    83
## # ... with 38 more rows

In this tutorial I discuss regular expressions in more detail.

separate_rows()

With this function each cell in a column is separated into several rows. In the starwars dataset the column hair_color often contains more than one color, separated by a ,. The function separate_rows() breaks up the cell values into additional rows while the hair_color column remains in place.

starwars %>%
  separate_rows(
    hair_color,
    sep = ", "
  ) %>%
  select(name, height, mass, hair_color)
## # A tibble: 90 x 4
##    name               height  mass hair_color
##    <chr>               <int> <dbl> <chr>     
##  1 Luke Skywalker        172    77 blond     
##  2 C-3PO                 167    75 <NA>      
##  3 R2-D2                  96    32 <NA>      
##  4 Darth Vader           202   136 none      
##  5 Leia Organa           150    49 brown     
##  6 Owen Lars             178   120 brown     
##  7 Owen Lars             178   120 grey      
##  8 Beru Whitesun lars    165    75 brown     
##  9 R5-D4                  97    32 <NA>      
## 10 Biggs Darklighter     183    84 black     
## # ... with 80 more rows

Although this might not be the best practical example it shows you that the dataset is getting longer when using this function!


5. Missing values

The following functions drop or replace explicit missing values (NA).

drop_na()

The function drop_na() drops all rows containing NA values in the specified columns. Below I’m excluding all rows that have a missing value in column height. In total, 6 rows are dropped.

starwars %>%
  drop_na(height)
## # A tibble: 81 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 71 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

What happens if more than one column is used with drop_na()? Below, rows that either have a missing value in height or mass are dropped from the dataset. In total 11 rows are excluded.

starwars %>%
  drop_na(height, hair_color)
## # A tibble: 76 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  3 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  4 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  5 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  6 Biggs D~    183    84 black      light      brown           24   male  mascu~
##  7 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
##  8 Anakin ~    188    84 blond      fair       blue            41.9 male  mascu~
##  9 Wilhuff~    180    NA auburn, g~ fair       blue            64   male  mascu~
## 10 Chewbac~    228   112 brown      unknown    blue           200   male  mascu~
## # ... with 66 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

See that C-3PO and R2-D2 are now also excluded from the output since they don’t have hair?

replace_na()

The function replace_na() lets you specify a value to replace NA values in selected columns instead of removing the row. If the input to this function is a tibble or data frame like the starwars dataset, the argument replace = takes a list of values, with one value for each column that has NA values to be replaced. Below I am replacing missing values in column mass with 0 and missing values in column hair_color with "unknown".

starwars %>%
  replace_na(replace = list(mass = 0, hair_color = "unknown")) %>%
  select(name, height, mass, hair_color)
## # A tibble: 87 x 4
##    name               height  mass hair_color   
##    <chr>               <int> <dbl> <chr>        
##  1 Luke Skywalker        172    77 blond        
##  2 C-3PO                 167    75 unknown      
##  3 R2-D2                  96    32 unknown      
##  4 Darth Vader           202   136 none         
##  5 Leia Organa           150    49 brown        
##  6 Owen Lars             178   120 brown, grey  
##  7 Beru Whitesun lars    165    75 brown        
##  8 R5-D4                  97    32 unknown      
##  9 Biggs Darklighter     183    84 black        
## 10 Obi-Wan Kenobi        182    77 auburn, white
## # ... with 77 more rows

Take care to only supply values that match the type of the column because otherwise it is coerced to the supplied type of value. Below I’m replacing NA values in column mass with "unknown" which converts the column to character.

starwars %>%
  replace_na(replace = list(mass = "unknown")) %>%
  select(name, height, mass, hair_color)
## # A tibble: 87 x 4
##    name               height mass  hair_color   
##    <chr>               <int> <chr> <chr>        
##  1 Luke Skywalker        172 77    blond        
##  2 C-3PO                 167 75    <NA>         
##  3 R2-D2                  96 32    <NA>         
##  4 Darth Vader           202 136   none         
##  5 Leia Organa           150 49    brown        
##  6 Owen Lars             178 120   brown, grey  
##  7 Beru Whitesun lars    165 75    brown        
##  8 R5-D4                  97 32    <NA>         
##  9 Biggs Darklighter     183 84    black        
## 10 Obi-Wan Kenobi        182 77    auburn, white
## # ... with 77 more rows

mutate() + replace_na()

The function replace_na() can also take a vector as input, for example in combination with function mutate(). Now I’m creating the new variable hair_color_clean for which all missing values are replaced with "unknown". Note that no list is required in function replace_na() now!

starwars %>%
  mutate(hair_color_clean = replace_na(hair_color, "unknown")) %>%
  select(name, mass, hair_color_clean)
## # A tibble: 87 x 3
##    name                mass hair_color_clean
##    <chr>              <dbl> <chr>           
##  1 Luke Skywalker        77 blond           
##  2 C-3PO                 75 unknown         
##  3 R2-D2                 32 unknown         
##  4 Darth Vader          136 none            
##  5 Leia Organa           49 brown           
##  6 Owen Lars            120 brown, grey     
##  7 Beru Whitesun lars    75 brown           
##  8 R5-D4                 32 unknown         
##  9 Biggs Darklighter     84 black           
## 10 Obi-Wan Kenobi        77 auburn, white   
## # ... with 77 more rows

fill()

You rather want to fill() in missing values in a column using the next or previous value? No problem! The argument .direction = controls the direction in which the NA values are filled. For example "down" uses the first non-missing value and fills in the next missing value(s). C-3PO, R2-D2 and R5-D4 have missing values in column hair_color - let’s take a look what happens when I use fill().

starwars %>%
  fill(hair_color, .direction = "down") %>%
  select(name, height, mass, hair_color)
## # A tibble: 87 x 4
##    name               height  mass hair_color   
##    <chr>               <int> <dbl> <chr>        
##  1 Luke Skywalker        172    77 blond        
##  2 C-3PO                 167    75 blond        
##  3 R2-D2                  96    32 blond        
##  4 Darth Vader           202   136 none         
##  5 Leia Organa           150    49 brown        
##  6 Owen Lars             178   120 brown, grey  
##  7 Beru Whitesun lars    165    75 brown        
##  8 R5-D4                  97    32 brown        
##  9 Biggs Darklighter     183    84 black        
## 10 Obi-Wan Kenobi        182    77 auburn, white
## # ... with 77 more rows

Specifying .direction = "up" reverses this procedure as it uses the last non-missing value and fills in the previous missing value(s).

starwars %>%
  fill(hair_color, .direction = "up") %>%
  select(name, height, mass, hair_color)
## # A tibble: 87 x 4
##    name               height  mass hair_color   
##    <chr>               <int> <dbl> <chr>        
##  1 Luke Skywalker        172    77 blond        
##  2 C-3PO                 167    75 none         
##  3 R2-D2                  96    32 none         
##  4 Darth Vader           202   136 none         
##  5 Leia Organa           150    49 brown        
##  6 Owen Lars             178   120 brown, grey  
##  7 Beru Whitesun lars    165    75 brown        
##  8 R5-D4                  97    32 black        
##  9 Biggs Darklighter     183    84 black        
## 10 Obi-Wan Kenobi        182    77 auburn, white
## # ... with 77 more rows


6. Nested data

In the dplyr tutorial we already had a peek at nested data when we worked with function dplyr::group_by(). A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types. When working with nested data you can preserve relationships between observations and subsets of data as well as the type of the variables being nested.

Have a look at the columns films, vehicles and starships in the starwars dataset. All three of them are list-columns and contain multiple entries per row.

starwars %>%
  select(films, vehicles, starships)
## # A tibble: 87 x 3
##    films     vehicles  starships
##    <list>    <list>    <list>   
##  1 <chr [5]> <chr [2]> <chr [2]>
##  2 <chr [6]> <chr [0]> <chr [0]>
##  3 <chr [7]> <chr [0]> <chr [0]>
##  4 <chr [4]> <chr [0]> <chr [1]>
##  5 <chr [5]> <chr [1]> <chr [0]>
##  6 <chr [3]> <chr [0]> <chr [0]>
##  7 <chr [3]> <chr [0]> <chr [0]>
##  8 <chr [1]> <chr [0]> <chr [0]>
##  9 <chr [1]> <chr [0]> <chr [1]>
## 10 <chr [6]> <chr [1]> <chr [5]>
## # ... with 77 more rows

In the output we can only see the value-type chr and the number of elements per row within []. How can we access these columns? Below I am using the function dplyr::pull() to extract all films in which Luke Skywalker (= row #1 according to the initial sorting order) appeared.

starwars %>% 
  slice(1) %>%
  pull(films)
## [[1]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"    
## [3] "Return of the Jedi"      "A New Hope"             
## [5] "The Force Awakens"

As the previous output indicated [5] films are present in this cell.

6.1 Creation

nest()

The function nest() moves groups of cells into a list-column of a data frame. Below I’m nesting all columns whose names end with "color" in a new column nested_colors.

starwars %>%
  nest(nested_colors = ends_with("color")) %>%
  select(name, nested_colors)
## # A tibble: 87 x 2
##    name               nested_colors   
##    <chr>              <list>          
##  1 Luke Skywalker     <tibble [1 x 3]>
##  2 C-3PO              <tibble [1 x 3]>
##  3 R2-D2              <tibble [1 x 3]>
##  4 Darth Vader        <tibble [1 x 3]>
##  5 Leia Organa        <tibble [1 x 3]>
##  6 Owen Lars          <tibble [1 x 3]>
##  7 Beru Whitesun lars <tibble [1 x 3]>
##  8 R5-D4              <tibble [1 x 3]>
##  9 Biggs Darklighter  <tibble [1 x 3]>
## 10 Obi-Wan Kenobi     <tibble [1 x 3]>
## # ... with 77 more rows

In RStudio we can use the function View() to inspect the data frame and a click on any cell of the column nested_colors opens up a new tab with all information related to the respective row.

starwars %>%
  nest(nest_colors = ends_with("color")) %>%
  select(name, nest_colors) %>%
  View()

group_by() + nest()

The function nest() can also be used with dplyr::group_by(). Below I’m grouping the starwars dataset by homeworld. All characters (rows) with the same homeworld are grouped in a new dataframe and stored in the column data. When using View() to inspect the output, a mouse-click in the data cell for Tatooine shows you all Star Wars characters with this homeworld.

starwars %>%
  group_by(homeworld) %>%
  nest()
## # A tibble: 49 x 2
## # Groups:   homeworld [49]
##    homeworld  data              
##    <chr>      <list>            
##  1 Tatooine   <tibble [10 x 13]>
##  2 Naboo      <tibble [11 x 13]>
##  3 Alderaan   <tibble [3 x 13]> 
##  4 Stewjon    <tibble [1 x 13]> 
##  5 Eriadu     <tibble [1 x 13]> 
##  6 Kashyyyk   <tibble [2 x 13]> 
##  7 Corellia   <tibble [2 x 13]> 
##  8 Rodia      <tibble [1 x 13]> 
##  9 Nal Hutta  <tibble [1 x 13]> 
## 10 Bestine IV <tibble [1 x 13]> 
## # ... with 39 more rows

Tibbles with list-columns

In this tutorial I explained functions that can be used to create a tibble. You may also apply tidy data concepts with them and store information in list columns. The function tibble::tibble() for example saves list input as list-columns.

tibble::tibble(
  name = c("Luke Skywalker", "Obi-Wan Kenobi"),
  starships = list(c("X-wing","Imperial shuttle"),
                   c("Jedi starfighter","Trade Federation cruiser"))
  )
## # A tibble: 2 x 2
##   name           starships
##   <chr>          <list>   
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>

This also works with tibble::tribble() which makes list-columns when needed.

tibble::tribble(
  ~name, ~starships,
  "Luke Skywalker", c("X-wing","Imperial shuttle"),
  "Obi-Wan Kenobi", c("Jedi starfighter","Trade Federation cruiser")
)
## # A tibble: 2 x 2
##   name           starships
##   <chr>          <list>   
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>

At last tibble::enframe() is able to convert multi-level lists to a tibble with list-columns.

tibble::enframe(
  list('Luke Skywalker' = c("X-wing","Imperial shuttle"), 
       'Obi-Wan Kenobi' = c("Jedi starfighter","Trade Federation cruiser")), 
  'name', 'starships'
  )
## # A tibble: 2 x 2
##   name           starships
##   <chr>          <list>   
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>

Output list-columns

The functions dplyr::mutate(), dplyr::transmute() and dplyr::summarise() will also output list-columns if they return a list. Below I compute the sample quantiles of mass for each homeworld with function quantile(). This function returns five values corresponding to the 0%, 25%, 50%, 75% and 100% percentile. With function list() we can store the returned values in a single row for each homeworld.

starwars %>%
  group_by(homeworld) %>%
  summarise(pecentiles = list(quantile(mass, na.rm = TRUE)))
## # A tibble: 49 x 2
##    homeworld      pecentiles
##    <chr>          <list>    
##  1 Alderaan       <dbl [5]> 
##  2 Aleen Minor    <dbl [5]> 
##  3 Bespin         <dbl [5]> 
##  4 Bestine IV     <dbl [5]> 
##  5 Cato Neimoidia <dbl [5]> 
##  6 Cerea          <dbl [5]> 
##  7 Champala       <dbl [5]> 
##  8 Chandrila      <dbl [5]> 
##  9 Concord Dawn   <dbl [5]> 
## 10 Corellia       <dbl [5]> 
## # ... with 39 more rows


6.2 Reshaping

unnest()

The function unnest() flattens nested columns back to regular columns. It’s the inverse of nest(). Below I’m unnesting the previously homeworld-nested starwars dataset.

starwars_nested %>%
  unnest(cols =c(data))
## # A tibble: 87 x 14
## # Groups:   homeworld [49]
##    homeworld name  height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>     <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Tatooine  Luke~    172    77 blond      fair       blue            19   male 
##  2 Tatooine  C-3PO    167    75 <NA>       gold       yellow         112   none 
##  3 Tatooine  Dart~    202   136 none       white      yellow          41.9 male 
##  4 Tatooine  Owen~    178   120 brown, gr~ light      blue            52   male 
##  5 Tatooine  Beru~    165    75 brown      light      blue            47   fema~
##  6 Tatooine  R5-D4     97    32 <NA>       white, red red             NA   none 
##  7 Tatooine  Bigg~    183    84 black      light      brown           24   male 
##  8 Tatooine  Anak~    188    84 blond      fair       blue            41.9 male 
##  9 Tatooine  Shmi~    163    NA black      fair       brown           72   fema~
## 10 Tatooine  Clie~    183    NA brown      fair       blue            82   male 
## # ... with 77 more rows, and 5 more variables: gender <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

unnest_longer()

This function turns each element of a list-column into a row. Below I’m applying it to the column starships of the starwars dataset.

starwars %>%
  select(name, starships) %>%
  unnest_longer(col = starships)
## # A tibble: 98 x 2
##    name               starships       
##    <chr>              <chr>           
##  1 Luke Skywalker     X-wing          
##  2 Luke Skywalker     Imperial shuttle
##  3 C-3PO              <NA>            
##  4 R2-D2              <NA>            
##  5 Darth Vader        TIE Advanced x1 
##  6 Leia Organa        <NA>            
##  7 Owen Lars          <NA>            
##  8 Beru Whitesun lars <NA>            
##  9 R5-D4              <NA>            
## 10 Biggs Darklighter  X-wing          
## # ... with 88 more rows

See how the dataset is getting longer and has now 173 rows in total? That is because for each element inside a cell of the starships list-column an additional row is added.

unnest_wider()

The function unnest_wider() turns each element of a list-column into a regular column. I apply this function again to the starships column.

starwars %>%
  select(name, starships) %>%
  unnest_wider(col = starships)
## # A tibble: 87 x 6
##    name               ...1             ...2                    ...3  ...4  ...5 
##    <chr>              <chr>            <chr>                   <chr> <chr> <chr>
##  1 Luke Skywalker     X-wing           Imperial shuttle        <NA>  <NA>  <NA> 
##  2 C-3PO              <NA>             <NA>                    <NA>  <NA>  <NA> 
##  3 R2-D2              <NA>             <NA>                    <NA>  <NA>  <NA> 
##  4 Darth Vader        TIE Advanced x1  <NA>                    <NA>  <NA>  <NA> 
##  5 Leia Organa        <NA>             <NA>                    <NA>  <NA>  <NA> 
##  6 Owen Lars          <NA>             <NA>                    <NA>  <NA>  <NA> 
##  7 Beru Whitesun lars <NA>             <NA>                    <NA>  <NA>  <NA> 
##  8 R5-D4              <NA>             <NA>                    <NA>  <NA>  <NA> 
##  9 Biggs Darklighter  X-wing           <NA>                    <NA>  <NA>  <NA> 
## 10 Obi-Wan Kenobi     Jedi starfighter Trade Federation cruis~ Nabo~ Jedi~ Belb~
## # ... with 77 more rows

Five new columns are created! Can you tell why? It reflects the maximum number of starships of a character in the dataset (5 list elements in a cell).

hoist()

Last but not least with function hoist() we can pull list components out into their own top-level columns. Below I’m creating two new columns representing the first_ and second_ element of the starships column. The starships column itself is preserved but each list element according to a specified position is removed. Hence, we can immediately tell if there are any elements left that where not captured by the new columns.

starwars %>%
  select(name, starships) %>%
  hoist(starships, 
        first_starship = 1, 
        second_starship = 2, 
        .remove = TRUE)
## # A tibble: 87 x 4
##    name               first_starship   second_starship          starships
##    <chr>              <chr>            <chr>                    <list>   
##  1 Luke Skywalker     X-wing           Imperial shuttle         <chr [0]>
##  2 C-3PO              <NA>             <NA>                     <chr [0]>
##  3 R2-D2              <NA>             <NA>                     <chr [0]>
##  4 Darth Vader        TIE Advanced x1  <NA>                     <chr [0]>
##  5 Leia Organa        <NA>             <NA>                     <chr [0]>
##  6 Owen Lars          <NA>             <NA>                     <chr [0]>
##  7 Beru Whitesun lars <NA>             <NA>                     <chr [0]>
##  8 R5-D4              <NA>             <NA>                     <chr [0]>
##  9 Biggs Darklighter  X-wing           <NA>                     <chr [0]>
## 10 Obi-Wan Kenobi     Jedi starfighter Trade Federation cruiser <chr [3]>
## # ... with 77 more rows

Obi-Wan Kenobi for example has three other starships which we could potentially move to a third_, fourth_ or fifth_ starships column.

Digression: Rectangling

This section is under construction. See vignette("rectangle") for more information.


6.3 Transformation

In the dplyr tutorial we also learned about vectorized functions. Unfortunately, these functions don’t work with lists (or list-columns). With the function dplyr::rowwise(), however, we can group data so that each row is one group, and within the groups, elements of list-columns appear directly (and not as lists of length one). When using rowwise(), dplyr functions will seem to apply functions to list-columns in a vectorized fashion.

See the purrr tutorial for more list functions.

6.3.1 Single list-columns

Create a list-column

At first, I want to apply a function to a list-column and create a new list-column. Below I’m counting the number of characters of each element in the vehicles column with function nchar(). In function mutate() I have to enclose the expression in function list() since nchar() is applied to a list which can return more than one value per row.

starwars %>%
  rowwise() %>%
  mutate(vehicles_nchars = list(nchar(vehicles))) %>%
  select(name, vehicles, vehicles_nchars)
## # A tibble: 87 x 3
## # Rowwise: 
##    name               vehicles  vehicles_nchars
##    <chr>              <list>    <list>         
##  1 Luke Skywalker     <chr [2]> <int [2]>      
##  2 C-3PO              <chr [0]> <int [0]>      
##  3 R2-D2              <chr [0]> <int [0]>      
##  4 Darth Vader        <chr [0]> <int [0]>      
##  5 Leia Organa        <chr [1]> <int [1]>      
##  6 Owen Lars          <chr [0]> <int [0]>      
##  7 Beru Whitesun lars <chr [0]> <int [0]>      
##  8 R5-D4              <chr [0]> <int [0]>      
##  9 Biggs Darklighter  <chr [0]> <int [0]>      
## 10 Obi-Wan Kenobi     <chr [1]> <int [1]>      
## # ... with 77 more rows

Now let’s take a look at the vehicles_nchars column with unnest().

starwars %>%
  rowwise() %>%
  mutate(vehicles_nchars = list(nchar(vehicles))) %>%
  select(name, vehicles, vehicles_nchars) %>%
  unnest(c(vehicles, vehicles_nchars))
## # A tibble: 13 x 3
##    name             vehicles                    vehicles_nchars
##    <chr>            <chr>                                 <int>
##  1 Luke Skywalker   Snowspeeder                              11
##  2 Luke Skywalker   Imperial Speeder Bike                    21
##  3 Leia Organa      Imperial Speeder Bike                    21
##  4 Obi-Wan Kenobi   Tribubble bongo                          15
##  5 Anakin Skywalker Zephyr-G swoop bike                      19
##  6 Anakin Skywalker XJ-6 airspeeder                          15
##  7 Chewbacca        AT-ST                                     5
##  8 Wedge Antilles   Snowspeeder                              11
##  9 Qui-Gon Jinn     Tribubble bongo                          15
## 10 Darth Maul       Sith speeder                             12
## 11 Dooku            Flitknot speeder                         16
## 12 Zam Wesell       Koro-2 Exodrive airspeeder               26
## 13 Grievous         Tsmeu-6 personal wheel bike              27
Create a regular column

Next I want to apply a function to a list-column and create a regular column. The function is.character() returns one logical value per row as it checks whether the list-element is of type character or not.

starwars %>%
  rowwise() %>%
  mutate(check_character = is.character(vehicles)) %>%
  select(name, vehicles, check_character)
## # A tibble: 87 x 3
## # Rowwise: 
##    name               vehicles  check_character
##    <chr>              <list>    <lgl>          
##  1 Luke Skywalker     <chr [2]> TRUE           
##  2 C-3PO              <chr [0]> TRUE           
##  3 R2-D2              <chr [0]> TRUE           
##  4 Darth Vader        <chr [0]> TRUE           
##  5 Leia Organa        <chr [1]> TRUE           
##  6 Owen Lars          <chr [0]> TRUE           
##  7 Beru Whitesun lars <chr [0]> TRUE           
##  8 R5-D4              <chr [0]> TRUE           
##  9 Biggs Darklighter  <chr [0]> TRUE           
## 10 Obi-Wan Kenobi     <chr [1]> TRUE           
## # ... with 77 more rows


6.3.2 Multiple list-columns

Collapse

It is also possible to collapse multiple list-columns into a single list-column. With function mutate() I create a new column transport which should contain both the vehicles and starships of a Star Wars character. This can be done with function append(). Note that it returns a list for each row, hence the column-type must be list.

starwars %>%
  rowwise() %>%
  mutate(transport = list(append(vehicles, starships))) %>%
  select(name, transport)
## # A tibble: 87 x 2
## # Rowwise: 
##    name               transport
##    <chr>              <list>   
##  1 Luke Skywalker     <chr [4]>
##  2 C-3PO              <chr [0]>
##  3 R2-D2              <chr [0]>
##  4 Darth Vader        <chr [1]>
##  5 Leia Organa        <chr [1]>
##  6 Owen Lars          <chr [0]>
##  7 Beru Whitesun lars <chr [0]>
##  8 R5-D4              <chr [0]>
##  9 Biggs Darklighter  <chr [1]>
## 10 Obi-Wan Kenobi     <chr [6]>
## # ... with 77 more rows

Let’s take a look at the data with unnest().

starwars %>%
  rowwise() %>%
  mutate(transport = list(append(vehicles, starships))) %>%
  select(name, transport) %>%
  unnest(transport)
## # A tibble: 44 x 2
##    name              transport               
##    <chr>             <chr>                   
##  1 Luke Skywalker    Snowspeeder             
##  2 Luke Skywalker    Imperial Speeder Bike   
##  3 Luke Skywalker    X-wing                  
##  4 Luke Skywalker    Imperial shuttle        
##  5 Darth Vader       TIE Advanced x1         
##  6 Leia Organa       Imperial Speeder Bike   
##  7 Biggs Darklighter X-wing                  
##  8 Obi-Wan Kenobi    Tribubble bongo         
##  9 Obi-Wan Kenobi    Jedi starfighter        
## 10 Obi-Wan Kenobi    Trade Federation cruiser
## # ... with 34 more rows
Apply a function

Finally, I’m going to apply a function to multiple list-columns. The function length() returns the length of a vector or list. That is the number of elements per list when applied to the vehicles and starships column.

starwars %>%
  rowwise() %>%
  mutate(n_transports = length(c(vehicles, starships))) %>%
  select(name, vehicles, starships, n_transports)
## # A tibble: 87 x 4
## # Rowwise: 
##    name               vehicles  starships n_transports
##    <chr>              <list>    <list>           <int>
##  1 Luke Skywalker     <chr [2]> <chr [2]>            4
##  2 C-3PO              <chr [0]> <chr [0]>            0
##  3 R2-D2              <chr [0]> <chr [0]>            0
##  4 Darth Vader        <chr [0]> <chr [1]>            1
##  5 Leia Organa        <chr [1]> <chr [0]>            1
##  6 Owen Lars          <chr [0]> <chr [0]>            0
##  7 Beru Whitesun lars <chr [0]> <chr [0]>            0
##  8 R5-D4              <chr [0]> <chr [0]>            0
##  9 Biggs Darklighter  <chr [0]> <chr [1]>            1
## 10 Obi-Wan Kenobi     <chr [1]> <chr [5]>            6
## # ... with 77 more rows


References

Wickham, Hadley. 2020. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.