Reading time: 13 minutes (2,468 words)
Even when data is stored in a modern format like a tibble it can still be messy. The tidyr
package therefore offers a wide range of functions that help us to bring our data into shape so that a maximum of information can be gained. This tutorial requires some knowledge of the dplyr
package and I recommend to familiarize yourself with it before continuing here - for example with this tutorial. When you’re ready load both packages!
library(tidyr)
library(dplyr)
As always we’ll use the starwars
dataset contained in the dplyr
package. Below is a snippet of the data.
starwars %>%
print()
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Remember the principles of the tidy data concept:
starwars
dataset corresponds to a character of the movie (= cases)name
, height
or mass
(= variables)Initially we’ll discuss how to reorganize tabular data without losing any information in the process.
With function pivot_longer()
we can make a dataset longer by collapsing several columns into only two. One of the new column then holds the column names of the collapsed columns (names_to =
) and the second new column holds the collapsed columns’ values (values_to =
).
Below I am collapsing a Star Wars character’s hair_color
, skin_color
and eye_color
into the new columns characteristic
and color
- the former holding the column names and the latter the values. In addition I’m using the names_pattern =
argument which lets me modify the original column names (turned values) of the names_to =
column.
starwars %>%
select(name, ends_with("_color")) %>%
pivot_longer(
cols = ends_with("_color"),
names_to = "characteristic",
values_to = "color",
names_pattern = "(.*)_color"
)
## # A tibble: 261 x 3
## name characteristic color
## <chr> <chr> <chr>
## 1 Luke Skywalker hair blond
## 2 Luke Skywalker skin fair
## 3 Luke Skywalker eye blue
## 4 C-3PO hair <NA>
## 5 C-3PO skin gold
## 6 C-3PO eye yellow
## 7 R2-D2 hair <NA>
## 8 R2-D2 skin white, blue
## 9 R2-D2 eye red
## 10 Darth Vader hair none
## # ... with 251 more rows
The original starwars
dataset held 87 rows. Now, after using pivot_longer()
, it holds 261 rows. Note that now there are multiple entries for the same Star Wars character.
The function pivot_wider()
does the opposite of pivot_longer()
. The dataset is now getting wider by expanding two columns into several. One column then provides the new column names (names from =
) and the other the values (values_from =
). All other columns remain unaffected and must be listed under id_cols =
if they should still be included in the output. Below, I’m reshaping the starwars
dataset we’ve just pivoted back to its original form. See the argument names_glue =
? It uses the names_from
column (and a special .value
) to create custom column names.
starwars_pivot %>%
pivot_wider(
id_cols = c(name, mass, height),
names_from = characteristic,
values_from = color,
names_glue = "{characteristic}_{.value}"
)
## # A tibble: 87 x 6
## name mass height hair_color skin_color eye_color
## <chr> <dbl> <int> <chr> <chr> <chr>
## 1 Luke Skywalker 77 172 blond fair blue
## 2 C-3PO 75 167 <NA> gold yellow
## 3 R2-D2 32 96 <NA> white, blue red
## 4 Darth Vader 136 202 none white yellow
## 5 Leia Organa 49 150 brown light brown
## 6 Owen Lars 120 178 brown, grey light blue
## 7 Beru Whitesun lars 75 165 brown light blue
## 8 R5-D4 32 97 <NA> white, red red
## 9 Biggs Darklighter 84 183 black light brown
## 10 Obi-Wan Kenobi 77 182 auburn, white fair blue-gray
## # ... with 77 more rows
The starwars
dataset is now back in its original state with one row for each character and 87 rows in total.
The functions in this section let you create new combinations of variables or identify implicit missing values (= combinations of variables which are not present in the dataset but may exist).
The function expand()
creates a new tibble with all possible combinations of the values of the supplied variables. Other variables are dropped from the output. Below, I expand()
the sex
and gender
of the Star Wars characters.
starwars %>%
expand(sex, gender)
## # A tibble: 15 x 2
## sex gender
## <chr> <chr>
## 1 female feminine
## 2 female masculine
## 3 female <NA>
## 4 hermaphroditic feminine
## 5 hermaphroditic masculine
## 6 hermaphroditic <NA>
## 7 male feminine
## 8 male masculine
## 9 male <NA>
## 10 none feminine
## 11 none masculine
## 12 none <NA>
## 13 <NA> feminine
## 14 <NA> masculine
## 15 <NA> <NA>
Note that also NA
values are treated as a category and are contained in the list of possible combinations.
With function complete()
missing combinations of values of the supplied variables are added to the starwars
dataset. By doing this, all other variables are filled with NA
. This is really helpful when creating a bar chart for example. where you would like to show categories that dont’t exist in your dataset but still happen to exist (in another sample etc.). Below, I complete()
the sex
and gender
columns.
starwars %>%
complete(sex, gender) %>%
slice(11:20) %>%
select(sex, gender, name, height, mass)
## # A tibble: 10 x 5
## sex gender name height mass
## <chr> <chr> <chr> <int> <dbl>
## 1 female feminine Zam Wesell 168 55
## 2 female feminine Taun We 213 NA
## 3 female feminine Jocasta Nu 167 NA
## 4 female feminine Shaak Ti 178 57
## 5 female feminine Rey NA NA
## 6 female feminine Padmé Amidala 165 45
## 7 female masculine <NA> NA NA
## 8 female <NA> <NA> NA NA
## 9 hermaphroditic feminine <NA> NA NA
## 10 hermaphroditic masculine Jabba Desilijic Tiure 175 1358
The missing combinations have been added to the starwars
dataset. For example there is no Star Wars character with a sex
-gender
combination female-masculine so NA
values are filled in each other column.
The function crossing()
is a wrapper around expand_grid()
that de-duplicates and sorts its inputs. This is very helpful when creating data frames for making predictions, for example after fitting a model.
tibble(year = c("2017","2018","2019")) %>%
crossing(
player = c("Thomas","Peter","Sandy","Maria"),
mood =c("good","bad")
)
## # A tibble: 24 x 3
## year player mood
## <chr> <chr> <chr>
## 1 2017 Maria bad
## 2 2017 Maria good
## 3 2017 Peter bad
## 4 2017 Peter good
## 5 2017 Sandy bad
## 6 2017 Sandy good
## 7 2017 Thomas bad
## 8 2017 Thomas good
## 9 2018 Maria bad
## 10 2018 Maria good
## # ... with 14 more rows
Next, I’ll discuss functions which can split or combine cells into individual, isolated values.
The function unite()
collapses cells (row values) across several columns into a single column. Below I’m collapsing the sex
and gender
column into a new column sex_gender
. The cell values of sex
and gender
are combined using a dash (sep = "-"
). When specifying remove = FALSE
the collapsed columns are preserved in the output.
starwars %>%
unite(
col = sex_gender,
c(sex, gender),
sep = "-",
remove = FALSE
) %>%
select(name, sex_gender, sex, gender)
## # A tibble: 87 x 4
## name sex_gender sex gender
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker male-masculine male masculine
## 2 C-3PO none-masculine none masculine
## 3 R2-D2 none-masculine none masculine
## 4 Darth Vader male-masculine male masculine
## 5 Leia Organa female-feminine female feminine
## 6 Owen Lars male-masculine male masculine
## 7 Beru Whitesun lars female-feminine female feminine
## 8 R5-D4 none-masculine none masculine
## 9 Biggs Darklighter male-masculine male masculine
## 10 Obi-Wan Kenobi male-masculine male masculine
## # ... with 77 more rows
See how sex_gender
is related to sex
and gender
?
The opposite of unite()
is achieved with function separate()
as it separates each cell in a column into several columns. Below I separate the column name
into a forename
and surname
column for all Human
Star Wars characters. Their name
is separated at the first blank space (sep = " "
). The argument extra =
lets me control what should happen if there are more components than variables supplied by the argument into =
. Just have a look at the characters’ names. Often more than one blank space occurs between their parts of the name. With extra = "merge"
I can preserve all excess parts of the name.
starwars %>%
filter(species == "Human") %>%
separate(
col = name,
into = c("forename", "surname"),
sep = " ",
extra = "merge",
remove = FALSE
) %>%
select(name, forename, surname)
## # A tibble: 35 x 3
## name forename surname
## <chr> <chr> <chr>
## 1 Luke Skywalker Luke Skywalker
## 2 Darth Vader Darth Vader
## 3 Leia Organa Leia Organa
## 4 Owen Lars Owen Lars
## 5 Beru Whitesun lars Beru Whitesun lars
## 6 Biggs Darklighter Biggs Darklighter
## 7 Obi-Wan Kenobi Obi-Wan Kenobi
## 8 Anakin Skywalker Anakin Skywalker
## 9 Wilhuff Tarkin Wilhuff Tarkin
## 10 Han Solo Han Solo
## # ... with 25 more rows
Can you guess what would happen to Beru Whitesun lars without the argument extra = "merge"
?
The function extract()
is quite similar and turns captured groups by regular expressions into a new column. If the groups don’t match, or the input is NA
, the output will be NA
. Below I am extracting all non-human Star Wars characters’ name
s into a new column name_wdigit
if it contains a digit.
starwars %>%
filter(species != "Human") %>%
extract(
col = name,
into = "name_wdigit",
regex = "(.*[0-9].*)",
remove = FALSE
) %>%
select(name, name_wdigit, height, mass)
## # A tibble: 48 x 4
## name name_wdigit height mass
## <chr> <chr> <int> <dbl>
## 1 C-3PO C-3PO 167 75
## 2 R2-D2 R2-D2 96 32
## 3 R5-D4 R5-D4 97 32
## 4 Chewbacca <NA> 228 112
## 5 Greedo <NA> 173 74
## 6 Jabba Desilijic Tiure <NA> 175 1358
## 7 Yoda <NA> 66 17
## 8 IG-88 IG-88 200 140
## 9 Bossk <NA> 190 113
## 10 Ackbar <NA> 180 83
## # ... with 38 more rows
In this tutorial I discuss regular expressions in more detail.
With this function each cell in a column is separated into several rows. In the starwars
dataset the column hair_color
often contains more than one color, separated by a ,
. The function separate_rows()
breaks up the cell values into additional rows while the hair_color
column remains in place.
starwars %>%
separate_rows(
hair_color,
sep = ", "
) %>%
select(name, height, mass, hair_color)
## # A tibble: 90 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 <NA>
## 3 R2-D2 96 32 <NA>
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown
## 7 Owen Lars 178 120 grey
## 8 Beru Whitesun lars 165 75 brown
## 9 R5-D4 97 32 <NA>
## 10 Biggs Darklighter 183 84 black
## # ... with 80 more rows
Although this might not be the best practical example it shows you that the dataset is getting longer when using this function!
The following functions drop or replace explicit missing values (NA
).
The function drop_na()
drops all rows containing NA
values in the specified columns. Below I’m excluding all rows that have a missing value in column height
. In total, 6 rows are dropped.
starwars %>%
drop_na(height)
## # A tibble: 81 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 71 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
What happens if more than one column is used with drop_na()
? Below, rows that either have a missing value in height
or mass
are dropped from the dataset. In total 11 rows are excluded.
starwars %>%
drop_na(height, hair_color)
## # A tibble: 76 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 3 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 4 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 5 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 6 Biggs D~ 183 84 black light brown 24 male mascu~
## 7 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## 8 Anakin ~ 188 84 blond fair blue 41.9 male mascu~
## 9 Wilhuff~ 180 NA auburn, g~ fair blue 64 male mascu~
## 10 Chewbac~ 228 112 brown unknown blue 200 male mascu~
## # ... with 66 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
See that C-3PO and R2-D2 are now also excluded from the output since they don’t have hair?
The function replace_na()
lets you specify a value to replace NA
values in selected columns instead of removing the row. If the input to this function is a tibble or data frame like the starwars
dataset, the argument replace =
takes a list of values, with one value for each column that has NA
values to be replaced. Below I am replacing missing values in column mass
with 0
and missing values in column hair_color
with "unknown"
.
starwars %>%
replace_na(replace = list(mass = 0, hair_color = "unknown")) %>%
select(name, height, mass, hair_color)
## # A tibble: 87 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 unknown
## 3 R2-D2 96 32 unknown
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown, grey
## 7 Beru Whitesun lars 165 75 brown
## 8 R5-D4 97 32 unknown
## 9 Biggs Darklighter 183 84 black
## 10 Obi-Wan Kenobi 182 77 auburn, white
## # ... with 77 more rows
Take care to only supply values that match the type of the column because otherwise it is coerced to the supplied type of value. Below I’m replacing NA
values in column mass
with "unknown"
which converts the column to character.
starwars %>%
replace_na(replace = list(mass = "unknown")) %>%
select(name, height, mass, hair_color)
## # A tibble: 87 x 4
## name height mass hair_color
## <chr> <int> <chr> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 <NA>
## 3 R2-D2 96 32 <NA>
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown, grey
## 7 Beru Whitesun lars 165 75 brown
## 8 R5-D4 97 32 <NA>
## 9 Biggs Darklighter 183 84 black
## 10 Obi-Wan Kenobi 182 77 auburn, white
## # ... with 77 more rows
The function replace_na()
can also take a vector as input, for example in combination with function mutate()
. Now I’m creating the new variable hair_color_clean
for which all missing values are replaced with "unknown"
. Note that no list is required in function replace_na()
now!
starwars %>%
mutate(hair_color_clean = replace_na(hair_color, "unknown")) %>%
select(name, mass, hair_color_clean)
## # A tibble: 87 x 3
## name mass hair_color_clean
## <chr> <dbl> <chr>
## 1 Luke Skywalker 77 blond
## 2 C-3PO 75 unknown
## 3 R2-D2 32 unknown
## 4 Darth Vader 136 none
## 5 Leia Organa 49 brown
## 6 Owen Lars 120 brown, grey
## 7 Beru Whitesun lars 75 brown
## 8 R5-D4 32 unknown
## 9 Biggs Darklighter 84 black
## 10 Obi-Wan Kenobi 77 auburn, white
## # ... with 77 more rows
You rather want to fill()
in missing values in a column using the next or previous value? No problem! The argument .direction =
controls the direction in which the NA
values are filled. For example "down"
uses the first non-missing value and fills in the next missing value(s). C-3PO, R2-D2 and R5-D4 have missing values in column hair_color
- let’s take a look what happens when I use fill()
.
starwars %>%
fill(hair_color, .direction = "down") %>%
select(name, height, mass, hair_color)
## # A tibble: 87 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 blond
## 3 R2-D2 96 32 blond
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown, grey
## 7 Beru Whitesun lars 165 75 brown
## 8 R5-D4 97 32 brown
## 9 Biggs Darklighter 183 84 black
## 10 Obi-Wan Kenobi 182 77 auburn, white
## # ... with 77 more rows
Specifying .direction = "up"
reverses this procedure as it uses the last non-missing value and fills in the previous missing value(s).
starwars %>%
fill(hair_color, .direction = "up") %>%
select(name, height, mass, hair_color)
## # A tibble: 87 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 none
## 3 R2-D2 96 32 none
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown, grey
## 7 Beru Whitesun lars 165 75 brown
## 8 R5-D4 97 32 black
## 9 Biggs Darklighter 183 84 black
## 10 Obi-Wan Kenobi 182 77 auburn, white
## # ... with 77 more rows
In the dplyr tutorial we already had a peek at nested data when we worked with function dplyr::group_by()
. A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types. When working with nested data you can preserve relationships between observations and subsets of data as well as the type of the variables being nested.
Have a look at the columns films
, vehicles
and starships
in the starwars
dataset. All three of them are list-columns and contain multiple entries per row.
starwars %>%
select(films, vehicles, starships)
## # A tibble: 87 x 3
## films vehicles starships
## <list> <list> <list>
## 1 <chr [5]> <chr [2]> <chr [2]>
## 2 <chr [6]> <chr [0]> <chr [0]>
## 3 <chr [7]> <chr [0]> <chr [0]>
## 4 <chr [4]> <chr [0]> <chr [1]>
## 5 <chr [5]> <chr [1]> <chr [0]>
## 6 <chr [3]> <chr [0]> <chr [0]>
## 7 <chr [3]> <chr [0]> <chr [0]>
## 8 <chr [1]> <chr [0]> <chr [0]>
## 9 <chr [1]> <chr [0]> <chr [1]>
## 10 <chr [6]> <chr [1]> <chr [5]>
## # ... with 77 more rows
In the output we can only see the value-type chr
and the number of elements per row within []
. How can we access these columns? Below I am using the function dplyr::pull()
to extract all films
in which Luke Skywalker (= row #1 according to the initial sorting order) appeared.
starwars %>%
slice(1) %>%
pull(films)
## [[1]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"
## [3] "Return of the Jedi" "A New Hope"
## [5] "The Force Awakens"
As the previous output indicated [5]
films
are present in this cell.
The function nest()
moves groups of cells into a list-column of a data frame. Below I’m nesting all columns whose names end with "color"
in a new column nested_colors
.
starwars %>%
nest(nested_colors = ends_with("color")) %>%
select(name, nested_colors)
## # A tibble: 87 x 2
## name nested_colors
## <chr> <list>
## 1 Luke Skywalker <tibble [1 x 3]>
## 2 C-3PO <tibble [1 x 3]>
## 3 R2-D2 <tibble [1 x 3]>
## 4 Darth Vader <tibble [1 x 3]>
## 5 Leia Organa <tibble [1 x 3]>
## 6 Owen Lars <tibble [1 x 3]>
## 7 Beru Whitesun lars <tibble [1 x 3]>
## 8 R5-D4 <tibble [1 x 3]>
## 9 Biggs Darklighter <tibble [1 x 3]>
## 10 Obi-Wan Kenobi <tibble [1 x 3]>
## # ... with 77 more rows
In RStudio we can use the function View()
to inspect the data frame and a click on any cell of the column nested_colors
opens up a new tab with all information related to the respective row.
starwars %>%
nest(nest_colors = ends_with("color")) %>%
select(name, nest_colors) %>%
View()
The function nest()
can also be used with dplyr::group_by()
. Below I’m grouping the starwars
dataset by homeworld
. All characters (rows) with the same homeworld
are grouped in a new dataframe and stored in the column data
. When using View()
to inspect the output, a mouse-click in the data
cell for Tatooine shows you all Star Wars characters with this homeworld
.
starwars %>%
group_by(homeworld) %>%
nest()
## # A tibble: 49 x 2
## # Groups: homeworld [49]
## homeworld data
## <chr> <list>
## 1 Tatooine <tibble [10 x 13]>
## 2 Naboo <tibble [11 x 13]>
## 3 Alderaan <tibble [3 x 13]>
## 4 Stewjon <tibble [1 x 13]>
## 5 Eriadu <tibble [1 x 13]>
## 6 Kashyyyk <tibble [2 x 13]>
## 7 Corellia <tibble [2 x 13]>
## 8 Rodia <tibble [1 x 13]>
## 9 Nal Hutta <tibble [1 x 13]>
## 10 Bestine IV <tibble [1 x 13]>
## # ... with 39 more rows
In this tutorial I explained functions that can be used to create a tibble. You may also apply tidy data concepts with them and store information in list columns. The function tibble::tibble()
for example saves list input as list-columns.
tibble::tibble(
name = c("Luke Skywalker", "Obi-Wan Kenobi"),
starships = list(c("X-wing","Imperial shuttle"),
c("Jedi starfighter","Trade Federation cruiser"))
)
## # A tibble: 2 x 2
## name starships
## <chr> <list>
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>
This also works with tibble::tribble()
which makes list-columns when needed.
tibble::tribble(
~name, ~starships,
"Luke Skywalker", c("X-wing","Imperial shuttle"),
"Obi-Wan Kenobi", c("Jedi starfighter","Trade Federation cruiser")
)
## # A tibble: 2 x 2
## name starships
## <chr> <list>
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>
At last tibble::enframe()
is able to convert multi-level lists to a tibble with list-columns.
tibble::enframe(
list('Luke Skywalker' = c("X-wing","Imperial shuttle"),
'Obi-Wan Kenobi' = c("Jedi starfighter","Trade Federation cruiser")),
'name', 'starships'
)
## # A tibble: 2 x 2
## name starships
## <chr> <list>
## 1 Luke Skywalker <chr [2]>
## 2 Obi-Wan Kenobi <chr [2]>
The functions dplyr::mutate()
, dplyr::transmute()
and dplyr::summarise()
will also output list-columns if they return a list. Below I compute the sample quantiles of mass
for each homeworld
with function quantile()
. This function returns five values corresponding to the 0%, 25%, 50%, 75% and 100% percentile. With function list()
we can store the returned values in a single row for each homeworld
.
starwars %>%
group_by(homeworld) %>%
summarise(pecentiles = list(quantile(mass, na.rm = TRUE)))
## # A tibble: 49 x 2
## homeworld pecentiles
## <chr> <list>
## 1 Alderaan <dbl [5]>
## 2 Aleen Minor <dbl [5]>
## 3 Bespin <dbl [5]>
## 4 Bestine IV <dbl [5]>
## 5 Cato Neimoidia <dbl [5]>
## 6 Cerea <dbl [5]>
## 7 Champala <dbl [5]>
## 8 Chandrila <dbl [5]>
## 9 Concord Dawn <dbl [5]>
## 10 Corellia <dbl [5]>
## # ... with 39 more rows
The function unnest()
flattens nested columns back to regular columns. It’s the inverse of nest()
. Below I’m unnesting the previously homeworld
-nested starwars
dataset.
starwars_nested %>%
unnest(cols =c(data))
## # A tibble: 87 x 14
## # Groups: homeworld [49]
## homeworld name height mass hair_color skin_color eye_color birth_year sex
## <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Tatooine Luke~ 172 77 blond fair blue 19 male
## 2 Tatooine C-3PO 167 75 <NA> gold yellow 112 none
## 3 Tatooine Dart~ 202 136 none white yellow 41.9 male
## 4 Tatooine Owen~ 178 120 brown, gr~ light blue 52 male
## 5 Tatooine Beru~ 165 75 brown light blue 47 fema~
## 6 Tatooine R5-D4 97 32 <NA> white, red red NA none
## 7 Tatooine Bigg~ 183 84 black light brown 24 male
## 8 Tatooine Anak~ 188 84 blond fair blue 41.9 male
## 9 Tatooine Shmi~ 163 NA black fair brown 72 fema~
## 10 Tatooine Clie~ 183 NA brown fair blue 82 male
## # ... with 77 more rows, and 5 more variables: gender <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
This function turns each element of a list-column into a row. Below I’m applying it to the column starships
of the starwars
dataset.
starwars %>%
select(name, starships) %>%
unnest_longer(col = starships)
## # A tibble: 98 x 2
## name starships
## <chr> <chr>
## 1 Luke Skywalker X-wing
## 2 Luke Skywalker Imperial shuttle
## 3 C-3PO <NA>
## 4 R2-D2 <NA>
## 5 Darth Vader TIE Advanced x1
## 6 Leia Organa <NA>
## 7 Owen Lars <NA>
## 8 Beru Whitesun lars <NA>
## 9 R5-D4 <NA>
## 10 Biggs Darklighter X-wing
## # ... with 88 more rows
See how the dataset is getting longer and has now 173 rows in total? That is because for each element inside a cell of the starships
list-column an additional row is added.
The function unnest_wider()
turns each element of a list-column into a regular column. I apply this function again to the starships
column.
starwars %>%
select(name, starships) %>%
unnest_wider(col = starships)
## # A tibble: 87 x 6
## name ...1 ...2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Luke Skywalker X-wing Imperial shuttle <NA> <NA> <NA>
## 2 C-3PO <NA> <NA> <NA> <NA> <NA>
## 3 R2-D2 <NA> <NA> <NA> <NA> <NA>
## 4 Darth Vader TIE Advanced x1 <NA> <NA> <NA> <NA>
## 5 Leia Organa <NA> <NA> <NA> <NA> <NA>
## 6 Owen Lars <NA> <NA> <NA> <NA> <NA>
## 7 Beru Whitesun lars <NA> <NA> <NA> <NA> <NA>
## 8 R5-D4 <NA> <NA> <NA> <NA> <NA>
## 9 Biggs Darklighter X-wing <NA> <NA> <NA> <NA>
## 10 Obi-Wan Kenobi Jedi starfighter Trade Federation cruis~ Nabo~ Jedi~ Belb~
## # ... with 77 more rows
Five new columns are created! Can you tell why? It reflects the maximum number of starships
of a character in the dataset (5 list elements in a cell).
Last but not least with function hoist()
we can pull list components out into their own top-level columns. Below I’m creating two new columns representing the first_
and second_
element of the starships
column. The starships
column itself is preserved but each list element according to a specified position is removed. Hence, we can immediately tell if there are any elements left that where not captured by the new columns.
starwars %>%
select(name, starships) %>%
hoist(starships,
first_starship = 1,
second_starship = 2,
.remove = TRUE)
## # A tibble: 87 x 4
## name first_starship second_starship starships
## <chr> <chr> <chr> <list>
## 1 Luke Skywalker X-wing Imperial shuttle <chr [0]>
## 2 C-3PO <NA> <NA> <chr [0]>
## 3 R2-D2 <NA> <NA> <chr [0]>
## 4 Darth Vader TIE Advanced x1 <NA> <chr [0]>
## 5 Leia Organa <NA> <NA> <chr [0]>
## 6 Owen Lars <NA> <NA> <chr [0]>
## 7 Beru Whitesun lars <NA> <NA> <chr [0]>
## 8 R5-D4 <NA> <NA> <chr [0]>
## 9 Biggs Darklighter X-wing <NA> <chr [0]>
## 10 Obi-Wan Kenobi Jedi starfighter Trade Federation cruiser <chr [3]>
## # ... with 77 more rows
Obi-Wan Kenobi for example has three other starships
which we could potentially move to a third_
, fourth_
or fifth_
starships
column.
This section is under construction. See vignette("rectangle")
for more information.
In the dplyr tutorial we also learned about vectorized functions. Unfortunately, these functions don’t work with lists (or list-columns). With the function dplyr::rowwise()
, however, we can group data so that each row is one group, and within the groups, elements of list-columns appear directly (and not as lists of length one). When using rowwise()
, dplyr
functions will seem to apply functions to list-columns in a vectorized fashion.
See the purrr tutorial for more list functions.
At first, I want to apply a function to a list-column and create a new list-column. Below I’m counting the number of characters of each element in the vehicles
column with function nchar()
. In function mutate()
I have to enclose the expression in function list()
since nchar()
is applied to a list which can return more than one value per row.
starwars %>%
rowwise() %>%
mutate(vehicles_nchars = list(nchar(vehicles))) %>%
select(name, vehicles, vehicles_nchars)
## # A tibble: 87 x 3
## # Rowwise:
## name vehicles vehicles_nchars
## <chr> <list> <list>
## 1 Luke Skywalker <chr [2]> <int [2]>
## 2 C-3PO <chr [0]> <int [0]>
## 3 R2-D2 <chr [0]> <int [0]>
## 4 Darth Vader <chr [0]> <int [0]>
## 5 Leia Organa <chr [1]> <int [1]>
## 6 Owen Lars <chr [0]> <int [0]>
## 7 Beru Whitesun lars <chr [0]> <int [0]>
## 8 R5-D4 <chr [0]> <int [0]>
## 9 Biggs Darklighter <chr [0]> <int [0]>
## 10 Obi-Wan Kenobi <chr [1]> <int [1]>
## # ... with 77 more rows
Now let’s take a look at the vehicles_nchars
column with unnest()
.
starwars %>%
rowwise() %>%
mutate(vehicles_nchars = list(nchar(vehicles))) %>%
select(name, vehicles, vehicles_nchars) %>%
unnest(c(vehicles, vehicles_nchars))
## # A tibble: 13 x 3
## name vehicles vehicles_nchars
## <chr> <chr> <int>
## 1 Luke Skywalker Snowspeeder 11
## 2 Luke Skywalker Imperial Speeder Bike 21
## 3 Leia Organa Imperial Speeder Bike 21
## 4 Obi-Wan Kenobi Tribubble bongo 15
## 5 Anakin Skywalker Zephyr-G swoop bike 19
## 6 Anakin Skywalker XJ-6 airspeeder 15
## 7 Chewbacca AT-ST 5
## 8 Wedge Antilles Snowspeeder 11
## 9 Qui-Gon Jinn Tribubble bongo 15
## 10 Darth Maul Sith speeder 12
## 11 Dooku Flitknot speeder 16
## 12 Zam Wesell Koro-2 Exodrive airspeeder 26
## 13 Grievous Tsmeu-6 personal wheel bike 27
Next I want to apply a function to a list-column and create a regular column. The function is.character()
returns one logical value per row as it checks whether the list-element is of type character or not.
starwars %>%
rowwise() %>%
mutate(check_character = is.character(vehicles)) %>%
select(name, vehicles, check_character)
## # A tibble: 87 x 3
## # Rowwise:
## name vehicles check_character
## <chr> <list> <lgl>
## 1 Luke Skywalker <chr [2]> TRUE
## 2 C-3PO <chr [0]> TRUE
## 3 R2-D2 <chr [0]> TRUE
## 4 Darth Vader <chr [0]> TRUE
## 5 Leia Organa <chr [1]> TRUE
## 6 Owen Lars <chr [0]> TRUE
## 7 Beru Whitesun lars <chr [0]> TRUE
## 8 R5-D4 <chr [0]> TRUE
## 9 Biggs Darklighter <chr [0]> TRUE
## 10 Obi-Wan Kenobi <chr [1]> TRUE
## # ... with 77 more rows
It is also possible to collapse multiple list-columns into a single list-column. With function mutate()
I create a new column transport
which should contain both the vehicles
and starships
of a Star Wars character. This can be done with function append()
. Note that it returns a list for each row, hence the column-type must be list.
starwars %>%
rowwise() %>%
mutate(transport = list(append(vehicles, starships))) %>%
select(name, transport)
## # A tibble: 87 x 2
## # Rowwise:
## name transport
## <chr> <list>
## 1 Luke Skywalker <chr [4]>
## 2 C-3PO <chr [0]>
## 3 R2-D2 <chr [0]>
## 4 Darth Vader <chr [1]>
## 5 Leia Organa <chr [1]>
## 6 Owen Lars <chr [0]>
## 7 Beru Whitesun lars <chr [0]>
## 8 R5-D4 <chr [0]>
## 9 Biggs Darklighter <chr [1]>
## 10 Obi-Wan Kenobi <chr [6]>
## # ... with 77 more rows
Let’s take a look at the data with unnest()
.
starwars %>%
rowwise() %>%
mutate(transport = list(append(vehicles, starships))) %>%
select(name, transport) %>%
unnest(transport)
## # A tibble: 44 x 2
## name transport
## <chr> <chr>
## 1 Luke Skywalker Snowspeeder
## 2 Luke Skywalker Imperial Speeder Bike
## 3 Luke Skywalker X-wing
## 4 Luke Skywalker Imperial shuttle
## 5 Darth Vader TIE Advanced x1
## 6 Leia Organa Imperial Speeder Bike
## 7 Biggs Darklighter X-wing
## 8 Obi-Wan Kenobi Tribubble bongo
## 9 Obi-Wan Kenobi Jedi starfighter
## 10 Obi-Wan Kenobi Trade Federation cruiser
## # ... with 34 more rows
Finally, I’m going to apply a function to multiple list-columns. The function length()
returns the length of a vector or list. That is the number of elements per list when applied to the vehicles
and starships
column.
starwars %>%
rowwise() %>%
mutate(n_transports = length(c(vehicles, starships))) %>%
select(name, vehicles, starships, n_transports)
## # A tibble: 87 x 4
## # Rowwise:
## name vehicles starships n_transports
## <chr> <list> <list> <int>
## 1 Luke Skywalker <chr [2]> <chr [2]> 4
## 2 C-3PO <chr [0]> <chr [0]> 0
## 3 R2-D2 <chr [0]> <chr [0]> 0
## 4 Darth Vader <chr [0]> <chr [1]> 1
## 5 Leia Organa <chr [1]> <chr [0]> 1
## 6 Owen Lars <chr [0]> <chr [0]> 0
## 7 Beru Whitesun lars <chr [0]> <chr [0]> 0
## 8 R5-D4 <chr [0]> <chr [0]> 0
## 9 Biggs Darklighter <chr [0]> <chr [1]> 1
## 10 Obi-Wan Kenobi <chr [1]> <chr [5]> 6
## # ... with 77 more rows