Reading time: 5 minutes (958 words)
In this tutorial I am going to explore selection helper functions of the tidyselect
package. As the name suggests these functions are commonly used with the select()
function of the dplyr
package. Read more about this package here. First, let’s load both packages.
library(dplyr)
library(tidyselect)
We will work again with the starwars
dataset, which I will use to show how tidyselect
functions work in practice. Below I print()
the dataset.
starwars %>%
print()
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
This overview shows all helper functions contained in the tidyselect
packages. They will be discussed in the folliwing sections in more detail.
Section | Function | Purpose |
---|---|---|
3 | tidyselect::everything() | Select all variable |
tidyselect::last_col() | Select the last variable | |
4 | tidyselect::starts_with() | Starts with a prefix |
tidyselect::ends_with() | Ends with a suffix | |
tidyselect::contains() | Contains a literal string | |
tidyselect::matches() | Matches a regular expression | |
tidyselect::num_range() | Matches a numerical range like x01, x02, x03 | |
5 | tidyselect::all_of() | If any of the variables is missing, an error is thrown |
tidyselect::any_of() | Doesn’t check for missing variables | |
6 | tidyselect::where() | Select variables with a predicate function (returns TRUE or FALSE) |
These two functions are self-explanatory and straightforward to use.
With function everything()
we can simply select every column in the starwars
dataset.
starwars %>%
select(everything())
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
This function lets us select the last column of the starwars
dataset.
starwars %>%
select(last_col())
## # A tibble: 87 x 1
## starships
## <list>
## 1 <chr [2]>
## 2 <chr [0]>
## 3 <chr [0]>
## 4 <chr [1]>
## 5 <chr [0]>
## 6 <chr [0]>
## 7 <chr [0]>
## 8 <chr [0]>
## 9 <chr [1]>
## 10 <chr [5]>
## # ... with 77 more rows
We start with the selection of variables via pattern matching.
The function starts_with()
allows you to select all variables that start with a prefix (= a single letter or string). Below I select()
all columns starting with a lowercase s.
starwars %>%
select(starts_with("s"))
## # A tibble: 87 x 4
## skin_color sex species starships
## <chr> <chr> <chr> <list>
## 1 fair male Human <chr [2]>
## 2 gold none Droid <chr [0]>
## 3 white, blue none Droid <chr [0]>
## 4 white male Human <chr [1]>
## 5 light female Human <chr [0]>
## 6 light male Human <chr [0]>
## 7 light female Human <chr [0]>
## 8 white, red none Droid <chr [0]>
## 9 light male Human <chr [1]>
## 10 fair male Human <chr [5]>
## # ... with 77 more rows
This works the other way around using function ends_with()
. Below I select all variables of the starwars
dataset that end with the suffix color.
starwars %>%
select(ends_with("color"))
## # A tibble: 87 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
With function contains()
it is possible to match a string between the beginning and end of the variable’s name. Below I select all variables that contain the string ir.
starwars %>%
select(contains("ir"))
## # A tibble: 87 x 2
## hair_color birth_year
## <chr> <dbl>
## 1 blond 19
## 2 <NA> 112
## 3 <NA> 33
## 4 none 41.9
## 5 brown 19
## 6 brown, grey 52
## 7 brown 47
## 8 <NA> NA
## 9 black 24
## 10 auburn, white 57
## # ... with 77 more rows
The function matches()
is more flexible than contains()
when matching strings because it allows for using regular expressions. Below I select all variables that start with a lowercase s or a lowercase h.
starwars %>%
select(matches("^(m|s)"))
## # A tibble: 87 x 5
## mass skin_color sex species starships
## <dbl> <chr> <chr> <chr> <list>
## 1 77 fair male Human <chr [2]>
## 2 75 gold none Droid <chr [0]>
## 3 32 white, blue none Droid <chr [0]>
## 4 136 white male Human <chr [1]>
## 5 49 light female Human <chr [0]>
## 6 120 light male Human <chr [0]>
## 7 75 light female Human <chr [0]>
## 8 32 white, red none Droid <chr [0]>
## 9 84 light male Human <chr [1]>
## 10 77 fair male Human <chr [5]>
## # ... with 77 more rows
Regular expressions offer a wide range of possibilities to carefully match strings that you require for a selection of variable. Below are some examples to get an idea what you can do with them. You’ll find a more comprehensive list in the stringr tutorial.
Regular expression | Purpose |
---|---|
\d |
Matches any digit |
\D |
Matches any non-digit |
^s |
Matches strings beginning with lowercase s |
s$ |
Matches strings ending on lowercase s |
* , + , ? |
Quantifiers |
\. |
Matches a period (special characters need to be escaped by a ) |
. |
Matches any character except line break |
The function num_range()
lets us select variables that share a common string and a sequence of numbers. Below I’ve created three new variables called fav_food1
, fav_food2
and fav_food3
. Two of them I want to select simultaneously with the help of the num_range()
function.
starwars %>%
mutate(
fav_food1 = "Dantooine flapjack",
fav_food2 = "Space waffle",
fav_food3 = "Fromirian roast queg"
) %>%
select(num_range("fav_food", 2:3))
## # A tibble: 87 x 2
## fav_food2 fav_food3
## <chr> <chr>
## 1 Space waffle Fromirian roast queg
## 2 Space waffle Fromirian roast queg
## 3 Space waffle Fromirian roast queg
## 4 Space waffle Fromirian roast queg
## 5 Space waffle Fromirian roast queg
## 6 Space waffle Fromirian roast queg
## 7 Space waffle Fromirian roast queg
## 8 Space waffle Fromirian roast queg
## 9 Space waffle Fromirian roast queg
## 10 Space waffle Fromirian roast queg
## # ... with 77 more rows
When working with large datasets it is common to store certain variables in a vector to have them easily available for analysis. Below I’ve created two vectors which should contain numeric variables of the starwars
dataset. Unfortunately, I’ve made a mistake in num_vars2
by including the column name age which is not a variable of the starwars
dataset.
num_vars1 <- c("height", "mass", "birth_year")
num_vars2 <- c("height", "mass", "birth_year", "age")
With function all_of()
a strict selection on the character vector is performed. When using num_vars1
to select the numeric variables of the starwars
dataset everything works as expected.
starwars %>%
select(all_of(num_vars1))
## # A tibble: 87 x 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
## 7 165 75 47
## 8 97 32 NA
## 9 183 84 24
## 10 182 77 57
## # ... with 77 more rows
However, if any of the variables in the character vector is missing, an error is thrown.
starwars %>%
select(all_of(num_vars2))
## Error in `stop_subscript()`:
## ! Can't subset columns that don't exist.
## x Column `age` doesn't exist.
The function any_of()
on the other hand does not check for missing variables, so even num_vars2
is working for the variable selection.
starwars %>%
select(any_of(num_vars2))
## # A tibble: 87 x 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
## 7 165 75 47
## 8 97 32 NA
## 9 183 84 24
## 10 182 77 57
## # ... with 77 more rows
This function is also useful to make sure a variable was removed from the dataset as expected because it doesn’t throw an error when used multiple times. Below I remove all numeric variables from the starwars
dataset.
starwars %>%
select(-any_of(num_vars2)) %>%
select(-any_of(num_vars2))
## # A tibble: 87 x 11
## name hair_color skin_color eye_color sex gender homeworld species films
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lis>
## 1 Luke Sk~ blond fair blue male mascu~ Tatooine Human <chr>
## 2 C-3PO <NA> gold yellow none mascu~ Tatooine Droid <chr>
## 3 R2-D2 <NA> white, bl~ red none mascu~ Naboo Droid <chr>
## 4 Darth V~ none white yellow male mascu~ Tatooine Human <chr>
## 5 Leia Or~ brown light brown fema~ femin~ Alderaan Human <chr>
## 6 Owen La~ brown, gr~ light blue male mascu~ Tatooine Human <chr>
## 7 Beru Wh~ brown light blue fema~ femin~ Tatooine Human <chr>
## 8 R5-D4 <NA> white, red red none mascu~ Tatooine Droid <chr>
## 9 Biggs D~ black light brown male mascu~ Tatooine Human <chr>
## 10 Obi-Wan~ auburn, w~ fair blue-gray male mascu~ Stewjon Human <chr>
## # ... with 77 more rows, and 2 more variables: vehicles <list>,
## # starships <list>
The selection helper where()
is very powerful as different predicate functions can be used inside of it. A predicate function is a function that returns a single TRUE
or FALSE
value. For example is.character()
checks whether a column of the starwars
dataset is of type character.
starwars %>%
pull(name) %>%
is.character()
## [1] TRUE
Can you remember why I’m using pull()
here instead of select()
?. Right, pull()
returns a vector and not a tibble or dataframe.
So is.character()
should work fine in combination with where()
. I am using the purrr-style formula approach here because it becomes more clear that each column .x
is plugged in, in function is.character()
.
starwars %>%
select(where(~is.character(.x)))
## # A tibble: 87 x 8
## name hair_color skin_color eye_color sex gender homeworld species
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Luke Skywalker blond fair blue male mascu~ Tatooine Human
## 2 C-3PO <NA> gold yellow none mascu~ Tatooine Droid
## 3 R2-D2 <NA> white, bl~ red none mascu~ Naboo Droid
## 4 Darth Vader none white yellow male mascu~ Tatooine Human
## 5 Leia Organa brown light brown fema~ femin~ Alderaan Human
## 6 Owen Lars brown, gr~ light blue male mascu~ Tatooine Human
## 7 Beru Whitesun~ brown light blue fema~ femin~ Tatooine Human
## 8 R5-D4 <NA> white, red red none mascu~ Tatooine Droid
## 9 Biggs Darklig~ black light brown male mascu~ Tatooine Human
## 10 Obi-Wan Kenobi auburn, w~ fair blue-gray male mascu~ Stewjon Human
## # ... with 77 more rows
Of course there are more possibilities than just checking the data type of a column. Below I want so select all columns of type character which don’t have any missing values.
starwars %>%
select(where(~is.character(.x) & !anyNA(.x)))
## # A tibble: 87 x 3
## name skin_color eye_color
## <chr> <chr> <chr>
## 1 Luke Skywalker fair blue
## 2 C-3PO gold yellow
## 3 R2-D2 white, blue red
## 4 Darth Vader white yellow
## 5 Leia Organa light brown
## 6 Owen Lars light blue
## 7 Beru Whitesun lars light blue
## 8 R5-D4 white, red red
## 9 Biggs Darklighter light brown
## 10 Obi-Wan Kenobi fair blue-gray
## # ... with 77 more rows