Reading time: 5 minutes (958 words)


1. Introduction

In this tutorial I am going to explore selection helper functions of the tidyselect package. As the name suggests these functions are commonly used with the select() function of the dplyr package. Read more about this package here. First, let’s load both packages.

library(dplyr)
library(tidyselect)

We will work again with the starwars dataset, which I will use to show how tidyselect functions work in practice. Below I print() the dataset.

starwars %>%
  print()
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>


2. Overview

This overview shows all helper functions contained in the tidyselect packages. They will be discussed in the folliwing sections in more detail.

Section Function Purpose
3 tidyselect::everything() Select all variable
tidyselect::last_col() Select the last variable
4 tidyselect::starts_with() Starts with a prefix
tidyselect::ends_with() Ends with a suffix
tidyselect::contains() Contains a literal string
tidyselect::matches() Matches a regular expression
tidyselect::num_range() Matches a numerical range like x01, x02, x03
5 tidyselect::all_of() If any of the variables is missing, an error is thrown
tidyselect::any_of() Doesn’t check for missing variables
6 tidyselect::where() Select variables with a predicate function (returns TRUE or FALSE)

3. Simple selection

These two functions are self-explanatory and straightforward to use.

everything()

With function everything() we can simply select every column in the starwars dataset.

starwars %>%
  select(everything())
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

last_col()

This function lets us select the last column of the starwars dataset.

starwars %>%
  select(last_col())
## # A tibble: 87 x 1
##    starships
##    <list>   
##  1 <chr [2]>
##  2 <chr [0]>
##  3 <chr [0]>
##  4 <chr [1]>
##  5 <chr [0]>
##  6 <chr [0]>
##  7 <chr [0]>
##  8 <chr [0]>
##  9 <chr [1]>
## 10 <chr [5]>
## # ... with 77 more rows


4. Selection via pattern matching

We start with the selection of variables via pattern matching.

starts_with()

The function starts_with() allows you to select all variables that start with a prefix (= a single letter or string). Below I select() all columns starting with a lowercase s.

starwars %>%
  select(starts_with("s"))
## # A tibble: 87 x 4
##    skin_color  sex    species starships
##    <chr>       <chr>  <chr>   <list>   
##  1 fair        male   Human   <chr [2]>
##  2 gold        none   Droid   <chr [0]>
##  3 white, blue none   Droid   <chr [0]>
##  4 white       male   Human   <chr [1]>
##  5 light       female Human   <chr [0]>
##  6 light       male   Human   <chr [0]>
##  7 light       female Human   <chr [0]>
##  8 white, red  none   Droid   <chr [0]>
##  9 light       male   Human   <chr [1]>
## 10 fair        male   Human   <chr [5]>
## # ... with 77 more rows

ends_with()

This works the other way around using function ends_with(). Below I select all variables of the starwars dataset that end with the suffix color.

starwars %>%
  select(ends_with("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

contains()

With function contains() it is possible to match a string between the beginning and end of the variable’s name. Below I select all variables that contain the string ir.

starwars %>%
  select(contains("ir"))
## # A tibble: 87 x 2
##    hair_color    birth_year
##    <chr>              <dbl>
##  1 blond               19  
##  2 <NA>               112  
##  3 <NA>                33  
##  4 none                41.9
##  5 brown               19  
##  6 brown, grey         52  
##  7 brown               47  
##  8 <NA>                NA  
##  9 black               24  
## 10 auburn, white       57  
## # ... with 77 more rows

matches()

The function matches() is more flexible than contains() when matching strings because it allows for using regular expressions. Below I select all variables that start with a lowercase s or a lowercase h.

starwars %>%
  select(matches("^(m|s)"))
## # A tibble: 87 x 5
##     mass skin_color  sex    species starships
##    <dbl> <chr>       <chr>  <chr>   <list>   
##  1    77 fair        male   Human   <chr [2]>
##  2    75 gold        none   Droid   <chr [0]>
##  3    32 white, blue none   Droid   <chr [0]>
##  4   136 white       male   Human   <chr [1]>
##  5    49 light       female Human   <chr [0]>
##  6   120 light       male   Human   <chr [0]>
##  7    75 light       female Human   <chr [0]>
##  8    32 white, red  none   Droid   <chr [0]>
##  9    84 light       male   Human   <chr [1]>
## 10    77 fair        male   Human   <chr [5]>
## # ... with 77 more rows

Digression: regular expressions

Regular expressions offer a wide range of possibilities to carefully match strings that you require for a selection of variable. Below are some examples to get an idea what you can do with them. You’ll find a more comprehensive list in the stringr tutorial.

Regular expression Purpose
\d Matches any digit
\D Matches any non-digit
^s Matches strings beginning with lowercase s
s$ Matches strings ending on lowercase s
*, +, ? Quantifiers
\. Matches a period (special characters need to be escaped by a )
. Matches any character except line break

num_range()

The function num_range() lets us select variables that share a common string and a sequence of numbers. Below I’ve created three new variables called fav_food1, fav_food2 and fav_food3. Two of them I want to select simultaneously with the help of the num_range() function.

starwars %>%
  mutate(
    fav_food1 = "Dantooine flapjack",
    fav_food2 = "Space waffle",
    fav_food3 = "Fromirian roast queg"
    ) %>%
  select(num_range("fav_food", 2:3))
## # A tibble: 87 x 2
##    fav_food2    fav_food3           
##    <chr>        <chr>               
##  1 Space waffle Fromirian roast queg
##  2 Space waffle Fromirian roast queg
##  3 Space waffle Fromirian roast queg
##  4 Space waffle Fromirian roast queg
##  5 Space waffle Fromirian roast queg
##  6 Space waffle Fromirian roast queg
##  7 Space waffle Fromirian roast queg
##  8 Space waffle Fromirian roast queg
##  9 Space waffle Fromirian roast queg
## 10 Space waffle Fromirian roast queg
## # ... with 77 more rows


5. Selection via character vectors

When working with large datasets it is common to store certain variables in a vector to have them easily available for analysis. Below I’ve created two vectors which should contain numeric variables of the starwars dataset. Unfortunately, I’ve made a mistake in num_vars2 by including the column name age which is not a variable of the starwars dataset.

num_vars1 <- c("height", "mass", "birth_year")
num_vars2 <- c("height", "mass", "birth_year", "age")

all_of()

With function all_of() a strict selection on the character vector is performed. When using num_vars1 to select the numeric variables of the starwars dataset everything works as expected.

starwars %>%
  select(all_of(num_vars1))
## # A tibble: 87 x 3
##    height  mass birth_year
##     <int> <dbl>      <dbl>
##  1    172    77       19  
##  2    167    75      112  
##  3     96    32       33  
##  4    202   136       41.9
##  5    150    49       19  
##  6    178   120       52  
##  7    165    75       47  
##  8     97    32       NA  
##  9    183    84       24  
## 10    182    77       57  
## # ... with 77 more rows

However, if any of the variables in the character vector is missing, an error is thrown.

starwars %>%
  select(all_of(num_vars2))
## Error in `stop_subscript()`:
## ! Can't subset columns that don't exist.
## x Column `age` doesn't exist.

any_of()

The function any_of() on the other hand does not check for missing variables, so even num_vars2 is working for the variable selection.

starwars %>%
  select(any_of(num_vars2))
## # A tibble: 87 x 3
##    height  mass birth_year
##     <int> <dbl>      <dbl>
##  1    172    77       19  
##  2    167    75      112  
##  3     96    32       33  
##  4    202   136       41.9
##  5    150    49       19  
##  6    178   120       52  
##  7    165    75       47  
##  8     97    32       NA  
##  9    183    84       24  
## 10    182    77       57  
## # ... with 77 more rows

This function is also useful to make sure a variable was removed from the dataset as expected because it doesn’t throw an error when used multiple times. Below I remove all numeric variables from the starwars dataset.

starwars %>%
  select(-any_of(num_vars2)) %>%
  select(-any_of(num_vars2))
## # A tibble: 87 x 11
##    name     hair_color skin_color eye_color sex   gender homeworld species films
##    <chr>    <chr>      <chr>      <chr>     <chr> <chr>  <chr>     <chr>   <lis>
##  1 Luke Sk~ blond      fair       blue      male  mascu~ Tatooine  Human   <chr>
##  2 C-3PO    <NA>       gold       yellow    none  mascu~ Tatooine  Droid   <chr>
##  3 R2-D2    <NA>       white, bl~ red       none  mascu~ Naboo     Droid   <chr>
##  4 Darth V~ none       white      yellow    male  mascu~ Tatooine  Human   <chr>
##  5 Leia Or~ brown      light      brown     fema~ femin~ Alderaan  Human   <chr>
##  6 Owen La~ brown, gr~ light      blue      male  mascu~ Tatooine  Human   <chr>
##  7 Beru Wh~ brown      light      blue      fema~ femin~ Tatooine  Human   <chr>
##  8 R5-D4    <NA>       white, red red       none  mascu~ Tatooine  Droid   <chr>
##  9 Biggs D~ black      light      brown     male  mascu~ Tatooine  Human   <chr>
## 10 Obi-Wan~ auburn, w~ fair       blue-gray male  mascu~ Stewjon   Human   <chr>
## # ... with 77 more rows, and 2 more variables: vehicles <list>,
## #   starships <list>


6. Selection via functions

The selection helper where() is very powerful as different predicate functions can be used inside of it. A predicate function is a function that returns a single TRUE or FALSE value. For example is.character() checks whether a column of the starwars dataset is of type character.

starwars %>%
  pull(name) %>%
  is.character()
## [1] TRUE

Can you remember why I’m using pull() here instead of select()?. Right, pull() returns a vector and not a tibble or dataframe.

where()

So is.character() should work fine in combination with where(). I am using the purrr-style formula approach here because it becomes more clear that each column .x is plugged in, in function is.character().

starwars %>%
  select(where(~is.character(.x)))
## # A tibble: 87 x 8
##    name           hair_color skin_color eye_color sex   gender homeworld species
##    <chr>          <chr>      <chr>      <chr>     <chr> <chr>  <chr>     <chr>  
##  1 Luke Skywalker blond      fair       blue      male  mascu~ Tatooine  Human  
##  2 C-3PO          <NA>       gold       yellow    none  mascu~ Tatooine  Droid  
##  3 R2-D2          <NA>       white, bl~ red       none  mascu~ Naboo     Droid  
##  4 Darth Vader    none       white      yellow    male  mascu~ Tatooine  Human  
##  5 Leia Organa    brown      light      brown     fema~ femin~ Alderaan  Human  
##  6 Owen Lars      brown, gr~ light      blue      male  mascu~ Tatooine  Human  
##  7 Beru Whitesun~ brown      light      blue      fema~ femin~ Tatooine  Human  
##  8 R5-D4          <NA>       white, red red       none  mascu~ Tatooine  Droid  
##  9 Biggs Darklig~ black      light      brown     male  mascu~ Tatooine  Human  
## 10 Obi-Wan Kenobi auburn, w~ fair       blue-gray male  mascu~ Stewjon   Human  
## # ... with 77 more rows

Of course there are more possibilities than just checking the data type of a column. Below I want so select all columns of type character which don’t have any missing values.

starwars %>%
  select(where(~is.character(.x) & !anyNA(.x)))
## # A tibble: 87 x 3
##    name               skin_color  eye_color
##    <chr>              <chr>       <chr>    
##  1 Luke Skywalker     fair        blue     
##  2 C-3PO              gold        yellow   
##  3 R2-D2              white, blue red      
##  4 Darth Vader        white       yellow   
##  5 Leia Organa        light       brown    
##  6 Owen Lars          light       blue     
##  7 Beru Whitesun lars light       blue     
##  8 R5-D4              white, red  red      
##  9 Biggs Darklighter  light       brown    
## 10 Obi-Wan Kenobi     fair        blue-gray
## # ... with 77 more rows

References

Henry, Lionel, and Hadley Wickham. 2020. Tidyselect: Select from a Set of Strings. https://CRAN.R-project.org/package=tidyselect.