R Tutorial: dplyr package

Reading time: 19 minutes (3,780 words)

1. Introduction

In this tutorial I present functions of the dplyr package and their practical application. To start with, install the package and load it as shown below.

library(dplyr)

The dplyr package contains a dataset called starwars, which I will use hereafter to demonstrate the functions. Let us take a preliminary look at the data with print().

print(starwars)

## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Tibbles

This dataset is a so called tibble which is simply a dataframe with some quality of life enhancements. Note that when printing starwars to the console, only the first 10 rows are displayed instead of all rows when using print() with a conventional dataframe. More information about the tibble can be found in this tutorial.

Cases and variables

Every row in the starwars dataset corresponds to a character of the movie - we will refer to the rows as cases. There are 87 cases in total. Each column holds information about a Star Wars character like its name, height or mass. The columns are referred to as variables. When a variable is saved in its own column and each case is saved in its own row the dataset is said to be tidy. Tidy data complements R’s vectorized operations and is also the required input for many functions of the dplyr package.

Pipe operator

Throughout this tutorial you will often encounter the pipe operator (%>%), which is in fact not a function of dplyr but the magrittr package. It passes an object forward into a function and makes code a lot more readable. Printing the starwars dataset with the pipe operator looks like this:

starwars %>%
  print()

## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

The output is the same as using print(starwars).

2. Manipulate variables

Initially, we will deal with the manipulation of variables. This affects the columns of our dataset while the cases (rows) remain unchanged.

select()

With the select() function I extract a character’s name and sex from the starwars dataset. Simply list the columns you want to keep inside the functions and separate them by ,.

starwars %>%
  select(name, height, mass)

## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # ... with 77 more rows

pull()

The function pull() extracts a single column from a dataset as a vector. It is very helpful when working with other functions that require a vector as input instead of a tibble or dataframe.

starwars %>%
  pull(height)

##  [1] 172 167  96 202 150 178 165  97 183 182 188 180 228 180 173 175 170 180  66
## [20] 170 183 200 190 177 175 180 150  NA  88 160 193 191 170 196 224 206 183 137
## [39] 112 183 163 175 180 178  94 122 163 188 198 196 171 184 188 264 188 196 185
## [58] 157 183 183 170 166 165 193 191 183 168 198 229 213 167  79  96 193 191 178
## [77] 216 234 188 178 206  NA  NA  NA  NA  NA 165

Why can’t we use select() to do this? You will of course get the same column and even the same rows of the dataset but R treats both outputs differently. When using select() the output is still a tibble. Let’s verify this using the class() function:

starwars %>%
  select(height) %>%
  class()

## [1] "tbl_df"     "tbl"        "data.frame"

starwars %>%
  pull(height) %>%
  class()

## [1] "integer"

rename()

You might also want to change a variable’s name and then use it in a subsequent select() function. Below I rename() the mass variable to weight.

starwars %>%
  rename(weight = "mass") %>%
  select(name, height, weight)

## # A tibble: 87 x 3
##    name               height weight
##    <chr>               <int>  <dbl>
##  1 Luke Skywalker        172     77
##  2 C-3PO                 167     75
##  3 R2-D2                  96     32
##  4 Darth Vader           202    136
##  5 Leia Organa           150     49
##  6 Owen Lars             178    120
##  7 Beru Whitesun lars    165     75
##  8 R5-D4                  97     32
##  9 Biggs Darklighter     183     84
## 10 Obi-Wan Kenobi        182     77
## # ... with 77 more rows

relocate()

Don’t like the sequence of columns in the starwars dataset? With relocate() you might change their position. Below I move the species variable before the variable name.

starwars %>%
  relocate(species, .before = name)

## # A tibble: 87 x 14
##    species name    height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Human   Luke S~    172    77 blond      fair       blue            19   male 
##  2 Droid   C-3PO      167    75 <NA>       gold       yellow         112   none 
##  3 Droid   R2-D2       96    32 <NA>       white, bl~ red             33   none 
##  4 Human   Darth ~    202   136 none       white      yellow          41.9 male 
##  5 Human   Leia O~    150    49 brown      light      brown           19   fema~
##  6 Human   Owen L~    178   120 brown, gr~ light      blue            52   male 
##  7 Human   Beru W~    165    75 brown      light      blue            47   fema~
##  8 Droid   R5-D4       97    32 <NA>       white, red red             NA   none 
##  9 Human   Biggs ~    183    84 black      light      brown           24   male 
## 10 Human   Obi-Wa~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## #   films <list>, vehicles <list>, starships <list>

Digression: selection helpers

There are many ways to reference variables inside the select() function. For example, when putting the - opreator before one or more variables they will be dropped from the dataset. On the other hand, a sequence of variables can be selected with a : (which means from:to).

In the tidyselect package are many more of these selection helpers which are discussed in this tutorial. Below I included a table to get an idea how to use them.

Function	Purpose
tidyselect::everything()	Select all variables
tidyselect::last_col()	Select the last variable
tidyselect::starts_with()	Starts with a prefix
tidyselect::ends_with()	Ends with a suffix
tidyselect::contains()	Contains a literal string
tidyselect::matches()	Matches a regular expression
tidyselect::num_range()	Matches a numerical range like x01, x02, x03
tidyselect::all_of()	If any of the variables is missing, an error is thrown
tidyselect::any_of()	Doesn’t check for missing variables
tidyselect::where()	Select variables with a predicate function (returns TRUE or FALSE)

3. Add new variables

Instead of manipulating the dataset’s existing variables we can also creating new columns. In this section vectorized functions play an important role. They take vectors as input and return vectors of the same length as output.

3.1 Add a single variable

mutate()

This function generates a new variable based on a function that uses existing columns of the dataset or some other (external) supplied vector of values. When we compute the avgWeight of the Star Wars characters with mutate(), the mean() value of mass is added to each case of the data!

starwars %>%
  mutate(avgWeight = mean(mass, na.rm = TRUE)) %>%
  select(name, avgWeight, mass)

## # A tibble: 87 x 3
##    name               avgWeight  mass
##    <chr>                  <dbl> <dbl>
##  1 Luke Skywalker          97.3    77
##  2 C-3PO                   97.3    75
##  3 R2-D2                   97.3    32
##  4 Darth Vader             97.3   136
##  5 Leia Organa             97.3    49
##  6 Owen Lars               97.3   120
##  7 Beru Whitesun lars      97.3    75
##  8 R5-D4                   97.3    32
##  9 Biggs Darklighter       97.3    84
## 10 Obi-Wan Kenobi          97.3    77
## # ... with 77 more rows

Now each row shows the mass of a Star Wars character but also the avgWeight of all characters (including itself). Note the use of argument na.rm = in function mean() which removes NA values when set to TRUE.

transmute()

While mutate() adds new variables and preserves all existing ones the function transmute() adds new variables and drops the existing ones.

starwars %>%
  transmute(avgWeight = mean(mass, na.rm = TRUE))

## # A tibble: 87 x 1
##    avgWeight
##        <dbl>
##  1      97.3
##  2      97.3
##  3      97.3
##  4      97.3
##  5      97.3
##  6      97.3
##  7      97.3
##  8      97.3
##  9      97.3
## 10      97.3
## # ... with 77 more rows

if_else()

With function if_else() we can specify a single condition that assigns the new variable a value if it evaluates to TRUE or else if it evaluates to FALSE. Below I check whether a case has a missing value in the mass variable with function is.na(). If so, I assign the value 0 instead of NA. If not I keep the given value of mass.

starwars %>%
  mutate(
    mass_mod = if_else(
      condition = is.na(mass), 
      true = 0, 
      false = mass
      )
    ) %>%
  select(name, mass, mass_mod)

## # A tibble: 87 x 3
##    name                mass mass_mod
##    <chr>              <dbl>    <dbl>
##  1 Luke Skywalker        77       77
##  2 C-3PO                 75       75
##  3 R2-D2                 32       32
##  4 Darth Vader          136      136
##  5 Leia Organa           49       49
##  6 Owen Lars            120      120
##  7 Beru Whitesun lars    75       75
##  8 R5-D4                 32       32
##  9 Biggs Darklighter     84       84
## 10 Obi-Wan Kenobi        77       77
## # ... with 77 more rows

case_when()

The case_when() function extends if_else() by allowing for multiple conditions. This is very useful when working with categorical/factor variables. Below I am creating a new variable rating, which represents the result of a survey on the popularity of the Star Wars characters among fans of the movies. For each case where a condition specified inside case_when() evaluates to TRUE the value after the ~ is stored in the new variable rating. The arguments are evaluated in order, so the conditions must be specified from the most specific to the most general. With TRUE you can assign a value to all remaining cases where no condition was met - otherwise these cases will be assigned with NA.

starwars %>%
  mutate(rating =
           case_when(
             name %in% c("Chewbacca", "Yoda", "C-3PO") ~ "badass",
             name %in% c("Han Solo", "R2-D2") ~ "quite cool",
             mass > 1000 ~ "wicked",
             TRUE ~ "indifferent"
             )
         ) %>%
  select(name, rating)

## # A tibble: 87 x 2
##    name               rating     
##    <chr>              <chr>      
##  1 Luke Skywalker     indifferent
##  2 C-3PO              badass     
##  3 R2-D2              quite cool 
##  4 Darth Vader        indifferent
##  5 Leia Organa        indifferent
##  6 Owen Lars          indifferent
##  7 Beru Whitesun lars indifferent
##  8 R5-D4              indifferent
##  9 Biggs Darklighter  indifferent
## 10 Obi-Wan Kenobi     indifferent
## # ... with 77 more rows

recode()

This is a vectorised version of switch() that replaces numeric values based on their position or their name, and character or factor values only by their name. Note that unlike in rename() or mutate() the old value of the variable comes first and then the new one. Hence, this function is not fully optimised for the dplyr package and might be altered in the future. Below I am recoding values of the variable sex (hermaphroditic and none) to take on the value other.

starwars %>%
  mutate(
    sex = recode_factor(
      sex,
      hermaphroditic = "other",
      none = "other"
      )
    ) %>%
  select(name, sex)

## # A tibble: 87 x 2
##    name               sex   
##    <chr>              <fct> 
##  1 Luke Skywalker     male  
##  2 C-3PO              other 
##  3 R2-D2              other 
##  4 Darth Vader        male  
##  5 Leia Organa        female
##  6 Owen Lars          male  
##  7 Beru Whitesun lars female
##  8 R5-D4              other 
##  9 Biggs Darklighter  male  
## 10 Obi-Wan Kenobi     male  
## # ... with 77 more rows

Digression: vectorized functions

The dplyr package contains many vectorized functions that take a vector of values as input and return another vector of values. Below I included a table of these functions and their possible field of application.

Class	Function	Purpose
Offset	dplyr::lead()	Computes leading values
	dplyr::lagged()	Computes lagged values
Ranking	dplyr::row_number()	Row number of the observation (sorting order)
	dplyr::min_rank()	Rank of each observation
	dplyr::dense_rank()	Rank of each observation (without gaps)
	dplyr::percent_rank()	Rank rescaled between 0 and 1
	dplyr::ntile()	Rough rank, which breaks the input vector into n buckets
	dplyr::cume_dist()	Proportion of all values <= current rank
Cumulative aggregate	dplyr::cumall()	Cumulative all()
	dplyr::cumany()	Cumulative any()
	base::cummax()	Cumulative max()
	dplyr::cummean()	Cumulative mean()
	base::cummin()	cumulative min()
	base::cumprod()	cumulative prod()
	base::cumsum()	cumulative sum()
Math	+, - , *, /, ^, %/%, %%	Arithmetic operators
	log(), log2(), log10()	Logarithm
	<, <=, >, >=, !=, ==	Logical comparisons
	dplyr::between()	x >= left & x <= right
	dplyr::near()	Safe comparison for floating point numbers
Misc	dplyr::coalesce()	Use a single value to replace all missing values, Match together a complete vector from missing pieces
	dplyr::na_if()	Convert an annoying value to NA
	tidyr::replace_na()	Replace NA values with another value

3.2 Add multiple variables

mutate()

With mutate() we can also add multiple new variables as they just have to be separated by a ,.

starwars %>%
  mutate(avgWeight = mean(mass, na.rm = TRUE),
         avgHeight = mean(height, na.rm = TRUE)) %>%
  select(name, height, mass, avgHeight, avgWeight)

## # A tibble: 87 x 5
##    name               height  mass avgHeight avgWeight
##    <chr>               <int> <dbl>     <dbl>     <dbl>
##  1 Luke Skywalker        172    77      174.      97.3
##  2 C-3PO                 167    75      174.      97.3
##  3 R2-D2                  96    32      174.      97.3
##  4 Darth Vader           202   136      174.      97.3
##  5 Leia Organa           150    49      174.      97.3
##  6 Owen Lars             178   120      174.      97.3
##  7 Beru Whitesun lars    165    75      174.      97.3
##  8 R5-D4                  97    32      174.      97.3
##  9 Biggs Darklighter     183    84      174.      97.3
## 10 Obi-Wan Kenobi        182    77      174.      97.3
## # ... with 77 more rows

across()

The more interesting question is whether we can apply a vectorized function to multiple columns simultaneously. Suppose we want to replace the NA values with "unknown" in all columns of type character. The relevant columns (.cols =) are selected with the selection helper where(). To replace the NA values with "unknown" I am using if_else(). To be on the safe side I don’t want to overwrite the original columns. With the argument .names = each modified column is added to the starwars dataset with a new name. We can access the original column names via {col}.

starwars %>%
  mutate(
    across(
      .cols = where(is.character),
      .fns = ~if_else(condition = is.na(.x), 
                      true = "unknown", 
                      false = as.character(.x)),
      .names = "new_{col}"
    )
  ) %>%
  select(name, hair_color, new_hair_color)

## # A tibble: 87 x 3
##    name               hair_color    new_hair_color
##    <chr>              <chr>         <chr>         
##  1 Luke Skywalker     blond         blond         
##  2 C-3PO              <NA>          unknown       
##  3 R2-D2              <NA>          unknown       
##  4 Darth Vader        none          none          
##  5 Leia Organa        brown         brown         
##  6 Owen Lars          brown, grey   brown, grey   
##  7 Beru Whitesun lars brown         brown         
##  8 R5-D4              <NA>          unknown       
##  9 Biggs Darklighter  black         black         
## 10 Obi-Wan Kenobi     auburn, white auburn, white 
## # ... with 77 more rows

All characters with missing hair_color now have a new_hair_color of value "unknown". The same applies to the variables species and homeworld, which contain NA values.

4. Manipulate cases

After we’re done with manipulating the columns of a dataset let us take into account the row dimension of the starwars dataset. Although we also have to use the variables (columns) of our dataset in order to access the rows are, they are not altered!

filter()

Below I only keep Star Wars characters with the hair_color blond. This is done by usin the function filter().

starwars %>%
  filter(hair_color == "blond")

## # A tibble: 3 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky~    172    77 blond      fair       blue            19   male  mascu~
## 2 Anakin S~    188    84 blond      fair       blue            41.9 male  mascu~
## 3 Finis Va~    170    NA blond      fair       blue            91   male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

slice()

This function allows you to select, remove, and duplicate rows of a dataset. An integer value must be supplied to it, which indexes the row in the dataset. Below I select rows 1 to 5 of the dataset. Note that the sorting of the dataset influences the output of slice(). Remember that the : operator means from row 1 to row 5.

starwars %>%
  slice(1:5)

## # A tibble: 5 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky~    172    77 blond      fair       blue            19   male  mascu~
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu~
## 3 R2-D2         96    32 <NA>       white, bl~ red             33   none  mascu~
## 4 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
## 5 Leia Org~    150    49 brown      light      brown           19   fema~ femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

With functions slice_min() or slice_max() we can select the top or bottom n = values after the dataset is ordered by a variable. Below I select the five tallest characters in the starwars dataset.

starwars %>%
  slice_max(order_by = height, n = 5)

## # A tibble: 5 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yarael P~    264    NA none       white      yellow            NA male  mascu~
## 2 Tarfful      234   136 brown      brown      blue              NA male  mascu~
## 3 Lama Su      229    88 none       grey       black             NA male  mascu~
## 4 Chewbacca    228   112 brown      unknown    blue             200 male  mascu~
## 5 Roos Tar~    224    82 none       grey       orange            NA male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Note that instead of an explicit number of rows you might also take a fraction (prop =) of the data.

slice_sample()

With function slice_sample() we can randomly select rows of the dataset. The prop = argument handles the fraction of the data which is selected. Setting replace = TRUE allows that a row can be selected multiple times. Always consider using set.seed() when working with random function components.

set.seed(2)
starwars %>%
  slice_sample(prop = 0.15, replace = TRUE)

## # A tibble: 13 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 BB8          NA    NA none       none       black             NA none  mascu~
##  2 Raymus ~    188    79 brown      light      brown             NA male  mascu~
##  3 Taun We     213    NA none       grey       black             NA fema~ femin~
##  4 Owen La~    178   120 brown, gr~ light      blue              52 male  mascu~
##  5 Nute Gu~    191    90 none       mottled g~ red               NA male  mascu~
##  6 R5-D4        97    32 <NA>       white, red red               NA none  mascu~
##  7 Wedge A~    170    77 brown      fair       hazel             21 male  mascu~
##  8 Tion Me~    206    80 none       grey       black             NA male  mascu~
##  9 Shaak Ti    178    57 none       red, blue~ black             NA fema~ femin~
## 10 Shmi Sk~    163    NA black      fair       brown             72 fema~ femin~
## 11 Kit Fis~    196    87 none       green      black             NA male  mascu~
## 12 San Hill    191    NA none       grey       gold              NA male  mascu~
## 13 Bail Pr~    191    NA black      tan        brown             67 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Instead of a fraction of the data we might also set the number of rows we want to filter from the data by using the n = argument.

distinct()

This functions removes all duplicate rows from the dataset. When working with a dataset that you’ve never touched before, this function might be helpful to check the range of values of the variables(s) you are going to use in function filter(). The function might be applied to the whole dataset, a single column or multiple columns.

starwars %>%
  distinct(hair_color)

## # A tibble: 13 x 1
##    hair_color   
##    <chr>        
##  1 blond        
##  2 <NA>         
##  3 none         
##  4 brown        
##  5 brown, grey  
##  6 black        
##  7 auburn, white
##  8 auburn, grey 
##  9 white        
## 10 grey         
## 11 auburn       
## 12 blonde       
## 13 unknown

There are 13 distinct() values of hair_color in the starwars dataset.

arrange()

With arrange() we can alter how the dataset is sorted. The default sorting order is from the lowest to the highest value. The function desc() lets you reverse that order.

starwars %>%
  filter(sex == "male") %>%
  arrange(desc(height))

## # A tibble: 60 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yarael ~    264    NA none       white      yellow          NA   male  mascu~
##  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu~
##  3 Lama Su     229    88 none       grey       black           NA   male  mascu~
##  4 Chewbac~    228   112 brown      unknown    blue           200   male  mascu~
##  5 Roos Ta~    224    82 none       grey       orange          NA   male  mascu~
##  6 Grievous    216   159 none       brown, wh~ green, y~       NA   male  mascu~
##  7 Rugor N~    206    NA none       green      orange          NA   male  mascu~
##  8 Tion Me~    206    80 none       grey       black           NA   male  mascu~
##  9 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
## 10 Ki-Adi-~    198    82 white      pale       yellow          92   male  mascu~
## # ... with 50 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Digression: operators

In section 2 we’ve seen many possibilities how to refer to variables in the select() function. There are also quite a few ways how we can filter() the cases of a dataset. They might be the result of comparing values, logical/boolean operators or value matching.

Operator	Function
<, >	Less than, greater than
==	Equal to
!=	Not equal to
<=, >=	Less than, greather than or equal to
%in%	Group membership
is.na(), !is.na()	Is NA, js not NA
&, \|, !, xor(), any(), all()	Boolean operators

Also some of the vectorized functions from the previous section can be useful for subsetting the dataset. Below I filter() cases in the starwars dataset, which lie between() row_number() 5 and 10.

starwars %>% 
  filter(between(row_number(), left = 5, right = 10))

## # A tibble: 6 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Org~    150    49 brown      light      brown             19 fema~ femin~
## 2 Owen Lars    178   120 brown, gr~ light      blue              52 male  mascu~
## 3 Beru Whi~    165    75 brown      light      blue              47 fema~ femin~
## 4 R5-D4         97    32 <NA>       white, red red               NA none  mascu~
## 5 Biggs Da~    183    84 black      light      brown             24 male  mascu~
## 6 Obi-Wan ~    182    77 auburn, w~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

5. Summarise cases

summarise()

Suppose we are interested in the average weight of all characters in the starwars dataset. By using the function summarise() we just have to plug in a name of the variable to be aggregated as well as an aggregation function like mean().

starwars %>%
  summarise(avgWeight = mean(mass, na.rm = TRUE))

## # A tibble: 1 x 1
##   avgWeight
##       <dbl>
## 1      97.3

A single value of the new (aggregated) variable avgWeight is returned. Remember that we’ve already computed this value in section 3. However, the function mutate() treats mean() as a vectorized function and returns not a single but a vector of values which is added to the dataset.

count()

With count() we can count the number of rows of each unique value of a variable. This function works with numeric and character inputs alike. It is quite similar to the function distinct() except that we get a second column with the number of occurences in the dataset. Remeber Base R’s table() function? This is roughly an equivalent to it but much more tidy.

starwars %>%
  count(hair_color)

## # A tibble: 13 x 2
##    hair_color        n
##    <chr>         <int>
##  1 auburn            1
##  2 auburn, grey      1
##  3 auburn, white     1
##  4 black            13
##  5 blond             3
##  6 blonde            1
##  7 brown            18
##  8 brown, grey       1
##  9 grey              1
## 10 none             37
## 11 unknown           1
## 12 white             4
## 13 <NA>              5

There is also a vectorised version of this function called add_count(). It stores the result of count() in a column n and adds it to the dataset in a mutate()-like fashion.

starwars %>% 
  add_count(hair_color) %>%
  select(name, hair_color, n)

## # A tibble: 87 x 3
##    name               hair_color        n
##    <chr>              <chr>         <int>
##  1 Luke Skywalker     blond             3
##  2 C-3PO              <NA>              5
##  3 R2-D2              <NA>              5
##  4 Darth Vader        none             37
##  5 Leia Organa        brown            18
##  6 Owen Lars          brown, grey       1
##  7 Beru Whitesun lars brown            18
##  8 R5-D4              <NA>              5
##  9 Biggs Darklighter  black            13
## 10 Obi-Wan Kenobi     auburn, white     1
## # ... with 77 more rows

across()

In section 3 we have seen that a vectorized function can be applied to multiple columns when using mutate(). The same is also true for aggregation functions. Suppose you want to compute the average height and weight of all characters in the starwars dataset. With across() we can select multiple columns (below I use the selection helper c()) and then apply one or many aggregation function(s) like mean() to each column.

starwars %>%
  summarise(
    across(
      .cols = c(height, mass),
      .fns = mean
      )
    )

## # A tibble: 1 x 2
##   height  mass
##    <dbl> <dbl>
## 1     NA    NA

Okay, this did not work as expected as for both columns NA was returned. Can you guess why? Before we used the argument na.rm = TRUE when computing the average weight. This argument drops any NA values in the dataset. We also have to do this within across(). There are multiple ways of doing this, however I recommend the purrr-style formula approach below. With .x we reference each element in the .cols = argument and then apply the ~mean() function while being able to access all arguments inside it.

starwars %>%
  summarise(
    across(
      .cols = c(height, mass), 
      .fns = ~mean(.x, na.rm = TRUE)
      )
    )

## # A tibble: 1 x 2
##   height  mass
##    <dbl> <dbl>
## 1   174.  97.3

Digression: aggregation functions

In general you can think of aggreation as the supplying of multiple values for the input and obatining a single value in return. There are different types of aggregation functions and many of them are already implemented in Base R. Below is a comprehensive table of the most common functions.

Metric	Function	Purpose
Position	stats::mean()	Mean value of a vector
	stats::median()	Median value of a vector
Logical	stats::mean()	Proportion of TRUE values
	base::sum()	Number of TRUE values
Order	dplyr::nth()	Nth value of a vector
	dplyr::first()	First value of a vector
	dplyr::last()	Last value of a vector
Rank	stats::max()	Maximum value in a vector
	stats::min()	Minimum value in a vector
	stats::quantile()	Minimum value in a vector
Spread	stats::var()	Variance of a vector
	stats::sd()	Standard deviation of a vector
	stats::IQR()	Inter quartile range of a vector
	stats::mad()	Mean absolute deviation
Count	dplyr::n()	Number of values/rows
	dplyr::n_distinct()	Number of unique values in a vector
	sum(!is.na())	Number of non-missing values

6. Group cases

Sometimes you don’t want to apply an aggregate or vectorized function on the whole dataset but only to specific subsets (or groups) of the data.

group_by()

The function group_by() lets you just do that by grouping the data into rows with the same value of one or multiple columns. Below I am grouping the starwars dataset by a character’s species.

starwars %>%
  group_by(species)

## # A tibble: 87 x 14
## # Groups:   species [38]
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Alright, this output does not look too interesing but the only thing we did so far is to supply our dataset with a structure of the groups.

group_by() + mutate()

For example, group_by() might be used in combination with mutate() which is very useful when computing a new variable for different subsets of the data.

starwars %>%
  group_by(species) %>%
  mutate(avgSpeciesWeight = mean(mass, na.rm = TRUE)) %>%
  select(name, species, mass, avgSpeciesWeight)

## # A tibble: 87 x 4
## # Groups:   species [38]
##    name               species  mass avgSpeciesWeight
##    <chr>              <chr>   <dbl>            <dbl>
##  1 Luke Skywalker     Human      77             82.8
##  2 C-3PO              Droid      75             69.8
##  3 R2-D2              Droid      32             69.8
##  4 Darth Vader        Human     136             82.8
##  5 Leia Organa        Human      49             82.8
##  6 Owen Lars          Human     120             82.8
##  7 Beru Whitesun lars Human      75             82.8
##  8 R5-D4              Droid      32             69.8
##  9 Biggs Darklighter  Human      84             82.8
## 10 Obi-Wan Kenobi     Human      77             82.8
## # ... with 77 more rows

See how inside the new avgSpeciesWeight column the values correspond to the character’s species instead of the global mean of all Star Wars characters?

group_by() + summarise()

On the other hand we may also use an aggregate function with summarise() and exploit the grouping structure.

starwars %>%
  group_by(species) %>%
  summarise(avgWeight = mean(mass, na.rm = TRUE))

## # A tibble: 38 x 2
##    species   avgWeight
##    <chr>         <dbl>
##  1 Aleena         15  
##  2 Besalisk      102  
##  3 Cerean         82  
##  4 Chagrian      NaN  
##  5 Clawdite       55  
##  6 Droid          69.8
##  7 Dug            40  
##  8 Ewok           20  
##  9 Geonosian      80  
## 10 Gungan         74  
## # ... with 28 more rows

How would you find out which species has the lowest average weight without inspecting the output in the console?

starwars %>%
  group_by(species) %>%
  summarise(avgWeight = mean(mass, na.rm = TRUE)) %>%
  slice_min(order_by = avgWeight, n = 1)

## # A tibble: 1 x 2
##   species avgWeight
##   <chr>       <dbl>
## 1 Aleena         15

It’s the Aleena race. According to Wookieepedia these guys are “a species of short, sentient reptiles.”

Digression: context dependent expressions

The function n() was mentioned in the digression of the previous section but is especially useful when working with grouped data. Let us take a look what the function n() does when used within summarise() after the dataset was grouped. It counts the number of rows for each value of homeworld and hence reflects the number of Star Wars characters from this planet in the dataset.

starwars %>%
  group_by(homeworld) %>%
  summarise(n = n())

## # A tibble: 49 x 2
##    homeworld          n
##    <chr>          <int>
##  1 Alderaan           3
##  2 Aleen Minor        1
##  3 Bespin             1
##  4 Bestine IV         1
##  5 Cato Neimoidia     1
##  6 Cerea              1
##  7 Champala           1
##  8 Chandrila          1
##  9 Concord Dawn       1
## 10 Corellia           2
## # ... with 39 more rows

There are more of these functions, which return information about the current group or current variable and only work in the context of summarise() or mutate(). Below I am creating an planet_id variable related to a character’s homeworld.

starwars %>%
  group_by(homeworld) %>%
  mutate(planet_id = cur_group_id()) %>%
  select(name, homeworld, planet_id)

## # A tibble: 87 x 3
## # Groups:   homeworld [49]
##    name               homeworld planet_id
##    <chr>              <chr>         <int>
##  1 Luke Skywalker     Tatooine         40
##  2 C-3PO              Tatooine         40
##  3 R2-D2              Naboo            28
##  4 Darth Vader        Tatooine         40
##  5 Leia Organa        Alderaan          1
##  6 Owen Lars          Tatooine         40
##  7 Beru Whitesun lars Tatooine         40
##  8 R5-D4              Tatooine         40
##  9 Biggs Darklighter  Tatooine         40
## 10 Obi-Wan Kenobi     Stewjon          38
## # ... with 77 more rows

In total there are 49 unique values in variable homeworld, hence the planet_id takes on the values between 1 and 49

ungroup()

There is also a function called ungroup() which removes the grouping information. When carrying out complex queries with the %>% it can be advisable to remove the grouping structure, for example because it becomes more clear during which steps aggregation functions are applied to the data. Let’s try to verify my previous statement that the planet_id takes on the values between 1 and 49. After computing planet_id I select all unique values of it by using distinct().

starwars %>%
  group_by(homeworld) %>%
  mutate(planet_id = cur_group_id()) %>%
  select(name, homeworld, planet_id) %>%
  distinct(planet_id)

## # A tibble: 49 x 2
## # Groups:   homeworld [49]
##    homeworld  planet_id
##    <chr>          <int>
##  1 Tatooine          40
##  2 Naboo             28
##  3 Alderaan           1
##  4 Stewjon           38
##  5 Eriadu            15
##  6 Kashyyyk          23
##  7 Corellia          10
##  8 Rodia             32
##  9 Nal Hutta         29
## 10 Bestine IV         4
## # ... with 39 more rows

Although I used distinct() only on the variable planet_id the variable homeworld is still present in the output. Now let me remove the grouping structure with ungroup().

starwars %>%
  group_by(homeworld) %>%
  mutate(planet_id = cur_group_id()) %>%
  select(name, homeworld, planet_id) %>%
  ungroup() %>%
  distinct(planet_id)

## # A tibble: 49 x 1
##    planet_id
##        <int>
##  1        40
##  2        28
##  3         1
##  4        38
##  5        15
##  6        23
##  7        10
##  8        32
##  9        29
## 10         4
## # ... with 39 more rows

See the difference? In this setting it does’nt do any harm to keep the grouping structure but depending on the situation it might cause some unwanted results.

rowwise()

This function can be used to group data into individual rows (as dplyr functions compute results for each row). Also you may apply functions to list-columns. In the tidyr tutorial the rowwise() function is discussed in more detail.

7. Combine datasets

Sometimes neither manipulating nor summarising doesn’t get you any further with your analysis (and of course you have also exploited various grouping strucutres). After thorough searching you find another dataset which complements your previous findings but now you’re faced with the challenge to merge them together somehow. There are many pitfalls when linking different datasets so this section might be a bit more challening than the previous ones.

7.1 Combine variables or cases

bind_cols()

Below I stored the name and mass of the Star Wars characters in two different tibbles sw_names and sw_mass. The function bind_cols() lets us combine variables from different sources into a single dataset. Take care that the separated datasets are ordered in the same way before glueing them together. The columns will not be matched by an id (= common link in both datasets) with bind_cols() so be careful that they are ordered in the same way.

sw_names <- 
  starwars %>%
  select(name)

sw_mass <-
  starwars %>%
  select(mass)

bind_cols(sw_names, sw_mass)

## # A tibble: 87 x 2
##    name                mass
##    <chr>              <dbl>
##  1 Luke Skywalker        77
##  2 C-3PO                 75
##  3 R2-D2                 32
##  4 Darth Vader          136
##  5 Leia Organa           49
##  6 Owen Lars            120
##  7 Beru Whitesun lars    75
##  8 R5-D4                 32
##  9 Biggs Darklighter     84
## 10 Obi-Wan Kenobi        77
## # ... with 77 more rows

bind_rows()

Next I am creating two new datasets. One contains all Star Wars characters with a mass of at least 100 kg. The other contains characters which are at least 1,90 m tall. Remember that the variables (columns) are not altered when function filter() is used. Now, with function bind_rows() I can combine the cases of both datasets into a single dataset.

sw_big <-
  starwars %>%
  select(name, height, mass) %>%
  filter(mass >= 100 )

sw_tall <-
  starwars %>%
  select(name, height, mass) %>%
  filter(height >= 190)

bind_rows(sw_big, sw_tall)

## # A tibble: 33 x 3
##    name                  height  mass
##    <chr>                  <int> <dbl>
##  1 Darth Vader              202   136
##  2 Owen Lars                178   120
##  3 Chewbacca                228   112
##  4 Jabba Desilijic Tiure    175  1358
##  5 Jek Tono Porkins         180   110
##  6 IG-88                    200   140
##  7 Bossk                    190   113
##  8 Dexter Jettster          198   102
##  9 Grievous                 216   159
## 10 Tarfful                  234   136
## # ... with 23 more rows

Using argument .id = you can supply a column name which links each row to its original dataset.

bind_rows(
  "big guys" = sw_big, 
  "tall guys" = sw_tall, 
  .id = "id"
  ) %>%
  group_by(id) %>%
  slice(1:3)

## # A tibble: 6 x 4
## # Groups:   id [2]
##   id        name        height  mass
##   <chr>     <chr>        <int> <dbl>
## 1 big guys  Darth Vader    202   136
## 2 big guys  Owen Lars      178   120
## 3 big guys  Chewbacca      228   112
## 4 tall guys Darth Vader    202   136
## 5 tall guys Chewbacca      228   112
## 6 tall guys IG-88          200   140

Can you see what happens to characters being big and tall?

7.2 Set operations

Set operations are used to compare two datasets with each other.

intersect()

Let’s find rows that appear both in sw_big and sw_tall. This can be done with intersect().

intersect(sw_big, sw_tall)

## # A tibble: 7 x 3
##   name            height  mass
##   <chr>            <int> <dbl>
## 1 Darth Vader        202   136
## 2 Chewbacca          228   112
## 3 IG-88              200   140
## 4 Bossk              190   113
## 5 Dexter Jettster    198   102
## 6 Grievous           216   159
## 7 Tarfful            234   136

setdiff()

With setdiff() on the other hand, we can find rows that appear in sw_big but not sw_tall.

setdiff(sw_big, sw_tall)

## # A tibble: 3 x 3
##   name                  height  mass
##   <chr>                  <int> <dbl>
## 1 Owen Lars                178   120
## 2 Jabba Desilijic Tiure    175  1358
## 3 Jek Tono Porkins         180   110

union()

Then with union() we can find rows that appear in sw_big or sw_tall while removing duplicate cases. Note that there is also union_all() to retain the duplicate values.

union(sw_big, sw_tall)

## # A tibble: 26 x 3
##    name                  height  mass
##    <chr>                  <int> <dbl>
##  1 Darth Vader              202   136
##  2 Owen Lars                178   120
##  3 Chewbacca                228   112
##  4 Jabba Desilijic Tiure    175  1358
##  5 Jek Tono Porkins         180   110
##  6 IG-88                    200   140
##  7 Bossk                    190   113
##  8 Dexter Jettster          198   102
##  9 Grievous                 216   159
## 10 Tarfful                  234   136
## # ... with 16 more rows

setequal()

Finally, with function setequal() we can test whether two datasets contain the exact same rows. The sorting order does not matter here.

setequal(sw_big, sw_tall)

## [1] FALSE

7.3 Mutating joins

When using a mutating join you combine different datasets in order to add additional information to your analysis. Hence, new columns will be added to your initial dataset. To do so, an id variable which can identify rows between the datasets must be present in your data. With the help of Wookieepedia I created a dataset called worlds, which contains information about astronomical objects of the Star Wars universe. Have a look at it below.

print(worlds)

## # A tibble: 49 x 4
##    object         type   moons species
##    <chr>          <chr>  <dbl> <lgl>  
##  1 Alderaan       planet     0 NA     
##  2 Aleen Minor    planet     0 NA     
##  3 Bespin         planet     2 NA     
##  4 Bestine IV     planet     0 NA     
##  5 Cato Neimoidia planet     2 NA     
##  6 Cerea          planet     0 NA     
##  7 Champala       planet     0 NA     
##  8 Chandrila      planet     1 NA     
##  9 Concord Dawn   planet     3 NA     
## 10 Corellia       planet     2 NA     
## # ... with 39 more rows

left_join()

With the left_join() two datasets x and y are combined by by joining all matching values from y to x. Non-matching values of x still remain in the output but will display NA by default for any column of y. In the starwars dataset the column homeworld contains values which also occur in the object column of the worlds dataset. Use the argument by = to supply the column names on which you want to match the two datasets.

left_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, homeworld, type, moons)

## # A tibble: 87 x 4
##    name               homeworld type                moons
##    <chr>              <chr>     <chr>               <dbl>
##  1 Luke Skywalker     Tatooine  planet                  3
##  2 C-3PO              Tatooine  planet                  3
##  3 R2-D2              Naboo     <NA>                   NA
##  4 Darth Vader        Tatooine  planet                  3
##  5 Leia Organa        Alderaan  planet                  0
##  6 Owen Lars          Tatooine  planet                  3
##  7 Beru Whitesun lars Tatooine  planet                  3
##  8 R5-D4              Tatooine  planet                  3
##  9 Biggs Darklighter  Tatooine  planet                  3
## 10 Obi-Wan Kenobi     Stewjon   astronomical object     0
## # ... with 77 more rows

Now each character in the starwars dataset shows the type and the number of moons of its homeworld. Note that the number of cases in the starwars dataset was not affected by this operation. Also note that for characters with the homeworld Naboo NA values were returned. Either this object is not included in the worlds dataset or some information are missing. Can we find out what’s going on here?

inner_join()

With the function inner_join() only rows that match between two datasets are returned. So when there is no object called Naboo in the worlds dataset any Star Wars character with this homeworld will be dropped from the output.

inner_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, homeworld, type, moons)

## # A tibble: 66 x 4
##    name               homeworld type                moons
##    <chr>              <chr>     <chr>               <dbl>
##  1 Luke Skywalker     Tatooine  planet                  3
##  2 C-3PO              Tatooine  planet                  3
##  3 Darth Vader        Tatooine  planet                  3
##  4 Leia Organa        Alderaan  planet                  0
##  5 Owen Lars          Tatooine  planet                  3
##  6 Beru Whitesun lars Tatooine  planet                  3
##  7 R5-D4              Tatooine  planet                  3
##  8 Biggs Darklighter  Tatooine  planet                  3
##  9 Obi-Wan Kenobi     Stewjon   astronomical object     0
## 10 Anakin Skywalker   Tatooine  planet                  3
## # ... with 56 more rows

Looks like R2-D2 is gone now!

full_join()

There is a third way to combine two datasets with function full_join(), which retains all rows and all values. For example, there might be some objects in the worlds dataset which are uninhabitable (even to the most adapted of creatures). It is hence unlikely that any character in the starwars dataset will have such an object as its homeworld. In such a case we would expect that the full_join() returns values for columns of the worlds dataset but NA for columns of the starwars dataset.

full_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, homeworld, type, moons) %>%
  filter(is.na(name))

## # A tibble: 2 x 4
##   name  homeworld type   moons
##   <chr> <chr>     <chr>  <dbl>
## 1 <NA>  Dagobah   planet     1
## 2 <NA>  Jakku     planet     2

No Star Wars character in the starwars dataset is from Dagobah or Jakku.

cross-join

There is also a possibility to perform a cross-join (aka cartesian product) which generates all combinations of x and y. This is not really meaningful given the information supplied in the starwars and worlds dataset but it’s simply done by specifying by = character().

left_join(
  x = starwars,
  y = worlds,
  by = character()
  ) %>%
  select(name, homeworld, moons)

## # A tibble: 4,263 x 3
##    name           homeworld moons
##    <chr>          <chr>     <dbl>
##  1 Luke Skywalker Tatooine      0
##  2 Luke Skywalker Tatooine      0
##  3 Luke Skywalker Tatooine      2
##  4 Luke Skywalker Tatooine      0
##  5 Luke Skywalker Tatooine      2
##  6 Luke Skywalker Tatooine      0
##  7 Luke Skywalker Tatooine      0
##  8 Luke Skywalker Tatooine      1
##  9 Luke Skywalker Tatooine      3
## 10 Luke Skywalker Tatooine      2
## # ... with 4,253 more rows

The output is a pretty long dataset with 4.263 rows.

7.4 Filtering joins

With a filtering join one table can be filtered against the rows of another.

semi_join()

When using a filtering join no columns from dataset y are added to x. Hence, only cases of starwars that have a match in worlds are returned. We can thus see what will be included in a join.

semi_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, homeworld)

## # A tibble: 66 x 2
##    name               homeworld
##    <chr>              <chr>    
##  1 Luke Skywalker     Tatooine 
##  2 C-3PO              Tatooine 
##  3 Darth Vader        Tatooine 
##  4 Leia Organa        Alderaan 
##  5 Owen Lars          Tatooine 
##  6 Beru Whitesun lars Tatooine 
##  7 R5-D4              Tatooine 
##  8 Biggs Darklighter  Tatooine 
##  9 Obi-Wan Kenobi     Stewjon  
## 10 Anakin Skywalker   Tatooine 
## # ... with 56 more rows

All characters whose homeworld can be found in the worlds dataset are returned.

anti_join()

The anti_join() does the opposite and returns characters of the starwars dataset which don’t have a match in worlds. We can hence see what will not be included in a join.

anti_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, homeworld)

## # A tibble: 21 x 2
##    name          homeworld
##    <chr>         <chr>    
##  1 R2-D2         Naboo    
##  2 Yoda          <NA>     
##  3 Palpatine     Naboo    
##  4 IG-88         <NA>     
##  5 Arvel Crynyd  <NA>     
##  6 Qui-Gon Jinn  <NA>     
##  7 Jar Jar Binks Naboo    
##  8 Roos Tarpals  Naboo    
##  9 Rugor Nass    Naboo    
## 10 Ric Olié      Naboo    
## # ... with 11 more rows

These are either characters from Naboo, which is missing in the homeworlds dataset, or characters with a missing value in variable homeworld.

nest_join()

With the function nest_join() one dataset can be inner joined to another into a nested data frame. Sounds quite complex so let’s take a look at what’s happening.

nest_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object")
  ) %>%
  select(name, worlds)

## # A tibble: 87 x 2
##    name               worlds          
##    <chr>              <list>          
##  1 Luke Skywalker     <tibble [1 x 3]>
##  2 C-3PO              <tibble [1 x 3]>
##  3 R2-D2              <tibble [0 x 3]>
##  4 Darth Vader        <tibble [1 x 3]>
##  5 Leia Organa        <tibble [1 x 3]>
##  6 Owen Lars          <tibble [1 x 3]>
##  7 Beru Whitesun lars <tibble [1 x 3]>
##  8 R5-D4              <tibble [1 x 3]>
##  9 Biggs Darklighter  <tibble [1 x 3]>
## 10 Obi-Wan Kenobi     <tibble [1 x 3]>
## # ... with 77 more rows

The matching rows of the worlds are now stored in one single list column instead of three separate columns. This safes some space for sure but it also makes things a bit more complex when working with the column worlds. There is a function called unnest() in the tidyr package which can entangle this column. Read more about it in this tutorial.

7.5 Column matching

by =

There are several ways how you can carry out a mutating or filtering join with argument by =.

Setting	Function
one identical column name (for matching) per dataset	`by = "id1"`
multiple identical column names (for matching) per dataset	`by = c("id1", "id2")`
one different column name (for matching) per dataset	`by = c("ida1" = "idb1")`
multiple different column names (for matching) per dataset	`by = c("ida1" = "idb1", "ida2 = idb2")`
identical column names not used for matching	`by = c("ida1" = "idb1", suffix = c("1","2"))`

suffix =

Suppose that the worlds dataset also contains a species variable which represents the most common creatures on the planet. Without the suffix = argument the species column will be duplicated with a default suffix .x and .y. We can change the suffix to make it clear which column represents which information. I’m using _chr for the species of the Star Wars character and _wrld for creature living on a planet.

left_join(
  x = starwars,
  y = worlds,
  by = c("homeworld" = "object"),
  suffix = c("_chr","_wrld")
  ) %>%
  select(name, species_chr, homeworld, species_wrld)

## # A tibble: 87 x 4
##    name               species_chr homeworld species_wrld
##    <chr>              <chr>       <chr>     <lgl>       
##  1 Luke Skywalker     Human       Tatooine  NA          
##  2 C-3PO              Droid       Tatooine  NA          
##  3 R2-D2              Droid       Naboo     NA          
##  4 Darth Vader        Human       Tatooine  NA          
##  5 Leia Organa        Human       Alderaan  NA          
##  6 Owen Lars          Human       Tatooine  NA          
##  7 Beru Whitesun lars Human       Tatooine  NA          
##  8 R5-D4              Droid       Tatooine  NA          
##  9 Biggs Darklighter  Human       Tatooine  NA          
## 10 Obi-Wan Kenobi     Human       Stewjon   NA          
## # ... with 77 more rows

References

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

R Tutorial: dplyr package

Philipp Leppert

26.11.2021

1. Introduction

Tibbles

Cases and variables

Pipe operator

2. Manipulate variables

select()

pull()

rename()

relocate()

Digression: selection helpers

3. Add new variables

3.1 Add a single variable

mutate()

transmute()

if_else()

case_when()

recode()

Digression: vectorized functions

3.2 Add multiple variables

mutate()

across()

4. Manipulate cases

filter()

slice()

slice_sample()

distinct()

arrange()

Digression: operators

5. Summarise cases

summarise()

count()

across()

Digression: aggregation functions

6. Group cases

group_by()

group_by() + mutate()

group_by() + summarise()

Digression: context dependent expressions

ungroup()

rowwise()

7. Combine datasets

7.1 Combine variables or cases

bind_cols()

bind_rows()

7.2 Set operations

intersect()

setdiff()

union()

setequal()

7.3 Mutating joins

left_join()

inner_join()

full_join()

cross-join

7.4 Filtering joins

semi_join()

anti_join()

nest_join()

7.5 Column matching

by =

suffix =

References