Reading time: 19 minutes (3,780 words)
In this tutorial I present functions of the dplyr
package and their practical application. To start with, install the package and load it as shown below.
library(dplyr)
The dplyr
package contains a dataset called starwars
, which I will use hereafter to demonstrate the functions. Let us take a preliminary look at the data with print()
.
print(starwars)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
This dataset is a so called tibble which is simply a dataframe with some quality of life enhancements. Note that when printing starwars
to the console, only the first 10 rows are displayed instead of all rows when using print()
with a conventional dataframe. More information about the tibble can be found in this tutorial.
Every row in the starwars
dataset corresponds to a character of the movie - we will refer to the rows as cases. There are 87 cases in total. Each column holds information about a Star Wars character like its name
, height
or mass
. The columns are referred to as variables. When a variable is saved in its own column and each case is saved in its own row the dataset is said to be tidy. Tidy data complements R’s vectorized operations and is also the required input for many functions of the dplyr
package.
Throughout this tutorial you will often encounter the pipe operator (%>%
), which is in fact not a function of dplyr
but the magrittr
package. It passes an object forward into a function and makes code a lot more readable. Printing the starwars
dataset with the pipe operator looks like this:
starwars %>%
print()
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
The output is the same as using print(starwars)
.
Initially, we will deal with the manipulation of variables. This affects the columns of our dataset while the cases (rows) remain unchanged.
With the select()
function I extract a character’s name
and sex
from the starwars
dataset. Simply list the columns you want to keep inside the functions and separate them by ,
.
starwars %>%
select(name, height, mass)
## # A tibble: 87 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # ... with 77 more rows
The function pull()
extracts a single column from a dataset as a vector. It is very helpful when working with other functions that require a vector as input instead of a tibble or dataframe.
starwars %>%
pull(height)
## [1] 172 167 96 202 150 178 165 97 183 182 188 180 228 180 173 175 170 180 66
## [20] 170 183 200 190 177 175 180 150 NA 88 160 193 191 170 196 224 206 183 137
## [39] 112 183 163 175 180 178 94 122 163 188 198 196 171 184 188 264 188 196 185
## [58] 157 183 183 170 166 165 193 191 183 168 198 229 213 167 79 96 193 191 178
## [77] 216 234 188 178 206 NA NA NA NA NA 165
Why can’t we use select()
to do this? You will of course get the same column and even the same rows of the dataset but R treats both outputs differently. When using select()
the output is still a tibble. Let’s verify this using the class()
function:
starwars %>%
select(height) %>%
class()
## [1] "tbl_df" "tbl" "data.frame"
starwars %>%
pull(height) %>%
class()
## [1] "integer"
You might also want to change a variable’s name and then use it in a subsequent select()
function. Below I rename()
the mass
variable to weight
.
starwars %>%
rename(weight = "mass") %>%
select(name, height, weight)
## # A tibble: 87 x 3
## name height weight
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # ... with 77 more rows
Don’t like the sequence of columns in the starwars
dataset? With relocate()
you might change their position. Below I move the species
variable before the variable name
.
starwars %>%
relocate(species, .before = name)
## # A tibble: 87 x 14
## species name height mass hair_color skin_color eye_color birth_year sex
## <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Human Luke S~ 172 77 blond fair blue 19 male
## 2 Droid C-3PO 167 75 <NA> gold yellow 112 none
## 3 Droid R2-D2 96 32 <NA> white, bl~ red 33 none
## 4 Human Darth ~ 202 136 none white yellow 41.9 male
## 5 Human Leia O~ 150 49 brown light brown 19 fema~
## 6 Human Owen L~ 178 120 brown, gr~ light blue 52 male
## 7 Human Beru W~ 165 75 brown light blue 47 fema~
## 8 Droid R5-D4 97 32 <NA> white, red red NA none
## 9 Human Biggs ~ 183 84 black light brown 24 male
## 10 Human Obi-Wa~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## # films <list>, vehicles <list>, starships <list>
There are many ways to reference variables inside the select()
function. For example, when putting the -
opreator before one or more variables they will be dropped from the dataset. On the other hand, a sequence of variables can be selected with a :
(which means from:to).
In the tidyselect
package are many more of these selection helpers which are discussed in this tutorial. Below I included a table to get an idea how to use them.
Function | Purpose |
---|---|
tidyselect::everything() | Select all variables |
tidyselect::last_col() | Select the last variable |
tidyselect::starts_with() | Starts with a prefix |
tidyselect::ends_with() | Ends with a suffix |
tidyselect::contains() | Contains a literal string |
tidyselect::matches() | Matches a regular expression |
tidyselect::num_range() | Matches a numerical range like x01, x02, x03 |
tidyselect::all_of() | If any of the variables is missing, an error is thrown |
tidyselect::any_of() | Doesn’t check for missing variables |
tidyselect::where() | Select variables with a predicate function (returns TRUE or FALSE) |
Instead of manipulating the dataset’s existing variables we can also creating new columns. In this section vectorized functions play an important role. They take vectors as input and return vectors of the same length as output.
This function generates a new variable based on a function that uses existing columns of the dataset or some other (external) supplied vector of values. When we compute the avgWeight
of the Star Wars characters with mutate()
, the mean()
value of mass
is added to each case of the data!
starwars %>%
mutate(avgWeight = mean(mass, na.rm = TRUE)) %>%
select(name, avgWeight, mass)
## # A tibble: 87 x 3
## name avgWeight mass
## <chr> <dbl> <dbl>
## 1 Luke Skywalker 97.3 77
## 2 C-3PO 97.3 75
## 3 R2-D2 97.3 32
## 4 Darth Vader 97.3 136
## 5 Leia Organa 97.3 49
## 6 Owen Lars 97.3 120
## 7 Beru Whitesun lars 97.3 75
## 8 R5-D4 97.3 32
## 9 Biggs Darklighter 97.3 84
## 10 Obi-Wan Kenobi 97.3 77
## # ... with 77 more rows
Now each row shows the mass
of a Star Wars character but also the avgWeight
of all characters (including itself). Note the use of argument na.rm =
in function mean()
which removes NA
values when set to TRUE
.
While mutate()
adds new variables and preserves all existing ones the function transmute()
adds new variables and drops the existing ones.
starwars %>%
transmute(avgWeight = mean(mass, na.rm = TRUE))
## # A tibble: 87 x 1
## avgWeight
## <dbl>
## 1 97.3
## 2 97.3
## 3 97.3
## 4 97.3
## 5 97.3
## 6 97.3
## 7 97.3
## 8 97.3
## 9 97.3
## 10 97.3
## # ... with 77 more rows
With function if_else()
we can specify a single condition that assigns the new variable a value if it evaluates to TRUE
or else if it evaluates to FALSE
. Below I check whether a case has a missing value in the mass
variable with function is.na()
. If so, I assign the value 0
instead of NA
. If not I keep the given value of mass
.
starwars %>%
mutate(
mass_mod = if_else(
condition = is.na(mass),
true = 0,
false = mass
)
) %>%
select(name, mass, mass_mod)
## # A tibble: 87 x 3
## name mass mass_mod
## <chr> <dbl> <dbl>
## 1 Luke Skywalker 77 77
## 2 C-3PO 75 75
## 3 R2-D2 32 32
## 4 Darth Vader 136 136
## 5 Leia Organa 49 49
## 6 Owen Lars 120 120
## 7 Beru Whitesun lars 75 75
## 8 R5-D4 32 32
## 9 Biggs Darklighter 84 84
## 10 Obi-Wan Kenobi 77 77
## # ... with 77 more rows
The case_when()
function extends if_else()
by allowing for multiple conditions. This is very useful when working with categorical/factor variables. Below I am creating a new variable rating
, which represents the result of a survey on the popularity of the Star Wars characters among fans of the movies. For each case where a condition specified inside case_when()
evaluates to TRUE
the value after the ~
is stored in the new variable rating
. The arguments are evaluated in order, so the conditions must be specified from the most specific to the most general. With TRUE
you can assign a value to all remaining cases where no condition was met - otherwise these cases will be assigned with NA
.
starwars %>%
mutate(rating =
case_when(
name %in% c("Chewbacca", "Yoda", "C-3PO") ~ "badass",
name %in% c("Han Solo", "R2-D2") ~ "quite cool",
mass > 1000 ~ "wicked",
TRUE ~ "indifferent"
)
) %>%
select(name, rating)
## # A tibble: 87 x 2
## name rating
## <chr> <chr>
## 1 Luke Skywalker indifferent
## 2 C-3PO badass
## 3 R2-D2 quite cool
## 4 Darth Vader indifferent
## 5 Leia Organa indifferent
## 6 Owen Lars indifferent
## 7 Beru Whitesun lars indifferent
## 8 R5-D4 indifferent
## 9 Biggs Darklighter indifferent
## 10 Obi-Wan Kenobi indifferent
## # ... with 77 more rows
This is a vectorised version of switch()
that replaces numeric values based on their position or their name, and character or factor values only by their name. Note that unlike in rename()
or mutate()
the old value of the variable comes first and then the new one. Hence, this function is not fully optimised for the dplyr
package and might be altered in the future. Below I am recoding values of the variable sex
(hermaphroditic and none) to take on the value other.
starwars %>%
mutate(
sex = recode_factor(
sex,
hermaphroditic = "other",
none = "other"
)
) %>%
select(name, sex)
## # A tibble: 87 x 2
## name sex
## <chr> <fct>
## 1 Luke Skywalker male
## 2 C-3PO other
## 3 R2-D2 other
## 4 Darth Vader male
## 5 Leia Organa female
## 6 Owen Lars male
## 7 Beru Whitesun lars female
## 8 R5-D4 other
## 9 Biggs Darklighter male
## 10 Obi-Wan Kenobi male
## # ... with 77 more rows
The dplyr
package contains many vectorized functions that take a vector of values as input and return another vector of values. Below I included a table of these functions and their possible field of application.
Class | Function | Purpose |
---|---|---|
Offset | dplyr::lead() | Computes leading values |
dplyr::lagged() | Computes lagged values | |
Ranking | dplyr::row_number() | Row number of the observation (sorting order) |
dplyr::min_rank() | Rank of each observation | |
dplyr::dense_rank() | Rank of each observation (without gaps) | |
dplyr::percent_rank() | Rank rescaled between 0 and 1 | |
dplyr::ntile() | Rough rank, which breaks the input vector into n buckets | |
dplyr::cume_dist() | Proportion of all values <= current rank | |
Cumulative aggregate | dplyr::cumall() | Cumulative all() |
dplyr::cumany() | Cumulative any() | |
base::cummax() | Cumulative max() | |
dplyr::cummean() | Cumulative mean() | |
base::cummin() | cumulative min() | |
base::cumprod() | cumulative prod() | |
base::cumsum() | cumulative sum() | |
Math | +, - , *, /, ^, %/%, %% | Arithmetic operators |
log(), log2(), log10() | Logarithm | |
<, <=, >, >=, !=, == | Logical comparisons | |
dplyr::between() | x >= left & x <= right | |
dplyr::near() | Safe comparison for floating point numbers | |
Misc | dplyr::coalesce() | Use a single value to replace all missing values, Match together a complete vector from missing pieces |
dplyr::na_if() | Convert an annoying value to NA | |
tidyr::replace_na() | Replace NA values with another value |
With mutate()
we can also add multiple new variables as they just have to be separated by a ,
.
starwars %>%
mutate(avgWeight = mean(mass, na.rm = TRUE),
avgHeight = mean(height, na.rm = TRUE)) %>%
select(name, height, mass, avgHeight, avgWeight)
## # A tibble: 87 x 5
## name height mass avgHeight avgWeight
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Luke Skywalker 172 77 174. 97.3
## 2 C-3PO 167 75 174. 97.3
## 3 R2-D2 96 32 174. 97.3
## 4 Darth Vader 202 136 174. 97.3
## 5 Leia Organa 150 49 174. 97.3
## 6 Owen Lars 178 120 174. 97.3
## 7 Beru Whitesun lars 165 75 174. 97.3
## 8 R5-D4 97 32 174. 97.3
## 9 Biggs Darklighter 183 84 174. 97.3
## 10 Obi-Wan Kenobi 182 77 174. 97.3
## # ... with 77 more rows
The more interesting question is whether we can apply a vectorized function to multiple columns simultaneously. Suppose we want to replace the NA
values with "unknown"
in all columns of type character. The relevant columns (.cols =
) are selected with the selection helper where()
. To replace the NA
values with "unknown"
I am using if_else()
. To be on the safe side I don’t want to overwrite the original columns. With the argument .names =
each modified column is added to the starwars
dataset with a new name. We can access the original column names via {col}
.
starwars %>%
mutate(
across(
.cols = where(is.character),
.fns = ~if_else(condition = is.na(.x),
true = "unknown",
false = as.character(.x)),
.names = "new_{col}"
)
) %>%
select(name, hair_color, new_hair_color)
## # A tibble: 87 x 3
## name hair_color new_hair_color
## <chr> <chr> <chr>
## 1 Luke Skywalker blond blond
## 2 C-3PO <NA> unknown
## 3 R2-D2 <NA> unknown
## 4 Darth Vader none none
## 5 Leia Organa brown brown
## 6 Owen Lars brown, grey brown, grey
## 7 Beru Whitesun lars brown brown
## 8 R5-D4 <NA> unknown
## 9 Biggs Darklighter black black
## 10 Obi-Wan Kenobi auburn, white auburn, white
## # ... with 77 more rows
All characters with missing hair_color
now have a new_hair_color
of value "unknown"
. The same applies to the variables species
and homeworld
, which contain NA
values.
After we’re done with manipulating the columns of a dataset let us take into account the row dimension of the starwars
dataset. Although we also have to use the variables (columns) of our dataset in order to access the rows are, they are not altered!
Below I only keep Star Wars characters with the hair_color
blond. This is done by usin the function filter()
.
starwars %>%
filter(hair_color == "blond")
## # A tibble: 3 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 Anakin S~ 188 84 blond fair blue 41.9 male mascu~
## 3 Finis Va~ 170 NA blond fair blue 91 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
This function allows you to select, remove, and duplicate rows of a dataset. An integer value must be supplied to it, which indexes the row in the dataset. Below I select rows 1 to 5 of the dataset. Note that the sorting of the dataset influences the output of slice()
. Remember that the :
operator means from row 1 to row 5.
starwars %>%
slice(1:5)
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
With functions slice_min()
or slice_max()
we can select the top or bottom n =
values after the dataset is ordered by a variable. Below I select the five tallest characters in the starwars
dataset.
starwars %>%
slice_max(order_by = height, n = 5)
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael P~ 264 NA none white yellow NA male mascu~
## 2 Tarfful 234 136 brown brown blue NA male mascu~
## 3 Lama Su 229 88 none grey black NA male mascu~
## 4 Chewbacca 228 112 brown unknown blue 200 male mascu~
## 5 Roos Tar~ 224 82 none grey orange NA male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Note that instead of an explicit number of rows you might also take a fraction (prop =
) of the data.
With function slice_sample()
we can randomly select rows of the dataset. The prop =
argument handles the fraction of the data which is selected. Setting replace = TRUE
allows that a row can be selected multiple times. Always consider using set.seed()
when working with random function components.
set.seed(2)
starwars %>%
slice_sample(prop = 0.15, replace = TRUE)
## # A tibble: 13 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 BB8 NA NA none none black NA none mascu~
## 2 Raymus ~ 188 79 brown light brown NA male mascu~
## 3 Taun We 213 NA none grey black NA fema~ femin~
## 4 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 5 Nute Gu~ 191 90 none mottled g~ red NA male mascu~
## 6 R5-D4 97 32 <NA> white, red red NA none mascu~
## 7 Wedge A~ 170 77 brown fair hazel 21 male mascu~
## 8 Tion Me~ 206 80 none grey black NA male mascu~
## 9 Shaak Ti 178 57 none red, blue~ black NA fema~ femin~
## 10 Shmi Sk~ 163 NA black fair brown 72 fema~ femin~
## 11 Kit Fis~ 196 87 none green black NA male mascu~
## 12 San Hill 191 NA none grey gold NA male mascu~
## 13 Bail Pr~ 191 NA black tan brown 67 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Instead of a fraction of the data we might also set the number of rows we want to filter from the data by using the n =
argument.
This functions removes all duplicate rows from the dataset. When working with a dataset that you’ve never touched before, this function might be helpful to check the range of values of the variables(s) you are going to use in function filter()
. The function might be applied to the whole dataset, a single column or multiple columns.
starwars %>%
distinct(hair_color)
## # A tibble: 13 x 1
## hair_color
## <chr>
## 1 blond
## 2 <NA>
## 3 none
## 4 brown
## 5 brown, grey
## 6 black
## 7 auburn, white
## 8 auburn, grey
## 9 white
## 10 grey
## 11 auburn
## 12 blonde
## 13 unknown
There are 13 distinct()
values of hair_color
in the starwars
dataset.
With arrange()
we can alter how the dataset is sorted. The default sorting order is from the lowest to the highest value. The function desc()
lets you reverse that order.
starwars %>%
filter(sex == "male") %>%
arrange(desc(height))
## # A tibble: 60 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael ~ 264 NA none white yellow NA male mascu~
## 2 Tarfful 234 136 brown brown blue NA male mascu~
## 3 Lama Su 229 88 none grey black NA male mascu~
## 4 Chewbac~ 228 112 brown unknown blue 200 male mascu~
## 5 Roos Ta~ 224 82 none grey orange NA male mascu~
## 6 Grievous 216 159 none brown, wh~ green, y~ NA male mascu~
## 7 Rugor N~ 206 NA none green orange NA male mascu~
## 8 Tion Me~ 206 80 none grey black NA male mascu~
## 9 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 10 Ki-Adi-~ 198 82 white pale yellow 92 male mascu~
## # ... with 50 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
In section 2 we’ve seen many possibilities how to refer to variables in the select()
function. There are also quite a few ways how we can filter()
the cases of a dataset. They might be the result of comparing values, logical/boolean operators or value matching.
Operator | Function |
---|---|
<, > | Less than, greater than |
== | Equal to |
!= | Not equal to |
<=, >= | Less than, greather than or equal to |
%in% | Group membership |
is.na(), !is.na() | Is NA, js not NA |
&, |, !, xor(), any(), all() | Boolean operators |
Also some of the vectorized functions from the previous section can be useful for subsetting the dataset. Below I filter()
cases in the starwars
dataset, which lie between()
row_number()
5 and 10.
starwars %>%
filter(between(row_number(), left = 5, right = 10))
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 2 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~
## 3 Beru Whi~ 165 75 brown light blue 47 fema~ femin~
## 4 R5-D4 97 32 <NA> white, red red NA none mascu~
## 5 Biggs Da~ 183 84 black light brown 24 male mascu~
## 6 Obi-Wan ~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Suppose we are interested in the average weight of all characters in the starwars
dataset. By using the function summarise()
we just have to plug in a name of the variable to be aggregated as well as an aggregation function like mean()
.
starwars %>%
summarise(avgWeight = mean(mass, na.rm = TRUE))
## # A tibble: 1 x 1
## avgWeight
## <dbl>
## 1 97.3
A single value of the new (aggregated) variable avgWeight
is returned. Remember that we’ve already computed this value in section 3. However, the function mutate()
treats mean()
as a vectorized function and returns not a single but a vector of values which is added to the dataset.
With count()
we can count the number of rows of each unique value of a variable. This function works with numeric and character inputs alike. It is quite similar to the function distinct()
except that we get a second column with the number of occurences in the dataset. Remeber Base R’s table()
function? This is roughly an equivalent to it but much more tidy.
starwars %>%
count(hair_color)
## # A tibble: 13 x 2
## hair_color n
## <chr> <int>
## 1 auburn 1
## 2 auburn, grey 1
## 3 auburn, white 1
## 4 black 13
## 5 blond 3
## 6 blonde 1
## 7 brown 18
## 8 brown, grey 1
## 9 grey 1
## 10 none 37
## 11 unknown 1
## 12 white 4
## 13 <NA> 5
There is also a vectorised version of this function called add_count()
. It stores the result of count()
in a column n
and adds it to the dataset in a mutate()
-like fashion.
starwars %>%
add_count(hair_color) %>%
select(name, hair_color, n)
## # A tibble: 87 x 3
## name hair_color n
## <chr> <chr> <int>
## 1 Luke Skywalker blond 3
## 2 C-3PO <NA> 5
## 3 R2-D2 <NA> 5
## 4 Darth Vader none 37
## 5 Leia Organa brown 18
## 6 Owen Lars brown, grey 1
## 7 Beru Whitesun lars brown 18
## 8 R5-D4 <NA> 5
## 9 Biggs Darklighter black 13
## 10 Obi-Wan Kenobi auburn, white 1
## # ... with 77 more rows
In section 3 we have seen that a vectorized function can be applied to multiple columns when using mutate()
. The same is also true for aggregation functions. Suppose you want to compute the average height and weight of all characters in the starwars
dataset. With across()
we can select multiple columns (below I use the selection helper c()
) and then apply one or many aggregation function(s) like mean()
to each column.
starwars %>%
summarise(
across(
.cols = c(height, mass),
.fns = mean
)
)
## # A tibble: 1 x 2
## height mass
## <dbl> <dbl>
## 1 NA NA
Okay, this did not work as expected as for both columns NA
was returned. Can you guess why? Before we used the argument na.rm = TRUE
when computing the average weight. This argument drops any NA
values in the dataset. We also have to do this within across()
. There are multiple ways of doing this, however I recommend the purrr-style formula approach below. With .x
we reference each element in the .cols =
argument and then apply the ~mean()
function while being able to access all arguments inside it.
starwars %>%
summarise(
across(
.cols = c(height, mass),
.fns = ~mean(.x, na.rm = TRUE)
)
)
## # A tibble: 1 x 2
## height mass
## <dbl> <dbl>
## 1 174. 97.3
In general you can think of aggreation as the supplying of multiple values for the input and obatining a single value in return. There are different types of aggregation functions and many of them are already implemented in Base R. Below is a comprehensive table of the most common functions.
Metric | Function | Purpose |
---|---|---|
Position | stats::mean() | Mean value of a vector |
stats::median() | Median value of a vector | |
Logical | stats::mean() | Proportion of TRUE values |
base::sum() | Number of TRUE values | |
Order | dplyr::nth() | Nth value of a vector |
dplyr::first() | First value of a vector | |
dplyr::last() | Last value of a vector | |
Rank | stats::max() | Maximum value in a vector |
stats::min() | Minimum value in a vector | |
stats::quantile() | Minimum value in a vector | |
Spread | stats::var() | Variance of a vector |
stats::sd() | Standard deviation of a vector | |
stats::IQR() | Inter quartile range of a vector | |
stats::mad() | Mean absolute deviation | |
Count | dplyr::n() | Number of values/rows |
dplyr::n_distinct() | Number of unique values in a vector | |
sum(!is.na()) | Number of non-missing values |
Sometimes you don’t want to apply an aggregate or vectorized function on the whole dataset but only to specific subsets (or groups) of the data.
The function group_by()
lets you just do that by grouping the data into rows with the same value of one or multiple columns. Below I am grouping the starwars dataset by a character’s species
.
starwars %>%
group_by(species)
## # A tibble: 87 x 14
## # Groups: species [38]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Alright, this output does not look too interesing but the only thing we did so far is to supply our dataset with a structure of the groups.
For example, group_by()
might be used in combination with mutate()
which is very useful when computing a new variable for different subsets of the data.
starwars %>%
group_by(species) %>%
mutate(avgSpeciesWeight = mean(mass, na.rm = TRUE)) %>%
select(name, species, mass, avgSpeciesWeight)
## # A tibble: 87 x 4
## # Groups: species [38]
## name species mass avgSpeciesWeight
## <chr> <chr> <dbl> <dbl>
## 1 Luke Skywalker Human 77 82.8
## 2 C-3PO Droid 75 69.8
## 3 R2-D2 Droid 32 69.8
## 4 Darth Vader Human 136 82.8
## 5 Leia Organa Human 49 82.8
## 6 Owen Lars Human 120 82.8
## 7 Beru Whitesun lars Human 75 82.8
## 8 R5-D4 Droid 32 69.8
## 9 Biggs Darklighter Human 84 82.8
## 10 Obi-Wan Kenobi Human 77 82.8
## # ... with 77 more rows
See how inside the new avgSpeciesWeight
column the values correspond to the character’s species instead of the global mean of all Star Wars characters?
On the other hand we may also use an aggregate function with summarise()
and exploit the grouping structure.
starwars %>%
group_by(species) %>%
summarise(avgWeight = mean(mass, na.rm = TRUE))
## # A tibble: 38 x 2
## species avgWeight
## <chr> <dbl>
## 1 Aleena 15
## 2 Besalisk 102
## 3 Cerean 82
## 4 Chagrian NaN
## 5 Clawdite 55
## 6 Droid 69.8
## 7 Dug 40
## 8 Ewok 20
## 9 Geonosian 80
## 10 Gungan 74
## # ... with 28 more rows
How would you find out which species has the lowest average weight without inspecting the output in the console?
starwars %>%
group_by(species) %>%
summarise(avgWeight = mean(mass, na.rm = TRUE)) %>%
slice_min(order_by = avgWeight, n = 1)
## # A tibble: 1 x 2
## species avgWeight
## <chr> <dbl>
## 1 Aleena 15
It’s the Aleena
race. According to Wookieepedia these guys are “a species of short, sentient reptiles.”
The function n()
was mentioned in the digression of the previous section but is especially useful when working with grouped data. Let us take a look what the function n()
does when used within summarise()
after the dataset was grouped. It counts the number of rows for each value of homeworld
and hence reflects the number of Star Wars characters from this planet in the dataset.
starwars %>%
group_by(homeworld) %>%
summarise(n = n())
## # A tibble: 49 x 2
## homeworld n
## <chr> <int>
## 1 Alderaan 3
## 2 Aleen Minor 1
## 3 Bespin 1
## 4 Bestine IV 1
## 5 Cato Neimoidia 1
## 6 Cerea 1
## 7 Champala 1
## 8 Chandrila 1
## 9 Concord Dawn 1
## 10 Corellia 2
## # ... with 39 more rows
There are more of these functions, which return information about the current group or current variable and only work in the context of summarise()
or mutate()
. Below I am creating an planet_id
variable related to a character’s homeworld
.
starwars %>%
group_by(homeworld) %>%
mutate(planet_id = cur_group_id()) %>%
select(name, homeworld, planet_id)
## # A tibble: 87 x 3
## # Groups: homeworld [49]
## name homeworld planet_id
## <chr> <chr> <int>
## 1 Luke Skywalker Tatooine 40
## 2 C-3PO Tatooine 40
## 3 R2-D2 Naboo 28
## 4 Darth Vader Tatooine 40
## 5 Leia Organa Alderaan 1
## 6 Owen Lars Tatooine 40
## 7 Beru Whitesun lars Tatooine 40
## 8 R5-D4 Tatooine 40
## 9 Biggs Darklighter Tatooine 40
## 10 Obi-Wan Kenobi Stewjon 38
## # ... with 77 more rows
In total there are 49 unique values in variable homeworld
, hence the planet_id
takes on the values between 1 and 49
There is also a function called ungroup()
which removes the grouping information. When carrying out complex queries with the %>%
it can be advisable to remove the grouping structure, for example because it becomes more clear during which steps aggregation functions are applied to the data. Let’s try to verify my previous statement that the planet_id
takes on the values between 1 and 49. After computing planet_id
I select all unique values of it by using distinct()
.
starwars %>%
group_by(homeworld) %>%
mutate(planet_id = cur_group_id()) %>%
select(name, homeworld, planet_id) %>%
distinct(planet_id)
## # A tibble: 49 x 2
## # Groups: homeworld [49]
## homeworld planet_id
## <chr> <int>
## 1 Tatooine 40
## 2 Naboo 28
## 3 Alderaan 1
## 4 Stewjon 38
## 5 Eriadu 15
## 6 Kashyyyk 23
## 7 Corellia 10
## 8 Rodia 32
## 9 Nal Hutta 29
## 10 Bestine IV 4
## # ... with 39 more rows
Although I used distinct()
only on the variable planet_id
the variable homeworld
is still present in the output. Now let me remove the grouping structure with ungroup()
.
starwars %>%
group_by(homeworld) %>%
mutate(planet_id = cur_group_id()) %>%
select(name, homeworld, planet_id) %>%
ungroup() %>%
distinct(planet_id)
## # A tibble: 49 x 1
## planet_id
## <int>
## 1 40
## 2 28
## 3 1
## 4 38
## 5 15
## 6 23
## 7 10
## 8 32
## 9 29
## 10 4
## # ... with 39 more rows
See the difference? In this setting it does’nt do any harm to keep the grouping structure but depending on the situation it might cause some unwanted results.
This function can be used to group data into individual rows (as dplyr
functions compute results for each row). Also you may apply functions to list-columns. In the tidyr tutorial the rowwise()
function is discussed in more detail.
Sometimes neither manipulating nor summarising doesn’t get you any further with your analysis (and of course you have also exploited various grouping strucutres). After thorough searching you find another dataset which complements your previous findings but now you’re faced with the challenge to merge them together somehow. There are many pitfalls when linking different datasets so this section might be a bit more challening than the previous ones.
Below I stored the name
and mass
of the Star Wars characters in two different tibbles sw_names
and sw_mass
. The function bind_cols()
lets us combine variables from different sources into a single dataset. Take care that the separated datasets are ordered in the same way before glueing them together. The columns will not be matched by an id (= common link in both datasets) with bind_cols()
so be careful that they are ordered in the same way.
sw_names <-
starwars %>%
select(name)
sw_mass <-
starwars %>%
select(mass)
bind_cols(sw_names, sw_mass)
## # A tibble: 87 x 2
## name mass
## <chr> <dbl>
## 1 Luke Skywalker 77
## 2 C-3PO 75
## 3 R2-D2 32
## 4 Darth Vader 136
## 5 Leia Organa 49
## 6 Owen Lars 120
## 7 Beru Whitesun lars 75
## 8 R5-D4 32
## 9 Biggs Darklighter 84
## 10 Obi-Wan Kenobi 77
## # ... with 77 more rows
Next I am creating two new datasets. One contains all Star Wars characters with a mass of at least 100 kg. The other contains characters which are at least 1,90 m tall. Remember that the variables (columns) are not altered when function filter()
is used. Now, with function bind_rows()
I can combine the cases of both datasets into a single dataset.
sw_big <-
starwars %>%
select(name, height, mass) %>%
filter(mass >= 100 )
sw_tall <-
starwars %>%
select(name, height, mass) %>%
filter(height >= 190)
bind_rows(sw_big, sw_tall)
## # A tibble: 33 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Darth Vader 202 136
## 2 Owen Lars 178 120
## 3 Chewbacca 228 112
## 4 Jabba Desilijic Tiure 175 1358
## 5 Jek Tono Porkins 180 110
## 6 IG-88 200 140
## 7 Bossk 190 113
## 8 Dexter Jettster 198 102
## 9 Grievous 216 159
## 10 Tarfful 234 136
## # ... with 23 more rows
Using argument .id =
you can supply a column name which links each row to its original dataset.
bind_rows(
"big guys" = sw_big,
"tall guys" = sw_tall,
.id = "id"
) %>%
group_by(id) %>%
slice(1:3)
## # A tibble: 6 x 4
## # Groups: id [2]
## id name height mass
## <chr> <chr> <int> <dbl>
## 1 big guys Darth Vader 202 136
## 2 big guys Owen Lars 178 120
## 3 big guys Chewbacca 228 112
## 4 tall guys Darth Vader 202 136
## 5 tall guys Chewbacca 228 112
## 6 tall guys IG-88 200 140
Can you see what happens to characters being big and tall?
Set operations are used to compare two datasets with each other.
Let’s find rows that appear both in sw_big
and sw_tall
. This can be done with intersect()
.
intersect(sw_big, sw_tall)
## # A tibble: 7 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Darth Vader 202 136
## 2 Chewbacca 228 112
## 3 IG-88 200 140
## 4 Bossk 190 113
## 5 Dexter Jettster 198 102
## 6 Grievous 216 159
## 7 Tarfful 234 136
With setdiff()
on the other hand, we can find rows that appear in sw_big
but not sw_tall
.
setdiff(sw_big, sw_tall)
## # A tibble: 3 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Owen Lars 178 120
## 2 Jabba Desilijic Tiure 175 1358
## 3 Jek Tono Porkins 180 110
Then with union()
we can find rows that appear in sw_big
or sw_tall
while removing duplicate cases. Note that there is also union_all()
to retain the duplicate values.
union(sw_big, sw_tall)
## # A tibble: 26 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Darth Vader 202 136
## 2 Owen Lars 178 120
## 3 Chewbacca 228 112
## 4 Jabba Desilijic Tiure 175 1358
## 5 Jek Tono Porkins 180 110
## 6 IG-88 200 140
## 7 Bossk 190 113
## 8 Dexter Jettster 198 102
## 9 Grievous 216 159
## 10 Tarfful 234 136
## # ... with 16 more rows
Finally, with function setequal()
we can test whether two datasets contain the exact same rows. The sorting order does not matter here.
setequal(sw_big, sw_tall)
## [1] FALSE
When using a mutating join you combine different datasets in order to add additional information to your analysis. Hence, new columns will be added to your initial dataset. To do so, an id variable which can identify rows between the datasets must be present in your data. With the help of Wookieepedia I created a dataset called worlds
, which contains information about astronomical objects of the Star Wars universe. Have a look at it below.
print(worlds)
## # A tibble: 49 x 4
## object type moons species
## <chr> <chr> <dbl> <lgl>
## 1 Alderaan planet 0 NA
## 2 Aleen Minor planet 0 NA
## 3 Bespin planet 2 NA
## 4 Bestine IV planet 0 NA
## 5 Cato Neimoidia planet 2 NA
## 6 Cerea planet 0 NA
## 7 Champala planet 0 NA
## 8 Chandrila planet 1 NA
## 9 Concord Dawn planet 3 NA
## 10 Corellia planet 2 NA
## # ... with 39 more rows
With the left_join()
two datasets x and y are combined by by joining all matching values from y to x. Non-matching values of x still remain in the output but will display NA
by default for any column of y. In the starwars
dataset the column homeworld
contains values which also occur in the object
column of the worlds
dataset. Use the argument by =
to supply the column names on which you want to match the two datasets.
left_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, homeworld, type, moons)
## # A tibble: 87 x 4
## name homeworld type moons
## <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker Tatooine planet 3
## 2 C-3PO Tatooine planet 3
## 3 R2-D2 Naboo <NA> NA
## 4 Darth Vader Tatooine planet 3
## 5 Leia Organa Alderaan planet 0
## 6 Owen Lars Tatooine planet 3
## 7 Beru Whitesun lars Tatooine planet 3
## 8 R5-D4 Tatooine planet 3
## 9 Biggs Darklighter Tatooine planet 3
## 10 Obi-Wan Kenobi Stewjon astronomical object 0
## # ... with 77 more rows
Now each character in the starwars
dataset shows the type
and the number of moons
of its homeworld
. Note that the number of cases in the starwars
dataset was not affected by this operation. Also note that for characters with the homeworld
Naboo NA
values were returned. Either this object
is not included in the worlds
dataset or some information are missing. Can we find out what’s going on here?
With the function inner_join()
only rows that match between two datasets are returned. So when there is no object
called Naboo in the worlds
dataset any Star Wars character with this homeworld
will be dropped from the output.
inner_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, homeworld, type, moons)
## # A tibble: 66 x 4
## name homeworld type moons
## <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker Tatooine planet 3
## 2 C-3PO Tatooine planet 3
## 3 Darth Vader Tatooine planet 3
## 4 Leia Organa Alderaan planet 0
## 5 Owen Lars Tatooine planet 3
## 6 Beru Whitesun lars Tatooine planet 3
## 7 R5-D4 Tatooine planet 3
## 8 Biggs Darklighter Tatooine planet 3
## 9 Obi-Wan Kenobi Stewjon astronomical object 0
## 10 Anakin Skywalker Tatooine planet 3
## # ... with 56 more rows
Looks like R2-D2 is gone now!
There is a third way to combine two datasets with function full_join()
, which retains all rows and all values. For example, there might be some objects
in the worlds
dataset which are uninhabitable (even to the most adapted of creatures). It is hence unlikely that any character in the starwars
dataset will have such an object
as its homeworld
. In such a case we would expect that the full_join()
returns values for columns of the worlds
dataset but NA
for columns of the starwars
dataset.
full_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, homeworld, type, moons) %>%
filter(is.na(name))
## # A tibble: 2 x 4
## name homeworld type moons
## <chr> <chr> <chr> <dbl>
## 1 <NA> Dagobah planet 1
## 2 <NA> Jakku planet 2
No Star Wars character in the starwars
dataset is from Dagobah or Jakku.
There is also a possibility to perform a cross-join (aka cartesian product) which generates all combinations of x and y. This is not really meaningful given the information supplied in the starwars
and worlds
dataset but it’s simply done by specifying by = character()
.
left_join(
x = starwars,
y = worlds,
by = character()
) %>%
select(name, homeworld, moons)
## # A tibble: 4,263 x 3
## name homeworld moons
## <chr> <chr> <dbl>
## 1 Luke Skywalker Tatooine 0
## 2 Luke Skywalker Tatooine 0
## 3 Luke Skywalker Tatooine 2
## 4 Luke Skywalker Tatooine 0
## 5 Luke Skywalker Tatooine 2
## 6 Luke Skywalker Tatooine 0
## 7 Luke Skywalker Tatooine 0
## 8 Luke Skywalker Tatooine 1
## 9 Luke Skywalker Tatooine 3
## 10 Luke Skywalker Tatooine 2
## # ... with 4,253 more rows
The output is a pretty long dataset with 4.263 rows.
With a filtering join one table can be filtered against the rows of another.
When using a filtering join no columns from dataset y are added to x. Hence, only cases of starwars
that have a match in worlds
are returned. We can thus see what will be included in a join.
semi_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, homeworld)
## # A tibble: 66 x 2
## name homeworld
## <chr> <chr>
## 1 Luke Skywalker Tatooine
## 2 C-3PO Tatooine
## 3 Darth Vader Tatooine
## 4 Leia Organa Alderaan
## 5 Owen Lars Tatooine
## 6 Beru Whitesun lars Tatooine
## 7 R5-D4 Tatooine
## 8 Biggs Darklighter Tatooine
## 9 Obi-Wan Kenobi Stewjon
## 10 Anakin Skywalker Tatooine
## # ... with 56 more rows
All characters whose homeworld
can be found in the worlds
dataset are returned.
The anti_join()
does the opposite and returns characters of the starwars
dataset which don’t have a match in worlds
. We can hence see what will not be included in a join.
anti_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, homeworld)
## # A tibble: 21 x 2
## name homeworld
## <chr> <chr>
## 1 R2-D2 Naboo
## 2 Yoda <NA>
## 3 Palpatine Naboo
## 4 IG-88 <NA>
## 5 Arvel Crynyd <NA>
## 6 Qui-Gon Jinn <NA>
## 7 Jar Jar Binks Naboo
## 8 Roos Tarpals Naboo
## 9 Rugor Nass Naboo
## 10 Ric Olié Naboo
## # ... with 11 more rows
These are either characters from Naboo, which is missing in the homeworlds
dataset, or characters with a missing value in variable homeworld
.
With the function nest_join()
one dataset can be inner joined to another into a nested data frame. Sounds quite complex so let’s take a look at what’s happening.
nest_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object")
) %>%
select(name, worlds)
## # A tibble: 87 x 2
## name worlds
## <chr> <list>
## 1 Luke Skywalker <tibble [1 x 3]>
## 2 C-3PO <tibble [1 x 3]>
## 3 R2-D2 <tibble [0 x 3]>
## 4 Darth Vader <tibble [1 x 3]>
## 5 Leia Organa <tibble [1 x 3]>
## 6 Owen Lars <tibble [1 x 3]>
## 7 Beru Whitesun lars <tibble [1 x 3]>
## 8 R5-D4 <tibble [1 x 3]>
## 9 Biggs Darklighter <tibble [1 x 3]>
## 10 Obi-Wan Kenobi <tibble [1 x 3]>
## # ... with 77 more rows
The matching rows of the worlds
are now stored in one single list column instead of three separate columns. This safes some space for sure but it also makes things a bit more complex when working with the column worlds
. There is a function called unnest()
in the tidyr
package which can entangle this column. Read more about it in this tutorial.
There are several ways how you can carry out a mutating or filtering join with argument by =
.
Setting | Function |
---|---|
one identical column name (for matching) per dataset | by = "id1" |
multiple identical column names (for matching) per dataset | by = c("id1", "id2") |
one different column name (for matching) per dataset | by = c("ida1" = "idb1") |
multiple different column names (for matching) per dataset | by = c("ida1" = "idb1", "ida2 = idb2") |
identical column names not used for matching | by = c("ida1" = "idb1", suffix = c("1","2")) |
Suppose that the worlds
dataset also contains a species
variable which represents the most common creatures on the planet. Without the suffix =
argument the species
column will be duplicated with a default suffix .x
and .y
. We can change the suffix to make it clear which column represents which information. I’m using _chr
for the species of the Star Wars character and _wrld
for creature living on a planet.
left_join(
x = starwars,
y = worlds,
by = c("homeworld" = "object"),
suffix = c("_chr","_wrld")
) %>%
select(name, species_chr, homeworld, species_wrld)
## # A tibble: 87 x 4
## name species_chr homeworld species_wrld
## <chr> <chr> <chr> <lgl>
## 1 Luke Skywalker Human Tatooine NA
## 2 C-3PO Droid Tatooine NA
## 3 R2-D2 Droid Naboo NA
## 4 Darth Vader Human Tatooine NA
## 5 Leia Organa Human Alderaan NA
## 6 Owen Lars Human Tatooine NA
## 7 Beru Whitesun lars Human Tatooine NA
## 8 R5-D4 Droid Tatooine NA
## 9 Biggs Darklighter Human Tatooine NA
## 10 Obi-Wan Kenobi Human Stewjon NA
## # ... with 77 more rows