Reading time: 6 minutes (1,130 words)
In this tutorial the tibble
package will be discussed in more detail. The dplyr
package contains the starwars
dataset, which is a tibble. We’ve seen this dataset before in this tutorial.
Let’s load both packages first.
library(tibble)
library(dplyr)
The tibble is in fact a data frame but is more convenient to work with. Below I’ll contrast the differences between both data types.
When you print a tibble, it only shows the first ten rows and all the columns that fit on the screen. It also prints an abbreviated description of the column type and uses font styles and color for highlighting (in R’s console).
starwars %>%
select(name, height, mass) %>%
print()
## # A tibble: 87 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # ... with 77 more rows
Tibbles are more strict about subsetting. The function [
is used to extract rows and/or columns from a dataset. When applied to a tibble it always returns another tibble.
class(starwars[,1])
## [1] "tbl_df" "tbl" "data.frame"
Tibbles are also stricter with the $
operator. Tibbles never do partial matching and will throw a warning and return NULL
if a column does not exist.
starwars$b
## NULL
First I have to convert the starwars
dataset to a conventional data frame with function as.data.frame()
.
starwars_df <- as.data.frame(starwars)
When printing starwars_df
all rows will be shown in the output. In order to save some document space I only kept the first 15 rows using dplyr::slice()
.
starwars_df %>%
select(name, height, mass) %>%
slice(1:15) %>%
print()
## name height mass
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## 11 Anakin Skywalker 188 84
## 12 Wilhuff Tarkin 180 NA
## 13 Chewbacca 228 112
## 14 Han Solo 180 80
## 15 Greedo 173 74
When subsetting a data frame with function [
it sometimes returns a data frame and sometimes just a vector.
class(starwars_df[,1])
## [1] "character"
As opposed to tibbles partial matching is also possible. By attempting to extract the non-existiting column b
from the starwars
dataset the column birth_year
is returned. This bevahior is not always desirable and can lead to unwanted results.
starwars_df$b
## [1] 19.0 112.0 33.0 41.9 19.0 52.0 47.0 NA 24.0 57.0 41.9 64.0
## [13] 200.0 29.0 44.0 600.0 21.0 NA 896.0 82.0 31.5 15.0 53.0 31.0
## [25] 37.0 41.0 48.0 NA 8.0 NA 92.0 NA 91.0 52.0 NA NA
## [37] NA NA NA 62.0 72.0 54.0 NA 48.0 NA NA NA 72.0
## [49] 92.0 NA NA NA NA NA 22.0 NA NA NA 82.0 NA
## [61] 58.0 40.0 NA 102.0 67.0 66.0 NA NA NA NA NA NA
## [73] NA NA NA NA NA NA NA NA NA NA NA NA
## [85] NA NA 46.0
In addition to print()
the function glimpse()
might be used to get an overview of the starwars
dataset. In the output the columns run down the page and data runs across, which makes it possible to see every column in a data frame. The Base R function str()
is somewhat related to glimpse()
.
starwars %>%
glimpse()
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~
Next we’ll take a look at functions that create a tibble from scratch.
With function tibble()
a data frame is contructed. The returned data frame has the class tbl_df, in addition to data.frame. This allows the tibble to exhibit special behaviours like enhanced printing which were discussed in section 1. Also character vectors are not coerced to factor, list-columns are expressly anticipated and column names are not modified. Below I am creating a tibble based on information of three characters of the starwars
dataset.
tibble(
name = c("Luke Skywalker", "C-3PO", "R2-D2"),
height = c(172, 167, 96),
mass = c(77.0, 75.0, 32.0)
)
## # A tibble: 3 x 3
## name height mass
## <chr> <dbl> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
You can also create a tibble that contains a column which is a tibble itself. Below I added a fourth column which is a tibble consisting of the two columns planet
and moons
.
tibble(
name = c("Luke Skywalker", "C-3PO", "R2-D2"),
height = c(172, 167, 96),
mass = c(77.0, 75.0, 32.0),
homeworld = tibble(
planet = c("Tatooine", "Tatooine", "Naboo"),
moons = c(3, 3, NA)
)
)
## # A tibble: 3 x 4
## name height mass homeworld$planet $moons
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 Tatooine 3
## 2 C-3PO 167 75 Tatooine 3
## 3 R2-D2 96 32 Naboo NA
The function tibble_row()
constructs a data frame that is guaranteed to occupy one row. Below a linear model is specified with function lm()
but it requires only one row in the resulting output. The column lm
is a so called list-column.
tibble_row(
a = "model1",
lm = lm(mass ~ height, data = starwars)
)
## # A tibble: 1 x 2
## a lm
## <chr> <list>
## 1 model1 <lm>
Does this also work when using the function tibble()
?
tibble(
a = "model1",
lm = lm(mass ~ height, data = starwars)
)
## Error:
## ! All columns in a tibble must be vectors.
## x Column `lm` is a `lm` object.
It does not because tibble()
expects vectors as input for a column unless specified otherwise. Below is a solution by creating a list-column explicitly with function list()
. This topic will be discussed in more detail in this tutorial.
tibble(
a = "model1",
lm = list(lm(mass ~ height, data = starwars))
)
## # A tibble: 1 x 2
## a lm
## <chr> <list>
## 1 model1 <lm>
The function tribble()
lets you create a tibble by using an easier to read row-by-row layout.
tribble(
~name, ~height, ~mass,
"Luke Skywalker", 172, 77.0,
"C-3PO", 167, 75.0,
"R2-D2", 96, 32.0
)
## # A tibble: 3 x 3
## name height mass
## <chr> <dbl> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
The function enframe()
converts a named atomic vector or list to a one- or two-column data frame. Below I create a vector heights
with three integer valuess. Using this atomic vector in function enframe()
returns a tibble with the columns name
and value
. The name
is simply a sequence of values from 1 to 3.
heights <- c(172, 167, 96)
enframe(heights)
## # A tibble: 3 x 2
## name value
## <int> <dbl>
## 1 1 172
## 2 2 167
## 3 3 96
With function deframe()
this operation can be reversed by converting a two-column data frame to a named vector or list, using the first column as name and the second column as value.
enframed_heights <- enframe(heights)
deframe(enframed_heights)
## 1 2 3
## 172 167 96
The function as_tibble()
turns an existing object, for example a conventional data frame, in a tibble with class tbl_df
. Let us turn the starwars_df
dataset created in section 1 from a data.frame back to a tbl.df.
starwars_df %>%
as_tibble() %>%
class()
## [1] "tbl_df" "tbl" "data.frame"
The function is_tibble()
returns TRUE
for tibbles or subclasses thereof, and FALSE for all other objects, including conventional data frames.
As the starwars
dataset is already a tibble we expect TRUE
to be returned.
is_tibble(starwars)
## [1] TRUE
As we created the starwars_df
dataset as a conventional data frame we expect FALSE
to be returned.
is_tibble(starwars_df)
## [1] FALSE
There are also functions which allow us to modify existing tibbles.
This is a convenient way to add one or more rows of data to an existing data frame. The function add_case()
is an alias of add_row()
and works the same. Below I am adding a row for the bounty hunter and pod racer Aldar Beedo to the starwars
dataset.
starwars %>%
slice(1:5) %>%
add_row(
name = "Aldar Beedo",
height = 130,
mass = 32,
skin_color = "yellow, blue, grey, beige",
eye_color = "orange",
sex = "male",
gender = "masculine",
homeworld = "Ploo II"
)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth Va~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Org~ 150 49 brown light brown 19 fema~ femin~
## 6 Aldar Be~ 130 32 <NA> yellow, b~ orange NA male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
As you can see it is not necessary to supply a value for each column of the original dataset as they are filled with NA
.
We might also add another column with function add_column()
. The arguments .before =
and .after =
let us control the position of the new column. Below I add a sequential id
for each row in the starwars
dataset.
starwars %>%
add_column(id = 1:87, .before = "name")
## # A tibble: 87 x 15
## id name height mass hair_color skin_color eye_color birth_year sex
## <int> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1 Luke Sky~ 172 77 blond fair blue 19 male
## 2 2 C-3PO 167 75 <NA> gold yellow 112 none
## 3 3 R2-D2 96 32 <NA> white, bl~ red 33 none
## 4 4 Darth Va~ 202 136 none white yellow 41.9 male
## 5 5 Leia Org~ 150 49 brown light brown 19 fema~
## 6 6 Owen Lars 178 120 brown, gr~ light blue 52 male
## 7 7 Beru Whi~ 165 75 brown light blue 47 fema~
## 8 8 R5-D4 97 32 <NA> white, red red NA none
## 9 9 Biggs Da~ 183 84 black light brown 24 male
## 10 10 Obi-Wan ~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column. Still, especially when having conventional data frames as input, it can be helpful to know about functions that can manipulate them.
The function column_to_row_names()
takes an existing column in a data frame to convert it to rownames. It is important that the column from which the rownames are taken does not contain duplicate values!
starwars %>%
select(name, height, mass) %>%
column_to_rownames(var = "name") %>%
head()
## height mass
## Luke Skywalker 172 77
## C-3PO 167 75
## R2-D2 96 32
## Darth Vader 202 136
## Leia Organa 150 49
## Owen Lars 178 120
With this function we can detect if rownames are present in the dataset. It returns TRUE
or FALSE
. The original starwars
dataset should not contain rownames.
starwars %>%
has_rownames()
## [1] FALSE
However, when adding rownames with function column_to_rownames()
they should exist.
starwars %>%
column_to_rownames(var = "name") %>%
has_rownames()
## [1] TRUE
Rownames can be removed from the dataset with this function. Note however, that the name
column is then lost from the starwars
dataset.
starwars %>%
select(name, height, mass) %>%
column_to_rownames(var = "name") %>%
remove_rownames() %>%
slice(1:10)
## height mass
## 1 172 77
## 2 167 75
## 3 96 32
## 4 202 136
## 5 150 49
## 6 178 120
## 7 165 75
## 8 97 32
## 9 183 84
## 10 182 77
If rownames already exist in the dataset they can be stored as a single column with function rownames_to_column()
. With argument var =
a column name can be supplied.
starwars %>%
select(name, height, mass) %>%
column_to_rownames(var = "name") %>%
rownames_to_column(var = "name") %>%
slice(1:10)
## name height mass
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
This function adds a column at the start of the data frame of ascending sequential row ids starting at 1.
starwars %>%
select(name, height, mass) %>%
rowid_to_column(var = "id") %>%
slice(1:10)
## # A tibble: 10 x 4
## id name height mass
## <int> <chr> <int> <dbl>
## 1 1 Luke Skywalker 172 77
## 2 2 C-3PO 167 75
## 3 3 R2-D2 96 32
## 4 4 Darth Vader 202 136
## 5 5 Leia Organa 150 49
## 6 6 Owen Lars 178 120
## 7 7 Beru Whitesun lars 165 75
## 8 8 R5-D4 97 32
## 9 9 Biggs Darklighter 183 84
## 10 10 Obi-Wan Kenobi 182 77