R Tutorial: tibble package

Reading time: 6 minutes (1,130 words)

1. Introduction

In this tutorial the tibble package will be discussed in more detail. The dplyr package contains the starwars dataset, which is a tibble. We’ve seen this dataset before in this tutorial.

Let’s load both packages first.

library(tibble)
library(dplyr)

The tibble is in fact a data frame but is more convenient to work with. Below I’ll contrast the differences between both data types.

Tibble

When you print a tibble, it only shows the first ten rows and all the columns that fit on the screen. It also prints an abbreviated description of the column type and uses font styles and color for highlighting (in R’s console).

starwars %>%
  select(name, height, mass) %>%
  print()

## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # ... with 77 more rows

Tibbles are more strict about subsetting. The function [ is used to extract rows and/or columns from a dataset. When applied to a tibble it always returns another tibble.

class(starwars[,1])

## [1] "tbl_df"     "tbl"        "data.frame"

Tibbles are also stricter with the $ operator. Tibbles never do partial matching and will throw a warning and return NULL if a column does not exist.

starwars$b

## NULL

Conventional data frame

First I have to convert the starwars dataset to a conventional data frame with function as.data.frame().

starwars_df <- as.data.frame(starwars)

When printing starwars_df all rows will be shown in the output. In order to save some document space I only kept the first 15 rows using dplyr::slice().

starwars_df %>%
  select(name, height, mass) %>%
  slice(1:15) %>%
  print()

##                  name height mass
## 1      Luke Skywalker    172   77
## 2               C-3PO    167   75
## 3               R2-D2     96   32
## 4         Darth Vader    202  136
## 5         Leia Organa    150   49
## 6           Owen Lars    178  120
## 7  Beru Whitesun lars    165   75
## 8               R5-D4     97   32
## 9   Biggs Darklighter    183   84
## 10     Obi-Wan Kenobi    182   77
## 11   Anakin Skywalker    188   84
## 12     Wilhuff Tarkin    180   NA
## 13          Chewbacca    228  112
## 14           Han Solo    180   80
## 15             Greedo    173   74

When subsetting a data frame with function [ it sometimes returns a data frame and sometimes just a vector.

class(starwars_df[,1])

## [1] "character"

As opposed to tibbles partial matching is also possible. By attempting to extract the non-existiting column b from the starwars dataset the column birth_year is returned. This bevahior is not always desirable and can lead to unwanted results.

starwars_df$b

##  [1]  19.0 112.0  33.0  41.9  19.0  52.0  47.0    NA  24.0  57.0  41.9  64.0
## [13] 200.0  29.0  44.0 600.0  21.0    NA 896.0  82.0  31.5  15.0  53.0  31.0
## [25]  37.0  41.0  48.0    NA   8.0    NA  92.0    NA  91.0  52.0    NA    NA
## [37]    NA    NA    NA  62.0  72.0  54.0    NA  48.0    NA    NA    NA  72.0
## [49]  92.0    NA    NA    NA    NA    NA  22.0    NA    NA    NA  82.0    NA
## [61]  58.0  40.0    NA 102.0  67.0  66.0    NA    NA    NA    NA    NA    NA
## [73]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## [85]    NA    NA  46.0

glimpse()

In addition to print() the function glimpse() might be used to get an overview of the starwars dataset. In the output the columns run down the page and data runs across, which makes it possible to see every column in a data frame. The Base R function str() is somewhat related to glimpse().

starwars %>%
  glimpse()

## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

2. Creating tibbles

Next we’ll take a look at functions that create a tibble from scratch.

tibble()

With function tibble() a data frame is contructed. The returned data frame has the class tbl_df, in addition to data.frame. This allows the tibble to exhibit special behaviours like enhanced printing which were discussed in section 1. Also character vectors are not coerced to factor, list-columns are expressly anticipated and column names are not modified. Below I am creating a tibble based on information of three characters of the starwars dataset.

tibble(
  name = c("Luke Skywalker", "C-3PO", "R2-D2"),
  height = c(172, 167, 96),
  mass = c(77.0, 75.0, 32.0)
  )

## # A tibble: 3 x 3
##   name           height  mass
##   <chr>           <dbl> <dbl>
## 1 Luke Skywalker    172    77
## 2 C-3PO             167    75
## 3 R2-D2              96    32

You can also create a tibble that contains a column which is a tibble itself. Below I added a fourth column which is a tibble consisting of the two columns planet and moons.

tibble(
  name = c("Luke Skywalker", "C-3PO", "R2-D2"),
  height = c(172, 167, 96),
  mass = c(77.0, 75.0, 32.0),
  homeworld = tibble(
    planet = c("Tatooine", "Tatooine", "Naboo"),
    moons = c(3, 3, NA)
    )
  )

## # A tibble: 3 x 4
##   name           height  mass homeworld$planet $moons
##   <chr>           <dbl> <dbl> <chr>             <dbl>
## 1 Luke Skywalker    172    77 Tatooine              3
## 2 C-3PO             167    75 Tatooine              3
## 3 R2-D2              96    32 Naboo                NA

tibble_row()

The function tibble_row() constructs a data frame that is guaranteed to occupy one row. Below a linear model is specified with function lm() but it requires only one row in the resulting output. The column lm is a so called list-column.

tibble_row(
  a = "model1", 
  lm = lm(mass ~ height, data = starwars)
  )

## # A tibble: 1 x 2
##   a      lm    
##   <chr>  <list>
## 1 model1 <lm>

Does this also work when using the function tibble()?

tibble(
  a = "model1", 
  lm = lm(mass ~ height, data = starwars)
  )

## Error:
## ! All columns in a tibble must be vectors.
## x Column `lm` is a `lm` object.

It does not because tibble() expects vectors as input for a column unless specified otherwise. Below is a solution by creating a list-column explicitly with function list(). This topic will be discussed in more detail in this tutorial.

tibble(
  a = "model1", 
  lm = list(lm(mass ~ height, data = starwars))
  )

## # A tibble: 1 x 2
##   a      lm    
##   <chr>  <list>
## 1 model1 <lm>

tribble()

The function tribble() lets you create a tibble by using an easier to read row-by-row layout.

tribble(
  ~name,            ~height, ~mass,
  "Luke Skywalker", 172,     77.0,
  "C-3PO",          167,     75.0,
  "R2-D2",           96,     32.0
  )

## # A tibble: 3 x 3
##   name           height  mass
##   <chr>           <dbl> <dbl>
## 1 Luke Skywalker    172    77
## 2 C-3PO             167    75
## 3 R2-D2              96    32

enframe() / deframe()

The function enframe() converts a named atomic vector or list to a one- or two-column data frame. Below I create a vector heights with three integer valuess. Using this atomic vector in function enframe() returns a tibble with the columns name and value. The name is simply a sequence of values from 1 to 3.

heights <- c(172, 167, 96)

enframe(heights)

## # A tibble: 3 x 2
##    name value
##   <int> <dbl>
## 1     1   172
## 2     2   167
## 3     3    96

With function deframe() this operation can be reversed by converting a two-column data frame to a named vector or list, using the first column as name and the second column as value.

enframed_heights <- enframe(heights)

deframe(enframed_heights)

##   1   2   3 
## 172 167  96

as_tibble()

The function as_tibble() turns an existing object, for example a conventional data frame, in a tibble with class tbl_df. Let us turn the starwars_df dataset created in section 1 from a data.frame back to a tbl.df.

starwars_df %>%
  as_tibble() %>%
  class()

## [1] "tbl_df"     "tbl"        "data.frame"

is_tibble()

The function is_tibble() returns TRUE for tibbles or subclasses thereof, and FALSE for all other objects, including conventional data frames.

As the starwars dataset is already a tibble we expect TRUE to be returned.

is_tibble(starwars)

## [1] TRUE

As we created the starwars_df dataset as a conventional data frame we expect FALSE to be returned.

is_tibble(starwars_df)

## [1] FALSE

3. Manipulating tibbles

There are also functions which allow us to modify existing tibbles.

add_row() / add_case()

This is a convenient way to add one or more rows of data to an existing data frame. The function add_case() is an alias of add_row() and works the same. Below I am adding a row for the bounty hunter and pod racer Aldar Beedo to the starwars dataset.

starwars %>%
  slice(1:5) %>%
  add_row(
    name = "Aldar Beedo",
    height = 130,
    mass = 32,
    skin_color = "yellow, blue, grey, beige",
    eye_color = "orange",
    sex = "male",
    gender = "masculine",
    homeworld = "Ploo II"
    )

## # A tibble: 6 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky~    172    77 blond      fair       blue            19   male  mascu~
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu~
## 3 R2-D2         96    32 <NA>       white, bl~ red             33   none  mascu~
## 4 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
## 5 Leia Org~    150    49 brown      light      brown           19   fema~ femin~
## 6 Aldar Be~    130    32 <NA>       yellow, b~ orange          NA   male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

As you can see it is not necessary to supply a value for each column of the original dataset as they are filled with NA.

add_column()

We might also add another column with function add_column(). The arguments .before = and .after = let us control the position of the new column. Below I add a sequential id for each row in the starwars dataset.

starwars %>%
  add_column(id = 1:87, .before = "name")

## # A tibble: 87 x 15
##       id name      height  mass hair_color skin_color eye_color birth_year sex  
##    <int> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1 Luke Sky~    172    77 blond      fair       blue            19   male 
##  2     2 C-3PO        167    75 <NA>       gold       yellow         112   none 
##  3     3 R2-D2         96    32 <NA>       white, bl~ red             33   none 
##  4     4 Darth Va~    202   136 none       white      yellow          41.9 male 
##  5     5 Leia Org~    150    49 brown      light      brown           19   fema~
##  6     6 Owen Lars    178   120 brown, gr~ light      blue            52   male 
##  7     7 Beru Whi~    165    75 brown      light      blue            47   fema~
##  8     8 R5-D4         97    32 <NA>       white, red red             NA   none 
##  9     9 Biggs Da~    183    84 black      light      brown           24   male 
## 10    10 Obi-Wan ~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

4. Working with rownames

Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column. Still, especially when having conventional data frames as input, it can be helpful to know about functions that can manipulate them.

column_to_row_names()

The function column_to_row_names() takes an existing column in a data frame to convert it to rownames. It is important that the column from which the rownames are taken does not contain duplicate values!

starwars %>%
  select(name, height, mass) %>%
  column_to_rownames(var = "name") %>%
  head()

##                height mass
## Luke Skywalker    172   77
## C-3PO             167   75
## R2-D2              96   32
## Darth Vader       202  136
## Leia Organa       150   49
## Owen Lars         178  120

has_rownames()

With this function we can detect if rownames are present in the dataset. It returns TRUE or FALSE. The original starwars dataset should not contain rownames.

starwars %>%
  has_rownames()

## [1] FALSE

However, when adding rownames with function column_to_rownames() they should exist.

starwars %>%
  column_to_rownames(var = "name") %>%
  has_rownames()

## [1] TRUE

remove_rownames()

Rownames can be removed from the dataset with this function. Note however, that the name column is then lost from the starwars dataset.

starwars %>%
  select(name, height, mass) %>%
  column_to_rownames(var = "name") %>%
  remove_rownames() %>%
  slice(1:10)

##    height mass
## 1     172   77
## 2     167   75
## 3      96   32
## 4     202  136
## 5     150   49
## 6     178  120
## 7     165   75
## 8      97   32
## 9     183   84
## 10    182   77

rownames_to_column()

If rownames already exist in the dataset they can be stored as a single column with function rownames_to_column(). With argument var = a column name can be supplied.

starwars %>%
  select(name, height, mass) %>%
  column_to_rownames(var = "name") %>%
  rownames_to_column(var = "name") %>%
  slice(1:10)

##                  name height mass
## 1      Luke Skywalker    172   77
## 2               C-3PO    167   75
## 3               R2-D2     96   32
## 4         Darth Vader    202  136
## 5         Leia Organa    150   49
## 6           Owen Lars    178  120
## 7  Beru Whitesun lars    165   75
## 8               R5-D4     97   32
## 9   Biggs Darklighter    183   84
## 10     Obi-Wan Kenobi    182   77

rowid_to_column()

This function adds a column at the start of the data frame of ascending sequential row ids starting at 1.

starwars %>%
  select(name, height, mass) %>%
  rowid_to_column(var = "id") %>%
  slice(1:10)

## # A tibble: 10 x 4
##       id name               height  mass
##    <int> <chr>               <int> <dbl>
##  1     1 Luke Skywalker        172    77
##  2     2 C-3PO                 167    75
##  3     3 R2-D2                  96    32
##  4     4 Darth Vader           202   136
##  5     5 Leia Organa           150    49
##  6     6 Owen Lars             178   120
##  7     7 Beru Whitesun lars    165    75
##  8     8 R5-D4                  97    32
##  9     9 Biggs Darklighter     183    84
## 10    10 Obi-Wan Kenobi        182    77

References

Müller, Kirill, and Hadley Wickham. 2020. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.