Reading time: 7 minutes (1,203 words)
The functions of the R package janitor
help to facilitate initital data cleaning and data exploration tasks before going into a deeper analysis with the dataset. For illustrating the use of each function, I’m using the dirty_data
dataset from the package Author’s GitHub page. We will also require some other R packages, which I’ve listed below.
library(janitor)
library(dplyr)
library(readr)
library(stringr)
library(lubridate)
We read the dataset with function read_delim()
from R package readr
and print it to the console.
dirty_data <- readr::read_delim(file = "data/dirty_data.csv", delim = ";")
print(dirty_data)
## # A tibble: 14 x 12
## `Data most recently~` ...2 ...3 `Dec-27 2020` ...5 ...6 ...7 ...8 ...9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 First Name Last~ Empl~ Subject Hire~ % Al~ Full~ do n~ Cert~
## 2 Jason Bour~ Teac~ PE 39690 75% Yes <NA> Phys~
## 3 Jason Bour~ Teac~ Drafting 14.0~ 25% Yes <NA> Phys~
## 4 Alicia Keys Teac~ Music 15.0~ 100% Yes <NA> Inst~
## 5 Ada Love~ Teac~ #BEZUG! 38572 100% Yes <NA> PEND~
## 6 Desus Nice Admi~ Dean 25.0~ 100% Yes <NA> PEND~
## 7 Chien-Shiung Wu Teac~ Physics 11037 50% Yes <NA> Scie~
## 8 Chien-Shiung Wu Teac~ Chemistry 11037 50% Yes <NA> Scie~
## 9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 James Joyce Teac~ English 20.0~ 50% No <NA> <NA>
## 11 Hedy Lama~ Teac~ Science 27919 50% No <NA> PEND~
## 12 Carlos Booz~ Coach Basketball 42221 #NV No <NA> Phys~
## 13 Young Booz~ Coach <NA> 34700 #NV No <NA> <NA>
## 14 Micheal Lars~ Teac~ English 40071 80% No <NA> Voca~
## # ... with 3 more variables: ...10 <chr>, ...11 <chr>, ...12 <chr>
Something already went wrong here! There seems to be some kind of note in the first row of this dataset, which read_delim()
uses per default as names for the columns. We can alter this behaviour by setting argument skip = 1
. Now the second row of the data will be used as column names. We now use glimpse()
to inspect the data’s structure.
readr::read_delim(file = "data/dirty_data.csv", delim = ";", skip = 1) %>%
glimpse()
## Rows: 13
## Columns: 12
## $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien~
## $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "W~
## $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Admin~
## $ Subject <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Phy~
## $ `Hire Date` <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.~
## $ `% Allocated` <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%"~
## $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", N~
## $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification...9 <chr> "Physical ed", "Physical ed", "Instr. music", "PEN~
## $ Certification...10 <chr> "Theater", "Theater", "Vocal music", "Computers", ~
## $ `Active?` <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", N~
## $ `Student Rating` <dbl> 1.5, 2.5, 3.5, 4.5, 2.3, 1.7, 3.0, NA, -1.5, -2.5,~
Still, the column names of this dataset look pretty messy as they contain special characters and two columns have identical names, which R enhances by default with ...
. So let’s have a look what can be done about this using functions of the janitor
package.
The first set of functions is used for the main tasks of data cleaning, like fixing the column names and removing missing values.
With this function we can select a row to be the column names of the data frame. This is somewhat redundant now as we already used the skip =
argument of function read_delim()
for this purpose. However, you might encounter a dataset where the row names are stored at the bottom of the table and then this function will come in handy.
dirty_data <-
read_delim(file = "data/dirty_data.csv", delim = ";") %>%
row_to_names(row_number = 1)
glimpse(dirty_data)
## Rows: 13
## Columns: 12
## $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien~
## $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "W~
## $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Admin~
## $ Subject <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Phy~
## $ `Hire Date` <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.~
## $ `% Allocated` <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%"~
## $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", N~
## $ `do not edit! --->` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PEN~
## $ Certification <chr> "Theater", "Theater", "Vocal music", "Computers", ~
## $ `Active?` <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", N~
## $ `Student Rating` <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", N~
This function cleans the column names of your dataset. It takes and returns a data frame, allowing for easy of piping with the %>%
operator. The resulting names are unique and consist only of the _
special character as well as numbers, and letters.
dirty_data %>%
clean_names() %>%
glimpse()
## Rows: 13
## Columns: 12
## $ first_name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-S~
## $ last_name <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu"~
## $ employee_status <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Adminis~
## $ subject <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physi~
## $ hire_date <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02~
## $ percent_allocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", ~
## $ full_time <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA,~
## $ do_not_edit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDI~
## $ certification_2 <chr> "Theater", "Theater", "Vocal music", "Computers", NA~
## $ active <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA,~
## $ student_rating <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", NA,~
You can specify capitalization preferences with the case =
argument (the default is "snake"
).
dirty_data %>%
clean_names(case = "small_camel") %>%
glimpse()
## Rows: 13
## Columns: 12
## $ firstName <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Sh~
## $ lastName <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu",~
## $ employeeStatus <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administ~
## $ subject <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physic~
## $ hireDate <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02.~
## $ percentAllocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", N~
## $ fullTime <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, ~
## $ doNotEdit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDIN~
## $ certification_2 <chr> "Theater", "Theater", "Vocal music", "Computers", NA,~
## $ active <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA, ~
## $ studentRating <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", NA, ~
This functions is similar to the previous one but operates on character vectors and can be used during the data import, for example in the argument name_repair =
of function read_delim()
.
dirty_data_mod <- read_delim(
file = "data/dirty_data.csv", delim = ";", skip = 1,
name_repair = make_clean_names
)
glimpse(dirty_data_mod)
## Rows: 13
## Columns: 12
## $ first_name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-S~
## $ last_name <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu"~
## $ employee_status <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Adminis~
## $ subject <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physi~
## $ hire_date <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02~
## $ percent_allocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", ~
## $ full_time <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA,~
## $ do_not_edit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDI~
## $ certification_2 <chr> "Theater", "Theater", "Vocal music", "Computers", NA~
## $ active <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA,~
## $ student_rating <dbl> 1.5, 2.5, 3.5, 4.5, 2.3, 1.7, 3.0, NA, -1.5, -2.5, 4~
The function remove_empty()
removes all rows/columns from a data frame that only consist of NA
values.
dirty_data_mod %>%
remove_empty(which = c("rows", "cols"))
## # A tibble: 12 x 11
## first_name last_name employee_status subject hire_date percent_allocat~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Jason Bourne Teacher PE 39690 75%
## 2 Jason Bourne Teacher Drafting 14.01.2019 25%
## 3 Alicia Keys Teacher Music 15.08.2001 100%
## 4 Ada Lovelace Teacher #BEZUG! 38572 100%
## 5 Desus Nice Administration Dean 25.02.2017 100%
## 6 Chien-Shiung Wu Teacher Physics 11037 50%
## 7 Chien-Shiung Wu Teacher Chemistry 11037 50%
## 8 James Joyce Teacher English 20.09.1999 50%
## 9 Hedy Lamarr Teacher Science 27919 50%
## 10 Carlos Boozer Coach Basketball 42221 #NV
## 11 Young Boozer Coach <NA> 34700 #NV
## 12 Micheal Larsen Teacher English 40071 80%
## # ... with 5 more variables: full_time <chr>, certification <chr>,
## # certification_2 <chr>, active <chr>, student_rating <dbl>
Instead of a total of 17 rows there are now only 12 rows left and the last column do_not_edit
is removed.
This function removes constant columns (= every row has identical values) from the data frame.
dirty_data_mod %>%
remove_constant(na.rm = FALSE)
## # A tibble: 13 x 11
## first_name last_name employee_status subject hire_date percent_allocat~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Jason Bourne Teacher PE 39690 75%
## 2 Jason Bourne Teacher Drafting 14.01.2019 25%
## 3 Alicia Keys Teacher Music 15.08.2001 100%
## 4 Ada Lovelace Teacher #BEZUG! 38572 100%
## 5 Desus Nice Administration Dean 25.02.2017 100%
## 6 Chien-Shiung Wu Teacher Physics 11037 50%
## 7 Chien-Shiung Wu Teacher Chemistry 11037 50%
## 8 <NA> <NA> <NA> <NA> <NA> <NA>
## 9 James Joyce Teacher English 20.09.1999 50%
## 10 Hedy Lamarr Teacher Science 27919 50%
## 11 Carlos Boozer Coach Basketball 42221 #NV
## 12 Young Boozer Coach <NA> 34700 #NV
## 13 Micheal Larsen Teacher English 40071 80%
## # ... with 5 more variables: full_time <chr>, certification <chr>,
## # certification_2 <chr>, active <chr>, student_rating <dbl>
The column do_not_edit
is removed.
When we set the argument na.rm = TRUE
then active
is also removed.
dirty_data_mod %>%
remove_constant(na.rm = TRUE)
## # A tibble: 13 x 10
## first_name last_name employee_status subject hire_date percent_allocat~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Jason Bourne Teacher PE 39690 75%
## 2 Jason Bourne Teacher Drafting 14.01.2019 25%
## 3 Alicia Keys Teacher Music 15.08.2001 100%
## 4 Ada Lovelace Teacher #BEZUG! 38572 100%
## 5 Desus Nice Administration Dean 25.02.2017 100%
## 6 Chien-Shiung Wu Teacher Physics 11037 50%
## 7 Chien-Shiung Wu Teacher Chemistry 11037 50%
## 8 <NA> <NA> <NA> <NA> <NA> <NA>
## 9 James Joyce Teacher English 20.09.1999 50%
## 10 Hedy Lamarr Teacher Science 27919 50%
## 11 Carlos Boozer Coach Basketball 42221 #NV
## 12 Young Boozer Coach <NA> 34700 #NV
## 13 Micheal Larsen Teacher English 40071 80%
## # ... with 4 more variables: full_time <chr>, certification <chr>,
## # certification_2 <chr>, student_rating <dbl>
With compare_df_cols()
we can compare two data frames for the purpose of checking if they bind together, for example when using the function dplyr::bind_row()
.
Below I compare the starwars
dataset from R package dplyr
with our dirty_data
.
compare_df_cols(starwars, dirty_data_mod)
## column_name starwars dirty_data_mod
## 1 active <NA> character
## 2 birth_year numeric <NA>
## 3 certification <NA> character
## 4 certification_2 <NA> character
## 5 do_not_edit <NA> logical
## 6 employee_status <NA> character
## 7 eye_color character <NA>
## 8 films list <NA>
## 9 first_name <NA> character
## 10 full_time <NA> character
## 11 gender character <NA>
## 12 hair_color character <NA>
## 13 height integer <NA>
## 14 hire_date <NA> character
## 15 homeworld character <NA>
## 16 last_name <NA> character
## 17 mass numeric <NA>
## 18 name character <NA>
## 19 percent_allocated <NA> character
## 20 sex character <NA>
## 21 skin_color character <NA>
## 22 species character <NA>
## 23 starships list <NA>
## 24 student_rating <NA> numeric
## 25 subject <NA> character
## 26 vehicles list <NA>
Especially when reading data from a spreadsheet format, handling of date-values can be quite a hassle. The following functions aim to ease the cleaning of date columns.
This function converts many date formats to the Date
class. The dates can be of mixed formats, such as a combination of Microsoft Excel serial numbers and string-formatted dates. In argument character_fun =
you need to specify how to parse the dates with year, month, and day components.
dirty_data_mod %>%
mutate(hire_date_clean = convert_to_date(hire_date,
character_fun = lubridate::dmy)
) %>%
select(starts_with("hire"))
## # A tibble: 13 x 2
## hire_date hire_date_clean
## <chr> <date>
## 1 39690 2008-08-30
## 2 14.01.2019 2019-01-14
## 3 15.08.2001 2001-08-15
## 4 38572 2005-08-08
## 5 25.02.2017 2017-02-25
## 6 11037 1930-03-20
## 7 11037 1930-03-20
## 8 <NA> NA
## 9 20.09.1999 1999-09-20
## 10 27919 1976-06-08
## 11 42221 2015-08-05
## 12 34700 1995-01-01
## 13 40071 2009-09-15
You can also only convert dates encoded as serial numbers (by Microsoft Excel) to the Date
class. Note that your date column must then only contain these serial numbers and no other type of date-value. Also the serial number must be numeric. Below I first remove all dates that are not serial numbers from the column hire_date
by using function str_detect()
from R package stringr
. Then I can use the function excel_numeric_to_date()
.
dirty_data_mod %>%
filter(stringr::str_detect(hire_date, pattern = "\\.") == FALSE) %>%
mutate(hire_date_clean = excel_numeric_to_date(hire_date %>% as.numeric())) %>%
select(starts_with("hire"))
## # A tibble: 8 x 2
## hire_date hire_date_clean
## <chr> <date>
## 1 39690 2008-08-30
## 2 38572 2005-08-08
## 3 11037 1930-03-20
## 4 11037 1930-03-20
## 5 27919 1976-06-08
## 6 42221 2015-08-05
## 7 34700 1995-01-01
## 8 40071 2009-09-15
With function convert_to_datetime()
you are able to convert date-times. It is similar to function convert_to_date()
but requires to set a time zone and a specification how to parse date-times with year, month, and day, hour, minute, and second components.
Sometimes the rounding of columns can also be an issue, which might be tackled with the following functions.
The base
R function round()
uses banker’s rounding, i.e., halves are rounded to the nearest even number. The function round_half_up()
, in contrast, rounds all halves up (positive or negative) - provided you set argument digits = 0
.
dirty_data_mod %>%
mutate(rounded_janitor = round_half_up(student_rating, digits = 0),
rounded_base = round(student_rating, digits = 0)) %>%
select(student_rating, starts_with("round"))
## # A tibble: 13 x 3
## student_rating rounded_janitor rounded_base
## <dbl> <dbl> <dbl>
## 1 1.5 2 2
## 2 2.5 3 2
## 3 3.5 4 4
## 4 4.5 5 4
## 5 2.3 2 2
## 6 1.7 2 2
## 7 3 3 3
## 8 NA NA NA
## 9 -1.5 -2 -2
## 10 -2.5 -3 -2
## 11 4 4 4
## 12 2 2 2
## 13 1 1 1
With this function you may round a decimal to the precise decimal value of a specified fractional denominator. For example you can use denominator = 1
to round to whole numbers or denominator = 3
to round to third of a whole number.
dirty_data_mod %>%
mutate(rounded_janitor1 = round_to_fraction(student_rating, denominator = 1),
rounded_janitor3 = round_to_fraction(student_rating, denominator = 3)) %>%
select(student_rating, starts_with("round"))
## # A tibble: 13 x 3
## student_rating rounded_janitor1 rounded_janitor3
## <dbl> <dbl> <dbl>
## 1 1.5 2 1.33
## 2 2.5 2 2.67
## 3 3.5 4 3.33
## 4 4.5 4 4.67
## 5 2.3 2 2.33
## 6 1.7 2 1.67
## 7 3 3 3
## 8 NA NA NA
## 9 -1.5 -2 -1.33
## 10 -2.5 -2 -2.67
## 11 4 4 4
## 12 2 2 2
## 13 1 1 1
The remainder of this tutorial discusses the janitor
functions which are intented to be used during an initital exploration of the dataset. This can either be important to check if there are duplicates in your dataset that need to be removed before further analyzing it or to detect unusual (combination of) values that you’ve missed before.
With function get_dupes()
we can explore records with duplicated values for specific combinations of variables. Inside this function we can use selection helpers from the tidyselect
package (see this tutorial for more information).
dirty_data_mod %>%
get_dupes(contains("name"))
## # A tibble: 4 x 13
## first_name last_name dupe_count employee_status subject hire_date
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 Chien-Shiung Wu 2 Teacher Physics 11037
## 2 Chien-Shiung Wu 2 Teacher Chemistry 11037
## 3 Jason Bourne 2 Teacher PE 39690
## 4 Jason Bourne 2 Teacher Drafting 14.01.2019
## # ... with 7 more variables: percent_allocated <chr>, full_time <chr>,
## # do_not_edit <lgl>, certification <chr>, certification_2 <chr>,
## # active <chr>, student_rating <dbl>
There are two rows with duplicate values in columns first_name
and last_name
.
We can use the function tabyl()
to tabulate up to three variables. This function is an equivalent to the base
R function table()
but can be used with the %>%
operator and the output is always a (tidy) data frame.
dirty_data_mod %>%
tabyl(employee_status)
## employee_status n percent valid_percent
## Administration 1 0.07692308 0.08333333
## Coach 2 0.15384615 0.16666667
## Teacher 9 0.69230769 0.75000000
## <NA> 1 0.07692308 NA
The function can also be used to create a two-way table. With argument show_na =
you can control whether NA
values should be reported or not.
dirty_data_mod %>%
tabyl(employee_status, full_time, show_na = FALSE)
## employee_status No Yes
## Administration 0 1
## Coach 2 0
## Teacher 3 6
Even a three-way table is possible.
dirty_data_mod %>%
tabyl(employee_status, certification, full_time, show_na = FALSE)
## $No
## employee_status Instr. music PENDING Physical ed Science 6-12 Vocal music
## Administration 0 0 0 0 0
## Coach 0 0 1 0 0
## Teacher 0 1 0 0 1
##
## $Yes
## employee_status Instr. music PENDING Physical ed Science 6-12 Vocal music
## Administration 0 1 0 0 0
## Coach 0 0 0 0 0
## Teacher 1 1 2 2 0
This set of functions can be used to embellish the results of tabulations with tabyl()
for creating concise but clear reports. Below I’m adding the row totals with adorn_total()
as well as row percentages with adorn_percentages()
. The latter can be formatted with function adorn_pct_formatting()
. With adorn_ns()
I include again the absolute frequencies after having computed relative frequencies with adorn_percentages()
. At last we may add the column variable name to the output with funtion adorn_title()
. Be careful that this makes the data frame less usfeul for further computations.
dirty_data_mod %>%
remove_empty(which = c("cols","rows")) %>%
tabyl(employee_status, full_time) %>%
adorn_totals(where = "row") %>%
adorn_percentages(denominator = "row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
adorn_title("combined")
## employee_status/full_time No Yes
## Administration 0.00% (0) 100.00% (1)
## Coach 100.00% (2) 0.00% (0)
## Teacher 33.33% (3) 66.67% (6)
## Total 41.67% (5) 58.33% (7)
Below I’ve used knitr::kable()
to demonstrate how the output looks rendered in a Markdown report.
employee_status/full_time | No | Yes |
---|---|---|
Administration | 0.00% (0) | 100.00% (1) |
Coach | 100.00% (2) | 0.00% (0) |
Teacher | 33.33% (3) | 66.67% (6) |
Total | 41.67% (5) | 58.33% (7) |
This function returns a frequency table with three groups (top/middle/bottom) and named rows according to the factor’s levels. With argument n =
you can control the number of levels to include in the top and bottom group.
starwars %>%
pull(eye_color) %>%
factor() %>%
top_levels(n = 1)
## . n percent
## black 10 0.1149425
## <<< Middle Group (13 categories) >>> 66 0.7586207
## yellow 11 0.1264368