1 Introduction

2 Data Cleaning

2.1 General Functions
2.2 Dates
2.3 Rounding

3 Data Exploration

References

Reading time: 7 minutes (1,203 words)

1 Introduction

The functions of the R package janitor help to facilitate initital data cleaning and data exploration tasks before going into a deeper analysis with the dataset. For illustrating the use of each function, I’m using the dirty_data dataset from the package Author’s GitHub page. We will also require some other R packages, which I’ve listed below.

library(janitor)
library(dplyr)
library(readr)
library(stringr)
library(lubridate)

We read the dataset with function read_delim() from R package readr and print it to the console.

dirty_data <- readr::read_delim(file = "data/dirty_data.csv", delim = ";")
print(dirty_data)

## # A tibble: 14 x 12
##    `Data most recently~` ...2  ...3  `Dec-27 2020` ...5  ...6  ...7  ...8  ...9 
##    <chr>                 <chr> <chr> <chr>         <chr> <chr> <chr> <chr> <chr>
##  1 First Name            Last~ Empl~ Subject       Hire~ % Al~ Full~ do n~ Cert~
##  2 Jason                 Bour~ Teac~ PE            39690 75%   Yes   <NA>  Phys~
##  3 Jason                 Bour~ Teac~ Drafting      14.0~ 25%   Yes   <NA>  Phys~
##  4 Alicia                Keys  Teac~ Music         15.0~ 100%  Yes   <NA>  Inst~
##  5 Ada                   Love~ Teac~ #BEZUG!       38572 100%  Yes   <NA>  PEND~
##  6 Desus                 Nice  Admi~ Dean          25.0~ 100%  Yes   <NA>  PEND~
##  7 Chien-Shiung          Wu    Teac~ Physics       11037 50%   Yes   <NA>  Scie~
##  8 Chien-Shiung          Wu    Teac~ Chemistry     11037 50%   Yes   <NA>  Scie~
##  9 <NA>                  <NA>  <NA>  <NA>          <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 James                 Joyce Teac~ English       20.0~ 50%   No    <NA>  <NA> 
## 11 Hedy                  Lama~ Teac~ Science       27919 50%   No    <NA>  PEND~
## 12 Carlos                Booz~ Coach Basketball    42221 #NV   No    <NA>  Phys~
## 13 Young                 Booz~ Coach <NA>          34700 #NV   No    <NA>  <NA> 
## 14 Micheal               Lars~ Teac~ English       40071 80%   No    <NA>  Voca~
## # ... with 3 more variables: ...10 <chr>, ...11 <chr>, ...12 <chr>

Something already went wrong here! There seems to be some kind of note in the first row of this dataset, which read_delim() uses per default as names for the columns. We can alter this behaviour by setting argument skip = 1. Now the second row of the data will be used as column names. We now use glimpse() to inspect the data’s structure.

readr::read_delim(file = "data/dirty_data.csv", delim = ";", skip = 1) %>%
  glimpse()

## Rows: 13
## Columns: 12
## $ `First Name`        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien~
## $ `Last Name`         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "W~
## $ `Employee Status`   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Admin~
## $ Subject             <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Phy~
## $ `Hire Date`         <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.~
## $ `% Allocated`       <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%"~
## $ `Full time?`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", N~
## $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification...9   <chr> "Physical ed", "Physical ed", "Instr. music", "PEN~
## $ Certification...10  <chr> "Theater", "Theater", "Vocal music", "Computers", ~
## $ `Active?`           <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", N~
## $ `Student Rating`    <dbl> 1.5, 2.5, 3.5, 4.5, 2.3, 1.7, 3.0, NA, -1.5, -2.5,~

Still, the column names of this dataset look pretty messy as they contain special characters and two columns have identical names, which R enhances by default with .... So let’s have a look what can be done about this using functions of the janitor package.

2 Data Cleaning

2.1 General Functions

The first set of functions is used for the main tasks of data cleaning, like fixing the column names and removing missing values.

With this function we can select a row to be the column names of the data frame. This is somewhat redundant now as we already used the skip = argument of function read_delim() for this purpose. However, you might encounter a dataset where the row names are stored at the bottom of the table and then this function will come in handy.

dirty_data <- 
  read_delim(file = "data/dirty_data.csv", delim = ";") %>%
  row_to_names(row_number = 1)

glimpse(dirty_data)

## Rows: 13
## Columns: 12
## $ `First Name`        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien~
## $ `Last Name`         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "W~
## $ `Employee Status`   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Admin~
## $ Subject             <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Phy~
## $ `Hire Date`         <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.~
## $ `% Allocated`       <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%"~
## $ `Full time?`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", N~
## $ `do not edit! --->` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification       <chr> "Physical ed", "Physical ed", "Instr. music", "PEN~
## $ Certification       <chr> "Theater", "Theater", "Vocal music", "Computers", ~
## $ `Active?`           <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", N~
## $ `Student Rating`    <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", N~

This function cleans the column names of your dataset. It takes and returns a data frame, allowing for easy of piping with the %>% operator. The resulting names are unique and consist only of the _ special character as well as numbers, and letters.

dirty_data %>%
  clean_names() %>%
  glimpse()

## Rows: 13
## Columns: 12
## $ first_name        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-S~
## $ last_name         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu"~
## $ employee_status   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Adminis~
## $ subject           <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physi~
## $ hire_date         <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02~
## $ percent_allocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", ~
## $ full_time         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA,~
## $ do_not_edit       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification     <chr> "Physical ed", "Physical ed", "Instr. music", "PENDI~
## $ certification_2   <chr> "Theater", "Theater", "Vocal music", "Computers", NA~
## $ active            <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA,~
## $ student_rating    <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", NA,~

You can specify capitalization preferences with the case = argument (the default is "snake").

dirty_data %>%
  clean_names(case = "small_camel") %>%
  glimpse()

## Rows: 13
## Columns: 12
## $ firstName        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Sh~
## $ lastName         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu",~
## $ employeeStatus   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administ~
## $ subject          <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physic~
## $ hireDate         <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02.~
## $ percentAllocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", N~
## $ fullTime         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, ~
## $ doNotEdit        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification    <chr> "Physical ed", "Physical ed", "Instr. music", "PENDIN~
## $ certification_2  <chr> "Theater", "Theater", "Vocal music", "Computers", NA,~
## $ active           <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA, ~
## $ studentRating    <chr> "1.5", "2.5", "3.5", "4.5", "2.3", "1.7", "3.0", NA, ~

This functions is similar to the previous one but operates on character vectors and can be used during the data import, for example in the argument name_repair = of function read_delim().

dirty_data_mod <- read_delim(
  file = "data/dirty_data.csv", delim = ";", skip = 1, 
  name_repair = make_clean_names
  )

glimpse(dirty_data_mod)

## Rows: 13
## Columns: 12
## $ first_name        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-S~
## $ last_name         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu"~
## $ employee_status   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Adminis~
## $ subject           <chr> "PE", "Drafting", "Music", "#BEZUG!", "Dean", "Physi~
## $ hire_date         <chr> "39690", "14.01.2019", "15.08.2001", "38572", "25.02~
## $ percent_allocated <chr> "75%", "25%", "100%", "100%", "100%", "50%", "50%", ~
## $ full_time         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA,~
## $ do_not_edit       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification     <chr> "Physical ed", "Physical ed", "Instr. music", "PENDI~
## $ certification_2   <chr> "Theater", "Theater", "Vocal music", "Computers", NA~
## $ active            <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", NA,~
## $ student_rating    <dbl> 1.5, 2.5, 3.5, 4.5, 2.3, 1.7, 3.0, NA, -1.5, -2.5, 4~

The function remove_empty() removes all rows/columns from a data frame that only consist of NA values.

dirty_data_mod %>%
  remove_empty(which = c("rows", "cols"))

## # A tibble: 12 x 11
##    first_name   last_name employee_status subject    hire_date  percent_allocat~
##    <chr>        <chr>     <chr>           <chr>      <chr>      <chr>           
##  1 Jason        Bourne    Teacher         PE         39690      75%             
##  2 Jason        Bourne    Teacher         Drafting   14.01.2019 25%             
##  3 Alicia       Keys      Teacher         Music      15.08.2001 100%            
##  4 Ada          Lovelace  Teacher         #BEZUG!    38572      100%            
##  5 Desus        Nice      Administration  Dean       25.02.2017 100%            
##  6 Chien-Shiung Wu        Teacher         Physics    11037      50%             
##  7 Chien-Shiung Wu        Teacher         Chemistry  11037      50%             
##  8 James        Joyce     Teacher         English    20.09.1999 50%             
##  9 Hedy         Lamarr    Teacher         Science    27919      50%             
## 10 Carlos       Boozer    Coach           Basketball 42221      #NV             
## 11 Young        Boozer    Coach           <NA>       34700      #NV             
## 12 Micheal      Larsen    Teacher         English    40071      80%             
## # ... with 5 more variables: full_time <chr>, certification <chr>,
## #   certification_2 <chr>, active <chr>, student_rating <dbl>

Instead of a total of 17 rows there are now only 12 rows left and the last column do_not_edit is removed.

This function removes constant columns (= every row has identical values) from the data frame.

dirty_data_mod %>%
  remove_constant(na.rm = FALSE)

## # A tibble: 13 x 11
##    first_name   last_name employee_status subject    hire_date  percent_allocat~
##    <chr>        <chr>     <chr>           <chr>      <chr>      <chr>           
##  1 Jason        Bourne    Teacher         PE         39690      75%             
##  2 Jason        Bourne    Teacher         Drafting   14.01.2019 25%             
##  3 Alicia       Keys      Teacher         Music      15.08.2001 100%            
##  4 Ada          Lovelace  Teacher         #BEZUG!    38572      100%            
##  5 Desus        Nice      Administration  Dean       25.02.2017 100%            
##  6 Chien-Shiung Wu        Teacher         Physics    11037      50%             
##  7 Chien-Shiung Wu        Teacher         Chemistry  11037      50%             
##  8 <NA>         <NA>      <NA>            <NA>       <NA>       <NA>            
##  9 James        Joyce     Teacher         English    20.09.1999 50%             
## 10 Hedy         Lamarr    Teacher         Science    27919      50%             
## 11 Carlos       Boozer    Coach           Basketball 42221      #NV             
## 12 Young        Boozer    Coach           <NA>       34700      #NV             
## 13 Micheal      Larsen    Teacher         English    40071      80%             
## # ... with 5 more variables: full_time <chr>, certification <chr>,
## #   certification_2 <chr>, active <chr>, student_rating <dbl>

The column do_not_edit is removed.

When we set the argument na.rm = TRUE then active is also removed.

dirty_data_mod %>%
  remove_constant(na.rm = TRUE)

## # A tibble: 13 x 10
##    first_name   last_name employee_status subject    hire_date  percent_allocat~
##    <chr>        <chr>     <chr>           <chr>      <chr>      <chr>           
##  1 Jason        Bourne    Teacher         PE         39690      75%             
##  2 Jason        Bourne    Teacher         Drafting   14.01.2019 25%             
##  3 Alicia       Keys      Teacher         Music      15.08.2001 100%            
##  4 Ada          Lovelace  Teacher         #BEZUG!    38572      100%            
##  5 Desus        Nice      Administration  Dean       25.02.2017 100%            
##  6 Chien-Shiung Wu        Teacher         Physics    11037      50%             
##  7 Chien-Shiung Wu        Teacher         Chemistry  11037      50%             
##  8 <NA>         <NA>      <NA>            <NA>       <NA>       <NA>            
##  9 James        Joyce     Teacher         English    20.09.1999 50%             
## 10 Hedy         Lamarr    Teacher         Science    27919      50%             
## 11 Carlos       Boozer    Coach           Basketball 42221      #NV             
## 12 Young        Boozer    Coach           <NA>       34700      #NV             
## 13 Micheal      Larsen    Teacher         English    40071      80%             
## # ... with 4 more variables: full_time <chr>, certification <chr>,
## #   certification_2 <chr>, student_rating <dbl>

With compare_df_cols() we can compare two data frames for the purpose of checking if they bind together, for example when using the function dplyr::bind_row().

Below I compare the starwars dataset from R package dplyr with our dirty_data.

compare_df_cols(starwars, dirty_data_mod)

##          column_name  starwars dirty_data_mod
## 1             active      <NA>      character
## 2         birth_year   numeric           <NA>
## 3      certification      <NA>      character
## 4    certification_2      <NA>      character
## 5        do_not_edit      <NA>        logical
## 6    employee_status      <NA>      character
## 7          eye_color character           <NA>
## 8              films      list           <NA>
## 9         first_name      <NA>      character
## 10         full_time      <NA>      character
## 11            gender character           <NA>
## 12        hair_color character           <NA>
## 13            height   integer           <NA>
## 14         hire_date      <NA>      character
## 15         homeworld character           <NA>
## 16         last_name      <NA>      character
## 17              mass   numeric           <NA>
## 18              name character           <NA>
## 19 percent_allocated      <NA>      character
## 20               sex character           <NA>
## 21        skin_color character           <NA>
## 22           species character           <NA>
## 23         starships      list           <NA>
## 24    student_rating      <NA>        numeric
## 25           subject      <NA>      character
## 26          vehicles      list           <NA>

2.2 Dates

Especially when reading data from a spreadsheet format, handling of date-values can be quite a hassle. The following functions aim to ease the cleaning of date columns.

This function converts many date formats to the Date class. The dates can be of mixed formats, such as a combination of Microsoft Excel serial numbers and string-formatted dates. In argument character_fun = you need to specify how to parse the dates with year, month, and day components.

dirty_data_mod %>%
  mutate(hire_date_clean = convert_to_date(hire_date, 
                                          character_fun = lubridate::dmy)
         ) %>%
  select(starts_with("hire"))

## # A tibble: 13 x 2
##    hire_date  hire_date_clean
##    <chr>      <date>         
##  1 39690      2008-08-30     
##  2 14.01.2019 2019-01-14     
##  3 15.08.2001 2001-08-15     
##  4 38572      2005-08-08     
##  5 25.02.2017 2017-02-25     
##  6 11037      1930-03-20     
##  7 11037      1930-03-20     
##  8 <NA>       NA             
##  9 20.09.1999 1999-09-20     
## 10 27919      1976-06-08     
## 11 42221      2015-08-05     
## 12 34700      1995-01-01     
## 13 40071      2009-09-15

You can also only convert dates encoded as serial numbers (by Microsoft Excel) to the Date class. Note that your date column must then only contain these serial numbers and no other type of date-value. Also the serial number must be numeric. Below I first remove all dates that are not serial numbers from the column hire_date by using function str_detect() from R package stringr. Then I can use the function excel_numeric_to_date().

dirty_data_mod %>%
  filter(stringr::str_detect(hire_date, pattern = "\\.") == FALSE) %>%
  mutate(hire_date_clean = excel_numeric_to_date(hire_date %>% as.numeric())) %>%
  select(starts_with("hire"))

## # A tibble: 8 x 2
##   hire_date hire_date_clean
##   <chr>     <date>         
## 1 39690     2008-08-30     
## 2 38572     2005-08-08     
## 3 11037     1930-03-20     
## 4 11037     1930-03-20     
## 5 27919     1976-06-08     
## 6 42221     2015-08-05     
## 7 34700     1995-01-01     
## 8 40071     2009-09-15

With function convert_to_datetime() you are able to convert date-times. It is similar to function convert_to_date() but requires to set a time zone and a specification how to parse date-times with year, month, and day, hour, minute, and second components.

2.3 Rounding

Sometimes the rounding of columns can also be an issue, which might be tackled with the following functions.

round_half_up()
round_to_fraction()

The base R function round() uses banker’s rounding, i.e., halves are rounded to the nearest even number. The function round_half_up(), in contrast, rounds all halves up (positive or negative) - provided you set argument digits = 0.

dirty_data_mod %>%
  mutate(rounded_janitor = round_half_up(student_rating, digits = 0),
         rounded_base = round(student_rating, digits = 0)) %>%
  select(student_rating, starts_with("round"))

## # A tibble: 13 x 3
##    student_rating rounded_janitor rounded_base
##             <dbl>           <dbl>        <dbl>
##  1            1.5               2            2
##  2            2.5               3            2
##  3            3.5               4            4
##  4            4.5               5            4
##  5            2.3               2            2
##  6            1.7               2            2
##  7            3                 3            3
##  8           NA                NA           NA
##  9           -1.5              -2           -2
## 10           -2.5              -3           -2
## 11            4                 4            4
## 12            2                 2            2
## 13            1                 1            1

With this function you may round a decimal to the precise decimal value of a specified fractional denominator. For example you can use denominator = 1 to round to whole numbers or denominator = 3 to round to third of a whole number.

dirty_data_mod %>%
  mutate(rounded_janitor1 = round_to_fraction(student_rating, denominator = 1),
         rounded_janitor3 = round_to_fraction(student_rating, denominator = 3)) %>%
  select(student_rating, starts_with("round"))

## # A tibble: 13 x 3
##    student_rating rounded_janitor1 rounded_janitor3
##             <dbl>            <dbl>            <dbl>
##  1            1.5                2             1.33
##  2            2.5                2             2.67
##  3            3.5                4             3.33
##  4            4.5                4             4.67
##  5            2.3                2             2.33
##  6            1.7                2             1.67
##  7            3                  3             3   
##  8           NA                 NA            NA   
##  9           -1.5               -2            -1.33
## 10           -2.5               -2            -2.67
## 11            4                  4             4   
## 12            2                  2             2   
## 13            1                  1             1

3 Data Exploration

The remainder of this tutorial discusses the janitor functions which are intented to be used during an initital exploration of the dataset. This can either be important to check if there are duplicates in your dataset that need to be removed before further analyzing it or to detect unusual (combination of) values that you’ve missed before.

With function get_dupes() we can explore records with duplicated values for specific combinations of variables. Inside this function we can use selection helpers from the tidyselect package (see this tutorial for more information).

dirty_data_mod %>%
  get_dupes(contains("name"))

## # A tibble: 4 x 13
##   first_name   last_name dupe_count employee_status subject   hire_date 
##   <chr>        <chr>          <int> <chr>           <chr>     <chr>     
## 1 Chien-Shiung Wu                 2 Teacher         Physics   11037     
## 2 Chien-Shiung Wu                 2 Teacher         Chemistry 11037     
## 3 Jason        Bourne             2 Teacher         PE        39690     
## 4 Jason        Bourne             2 Teacher         Drafting  14.01.2019
## # ... with 7 more variables: percent_allocated <chr>, full_time <chr>,
## #   do_not_edit <lgl>, certification <chr>, certification_2 <chr>,
## #   active <chr>, student_rating <dbl>

There are two rows with duplicate values in columns first_name and last_name.

We can use the function tabyl() to tabulate up to three variables. This function is an equivalent to the base R function table() but can be used with the %>% operator and the output is always a (tidy) data frame.

dirty_data_mod %>%
  tabyl(employee_status)

##  employee_status n    percent valid_percent
##   Administration 1 0.07692308    0.08333333
##            Coach 2 0.15384615    0.16666667
##          Teacher 9 0.69230769    0.75000000
##             <NA> 1 0.07692308            NA

The function can also be used to create a two-way table. With argument show_na = you can control whether NA values should be reported or not.

dirty_data_mod %>%
  tabyl(employee_status, full_time, show_na = FALSE)

##  employee_status No Yes
##   Administration  0   1
##            Coach  2   0
##          Teacher  3   6

Even a three-way table is possible.

dirty_data_mod %>%
  tabyl(employee_status, certification, full_time, show_na = FALSE)

## $No
##  employee_status Instr. music PENDING Physical ed Science 6-12 Vocal music
##   Administration            0       0           0            0           0
##            Coach            0       0           1            0           0
##          Teacher            0       1           0            0           1
## 
## $Yes
##  employee_status Instr. music PENDING Physical ed Science 6-12 Vocal music
##   Administration            0       1           0            0           0
##            Coach            0       0           0            0           0
##          Teacher            1       1           2            2           0

This set of functions can be used to embellish the results of tabulations with tabyl() for creating concise but clear reports. Below I’m adding the row totals with adorn_total() as well as row percentages with adorn_percentages(). The latter can be formatted with function adorn_pct_formatting(). With adorn_ns() I include again the absolute frequencies after having computed relative frequencies with adorn_percentages(). At last we may add the column variable name to the output with funtion adorn_title(). Be careful that this makes the data frame less usfeul for further computations.

dirty_data_mod %>%
  remove_empty(which = c("cols","rows")) %>%
  tabyl(employee_status, full_time) %>%
  adorn_totals(where = "row") %>%
  adorn_percentages(denominator = "row") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_ns() %>%
  adorn_title("combined")

##  employee_status/full_time          No         Yes
##             Administration   0.00% (0) 100.00% (1)
##                      Coach 100.00% (2)   0.00% (0)
##                    Teacher  33.33% (3)  66.67% (6)
##                      Total  41.67% (5)  58.33% (7)

Below I’ve used knitr::kable() to demonstrate how the output looks rendered in a Markdown report.

employee_status/full_time	No	Yes
Administration	0.00% (0)	100.00% (1)
Coach	100.00% (2)	0.00% (0)
Teacher	33.33% (3)	66.67% (6)
Total	41.67% (5)	58.33% (7)

This function returns a frequency table with three groups (top/middle/bottom) and named rows according to the factor’s levels. With argument n = you can control the number of levels to include in the top and bottom group.

starwars %>%
  pull(eye_color) %>%
  factor() %>%
  top_levels(n = 1)

##                                     .  n   percent
##                                 black 10 0.1149425
##  <<< Middle Group (13 categories) >>> 66 0.7586207
##                                yellow 11 0.1264368

References

Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

R Tutorial: janitor package

Philipp Leppert

28.01.2022