Reading time: 13 minutes (2,506 words)


1. Introduction

Before trying to answer scientific questions or making predictions with a (unknown) dataset, you should get familiar with it. At least, this will include an analysis of the data structure and the contained observations.

  • How many variables are there?

  • What is their level of measurement?

  • Do my variables of interest follow a normal distribution or are there clusters in my data I should pay attention to?

  • Are some of my modelling assumptions going to be violated?

The answers to these questions lie in a description or summary of the data at hand. That’s why they are commonly referred to as descriptive statistics.

1.1 R packages

library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(forcats)
library(ggmosaic)
library(Hmisc)
library(scales)

1.2 Dataset

All practical examples are built around the Mid-Atlantic Wage Data contained in R package ISLR. This dataset contains information about 3,000 male workers in the Mid-Atlantic region like their yearly salary and other demographic characteristics. After reading the data, I convert wage in full US dollar.

data("Wage", package = "ISLR")

Wage <- Wage %>%
  mutate(wage = wage*1000)

2. The Big Picture

2.1 At first glance

After reading the data, the Wage dataset can be found in RStudio’s global environment and displays the number of columns and rows (3000 obs. of 11 variables).

dplyr::glimpse

The R package dplyr offers the function glimpse() which displays more information about the dataset. RStudio’s console displays all columns (variable names) of the dataset. The column format, called class, can be found between the < > operators. There are integer columns (int), factor columns (fct) as well as double columns (dbl). For each column some values (rows) are displayed.

Wage %>%
  glimpse()
## Rows: 3,000
## Columns: 11
## $ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,~
## $ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,~
## $ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,~
## $ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,~
## $ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College ~
## $ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,~
## $ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio~
## $ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=~
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye~
## $ logwage    <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,~
## $ wage       <dbl> 75043.15, 70476.02, 130982.18, 154685.29, 75043.15, 127115.~

utils::str

The integrated function str() is very similar to glimpse(). There are only minor differences like the display of the number of levels for each factor variable.

Wage %>%
  str()
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75043 70476 130982 154685 75043 ...


2.2 Qualitative variables

These variables are typically of class factor or character. The following representations might apply:


2.3 Numerical variables

These variable are typically of class int or double. The following representations might apply:


3. Exploring Qualitative Variables

In this section we’ll discuss frequency tables, bar charts and mosaic plots as a simple but useful way to examine one ore more variables of nominal or ordinal scale.

3.1 One variable

The distribution of the respondents’ education marks the starting point of the following analysis.

One-way Table

The integrated function table() produces a one-way (frequency) table, which calculates the number of observations (counts) for each level of education.

Wage %>%
  select(education) %>%
  table()
## .
##       1. < HS Grad         2. HS Grad    3. Some College    4. College Grad 
##                268                971                650                685 
## 5. Advanced Degree 
##                426

Bar Chart

The function geom_bar() from R package ggplot2 is used to create a corresponding bar chart. Bald but informative!

ggplot(data = Wage, aes(x=education)) +
  geom_bar(fill = "white", color = "black") +
  labs(x = "Education", y = "Number of observations")


3.2 Two variables

Next, I want to gain information about the distribution of the respondents’ education while taking into account their race.

Two-way Table

Supplying the function table() with two variables produces a two-way (contingency) table.

Wage %>%
  select(education, race) %>%
  table()
##                     race
## education            1. White 2. Black 3. Asian 4. Other
##   1. < HS Grad            211       31       15       11
##   2. HS Grad              822      105       31       13
##   3. Some College         532       92       18        8
##   4. College Grad         576       40       66        3
##   5. Advanced Degree      339       25       60        2

Bar Chart (fill)

In the bar chart variable race might be assigned to argument fill=, which fills each bar with the levels of race.

ggplot(data = Wage, aes(x = education, fill = race)) +
  geom_bar(color = "black") +
  labs(x = "Education", y = "Number of observations", fill = "Race")

Bar Chart (dodge)

For comparing the respondents’ race between each level of education, the arumgent position might be altered to dodge the bars instead of stacking them.

ggplot(data = Wage, aes(x = education, fill = race)) +
  geom_bar(position = "dodge", color = "black") +
  labs(x = "Education", y = "Number of observations", fill = "Race")

Bar Chart (percent)

The distribution of race within each education bar can be more precisely assessed by displaying relative percentages instead of counts. This can be achieved by rescaling the y-axis with function scale_y_continuous() and using the R package scales which offers tailored options for the labeling of the scale. This bar chart is well suited for comparing the respondents’ race within each level of education.

ggplot(data = Wage, aes(x = education, fill = race)) +
  geom_bar(position = "fill", color = "black") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Education", y = "Share", fill = "Race")

Mosaic Plot

The mosaic plot is a superior form of visualization in this setting. The contents of a contingency table are displayed in bars which differ in their width according to the respective count. There are many implementations of the mosaic plot - here I am using the function geom_mosaic from R package ggmosaic which is easily integrated in a ggplot visualization. I removed all labels of race from the x-axis and assigned them to the argument fill = because they overlap each other and are hardy readable.

ggplot(data = Wage) +
  geom_mosaic(aes(x = product(education, race), fill = race)) +
  labs(x = "Race", y = "Education", fill = "Race") +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )


3.3 Three variables

In addition to race let us take into account the respondents’ jobclass for assessing the distribution of education.

Three-way Table

Now, the function table() creates a contingency table of education and race for each level of jobclass. Extracting information from this form of display is a bit more elaborate …

Wage %>%
  dplyr::select(education, race, jobclass) %>%
  table()
## , , jobclass = 1. Industrial
## 
##                     race
## education            1. White 2. Black 3. Asian 4. Other
##   1. < HS Grad            159       12       12        7
##   2. HS Grad              552       48       25       11
##   3. Some College         300       30        8        4
##   4. College Grad         234       17       23        0
##   5. Advanced Degree       80        4       18        0
## 
## , , jobclass = 2. Information
## 
##                     race
## education            1. White 2. Black 3. Asian 4. Other
##   1. < HS Grad             52       19        3        4
##   2. HS Grad              270       57        6        2
##   3. Some College         232       62       10        4
##   4. College Grad         342       23       43        3
##   5. Advanced Degree      259       21       42        2

Bar Chart (facet)

The bar chart might be used again by creating stand-alone bar charts for each level of jobclass while using separate bars for each level of race inside them. The function facet_grid()allows to group a plot by one or two variables. Here I am using column-wise grouping with argument cols =. What I don’t like about this bar chart are the duplicated x-axes which are hardly readable because of the levels of education …

ggplot(data = Wage, aes(x = education, fill = race)) +
  geom_bar(position = "dodge", color = "black") +
  facet_grid(cols = vars(jobclass)) +
  labs(x = "Education", y = "Number of observations", fill = "Race")

Bar Chart (flipped)

The function coord_flip() solves this problem. Close inspection reveals that the colored bars for race are not always on the same height, making a comparison between the levels of jobclass unnecessary difficult. This issue is caused by empty cells (= combinations of variables’ levels without any observations). For example, there are no respondents with jobclass Industrial, race Other and education Advanced Degree or College Degree …

ggplot(data = Wage, aes(x = education, fill = race)) +
  geom_bar(position = "dodge", color = "black") +
  facet_grid(cols = vars(jobclass)) +
  coord_flip() +
  labs(x = "Education", y = "Number of observations", fill = "Race")

Bar Chart (complete)

This issue is more demanding. At first, each cell (= count per combination of each variable’s levels) has to be calculated manually. This can be achieved with the functions group_by() and summarise() from R package dplyr.

Then complete() from R package tidyr does the trick by filling in missing categories of a given hierarchical structure with NA or other user-defined values. Always check that the argument nesting is supplied with the required structure. The argument stat = within function geom_bar() now has to be supplied with the value "identity" since the "count"s (number of observations per combination) have already been calculated.

Wage %>%
  group_by(education, race, jobclass) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  complete(education, nesting(jobclass, race),  fill = list(count = NA)) %>%
  ggplot(aes(x = education, y = count, fill = race)) +
  geom_bar(stat = "identity",
           position = "dodge",
           color = "black") +
  facet_grid(cols = vars(jobclass)) +
  coord_flip() +
  labs(x = "Education", y = "Number of observations", fill = "Race")

Mosaic Plot

The mosaic plot also offers a concise way of displaying information contained in grouped contingency tables. Argument conds = has to be supplied with the grouping variable, so in this example the levels of education are grouped by the levels of jobclass. To ease comparisons with the previous bar chart I reversed the order of the levels of jobclass with function fct_rev() from R package forcats. Looks neat!

Wage %>%
  mutate(jobclass = fct_rev(jobclass)) %>%
  ggplot(data = .) +
  geom_mosaic(aes(
    x = product(education, race),
    fill = race,
    conds = product(jobclass)
  ),
  divider = mosaic("v")) +
  labs(x = "", y = "Education : Jobclass", fill = "Race") +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )


3.4 Four variables

Generally speaking more variables might be added in the tabular dimension, although the information gain is often quite small because the form of presentation gets confusing Even visualizations with mosaic plots or bar charts hit a wall when supplied with too many variables. Nevertheless, I am adding health_ins as grouping variable for my analysis of the distribution of education.

N-way Table

The function table() now prints contingency tables for all combinations of jobclass and health_ins. Not very pretty but since each variable has only two levels it’s still readable. Especially when using the function ftable() which flattens the output of function table().

Wage %>%
  select(education, race, jobclass, health_ins) %>%
  table() %>% 
  ftable()
##                                            health_ins 1. Yes 2. No
## education          race     jobclass                              
## 1. < HS Grad       1. White 1. Industrial                 73    86
##                             2. Information                25    27
##                    2. Black 1. Industrial                  6     6
##                             2. Information                13     6
##                    3. Asian 1. Industrial                  4     8
##                             2. Information                 0     3
##                    4. Other 1. Industrial                  2     5
##                             2. Information                 1     3
## 2. HS Grad         1. White 1. Industrial                324   228
##                             2. Information               199    71
##                    2. Black 1. Industrial                 27    21
##                             2. Information                41    16
##                    3. Asian 1. Industrial                 11    14
##                             2. Information                 4     2
##                    4. Other 1. Industrial                  5     6
##                             2. Information                 1     1
## 3. Some College    1. White 1. Industrial                207    93
##                             2. Information               180    52
##                    2. Black 1. Industrial                 21     9
##                             2. Information                40    22
##                    3. Asian 1. Industrial                  6     2
##                             2. Information                 6     4
##                    4. Other 1. Industrial                  4     0
##                             2. Information                 3     1
## 4. College Grad    1. White 1. Industrial                179    55
##                             2. Information               273    69
##                    2. Black 1. Industrial                 11     6
##                             2. Information                18     5
##                    3. Asian 1. Industrial                 15     8
##                             2. Information                31    12
##                    4. Other 1. Industrial                  0     0
##                             2. Information                 2     1
## 5. Advanced Degree 1. White 1. Industrial                 60    20
##                             2. Information               220    39
##                    2. Black 1. Industrial                  1     3
##                             2. Information                19     2
##                    3. Asian 1. Industrial                 13     5
##                             2. Information                36     6
##                    4. Other 1. Industrial                  0     0
##                             2. Information                 2     0

Barchart

The function facet_grid() offers a second argument (rows=) for adding another grouping variable in the bar chart. Use the labeller = argument inside function facet_grid() in order to display the variable name inside the facets which eases the interpretation of imprecise labels. When using the relative shares of race inside the bars of each level of education this plot is also quite informative.

Wage %>%
  group_by(education, race, jobclass, health_ins) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  complete(education, nesting(jobclass, race, health_ins),  fill = list(count = NA)) %>%
  ggplot(aes(x=education, y = count, fill = race)) +
  geom_bar(stat = "identity", position = "fill", color = "black") +
  facet_grid(rows = vars(health_ins), cols = vars(jobclass),
             labeller = label_both) +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  labs(x = "Education", y = "Share", fill = "Race")

Mosaic Plot

The variable health_in might be added inside the facet_grid() function.

Wage %>%
  mutate(jobclass = fct_rev(jobclass)) %>%
  ggplot(data = .) +
  geom_mosaic(aes(
    x = product(education, race),
    fill = race,
    conds = product(jobclass)
  ),
  divider = mosaic("v")) +
  facet_grid(cols =  vars(health_ins), labeller = label_both) +
  labs(x = "", y = "Education : Jobclass", fill = "Race") +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )


4. Exploring Numeric Variables

Now we’ll discuss tools to evaluate variables of interval or continuous scale. I’m going to present classical summary statistics, histograms, density plots as well as boxplots for assessing the distribution of these variables.

4.1 Simple summary statistics

base::summary

The function summary() displays summary statistics for all variables in a dataset. For numeric variables (class int ordbl) location parameters like the mean and median of the their distribution are printed.

Wage %>%
  select(age, wage) %>%
  summary()
##       age             wage       
##  Min.   :18.00   Min.   : 20086  
##  1st Qu.:33.75   1st Qu.: 85384  
##  Median :42.00   Median :104922  
##  Mean   :42.41   Mean   :111704  
##  3rd Qu.:51.00   3rd Qu.:128681  
##  Max.   :80.00   Max.   :318342

Hmisc::describe

R package Hmisc contains function describe() which is also suited for generating summary statistics. Additionally the number of missing and distinct values are displayed. Regarding numeric variables the five lowest and highest values are shown.

Wage %>%
  select(age, wage) %>%
  Hmisc::describe()
## . 
## 
##  2  Variables      3000  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3000        0       61    0.999    42.41    13.16    24.00    27.00 
##      .25      .50      .75      .90      .95 
##    33.75    42.00    51.00    58.00    61.00 
## 
## lowest : 18 19 20 21 22, highest: 74 75 76 77 80
## --------------------------------------------------------------------------------
## wage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3000        0      508        1   111704    42643    61188    70476 
##      .25      .50      .75      .90      .95 
##    85384   104922   128680   154704   176990 
## 
## lowest :  20085.54  20934.38  22962.40  23274.70  23952.94
## highest: 299262.98 309571.77 311934.57 314329.34 318342.43
## --------------------------------------------------------------------------------

Histogram

The function geom_histogram() from R package ggplot2 generates a histogram for a single variable. The number of bins can be controlled with the eponymous argument.

ggplot(data = Wage, aes(x = wage)) +
  geom_histogram(fill = "white", color = "black", bins = 50) +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Wage", y = "Count")


4.2 Summarizing by a single group

Summary statistics of continuously scaled variables might be grouped by variables of nominal, categorical or ordinal scale. Hence, individual summary statistics for each level of the grouping variable will be calculated. In this section I am grouping the summary statistics by jobclass.

base::tapply

With the integrated function tapply() function summary() can be (t)applied to all levels of education for calculating summary statistics of wage. However, summary statistics of more than one variable cannot be caluclated …

tapply(Wage$wage, Wage$jobclass, summary)
## $`1. Industrial`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22962   81283   99689  103321  118884  295991 
## 
## $`2. Information`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20086   91699  112649  120593  137590  318342

purrr::map

The R package purrr contains function map() which can produce grouped summary statistics for multiple variables. At first, however, I have to split() the dataset by the grouping variable. With many variables this form of display gets messy very quickly …

Wage %>% 
  select(age, wage, jobclass) %>%
  split(.$jobclass) %>% 
  map(summary)
## $`1. Industrial`
##       age            wage                  jobclass   
##  Min.   :18.0   Min.   : 22962   1. Industrial :1544  
##  1st Qu.:32.0   1st Qu.: 81283   2. Information:   0  
##  Median :41.0   Median : 99689                        
##  Mean   :41.4   Mean   :103321                        
##  3rd Qu.:50.0   3rd Qu.:118884                        
##  Max.   :80.0   Max.   :295991                        
## 
## $`2. Information`
##       age             wage                  jobclass   
##  Min.   :18.00   Min.   : 20086   1. Industrial :   0  
##  1st Qu.:35.00   1st Qu.: 91699   2. Information:1456  
##  Median :43.00   Median :112649                        
##  Mean   :43.49   Mean   :120593                        
##  3rd Qu.:51.00   3rd Qu.:137590                        
##  Max.   :80.00   Max.   :318342

Tidy Approach

The functions group_by() and across() from R package dplyr are able to generate grouped summary statistics for multiple variables as well. However, the location parameters have to be calculated manually. Sounds complex but it is not! Sure the coding effort is a bit higher but there is virtually no restriction on the number and type of (location) parameters. For this example I am calculating the mean and median as well as missing values for wage and age. Then I am using the function pivot_longer() from R package tidyr to transpose the output in a long format which looks tidy. Finally, just print() the data frame.

grouped_summary <- Wage %>%
  select(wage, age, jobclass) %>%
  group_by(jobclass) %>%
  summarise(across(
    .cols = everything(),
    .fns = list(
      mean = mean,
      median = median,
      nmiss = ~ sum(is.na(.x))
    )
  )) %>%
  pivot_longer(
    cols = -jobclass,
    names_to = c("variable", "parameter"),
    names_sep = "_",
    values_to = "value"
  )

print(grouped_summary)
## # A tibble: 12 x 4
##    jobclass       variable parameter    value
##    <fct>          <chr>    <chr>        <dbl>
##  1 1. Industrial  wage     mean      103321. 
##  2 1. Industrial  wage     median     99689. 
##  3 1. Industrial  wage     nmiss          0  
##  4 1. Industrial  age      mean          41.4
##  5 1. Industrial  age      median        41  
##  6 1. Industrial  age      nmiss          0  
##  7 2. Information wage     mean      120593. 
##  8 2. Information wage     median    112649. 
##  9 2. Information wage     nmiss          0  
## 10 2. Information age      mean          43.5
## 11 2. Information age      median        43  
## 12 2. Information age      nmiss          0

Boxplot

A boxplot is suitable for visualising location parameters especially by groups. The function geom_boxplot() from R package ggplot2 generates boxplots of wage for each level of jobclass.

ggplot(data = Wage, aes(x = wage, y = jobclass)) +
  geom_boxplot(color = "black") +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Wage", y = "Jobclass") 

Density plot

The density plot is a smooth version of the histogram, which is in my opinion more suited for grouped summary statistics. It can be created using function geom_density() from R package ggplot2. The grouping variable jobclass is assigned to argument fill =. The important feature to make this plot informative is the argument alpha =, which adjusts the color transparency of the grouping variable. More transparent colors are obtained with smaller values of alpha.

ggplot(data = Wage, aes(x = wage, fill = jobclass)) +
  geom_density(alpha = 0.6) +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Wage", y = "Density", fill = "Jobclass")

Violin plot

Another visualization technique which is the violin plot. It is a mixture of a boxplot and a density plot. The latter is mirrored such that it appears in shape as a boxplot. Below I marked the mean value of wage for each joblcass inside the violin plot with a red dot and also added a point range of one standard deviation from the mean.

ggplot(data = Wage, aes(x = jobclass, y = wage)) +
  geom_violin(trim = FALSE) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), 
               geom = "pointrange", color = "red") +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "", y = "Wage")


4.3 Summarizing by multiple groups

Of course we might add another variable for grouping a summary statistic so I am adding health_ins to the mix.

purrr::map

The function split() introduced in the previous section can only be used with one variable to split a dataset by its levels. I am helping myself out here with a little trick by combining the levels of both grouping variables in a new grouping variable. With the integrated function intersect() I am creating the new variable jobclass_health_in which coalesces the levels of jobclass and health_ins.

Wage %>% 
  mutate(jobclass_health_ins = interaction(jobclass, health_ins)) %>%
  select(wage, jobclass_health_ins) %>%
  split(.$jobclass_health_ins) %>% 
  map(summary)
## $`1. Industrial.1. Yes`
##       wage                   jobclass_health_ins
##  Min.   : 34607   1. Industrial.1. Yes :969     
##  1st Qu.: 87981   2. Information.1. Yes:  0     
##  Median :107904   1. Industrial.2. No  :  0     
##  Mean   :112255   2. Information.2. No :  0     
##  3rd Qu.:129454                                 
##  Max.   :278964                                 
## 
## $`2. Information.1. Yes`
##       wage                   jobclass_health_ins
##  Min.   : 32366   1. Industrial.1. Yes :   0    
##  1st Qu.: 99690   2. Information.1. Yes:1114    
##  Median :117147   1. Industrial.2. No  :   0    
##  Mean   :127182   2. Information.2. No :   0    
##  3rd Qu.:141775                                 
##  Max.   :318342                                 
## 
## $`1. Industrial.2. No`
##       wage                   jobclass_health_ins
##  Min.   : 22962   1. Industrial.1. Yes :  0     
##  1st Qu.: 68748   2. Information.1. Yes:  0     
##  Median : 84100   1. Industrial.2. No  :575     
##  Mean   : 88265   2. Information.2. No :  0     
##  3rd Qu.:102870                                 
##  Max.   :295991                                 
## 
## $`2. Information.2. No`
##       wage                   jobclass_health_ins
##  Min.   : 20086   1. Industrial.1. Yes :  0     
##  1st Qu.: 73776   2. Information.1. Yes:  0     
##  Median : 92298   1. Industrial.2. No  :  0     
##  Mean   : 99129   2. Information.2. No :342     
##  3rd Qu.:116926                                 
##  Max.   :309572

Tidy Approach

However, I think the tidy approach introduced in the previous section is more suitable. On the one hand, I only have to alter the functions group_by() and pivot_longer() by adding health_ins. On the other hand, I preserve the flexibility to add various (location) parameters.

grouped_summary <- Wage %>%
  select(age, wage, jobclass, health_ins) %>%
  group_by(jobclass, health_ins) %>%
  summarise(across(
    .cols = everything(),
    .fns = list(median = median,
                mean = mean)
  )) %>%
  pivot_longer(
    cols = -c(jobclass, health_ins),
    names_to = c("variable", "parameter"),
    names_sep = "_",
    values_to = "value"
  )

print(grouped_summary)
## # A tibble: 16 x 5
## # Groups:   jobclass [2]
##    jobclass       health_ins variable parameter    value
##    <fct>          <fct>      <chr>    <chr>        <dbl>
##  1 1. Industrial  1. Yes     age      median        43  
##  2 1. Industrial  1. Yes     age      mean          43.0
##  3 1. Industrial  1. Yes     wage     median    107904. 
##  4 1. Industrial  1. Yes     wage     mean      112255. 
##  5 1. Industrial  2. No      age      median        38  
##  6 1. Industrial  2. No      age      mean          38.7
##  7 1. Industrial  2. No      wage     median     84100. 
##  8 1. Industrial  2. No      wage     mean       88265. 
##  9 2. Information 1. Yes     age      median        44  
## 10 2. Information 1. Yes     age      mean          43.9
## 11 2. Information 1. Yes     wage     median    117147. 
## 12 2. Information 1. Yes     wage     mean      127182. 
## 13 2. Information 2. No      age      median        41.5
## 14 2. Information 2. No      age      mean          42.0
## 15 2. Information 2. No      wage     median     92298. 
## 16 2. Information 2. No      wage     mean       99129.

Boxplot

Regarding boxplots the argument fill = might be supplied with health_ins in order to obtain separate boxplots for each level of jobclass.

ggplot(data = Wage, aes(x = wage, y = jobclass, fill = health_ins)) +
  geom_boxplot(color = "black") +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Wage", y = "Jobclass", fill = "Health_ins") 

Density Plot

In the previous section I already assigned jobclass to argument fill = for the density plot. Instead I am now using function facet_grid and generate stand-alone density plots for each level of jobclass.

ggplot(data = Wage, aes(x = wage, fill=jobclass)) +
  geom_density(alpha = 0.6) +
  facet_grid(cols = vars(health_ins), labeller = label_both) +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Wage", y = "Density", fill = "Jobclass")

Violin plot

The violin plot is specified in the same manner as the boxplot. Note that you have to alter the position = argument in function stat_summary() in order to shift the calculated point range inside the filled violin plots.

ggplot(data = Wage, aes(x = jobclass, y = wage, fill = health_ins)) +
  geom_violin(trim = FALSE) +
   stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), 
                geom = "pointrange", color = "red",
                position = position_dodge(width = 0.9)) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "", y = "Wage", fill = "Health_ins")


4.4 Further methods

Scatterplot

The scatterplot is a two dimensional representation of two numeric variables. We use the function geom_point()to create it.

ggplot(data = Wage, aes(x = age, y = wage)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

It can be easily enhanced with many of the approaches we’ve seen in the previous sections such as adding further dimensions via facet_wrap().

ggplot(data = Wage, aes(x = age, y = wage)) +
  geom_point() +
  facet_wrap(.~year) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Age", y = "Wage")

Quantile Plot

In this plot we plot the value of a numeric variable against each share of all smaller values. By assessing the slope of the resulting curve we can identify local densities and outliers. Are the variable’s values of equal size, they will lie on a horizontal line in this plot. If the values are slightly different we will observe a slight slope. A steep slope indicates that the values are strongly different from each. Hence, the curve flattens out when the density of the distribution is high.

There is, to my knowledge, no implementation of this plot readily availabe in R, so I’ve created something on my own that works with functions of the ggplot2 package. We need to create a new variable fraction which assigns each osbervations some kind of rank reagarding the variable of interest. Likewise we have to sort the variable of interest (wage). Then we plot the ordered values of wage against quantiles of a uniform distribution.

Wage_mod <-
  Wage %>%
  mutate(fraction = (1:length(wage) - 1)/(length(wage) - 1),
         wage_sorted = sort(wage))

Wage_mod %>%
  ggplot(aes(x=fraction, y = wage_sorted)) +
  geom_point() +
  geom_abline(
    intercept = min(Wage_mod$wage), 
    slope = (max(Wage_mod$wage)-min(Wage_mod$wage))/(max(Wage_mod$fraction)-min(Wage_mod$fraction)),
    color = "red"
    ) +
  geom_vline(xintercept = c(0.25,0.50, 0.75)) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Fraction of the data",
       y = "Quantiles of Wage")

For a large fraction of the data we observe only a slight slope, hence the distributions is pretty dense. On the other hand we observe steep slopes at the tails of the distribution. So there are some outliers with very high values of wage but also some with low values of wage - compared to all other wages.

If we observe that the majority of values is below the straight red line, the distribution is said to be skewed to the right.

Quantile-Quantile Plot

With a Q-Q plot we can compare the distribution of two variables or the distribution of one variable against the normal distribution.

ggplot(Wage, 
       aes(sample = wage)) +
  stat_qq() +
  stat_qq_line() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Theoretical Quantiles",
       y = "Wage")

Below I compare wage between the "Industrial" and "Information" sector. If the workers wages between both sectors are identically distributed, all observations would lie on the straight line. As the observations are closert to the y-axis, workers in the information sector earn higher wages in total. I was not able to come up with a solution to do this in ggplot2 yet.

qqplot(Wage$wage[Wage$jobclass=="1. Industrial"], Wage$wage[Wage$jobclass=="2. Information"],)
abline(a = 0, b = 1, lty="solid")