Reading time: 13 minutes (2,506 words)
Before trying to answer scientific questions or making predictions with a (unknown) dataset, you should get familiar with it. At least, this will include an analysis of the data structure and the contained observations.
How many variables are there?
What is their level of measurement?
Do my variables of interest follow a normal distribution or are there clusters in my data I should pay attention to?
Are some of my modelling assumptions going to be violated?
The answers to these questions lie in a description or summary of the data at hand. That’s why they are commonly referred to as descriptive statistics.
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(forcats)
library(ggmosaic)
library(Hmisc)
library(scales)
All practical examples are built around the Mid-Atlantic Wage Data contained in R package ISLR
. This dataset contains information about 3,000 male workers in the Mid-Atlantic region like their yearly salary and other demographic characteristics. After reading the data, I convert wage
in full US dollar.
data("Wage", package = "ISLR")
Wage <- Wage %>%
mutate(wage = wage*1000)
After reading the data, the Wage dataset can be found in RStudio’s global environment
and displays the number of columns and rows (3000 obs. of 11 variables
).
The R package dplyr
offers the function glimpse()
which displays more information about the dataset. RStudio’s console
displays all columns (variable names) of the dataset. The column format, called class
, can be found between the < >
operators. There are integer columns (int
), factor columns (fct
) as well as double columns (dbl
). For each column some values (rows) are displayed.
Wage %>%
glimpse()
## Rows: 3,000
## Columns: 11
## $ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,~
## $ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,~
## $ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,~
## $ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,~
## $ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College ~
## $ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,~
## $ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio~
## $ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=~
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye~
## $ logwage <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,~
## $ wage <dbl> 75043.15, 70476.02, 130982.18, 154685.29, 75043.15, 127115.~
The integrated function str()
is very similar to glimpse()
. There are only minor differences like the display of the number of levels for each factor variable.
Wage %>%
str()
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75043 70476 130982 154685 75043 ...
These variables are typically of class factor
or character
. The following representations might apply:
These variable are typically of class int
or double
. The following representations might apply:
In this section we’ll discuss frequency tables, bar charts and mosaic plots as a simple but useful way to examine one ore more variables of nominal or ordinal scale.
The distribution of the respondents’ education
marks the starting point of the following analysis.
The integrated function table()
produces a one-way (frequency) table, which calculates the number of observations (counts) for each level of education
.
Wage %>%
select(education) %>%
table()
## .
## 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad
## 268 971 650 685
## 5. Advanced Degree
## 426
The function geom_bar()
from R package ggplot2
is used to create a corresponding bar chart. Bald but informative!
ggplot(data = Wage, aes(x=education)) +
geom_bar(fill = "white", color = "black") +
labs(x = "Education", y = "Number of observations")
Next, I want to gain information about the distribution of the respondents’ education
while taking into account their race
.
Supplying the function table()
with two variables produces a two-way (contingency) table.
Wage %>%
select(education, race) %>%
table()
## race
## education 1. White 2. Black 3. Asian 4. Other
## 1. < HS Grad 211 31 15 11
## 2. HS Grad 822 105 31 13
## 3. Some College 532 92 18 8
## 4. College Grad 576 40 66 3
## 5. Advanced Degree 339 25 60 2
In the bar chart variable race
might be assigned to argument fill=
, which fills each bar with the levels of race
.
ggplot(data = Wage, aes(x = education, fill = race)) +
geom_bar(color = "black") +
labs(x = "Education", y = "Number of observations", fill = "Race")
For comparing the respondents’ race
between each level of education
, the arumgent position
might be altered to dodge
the bars instead of stack
ing them.
ggplot(data = Wage, aes(x = education, fill = race)) +
geom_bar(position = "dodge", color = "black") +
labs(x = "Education", y = "Number of observations", fill = "Race")
The distribution of race
within each education
bar can be more precisely assessed by displaying relative percentages instead of counts. This can be achieved by rescaling the y-axis with function scale_y_continuous()
and using the R package scales
which offers tailored options for the labeling of the scale. This bar chart is well suited for comparing the respondents’ race
within each level of education
.
ggplot(data = Wage, aes(x = education, fill = race)) +
geom_bar(position = "fill", color = "black") +
scale_y_continuous(labels = scales::percent) +
labs(x = "Education", y = "Share", fill = "Race")
The mosaic plot is a superior form of visualization in this setting. The contents of a contingency table are displayed in bars which differ in their width according to the respective count. There are many implementations of the mosaic plot - here I am using the function geom_mosaic
from R package ggmosaic
which is easily integrated in a ggplot visualization. I removed all labels of race
from the x-axis and assigned them to the argument fill =
because they overlap each other and are hardy readable.
ggplot(data = Wage) +
geom_mosaic(aes(x = product(education, race), fill = race)) +
labs(x = "Race", y = "Education", fill = "Race") +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
In addition to race
let us take into account the respondents’ jobclass
for assessing the distribution of education
.
Now, the function table()
creates a contingency table of education
and race
for each level of jobclass
. Extracting information from this form of display is a bit more elaborate …
Wage %>%
dplyr::select(education, race, jobclass) %>%
table()
## , , jobclass = 1. Industrial
##
## race
## education 1. White 2. Black 3. Asian 4. Other
## 1. < HS Grad 159 12 12 7
## 2. HS Grad 552 48 25 11
## 3. Some College 300 30 8 4
## 4. College Grad 234 17 23 0
## 5. Advanced Degree 80 4 18 0
##
## , , jobclass = 2. Information
##
## race
## education 1. White 2. Black 3. Asian 4. Other
## 1. < HS Grad 52 19 3 4
## 2. HS Grad 270 57 6 2
## 3. Some College 232 62 10 4
## 4. College Grad 342 23 43 3
## 5. Advanced Degree 259 21 42 2
The bar chart might be used again by creating stand-alone bar charts for each level of jobclass
while using separate bars for each level of race
inside them. The function facet_grid()
allows to group a plot by one or two variables. Here I am using column-wise grouping with argument cols =
. What I don’t like about this bar chart are the duplicated x-axes which are hardly readable because of the levels of education
…
ggplot(data = Wage, aes(x = education, fill = race)) +
geom_bar(position = "dodge", color = "black") +
facet_grid(cols = vars(jobclass)) +
labs(x = "Education", y = "Number of observations", fill = "Race")
The function coord_flip()
solves this problem. Close inspection reveals that the colored bars for race
are not always on the same height, making a comparison between the levels of jobclass
unnecessary difficult. This issue is caused by empty cells (= combinations of variables’ levels without any observations). For example, there are no respondents with jobclass
Industrial, race
Other and education
Advanced Degree or College Degree …
ggplot(data = Wage, aes(x = education, fill = race)) +
geom_bar(position = "dodge", color = "black") +
facet_grid(cols = vars(jobclass)) +
coord_flip() +
labs(x = "Education", y = "Number of observations", fill = "Race")
This issue is more demanding. At first, each cell (= count per combination of each variable’s levels) has to be calculated manually. This can be achieved with the functions group_by()
and summarise()
from R package dplyr
.
Then complete()
from R package tidyr
does the trick by filling in missing categories of a given hierarchical structure with NA
or other user-defined values. Always check that the argument nesting
is supplied with the required structure. The argument stat =
within function geom_bar()
now has to be supplied with the value "identity"
since the "count"
s (number of observations per combination) have already been calculated.
Wage %>%
group_by(education, race, jobclass) %>%
summarise(count = n()) %>%
ungroup() %>%
complete(education, nesting(jobclass, race), fill = list(count = NA)) %>%
ggplot(aes(x = education, y = count, fill = race)) +
geom_bar(stat = "identity",
position = "dodge",
color = "black") +
facet_grid(cols = vars(jobclass)) +
coord_flip() +
labs(x = "Education", y = "Number of observations", fill = "Race")
The mosaic plot also offers a concise way of displaying information contained in grouped contingency tables. Argument conds =
has to be supplied with the grouping variable, so in this example the levels of education
are grouped by the levels of jobclass
. To ease comparisons with the previous bar chart I reversed the order of the levels of jobclass
with function fct_rev()
from R package forcats
. Looks neat!
Wage %>%
mutate(jobclass = fct_rev(jobclass)) %>%
ggplot(data = .) +
geom_mosaic(aes(
x = product(education, race),
fill = race,
conds = product(jobclass)
),
divider = mosaic("v")) +
labs(x = "", y = "Education : Jobclass", fill = "Race") +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Generally speaking more variables might be added in the tabular dimension, although the information gain is often quite small because the form of presentation gets confusing Even visualizations with mosaic plots or bar charts hit a wall when supplied with too many variables. Nevertheless, I am adding health_ins
as grouping variable for my analysis of the distribution of education
.
The function table()
now prints contingency tables for all combinations of jobclass
and health_ins
. Not very pretty but since each variable has only two levels it’s still readable. Especially when using the function ftable()
which flattens the output of function table()
.
Wage %>%
select(education, race, jobclass, health_ins) %>%
table() %>%
ftable()
## health_ins 1. Yes 2. No
## education race jobclass
## 1. < HS Grad 1. White 1. Industrial 73 86
## 2. Information 25 27
## 2. Black 1. Industrial 6 6
## 2. Information 13 6
## 3. Asian 1. Industrial 4 8
## 2. Information 0 3
## 4. Other 1. Industrial 2 5
## 2. Information 1 3
## 2. HS Grad 1. White 1. Industrial 324 228
## 2. Information 199 71
## 2. Black 1. Industrial 27 21
## 2. Information 41 16
## 3. Asian 1. Industrial 11 14
## 2. Information 4 2
## 4. Other 1. Industrial 5 6
## 2. Information 1 1
## 3. Some College 1. White 1. Industrial 207 93
## 2. Information 180 52
## 2. Black 1. Industrial 21 9
## 2. Information 40 22
## 3. Asian 1. Industrial 6 2
## 2. Information 6 4
## 4. Other 1. Industrial 4 0
## 2. Information 3 1
## 4. College Grad 1. White 1. Industrial 179 55
## 2. Information 273 69
## 2. Black 1. Industrial 11 6
## 2. Information 18 5
## 3. Asian 1. Industrial 15 8
## 2. Information 31 12
## 4. Other 1. Industrial 0 0
## 2. Information 2 1
## 5. Advanced Degree 1. White 1. Industrial 60 20
## 2. Information 220 39
## 2. Black 1. Industrial 1 3
## 2. Information 19 2
## 3. Asian 1. Industrial 13 5
## 2. Information 36 6
## 4. Other 1. Industrial 0 0
## 2. Information 2 0
The function facet_grid()
offers a second argument (rows=
) for adding another grouping variable in the bar chart. Use the labeller =
argument inside function facet_grid()
in order to display the variable name inside the facets which eases the interpretation of imprecise labels. When using the relative shares of race
inside the bars of each level of education
this plot is also quite informative.
Wage %>%
group_by(education, race, jobclass, health_ins) %>%
summarise(count = n()) %>%
ungroup() %>%
complete(education, nesting(jobclass, race, health_ins), fill = list(count = NA)) %>%
ggplot(aes(x=education, y = count, fill = race)) +
geom_bar(stat = "identity", position = "fill", color = "black") +
facet_grid(rows = vars(health_ins), cols = vars(jobclass),
labeller = label_both) +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
labs(x = "Education", y = "Share", fill = "Race")
The variable health_in
might be added inside the facet_grid()
function.
Wage %>%
mutate(jobclass = fct_rev(jobclass)) %>%
ggplot(data = .) +
geom_mosaic(aes(
x = product(education, race),
fill = race,
conds = product(jobclass)
),
divider = mosaic("v")) +
facet_grid(cols = vars(health_ins), labeller = label_both) +
labs(x = "", y = "Education : Jobclass", fill = "Race") +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Now we’ll discuss tools to evaluate variables of interval or continuous scale. I’m going to present classical summary statistics, histograms, density plots as well as boxplots for assessing the distribution of these variables.
The function summary()
displays summary statistics for all variables in a dataset. For numeric variables (class int
ordbl
) location parameters like the mean and median of the their distribution are printed.
Wage %>%
select(age, wage) %>%
summary()
## age wage
## Min. :18.00 Min. : 20086
## 1st Qu.:33.75 1st Qu.: 85384
## Median :42.00 Median :104922
## Mean :42.41 Mean :111704
## 3rd Qu.:51.00 3rd Qu.:128681
## Max. :80.00 Max. :318342
R package Hmisc
contains function describe()
which is also suited for generating summary statistics. Additionally the number of missing
and distinct
values are displayed. Regarding numeric variables the five lowest and highest values are shown.
Wage %>%
select(age, wage) %>%
Hmisc::describe()
## .
##
## 2 Variables 3000 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 3000 0 61 0.999 42.41 13.16 24.00 27.00
## .25 .50 .75 .90 .95
## 33.75 42.00 51.00 58.00 61.00
##
## lowest : 18 19 20 21 22, highest: 74 75 76 77 80
## --------------------------------------------------------------------------------
## wage
## n missing distinct Info Mean Gmd .05 .10
## 3000 0 508 1 111704 42643 61188 70476
## .25 .50 .75 .90 .95
## 85384 104922 128680 154704 176990
##
## lowest : 20085.54 20934.38 22962.40 23274.70 23952.94
## highest: 299262.98 309571.77 311934.57 314329.34 318342.43
## --------------------------------------------------------------------------------
The function geom_histogram()
from R package ggplot2
generates a histogram for a single variable. The number of bins
can be controlled with the eponymous argument.
ggplot(data = Wage, aes(x = wage)) +
geom_histogram(fill = "white", color = "black", bins = 50) +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Wage", y = "Count")
Summary statistics of continuously scaled variables might be grouped by variables of nominal, categorical or ordinal scale. Hence, individual summary statistics for each level of the grouping variable will be calculated. In this section I am grouping the summary statistics by jobclass
.
With the integrated function tapply()
function summary()
can be (t)applied to all levels of education
for calculating summary statistics of wage
. However, summary statistics of more than one variable cannot be caluclated …
tapply(Wage$wage, Wage$jobclass, summary)
## $`1. Industrial`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22962 81283 99689 103321 118884 295991
##
## $`2. Information`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20086 91699 112649 120593 137590 318342
The R package purrr
contains function map()
which can produce grouped summary statistics for multiple variables. At first, however, I have to split()
the dataset by the grouping variable. With many variables this form of display gets messy very quickly …
Wage %>%
select(age, wage, jobclass) %>%
split(.$jobclass) %>%
map(summary)
## $`1. Industrial`
## age wage jobclass
## Min. :18.0 Min. : 22962 1. Industrial :1544
## 1st Qu.:32.0 1st Qu.: 81283 2. Information: 0
## Median :41.0 Median : 99689
## Mean :41.4 Mean :103321
## 3rd Qu.:50.0 3rd Qu.:118884
## Max. :80.0 Max. :295991
##
## $`2. Information`
## age wage jobclass
## Min. :18.00 Min. : 20086 1. Industrial : 0
## 1st Qu.:35.00 1st Qu.: 91699 2. Information:1456
## Median :43.00 Median :112649
## Mean :43.49 Mean :120593
## 3rd Qu.:51.00 3rd Qu.:137590
## Max. :80.00 Max. :318342
The functions group_by()
and across()
from R package dplyr
are able to generate grouped summary statistics for multiple variables as well. However, the location parameters have to be calculated manually. Sounds complex but it is not! Sure the coding effort is a bit higher but there is virtually no restriction on the number and type of (location) parameters. For this example I am calculating the mean and median as well as missing values for wage
and age
. Then I am using the function pivot_longer()
from R package tidyr
to transpose the output in a long format which looks tidy. Finally, just print()
the data frame.
grouped_summary <- Wage %>%
select(wage, age, jobclass) %>%
group_by(jobclass) %>%
summarise(across(
.cols = everything(),
.fns = list(
mean = mean,
median = median,
nmiss = ~ sum(is.na(.x))
)
)) %>%
pivot_longer(
cols = -jobclass,
names_to = c("variable", "parameter"),
names_sep = "_",
values_to = "value"
)
print(grouped_summary)
## # A tibble: 12 x 4
## jobclass variable parameter value
## <fct> <chr> <chr> <dbl>
## 1 1. Industrial wage mean 103321.
## 2 1. Industrial wage median 99689.
## 3 1. Industrial wage nmiss 0
## 4 1. Industrial age mean 41.4
## 5 1. Industrial age median 41
## 6 1. Industrial age nmiss 0
## 7 2. Information wage mean 120593.
## 8 2. Information wage median 112649.
## 9 2. Information wage nmiss 0
## 10 2. Information age mean 43.5
## 11 2. Information age median 43
## 12 2. Information age nmiss 0
A boxplot is suitable for visualising location parameters especially by groups. The function geom_boxplot()
from R package ggplot2
generates boxplots of wage
for each level of jobclass
.
ggplot(data = Wage, aes(x = wage, y = jobclass)) +
geom_boxplot(color = "black") +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Wage", y = "Jobclass")
The density plot is a smooth version of the histogram, which is in my opinion more suited for grouped summary statistics. It can be created using function geom_density()
from R package ggplot2
. The grouping variable jobclass
is assigned to argument fill =
. The important feature to make this plot informative is the argument alpha =
, which adjusts the color transparency of the grouping variable. More transparent colors are obtained with smaller values of alpha
.
ggplot(data = Wage, aes(x = wage, fill = jobclass)) +
geom_density(alpha = 0.6) +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Wage", y = "Density", fill = "Jobclass")
Another visualization technique which is the violin plot. It is a mixture of a boxplot and a density plot. The latter is mirrored such that it appears in shape as a boxplot. Below I marked the mean value of wage
for each joblcass
inside the violin plot with a red dot and also added a point range of one standard deviation from the mean.
ggplot(data = Wage, aes(x = jobclass, y = wage)) +
geom_violin(trim = FALSE) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "pointrange", color = "red") +
scale_y_continuous(labels = scales::dollar) +
labs(x = "", y = "Wage")
Of course we might add another variable for grouping a summary statistic so I am adding health_ins
to the mix.
The function split()
introduced in the previous section can only be used with one variable to split a dataset by its levels. I am helping myself out here with a little trick by combining the levels of both grouping variables in a new grouping variable. With the integrated function intersect()
I am creating the new variable jobclass_health_in
which coalesces the levels of jobclass
and health_ins
.
Wage %>%
mutate(jobclass_health_ins = interaction(jobclass, health_ins)) %>%
select(wage, jobclass_health_ins) %>%
split(.$jobclass_health_ins) %>%
map(summary)
## $`1. Industrial.1. Yes`
## wage jobclass_health_ins
## Min. : 34607 1. Industrial.1. Yes :969
## 1st Qu.: 87981 2. Information.1. Yes: 0
## Median :107904 1. Industrial.2. No : 0
## Mean :112255 2. Information.2. No : 0
## 3rd Qu.:129454
## Max. :278964
##
## $`2. Information.1. Yes`
## wage jobclass_health_ins
## Min. : 32366 1. Industrial.1. Yes : 0
## 1st Qu.: 99690 2. Information.1. Yes:1114
## Median :117147 1. Industrial.2. No : 0
## Mean :127182 2. Information.2. No : 0
## 3rd Qu.:141775
## Max. :318342
##
## $`1. Industrial.2. No`
## wage jobclass_health_ins
## Min. : 22962 1. Industrial.1. Yes : 0
## 1st Qu.: 68748 2. Information.1. Yes: 0
## Median : 84100 1. Industrial.2. No :575
## Mean : 88265 2. Information.2. No : 0
## 3rd Qu.:102870
## Max. :295991
##
## $`2. Information.2. No`
## wage jobclass_health_ins
## Min. : 20086 1. Industrial.1. Yes : 0
## 1st Qu.: 73776 2. Information.1. Yes: 0
## Median : 92298 1. Industrial.2. No : 0
## Mean : 99129 2. Information.2. No :342
## 3rd Qu.:116926
## Max. :309572
However, I think the tidy approach
introduced in the previous section is more suitable. On the one hand, I only have to alter the functions group_by()
and pivot_longer()
by adding health_ins
. On the other hand, I preserve the flexibility to add various (location) parameters.
grouped_summary <- Wage %>%
select(age, wage, jobclass, health_ins) %>%
group_by(jobclass, health_ins) %>%
summarise(across(
.cols = everything(),
.fns = list(median = median,
mean = mean)
)) %>%
pivot_longer(
cols = -c(jobclass, health_ins),
names_to = c("variable", "parameter"),
names_sep = "_",
values_to = "value"
)
print(grouped_summary)
## # A tibble: 16 x 5
## # Groups: jobclass [2]
## jobclass health_ins variable parameter value
## <fct> <fct> <chr> <chr> <dbl>
## 1 1. Industrial 1. Yes age median 43
## 2 1. Industrial 1. Yes age mean 43.0
## 3 1. Industrial 1. Yes wage median 107904.
## 4 1. Industrial 1. Yes wage mean 112255.
## 5 1. Industrial 2. No age median 38
## 6 1. Industrial 2. No age mean 38.7
## 7 1. Industrial 2. No wage median 84100.
## 8 1. Industrial 2. No wage mean 88265.
## 9 2. Information 1. Yes age median 44
## 10 2. Information 1. Yes age mean 43.9
## 11 2. Information 1. Yes wage median 117147.
## 12 2. Information 1. Yes wage mean 127182.
## 13 2. Information 2. No age median 41.5
## 14 2. Information 2. No age mean 42.0
## 15 2. Information 2. No wage median 92298.
## 16 2. Information 2. No wage mean 99129.
Regarding boxplots the argument fill =
might be supplied with health_ins
in order to obtain separate boxplots for each level of jobclass
.
ggplot(data = Wage, aes(x = wage, y = jobclass, fill = health_ins)) +
geom_boxplot(color = "black") +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Wage", y = "Jobclass", fill = "Health_ins")
In the previous section I already assigned jobclass
to argument fill =
for the density plot. Instead I am now using function facet_grid
and generate stand-alone density plots for each level of jobclass
.
ggplot(data = Wage, aes(x = wage, fill=jobclass)) +
geom_density(alpha = 0.6) +
facet_grid(cols = vars(health_ins), labeller = label_both) +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Wage", y = "Density", fill = "Jobclass")
The violin plot is specified in the same manner as the boxplot. Note that you have to alter the position =
argument in function stat_summary()
in order to shift the calculated point range inside the filled violin plots.
ggplot(data = Wage, aes(x = jobclass, y = wage, fill = health_ins)) +
geom_violin(trim = FALSE) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "pointrange", color = "red",
position = position_dodge(width = 0.9)) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "", y = "Wage", fill = "Health_ins")
The scatterplot is a two dimensional representation of two numeric variables. We use the function geom_point()
to create it.
ggplot(data = Wage, aes(x = age, y = wage)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
It can be easily enhanced with many of the approaches we’ve seen in the previous sections such as adding further dimensions via facet_wrap()
.
ggplot(data = Wage, aes(x = age, y = wage)) +
geom_point() +
facet_wrap(.~year) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Age", y = "Wage")
In this plot we plot the value of a numeric variable against each share of all smaller values. By assessing the slope of the resulting curve we can identify local densities and outliers. Are the variable’s values of equal size, they will lie on a horizontal line in this plot. If the values are slightly different we will observe a slight slope. A steep slope indicates that the values are strongly different from each. Hence, the curve flattens out when the density of the distribution is high.
There is, to my knowledge, no implementation of this plot readily availabe in R, so I’ve created something on my own that works with functions of the ggplot2
package. We need to create a new variable fraction
which assigns each osbervations some kind of rank reagarding the variable of interest. Likewise we have to sort the variable of interest (wage
). Then we plot the ordered values of wage
against quantiles of a uniform distribution.
Wage_mod <-
Wage %>%
mutate(fraction = (1:length(wage) - 1)/(length(wage) - 1),
wage_sorted = sort(wage))
Wage_mod %>%
ggplot(aes(x=fraction, y = wage_sorted)) +
geom_point() +
geom_abline(
intercept = min(Wage_mod$wage),
slope = (max(Wage_mod$wage)-min(Wage_mod$wage))/(max(Wage_mod$fraction)-min(Wage_mod$fraction)),
color = "red"
) +
geom_vline(xintercept = c(0.25,0.50, 0.75)) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Fraction of the data",
y = "Quantiles of Wage")
For a large fraction of the data we observe only a slight slope, hence the distributions is pretty dense. On the other hand we observe steep slopes at the tails of the distribution. So there are some outliers with very high values of wage
but also some with low values of wage
- compared to all other wages.
If we observe that the majority of values is below the straight red line, the distribution is said to be skewed to the right.
With a Q-Q plot we can compare the distribution of two variables or the distribution of one variable against the normal distribution.
ggplot(Wage,
aes(sample = wage)) +
stat_qq() +
stat_qq_line() +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Theoretical Quantiles",
y = "Wage")
Below I compare wage
between the "Industrial"
and "Information"
sector. If the workers wages between both sectors are identically distributed, all observations would lie on the straight line. As the observations are closert to the y-axis, workers in the information sector earn higher wages in total. I was not able to come up with a solution to do this in ggplot2
yet.
qqplot(Wage$wage[Wage$jobclass=="1. Industrial"], Wage$wage[Wage$jobclass=="2. Information"],)
abline(a = 0, b = 1, lty="solid")