Reading time: 25 minutes (4,872 words)
The following packages are required for this tutorial.
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)
library(stringr)
library(broom)
library(car)
library(boot)
library(heplots)
library(mvnormalTest)
The practical examples in the following sections are always illustrated with simulated data, so there is no need to load a dataset.
With a statistical hypothesis we want to test if there is evidence against a statement about the nature of a population (consider for example a parameter like the mean or median). Before testing a statistical hypothesis the probability of error or the significance level (α) has to be set. The corresponding confidence interval is 1-α. So for example a significance level of 5 % (or 0.05) reflects a 95 % confidence interval (0.95).
The significance level gives the probability of rejecting the null hypothesis although it is correct. Given a significance level of 5 %, in 5 out of 100 cases the null hypothesis is erroneously rejected. The null hypotheis is rejected if the calculated p value is smaller than the predefined significane level. However, it is also correct to say that you are 95 % confident that the alternative hypothesis is true. In the other case, one can say that the null hypothesis cannot be rejected instead that the null hypothesis is accepted. The statement the null hypothesis is rejected and the alternative is true with a probality of at least 95 % is wrong.
Rejecting the null hypothesis after a statistical test is denoted as either:
By design the result of a statistical test is never unequivocal evidence for the change/decrease/increase of a measure, even when the null hypothesis is rejected at a very low significane level. Also, a significant change can be negligibly small. Always remember that it is required to set the significane level before calculating the p value, in order to not adjust α in hindsight to obtain the desired result.
The signficance level in the following examples is always set to 5 %.
The Student’s t-test is a statistical hypothesis test, in which the test statistic follows the Student’s t-distribution. Commonly, it is used to determine if the means of two groups of data are significantly different from each other. Still, it can also be applied to a single sample, where the mean is tested against a specific value. Below is a representation with different degrees of freedom.
For higher degrees of freedom the t-distribution approximates the normal distribution. However, note the difference in the tails between both distributions.
In its most simple form the t-test is a location test of whether the mean of a population has a value specified in the null hypothesis. The data, and hence the sample mean, must follow a normal distribution in order for the test to work.
The first dataset I create contains 75 individuals identified by an id
for which a certain variable
of interest was recorded. The variable
was generated with function rnorm()
and follows a normal distribution with a mean of 5 and standard deviation of 2. Since the values are drawn randomly from the distribution, it is important to use function set.seed()
in order to obtain replicable results.
set.seed(123)
dataset_1 <- tibble(
id = as.character(c(1:75)),
variable = rnorm(n = 75, mean = 5, sd = 3)
)
dataset_1
## # A tibble: 75 x 2
## id variable
## <chr> <dbl>
## 1 1 3.32
## 2 2 4.31
## 3 3 9.68
## 4 4 5.21
## 5 5 5.39
## 6 6 10.1
## 7 7 6.38
## 8 8 1.20
## 9 9 2.94
## 10 10 3.66
## # ... with 65 more rows
With a histogram, the distribution of the variable
can be visualized.
ggplot(data = dataset_1,
aes(x = variable)) +
geom_histogram(fill = "white", color = "black", bins = 10) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
This looks normally distributed.
Before I compute the t-test, the normality assumption is checked. As the sample is quite small, the central limit theorem will not apply. However, a Shapiro-Wilk Normality Test can be used to test whether the data follows a normal distribution. The null hypothesis is hence that the data is normally distributed. This test can be computed with function shapiro.test()
.
shapiro.test(dataset_1$variable)
##
## Shapiro-Wilk normality test
##
## data: dataset_1$variable
## W = 0.99421, p-value = 0.9834
The test shows a p-value of 0.98. Hence, there is no evidence to reject the null hypothesis in favor of the alternative that the data is not normally distributed.
Finally we can run the t-test which is implemented in function t.test()
. The theoretical mean against which the hypothesis test is carried out (mu
) is 0 by default. With argument alternative =
a two- or one-tailed test can be applied. Below, the null hypothesis that the population mean of variable
is equal to zero (mu = 0
), is rejected in favor of the alternative that it is not equal to zero.
t.test(dataset_1$variable,
alternative = "two.sided",
mu = 0, conf.level = 0.95)
##
## One Sample t-test
##
## data: dataset_1$variable
## t = 15.679, df = 74, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.433596 5.724500
## sample estimates:
## mean of x
## 5.079048
To check this result, I am computing the test statistic and confidence interval by hand.
knitr::kable(tibble(
stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
tstat = (mean(dataset_1$variable) - 0) / stderr,
pval = 2 * pt(q = -abs(tstat), df = length(dataset_1$variable) -1),
alpha = 1-0.95,
error = qt(p = 1 - alpha / 2, df = length(dataset_1$variable) -1) * stderr,
lower_ci = mean(dataset_1$variable) - error,
upper_ci = mean(dataset_1$variable) + error)
)
stderr | tstat | pval | alpha | error | lower_ci | upper_ci |
---|---|---|---|---|---|---|
0.3239338 | 15.67928 | 0 | 0.05 | 0.6454521 | 4.433596 | 5.7245 |
A one-tailed t-test is used for the null hypothesis that the population mean of variable
is greater than or equal to 6. The null hypothesis rejected in favor of the alternative that it is less than 6.
t.test(dataset_1$variable,
alternative = "less",
mu = 6, conf.level = 0.95)
##
## One Sample t-test
##
## data: dataset_1$variable
## t = -2.843, df = 74, p-value = 0.002886
## alternative hypothesis: true mean is less than 6
## 95 percent confidence interval:
## -Inf 5.618627
## sample estimates:
## mean of x
## 5.079048
Next, I test the null hypothesis that the the population mean of variable
is less than or equal to 4. The null hypothesis is rejected in favor of the alternative that it is greater than 4.
t.test(dataset_1$variable,
alternative = "greater",
mu = 4, conf.level = 0.95)
##
## One Sample t-test
##
## data: dataset_1$variable
## t = 3.3311, df = 74, p-value = 0.000676
## alternative hypothesis: true mean is greater than 4
## 95 percent confidence interval:
## 4.539469 Inf
## sample estimates:
## mean of x
## 5.079048
Again, I show the caluclation for both one-tailed tests.
knitr::kable(tibble(
stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
tstat = (mean(dataset_1$variable) - 6) / stderr,
pval = pt(q = -abs(tstat), df = length(dataset_1$variable) -1),
error = qt(p = 0.95, df = length(dataset_1$variable) -1) * stderr,
lower_ci = -Inf,
upper_ci = mean(dataset_1$variable) + error)
)
stderr | tstat | pval | error | lower_ci | upper_ci |
---|---|---|---|---|---|
0.3239338 | -2.843025 | 0.0028864 | 0.5395787 | -Inf | 5.618627 |
knitr::kable(tibble(
stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
tstat = (mean(dataset_1$variable) - 4) / stderr,
pval = pt(q = abs(tstat), df = length(dataset_1$variable) -1, lower.tail = FALSE),
error = qt(p = 0.95, df = length(dataset_1$variable) -1) * stderr,
lower_ci = mean(dataset_1$variable) - error,
upper_ci = Inf)
)
stderr | tstat | pval | error | lower_ci | upper_ci |
---|---|---|---|---|---|
0.3239338 | 3.331076 | 0.000676 | 0.5395787 | 4.539469 | Inf |
The independent or unpaired samples t-test is used to compare the mean of two unrelated groups. Each group is a sample of one of the two populations being compared. Again, we have to check that the data of the two groups is normally distributed. Additionally, the variance of the two groups now plays a role and is assumed to be equal (variance homogeneity).
For the dataset in this section I create 50 observations for men and women each of a variable
of interest. Both groups are normally distributed with an identical standard deviation but their mean is different as seen below.
set.seed(123)
dataset_2 <- tibble(
men = rnorm(50, mean = 5, sd = 2),
women = rnorm(50, mean = 8, sd = 2)
) %>%
pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")
dataset_2
## # A tibble: 100 x 2
## group variable
## <chr> <dbl>
## 1 men 3.88
## 2 women 8.51
## 3 men 4.54
## 4 women 7.94
## 5 men 8.12
## 6 women 7.91
## 7 men 5.14
## 8 women 10.7
## 9 men 5.26
## 10 women 7.55
## # ... with 90 more rows
The distribution of variable
within each group is shown below with two boxplots.
ggplot(data = dataset_2,
aes(x = group, y = variable)) +
geom_boxplot() +
labs(x = "")
The Shapiro-Wilk-Test is again used to check the assumption of normally distributed data. It is carried out for each group individually.
dataset_2 %>%
filter(group == "men") %>%
pull(variable) %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.98928, p-value = 0.9279
dataset_2 %>%
filter(group == "women") %>%
pull(variable) %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99073, p-value = 0.9618
For both grouphs, the null hypothesis that the data is normally distributed cannot be rejected.
The function var.test()
performs an F-test to compare the variances of the two groups regarding variable
. The null hypothesis is that the ratio of the variances is equal to 1.
var.test(variable ~ group, data = dataset_2,
ratio = 1, alternative = "two.sided",
conf.level = 0.95)
##
## F test to compare two variances
##
## data: variable by group
## F = 1.0456, num df = 49, denom df = 49, p-value = 0.8766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5933642 1.8425790
## sample estimates:
## ratio of variances
## 1.04562
There is no evidence to reject the null hypothesis. So it is fine to assume homoskedasticity.
The t-test for unpaired groups can be computed in the same way as in the one-sample case. The argument paired =
is set to FALSE
by default in function t.test()
. Since there was no evidence against homoskedasticity, the argument var.equal=TRUE
is added. The null hypothesis is that the difference of the mean value of variable
between men
and women
is equal to zero.
t.test(variable ~ group, data = dataset_2,
alternative = "two.sided",
mu = 0, paired = FALSE,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: variable by group
## t = -8.8019, df = 98, p-value = 4.799e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.950893 -2.497126
## sample estimates:
## mean in group men mean in group women
## 5.068807 8.292817
The null hypothesis is rejected in favor of the alternative that there is a significant difference in the groups’ means.
Below, I computed the output of t.test()
. The confidence interval now refers to the difference between the mean of the two groups.
knitr::kable(
tibble(
stderr = sqrt(var(dataset_2[dataset_2$group == "men", ]$variable) / 50 +
var(dataset_2[dataset_2$group == "women", ]$variable) / 50),
tstat = (mean(dataset_2[dataset_2$group == "men", ]$variable) -
mean(dataset_2[dataset_2$group == "women", ]$variable)) / stderr,
pval = 2 * pt(q = -abs(tstat), df = length(dataset_2$variable) - 2),
alpha = 1 - 0.95,
error = qt(p = 1 - alpha / 2, df = length(dataset_2$variable) - 1) * stderr,
lower_ci = mean(dataset_2[dataset_2$group == "men", ]$variable) -
mean(dataset_2[dataset_2$group == "women", ]$variable) - error,
upper_ci = mean(dataset_2[dataset_2$group == "men", ]$variable) -
mean(dataset_2[dataset_2$group == "women", ]$variable) + error
)
)
stderr | tstat | pval | alpha | error | lower_ci | upper_ci |
---|---|---|---|---|---|---|
0.3662862 | -8.801886 | 0 | 0.05 | 0.7267913 | -3.950801 | -2.497218 |
If the variance of the two groups being compared is not identical, it is possible to use a Welch t-test instead. This test is an adaptation of the Student’s t-test and does not use the pooled variance of both groups.
The next dataset is similar to the previously generated dataset, except that the variance of the distribution of women and men is different by design.
set.seed(123)
dataset_3 <- tibble(
men = rnorm(50, mean = 5, sd = 8),
women = rnorm(50, mean = 8, sd = 2)
) %>%
pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")
dataset_3
## # A tibble: 100 x 2
## group variable
## <chr> <dbl>
## 1 men 0.516
## 2 women 8.51
## 3 men 3.16
## 4 women 7.94
## 5 men 17.5
## 6 women 7.91
## 7 men 5.56
## 8 women 10.7
## 9 men 6.03
## 10 women 7.55
## # ... with 90 more rows
We can easily see this with the boxplots. The observations within group
men have a greater deviation around the median than the observations in the group
women.
ggplot(data = dataset_3,
aes(x = group, y = variable)) +
geom_boxplot() +
labs(x = "")
The null hypothesis, that the ratio of the variances is equal to 1, can now be rejected.
var.test(variable ~ group, data = dataset_3)
##
## F test to compare two variances
##
## data: variable by group
## F = 16.73, num df = 49, denom df = 49, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 9.493828 29.481263
## sample estimates:
## ratio of variances
## 16.72991
To perform a Welch t-test, the function t.test()
is modified by setting the argument var.equal = FALSE
.
t.test(variable ~ group, data = dataset_3,
alternative = "two.sided",
mu = 0, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: variable by group
## t = -2.7983, df = 54.837, p-value = 0.007075
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.1788028 -0.8563735
## sample estimates:
## mean in group men mean in group women
## 5.275228 8.292817
If the data for the two groups are not normally distributed, it is recommended to use the Mann-Whitney U test (or Wilcoxon rank sum test). The null hypothesis of this test is that two samples come from the same population (have the same median) or, alternatively, whether observations in one sample tend to be larger than observations in the other. It assumes, however, that the two distributions are similar in shape.
The dataset from the previous section is slightly modified such that the variable
of interest for the two groups now follows a log normal distribution, which I created with function rlnorm()
.
set.seed(123)
dataset_4 <- tibble(
men = rlnorm(50, meanlog = 1, sdlog = 0.6),
women = rlnorm(50, meanlog = 1.2, sdlog = 0.6)
) %>%
pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")
dataset_4
## # A tibble: 100 x 2
## group variable
## <chr> <dbl>
## 1 men 1.94
## 2 women 3.87
## 3 men 2.37
## 4 women 3.26
## 5 men 6.93
## 6 women 3.24
## 7 men 2.84
## 8 women 7.55
## 9 men 2.94
## 10 women 2.90
## # ... with 90 more rows
I use a density plot to assess the skewness of the distribution of variable
in dataset_4
by comparing it to its distribution in dataset_2
from the previous section. It is similar in shape but cleary deviates from the normal distribution.
bind_rows(
list(dataset_2=dataset_2, dataset_4=dataset_4),
.id = "source"
) %>%
ggplot(data = .,
aes(x = variable, fill = group)) +
geom_density(position = "dodge", color = "black") +
facet_wrap(. ~ source) +
labs(fill = "")
The null hypothesis that the data follow a normal distribution is now strongly rejected for each group!
with(dataset_4, shapiro.test(variable[group == "men"]))
##
## Shapiro-Wilk normality test
##
## data: variable[group == "men"]
## W = 0.86863, p-value = 5.146e-05
with(dataset_4, shapiro.test(variable[group == "women"]))
##
## Shapiro-Wilk normality test
##
## data: variable[group == "women"]
## W = 0.87171, p-value = 6.328e-05
Below the Wilcoxon rank sum test is computed with function wilcox.test()
and setting argument paired = FALSE
. The null hypothesis is that the distribution of variable
between men
and women
differ by a location shift of mu
and the alternative is that they differ by some other location shift.
wilcox.test(variable ~ group, data = dataset_4,
alternative = "two.sided",
mu = 0, paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: variable by group
## W = 898, p-value = 0.01539
## alternative hypothesis: true location shift is not equal to 0
We can conclude that the median value of variable
for men is significantly different from the median of variable
for women at the 5% level.
The paired samples t-test is similiar in its assumptions to the unpaired t-test except that it requires that the two groups of samples are related. For example this is the case when the same individual is observed at different time points.
The next dataset is generated such that 50 individuals which are identified by an id
were observed before and after a treatment. The observations are drawn from a normal distribution but the parameters differ before and after the treatment.
set.seed(123)
dataset_5 <- tibble(
id = c(1:50),
before_treatment = rnorm(50, mean = 3, sd = 1),
after_treatment = rnorm(50, mean = 5, sd = 1)
) %>%
pivot_longer(cols = c(before_treatment, after_treatment),
names_to = "group", values_to = "variable")
dataset_5
## # A tibble: 100 x 3
## id group variable
## <int> <chr> <dbl>
## 1 1 before_treatment 2.44
## 2 1 after_treatment 5.25
## 3 2 before_treatment 2.77
## 4 2 after_treatment 4.97
## 5 3 before_treatment 4.56
## 6 3 after_treatment 4.96
## 7 4 before_treatment 3.07
## 8 4 after_treatment 6.37
## 9 5 before_treatment 3.13
## 10 5 after_treatment 4.77
## # ... with 90 more rows
The distribution of variable
is shown with a boxplot before and after the treatment. Additionally, I connected each pair of id
. The observations are not independent!
ggplot(data = dataset_5,
aes(x = reorder(group, variable), y = variable)) +
geom_boxplot() +
geom_point() +
geom_line(aes(group = factor(id))) +
labs(x = "")
The Shapiro-Wilk-Test is computed as always but now the difference
between each pair of observations has to be used. The null hypothesis, that the differences before and after the treatment follow a normal distribution, cannot be rejected.
dataset_5 %>%
pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
mutate(difference = after_treatment - before_treatment) %>%
pull(difference) %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.98276, p-value = 0.6727
Let’s have a look at the distribution of the difference
.
dataset_5 %>%
pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
mutate(difference = after_treatment - before_treatment) %>%
ggplot(aes(x=difference)) + geom_density()
The t-test for dependent groups can be computed by setting the argument paired =
to TRUE
in function t.test()
. Since I did not check the assumption of variance homogeneity the default var.equal = FALSE
is in place. The null hypothesis is that the difference of the mean value of variable
before and after the treatment is equal to zero.
t.test(variable ~ group, data = dataset_5,
alternative = "two.sided",
mu = 0,
paired = TRUE)
##
## Paired t-test
##
## data: variable by group
## t = 11.331, df = 49, p-value = 2.711e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.737424 2.486586
## sample estimates:
## mean of the differences
## 2.112005
The null hypothesis is rejected in favor of the alternative that there is a significant difference in the mean of variable
before and after the treatment.
Should the data not follow a normal distribution, it is recommended to use a Wilcoxon signed rank test instead of the t-test.
Now, a dataset is generated where observations are related but drawn from a log normal distribution.
set.seed(123)
dataset_6 <- tibble(
id = c(1:50),
before_treatment = rlnorm(50, meanlog = 2, sdlog = 0.7),
after_treatment = rlnorm(50, meanlog = 1, sdlog = 0.6)
) %>%
pivot_longer(cols = c(before_treatment, after_treatment),
names_to = "group", values_to = "variable")
dataset_6
## # A tibble: 100 x 3
## id group variable
## <int> <chr> <dbl>
## 1 1 before_treatment 4.99
## 2 1 after_treatment 3.16
## 3 2 before_treatment 6.29
## 4 2 after_treatment 2.67
## 5 3 before_treatment 22.0
## 6 3 after_treatment 2.65
## 7 4 before_treatment 7.76
## 8 4 after_treatment 6.18
## 9 5 before_treatment 8.09
## 10 5 after_treatment 2.37
## # ... with 90 more rows
I use a density plot to assess the skewness of the distribution before and after treatment by comparing it to the distribution in dataset_5
.
bind_rows(
list(dataset_5=dataset_5, dataset_6=dataset_6),
.id = "source"
) %>%
ggplot(data = .,
aes(x = variable, fill = group)) +
geom_density(position = "dodge", color = "black") +
facet_wrap(. ~ source, scales = "free") +
labs(fill = "")
There is no difference compared to the previous section, except that the null hypothesis, that the difference before and after the treatment follows a normal distribution, is now rejected.
dataset_6 %>%
pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
mutate(difference = after_treatment - before_treatment) %>%
pull(difference) %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.90492, p-value = 0.0007013
The distribution of difference
looks rather skewed now.
dataset_6 %>%
pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
mutate(difference = after_treatment - before_treatment) %>%
ggplot(aes(x=difference)) + geom_density()
In function wilcox.test()
the argument paired =
must be set to TRUE
to perform this test. It tests whether the distribution of the two groups is symmetric about mu
. In other words, the median of the differences before and after treatment are compared to a hypothetical median.
wilcox.test(variable ~ group, data = dataset_6,
alternative = "two.sided",
mu = 0,
paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: variable by group
## V = 100, p-value = 2.174e-07
## alternative hypothesis: true location shift is not equal to 0
The null hypothesis that the median differences before and after the treatment are equal to zero can be rejected.
The one-way analysis of variance (ANOVA), or one-factor ANOVA, is an extension of the independent two-samples t-test. It compares means in a situation where there are more than two groups. The data is split into several groups based on one single grouping variable (also called factor variable). ANOVA assumes that the observations are obtained independently and randomly from the population defined by the factor levels and the data of each factor level is normally distributed. Also the assumption of variance homogeneity, previously discussed by the t-test, applies here.
The ANOVA is performed with the following steps:
Compute the variance between groups
Compute the variance within groups (residual variance)
Determine the degrees of freedom
Calculate the F-statistic as the ratio of between variance and within variance
I am generating 30 observations for three different groups which are drawn from a normal distribution with different means but identical standard deviations. The column group
is declared as a factor variable with function factor()
.
set.seed(345)
dataset_7 <- tibble(
group_A = rnorm(30, mean = 1, sd = 2),
group_C = rnorm(30, mean = 4, sd = 2),
group_B = rnorm(30, mean = 3, sd = 2)
) %>%
pivot_longer(cols = starts_with("group"),
names_to = "group",
names_pattern = "._(.+)",
values_to = "variable") %>%
mutate(group = factor(group))
dataset_7
## # A tibble: 90 x 2
## group variable
## <fct> <dbl>
## 1 A -0.570
## 2 C 2.51
## 3 B 1.70
## 4 A 0.441
## 5 C 5.37
## 6 B 1.83
## 7 A 0.677
## 8 C 2.07
## 9 B 4.22
## 10 A 0.419
## # ... with 80 more rows
With function levels()
the unique values of a factor variable can be displayed. When performing an ANOVA with an unbalanced design, that means unequal cell sizes, the order of the levels is important.
levels(dataset_7$group)
## [1] "A" "B" "C"
dataset_7$group <- ordered(dataset_7$group,
levels = c("A", "B", "C"))
Below I plot all observations per group using the function geom_jitter()
which makes the distribution of variable
more visible. I am adding the standard deviation from the mean with function geom_errorbar()
.
dataset_7 %>%
group_by(group) %>%
summarise(
mean = mean(variable, na.rm = TRUE),
sd = sd(variable, na.rm = TRUE)
) %>%
left_join(dataset_7) %>%
ggplot(data = .,
aes(x = group, y = variable, col = group)) +
geom_jitter(width = 0.1) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd)) +
geom_point(aes(y = mean), size = 5, shape = 3) +
theme(legend.position = "none")
First we have to compute the means of variable
for each group
as well as the overall mean regardless of group
. To save some coding effort I create subsets of dataset_7
for each group.
group_A <- dataset_7 %>% filter(group == "A")
group_B <- dataset_7 %>% filter(group == "B")
group_C <- dataset_7 %>% filter(group == "C")
A_mean <- mean(group_A$variable)
B_mean <- mean(group_B$variable)
C_mean <- mean(group_C$variable)
Overall_mean <- mean(dataset_7$variable, na.rm =T)
To obtain the between variance the squared differences between the mean of each group and the overall mean are computed.
The within variance is computed as the sum of the squared differences between each observation and its group mean.
The degrees of freedom for the between variance are calculated by substracting 1 from the number of distinct groups. The degress of freedom for the within variance are calculated by substracting the number of distrinct groups from the number of observations.
The F-statistic is the ratio of the between and within variance adjusted by the degress of freedom. The p-value can be calculcated with function pf()
.
knitr::kable(tibble(
between_variance = 30*(A_mean-Overall_mean)^2 +
30*(B_mean-Overall_mean)^2 +
30*(C_mean-Overall_mean)^2,
within_variance = sum((group_A$variable - mean(group_A$variable))^2) +
sum((group_B$variable - mean(group_B$variable))^2) +
sum((group_C$variable - mean(group_C$variable))^2),
df_between = 3-1,
df_within = 90-3,
F_stat = (between_variance/df_between) / (within_variance/df_within),
pval = pf(F_stat, 2, 87, lower.tail = FALSE)
)
)
between_variance | within_variance | df_between | df_within | F_stat | pval |
---|---|---|---|---|---|
129.0416 | 355.1128 | 2 | 87 | 15.80711 | 1.4e-06 |
The one-way ANOVA can be calculated way more easily with function aov()
and the results are displayed with function summary()
. The null hypothesis is that there are no differences regarding variable
between the levels of group
.
model_aov <- aov(variable ~ group, data = dataset_7)
summary(model_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 129.0 64.52 15.81 1.39e-06 ***
## Residuals 87 355.1 4.08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using function lm()
to compute the anova yields the same results with a little bit more precision but we have to use function anova()
instead of summary()
to get the desired output.
model_lm <- lm(variable ~ group, data = dataset_7)
anova(model_lm)
## Analysis of Variance Table
##
## Response: variable
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 129.04 64.521 15.807 1.394e-06 ***
## Residuals 87 355.11 4.082
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value is less than the significance level of 5 %, we can conclude that there are significant differences between the groups. However, we do not know which pairs of groups are different.
It is possible to perform multiple pairwise-comparisons to determine whether the mean differences between specific pairs of groups are statistically significant. This is can be done by computing Tukey Honest Significant Differences with function TukeyHSD()
. It takes the fitted ANOVA as an argument.
TukeyHSD(model_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = variable ~ group, data = dataset_7)
##
## $group
## diff lwr upr p adj
## B-A 1.9378958 0.6940356 3.181756 0.0010356
## C-A 2.8756463 1.6317862 4.119506 0.0000011
## C-B 0.9377505 -0.3061096 2.181611 0.1763757
We can conclude that there are significant differences between the means of group A and group B as well as group A and group C but not between group B and group C.
An ANOVA assumes that the variance between each group’s observations is homogeneous. Either Bartlett’s test or Levene’s test can be used to test this assumption. The null hypothesis is that the variance across groups is homogeneous.
car::leveneTest(variable ~ group, data = dataset_7)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5639 0.5711
## 87
bartlett.test(variable ~ group, data = dataset_7)
##
## Bartlett test of homogeneity of variances
##
## data: variable by group
## Bartlett's K-squared = 0.52506, df = 2, p-value = 0.7691
Both tests show that there is no evidence to suggest that the variance across groups is significantly different.
Also a scatterplot might be used to check whether there is a relationship between residuals and the fitted values (which are the mean of each group in this setting). The straight blue line connects the mean of each group’s residuals. Pay special attention to outliers in this type of plot as they can strongly influence the above tests! Here, everything looks fine. Note that this plot requires an lm
object and can hence only be created after running the ANOVA.
broom::augment(model_lm) %>%
group_by(group) %>%
summarise(mean_resid = mean(.resid)) %>%
left_join(broom::augment(model_lm)) %>%
ggplot(data = .,
aes(x = .fitted, y = .resid)) +
geom_point() +
geom_line(aes(y = mean_resid), color = "blue") +
labs(x = "Fitted values", y = "Residuals")
The Shapiro-Wilk test, which was already used with the Student’s t-test, is used to check the distribution of the residuals. The null hypothesis is that the residuals follow a normal distribution. The function residuals()
is used to extract the residuals from either the object created by aov()
or lm()
.
model_aov_residuals <- residuals(object = model_aov)
shapiro.test(x = model_aov_residuals)
##
## Shapiro-Wilk normality test
##
## data: model_aov_residuals
## W = 0.98789, p-value = 0.5769
The null hypothesis that the residuals are normally distributed cannot be rejected.
For assessing the normality assumption I am also creating a Q-Q (quantile-quantile) plot of the residuals with functions stat_qq()
and stat_qq_line()
. Approximately, the residuals should follow a straight line.
ggplot(broom::augment(model_lm),
aes(sample = .std.resid)) +
stat_qq() +
stat_qq_line() +
labs(x = "Theoretical Quantiles",
y = "Standardized Residuals")
Again, I am creating 30 observations for three different groups which are drawn from a normal distribution but now with different means and standard deviations.
set.seed(345)
dataset_8 <- tibble(
group_A = rnorm(30, mean = 1, sd = 2),
group_C = rnorm(30, mean = 4, sd = 5),
group_B = rnorm(30, mean = 3, sd = 8)
) %>%
pivot_longer(cols = starts_with("group"),
names_to = "group",
names_pattern = "._(.+)",
values_to = "variable") %>%
mutate(group = ordered(group,
levels = c("A", "B", "C")))
dataset_8
## # A tibble: 90 x 2
## group variable
## <ord> <dbl>
## 1 A -0.570
## 2 C 0.285
## 3 B -2.21
## 4 A 0.441
## 5 C 7.42
## 6 B -1.67
## 7 A 0.677
## 8 C -0.821
## 9 B 7.87
## 10 A 0.419
## # ... with 80 more rows
Let us take a look at the modified dataset. By comparing the distance between the error bars of each group, it is easy to see that there is a different spread within each group compared to the previous section.
dataset_8 %>%
group_by(group) %>%
summarise(
mean = mean(variable, na.rm = TRUE),
sd = sd(variable, na.rm = TRUE)
) %>%
left_join(dataset_8) %>%
ggplot(data = .,
aes(x = group, y = variable, col = group)) +
geom_jitter(width = 0.1) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd)) +
geom_point(aes(y = mean), size = 5, shape = 3) +
theme(legend.position = "none")
Now I am running again both tests to test whether the variance between the three groups is homogeneous.
car::leveneTest(variable ~ group, data = dataset_8)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 9.6438 0.0001649 ***
## 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bartlett.test(variable ~ group, data = dataset_8)
##
## Bartlett test of homogeneity of variances
##
## data: variable by group
## Bartlett's K-squared = 38.319, df = 2, p-value = 4.776e-09
In both tests the null hypothesis that the variance between groups is homogeneous is rejected. Now a test, which does not assume equal variance is required
The Welch one-way test is designed to do this and can be performed with function oneway.test()
. It tests whether two or more samples, which follow a normal distribution, have the same means.
oneway.test(variable ~ group, data = dataset_8,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: variable and group
## F = 3.4606, num df = 2.000, denom df = 47.205, p-value = 0.03959
The null hypothesis that the group’s means are equal can be rejected at the 5 % level in favor of the alternative that they are different.
With function pairwise.t.test()
comparisons between group levels can be calculated. The argument pool.sd =
has to be set to FALSE
.
pairwise.t.test(dataset_8$variable, dataset_8$group,
pool.sd = FALSE)
##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: dataset_8$variable and dataset_8$group
##
## A B
## B 0.727 -
## C 0.042 0.727
##
## P value adjustment method: holm
There are significant differences between group A
and group C
but not between group A
and group B
as well as group B
and group C
.
A non-parametric alternative to the one-way ANOVA is the Kruskal-Wallis rank sum test, which can be used when the normality assumption is violated.
Below I create another 30 observations for three different groups which are now drawn from a log normal distribution.
set.seed(345)
dataset_9 <-
tibble(
group_A = rlnorm(30, meanlog = 1.0, sdlog = 0.3),
group_C = rlnorm(30, meanlog = 1.5, sdlog = 0.3),
group_B = rlnorm(30, meanlog = 1.8, sdlog = 0.3)
) %>%
pivot_longer(cols = starts_with("group"),
names_to = "group",
names_pattern = "._(.+)",
values_to = "variable") %>%
mutate(group =
ordered(group, levels = c("A", "B", "C"))
)
dataset_9
## # A tibble: 90 x 2
## group variable
## <ord> <dbl>
## 1 A 2.15
## 2 C 3.59
## 3 B 4.98
## 4 A 2.50
## 5 C 5.50
## 6 B 5.08
## 7 A 2.59
## 8 C 3.36
## 9 B 7.26
## 10 A 2.49
## # ... with 80 more rows
I use a density plot to assess the skewness of the distribution of variable
among the three groups in dataset_9
by comparing it to its distribution in dataset_7
from the previous section. The distribution of variable
in dataset_9 is skewed to the right.
bind_rows(
list(dataset_7=dataset_7, dataset_9=dataset_9),
.id = "source"
) %>%
ggplot(data = .,
aes(x = variable, fill = group)) +
geom_density(position = "dodge", color = "black") +
facet_wrap(. ~ source, scales = "free")
Using the function shapiro.test()
, the null hypothesis that the residuals are normally distributed is now rejected.
model_aov_residuals <- residuals(object = lm(variable ~ group, data = dataset_9))
shapiro.test(x = model_aov_residuals)
##
## Shapiro-Wilk normality test
##
## data: model_aov_residuals
## W = 0.87061, p-value = 2.441e-07
The Q-Q plot of the residuals also shows the deviation from the normal distribution at the tails.
ggplot(broom::augment(lm(variable ~ group, data = dataset_9)),
aes(sample = .std.resid)) +
stat_qq() +
stat_qq_line() +
labs(x = "Theoretical Quantiles",
y = "Standardized Residuals")
The function kruskal.test()
performs a test of the null hypothesis that the location parameters of the distribution of variable
are the same in each group
. The alternative is that they differ in at least one.
kruskal.test(variable ~ group, data = dataset_9)
##
## Kruskal-Wallis rank sum test
##
## data: variable by group
## Kruskal-Wallis chi-squared = 49.644, df = 2, p-value = 1.66e-11
The null hypothesis is rejected in favor of the alternative that the distribution of variable
differs among the three groups.
With function pairwise.wilcox.test()
pairwise comparisons between group levels may be calculated.
pairwise.wilcox.test(dataset_9$variable, dataset_9$group)
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: dataset_9$variable and dataset_9$group
##
## A B
## B 9.4e-13 -
## C 3.8e-07 0.00021
##
## P value adjustment method: holm
There are significant differences between all three groups.
The two-way analysis of variance is an extension of the one-way ANOVA and examines the influence of two different categorical independent variables on one continuous dependent variable. It not only aims at assessing the main effect of each independent variable but also if there is any interaction between them. The two-way ANOVA again assumes that the observations within each group are normally distributed and have equal variances.
The follwing hypotheses might be tested:
There is no difference in the means of categories of variable A.
There is no difference in the means of categories of variable B.
There is no interaction between the variables A and B.
The first section discusses an ANOVA when the study’s design is balanced. This means that for each combination of the two factor variables there are the same number of observations (cell counts).
Below, I am generating a dataset with 60 observations. They are split within group
(A, B, C) and country
(DE, EN), hence there are 10 observations per combination of the two factor variables. The variable
of interest is drawn from a normal distribution. The mean differs within and between group
and country
while the standard deviation only differs between country
.
set.seed(345)
dataset_10 <- tibble(
DE_group_A = rnorm(10, mean = 1, sd = 1),
DE_group_B = rnorm(10, mean = 2, sd = 1),
DE_group_C = rnorm(10, mean = 3, sd = 1),
EN_group_A = rnorm(10, mean = 2, sd = 1.4),
EN_group_B = rnorm(10, mean = 4, sd = 1.4),
EN_group_C = rnorm(10, mean = 6, sd = 1.4)
) %>%
pivot_longer(cols = everything(),
names_to = c("country", "group"),
names_pattern = "(.*)_group_(.*)",
values_to = "variable") %>%
mutate(group = factor(group),
country = factor(country))
The cell counts can be checked with a two-way table.
dataset_10 %>%
select(country, group) %>%
table()
## group
## country A B C
## DE 10 10 10
## EN 10 10 10
Here, the design is balanced.
Let us take a look at dataset_11
. Since there is now a second categorical variable, I am utilizing the color =
argument to separate the boxplots by country
.
ggplot(data = dataset_10,
aes(x = group, y = variable, color = country)) +
geom_boxplot()
The two-way ANOVA is carried out in the same fashion as in the previous section. The second factor variable is added via a +
in the formula of function aov()
.
model_aov2 <- aov(variable ~ country + group, data = dataset_10)
summary(model_aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## country 1 53.38 53.38 34.62 2.34e-07 ***
## group 2 53.94 26.97 17.49 1.25e-06 ***
## Residuals 56 86.34 1.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both country
and group
are statistically significant and influence the mean of variable
length.
The above fitted model is called an additive model. It makes the assumption that the two factor variables country
and group
are independent. If there are reasons to believe that these two variables might interact, that is they depend on each other, the formula of function aov()
has to be adjusted by adding country:group
.
model_aov_interact <- aov(variable ~ country + group + country:group, data = dataset_10)
summary(model_aov_interact)
## Df Sum Sq Mean Sq F value Pr(>F)
## country 1 53.38 53.38 37.528 1.07e-07 ***
## group 2 53.94 26.97 18.958 5.79e-07 ***
## country:group 2 9.53 4.76 3.349 0.0425 *
## Residuals 54 76.82 1.42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction is still statistically significant at the 5 % level. Hence, the relationship between group
and variable
does depend on country
. If the interaction is not significant the additive model is preferred.
With function model.tables()
summary tables for model fits can be printed. It displays the overall mean, the means per level of each factor variable as well as the means of the combinations of both factor’s levels.
model.tables(model_aov_interact, type = "means", se = TRUE)
## Tables of means
## Grand mean
##
## 2.877139
##
## country
## country
## DE EN
## 1.934 3.820
##
## group
## group
## A B C
## 1.585 3.212 3.834
##
## country:group
## group
## country A B C
## DE 1.203 2.034 2.564
## EN 1.968 4.390 5.104
##
## Standard errors for differences of means
## country group country:group
## 0.3080 0.3772 0.5334
## replic. 30 20 10
Similarly to the one-way ANOVA, a significant p-value indicates that the group means are different, but we do not know exactly which pairs. Again, we can compute Tukey Honest Significant Differences to perform multiple pairwise-comparisons between the means of groups.
The argument which =
of function TukeyHSD()
can take a list of character values as input and hence we can test both factor variables (country
and group
) at once.
TukeyHSD(model_aov_interact, which = c("country", "group"))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = variable ~ country + group + country:group, data = dataset_10)
##
## $country
## diff lwr upr p adj
## EN-DE 1.886507 1.269101 2.503912 1e-07
##
## $group
## diff lwr upr p adj
## B-A 1.6266655 0.7177109 2.535620 0.0002010
## C-A 2.2488666 1.3399120 3.157821 0.0000006
## C-B 0.6222011 -0.2867535 1.531156 0.2339918
Irrespective of group
, we find that there are statistically significant differences between countries DE and EN. Also we find that there are statistically significant differences between group A and B as well as group A and C but not between group B and C irrespective of country
.
Alternatively, we can also use a pairwise t-test to consider all possible combinations of the categorical variables. I’m using the function str_c()
to combine the vectors group
and country
.
pairwise.t.test(dataset_10$variable, str_c(dataset_10$group, dataset_10$country, sep = "-"))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dataset_10$variable and str_c(dataset_10$group, dataset_10$country, sep = "-")
##
## A-DE A-EN B-DE B-EN C-DE
## A-EN 0.78787 - - - -
## B-DE 0.74968 0.90070 - - -
## B-EN 2.6e-06 0.00032 0.00044 - -
## C-DE 0.09505 0.80470 0.80470 0.00953 -
## C-EN 1.9e-08 3.4e-06 5.0e-06 0.78787 0.00016
##
## P value adjustment method: holm
Variance homogeneity: The null hypothesis that the variance between samples is homogeneous cannot be rejected.
car::leveneTest(variable ~ country*group, data = dataset_10)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 0.388 0.8549
## 54
Normality: The null hypothesis that the residuals are normally distributed cannot be rejected.
model_aov_interact_residuals <- residuals(object = lm(variable ~ country*group, data = dataset_10))
shapiro.test(x = model_aov_interact_residuals)
##
## Shapiro-Wilk normality test
##
## data: model_aov_interact_residuals
## W = 0.97825, p-value = 0.3592
An unbalanced design has unequal numbers of subjects in each group. In general there are three different ways to run an ANOVA and as long as a study’s design is balanced these methods lead to the same results. However, when dealing with an unbalanced design is is recommended to calculate Type-II or Type-III sums of squares.
In order to create an unbalanced design I use the same dataset from the previous section (balanced design) and introduce missing values (NA
) in each group.
set.seed(345)
dataset_11 <- tibble(
DE_group_A = c(rnorm(8, mean = 1, sd = 1), NA,NA),
DE_group_B = c(rnorm(7, mean = 2, sd = 1), NA, NA, NA),
DE_group_C = c(rnorm(9, mean = 3, sd = 1), NA),
EN_group_A = c(rnorm(9, mean = 2, sd = 1.4), NA),
EN_group_B = c(rnorm(10, mean = 4, sd = 1.4)),
EN_group_C = c(rnorm(8, mean = 6, sd = 1.4), NA, NA)
) %>%
pivot_longer(cols = everything(),
names_to = c("country", "group"),
names_pattern = "(.*)_group_(.*)",
values_to = "variable") %>%
mutate(group = factor(group),
country = factor(country)) %>%
filter(!is.na(variable))
Below we can see the unequal cell counts. They vary between 7 and 10.
dataset_11 %>%
select(country, group) %>%
table()
## group
## country A B C
## DE 8 7 9
## EN 9 10 8
With a boxplot is not possible to see whether a dataset inherits a balanced design or not, so always use frequency tables to check for it.
ggplot(data = dataset_11,
aes(x = group, y = variable, color = country)) +
geom_boxplot()
The first type is also known as sequential sum of squares because the sequence of the specified independent variables matters. In R’s anova()
and aov()
functions this type is implemented and the ANOVA should be run two times, once with each independent variable taking the lead.
summary(aov(variable ~ country*group, data = dataset_11))
## Df Sum Sq Mean Sq F value Pr(>F)
## country 1 47.00 47.00 34.394 4.96e-07 ***
## group 2 97.40 48.70 35.642 5.29e-10 ***
## country:group 2 19.91 9.95 7.284 0.00182 **
## Residuals 45 61.49 1.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(variable ~ group*country, data = dataset_11))
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 93.54 46.77 34.226 9.20e-10 ***
## country 1 50.87 50.87 37.225 2.21e-07 ***
## group:country 2 19.91 9.95 7.284 0.00182 **
## Residuals 45 61.49 1.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference in the main effects of country
and group
is only due to which variable comes first. Also note that the interaction effect is not affected by the ordering of the independent variables.
For the second type the sequence of the arguments in the model does not matter! We can use the function Anova()
from package car
and specify type = "II"
to obtain the desired results.
model_aov2 <- aov(variable ~ country*group, data = dataset_11)
car::Anova(model_aov2, type = "II")
## Anova Table (Type II tests)
##
## Response: variable
## Sum Sq Df F value Pr(>F)
## country 50.865 1 37.2248 2.209e-07 ***
## group 97.404 2 35.6417 5.286e-10 ***
## country:group 19.907 2 7.2844 0.001817 **
## Residuals 61.489 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The third type is also known as partial sums of squares. Again we use the function Anova()
from package car
and specify type = "III"
.
model_aov3 <- aov(variable ~ country*group, data = dataset_11)
car::Anova(model_aov3, type = "III")
## Anova Table (Type III tests)
##
## Response: variable
## Sum Sq Df F value Pr(>F)
## (Intercept) 5.388 1 3.9433 0.053171 .
## country 2.251 1 1.6476 0.205850
## group 17.765 2 6.5004 0.003312 **
## country:group 19.907 2 7.2844 0.001817 **
## Residuals 61.489 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The effect of group
is still statistically significant but the influence of country
is not! The p-value of the interaction term is identical to the previous settings.
A multivariate analysis of variance (MANOVA) is simply an ANOVA with more than one dependent variable. Hence, while an ANOVA tests for the difference in means between two or more groups, the MANOVA tests for the difference in two or more vectors of means.
I create a dataset where two variables of interest (height
and weight
) are drawn from a normal distribution for three different groups
. In total there are 60 observations of the two variables of interest - 20 for each group.
set.seed(345)
dataset_12 <- tibble(
height_group_A = rnorm(20, mean = 3.0, sd = 1.0),
height_group_B = rnorm(20, mean = 4.5, sd = 1.0),
height_group_C = rnorm(20, mean = 3.5, sd = 1.0),
weight_group_A = rnorm(20, mean = 5.0, sd = 1.0),
weight_group_B = rnorm(20, mean = 6.5, sd = 1.0),
weight_group_C = rnorm(20, mean = 5.5, sd = 1.0)
) %>%
pivot_longer(cols = everything(),
names_to = c(".value", "group"),
names_pattern = "(.+)_group_(.)"
) %>%
mutate(group = factor(group))
dataset_12
## # A tibble: 60 x 3
## group height weight
## <fct> <dbl> <dbl>
## 1 A 2.22 4.35
## 2 B 4.94 7.86
## 3 C 4.13 4.14
## 4 A 2.72 4.42
## 5 B 5.33 5.58
## 6 C 4.34 5.01
## 7 A 2.84 5.61
## 8 B 1.92 5.48
## 9 C 2.65 6.38
## 10 A 2.71 3.79
## # ... with 50 more rows
Since there is more than one continuous variable, I am using a scatterplot to show the distribution of height
and weight
between group
A, B and C.
ggplot(data = dataset_12,
aes(x = height, y = weight)) +
geom_point() +
facet_wrap(. ~ group, labeller = purrr::partial(label_both, sep = ": "))
With function cbind()
the height
and weight
variable are combined and then used as dependent variables in function manova()
.
model_manova <- manova(cbind(height, weight) ~ group, data = dataset_12)
summary(model_manova)
## Df Pillai approx F num Df den Df Pr(>F)
## group 2 0.46724 8.6879 4 114 3.708e-06 ***
## Residuals 57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, we can see that there are significant differences between the groups regarding both dependent variables. The column Pillai
in the output displays Pillai’s Trace which is a positive integer value ranging from 0 to 1. This test is considered to be the most powerful and robust statistic for general use, especially for departures from the assumption of variance homogeneity.
Below is another output for both responses individually. There are also significant differences between the groups for each dependent variable.
summary.aov(model_manova)
## Response height :
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 15.145 7.5726 7.1769 0.00166 **
## Residuals 57 60.142 1.0551
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response weight :
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 35.123 17.5613 19.638 3.252e-07 ***
## Residuals 57 50.972 0.8942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Similar to the ANOVA we can use the Shapiro-Wilk test to check the distribution of the residuals. The null hypothesis is that the residuals follow a normal distribution. The function residuals() is used to extract the residuals from the object created by function manova()
. However, we can only check for univariate normality with this test.
model_manova_residuals <- residuals(object = model_manova)
shapiro.test(x = model_manova_residuals)
##
## Shapiro-Wilk normality test
##
## data: model_manova_residuals
## W = 0.99062, p-value = 0.5905
The null hypothesis that the residuals are univariate normally distributed cannot be rejected.
In addition there is **Mardia’skewness and kurtosis test* for multivariate normality. It is implemented in function mardia
in package mvnormalTest
.
mvnormalTest::mardia(cbind(dataset_12$height, dataset_12$weight))
## $mv.test
## Test Statistic p-value Result
## 1 Skewness 4.2945 0.3676 YES
## 2 Kurtosis -1.2815 0.2 YES
## 3 MV Normality <NA> <NA> YES
##
## $uv.shapiro
## W p-value UV.Normality
## V1 0.9812 0.4818 Yes
## V2 0.9695 0.1374 Yes
The null hypothesis that the dependent variables are multivariate normally distributed cannot be rejected.
When perfoming a MANOVA the data should have equal variance-covariance matrices for each combination formed by each group in the independent variable. This is a multivariate version of the homogeneity of variances described in the ANOVA section. We can use Box’s M test implemented in function boxM()
in package heplots
.
heplots::boxM(cbind(height, weight) ~ group, data = dataset_12)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: Y
## Chi-Sq (approx.) = 3.3773, df = 6, p-value = 0.7602
The null hypothesis that the covariance matrices are homogeneous cannot be rejected.
Also we need to check for multicollinearity among our two dependent variables height
and weight
. Correlation is measured between -1 and 1 where zero indicates no correlation and -1/1 perfect negative/positive correlation.
cor(dataset_12$height, dataset_12$weight)
## [1] 0.3606827
As a non-parametric estimation method, the bootstrap quantifies the uncertainty of an estimator involved with the standard deviation. This can for example be very helpful if the sample size is relatively small or when the underlying distribution cannot be approximated parametrically.
I create a dataset with 30 observations of two variables of interest (height
and weight
) which are sampled from a range of given values.
set.seed(345)
dataset_13 <- tibble(
height = sample(170:190, 30, replace = TRUE),
weight = sample(70:90, 30, replace = TRUE)
)
dataset_13
## # A tibble: 30 x 2
## height weight
## <int> <int>
## 1 188 80
## 2 190 88
## 3 187 70
## 4 189 74
## 5 177 78
## 6 170 81
## 7 178 90
## 8 171 75
## 9 188 86
## 10 174 70
## # ... with 20 more rows
For variable height
I calculate the mean and its standard error.
mean(dataset_13$height)
## [1] 179.4333
sd(dataset_13$height)/sqrt(length(dataset_13$height))
## [1] 1.20505
Before we are able to perform the bootstrap, a specific function is required which must allow for subsetting the dataset. Below I am generating a new function mean_indexed()
which uses the argument indices
to subset the supplied data
.
mean_indexed <- function(data, indices){
mean(data[indices])
}
Now with function boot()
from R package boot
the bootstrap can be calculated. In order to get replicable results a seed must be set. Then I am using the above generated function mean_indexed()
with argument statistic =
and set R = 1000
to create 1000 replications (samples).
set.seed(345)
bootMean <- boot(dataset_13$height, statistic = mean_indexed, R = 1000)
summary(bootMean)
## R original bootBias bootSE bootMed
## 1 1000 179.43 -0.045633 1.2248 179.38
The original
mean is 179.43
. The bootBias
represents the difference between the original and bootstrapped value of the mean. The bootstrapped standard error (bootSE
) is 1.2248. The values shown above can also be calculated by using the 1000 replicated mean values contained in t
inside object bootMean
.
mean(bootMean$t)
## [1] 179.3877
sd(bootMean$t)
## [1] 1.224809
options(pillar.sigfig = 5)
tibble(
variable = c("height", "height"),
type = c("parametric", "bootstrap"),
mean = c(mean(dataset_13$height), mean(bootMean$t)),
se = c(sd(dataset_13$height)/sqrt(length(dataset_13$height)), sd(bootMean$t))
)
## # A tibble: 2 x 4
## variable type mean se
## <chr> <chr> <dbl> <dbl>
## 1 height parametric 179.43 1.2051
## 2 height bootstrap 179.39 1.2248
Below is another example on how to use the bootstrap. I want to estimate the correlation between height
and weight
with the bootstrap.
cor_indexed <- function(data, indices){
cor(data[indices,]$height, data[indices,]$weight)
}
set.seed(345)
bootCor <- boot(dataset_13, statistic = cor_indexed, R = 1000)
options(pillar.sigfig = 5)
tibble(
correlation_paramteric = cor(dataset_13$height, dataset_13$weight),
correlation_bootstrap = mean(bootCor$t)
)
## # A tibble: 1 x 2
## correlation_paramteric correlation_bootstrap
## <dbl> <dbl>
## 1 0.053863 0.049370