R Tutorial: Statistical Inference 1

1. Introduction

1.1 R packages & Data
1.2 Hypothesis testing

2. Student’s t-test

2.1 One-sample t-test
2.2 Two-samples unpaired t-test
2.3 Two-samples paired t-test

3. Analysis of Variance (ANOVA)

3.1 One-way ANOVA
3.2 Two-way ANOVA

4. Multivariate Analysis of Variance (MANOVA)

5. Bootstrapping

References

Reading time: 25 minutes (4,872 words)

1. Introduction

1.1 R packages & Data

The following packages are required for this tutorial.

library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)
library(stringr)
library(broom)
library(car)
library(boot)
library(heplots)
library(mvnormalTest)

The practical examples in the following sections are always illustrated with simulated data, so there is no need to load a dataset.

1.2 Hypothesis testing

With a statistical hypothesis we want to test if there is evidence against a statement about the nature of a population (consider for example a parameter like the mean or median). Before testing a statistical hypothesis the probability of error or the significance level (α) has to be set. The corresponding confidence interval is 1-α. So for example a significance level of 5 % (or 0.05) reflects a 95 % confidence interval (0.95).

The significance level gives the probability of rejecting the null hypothesis although it is correct. Given a significance level of 5 %, in 5 out of 100 cases the null hypothesis is erroneously rejected. The null hypotheis is rejected if the calculated p value is smaller than the predefined significane level. However, it is also correct to say that you are 95 % confident that the alternative hypothesis is true. In the other case, one can say that the null hypothesis cannot be rejected instead that the null hypothesis is accepted. The statement the null hypothesis is rejected and the alternative is true with a probality of at least 95 % is wrong.

Rejecting the null hypothesis after a statistical test is denoted as either:

a significant change (two-sided test),
a significant decrease (left-sided test) or
a significant increase (right-sided test)

By design the result of a statistical test is never unequivocal evidence for the change/decrease/increase of a measure, even when the null hypothesis is rejected at a very low significane level. Also, a significant change can be negligibly small. Always remember that it is required to set the significane level before calculating the p value, in order to not adjust α in hindsight to obtain the desired result.

The signficance level in the following examples is always set to 5 %.

2. Student’s t-test

The Student’s t-test is a statistical hypothesis test, in which the test statistic follows the Student’s t-distribution. Commonly, it is used to determine if the means of two groups of data are significantly different from each other. Still, it can also be applied to a single sample, where the mean is tested against a specific value. Below is a representation with different degrees of freedom.

For higher degrees of freedom the t-distribution approximates the normal distribution. However, note the difference in the tails between both distributions.

2.1 One-sample t-test

In its most simple form the t-test is a location test of whether the mean of a population has a value specified in the null hypothesis. The data, and hence the sample mean, must follow a normal distribution in order for the test to work.

The first dataset I create contains 75 individuals identified by an id for which a certain variable of interest was recorded. The variable was generated with function rnorm() and follows a normal distribution with a mean of 5 and standard deviation of 2. Since the values are drawn randomly from the distribution, it is important to use function set.seed() in order to obtain replicable results.

set.seed(123)
dataset_1 <- tibble(
  id = as.character(c(1:75)),
  variable = rnorm(n = 75, mean = 5, sd = 3)
  )

dataset_1

## # A tibble: 75 x 2
##    id    variable
##    <chr>    <dbl>
##  1 1         3.32
##  2 2         4.31
##  3 3         9.68
##  4 4         5.21
##  5 5         5.39
##  6 6        10.1 
##  7 7         6.38
##  8 8         1.20
##  9 9         2.94
## 10 10        3.66
## # ... with 65 more rows

With a histogram, the distribution of the variable can be visualized.

ggplot(data = dataset_1,
       aes(x = variable)) +
  geom_histogram(fill = "white", color = "black", bins = 10) +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

This looks normally distributed.

Before I compute the t-test, the normality assumption is checked. As the sample is quite small, the central limit theorem will not apply. However, a Shapiro-Wilk Normality Test can be used to test whether the data follows a normal distribution. The null hypothesis is hence that the data is normally distributed. This test can be computed with function shapiro.test().

shapiro.test(dataset_1$variable)

## 
##  Shapiro-Wilk normality test
## 
## data:  dataset_1$variable
## W = 0.99421, p-value = 0.9834

The test shows a p-value of 0.98. Hence, there is no evidence to reject the null hypothesis in favor of the alternative that the data is not normally distributed.

Finally we can run the t-test which is implemented in function t.test(). The theoretical mean against which the hypothesis test is carried out (mu) is 0 by default. With argument alternative = a two- or one-tailed test can be applied. Below, the null hypothesis that the population mean of variable is equal to zero (mu = 0), is rejected in favor of the alternative that it is not equal to zero.

t.test(dataset_1$variable, 
       alternative = "two.sided",
       mu = 0, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  dataset_1$variable
## t = 15.679, df = 74, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.433596 5.724500
## sample estimates:
## mean of x 
##  5.079048

To check this result, I am computing the test statistic and confidence interval by hand.

knitr::kable(tibble(
  stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
  tstat = (mean(dataset_1$variable) - 0) / stderr,
  pval = 2 * pt(q = -abs(tstat), df = length(dataset_1$variable) -1),
  alpha = 1-0.95,
  error = qt(p = 1 - alpha / 2, df = length(dataset_1$variable) -1) * stderr,
  lower_ci = mean(dataset_1$variable) - error,
  upper_ci = mean(dataset_1$variable) + error)
  )

stderr	tstat	pval	alpha	error	lower_ci	upper_ci
0.3239338	15.67928	0	0.05	0.6454521	4.433596	5.7245

A one-tailed t-test is used for the null hypothesis that the population mean of variable is greater than or equal to 6. The null hypothesis rejected in favor of the alternative that it is less than 6.

t.test(dataset_1$variable, 
       alternative = "less",
       mu = 6, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  dataset_1$variable
## t = -2.843, df = 74, p-value = 0.002886
## alternative hypothesis: true mean is less than 6
## 95 percent confidence interval:
##      -Inf 5.618627
## sample estimates:
## mean of x 
##  5.079048

Next, I test the null hypothesis that the the population mean of variable is less than or equal to 4. The null hypothesis is rejected in favor of the alternative that it is greater than 4.

t.test(dataset_1$variable,
       alternative = "greater",
       mu = 4, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  dataset_1$variable
## t = 3.3311, df = 74, p-value = 0.000676
## alternative hypothesis: true mean is greater than 4
## 95 percent confidence interval:
##  4.539469      Inf
## sample estimates:
## mean of x 
##  5.079048

Again, I show the caluclation for both one-tailed tests.

knitr::kable(tibble(
  stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
  tstat = (mean(dataset_1$variable) - 6) / stderr,
  pval = pt(q = -abs(tstat), df = length(dataset_1$variable) -1),
  error = qt(p = 0.95, df = length(dataset_1$variable) -1) * stderr,
  lower_ci = -Inf,
  upper_ci = mean(dataset_1$variable) + error)
)

stderr	tstat	pval	error	lower_ci	upper_ci
0.3239338	-2.843025	0.0028864	0.5395787	-Inf	5.618627

knitr::kable(tibble(
  stderr = sqrt( var(dataset_1$variable) / length(dataset_1$variable) ),
  tstat = (mean(dataset_1$variable) - 4) / stderr,
  pval = pt(q = abs(tstat), df = length(dataset_1$variable) -1, lower.tail = FALSE),
  error = qt(p = 0.95, df = length(dataset_1$variable) -1) * stderr,
  lower_ci = mean(dataset_1$variable) - error,
  upper_ci = Inf)
)

stderr	tstat	pval	error	lower_ci	upper_ci
0.3239338	3.331076	0.000676	0.5395787	4.539469	Inf

2.2 Two-samples unpaired t-test

The independent or unpaired samples t-test is used to compare the mean of two unrelated groups. Each group is a sample of one of the two populations being compared. Again, we have to check that the data of the two groups is normally distributed. Additionally, the variance of the two groups now plays a role and is assumed to be equal (variance homogeneity).

2.2.1 Equal variance

For the dataset in this section I create 50 observations for men and women each of a variable of interest. Both groups are normally distributed with an identical standard deviation but their mean is different as seen below.

set.seed(123)
dataset_2 <- tibble(
  men = rnorm(50, mean = 5, sd = 2), 
  women = rnorm(50, mean = 8, sd = 2)
  ) %>%
  pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")

dataset_2

## # A tibble: 100 x 2
##    group variable
##    <chr>    <dbl>
##  1 men       3.88
##  2 women     8.51
##  3 men       4.54
##  4 women     7.94
##  5 men       8.12
##  6 women     7.91
##  7 men       5.14
##  8 women    10.7 
##  9 men       5.26
## 10 women     7.55
## # ... with 90 more rows

The distribution of variable within each group is shown below with two boxplots.

ggplot(data = dataset_2,
       aes(x = group, y = variable)) +
  geom_boxplot() +
  labs(x = "")

The Shapiro-Wilk-Test is again used to check the assumption of normally distributed data. It is carried out for each group individually.

dataset_2 %>%
  filter(group == "men")  %>%
  pull(variable) %>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.98928, p-value = 0.9279

dataset_2 %>%
  filter(group == "women")  %>%
  pull(variable) %>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99073, p-value = 0.9618

For both grouphs, the null hypothesis that the data is normally distributed cannot be rejected.

The function var.test() performs an F-test to compare the variances of the two groups regarding variable. The null hypothesis is that the ratio of the variances is equal to 1.

var.test(variable ~ group, data = dataset_2, 
         ratio = 1, alternative = "two.sided", 
         conf.level = 0.95)

## 
##  F test to compare two variances
## 
## data:  variable by group
## F = 1.0456, num df = 49, denom df = 49, p-value = 0.8766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5933642 1.8425790
## sample estimates:
## ratio of variances 
##            1.04562

There is no evidence to reject the null hypothesis. So it is fine to assume homoskedasticity.

The t-test for unpaired groups can be computed in the same way as in the one-sample case. The argument paired = is set to FALSE by default in function t.test(). Since there was no evidence against homoskedasticity, the argument var.equal=TRUE is added. The null hypothesis is that the difference of the mean value of variable between men and women is equal to zero.

t.test(variable ~ group, data = dataset_2, 
       alternative = "two.sided", 
       mu = 0, paired = FALSE,
       var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  variable by group
## t = -8.8019, df = 98, p-value = 4.799e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.950893 -2.497126
## sample estimates:
##   mean in group men mean in group women 
##            5.068807            8.292817

The null hypothesis is rejected in favor of the alternative that there is a significant difference in the groups’ means.

Below, I computed the output of t.test(). The confidence interval now refers to the difference between the mean of the two groups.

knitr::kable(
  tibble(
    stderr = sqrt(var(dataset_2[dataset_2$group == "men", ]$variable) / 50 +
                    var(dataset_2[dataset_2$group == "women", ]$variable) / 50),
    tstat = (mean(dataset_2[dataset_2$group == "men", ]$variable) -
               mean(dataset_2[dataset_2$group == "women", ]$variable)) / stderr,
    pval = 2 * pt(q = -abs(tstat), df = length(dataset_2$variable) - 2),
    alpha = 1 - 0.95,
    error = qt(p = 1 - alpha / 2, df = length(dataset_2$variable) - 1) * stderr,
    lower_ci = mean(dataset_2[dataset_2$group == "men", ]$variable) -
      mean(dataset_2[dataset_2$group == "women", ]$variable) - error,
    upper_ci = mean(dataset_2[dataset_2$group == "men", ]$variable) -
      mean(dataset_2[dataset_2$group == "women", ]$variable) + error
  )
)

stderr	tstat	pval	alpha	error	lower_ci	upper_ci
0.3662862	-8.801886	0	0.05	0.7267913	-3.950801	-2.497218

2.2.2 Unequal variance

If the variance of the two groups being compared is not identical, it is possible to use a Welch t-test instead. This test is an adaptation of the Student’s t-test and does not use the pooled variance of both groups.

The next dataset is similar to the previously generated dataset, except that the variance of the distribution of women and men is different by design.

set.seed(123)
dataset_3 <- tibble(
  men = rnorm(50, mean = 5, sd = 8), 
  women = rnorm(50, mean = 8, sd = 2)
  ) %>%
  pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")

dataset_3

## # A tibble: 100 x 2
##    group variable
##    <chr>    <dbl>
##  1 men      0.516
##  2 women    8.51 
##  3 men      3.16 
##  4 women    7.94 
##  5 men     17.5  
##  6 women    7.91 
##  7 men      5.56 
##  8 women   10.7  
##  9 men      6.03 
## 10 women    7.55 
## # ... with 90 more rows

We can easily see this with the boxplots. The observations within group men have a greater deviation around the median than the observations in the group women.

ggplot(data = dataset_3,
       aes(x = group, y = variable)) +
  geom_boxplot() +
  labs(x = "")

The null hypothesis, that the ratio of the variances is equal to 1, can now be rejected.

var.test(variable ~ group, data = dataset_3)

## 
##  F test to compare two variances
## 
## data:  variable by group
## F = 16.73, num df = 49, denom df = 49, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   9.493828 29.481263
## sample estimates:
## ratio of variances 
##           16.72991

To perform a Welch t-test, the function t.test() is modified by setting the argument var.equal = FALSE.

t.test(variable ~ group, data = dataset_3, 
       alternative = "two.sided",
       mu = 0, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  variable by group
## t = -2.7983, df = 54.837, p-value = 0.007075
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.1788028 -0.8563735
## sample estimates:
##   mean in group men mean in group women 
##            5.275228            8.292817

2.2.3 Non-parametric alternative

If the data for the two groups are not normally distributed, it is recommended to use the Mann-Whitney U test (or Wilcoxon rank sum test). The null hypothesis of this test is that two samples come from the same population (have the same median) or, alternatively, whether observations in one sample tend to be larger than observations in the other. It assumes, however, that the two distributions are similar in shape.

The dataset from the previous section is slightly modified such that the variable of interest for the two groups now follows a log normal distribution, which I created with function rlnorm().

set.seed(123)
dataset_4 <- tibble(
  men = rlnorm(50, meanlog = 1, sdlog = 0.6), 
  women = rlnorm(50, meanlog = 1.2, sdlog = 0.6)
  ) %>%
  pivot_longer(cols = c(men, women),names_to = "group", values_to = "variable")

dataset_4

## # A tibble: 100 x 2
##    group variable
##    <chr>    <dbl>
##  1 men       1.94
##  2 women     3.87
##  3 men       2.37
##  4 women     3.26
##  5 men       6.93
##  6 women     3.24
##  7 men       2.84
##  8 women     7.55
##  9 men       2.94
## 10 women     2.90
## # ... with 90 more rows

I use a density plot to assess the skewness of the distribution of variable in dataset_4 by comparing it to its distribution in dataset_2 from the previous section. It is similar in shape but cleary deviates from the normal distribution.

bind_rows(
  list(dataset_2=dataset_2, dataset_4=dataset_4), 
  .id = "source"
  ) %>%
  ggplot(data = .,
       aes(x = variable, fill = group)) +
  geom_density(position = "dodge", color = "black") +
  facet_wrap(. ~ source) +
  labs(fill = "")

The null hypothesis that the data follow a normal distribution is now strongly rejected for each group!

with(dataset_4, shapiro.test(variable[group == "men"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  variable[group == "men"]
## W = 0.86863, p-value = 5.146e-05

with(dataset_4, shapiro.test(variable[group == "women"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  variable[group == "women"]
## W = 0.87171, p-value = 6.328e-05

Below the Wilcoxon rank sum test is computed with function wilcox.test() and setting argument paired = FALSE. The null hypothesis is that the distribution of variable between men and women differ by a location shift of mu and the alternative is that they differ by some other location shift.

wilcox.test(variable ~ group, data = dataset_4,
            alternative = "two.sided", 
            mu = 0, paired = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  variable by group
## W = 898, p-value = 0.01539
## alternative hypothesis: true location shift is not equal to 0

We can conclude that the median value of variable for men is significantly different from the median of variable for women at the 5% level.

2.3 Two-samples paired t-test

The paired samples t-test is similiar in its assumptions to the unpaired t-test except that it requires that the two groups of samples are related. For example this is the case when the same individual is observed at different time points.

2.3.1 Parametric version

The next dataset is generated such that 50 individuals which are identified by an id were observed before and after a treatment. The observations are drawn from a normal distribution but the parameters differ before and after the treatment.

set.seed(123)
dataset_5 <- tibble(
  id = c(1:50),
  before_treatment = rnorm(50, mean = 3, sd = 1),
  after_treatment = rnorm(50, mean = 5, sd = 1)
  ) %>%
  pivot_longer(cols = c(before_treatment, after_treatment), 
               names_to = "group", values_to = "variable")

dataset_5

## # A tibble: 100 x 3
##       id group            variable
##    <int> <chr>               <dbl>
##  1     1 before_treatment     2.44
##  2     1 after_treatment      5.25
##  3     2 before_treatment     2.77
##  4     2 after_treatment      4.97
##  5     3 before_treatment     4.56
##  6     3 after_treatment      4.96
##  7     4 before_treatment     3.07
##  8     4 after_treatment      6.37
##  9     5 before_treatment     3.13
## 10     5 after_treatment      4.77
## # ... with 90 more rows

The distribution of variable is shown with a boxplot before and after the treatment. Additionally, I connected each pair of id. The observations are not independent!

ggplot(data = dataset_5,
       aes(x = reorder(group, variable), y = variable)) +
  geom_boxplot() +
  geom_point() +
  geom_line(aes(group = factor(id))) +
  labs(x = "")

The Shapiro-Wilk-Test is computed as always but now the difference between each pair of observations has to be used. The null hypothesis, that the differences before and after the treatment follow a normal distribution, cannot be rejected.

dataset_5 %>%
  pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
  mutate(difference = after_treatment - before_treatment) %>%
  pull(difference) %>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.98276, p-value = 0.6727

Let’s have a look at the distribution of the difference.

dataset_5 %>%
  pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
  mutate(difference = after_treatment - before_treatment) %>%
  ggplot(aes(x=difference)) + geom_density()

The t-test for dependent groups can be computed by setting the argument paired = to TRUE in function t.test(). Since I did not check the assumption of variance homogeneity the default var.equal = FALSE is in place. The null hypothesis is that the difference of the mean value of variable before and after the treatment is equal to zero.

t.test(variable ~ group, data = dataset_5,
       alternative = "two.sided",
       mu = 0, 
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  variable by group
## t = 11.331, df = 49, p-value = 2.711e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.737424 2.486586
## sample estimates:
## mean of the differences 
##                2.112005

The null hypothesis is rejected in favor of the alternative that there is a significant difference in the mean of variable before and after the treatment.

2.3.2 Non-parametric alternative

Should the data not follow a normal distribution, it is recommended to use a Wilcoxon signed rank test instead of the t-test.

Now, a dataset is generated where observations are related but drawn from a log normal distribution.

set.seed(123)
dataset_6 <- tibble(
  id = c(1:50),
  before_treatment = rlnorm(50, meanlog = 2, sdlog = 0.7), 
  after_treatment = rlnorm(50, meanlog = 1, sdlog = 0.6)
  ) %>%
  pivot_longer(cols = c(before_treatment, after_treatment), 
               names_to = "group", values_to = "variable")

dataset_6

## # A tibble: 100 x 3
##       id group            variable
##    <int> <chr>               <dbl>
##  1     1 before_treatment     4.99
##  2     1 after_treatment      3.16
##  3     2 before_treatment     6.29
##  4     2 after_treatment      2.67
##  5     3 before_treatment    22.0 
##  6     3 after_treatment      2.65
##  7     4 before_treatment     7.76
##  8     4 after_treatment      6.18
##  9     5 before_treatment     8.09
## 10     5 after_treatment      2.37
## # ... with 90 more rows

I use a density plot to assess the skewness of the distribution before and after treatment by comparing it to the distribution in dataset_5.

bind_rows(
  list(dataset_5=dataset_5, dataset_6=dataset_6), 
  .id = "source"
  ) %>%
  ggplot(data = .,
       aes(x = variable, fill = group)) +
  geom_density(position = "dodge", color = "black") +
  facet_wrap(. ~ source, scales = "free") +
  labs(fill = "")

There is no difference compared to the previous section, except that the null hypothesis, that the difference before and after the treatment follows a normal distribution, is now rejected.

dataset_6 %>%
  pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
  mutate(difference = after_treatment - before_treatment) %>%
  pull(difference) %>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.90492, p-value = 0.0007013

The distribution of difference looks rather skewed now.

dataset_6 %>%
  pivot_wider(id_cols = id, names_from = "group", values_from="variable") %>%
  mutate(difference = after_treatment - before_treatment) %>%
  ggplot(aes(x=difference)) + geom_density()

In function wilcox.test() the argument paired = must be set to TRUE to perform this test. It tests whether the distribution of the two groups is symmetric about mu. In other words, the median of the differences before and after treatment are compared to a hypothetical median.

wilcox.test(variable ~ group, data = dataset_6,
            alternative = "two.sided", 
            mu = 0, 
            paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  variable by group
## V = 100, p-value = 2.174e-07
## alternative hypothesis: true location shift is not equal to 0

The null hypothesis that the median differences before and after the treatment are equal to zero can be rejected.

3. Analysis of Variance (ANOVA)

3.1 One-way ANOVA

The one-way analysis of variance (ANOVA), or one-factor ANOVA, is an extension of the independent two-samples t-test. It compares means in a situation where there are more than two groups. The data is split into several groups based on one single grouping variable (also called factor variable). ANOVA assumes that the observations are obtained independently and randomly from the population defined by the factor levels and the data of each factor level is normally distributed. Also the assumption of variance homogeneity, previously discussed by the t-test, applies here.

The ANOVA is performed with the following steps:

Compute the variance between groups
Compute the variance within groups (residual variance)
Determine the degrees of freedom
Calculate the F-statistic as the ratio of between variance and within variance

3.1.1 Equal variance between groups

I am generating 30 observations for three different groups which are drawn from a normal distribution with different means but identical standard deviations. The column group is declared as a factor variable with function factor().

set.seed(345)
dataset_7 <- tibble(
  group_A = rnorm(30, mean = 1, sd = 2), 
  group_C = rnorm(30, mean = 4, sd = 2),
  group_B = rnorm(30, mean = 3, sd = 2)
  ) %>%
  pivot_longer(cols = starts_with("group"), 
               names_to = "group", 
               names_pattern = "._(.+)", 
               values_to = "variable") %>%
  mutate(group = factor(group))

dataset_7

## # A tibble: 90 x 2
##    group variable
##    <fct>    <dbl>
##  1 A       -0.570
##  2 C        2.51 
##  3 B        1.70 
##  4 A        0.441
##  5 C        5.37 
##  6 B        1.83 
##  7 A        0.677
##  8 C        2.07 
##  9 B        4.22 
## 10 A        0.419
## # ... with 80 more rows

With function levels() the unique values of a factor variable can be displayed. When performing an ANOVA with an unbalanced design, that means unequal cell sizes, the order of the levels is important.

levels(dataset_7$group)

## [1] "A" "B" "C"

dataset_7$group <- ordered(dataset_7$group,
                           levels = c("A", "B", "C"))

Below I plot all observations per group using the function geom_jitter() which makes the distribution of variable more visible. I am adding the standard deviation from the mean with function geom_errorbar().

dataset_7 %>%
group_by(group) %>%
  summarise(
    mean = mean(variable, na.rm = TRUE),
    sd = sd(variable, na.rm = TRUE)
  ) %>%
  left_join(dataset_7) %>%
  ggplot(data = ., 
       aes(x = group, y = variable, col = group)) +
  geom_jitter(width = 0.1) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd)) +
  geom_point(aes(y = mean), size = 5, shape = 3) +
  theme(legend.position = "none")

First we have to compute the means of variable for each group as well as the overall mean regardless of group. To save some coding effort I create subsets of dataset_7 for each group.

group_A <- dataset_7 %>% filter(group == "A")
group_B <- dataset_7 %>% filter(group == "B")
group_C <- dataset_7 %>% filter(group == "C")

A_mean <- mean(group_A$variable)
B_mean <- mean(group_B$variable)
C_mean <- mean(group_C$variable)
Overall_mean <- mean(dataset_7$variable, na.rm =T)

To obtain the between variance the squared differences between the mean of each group and the overall mean are computed.

The within variance is computed as the sum of the squared differences between each observation and its group mean.

The degrees of freedom for the between variance are calculated by substracting 1 from the number of distinct groups. The degress of freedom for the within variance are calculated by substracting the number of distrinct groups from the number of observations.

The F-statistic is the ratio of the between and within variance adjusted by the degress of freedom. The p-value can be calculcated with function pf().

knitr::kable(tibble(
  between_variance = 30*(A_mean-Overall_mean)^2 + 
                     30*(B_mean-Overall_mean)^2 + 
                     30*(C_mean-Overall_mean)^2,
  within_variance = sum((group_A$variable - mean(group_A$variable))^2) + 
                   sum((group_B$variable - mean(group_B$variable))^2) + 
                   sum((group_C$variable - mean(group_C$variable))^2),
  df_between = 3-1,
  df_within = 90-3,
  F_stat = (between_variance/df_between) / (within_variance/df_within),
  pval = pf(F_stat, 2, 87, lower.tail = FALSE)
  )
)

between_variance	within_variance	df_between	df_within	F_stat	pval
129.0416	355.1128	2	87	15.80711	1.4e-06

The one-way ANOVA can be calculated way more easily with function aov() and the results are displayed with function summary(). The null hypothesis is that there are no differences regarding variable between the levels of group.

model_aov <- aov(variable ~ group, data = dataset_7)

summary(model_aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2  129.0   64.52   15.81 1.39e-06 ***
## Residuals   87  355.1    4.08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using function lm() to compute the anova yields the same results with a little bit more precision but we have to use function anova() instead of summary() to get the desired output.

model_lm <- lm(variable ~ group, data = dataset_7)
anova(model_lm)

## Analysis of Variance Table
## 
## Response: variable
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## group      2 129.04  64.521  15.807 1.394e-06 ***
## Residuals 87 355.11   4.082                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value is less than the significance level of 5 %, we can conclude that there are significant differences between the groups. However, we do not know which pairs of groups are different.

It is possible to perform multiple pairwise-comparisons to determine whether the mean differences between specific pairs of groups are statistically significant. This is can be done by computing Tukey Honest Significant Differences with function TukeyHSD(). It takes the fitted ANOVA as an argument.

TukeyHSD(model_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = variable ~ group, data = dataset_7)
## 
## $group
##          diff        lwr      upr     p adj
## B-A 1.9378958  0.6940356 3.181756 0.0010356
## C-A 2.8756463  1.6317862 4.119506 0.0000011
## C-B 0.9377505 -0.3061096 2.181611 0.1763757

We can conclude that there are significant differences between the means of group A and group B as well as group A and group C but not between group B and group C.

An ANOVA assumes that the variance between each group’s observations is homogeneous. Either Bartlett’s test or Levene’s test can be used to test this assumption. The null hypothesis is that the variance across groups is homogeneous.

car::leveneTest(variable  ~ group, data = dataset_7)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.5639 0.5711
##       87

bartlett.test(variable ~ group, data = dataset_7)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  variable by group
## Bartlett's K-squared = 0.52506, df = 2, p-value = 0.7691

Both tests show that there is no evidence to suggest that the variance across groups is significantly different.

Also a scatterplot might be used to check whether there is a relationship between residuals and the fitted values (which are the mean of each group in this setting). The straight blue line connects the mean of each group’s residuals. Pay special attention to outliers in this type of plot as they can strongly influence the above tests! Here, everything looks fine. Note that this plot requires an lm object and can hence only be created after running the ANOVA.

broom::augment(model_lm) %>%
  group_by(group) %>%
  summarise(mean_resid = mean(.resid)) %>%
  left_join(broom::augment(model_lm)) %>%
  ggplot(data = .,
         aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_line(aes(y = mean_resid), color = "blue") +
  labs(x = "Fitted values", y = "Residuals")

The Shapiro-Wilk test, which was already used with the Student’s t-test, is used to check the distribution of the residuals. The null hypothesis is that the residuals follow a normal distribution. The function residuals() is used to extract the residuals from either the object created by aov() or lm().

model_aov_residuals <- residuals(object = model_aov)

shapiro.test(x = model_aov_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_aov_residuals
## W = 0.98789, p-value = 0.5769

The null hypothesis that the residuals are normally distributed cannot be rejected.

For assessing the normality assumption I am also creating a Q-Q (quantile-quantile) plot of the residuals with functions stat_qq() and stat_qq_line(). Approximately, the residuals should follow a straight line.

ggplot(broom::augment(model_lm), 
       aes(sample = .std.resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles",
       y = "Standardized Residuals")

3.1.2 Unequal variance between groups

Again, I am creating 30 observations for three different groups which are drawn from a normal distribution but now with different means and standard deviations.

set.seed(345)
dataset_8 <- tibble(
  group_A = rnorm(30, mean = 1, sd = 2), 
  group_C = rnorm(30, mean = 4, sd = 5),
  group_B = rnorm(30, mean = 3, sd = 8)
  ) %>%
  pivot_longer(cols = starts_with("group"), 
               names_to = "group",
               names_pattern = "._(.+)", 
               values_to = "variable") %>%
  mutate(group = ordered(group,
                         levels = c("A", "B", "C")))

dataset_8

## # A tibble: 90 x 2
##    group variable
##    <ord>    <dbl>
##  1 A       -0.570
##  2 C        0.285
##  3 B       -2.21 
##  4 A        0.441
##  5 C        7.42 
##  6 B       -1.67 
##  7 A        0.677
##  8 C       -0.821
##  9 B        7.87 
## 10 A        0.419
## # ... with 80 more rows

Let us take a look at the modified dataset. By comparing the distance between the error bars of each group, it is easy to see that there is a different spread within each group compared to the previous section.

dataset_8 %>%
group_by(group) %>%
  summarise(
    mean = mean(variable, na.rm = TRUE),
    sd = sd(variable, na.rm = TRUE)
  ) %>%
  left_join(dataset_8) %>%
  ggplot(data = ., 
       aes(x = group, y = variable, col = group)) +
  geom_jitter(width = 0.1) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd)) +
  geom_point(aes(y = mean), size = 5, shape = 3) +
  theme(legend.position = "none")

Now I am running again both tests to test whether the variance between the three groups is homogeneous.

car::leveneTest(variable  ~ group, data = dataset_8)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value    Pr(>F)    
## group  2  9.6438 0.0001649 ***
##       87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bartlett.test(variable  ~ group, data = dataset_8)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  variable by group
## Bartlett's K-squared = 38.319, df = 2, p-value = 4.776e-09

In both tests the null hypothesis that the variance between groups is homogeneous is rejected. Now a test, which does not assume equal variance is required

The Welch one-way test is designed to do this and can be performed with function oneway.test(). It tests whether two or more samples, which follow a normal distribution, have the same means.

oneway.test(variable ~ group, data = dataset_8,
            var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  variable and group
## F = 3.4606, num df = 2.000, denom df = 47.205, p-value = 0.03959

The null hypothesis that the group’s means are equal can be rejected at the 5 % level in favor of the alternative that they are different.

With function pairwise.t.test() comparisons between group levels can be calculated. The argument pool.sd = has to be set to FALSE.

pairwise.t.test(dataset_8$variable, dataset_8$group, 
                pool.sd = FALSE)

## 
##  Pairwise comparisons using t tests with non-pooled SD 
## 
## data:  dataset_8$variable and dataset_8$group 
## 
##   A     B    
## B 0.727 -    
## C 0.042 0.727
## 
## P value adjustment method: holm

There are significant differences between group A and group C but not between group A and group B as well as group B and group C.

3.1.3. Non-parametric alternative

A non-parametric alternative to the one-way ANOVA is the Kruskal-Wallis rank sum test, which can be used when the normality assumption is violated.

Below I create another 30 observations for three different groups which are now drawn from a log normal distribution.

set.seed(345)
dataset_9 <- 
  tibble(
    group_A = rlnorm(30, meanlog = 1.0, sdlog = 0.3), 
    group_C = rlnorm(30, meanlog = 1.5, sdlog = 0.3),
    group_B = rlnorm(30, meanlog = 1.8, sdlog = 0.3)
    ) %>%
  pivot_longer(cols = starts_with("group"), 
               names_to = "group", 
               names_pattern = "._(.+)", 
               values_to = "variable") %>%
  mutate(group = 
           ordered(group, levels = c("A", "B", "C"))
         )

dataset_9

## # A tibble: 90 x 2
##    group variable
##    <ord>    <dbl>
##  1 A         2.15
##  2 C         3.59
##  3 B         4.98
##  4 A         2.50
##  5 C         5.50
##  6 B         5.08
##  7 A         2.59
##  8 C         3.36
##  9 B         7.26
## 10 A         2.49
## # ... with 80 more rows

I use a density plot to assess the skewness of the distribution of variable among the three groups in dataset_9 by comparing it to its distribution in dataset_7 from the previous section. The distribution of variable in dataset_9 is skewed to the right.

bind_rows(
  list(dataset_7=dataset_7, dataset_9=dataset_9), 
  .id = "source"
  ) %>%
  ggplot(data = .,
       aes(x = variable, fill = group)) +
  geom_density(position = "dodge", color = "black") +
  facet_wrap(. ~ source, scales = "free")

Using the function shapiro.test(), the null hypothesis that the residuals are normally distributed is now rejected.

model_aov_residuals <- residuals(object = lm(variable ~ group, data = dataset_9))

shapiro.test(x = model_aov_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_aov_residuals
## W = 0.87061, p-value = 2.441e-07

The Q-Q plot of the residuals also shows the deviation from the normal distribution at the tails.

ggplot(broom::augment(lm(variable ~ group, data = dataset_9)), 
       aes(sample = .std.resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles",
       y = "Standardized Residuals")

The function kruskal.test() performs a test of the null hypothesis that the location parameters of the distribution of variable are the same in each group. The alternative is that they differ in at least one.

kruskal.test(variable ~ group, data = dataset_9)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  variable by group
## Kruskal-Wallis chi-squared = 49.644, df = 2, p-value = 1.66e-11

The null hypothesis is rejected in favor of the alternative that the distribution of variable differs among the three groups.

With function pairwise.wilcox.test() pairwise comparisons between group levels may be calculated.

pairwise.wilcox.test(dataset_9$variable, dataset_9$group)

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  dataset_9$variable and dataset_9$group 
## 
##   A       B      
## B 9.4e-13 -      
## C 3.8e-07 0.00021
## 
## P value adjustment method: holm

There are significant differences between all three groups.

3.2 Two-way ANOVA

The two-way analysis of variance is an extension of the one-way ANOVA and examines the influence of two different categorical independent variables on one continuous dependent variable. It not only aims at assessing the main effect of each independent variable but also if there is any interaction between them. The two-way ANOVA again assumes that the observations within each group are normally distributed and have equal variances.

The follwing hypotheses might be tested:

There is no difference in the means of categories of variable A.
There is no difference in the means of categories of variable B.
There is no interaction between the variables A and B.

3.2.1 Balanced Design

The first section discusses an ANOVA when the study’s design is balanced. This means that for each combination of the two factor variables there are the same number of observations (cell counts).

Below, I am generating a dataset with 60 observations. They are split within group (A, B, C) and country (DE, EN), hence there are 10 observations per combination of the two factor variables. The variable of interest is drawn from a normal distribution. The mean differs within and between group and country while the standard deviation only differs between country.

set.seed(345)
dataset_10 <- tibble(
  DE_group_A = rnorm(10, mean = 1, sd = 1), 
  DE_group_B = rnorm(10, mean = 2, sd = 1),
  DE_group_C = rnorm(10, mean = 3, sd = 1),
  EN_group_A = rnorm(10, mean = 2, sd = 1.4), 
  EN_group_B = rnorm(10, mean = 4, sd = 1.4),
  EN_group_C = rnorm(10, mean = 6, sd = 1.4)
  ) %>%
  pivot_longer(cols = everything(),
               names_to = c("country", "group"),
               names_pattern = "(.*)_group_(.*)",
               values_to = "variable") %>%
  mutate(group = factor(group),
         country = factor(country))

The cell counts can be checked with a two-way table.

dataset_10 %>%
  select(country, group) %>%
  table()

##        group
## country  A  B  C
##      DE 10 10 10
##      EN 10 10 10

Here, the design is balanced.

Let us take a look at dataset_11. Since there is now a second categorical variable, I am utilizing the color = argument to separate the boxplots by country.

ggplot(data = dataset_10,
       aes(x = group, y = variable, color = country)) +
  geom_boxplot()

The two-way ANOVA is carried out in the same fashion as in the previous section. The second factor variable is added via a + in the formula of function aov().

model_aov2 <- aov(variable ~ country + group, data = dataset_10)

summary(model_aov2)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## country      1  53.38   53.38   34.62 2.34e-07 ***
## group        2  53.94   26.97   17.49 1.25e-06 ***
## Residuals   56  86.34    1.54                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both country and group are statistically significant and influence the mean of variable length.

The above fitted model is called an additive model. It makes the assumption that the two factor variables country and group are independent. If there are reasons to believe that these two variables might interact, that is they depend on each other, the formula of function aov() has to be adjusted by adding country:group.

model_aov_interact <- aov(variable ~ country + group + country:group, data = dataset_10)

summary(model_aov_interact)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## country        1  53.38   53.38  37.528 1.07e-07 ***
## group          2  53.94   26.97  18.958 5.79e-07 ***
## country:group  2   9.53    4.76   3.349   0.0425 *  
## Residuals     54  76.82    1.42                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction is still statistically significant at the 5 % level. Hence, the relationship between group and variable does depend on country. If the interaction is not significant the additive model is preferred.

With function model.tables() summary tables for model fits can be printed. It displays the overall mean, the means per level of each factor variable as well as the means of the combinations of both factor’s levels.

model.tables(model_aov_interact, type = "means", se = TRUE)

## Tables of means
## Grand mean
##          
## 2.877139 
## 
##  country 
## country
##    DE    EN 
## 1.934 3.820 
## 
##  group 
## group
##     A     B     C 
## 1.585 3.212 3.834 
## 
##  country:group 
##        group
## country A     B     C    
##      DE 1.203 2.034 2.564
##      EN 1.968 4.390 5.104
## 
## Standard errors for differences of means
##         country  group country:group
##          0.3080 0.3772        0.5334
## replic.      30     20            10

Similarly to the one-way ANOVA, a significant p-value indicates that the group means are different, but we do not know exactly which pairs. Again, we can compute Tukey Honest Significant Differences to perform multiple pairwise-comparisons between the means of groups.

The argument which = of function TukeyHSD() can take a list of character values as input and hence we can test both factor variables (country and group) at once.

TukeyHSD(model_aov_interact, which = c("country", "group"))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = variable ~ country + group + country:group, data = dataset_10)
## 
## $country
##           diff      lwr      upr p adj
## EN-DE 1.886507 1.269101 2.503912 1e-07
## 
## $group
##          diff        lwr      upr     p adj
## B-A 1.6266655  0.7177109 2.535620 0.0002010
## C-A 2.2488666  1.3399120 3.157821 0.0000006
## C-B 0.6222011 -0.2867535 1.531156 0.2339918

Irrespective of group, we find that there are statistically significant differences between countries DE and EN. Also we find that there are statistically significant differences between group A and B as well as group A and C but not between group B and C irrespective of country.

Alternatively, we can also use a pairwise t-test to consider all possible combinations of the categorical variables. I’m using the function str_c() to combine the vectors group and country.

pairwise.t.test(dataset_10$variable, str_c(dataset_10$group, dataset_10$country, sep = "-"))

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dataset_10$variable and str_c(dataset_10$group, dataset_10$country, sep = "-") 
## 
##      A-DE    A-EN    B-DE    B-EN    C-DE   
## A-EN 0.78787 -       -       -       -      
## B-DE 0.74968 0.90070 -       -       -      
## B-EN 2.6e-06 0.00032 0.00044 -       -      
## C-DE 0.09505 0.80470 0.80470 0.00953 -      
## C-EN 1.9e-08 3.4e-06 5.0e-06 0.78787 0.00016
## 
## P value adjustment method: holm

Variance homogeneity: The null hypothesis that the variance between samples is homogeneous cannot be rejected.

car::leveneTest(variable ~ country*group, data = dataset_10)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  5   0.388 0.8549
##       54

Normality: The null hypothesis that the residuals are normally distributed cannot be rejected.

model_aov_interact_residuals <- residuals(object = lm(variable ~ country*group, data = dataset_10))

shapiro.test(x = model_aov_interact_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_aov_interact_residuals
## W = 0.97825, p-value = 0.3592

3.2.2 Unbalanced Designs

An unbalanced design has unequal numbers of subjects in each group. In general there are three different ways to run an ANOVA and as long as a study’s design is balanced these methods lead to the same results. However, when dealing with an unbalanced design is is recommended to calculate Type-II or Type-III sums of squares.

In order to create an unbalanced design I use the same dataset from the previous section (balanced design) and introduce missing values (NA) in each group.

set.seed(345)
dataset_11 <- tibble(
  DE_group_A = c(rnorm(8, mean = 1, sd = 1), NA,NA), 
  DE_group_B = c(rnorm(7, mean = 2, sd = 1), NA, NA, NA),
  DE_group_C = c(rnorm(9, mean = 3, sd = 1), NA),
  EN_group_A = c(rnorm(9, mean = 2, sd = 1.4), NA), 
  EN_group_B = c(rnorm(10, mean = 4, sd = 1.4)),
  EN_group_C = c(rnorm(8, mean = 6, sd = 1.4), NA, NA)
  ) %>%
  pivot_longer(cols = everything(),
               names_to = c("country", "group"),
               names_pattern = "(.*)_group_(.*)",
               values_to = "variable") %>%
  mutate(group = factor(group),
         country = factor(country)) %>%
  filter(!is.na(variable))

Below we can see the unequal cell counts. They vary between 7 and 10.

dataset_11 %>%
  select(country, group) %>%
  table()

##        group
## country  A  B  C
##      DE  8  7  9
##      EN  9 10  8

With a boxplot is not possible to see whether a dataset inherits a balanced design or not, so always use frequency tables to check for it.

ggplot(data = dataset_11,
       aes(x = group, y = variable, color = country)) +
  geom_boxplot()

The first type is also known as sequential sum of squares because the sequence of the specified independent variables matters. In R’s anova() and aov() functions this type is implemented and the ANOVA should be run two times, once with each independent variable taking the lead.

summary(aov(variable ~ country*group, data = dataset_11))

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## country        1  47.00   47.00  34.394 4.96e-07 ***
## group          2  97.40   48.70  35.642 5.29e-10 ***
## country:group  2  19.91    9.95   7.284  0.00182 ** 
## Residuals     45  61.49    1.37                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(variable ~ group*country, data = dataset_11))

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## group          2  93.54   46.77  34.226 9.20e-10 ***
## country        1  50.87   50.87  37.225 2.21e-07 ***
## group:country  2  19.91    9.95   7.284  0.00182 ** 
## Residuals     45  61.49    1.37                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference in the main effects of country and group is only due to which variable comes first. Also note that the interaction effect is not affected by the ordering of the independent variables.

For the second type the sequence of the arguments in the model does not matter! We can use the function Anova() from package car and specify type = "II" to obtain the desired results.

model_aov2 <- aov(variable ~ country*group, data = dataset_11)

car::Anova(model_aov2, type = "II")

## Anova Table (Type II tests)
## 
## Response: variable
##               Sum Sq Df F value    Pr(>F)    
## country       50.865  1 37.2248 2.209e-07 ***
## group         97.404  2 35.6417 5.286e-10 ***
## country:group 19.907  2  7.2844  0.001817 ** 
## Residuals     61.489 45                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The third type is also known as partial sums of squares. Again we use the function Anova() from package car and specify type = "III".

model_aov3 <- aov(variable ~ country*group, data = dataset_11)

car::Anova(model_aov3, type = "III")

## Anova Table (Type III tests)
## 
## Response: variable
##               Sum Sq Df F value   Pr(>F)   
## (Intercept)    5.388  1  3.9433 0.053171 . 
## country        2.251  1  1.6476 0.205850   
## group         17.765  2  6.5004 0.003312 **
## country:group 19.907  2  7.2844 0.001817 **
## Residuals     61.489 45                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The effect of group is still statistically significant but the influence of country is not! The p-value of the interaction term is identical to the previous settings.

4. Multivariate Analysis of Variance (MANOVA)

A multivariate analysis of variance (MANOVA) is simply an ANOVA with more than one dependent variable. Hence, while an ANOVA tests for the difference in means between two or more groups, the MANOVA tests for the difference in two or more vectors of means.

I create a dataset where two variables of interest (height and weight) are drawn from a normal distribution for three different groups. In total there are 60 observations of the two variables of interest - 20 for each group.

set.seed(345)
dataset_12 <- tibble(
  height_group_A = rnorm(20, mean = 3.0, sd = 1.0), 
  height_group_B = rnorm(20, mean = 4.5, sd = 1.0),
  height_group_C = rnorm(20, mean = 3.5, sd = 1.0),
  weight_group_A = rnorm(20, mean = 5.0, sd = 1.0), 
  weight_group_B = rnorm(20, mean = 6.5, sd = 1.0),
  weight_group_C = rnorm(20, mean = 5.5, sd = 1.0)
  ) %>%
  pivot_longer(cols = everything(),
               names_to = c(".value", "group"),
               names_pattern = "(.+)_group_(.)"
               ) %>%
  mutate(group = factor(group))

dataset_12

## # A tibble: 60 x 3
##    group height weight
##    <fct>  <dbl>  <dbl>
##  1 A       2.22   4.35
##  2 B       4.94   7.86
##  3 C       4.13   4.14
##  4 A       2.72   4.42
##  5 B       5.33   5.58
##  6 C       4.34   5.01
##  7 A       2.84   5.61
##  8 B       1.92   5.48
##  9 C       2.65   6.38
## 10 A       2.71   3.79
## # ... with 50 more rows

Since there is more than one continuous variable, I am using a scatterplot to show the distribution of height and weight between group A, B and C.

ggplot(data = dataset_12,
       aes(x = height, y = weight)) +
  geom_point() +
  facet_wrap(. ~ group,  labeller = purrr::partial(label_both, sep = ": "))

With function cbind() the height and weight variable are combined and then used as dependent variables in function manova().

model_manova <- manova(cbind(height, weight) ~ group, data = dataset_12)
summary(model_manova)

##           Df  Pillai approx F num Df den Df    Pr(>F)    
## group      2 0.46724   8.6879      4    114 3.708e-06 ***
## Residuals 57                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First, we can see that there are significant differences between the groups regarding both dependent variables. The column Pillai in the output displays Pillai’s Trace which is a positive integer value ranging from 0 to 1. This test is considered to be the most powerful and robust statistic for general use, especially for departures from the assumption of variance homogeneity.

Below is another output for both responses individually. There are also significant differences between the groups for each dependent variable.

summary.aov(model_manova)

##  Response height :
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## group        2 15.145  7.5726  7.1769 0.00166 **
## Residuals   57 60.142  1.0551                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response weight :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## group        2 35.123 17.5613  19.638 3.252e-07 ***
## Residuals   57 50.972  0.8942                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similar to the ANOVA we can use the Shapiro-Wilk test to check the distribution of the residuals. The null hypothesis is that the residuals follow a normal distribution. The function residuals() is used to extract the residuals from the object created by function manova(). However, we can only check for univariate normality with this test.

model_manova_residuals <- residuals(object = model_manova)

shapiro.test(x = model_manova_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_manova_residuals
## W = 0.99062, p-value = 0.5905

The null hypothesis that the residuals are univariate normally distributed cannot be rejected.

In addition there is **Mardia’skewness and kurtosis test* for multivariate normality. It is implemented in function mardia in package mvnormalTest.

mvnormalTest::mardia(cbind(dataset_12$height, dataset_12$weight))

## $mv.test
##           Test Statistic p-value Result
## 1     Skewness    4.2945  0.3676    YES
## 2     Kurtosis   -1.2815     0.2    YES
## 3 MV Normality      <NA>    <NA>    YES
## 
## $uv.shapiro
##    W      p-value UV.Normality
## V1 0.9812 0.4818  Yes         
## V2 0.9695 0.1374  Yes

The null hypothesis that the dependent variables are multivariate normally distributed cannot be rejected.

When perfoming a MANOVA the data should have equal variance-covariance matrices for each combination formed by each group in the independent variable. This is a multivariate version of the homogeneity of variances described in the ANOVA section. We can use Box’s M test implemented in function boxM() in package heplots.

heplots::boxM(cbind(height, weight) ~ group, data = dataset_12)

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Y
## Chi-Sq (approx.) = 3.3773, df = 6, p-value = 0.7602

The null hypothesis that the covariance matrices are homogeneous cannot be rejected.

Also we need to check for multicollinearity among our two dependent variables height and weight. Correlation is measured between -1 and 1 where zero indicates no correlation and -1/1 perfect negative/positive correlation.

cor(dataset_12$height, dataset_12$weight)

## [1] 0.3606827

5. Bootstrapping

As a non-parametric estimation method, the bootstrap quantifies the uncertainty of an estimator involved with the standard deviation. This can for example be very helpful if the sample size is relatively small or when the underlying distribution cannot be approximated parametrically.

I create a dataset with 30 observations of two variables of interest (height and weight) which are sampled from a range of given values.

set.seed(345)
dataset_13 <- tibble(
  height = sample(170:190, 30, replace = TRUE), 
  weight = sample(70:90, 30, replace = TRUE)
  )

dataset_13

## # A tibble: 30 x 2
##    height weight
##     <int>  <int>
##  1    188     80
##  2    190     88
##  3    187     70
##  4    189     74
##  5    177     78
##  6    170     81
##  7    178     90
##  8    171     75
##  9    188     86
## 10    174     70
## # ... with 20 more rows

For variable height I calculate the mean and its standard error.

mean(dataset_13$height)

## [1] 179.4333

sd(dataset_13$height)/sqrt(length(dataset_13$height))

## [1] 1.20505

Before we are able to perform the bootstrap, a specific function is required which must allow for subsetting the dataset. Below I am generating a new function mean_indexed() which uses the argument indices to subset the supplied data.

mean_indexed <- function(data, indices){
  mean(data[indices])
  }

Now with function boot() from R package boot the bootstrap can be calculated. In order to get replicable results a seed must be set. Then I am using the above generated function mean_indexed() with argument statistic = and set R = 1000 to create 1000 replications (samples).

set.seed(345)
bootMean <- boot(dataset_13$height, statistic = mean_indexed, R = 1000)

summary(bootMean)

##      R original  bootBias bootSE bootMed
## 1 1000   179.43 -0.045633 1.2248  179.38

The original mean is 179.43. The bootBias represents the difference between the original and bootstrapped value of the mean. The bootstrapped standard error (bootSE) is 1.2248. The values shown above can also be calculated by using the 1000 replicated mean values contained in t inside object bootMean.

mean(bootMean$t)

## [1] 179.3877

sd(bootMean$t)

## [1] 1.224809

options(pillar.sigfig = 5)
tibble(
  variable = c("height", "height"),
  type = c("parametric", "bootstrap"),
  mean = c(mean(dataset_13$height), mean(bootMean$t)),
  se = c(sd(dataset_13$height)/sqrt(length(dataset_13$height)), sd(bootMean$t))
  )

## # A tibble: 2 x 4
##   variable type         mean     se
##   <chr>    <chr>       <dbl>  <dbl>
## 1 height   parametric 179.43 1.2051
## 2 height   bootstrap  179.39 1.2248

Below is another example on how to use the bootstrap. I want to estimate the correlation between height and weight with the bootstrap.

cor_indexed <- function(data, indices){
  cor(data[indices,]$height, data[indices,]$weight)
  }

set.seed(345)
bootCor <- boot(dataset_13, statistic = cor_indexed, R = 1000)

options(pillar.sigfig = 5)
tibble(
  correlation_paramteric = cor(dataset_13$height, dataset_13$weight),
  correlation_bootstrap = mean(bootCor$t)
  )

## # A tibble: 1 x 2
##   correlation_paramteric correlation_bootstrap
##                    <dbl>                 <dbl>
## 1               0.053863              0.049370