R Intermediate Course 2025

Imferential Statistics II

Bioinformatics Core Facility CECAD

2025-03-17

Slides & Code

[f] Full screen
[o] Slide Overview
[c] Notes
[h] help

git repo

R-Basic

Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git

Slides Directly

https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/

Session 6 :: Inferential Statistics II

Our Running Example: The Iris Dataset

# load the dataset
data(iris) 

# have a look
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

4 flower trait observations per sample
3 species
Do the species differ significantly in individual traits, and/or in all traits jointly?
Outlook on Fisher's original approach to predict species membership from a linear combination of the traits

A Quick Tour of Concepts: One-Sample T-Test

A Quick Tour of Concepts: Independent Samples T-Test

A Quick Tour of Concepts: Paired Samples T-Test

Equivalent to an unpaired t-test on the differences between the conditions
Note that the two conditions can be anything, not necessarily pre/post!

A Quick Tour of Concepts: ANOVA

ANOVA

A Quick Tour of Concepts: ANOVA

NOTE:

A significant F value rejects the Null "hypothesis all means are equal"
One needs to do individual pairwise post-hoc tests to find the actual differences
Multiple post-hoc tests need multiple testing correction!

A Quick Tour of Concepts: Linear Regression

A Quick Tour of Concepts: Multiple Linear Regression

A Quick Tour of Concepts: MANOVA

MANOVA = ANOVA with >1 dependent variable
(> 1 feature measured per sampling unit)
A significant MANOVA result must be followed up with post-hoc tests, as in ANOVA
It is popular to use Linear Discriminant Analysis (LDA) as a post-hoc test for MANOVA for a quick overview over strongly deviant groups (not a formal test)

Test Functions in R and Their Assumptions

Code of Some Convenience Functions

library(dplyr)
library(ggpubr)

run_onevar_test <-
    function(x,
             group_var,
             test_var,
             test_f,
             ndigits = 1) {

        purrr::map(x |> dplyr::pull(group_var) |> levels(),
           \ (l) iris |> 
                dplyr::filter(.data[[group_var]]==l) |> 
                dplyr::pull(test_var) |> 
                test_f() |>
                broom::tidy() |>
                mutate(level = l) |>
                mutate(test_var = test_var) |>
                mutate(p.value = formatC(p.value, 
                                 format = "e", 
                                 digits = ndigits)
                       )
        ) |> bind_rows()
    }

run_formula_test <-
    function(x,group_var,test_var, ndigits=1, test_f) {
        char_f <- paste(test_var,"~",group_var)
        
        test_f(eval(as.formula(char_f)),
                        data = x)  |>
        broom::tidy() |>
        mutate(group_var = group_var) |>
        mutate(test_var = test_var) |>
        mutate(p.value = formatC(p.value, 
                                 format = "e", 
                                 digits = ndigits)
               )
    }
     
run_levene <-
    function(x,group_var,test_var, ndigits=1) {
        char_f <- paste(test_var,"~",group_var)
        
        car::leveneTest(y = eval(as.formula(char_f)),
                        data = x)  |>
        broom::tidy() |>
        mutate(group_var = group_var) |>
        mutate(test_var = test_var) |>
        mutate(p.value = formatC(p.value, 
                                 format = "e", 
                                 digits = ndigits)
               )
    }

Do the Assumptions Hold For the Iris Flower Traits?

Normality (Sepal.Length):

this_trait <- "Sepal.Length"
ggpubr::ggqqplot(iris, 
                 this_trait,
                 facet.by="Species",
                 ylab = this_trait
                )

run_onevar_test(iris,
                "Species",
                "Sepal.Length",
                shapiro.test)

# A tibble: 3 × 5
  statistic p.value method                      level      test_var    
      <dbl> <chr>   <chr>                       <chr>      <chr>       
1     0.978 4.6e-01 Shapiro-Wilk normality test setosa     Sepal.Length
2     0.978 4.6e-01 Shapiro-Wilk normality test versicolor Sepal.Length
3     0.971 2.6e-01 Shapiro-Wilk normality test virginica  Sepal.Length

Do the Assumptions Hold For the Iris Flower Traits?

Normality (Sepal.Width):

this_trait <- "Sepal.Width"
ggpubr::ggqqplot(iris, 
                 this_trait,
                 facet.by="Species",
                 ylab = this_trait
                )

run_onevar_test(iris,
                "Species",
                "Sepal.Width",
                shapiro.test)

# A tibble: 3 × 5
  statistic p.value method                      level      test_var   
      <dbl> <chr>   <chr>                       <chr>      <chr>      
1     0.972 2.7e-01 Shapiro-Wilk normality test setosa     Sepal.Width
2     0.974 3.4e-01 Shapiro-Wilk normality test versicolor Sepal.Width
3     0.967 1.8e-01 Shapiro-Wilk normality test virginica  Sepal.Width

Do the Assumptions Hold For the Iris Flower Traits?

Normality (Petal.Length):

this_trait <- "Petal.Length"
ggpubr::ggqqplot(iris, 
                 this_trait,
                 facet.by="Species",
                 ylab = this_trait
                )

run_onevar_test(iris,
                "Species",
                "Petal.Length",
                shapiro.test)

# A tibble: 3 × 5
  statistic p.value method                      level      test_var    
      <dbl> <chr>   <chr>                       <chr>      <chr>       
1     0.955 5.5e-02 Shapiro-Wilk normality test setosa     Petal.Length
2     0.966 1.6e-01 Shapiro-Wilk normality test versicolor Petal.Length
3     0.962 1.1e-01 Shapiro-Wilk normality test virginica  Petal.Length

Do the Assumptions Hold For the Iris Flower Traits?

Normality (Petal.Width):

this_trait <- "Petal.Width"
ggpubr::ggqqplot(iris, 
                 this_trait,
                 facet.by="Species",
                 ylab = this_trait
                )

run_onevar_test(iris,"Species",
                "Petal.Width",
                shapiro.test)

# A tibble: 3 × 5
  statistic p.value method                      level      test_var   
      <dbl> <chr>   <chr>                       <chr>      <chr>      
1     0.800 8.7e-07 Shapiro-Wilk normality test setosa     Petal.Width
2     0.948 2.7e-02 Shapiro-Wilk normality test versicolor Petal.Width
3     0.960 8.7e-02 Shapiro-Wilk normality test virginica  Petal.Width

Do the Assumptions Hold For the Iris Flower Traits?

Equal variance:

purrr::map(colnames(iris)[-5],
           \(x) run_levene(iris,"Species",x)
) |> bind_rows()

# A tibble: 4 × 6
  statistic p.value    df df.residual group_var test_var    
      <dbl> <chr>   <int>       <int> <chr>     <chr>       
1     6.35  2.3e-03     2         147 Species   Sepal.Length
2     0.590 5.6e-01     2         147 Species   Sepal.Width 
3    19.5   3.1e-08     2         147 Species   Petal.Length
4    19.9   2.3e-08     2         147 Species   Petal.Width

For all traits except Sepal.Width, __the Equality of Variance assumption is strongly violated

Normality in contrast is more or less OK, except for Petal.Width.
So we can use downstream tests which assume normality but not necessarily equal variance.

oneway.test():

An ANOVA Test Which Can Deal With Unequal Variances

purrr::map(colnames(iris)[1:4], ## run oneway.test() for each trait 
           \(x)run_formula_test(iris, "Species",x,test_f=oneway.test)
  )|> 
  bind_rows() |> ## concatenate the results by row
  select(statistic,p.value,group_var,test_var) ## filter outputs to keep

# A tibble: 4 × 4
  statistic p.value group_var test_var    
      <dbl> <chr>   <chr>     <chr>       
1     139.  1.5e-28 Species   Sepal.Length
2      45.0 1.4e-14 Species   Sepal.Width 
3    1828.  2.7e-66 Species   Petal.Length
4    1277.  4.1e-64 Species   Petal.Width

The Null Hypothesis of Equality of all Species Means is strongly rejected for all 4 traits. However the Petal traits are clearly more species specific than the Sepal trait.

Preparation for doing t-Tests

The recommended post-hoc test for ANOVA is actually Tukey's HSD (stats::TukeyHSD). However it assumes equal variances.

## Define a wrapper function around ggplot (slightly simpler):
require(dplyr)
require(ggplot2)
require(ggpubr)

ggbp <- 
     function(data, group_var, test_var, color, palette,
              var.equal=FALSE,
              paired = FALSE) {
       
         cmb <-utils::combn(data |> 
                            dplyr::pull(group_var) |>
                            levels(),
                            2,
                            # return a *list* of pairs!
                            simplify=FALSE 
                      ) 
         ggboxplot(data=data, x = group_var, y = test_var,
                   color = group_var, palette = palette,
                   order = levels(data |> pull(group_var)),
                   ylab = test_var, xlab = group_var) +
             
             stat_compare_means(method = "t.test",
                                paired = paired,
                                method.args = 
                                  list(var.equal = var.equal),
                                comparisons = cmb
                                            
             )
     }

Welch’s t-Test [unpaired] (Sepal.Length)

ggbp(data = iris,
     group_var = "Species",
     test_var = "Sepal.Length",
     color = "Species",
     palette = c("blue","orange","yellow"),
     paired = FALSE
     )

Welch’s t-Test (Sepal.Width)

ggbp(data = iris,
     group_var = "Species",
     test_var = "Sepal.Width",
     color = "Species",
     palette = c("blue","orange","yellow"),
     paired = FALSE
     )

Welch’s t-Test (Petal.Length)

ggbp(data = iris,
     group_var = "Species",
     test_var = "Petal.Length",
     color = "Species",
     palette = c("blue","orange","yellow"),
     paired = FALSE
     )

Welch’s t-Test (Sepal.Length)

ggbp(data = iris,
     group_var = "Species",
     test_var = "Petal.Width",
     color = "Species",
     palette = c("blue","orange","yellow"),
     paired = FALSE
     )

Linear Regression

Linear Regression in R

In plain text for copying …

our.model <- lm(Sepal.Length ~ Sepal.Width,
                iris)

our.model


Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Coefficients:
(Intercept)  Sepal.Width  
     6.5262      -0.2234

summary(our.model)


Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5561 -0.6333 -0.1120  0.5579  2.2226 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.5262     0.4789   13.63   <2e-16 ***
Sepal.Width  -0.2234     0.1551   -1.44    0.152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-squared:  0.01382,   Adjusted R-squared:  0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

Linear Regression in R

ggplot2::ggplot(
  iris,
  aes(x=Sepal.Width,
      y=Sepal.Length
    )
)+geom_point()+
  geom_smooth(method="lm")
#

ggplot2::ggplot(
  iris,
  aes(x=Sepal.Width,
      y=Sepal.Length,
      color=Species
    )
)+geom_point()+
  geom_smooth(method="lm")

Linear Regression in R

summary(
  lm(
  Sepal.Length ~ Sepal.Width + Species,
  data = iris
  )
)


Call:
lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.30711 -0.25713 -0.05325  0.19542  1.41253 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-squared:  0.7259,    Adjusted R-squared:  0.7203 
F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

MANOVA with Post-Hoc Linear Discriminant Analysis

MANOVA = multivariate ANOVA = multi-dimensional observations
like ANOVA, it tests the Null hypothesis of equal group means
Linear Discriminat Analysis (LDA) helps to identify likely causal groups:
- Use LDA to find a linear combination of the data dimensions (here: the flower traits) which optimally distinguishes the data groups (here: the species)
- Apply this linear combination to the data (re-predict the data)
- Plot the predictions and see which groups are most clearly distinguished

MANOVA with Post-Hoc Linear Discriminant Analysis

## this is from https://rpubs.com/KristelG/894161

dependent_vars <- iris[,-5] |> as.matrix()
independent_var <- iris$Species

manova_model <- 
  manova(dependent_vars ~ independent_var, 
         data = iris
         )
summary(manova_model)

                 Df Pillai approx F num Df den Df    Pr(>F)    
independent_var   2 1.1919   53.466      8    290 < 2.2e-16 ***
Residuals       147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MANOVA with Post-Hoc Linear Discriminant Analysis

require(MASS)

iris_lda <- 
  lda(independent_var ~ dependent_vars, 
      CV = FALSE) # no leave-one-out cross-validation
iris_lda

Call:
lda(independent_var ~ dependent_vars, CV = FALSE)

Prior probabilities of groups:
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 

Group means:
           dependent_varsSepal.Length dependent_varsSepal.Width
setosa                          5.006                     3.428
versicolor                      5.936                     2.770
virginica                       6.588                     2.974
           dependent_varsPetal.Length dependent_varsPetal.Width
setosa                          1.462                     0.246
versicolor                      4.260                     1.326
virginica                       5.552                     2.026

Coefficients of linear discriminants:
                                  LD1         LD2
dependent_varsSepal.Length  0.8293776  0.02410215
dependent_varsSepal.Width   1.5344731  2.16452123
dependent_varsPetal.Length -2.2012117 -0.93192121
dependent_varsPetal.Width  -2.8104603  2.83918785

Proportion of trace:
   LD1    LD2 
0.9912 0.0088

MANOVA with Post-Hoc Linear Discriminant Analysis

lda_df <- 
  data.frame(species = iris[, "Species"],
             lda = predict(iris_lda)$x
  )

ggplot(lda_df) +
  geom_point(aes(x = lda.LD1, 
                 y = lda.LD2, 
                 color = species), 
             size = 4) +
  theme_classic()

Setosa is the most cleanly re-predicted species and is likely most contributing to the rejection of the Null hypothesis.

Some Remarks on LDA, Ronald A Fisher, and the Iris Data

Fisher was the first to describe the idea of Linear Discriminant Analysis
He did so while working with Anderson’s (=our) Iris data.
The basic idea is "Fisher's rule": Project the multidimensional data onto an axis,
such that the group means distance is maximized, and the group scatter in the direction of the axis is minimized (optimal signal-to-noise ratio)

Some Remarks on LDA, Ronald A Fisher, and the Iris Data

Fisher's initial method could only distinguish two groups.
However his 1936 paper argues that a biological constraint on the group means allows him to separate all 3 Iris species.
He derived the constraint from the (now proven) hypothesis that versicolor = sertosa x virginica,
and argued that the fact that it leads to a good classifier suggests that the genome dosis has a global effect on all traits (independent gene action)