Imferential Statistics II
Bioinformatics Core Facility CECAD
2025-03-17
git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git
https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/
Session 6 :: Inferential Statistics II
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Do the species differ significantly in individual traits, and/or in all traits jointly?
Outlook on Fisher's original approach
to predict species membership from a linear combination of the traits
NOTE:
Null "hypothesis all means are equal"
pairwise post-hoc tests to find the actual differences
MANOVA = ANOVA with >1 dependent variable
post-hoc tests
, as in ANOVALinear Discriminant Analysis (LDA) as a post-hoc test for MANOVA
for a quick overview over strongly deviant groups (not a formal test)library(dplyr)
library(ggpubr)
run_onevar_test <-
function(x,
group_var,
test_var,
test_f,
ndigits = 1) {
purrr::map(x |> dplyr::pull(group_var) |> levels(),
\ (l) iris |>
dplyr::filter(.data[[group_var]]==l) |>
dplyr::pull(test_var) |>
test_f() |>
broom::tidy() |>
mutate(level = l) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
) |> bind_rows()
}
run_formula_test <-
function(x,group_var,test_var, ndigits=1, test_f) {
char_f <- paste(test_var,"~",group_var)
test_f(eval(as.formula(char_f)),
data = x) |>
broom::tidy() |>
mutate(group_var = group_var) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
}
run_levene <-
function(x,group_var,test_var, ndigits=1) {
char_f <- paste(test_var,"~",group_var)
car::leveneTest(y = eval(as.formula(char_f)),
data = x) |>
broom::tidy() |>
mutate(group_var = group_var) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
}
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.978 4.6e-01 Shapiro-Wilk normality test setosa Sepal.Length
2 0.978 4.6e-01 Shapiro-Wilk normality test versicolor Sepal.Length
3 0.971 2.6e-01 Shapiro-Wilk normality test virginica Sepal.Length
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.972 2.7e-01 Shapiro-Wilk normality test setosa Sepal.Width
2 0.974 3.4e-01 Shapiro-Wilk normality test versicolor Sepal.Width
3 0.967 1.8e-01 Shapiro-Wilk normality test virginica Sepal.Width
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.955 5.5e-02 Shapiro-Wilk normality test setosa Petal.Length
2 0.966 1.6e-01 Shapiro-Wilk normality test versicolor Petal.Length
3 0.962 1.1e-01 Shapiro-Wilk normality test virginica Petal.Length
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.800 8.7e-07 Shapiro-Wilk normality test setosa Petal.Width
2 0.948 2.7e-02 Shapiro-Wilk normality test versicolor Petal.Width
3 0.960 8.7e-02 Shapiro-Wilk normality test virginica Petal.Width
# A tibble: 4 × 6
statistic p.value df df.residual group_var test_var
<dbl> <chr> <int> <int> <chr> <chr>
1 6.35 2.3e-03 2 147 Species Sepal.Length
2 0.590 5.6e-01 2 147 Species Sepal.Width
3 19.5 3.1e-08 2 147 Species Petal.Length
4 19.9 2.3e-08 2 147 Species Petal.Width
For all traits except Sepal.Width, __the Equality of Variance assumption is strongly violated
Normality in contrast is more or less OK
, except for Petal.Width.
So we can use downstream tests which assume normality but not necessarily equal variance.
An ANOVA Test Which Can Deal With Unequal Variances
purrr::map(colnames(iris)[1:4], ## run oneway.test() for each trait
\(x)run_formula_test(iris, "Species",x,test_f=oneway.test)
)|>
bind_rows() |> ## concatenate the results by row
select(statistic,p.value,group_var,test_var) ## filter outputs to keep
# A tibble: 4 × 4
statistic p.value group_var test_var
<dbl> <chr> <chr> <chr>
1 139. 1.5e-28 Species Sepal.Length
2 45.0 1.4e-14 Species Sepal.Width
3 1828. 2.7e-66 Species Petal.Length
4 1277. 4.1e-64 Species Petal.Width
The Null Hypothesis of Equality of all Species Means is strongly rejected for all 4 traits
. However the Petal traits are clearly more species specific than the Sepal trait.
The recommended post-hoc test for ANOVA is actually Tukey's HSD (stats::TukeyHSD)
. However it assumes equal variances.
## Define a wrapper function around ggplot (slightly simpler):
require(dplyr)
require(ggplot2)
require(ggpubr)
ggbp <-
function(data, group_var, test_var, color, palette,
var.equal=FALSE,
paired = FALSE) {
cmb <-utils::combn(data |>
dplyr::pull(group_var) |>
levels(),
2,
# return a *list* of pairs!
simplify=FALSE
)
ggboxplot(data=data, x = group_var, y = test_var,
color = group_var, palette = palette,
order = levels(data |> pull(group_var)),
ylab = test_var, xlab = group_var) +
stat_compare_means(method = "t.test",
paired = paired,
method.args =
list(var.equal = var.equal),
comparisons = cmb
)
}
In plain text for copying …
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
Coefficients:
(Intercept) Sepal.Width
6.5262 -0.2234
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.5561 -0.6333 -0.1120 0.5579 2.2226
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5262 0.4789 13.63 <2e-16 ***
Sepal.Width -0.2234 0.1551 -1.44 0.152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Call:
lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.30711 -0.25713 -0.05325 0.19542 1.41253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
MANOVA = multivariate ANOVA = multi-dimensional observations
Null hypothesis of equal group means
Linear Discriminat Analysis (LDA)
helps to identify likely causal groups:
re-predict the data
)see which groups are most clearly distinguished
Df Pillai approx F num Df den Df Pr(>F)
independent_var 2 1.1919 53.466 8 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(MASS)
iris_lda <-
lda(independent_var ~ dependent_vars,
CV = FALSE) # no leave-one-out cross-validation
iris_lda
Call:
lda(independent_var ~ dependent_vars, CV = FALSE)
Prior probabilities of groups:
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Group means:
dependent_varsSepal.Length dependent_varsSepal.Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974
dependent_varsPetal.Length dependent_varsPetal.Width
setosa 1.462 0.246
versicolor 4.260 1.326
virginica 5.552 2.026
Coefficients of linear discriminants:
LD1 LD2
dependent_varsSepal.Length 0.8293776 0.02410215
dependent_varsSepal.Width 1.5344731 2.16452123
dependent_varsPetal.Length -2.2012117 -0.93192121
dependent_varsPetal.Width -2.8104603 2.83918785
Proportion of trace:
LD1 LD2
0.9912 0.0088
Setosa is the most cleanly re-predicted species and is likely most contributing to the rejection of the Null hypothesis.
"Fisher's rule"
: Project the multidimensional data onto an axis, group means distance is maximized
, and the group scatter
in the direction of the axis is minimized
(optimal signal-to-noise ratio)
Fisher's initial method could only distinguish two groups
. a biological constraint on the group means allows him to separate all 3 Iris species
.hypothesis that versicolor = sertosa x virginica
,the fact that it leads to a good classifier suggests
that the genome dosis has a global effect on all traits (independent gene action
)