Imferential Statistics II
Bioinformatics Core Facility CECAD
2025-03-17
git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git
https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/
Session 6 :: Inferential Statistics II
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Do the species differ significantly in individual traits, and/or in all traits jointly?Outlook on Fisher's original approach to predict species membership from a linear combination of the traits
NOTE:
Null "hypothesis all means are equal"pairwise post-hoc tests to find the actual differences
MANOVA = ANOVA with >1 dependent variable post-hoc tests, as in ANOVALinear Discriminant Analysis (LDA) as a post-hoc test for MANOVA for a quick overview over strongly deviant groups (not a formal test)library(dplyr)
library(ggpubr)
run_onevar_test <-
function(x,
group_var,
test_var,
test_f,
ndigits = 1) {
purrr::map(x |> dplyr::pull(group_var) |> levels(),
\ (l) iris |>
dplyr::filter(.data[[group_var]]==l) |>
dplyr::pull(test_var) |>
test_f() |>
broom::tidy() |>
mutate(level = l) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
) |> bind_rows()
}
run_formula_test <-
function(x,group_var,test_var, ndigits=1, test_f) {
char_f <- paste(test_var,"~",group_var)
test_f(eval(as.formula(char_f)),
data = x) |>
broom::tidy() |>
mutate(group_var = group_var) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
}
run_levene <-
function(x,group_var,test_var, ndigits=1) {
char_f <- paste(test_var,"~",group_var)
car::leveneTest(y = eval(as.formula(char_f)),
data = x) |>
broom::tidy() |>
mutate(group_var = group_var) |>
mutate(test_var = test_var) |>
mutate(p.value = formatC(p.value,
format = "e",
digits = ndigits)
)
}# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.978 4.6e-01 Shapiro-Wilk normality test setosa Sepal.Length
2 0.978 4.6e-01 Shapiro-Wilk normality test versicolor Sepal.Length
3 0.971 2.6e-01 Shapiro-Wilk normality test virginica Sepal.Length
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.972 2.7e-01 Shapiro-Wilk normality test setosa Sepal.Width
2 0.974 3.4e-01 Shapiro-Wilk normality test versicolor Sepal.Width
3 0.967 1.8e-01 Shapiro-Wilk normality test virginica Sepal.Width
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.955 5.5e-02 Shapiro-Wilk normality test setosa Petal.Length
2 0.966 1.6e-01 Shapiro-Wilk normality test versicolor Petal.Length
3 0.962 1.1e-01 Shapiro-Wilk normality test virginica Petal.Length
# A tibble: 3 × 5
statistic p.value method level test_var
<dbl> <chr> <chr> <chr> <chr>
1 0.800 8.7e-07 Shapiro-Wilk normality test setosa Petal.Width
2 0.948 2.7e-02 Shapiro-Wilk normality test versicolor Petal.Width
3 0.960 8.7e-02 Shapiro-Wilk normality test virginica Petal.Width
# A tibble: 4 × 6
statistic p.value df df.residual group_var test_var
<dbl> <chr> <int> <int> <chr> <chr>
1 6.35 2.3e-03 2 147 Species Sepal.Length
2 0.590 5.6e-01 2 147 Species Sepal.Width
3 19.5 3.1e-08 2 147 Species Petal.Length
4 19.9 2.3e-08 2 147 Species Petal.Width
For all traits except Sepal.Width, __the Equality of Variance assumption is strongly violated
Normality in contrast is more or less OK, except for Petal.Width.
So we can use downstream tests which assume normality but not necessarily equal variance.
An ANOVA Test Which Can Deal With Unequal Variances
purrr::map(colnames(iris)[1:4], ## run oneway.test() for each trait
\(x)run_formula_test(iris, "Species",x,test_f=oneway.test)
)|>
bind_rows() |> ## concatenate the results by row
select(statistic,p.value,group_var,test_var) ## filter outputs to keep# A tibble: 4 × 4
statistic p.value group_var test_var
<dbl> <chr> <chr> <chr>
1 139. 1.5e-28 Species Sepal.Length
2 45.0 1.4e-14 Species Sepal.Width
3 1828. 2.7e-66 Species Petal.Length
4 1277. 4.1e-64 Species Petal.Width
The Null Hypothesis of Equality of all Species Means is strongly rejected for all 4 traits. However the Petal traits are clearly more species specific than the Sepal trait.
The recommended post-hoc test for ANOVA is actually Tukey's HSD (stats::TukeyHSD). However it assumes equal variances.
## Define a wrapper function around ggplot (slightly simpler):
require(dplyr)
require(ggplot2)
require(ggpubr)
ggbp <-
function(data, group_var, test_var, color, palette,
var.equal=FALSE,
paired = FALSE) {
cmb <-utils::combn(data |>
dplyr::pull(group_var) |>
levels(),
2,
# return a *list* of pairs!
simplify=FALSE
)
ggboxplot(data=data, x = group_var, y = test_var,
color = group_var, palette = palette,
order = levels(data |> pull(group_var)),
ylab = test_var, xlab = group_var) +
stat_compare_means(method = "t.test",
paired = paired,
method.args =
list(var.equal = var.equal),
comparisons = cmb
)
}
In plain text for copying …
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
Coefficients:
(Intercept) Sepal.Width
6.5262 -0.2234
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.5561 -0.6333 -0.1120 0.5579 2.2226
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5262 0.4789 13.63 <2e-16 ***
Sepal.Width -0.2234 0.1551 -1.44 0.152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Call:
lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.30711 -0.25713 -0.05325 0.19542 1.41253
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
MANOVA = multivariate ANOVA = multi-dimensional observationsNull hypothesis of equal group meansLinear Discriminat Analysis (LDA) helps to identify likely causal groups:
re-predict the data)see which groups are most clearly distinguished
Df Pillai approx F num Df den Df Pr(>F)
independent_var 2 1.1919 53.466 8 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(MASS)
iris_lda <-
lda(independent_var ~ dependent_vars,
CV = FALSE) # no leave-one-out cross-validation
iris_ldaCall:
lda(independent_var ~ dependent_vars, CV = FALSE)
Prior probabilities of groups:
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Group means:
dependent_varsSepal.Length dependent_varsSepal.Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974
dependent_varsPetal.Length dependent_varsPetal.Width
setosa 1.462 0.246
versicolor 4.260 1.326
virginica 5.552 2.026
Coefficients of linear discriminants:
LD1 LD2
dependent_varsSepal.Length 0.8293776 0.02410215
dependent_varsSepal.Width 1.5344731 2.16452123
dependent_varsPetal.Length -2.2012117 -0.93192121
dependent_varsPetal.Width -2.8104603 2.83918785
Proportion of trace:
LD1 LD2
0.9912 0.0088
Setosa is the most cleanly re-predicted species and is likely most contributing to the rejection of the Null hypothesis.
"Fisher's rule": Project the multidimensional data onto an axis, group means distance is maximized, and the group scatter in the direction of the axis is minimized (optimal signal-to-noise ratio)
Fisher's initial method could only distinguish two groups. a biological constraint on the group means allows him to separate all 3 Iris species.hypothesis that versicolor = sertosa x virginica,the fact that it leads to a good classifier suggests that the genome dosis has a global effect on all traits (independent gene action)