Day 1: Exercises to Intermediate R Course 2025

Intermediate R-Course 2025

Day 1 Practice

Session 1: Functions and Packages

Exercise 1 (Functions: simple)

Write a function which samples n random elements of the built-in vector LETTERS and returns a concatenation of these elements. (Hint: Read the help pages of sample and paste. Mind the sep parameter!)

f <-
  function(n=10) {
    ## YOUR CODE HERE
    
  }

Exercise 2a (Functions: advanced)

Make a copy of your function. Modify the copy such that

It now takes three parameters n1, n2, and num, where
- n1 is the number of elements to sample from LETTERS, as above
- num is a whole number (give it a default value, e.g. 100)
- n2 is the number of elements to sample from the integers 1:num

The function should then

sample n1 elements from LETTERS
sample n2 elements from 1:num
concatenate these vectors (use c())
scramble the concatenated vector (read carefully what the help page of sample says about the default of its size parameter)
return the pasted elements

f_advanced <-
  function(n1=5, n2=5, num=100) {
    ## YOUR CODE HERE
    
  }

Exercise 2b (Functions: advanced)

The LETTERS vector has a fixed length (check it with length()). Your parameter n1 must not exceed this length (sample samples without replacement by default).

Likewise, n2 must not exceed the chosen value of num. Handle cases where a user requests illegal values of n1 and n2 (either by stop("ERROR MESSAGE") or by a premature return() with an explanatory message and an empty string ).

f_advanced_safe <-
  function(n1=5, n2=5, num=100) {
    ## YOUR CODE HERE
    
  }

Exercise 3 (Source file)

Once your functions are working, put them into a .R file (commonly named “functions.R”). Inside another script, the functions can then be made available by

source("functions.R")

Note that while a .Rmd (rmarkdown) file allows for more expressive documentation, it cannot be sourceed from within another R script. Hence functions are better kept in a .R (R script) file.

Exercise 4 (Packages: installing from CRAN)

Find package deepdep on CRAN and have a quick look at its vignettes
In rstudio, run getOption("repos") to see what the default repository is. It should already be CRAN.
From the “Packages” tab of rstudio you can simply install a package by entering its name in the search box and clicking install. Do this for deepdep.
Watch what happens in the rstudio console when you do it! It mirrors the actual R commands it is executing under the hood.

Exercise 5 (Packages: Dependencies)

Use browseVignettes("deepdep") to see an html version of the package’s vignettes. Browse the “deepdep package” vignette.
How would you visualize the strong dependencies of the ggplot2 package, using deepdep?
What are the strong dependencies of deepdep itself?

Exercise 6 (Packages: installing from Bioconductor)

If you have never used Bioconductor, go to https://www.bioconductor.org/ and follow the “Get Started” link. Observe that Bioconductor has its own installer (the BiocManager). Actually doing a complete installation takes time, so you better do this outside of the course.
But you can follow the “Packages” link and browse their impressive list of biology-related software!

Session 2: Data Manipulation with dplyr

Exercise 1: Grouped summary with multiple stats

Find the mean, median, and standard deviation of Sepal.Length and Petal.Length for each Species, arranged by descending mean Sepal.Length.

Exercise 2: Conditional column creation

Create a new column Sepal.Size that labels flowers as:

“Small” if Sepal.Length < species median,
“Large” otherwise (species-specific).

Exercise 3: Complex filtering with multiple conditions

Return rows where:

Petal.Length is above the overall 75th percentile, and
Sepal.Width is below the species mean.

Session 3: Data Visualization with ggplot2

Exercise 1: Basic Scatter Plot

Create a scatter plot showing the relationship between Sepal.Length (x-axis) and Sepal.Width (y-axis), with points colored by Species. Add appropriate labels and use theme_minimal().

Exercise 2: Histogram with Faceting

Create a histogram of Petal.Length, filled by Species, and use facet_wrap() to create separate plots for each species. Use alpha = 0.7 for transparency.

Exercise 3: Boxplot with Data Points

Create a boxplot comparing Sepal.Width across the three Species. Add individual data points using geom_jitter() and customize the colors manually using scale_fill_manual().

Exercise 4: Base R vs ggplot2 Comparison

Create the same plot using both Base R and ggplot2: a scatter plot of Petal.Length vs Petal.Width colored by Species. Compare the code complexity and visual output.

Exercise 5: Advanced Multi-layer Plot

Create a complex plot with multiple layers: 1. Scatter plot of Sepal.Length vs Sepal.Width 2. Add trend lines for each species 3. Use faceting by species 4. Custom color palette 5. Professional labeling with title, subtitle, and caption

Exercise 6: Troubleshooting Common Mistakes

The code below contains several common ggplot2 mistakes. Identify and fix them:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = "blue")) %>%
  geom_point() %>%
  labs(title = "My Plot")
  theme_minimal()

Exercise 7: Custom Themes and Colors

Create a density plot of Petal.Width by Species with: 1. Custom color palette using scale_fill_viridis_d() 2. Transparency (alpha = 0.7) 3. Custom theme modifications (remove grid, adjust text) 4. Proper axis labels and title

Day 2: Advanced Data Analysis

Session 1: Statistical Tests

Exercise 1: Chi-square test of independence

Create a categorical variable for Sepal.Length as “Short” or “Long” based on the overall median. Test whether the proportion of “Short” vs. “Long” flowers is independent of Species.

library(dplyr)
# Create a binary Sepal.Length category
iris_chi <- iris %>%
  mutate(SepalCat = ifelse(Sepal.Length > median(Sepal.Length),
                           "Long", "Short"))

# Create contingency table
tab <- table(iris_chi$Species, iris_chi$SepalCat)
print(tab)

##             
##              Long Short
##   setosa        0    50
##   versicolor   26    24
##   virginica    44     6

# Chi-square test
chi_result <- chisq.test(tab)
chi_result

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 78.643, df = 2, p-value < 2.2e-16

# View expected frequencies
cat("\nExpected frequencies:\n")

## 
## Expected frequencies:

print(chi_result$expected)

##             
##                  Long    Short
##   setosa     23.33333 26.66667
##   versicolor 23.33333 26.66667
##   virginica  23.33333 26.66667

# Calculate effect size (Cramér's V)
cramers_v <- sqrt(chi_result$statistic / (sum(tab) * (min(dim(tab)) - 1)))
cat("\nCramér's V (effect size):", round(cramers_v, 3))

## 
## Cramér's V (effect size): 0.724

Interpretation:

Null hypothesis (H₀): Sepal length category is independent of Species
Alternative hypothesis (H₁): Sepal length category is associated with Species
Test statistic: Chi-square value measures deviation from independence
p-value < 0.05: Reject H₀ → Sepal length distribution differs significantly by Species
Cramér’s V: Effect size measure (0 = no association, 1 = perfect association)
Expected vs Observed: Large differences indicate strong association

The p-value is typically < 0.001, meaning Sepal length category is strongly associated with Species.

Exercise 2: Fisher’s Exact test

Focus only on Setosa and Versicolor. Create a binary variable PetalCat as “Wide” or “Narrow” using the median Petal.Width of these two species. Test whether PetalCat is independent of Species.

library(dplyr)
# Filter for 2 species and create a Petal.Width category
iris_fisher <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  mutate(PetalCat = ifelse(Petal.Width > median(Petal.Width),
                           "Wide", "Narrow"))

# Remove unused factor levels
iris_fisher$Species <- droplevels(iris_fisher$Species)

# Contingency table
tab2 <- table(iris_fisher$Species, iris_fisher$PetalCat)
print(tab2)

##             
##              Narrow Wide
##   setosa         50    0
##   versicolor      0   50

# Fisher's Exact Test
fisher_result <- fisher.test(tab2)
fisher_result

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab2
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  326.3772      Inf
## sample estimates:
## odds ratio 
##        Inf

# Additional information
cat("\nSample sizes:\n")

## 
## Sample sizes:

cat("Total observations:", sum(tab2), "\n")

## Total observations: 100

cat("Setosa:", sum(tab2[1, ]), "\n")

## Setosa: 50

cat("Versicolor:", sum(tab2[2, ]), "\n")

## Versicolor: 50

# Check if chi-square assumptions are met
expected_freq <- chisq.test(tab2)$expected
cat("\nExpected frequencies (for chi-square assumption check):\n")

## 
## Expected frequencies (for chi-square assumption check):

print(round(expected_freq, 2))

##             
##              Narrow Wide
##   setosa         25   25
##   versicolor     25   25

cat("All expected frequencies ≥ 5?", all(expected_freq >= 5), "\n")

## All expected frequencies ≥ 5? TRUE

Interpretation:

Null hypothesis (H₀): Petal width category is independent of Species (setosa vs versicolor)
Fisher’s Exact Test: Calculates exact p-value, appropriate for small sample sizes
When to use Fisher’s: When expected frequencies < 5 or small sample sizes
Odds Ratio: Measures strength of association between variables
Confidence Interval: Range of plausible values for the odds ratio

Fisher’s test provides an exact p-value without relying on chi-square approximations. The p-value is typically < 0.001, indicating strong association between petal width and species.

Session 3: Principal Component Analysis (PCA)

Exercise 1: Basic PCA Implementation

Load the mtcars dataset and perform PCA on the continuous variables. Scale the data before performing PCA and examine the summary of the PCA result. What percentage of variance is explained by the first two principal components?

# Load the mtcars dataset
data(mtcars)

# Perform PCA on all continuous variables (excluding categorical variables if any)
# mtcars has all continuous variables, so we use all columns
pca_result <- prcomp(mtcars, scale. = TRUE)

# View the summary to see variance explained
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

# Calculate percentage of variance explained by first two PCs
var_explained <- summary(pca_result)$importance[2, ]  # Proportion of Variance row
pc1_percent <- round(var_explained[1] * 100, 2)
pc2_percent <- round(var_explained[2] * 100, 2)
total_pc1_pc2 <- round(sum(var_explained[1:2]) * 100, 2)

cat("PC1 explains:", pc1_percent, "% of variance\n")

## PC1 explains: 60.08 % of variance

cat("PC2 explains:", pc2_percent, "% of variance\n")

## PC2 explains: 24.09 % of variance

cat("Together, PC1 and PC2 explain:", total_pc1_pc2, "% of total variance")

## Together, PC1 and PC2 explain: 84.17 % of total variance

Key Concepts Applied: - prcomp() with scale. = TRUE standardizes variables before PCA - summary() shows cumulative and individual variance explained - First two PCs typically capture the most important patterns in the data - Scaling is crucial when variables have different units/ranges

Exercise 2: PCA Visualization - Scores Plot

Create a scatter plot of the first two principal components using the mtcars PCA results. Color the points by the number of cylinders (cyl) and add proper labels. Include the percentage of variance explained in the axis labels.

Exercise 3: PCA Loadings Analysis

Extract and visualize the loadings (variable contributions) for the first two principal components from the mtcars PCA. Create a loading plot showing which original variables contribute most to each PC.

library(ggplot2)
library(dplyr)

# Extract loadings (rotation matrix)
loadings <- pca_result$rotation

# Create a data frame for plotting
loading_df <- data.frame(
  Variable = rownames(loadings),
  PC1 = loadings[, 1],
  PC2 = loadings[, 2]
)

# Create loading biplot (PC1 vs PC2)
ggplot(loading_df, aes(x = PC1, y = PC2)) +
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0,
               arrow = arrow(length = unit(0.3, "cm")),
               color = "darkblue", size = 1) +
  geom_text(aes(label = Variable), 
            vjust = -0.5, hjust = 0.5, 
            size = 3, fontface = "bold") +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(
    title = "PCA Loading Plot - Variable Contributions",
    x = "PC1 Loadings",
    y = "PC2 Loadings"
  ) +
  theme_minimal() +
  coord_equal() +
  xlim(-1.1, 1.1) + ylim(-1.1, 1.1)

# Show which variables contribute most to PC1
cat("Variables with strongest contribution to PC1:\n")

## Variables with strongest contribution to PC1:

pc1_contributions <- loading_df %>%
  mutate(abs_pc1 = abs(PC1)) %>%
  arrange(desc(abs_pc1)) %>%
  select(Variable, PC1)
print(pc1_contributions)

##      Variable        PC1
## cyl       cyl  0.3739160
## disp     disp  0.3681852
## mpg       mpg -0.3625305
## wt         wt  0.3461033
## hp         hp  0.3300569
## vs         vs -0.3065113
## drat     drat -0.2941514
## am         am -0.2349429
## carb     carb  0.2140177
## gear     gear -0.2069162
## qsec     qsec -0.2004563

Loading Interpretation: - Arrows point in direction of increasing variable values - Arrow length indicates strength of contribution to that PC - Variables pointing in similar directions are positively correlated - Variables pointing in opposite directions are negatively correlated - Long arrows = high contribution, short arrows = low contribution

Exercise 4: Scree Plot and Variance Explanation

Create a scree plot to visualize the variance explained by each principal component. Determine how many components are needed to explain at least 80% of the total variance.

library(ggplot2)

# Extract variance information
pca_summary <- summary(pca_result)
var_explained <- pca_summary$importance[2, ]  # Proportion of Variance
cumvar_explained <- pca_summary$importance[3, ]  # Cumulative Proportion

# Create data frame for plotting
scree_data <- data.frame(
  PC = factor(1:length(var_explained), levels = 1:length(var_explained)),
  Variance = var_explained * 100,  # Convert to percentage
  Cumulative = cumvar_explained * 100
)

# Create scree plot
ggplot(scree_data, aes(x = PC)) +
  geom_col(aes(y = Variance), fill = "steelblue", alpha = 0.7) +
  geom_line(aes(y = Cumulative, group = 1), color = "red", size = 1.2) +
  geom_point(aes(y = Cumulative), color = "red", size = 2) +
  geom_hline(yintercept = 80, linetype = "dashed", color = "orange", size = 1) +
  labs(
    title = "Scree Plot - PCA Variance Explained",
    x = "Principal Component",
    y = "Percentage of Variance Explained"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  annotate("text", x = length(var_explained) - 1, y = 85, 
           label = "80% threshold", color = "orange", size = 4)

# Find number of components needed for 80% variance
components_80 <- which(cumvar_explained >= 0.8)[1]
cat("Number of components needed for 80% variance:", components_80, "\n")

## Number of components needed for 80% variance: 2

cat("These", components_80, "components explain", 
    round(cumvar_explained[components_80] * 100, 1), "% of variance")

## These 2 components explain 84.2 % of variance

Scree Plot Analysis: - Bar height shows individual PC contribution - Red line shows cumulative variance explained - “Elbow” in the curve suggests optimal number of PCs - 80% threshold is a common cutoff for dimensionality reduction - First few PCs usually capture most meaningful patterns

Exercise 5: Comprehensive PCA Analysis

Using the iris dataset, perform a complete PCA analysis including: 1) PCA calculation, 2) biplot combining scores and loadings, 3) interpretation of which variables drive species separation.

library(ggplot2)
library(dplyr)

# Perform PCA on iris measurements (excluding species column)
iris_pca <- prcomp(iris[, 1:4], scale. = TRUE)

# Extract scores and add species information
iris_scores <- as.data.frame(iris_pca$x)
iris_scores$Species <- iris$Species

# Extract loadings
iris_loadings <- as.data.frame(iris_pca$rotation)
iris_loadings$Variable <- rownames(iris_loadings)

# Get variance explained for labels
var_exp <- summary(iris_pca)$importance[2, ]
pc1_var <- round(var_exp[1] * 100, 1)
pc2_var <- round(var_exp[2] * 100, 1)

# Create comprehensive biplot
ggplot() +
  # Add score points colored by species
  geom_point(data = iris_scores, 
             aes(x = PC1, y = PC2, color = Species), 
             size = 3, alpha = 0.7) +
  # Add loading arrows
  geom_segment(data = iris_loadings,
               aes(x = 0, y = 0, xend = PC1 * 3, yend = PC2 * 3),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "darkred", size = 1) +
  # Add loading labels
  geom_text(data = iris_loadings,
            aes(x = PC1 * 3.2, y = PC2 * 3.2, label = Variable),
            color = "darkred", fontface = "bold", size = 3) +
  labs(
    title = "PCA Biplot - Iris Dataset",
    subtitle = "Species separation and variable contributions",
    x = paste0("PC1 (", pc1_var, "% variance)"),
    y = paste0("PC2 (", pc2_var, "% variance)")
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom"
  ) +
  scale_color_viridis_d(option = "plasma")

# Analysis of species separation
cat("PCA Analysis Summary:\n")

## PCA Analysis Summary:

cat("Total variance explained by PC1+PC2:", round(sum(var_exp[1:2]) * 100, 1), "%\n\n")

## Total variance explained by PC1+PC2: 95.8 %

# Identify key variables for species separation
cat("Variable contributions to PC1 (main separation axis):\n")

## Variable contributions to PC1 (main separation axis):

pc1_contrib <- iris_loadings %>%
  select(Variable, PC1) %>%
  arrange(desc(abs(PC1)))
print(pc1_contrib)

##                  Variable        PC1
## Petal.Length Petal.Length  0.5804131
## Petal.Width   Petal.Width  0.5648565
## Sepal.Length Sepal.Length  0.5210659
## Sepal.Width   Sepal.Width -0.2693474

cat("\nSpecies separation pattern:\n")

## 
## Species separation pattern:

cat("- Setosa clearly separated (left side)\n")

## - Setosa clearly separated (left side)

cat("- Versicolor and Virginica partially overlap (right side)\n")

## - Versicolor and Virginica partially overlap (right side)

cat("- Petal measurements drive most separation\n")

## - Petal measurements drive most separation

Comprehensive PCA Insights: - Biplot combines sample positions (scores) with variable directions (loadings) - Species clustering shows natural groupings in the data - Variable arrows reveal which measurements drive the separation - Petal length and width are primary drivers of species differences - PC1 captures main species separation, PC2 shows within-species variation

Session 3: Hierarchical Clustering

Exercise 1: Basic Hierarchical Clustering

Perform hierarchical clustering on the mtcars dataset using Euclidean distance and complete linkage. Create a dendrogram and cut the tree to obtain 3 clusters. Which cars are grouped together?

# Load required data
data(mtcars)

# Scale the data for fair comparison across variables
mtcars_scaled <- scale(mtcars)

# Calculate distance matrix using Euclidean distance
dist_matrix <- dist(mtcars_scaled, method = "euclidean")

# Perform hierarchical clustering with complete linkage
hc_result <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_result, 
     main = "Hierarchical Clustering - mtcars Dataset",
     xlab = "Cars", 
     ylab = "Height",
     cex = 0.8)

# Cut the tree to get 3 clusters
clusters_3 <- cutree(hc_result, k = 3)

# Add colored rectangles to show clusters
rect.hclust(hc_result, k = 3, border = c("red", "blue", "green"))

# Show which cars are in each cluster
cat("Cluster assignments:\n")

## Cluster assignments:

for(i in 1:3) {
  cat("\nCluster", i, "cars:\n")
  cluster_cars <- names(clusters_3)[clusters_3 == i]
  cat(paste(cluster_cars, collapse = ", "), "\n")
}

## 
## Cluster 1 cars:
## Mazda RX4, Mazda RX4 Wag, Ford Pantera L, Ferrari Dino, Maserati Bora 
## 
## Cluster 2 cars:
## Datsun 710, Hornet 4 Drive, Valiant, Merc 240D, Merc 230, Merc 280, Merc 280C, Fiat 128, Honda Civic, Toyota Corolla, Toyota Corona, Fiat X1-9, Porsche 914-2, Lotus Europa, Volvo 142E 
## 
## Cluster 3 cars:
## Hornet Sportabout, Duster 360, Merc 450SE, Merc 450SL, Merc 450SLC, Cadillac Fleetwood, Lincoln Continental, Chrysler Imperial, Dodge Challenger, AMC Javelin, Camaro Z28, Pontiac Firebird

# Summary table of cluster assignments
cluster_table <- table(clusters_3)
cat("\nCluster sizes:", cluster_table)

## 
## Cluster sizes: 5 15 12

Hierarchical Clustering Basics: - Scaling ensures all variables contribute equally regardless of units - Euclidean distance measures similarity between car profiles - Complete linkage uses maximum distance between clusters (creates compact clusters) - Dendrogram height indicates dissimilarity level at which clusters merge - Cutting the tree at different heights gives different numbers of clusters

Exercise 2: Comparing Linkage Methods

Compare different linkage methods (complete, single, average, ward.D2) on the iris dataset. Create dendrograms for each method and observe how they differ in their clustering patterns.

# Prepare iris data (remove species column)
iris_data <- iris[, 1:4]
iris_scaled <- scale(iris_data)
iris_dist <- dist(iris_scaled)

# Define linkage methods to compare
methods <- c("complete", "single", "average", "ward.D2")

# Create comparison plot with 2x2 layout
par(mfrow = c(2, 2), mar = c(4, 4, 3, 2))

# Generate dendrograms for each method
hc_results <- list()
for(method in methods) {
  hc_results[[method]] <- hclust(iris_dist, method = method)
  
  plot(hc_results[[method]], 
       main = paste("Linkage:", method),
       labels = FALSE,  # Hide labels for clarity
       xlab = "", ylab = "Height",
       cex.main = 1.2)
}

# Reset plotting parameters
par(mfrow = c(1, 1))

# Compare cluster assignments for k=3 across methods
cat("Cluster assignments for k=3 across different linkage methods:\n\n")

## Cluster assignments for k=3 across different linkage methods:

cluster_comparison <- data.frame(Species = iris$Species)
for(method in methods) {
  clusters <- cutree(hc_results[[method]], k = 3)
  cluster_comparison[[paste0(method, "_cluster")]] <- clusters
}

# Show first few rows of comparison
print(head(cluster_comparison, 10))

##    Species complete_cluster single_cluster average_cluster ward.D2_cluster
## 1   setosa                1              1               1               1
## 2   setosa                1              1               1               1
## 3   setosa                1              1               1               1
## 4   setosa                1              1               1               1
## 5   setosa                1              1               1               1
## 6   setosa                1              1               1               1
## 7   setosa                1              1               1               1
## 8   setosa                1              1               1               1
## 9   setosa                1              1               1               1
## 10  setosa                1              1               1               1

# Compare how well each method separates species
cat("\nSpecies purity analysis (how well clusters match species):\n")

## 
## Species purity analysis (how well clusters match species):

for(method in methods) {
  clusters <- cutree(hc_results[[method]], k = 3)
  # Calculate adjusted rand index or simple accuracy
  species_cluster_table <- table(iris$Species, clusters)
  cat("\n", method, "linkage - Species vs Clusters:\n")
  print(species_cluster_table)
}

## 
##  complete linkage - Species vs Clusters:
##             clusters
##               1  2  3
##   setosa     49  1  0
##   versicolor  0 21 29
##   virginica   0  2 48
## 
##  single linkage - Species vs Clusters:
##             clusters
##               1  2  3
##   setosa     49  1  0
##   versicolor  0  0 50
##   virginica   0  0 50
## 
##  average linkage - Species vs Clusters:
##             clusters
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 50  0
##   virginica   0 47  3
## 
##  ward.D2 linkage - Species vs Clusters:
##             clusters
##               1  2  3
##   setosa     49  1  0
##   versicolor  0 27 23
##   virginica   0  2 48

Linkage Method Comparison: - Complete linkage: Creates compact, well-separated clusters - Single linkage: Can create elongated chains, sensitive to noise - Average linkage: Balanced approach, moderate compactness - Ward linkage: Minimizes within-cluster variance, often best for biological data - Dendrogram structure varies significantly between methods - Species separation quality differs across linkage types

Exercise 3: Heatmap with Hierarchical Clustering

Create a heatmap of the mtcars dataset with hierarchical clustering applied to both rows (cars) and columns (variables). Include a color annotation for the number of cylinders.

library(pheatmap)

# Prepare the data
data(mtcars)
mtcars_matrix <- as.matrix(scale(mtcars))  # Scale for heatmap

# Create annotation for cylinders
annotation_row <- data.frame(
  Cylinders = factor(mtcars$cyl),
  row.names = rownames(mtcars)
)

# Define colors for cylinder annotation
annotation_colors <- list(
  Cylinders = c("4" = "#FF6B6B", "6" = "#4ECDC4", "8" = "#45B7D1")
)

# Create the heatmap with hierarchical clustering
pheatmap(
  mtcars_matrix,
  # Clustering parameters
  clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean", 
  clustering_method = "complete",
  
  # Annotations
  annotation_row = annotation_row,
  annotation_colors = annotation_colors,
  
  # Appearance
  main = "mtcars Heatmap with Hierarchical Clustering",
  show_rownames = TRUE,
  show_colnames = TRUE,
  fontsize = 10,
  fontsize_row = 8,
  
  # Color scheme
  color = colorRampPalette(c("blue", "white", "red"))(50)
)

# Alternative: Create heatmap with custom clustering
# Perform clustering separately for more control
row_clusters <- hclust(dist(mtcars_matrix), method = "ward.D2")
col_clusters <- hclust(dist(t(mtcars_matrix)), method = "ward.D2")

# Display cluster information
cat("Row clustering (cars):\n")

## Row clustering (cars):

row_groups <- cutree(row_clusters, k = 3)
cat("Number of cars in each cluster:", table(row_groups), "\n")

## Number of cars in each cluster: 5 15 12

cat("\nColumn clustering (variables):\n")

## 
## Column clustering (variables):

col_groups <- cutree(col_clusters, k = 3)
variable_clusters <- split(names(col_groups), col_groups)
for(i in 1:length(variable_clusters)) {
  cat("Variable cluster", i, ":", paste(variable_clusters[[i]], collapse = ", "), "\n")
}

## Variable cluster 1 : mpg, drat, am, gear 
## Variable cluster 2 : cyl, disp, hp, wt, carb 
## Variable cluster 3 : qsec, vs

Heatmap Clustering Features: - Row clustering groups similar cars together - Column clustering groups correlated variables - Color annotation shows cylinder categories alongside clustering - Scaled values ensure fair comparison across different variable ranges - Dendrograms on sides show clustering relationships - Color intensity represents standardized variable values (z-scores)

Exercise 4: Optimal Number of Clusters

Using the iris dataset, determine the optimal number of clusters using multiple methods: elbow method, silhouette analysis, and gap statistic. Compare the results.

library(cluster)
library(factoextra)

# Prepare data
iris_data <- scale(iris[, 1:4])

# Method 1: Elbow Method (Within-cluster sum of squares)
wss <- sapply(1:10, function(k) {
  clusters <- cutree(hclust(dist(iris_data), method = "ward.D2"), k = k)
  sum(sapply(1:k, function(i) {
    cluster_data <- iris_data[clusters == i, , drop = FALSE]
    if(nrow(cluster_data) > 1) {
      sum(dist(cluster_data)^2) / (2 * nrow(cluster_data))
    } else {
      0
    }
  }))
})

# Plot elbow method
plot(1:10, wss, type = "b", pch = 19, 
     xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares",
     main = "Elbow Method for Optimal k")
abline(v = 3, col = "red", lty = 2)

# Method 2: Average Silhouette Width
silhouette_scores <- sapply(2:10, function(k) {
  clusters <- cutree(hclust(dist(iris_data), method = "ward.D2"), k = k)
  sil <- silhouette(clusters, dist(iris_data))
  mean(sil[, 3])
})

# Plot silhouette analysis
plot(2:10, silhouette_scores, type = "b", pch = 19,
     xlab = "Number of Clusters", ylab = "Average Silhouette Width",
     main = "Silhouette Analysis for Optimal k")
abline(v = which.max(silhouette_scores) + 1, col = "red", lty = 2)

# Method 3: Gap Statistic (using cluster package)
set.seed(123)

# Define a proper clustering function for clusGap
hclust_fun <- function(x, k) {
  hc <- hclust(dist(x), method = "ward.D2")
  list(cluster = cutree(hc, k = k))
}

gap_stat <- clusGap(iris_data, 
                    FUN = hclust_fun,
                    K.max = 10, B = 50)

# Plot gap statistic
plot(gap_stat, main = "Gap Statistic for Optimal k")

# Summary of results
cat("Optimal number of clusters according to different methods:\n\n")

## Optimal number of clusters according to different methods:

# Elbow method - look for the "elbow" point
elbow_k <- 3  # Visual inspection of elbow
cat("Elbow Method suggests:", elbow_k, "clusters\n")

## Elbow Method suggests: 3 clusters

# Silhouette method - maximum average silhouette width
silhouette_k <- which.max(silhouette_scores) + 1
cat("Silhouette Method suggests:", silhouette_k, "clusters\n")

## Silhouette Method suggests: 2 clusters

# Gap statistic - first local maximum
gap_k <- maxSE(gap_stat$Tab[, "gap"], gap_stat$Tab[, "SE.sim"])
cat("Gap Statistic suggests:", gap_k, "clusters\n")

## Gap Statistic suggests: 2 clusters

# Compare with known species (3 species in iris)
cat("\nKnown truth: iris has 3 species\n")

## 
## Known truth: iris has 3 species

cat("Methods agreement with truth:", 
    sum(c(elbow_k, silhouette_k, gap_k) == 3), "out of 3 methods\n")

## Methods agreement with truth: 1 out of 3 methods

# Detailed silhouette analysis for k=3
clusters_3 <- cutree(hclust(dist(iris_data), method = "ward.D2"), k = 3)
sil_3 <- silhouette(clusters_3, dist(iris_data))
cat("\nSilhouette analysis for k=3:\n")

## 
## Silhouette analysis for k=3:

cat("Average silhouette width:", round(mean(sil_3[, 3]), 3), "\n")

## Average silhouette width: 0.447

# Show cluster vs species comparison
cat("\nCluster vs Species comparison:\n")

## 
## Cluster vs Species comparison:

comparison_table <- table(iris$Species, clusters_3)
print(comparison_table)

##             clusters_3
##               1  2  3
##   setosa     49  1  0
##   versicolor  0 27 23
##   virginica   0  2 48

Cluster Validation Methods: - Elbow Method: Look for bend in WSS curve, suggests optimal k - Silhouette Analysis: Measures how well samples fit their clusters - Gap Statistic: Compares clustering to random data - Multiple methods provide confidence in cluster number choice - Biological validation: Compare results to known groups (species) - k=3 optimal for iris, matching the three known species

Exercise 5: Advanced Clustering with Custom Distance

Create a custom analysis combining PCA and hierarchical clustering. First reduce the mtcars data to 2 principal components, then perform hierarchical clustering on the PC scores. Visualize the results.

library(ggplot2)
library(dendextend)

# Step 1: Perform PCA on mtcars
mtcars_pca <- prcomp(mtcars, scale. = TRUE)

# Extract first 2 PC scores
pc_scores <- mtcars_pca$x[, 1:2]

# Step 2: Hierarchical clustering on PC scores
pc_dist <- dist(pc_scores)
pc_hclust <- hclust(pc_dist, method = "ward.D2")

# Step 3: Create enhanced dendrogram
dend <- as.dendrogram(pc_hclust)

# Color branches by clusters
clusters <- cutree(pc_hclust, k = 4)
dend_colored <- color_branches(dend, k = 4)

# Plot colored dendrogram
plot(dend_colored, 
     main = "Hierarchical Clustering on PC Scores",
     ylab = "Height")

# Step 4: Visualize clusters in PC space
pc_data <- data.frame(
  PC1 = pc_scores[, 1],
  PC2 = pc_scores[, 2],
  Car = rownames(mtcars),
  Cluster = factor(clusters),
  Cylinders = factor(mtcars$cyl)
)

# Get variance explained for axis labels
var_exp <- summary(mtcars_pca)$importance[2, ]
pc1_var <- round(var_exp[1] * 100, 1)
pc2_var <- round(var_exp[2] * 100, 1)

# Create PC space plot with clusters
ggplot(pc_data, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = Cluster, shape = Cylinders), size = 3, alpha = 0.8) +
  geom_text(aes(label = Car), vjust = -0.5, size = 2.5) +
  labs(
    title = "Hierarchical Clustering Results in PC Space",
    subtitle = "Clustering performed on first 2 principal components",
    x = paste0("PC1 (", pc1_var, "% variance)"),
    y = paste0("PC2 (", pc2_var, "% variance)")
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom"
  ) +
  scale_color_viridis_d(option = "plasma") +
  guides(shape = guide_legend(title = "Cylinders"))

# Step 5: Analysis summary
cat("Combined PCA + Hierarchical Clustering Analysis:\n\n")

## Combined PCA + Hierarchical Clustering Analysis:

cat("Variance explained by first 2 PCs:", 
    round(sum(var_exp[1:2]) * 100, 1), "%\n\n")

## Variance explained by first 2 PCs: 84.2 %

cat("Cluster compositions:\n")

## Cluster compositions:

for(i in 1:4) {
  cluster_cars <- rownames(mtcars)[clusters == i]
  cluster_cyls <- mtcars$cyl[clusters == i]
  cat("Cluster", i, ":", paste(cluster_cars, collapse = ", "), "\n")
  cat("  Cylinder distribution:", paste(names(table(cluster_cyls)), 
                                        table(cluster_cyls), 
                                        sep = ":", collapse = ", "), "\n\n")
}

## Cluster 1 : Mazda RX4, Mazda RX4 Wag, Hornet 4 Drive, Valiant, Merc 240D, Merc 230, Merc 280, Merc 280C, Toyota Corona 
##   Cylinder distribution: 4:3, 6:6 
## 
## Cluster 2 : Datsun 710, Fiat 128, Honda Civic, Toyota Corolla, Fiat X1-9, Porsche 914-2, Lotus Europa, Volvo 142E 
##   Cylinder distribution: 4:8 
## 
## Cluster 3 : Hornet Sportabout, Duster 360, Merc 450SE, Merc 450SL, Merc 450SLC, Cadillac Fleetwood, Lincoln Continental, Chrysler Imperial, Dodge Challenger, AMC Javelin, Camaro Z28, Pontiac Firebird 
##   Cylinder distribution: 8:12 
## 
## Cluster 4 : Ford Pantera L, Ferrari Dino, Maserati Bora 
##   Cylinder distribution: 6:1, 8:2

# Compare original vs PC-based clustering
original_hclust <- hclust(dist(scale(mtcars)), method = "ward.D2")
original_clusters <- cutree(original_hclust, k = 4)

cat("Clustering comparison (PC-based vs Original data):\n")

## Clustering comparison (PC-based vs Original data):

agreement <- sum(clusters == original_clusters) / length(clusters)
cat("Agreement rate:", round(agreement * 100, 1), "%\n")

## Agreement rate: 31.2 %

Advanced Clustering Insights: - PCA preprocessing reduces noise and computational complexity - PC-based clustering focuses on main data patterns - Dimensional reduction can improve clustering when many variables present - Visualization in PC space shows cluster separation clearly - Combined approach leverages strengths of both PCA and hierarchical clustering - Cluster validation through comparison with original data clustering