Day 1: Exercises to Intermediate R Course 2025

Day 1 Practice

Session 1: Functions and Packages

Exercise 1 (Functions: simple)

Write a function which samples n random elements of the built-in vector LETTERS and returns a concatenation of these elements. (Hint: Read the help pages of sample and paste. Mind the sep parameter!)

f <-
  function(n=10) {
    ## YOUR CODE HERE
    
  }

Exercise 2a (Functions: advanced)

Make a copy of your function. Modify the copy such that

  • It now takes three parameters n1, n2, and num, where
    • n1 is the number of elements to sample from LETTERS, as above
    • num is a whole number (give it a default value, e.g. 100)
    • n2 is the number of elements to sample from the integers 1:num

The function should then

  • sample n1 elements from LETTERS
  • sample n2 elements from 1:num
  • concatenate these vectors (use c())
  • scramble the concatenated vector (read carefully what the help page of sample says about the default of its size parameter)
  • return the pasted elements
f_advanced <-
  function(n1=5, n2=5, num=100) {
    ## YOUR CODE HERE
    
  }

Exercise 2b (Functions: advanced)

The LETTERS vector has a fixed length (check it with length()). Your parameter n1 must not exceed this length (sample samples without replacement by default).

Likewise, n2 must not exceed the chosen value of num. Handle cases where a user requests illegal values of n1 and n2 (either by stop("ERROR MESSAGE") or by a premature return() with an explanatory message and an empty string ).

f_advanced_safe <-
  function(n1=5, n2=5, num=100) {
    ## YOUR CODE HERE
    
  }

Exercise 3 (Source file)

Once your functions are working, put them into a .R file (commonly named “functions.R”). Inside another script, the functions can then be made available by

  • source("functions.R")

Note that while a .Rmd (rmarkdown) file allows for more expressive documentation, it cannot be sourceed from within another R script. Hence functions are better kept in a .R (R script) file.

Exercise 4 (Packages: installing from CRAN)

  • Find package deepdep on CRAN and have a quick look at its vignettes
  • In rstudio, run getOption("repos") to see what the default repository is. It should already be CRAN.
  • From the “Packages” tab of rstudio you can simply install a package by entering its name in the search box and clicking install. Do this for deepdep.
  • Watch what happens in the rstudio console when you do it! It mirrors the actual R commands it is executing under the hood.

Exercise 5 (Packages: Dependencies)

  • Use browseVignettes("deepdep") to see an html version of the package’s vignettes. Browse the “deepdep package” vignette.
  • How would you visualize the strong dependencies of the ggplot2 package, using deepdep?
  • What are the strong dependencies of deepdep itself?

Exercise 6 (Packages: installing from Bioconductor)

  • If you have never used Bioconductor, go to https://www.bioconductor.org/ and follow the “Get Started” link. Observe that Bioconductor has its own installer (the BiocManager). Actually doing a complete installation takes time, so you better do this outside of the course.
  • But you can follow the “Packages” link and browse their impressive list of biology-related software!

Session 2: Data Manipulation with dplyr

Exercise 1: Grouped summary with multiple stats

Find the mean, median, and standard deviation of Sepal.Length and Petal.Length for each Species, arranged by descending mean Sepal.Length.

Exercise 2: Conditional column creation

Create a new column Sepal.Size that labels flowers as:

  • “Small” if Sepal.Length < species median,

  • “Large” otherwise (species-specific).

Exercise 3: Complex filtering with multiple conditions

Return rows where:

  • Petal.Length is above the overall 75th percentile, and

  • Sepal.Width is below the species mean.

Session 3: Data Visualization with ggplot2

Exercise 1: Basic Scatter Plot

Create a scatter plot showing the relationship between Sepal.Length (x-axis) and Sepal.Width (y-axis), with points colored by Species. Add appropriate labels and use theme_minimal().

Exercise 2: Histogram with Faceting

Create a histogram of Petal.Length, filled by Species, and use facet_wrap() to create separate plots for each species. Use alpha = 0.7 for transparency.

Exercise 3: Boxplot with Data Points

Create a boxplot comparing Sepal.Width across the three Species. Add individual data points using geom_jitter() and customize the colors manually using scale_fill_manual().

Exercise 4: Base R vs ggplot2 Comparison

Create the same plot using both Base R and ggplot2: a scatter plot of Petal.Length vs Petal.Width colored by Species. Compare the code complexity and visual output.

Exercise 5: Advanced Multi-layer Plot

Create a complex plot with multiple layers: 1. Scatter plot of Sepal.Length vs Sepal.Width 2. Add trend lines for each species 3. Use faceting by species 4. Custom color palette 5. Professional labeling with title, subtitle, and caption

Exercise 6: Troubleshooting Common Mistakes

The code below contains several common ggplot2 mistakes. Identify and fix them:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = "blue")) %>%
  geom_point() %>%
  labs(title = "My Plot")
  theme_minimal()

Exercise 7: Custom Themes and Colors

Create a density plot of Petal.Width by Species with: 1. Custom color palette using scale_fill_viridis_d() 2. Transparency (alpha = 0.7) 3. Custom theme modifications (remove grid, adjust text) 4. Proper axis labels and title

Day 2: Advanced Data Analysis

Session 1: Statistical Tests

Exercise 1: Chi-square test of independence

Create a categorical variable for Sepal.Length as “Short” or “Long” based on the overall median. Test whether the proportion of “Short” vs. “Long” flowers is independent of Species.

Exercise 2: Fisher’s Exact test

Focus only on Setosa and Versicolor. Create a binary variable PetalCat as “Wide” or “Narrow” using the median Petal.Width of these two species. Test whether PetalCat is independent of Species.

Session 3: Principal Component Analysis (PCA)

Exercise 1: Basic PCA Implementation

Load the mtcars dataset and perform PCA on the continuous variables. Scale the data before performing PCA and examine the summary of the PCA result. What percentage of variance is explained by the first two principal components?

Exercise 2: PCA Visualization - Scores Plot

Create a scatter plot of the first two principal components using the mtcars PCA results. Color the points by the number of cylinders (cyl) and add proper labels. Include the percentage of variance explained in the axis labels.

Exercise 3: PCA Loadings Analysis

Extract and visualize the loadings (variable contributions) for the first two principal components from the mtcars PCA. Create a loading plot showing which original variables contribute most to each PC.

Exercise 4: Scree Plot and Variance Explanation

Create a scree plot to visualize the variance explained by each principal component. Determine how many components are needed to explain at least 80% of the total variance.

Exercise 5: Comprehensive PCA Analysis

Using the iris dataset, perform a complete PCA analysis including: 1) PCA calculation, 2) biplot combining scores and loadings, 3) interpretation of which variables drive species separation.

Session 3: Hierarchical Clustering

Exercise 1: Basic Hierarchical Clustering

Perform hierarchical clustering on the mtcars dataset using Euclidean distance and complete linkage. Create a dendrogram and cut the tree to obtain 3 clusters. Which cars are grouped together?

Exercise 2: Comparing Linkage Methods

Compare different linkage methods (complete, single, average, ward.D2) on the iris dataset. Create dendrograms for each method and observe how they differ in their clustering patterns.

Exercise 3: Heatmap with Hierarchical Clustering

Create a heatmap of the mtcars dataset with hierarchical clustering applied to both rows (cars) and columns (variables). Include a color annotation for the number of cylinders.

Exercise 4: Optimal Number of Clusters

Using the iris dataset, determine the optimal number of clusters using multiple methods: elbow method, silhouette analysis, and gap statistic. Compare the results.

Exercise 5: Advanced Clustering with Custom Distance

Create a custom analysis combining PCA and hierarchical clustering. First reduce the mtcars data to 2 principal components, then perform hierarchical clustering on the PC scores. Visualize the results.