Day 1: Exercises to Intermediate R Course 2025
Day 1 Practice
Session 1: Functions and Packages
Exercise 1 (Functions: simple)
Write a function which samples n
random elements of the
built-in vector LETTERS and returns a concatenation of these elements.
(Hint: Read the help pages of sample
and
paste
. Mind the sep
parameter!)
Exercise 2a (Functions: advanced)
Make a copy of your function. Modify the copy such that
- It now takes three parameters n1, n2, and num, where
n1
is the number of elements to sample from LETTERS, as abovenum
is a whole number (give it a default value, e.g. 100)n2
is the number of elements to sample from the integers 1:num
The function should then
- sample
n1
elements from LETTERS - sample
n2
elements from 1:num - concatenate these vectors (use
c()
) - scramble the concatenated vector (read carefully what the help page
of
sample
says about the default of itssize
parameter) - return the pasted elements
Exercise 2b (Functions: advanced)
The LETTERS vector has a fixed length (check it with
length()
). Your parameter n1
must not exceed
this length (sample
samples without replacement by
default).
Likewise, n2
must not exceed the chosen value of
num
. Handle cases where a user requests illegal values of
n1
and n2
(either by
stop("ERROR MESSAGE")
or by a premature
return()
with an explanatory message and an empty string
).
Exercise 3 (Source file)
Once your functions are working, put them into a .R file (commonly named “functions.R”). Inside another script, the functions can then be made available by
source("functions.R")
Note that while a .Rmd (rmarkdown) file allows for more expressive
documentation, it cannot be source
ed from within another R
script. Hence functions are better kept in a .R (R script) file.
Exercise 4 (Packages: installing from CRAN)
- Find package
deepdep
on CRAN and have a quick look at its vignettes - In rstudio, run
getOption("repos")
to see what the default repository is. It should already be CRAN. - From the “Packages” tab of rstudio you can simply install a package
by entering its name in the search box and clicking
install
. Do this fordeepdep
. - Watch what happens in the rstudio console when you do it! It mirrors the actual R commands it is executing under the hood.
Exercise 5 (Packages: Dependencies)
- Use
browseVignettes("deepdep")
to see an html version of the package’s vignettes. Browse the “deepdep package” vignette. - How would you visualize the strong dependencies of the
ggplot2
package, usingdeepdep
? - What are the strong dependencies of
deepdep
itself?
Exercise 6 (Packages: installing from Bioconductor)
- If you have never used Bioconductor, go to
https://www.bioconductor.org/
and follow the “Get Started” link. Observe that Bioconductor has its own installer (theBiocManager
). Actually doing a complete installation takes time, so you better do this outside of the course. - But you can follow the “Packages” link and browse their impressive list of biology-related software!
Session 2: Data Manipulation with dplyr
Exercise 1: Grouped summary with multiple stats
Find the mean, median, and standard deviation of
Sepal.Length
and Petal.Length
for each
Species, arranged by descending mean Sepal.Length
.
Exercise 2: Conditional column creation
Create a new column Sepal.Size
that labels flowers
as:
“Small” if
Sepal.Length
< species median,“Large” otherwise (species-specific).
Exercise 3: Complex filtering with multiple conditions
Return rows where:
Petal.Length
is above the overall 75th percentile, andSepal.Width
is below the species mean.
Session 3: Data Visualization with ggplot2
Exercise 1: Basic Scatter Plot
Create a scatter plot showing the relationship between
Sepal.Length
(x-axis) and Sepal.Width
(y-axis), with points colored by Species
. Add appropriate
labels and use theme_minimal()
.
Exercise 2: Histogram with Faceting
Create a histogram of Petal.Length
, filled by
Species
, and use facet_wrap()
to create
separate plots for each species. Use alpha = 0.7
for
transparency.
Exercise 3: Boxplot with Data Points
Create a boxplot comparing Sepal.Width
across the three
Species
. Add individual data points using
geom_jitter()
and customize the colors manually using
scale_fill_manual()
.
Exercise 4: Base R vs ggplot2 Comparison
Create the same plot using both Base R and ggplot2: a scatter plot of
Petal.Length
vs Petal.Width
colored by
Species
. Compare the code complexity and visual output.
Exercise 5: Advanced Multi-layer Plot
Create a complex plot with multiple layers: 1. Scatter plot of
Sepal.Length
vs Sepal.Width
2. Add trend lines
for each species 3. Use faceting by species 4. Custom color palette 5.
Professional labeling with title, subtitle, and caption
Exercise 6: Troubleshooting Common Mistakes
The code below contains several common ggplot2 mistakes. Identify and fix them:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = "blue")) %>%
geom_point() %>%
labs(title = "My Plot")
theme_minimal()
Exercise 7: Custom Themes and Colors
Create a density plot of Petal.Width
by
Species
with: 1. Custom color palette using
scale_fill_viridis_d()
2. Transparency
(alpha = 0.7
) 3. Custom theme modifications (remove grid,
adjust text) 4. Proper axis labels and title
Day 2: Advanced Data Analysis
Session 1: Statistical Tests
Exercise 1: Chi-square test of independence
Create a categorical variable for Sepal.Length
as
“Short” or “Long” based on the overall median. Test whether the
proportion of “Short” vs. “Long” flowers is independent of Species.
Exercise 2: Fisher’s Exact test
Focus only on Setosa and Versicolor. Create a binary variable PetalCat as “Wide” or “Narrow” using the median Petal.Width of these two species. Test whether PetalCat is independent of Species.
Session 3: Principal Component Analysis (PCA)
Exercise 1: Basic PCA Implementation
Load the mtcars
dataset and perform PCA on the
continuous variables. Scale the data before performing PCA and examine
the summary of the PCA result. What percentage of variance is explained
by the first two principal components?
Exercise 2: PCA Visualization - Scores Plot
Create a scatter plot of the first two principal components using the
mtcars
PCA results. Color the points by the number of
cylinders (cyl
) and add proper labels. Include the
percentage of variance explained in the axis labels.
Exercise 3: PCA Loadings Analysis
Extract and visualize the loadings (variable contributions) for the
first two principal components from the mtcars
PCA. Create
a loading plot showing which original variables contribute most to each
PC.
Exercise 4: Scree Plot and Variance Explanation
Create a scree plot to visualize the variance explained by each principal component. Determine how many components are needed to explain at least 80% of the total variance.
Exercise 5: Comprehensive PCA Analysis
Using the iris
dataset, perform a complete PCA analysis
including: 1) PCA calculation, 2) biplot combining scores and loadings,
3) interpretation of which variables drive species separation.
Session 3: Hierarchical Clustering
Exercise 1: Basic Hierarchical Clustering
Perform hierarchical clustering on the mtcars
dataset
using Euclidean distance and complete linkage. Create a dendrogram and
cut the tree to obtain 3 clusters. Which cars are grouped together?
Exercise 2: Comparing Linkage Methods
Compare different linkage methods (complete, single, average,
ward.D2) on the iris
dataset. Create dendrograms for each
method and observe how they differ in their clustering patterns.
Exercise 3: Heatmap with Hierarchical Clustering
Create a heatmap of the mtcars
dataset with hierarchical
clustering applied to both rows (cars) and columns (variables). Include
a color annotation for the number of cylinders.
Exercise 4: Optimal Number of Clusters
Using the iris
dataset, determine the optimal number of
clusters using multiple methods: elbow method, silhouette analysis, and
gap statistic. Compare the results.
Exercise 5: Advanced Clustering with Custom Distance
Create a custom analysis combining PCA and hierarchical clustering.
First reduce the mtcars
data to 2 principal components,
then perform hierarchical clustering on the PC scores. Visualize the
results.