Day 1: Exercises to Intermediate R Course 2025
Day 1 Practice
Session 1: Functions and Packages
Exercise 1 (Functions: simple)
Write a function which samples n random elements of the
built-in vector LETTERS and returns a concatenation of these elements.
(Hint: Read the help pages of sample and
paste. Mind the sep parameter!)
Exercise 2a (Functions: advanced)
Make a copy of your function. Modify the copy such that
- It now takes three parameters n1, n2, and num, where
n1is the number of elements to sample from LETTERS, as abovenumis a whole number (give it a default value, e.g. 100)n2is the number of elements to sample from the integers 1:num
The function should then
- sample
n1elements from LETTERS - sample
n2elements from 1:num - concatenate these vectors (use
c()) - scramble the concatenated vector (read carefully what the help page
of
samplesays about the default of itssizeparameter) - return the pasted elements
Exercise 2b (Functions: advanced)
The LETTERS vector has a fixed length (check it with
length()). Your parameter n1 must not exceed
this length (sample samples without replacement by
default).
Likewise, n2 must not exceed the chosen value of
num. Handle cases where a user requests illegal values of
n1 and n2 (either by
stop("ERROR MESSAGE") or by a premature
return() with an explanatory message and an empty string
).
Exercise 3 (Source file)
Once your functions are working, put them into a .R file (commonly named “functions.R”). Inside another script, the functions can then be made available by
source("functions.R")
Note that while a .Rmd (rmarkdown) file allows for more expressive
documentation, it cannot be sourceed from within another R
script. Hence functions are better kept in a .R (R script) file.
Exercise 4 (Packages: installing from CRAN)
- Find package
deepdepon CRAN and have a quick look at its vignettes - In rstudio, run
getOption("repos")to see what the default repository is. It should already be CRAN. - From the “Packages” tab of rstudio you can simply install a package
by entering its name in the search box and clicking
install. Do this fordeepdep. - Watch what happens in the rstudio console when you do it! It mirrors the actual R commands it is executing under the hood.
Exercise 5 (Packages: Dependencies)
- Use
browseVignettes("deepdep")to see an html version of the package’s vignettes. Browse the “deepdep package” vignette. - How would you visualize the strong dependencies of the
ggplot2package, usingdeepdep? - What are the strong dependencies of
deepdepitself?
Exercise 6 (Packages: installing from Bioconductor)
- If you have never used Bioconductor, go to
https://www.bioconductor.org/and follow the “Get Started” link. Observe that Bioconductor has its own installer (theBiocManager). Actually doing a complete installation takes time, so you better do this outside of the course. - But you can follow the “Packages” link and browse their impressive list of biology-related software!
Session 2: Data Manipulation with dplyr
Exercise 1: Grouped summary with multiple stats
Find the mean, median, and standard deviation of
Sepal.Length and Petal.Length for each
Species, arranged by descending mean Sepal.Length.
Exercise 2: Conditional column creation
Create a new column Sepal.Size that labels flowers
as:
“Small” if
Sepal.Length< species median,“Large” otherwise (species-specific).
Exercise 3: Complex filtering with multiple conditions
Return rows where:
Petal.Lengthis above the overall 75th percentile, andSepal.Widthis below the species mean.
Session 3: Data Visualization with ggplot2
Exercise 1: Basic Scatter Plot
Create a scatter plot showing the relationship between
Sepal.Length (x-axis) and Sepal.Width
(y-axis), with points colored by Species. Add appropriate
labels and use theme_minimal().
Exercise 2: Histogram with Faceting
Create a histogram of Petal.Length, filled by
Species, and use facet_wrap() to create
separate plots for each species. Use alpha = 0.7 for
transparency.
Exercise 3: Boxplot with Data Points
Create a boxplot comparing Sepal.Width across the three
Species. Add individual data points using
geom_jitter() and customize the colors manually using
scale_fill_manual().
Exercise 4: Base R vs ggplot2 Comparison
Create the same plot using both Base R and ggplot2: a scatter plot of
Petal.Length vs Petal.Width colored by
Species. Compare the code complexity and visual output.
Exercise 5: Advanced Multi-layer Plot
Create a complex plot with multiple layers: 1. Scatter plot of
Sepal.Length vs Sepal.Width 2. Add trend lines
for each species 3. Use faceting by species 4. Custom color palette 5.
Professional labeling with title, subtitle, and caption
Exercise 6: Troubleshooting Common Mistakes
The code below contains several common ggplot2 mistakes. Identify and fix them:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = "blue")) %>%
geom_point() %>%
labs(title = "My Plot")
theme_minimal()
Exercise 7: Custom Themes and Colors
Create a density plot of Petal.Width by
Species with: 1. Custom color palette using
scale_fill_viridis_d() 2. Transparency
(alpha = 0.7) 3. Custom theme modifications (remove grid,
adjust text) 4. Proper axis labels and title
Day 2: Advanced Data Analysis
Session 1: Statistical Tests
Exercise 1: Chi-square test of independence
Create a categorical variable for Sepal.Length as
“Short” or “Long” based on the overall median. Test whether the
proportion of “Short” vs. “Long” flowers is independent of Species.
Exercise 2: Fisher’s Exact test
Focus only on Setosa and Versicolor. Create a binary variable PetalCat as “Wide” or “Narrow” using the median Petal.Width of these two species. Test whether PetalCat is independent of Species.
Session 3: Principal Component Analysis (PCA)
Exercise 1: Basic PCA Implementation
Load the mtcars dataset and perform PCA on the
continuous variables. Scale the data before performing PCA and examine
the summary of the PCA result. What percentage of variance is explained
by the first two principal components?
Exercise 2: PCA Visualization - Scores Plot
Create a scatter plot of the first two principal components using the
mtcars PCA results. Color the points by the number of
cylinders (cyl) and add proper labels. Include the
percentage of variance explained in the axis labels.
Exercise 3: PCA Loadings Analysis
Extract and visualize the loadings (variable contributions) for the
first two principal components from the mtcars PCA. Create
a loading plot showing which original variables contribute most to each
PC.
Exercise 4: Scree Plot and Variance Explanation
Create a scree plot to visualize the variance explained by each principal component. Determine how many components are needed to explain at least 80% of the total variance.
Exercise 5: Comprehensive PCA Analysis
Using the iris dataset, perform a complete PCA analysis
including: 1) PCA calculation, 2) biplot combining scores and loadings,
3) interpretation of which variables drive species separation.
Session 3: Hierarchical Clustering
Exercise 1: Basic Hierarchical Clustering
Perform hierarchical clustering on the mtcars dataset
using Euclidean distance and complete linkage. Create a dendrogram and
cut the tree to obtain 3 clusters. Which cars are grouped together?
Exercise 2: Comparing Linkage Methods
Compare different linkage methods (complete, single, average,
ward.D2) on the iris dataset. Create dendrograms for each
method and observe how they differ in their clustering patterns.
Exercise 3: Heatmap with Hierarchical Clustering
Create a heatmap of the mtcars dataset with hierarchical
clustering applied to both rows (cars) and columns (variables). Include
a color annotation for the number of cylinders.
Exercise 4: Optimal Number of Clusters
Using the iris dataset, determine the optimal number of
clusters using multiple methods: elbow method, silhouette analysis, and
gap statistic. Compare the results.
Exercise 5: Advanced Clustering with Custom Distance
Create a custom analysis combining PCA and hierarchical clustering.
First reduce the mtcars data to 2 principal components,
then perform hierarchical clustering on the PC scores. Visualize the
results.