R Intermediate Course 2025

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2025-09-19

Slides & Code

[f] Full screen
[o] Slide Overview
[c] Notes
[h] help

git repo

R-Intermediate

Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git

Slides Directly

https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/

Session 2 :: Data Wrangling

Recap :: Data Reshaping : Base R Tools

reshape()
stack() / unstack()
cbind() / rbind() vs. merge()
split() / unsplit()
table() / cut()

`Tidyverse`

tidyverse is an ecosystem of packages that work together to make data science easier and more efficient in R.

Some of the most popular packages in the tidyverse include:

readr for data import

tidyr for data tidying
tibble for enhanced data frames

dplyr for data manipulation
stringr for string manipulation
forcats for working with categorical variables (factors)
lubridate for working with dates and times

ggplot2 for data visualization

purrr for functional programming

tidy data

Tidy data ¹ is a standardized way of organizing data values within a dataset. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

This structure makes it easier to manipulate, analyze, and visualize data.

Pipes `|>` or `%>%`

|> or %>% operators allow you to pass the output of one function as the input to another function, making your code more readable and concise. Simply saying, chain multiple operations together.

Pipes `|>` or `%>%` …

Find the average weight of chicks by Diet on day 21, sorted in descending order of average weight using ChickWeight dataset.

without-pipe operator
with-pipe operator

# Step 1: Filter data
day21 <- ChickWeight[ChickWeight$Time == 21, ]

# Step 2: Compute mean weight per Diet
avg_weights <- aggregate(weight ~ Diet, data = day21, FUN = mean)

# Step 3: Sort by descending weight
avg_weights <- avg_weights[order(-avg_weights$weight), ]

avg_weights

  Diet   weight
3    3 270.3000
4    4 238.5556
2    2 214.7000
1    1 177.7500

library(dplyr)

ChickWeight %>%
  filter(Time == 21) %>%                      # 1. Keep only records at day 21
  group_by(Diet) %>%                          # 2. Group by Diet
  summarise(avg_weight = mean(weight)) %>%    # 3. Calculate average weight
  arrange(desc(avg_weight))                   # 4. Sort descending

# A tibble: 4 × 2
  Diet  avg_weight
  <fct>      <dbl>
1 3           270.
2 4           239.
3 2           215.
4 1           178.

Data Reshaping with dplyr

library(dplyr)
iris |> 
   filter(Sepal.Length > 5) |>                          # filter rows
   select(Sepal.Length, Petal.Length, Species) |>       # select columns
   mutate(SePa.Length=Sepal.Length+Petal.Length) |>     # add new column
   arrange(desc(SePa.Length)) |>                        # sort rows
   group_by(Species) |>                                 # group by Species
   summarise(mean = mean(SePa.Length), n = n())         # summarise

# A tibble: 3 × 3
  Species     mean     n
  <fct>      <dbl> <int>
1 setosa      6.82    22
2 versicolor 10.3     47
3 virginica  12.2     49

Using `select()` to pick columns

select(Sepal.Length, Petal.Length, Species)
- select specific columns by name

select(1,2,3)
- select specific columns by position

select(1:3)
- select a range of columns by position

select(-c(4:5))
- exclude specific columns by position

select(Species, Sepal.Length, Petal.Length)
- reorder columns

select() vs pull()

Using `slice()` to pick rows

slice(1:10)
- first 10 rows

slice(10:20)
- rows 10 to 20

slice(n())
- last row

slice((n()-9):n())
- last 10 rows

slice(c(1,3,5,7))
- specific rows

Note

slice() selects rows by row number, not based on conditions.

Functional row selection with `filter()`

filter(Sepal.Length > 5)
- Select a conditional parameter
  - Options : >, >=, <, <=, ==, !=
  - Combine multiple conditions using logical operators: & , |

filter( Sepal.Length %in% c(5.0, 5.1))
- %in% : match values in a vector

filter( is.na(Sepal.Length) )
- is.na() : filter missing values

filter( abs(Sepal.Length - 5) < 0.1 )
- abs() : absolute values

filter( grepl("ver", Species))
- grepl() : pattern matching

Note

filter() selects rows based on conditions.
& and | is not same as && and ||.

`mutate()` & `transmute()` to add or transform columns

mutate(SePa.Length=Sepal.Length+Petal.Length)
- add new column

mutate(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
- conditional column

transmute(SePa.Length=Sepal.Length+Petal.Length)
- create new column and drop others

transmute(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
- conditional column and drop others

arrange (sorting)

The arrange() function is used to sort data.

arrange(Sepal.Length)
- sort by Sepal.Length in ascending order

arrange(desc(Sepal.Length))
- sort by Sepal.Length in descending order

arrange(Species, Sepal.Length)
- sort by Species and then by Sepal.Length

group_by + summarise (aggregation)

The group_by() function is used to group data by categorical variable(s). The summarise() function is used to summarize data.

group_by(Species) %>% summarise(mean = mean(Petal.Length), n = n())
- group by Species and calculate mean and count

group_by(Species, Sepal.Length) %>% summarise(mean = mean(Petal.Length), n = n())
- group by Species and Sepal.Length and calculate mean and count

distinct (deduplication)

distinct()
- return unique rows
distinct(Index, .keep_all= TRUE)
- return unique rows based on Index column and keep all other columns
distinct(Index, .keep_all= FALSE)
- return unique rows based on Index column and drop other columns