R Intermediate Course 2025

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2025-09-19

Slides & Code

  • [f] Full screen
  • [o] Slide Overview
  • [c] Notes
  • [h] help

git repo

R-Intermediate


Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git


Slides Directly

https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/

Session 2 :: Data Wrangling

Recap :: Data Reshaping : Base R Tools

  1. reshape()
  2. stack() / unstack()
  3. cbind() / rbind() vs. merge()
  4. split() / unsplit()
  5. table() / cut()

Tidyverse

tidyverse is an ecosystem of packages that work together to make data science easier and more efficient in R.

Some of the most popular packages in the tidyverse include:

  • readr for data import
  • tidyr for data tidying
  • tibble for enhanced data frames
  • dplyr for data manipulation
  • stringr for string manipulation
  • forcats for working with categorical variables (factors)
  • lubridate for working with dates and times
  • ggplot2 for data visualization
  • purrr for functional programming

tidy data

Tidy data 1 is a standardized way of organizing data values within a dataset. In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

This structure makes it easier to manipulate, analyze, and visualize data.

Pipes |> or %>%

  • |> or %>% operators allow you to pass the output of one function as the input to another function, making your code more readable and concise. Simply saying, chain multiple operations together.

Pipes |> or %>%

Find the average weight of chicks by Diet on day 21, sorted in descending order of average weight using ChickWeight dataset.

# Step 1: Filter data
day21 <- ChickWeight[ChickWeight$Time == 21, ]

# Step 2: Compute mean weight per Diet
avg_weights <- aggregate(weight ~ Diet, data = day21, FUN = mean)

# Step 3: Sort by descending weight
avg_weights <- avg_weights[order(-avg_weights$weight), ]

avg_weights
  Diet   weight
3    3 270.3000
4    4 238.5556
2    2 214.7000
1    1 177.7500
library(dplyr)

ChickWeight %>%
  filter(Time == 21) %>%                      # 1. Keep only records at day 21
  group_by(Diet) %>%                          # 2. Group by Diet
  summarise(avg_weight = mean(weight)) %>%    # 3. Calculate average weight
  arrange(desc(avg_weight))                   # 4. Sort descending
# A tibble: 4 × 2
  Diet  avg_weight
  <fct>      <dbl>
1 3           270.
2 4           239.
3 2           215.
4 1           178.

Data Reshaping with dplyr

library(dplyr)
iris |> 
   filter(Sepal.Length > 5) |>                          # filter rows
   select(Sepal.Length, Petal.Length, Species) |>       # select columns
   mutate(SePa.Length=Sepal.Length+Petal.Length) |>     # add new column
   arrange(desc(SePa.Length)) |>                        # sort rows
   group_by(Species) |>                                 # group by Species
   summarise(mean = mean(SePa.Length), n = n())         # summarise
# A tibble: 3 × 3
  Species     mean     n
  <fct>      <dbl> <int>
1 setosa      6.82    22
2 versicolor 10.3     47
3 virginica  12.2     49

Using select() to pick columns

  • select(Sepal.Length, Petal.Length, Species)
    • select specific columns by name
  • select(1,2,3)
    • select specific columns by position
  • select(1:3)
    • select a range of columns by position
  • select(-c(4:5))
    • exclude specific columns by position
  • select(Species, Sepal.Length, Petal.Length)
    • reorder columns
  • select() vs pull()

Using slice() to pick rows

  • slice(1:10)
    • first 10 rows
  • slice(10:20)
    • rows 10 to 20
  • slice(n())
    • last row
  • slice((n()-9):n())
    • last 10 rows
  • slice(c(1,3,5,7))
    • specific rows

Note

  • slice() selects rows by row number, not based on conditions.

Functional row selection with filter()

  • filter(Sepal.Length > 5)
    • Select a conditional parameter
      • Options : >, >=, <, <=, ==, !=
      • Combine multiple conditions using logical operators: & , |
  • filter( Sepal.Length %in% c(5.0, 5.1))
    • %in% : match values in a vector
  • filter( is.na(Sepal.Length) )
    • is.na() : filter missing values
  • filter( abs(Sepal.Length - 5) < 0.1 )
    • abs() : absolute values
  • filter( grepl("ver", Species))
    • grepl() : pattern matching

Note

  • filter() selects rows based on conditions.
  • & and | is not same as && and ||.

mutate() & transmute() to add or transform columns

  • mutate(SePa.Length=Sepal.Length+Petal.Length)
    • add new column
  • mutate(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
    • conditional column

  • transmute(SePa.Length=Sepal.Length+Petal.Length)
    • create new column and drop others
  • transmute(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
    • conditional column and drop others

arrange (sorting)

The arrange() function is used to sort data.

  • arrange(Sepal.Length)
    • sort by Sepal.Length in ascending order
  • arrange(desc(Sepal.Length))
    • sort by Sepal.Length in descending order
  • arrange(Species, Sepal.Length)
    • sort by Species and then by Sepal.Length

group_by + summarise (aggregation)

The group_by() function is used to group data by categorical variable(s). The summarise() function is used to summarize data.

  • group_by(Species) %>% summarise(mean = mean(Petal.Length), n = n())
    • group by Species and calculate mean and count
  • group_by(Species, Sepal.Length) %>% summarise(mean = mean(Petal.Length), n = n())
    • group by Species and Sepal.Length and calculate mean and count

distinct (deduplication)

  • distinct()
    • return unique rows
  • distinct(Index, .keep_all= TRUE)
    • return unique rows based on Index column and keep all other columns
  • distinct(Index, .keep_all= FALSE)
    • return unique rows based on Index column and drop other columns