Bioinformatics Core Facility CECAD
2025-09-19
git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git
https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/
Session 2 :: Data Wrangling
reshape()stack() / unstack()cbind() / rbind() vs. merge()split() / unsplit()table() / cut()Tidyversetidyverse is an ecosystem of packages that work together to make data science easier and more efficient in R.
Some of the most popular packages in the tidyverse include:
readr for data importtidyr for data tidyingtibble for enhanced data framesdplyr for data manipulationstringr for string manipulationforcats for working with categorical variables (factors)lubridate for working with dates and timesggplot2 for data visualizationpurrr for functional programmingTidy data 1 is a standardized way of organizing data values within a dataset. In tidy data:
This structure makes it easier to manipulate, analyze, and visualize data.
|> or %>%|> or %>% operators allow you to pass the output of one function as the input to another function, making your code more readable and concise. Simply saying, chain multiple operations together.
|> or %>% …Find the average weight of chicks by Diet on day 21, sorted in descending order of average weight using ChickWeight dataset.
Diet weight
3 3 270.3000
4 4 238.5556
2 2 214.7000
1 1 177.7500
# A tibble: 4 × 2
Diet avg_weight
<fct> <dbl>
1 3 270.
2 4 239.
3 2 215.
4 1 178.
library(dplyr)
iris |>
filter(Sepal.Length > 5) |> # filter rows
select(Sepal.Length, Petal.Length, Species) |> # select columns
mutate(SePa.Length=Sepal.Length+Petal.Length) |> # add new column
arrange(desc(SePa.Length)) |> # sort rows
group_by(Species) |> # group by Species
summarise(mean = mean(SePa.Length), n = n()) # summarise# A tibble: 3 × 3
Species mean n
<fct> <dbl> <int>
1 setosa 6.82 22
2 versicolor 10.3 47
3 virginica 12.2 49
select() to pick columnsselect(Sepal.Length, Petal.Length, Species)
select(1,2,3)
select(1:3)
select(-c(4:5))
select(Species, Sepal.Length, Petal.Length)
select() vs pull()slice() to pick rowsslice(1:10)
slice(10:20)
slice(n())
slice((n()-9):n())
slice(c(1,3,5,7))
Note
slice() selects rows by row number, not based on conditions.filter()filter(Sepal.Length > 5)
>, >=, <, <=, ==, !=& , |filter( Sepal.Length %in% c(5.0, 5.1))
%in% : match values in a vectorfilter( is.na(Sepal.Length) )
is.na() : filter missing valuesfilter( abs(Sepal.Length - 5) < 0.1 )
abs() : absolute valuesfilter( grepl("ver", Species))
grepl() : pattern matchingNote
filter() selects rows based on conditions.& and | is not same as && and ||.mutate() & transmute() to add or transform columnsmutate(SePa.Length=Sepal.Length+Petal.Length)
mutate(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
transmute(SePa.Length=Sepal.Length+Petal.Length)
transmute(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
The arrange() function is used to sort data.
arrange(Sepal.Length)
arrange(desc(Sepal.Length))
arrange(Species, Sepal.Length)
The group_by() function is used to group data by categorical variable(s). The summarise() function is used to summarize data.
group_by(Species) %>% summarise(mean = mean(Petal.Length), n = n())
group_by(Species, Sepal.Length) %>% summarise(mean = mean(Petal.Length), n = n())
distinct()
distinct(Index, .keep_all= TRUE)
distinct(Index, .keep_all= FALSE)