Bioinformatics Core Facility CECAD
2025-09-19
git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git
https://cecadbioinformaticscorefacility.github.io/Intermediate_R_Course_2025/
Session 2 :: Data Wrangling
reshape()
stack() / unstack()
cbind() / rbind() vs. merge()
split() / unsplit()
table() / cut()
Tidyverse
tidyverse
is an ecosystem of packages that work together to make data science easier and more efficient in R.
Some of the most popular packages in the tidyverse include:
readr
for data importtidyr
for data tidyingtibble
for enhanced data framesdplyr
for data manipulationstringr
for string manipulationforcats
for working with categorical variables (factors)lubridate
for working with dates and timesggplot2
for data visualizationpurrr
for functional programmingTidy data 1 is a standardized way of organizing data values within a dataset. In tidy data:
This structure makes it easier to manipulate, analyze, and visualize data.
|>
or %>%
|>
or %>%
operators allow you to pass the output of one function as the input to another function, making your code more readable and concise. Simply saying, chain multiple operations together.
|>
or %>%
…Find the average weight of chicks by Diet on day 21, sorted in descending order of average weight using ChickWeight
dataset.
Diet weight
3 3 270.3000
4 4 238.5556
2 2 214.7000
1 1 177.7500
# A tibble: 4 × 2
Diet avg_weight
<fct> <dbl>
1 3 270.
2 4 239.
3 2 215.
4 1 178.
library(dplyr)
iris |>
filter(Sepal.Length > 5) |> # filter rows
select(Sepal.Length, Petal.Length, Species) |> # select columns
mutate(SePa.Length=Sepal.Length+Petal.Length) |> # add new column
arrange(desc(SePa.Length)) |> # sort rows
group_by(Species) |> # group by Species
summarise(mean = mean(SePa.Length), n = n()) # summarise
# A tibble: 3 × 3
Species mean n
<fct> <dbl> <int>
1 setosa 6.82 22
2 versicolor 10.3 47
3 virginica 12.2 49
select()
to pick columnsselect(Sepal.Length, Petal.Length, Species)
select(1,2,3)
select(1:3)
select(-c(4:5))
select(Species, Sepal.Length, Petal.Length)
select()
vs pull()
slice()
to pick rowsslice(1:10)
slice(10:20)
slice(n())
slice((n()-9):n())
slice(c(1,3,5,7))
Note
slice()
selects rows by row number, not based on conditions.filter()
filter(Sepal.Length > 5)
>
, >=
, <
, <=
, ==
, !=
&
, |
filter( Sepal.Length %in% c(5.0, 5.1))
%in%
: match values in a vectorfilter( is.na(Sepal.Length) )
is.na()
: filter missing valuesfilter( abs(Sepal.Length - 5) < 0.1 )
abs()
: absolute valuesfilter( grepl("ver", Species))
grepl()
: pattern matchingNote
filter()
selects rows based on conditions.&
and |
is not same as &&
and ||
.mutate()
& transmute()
to add or transform columnsmutate(SePa.Length=Sepal.Length+Petal.Length)
mutate(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
transmute(SePa.Length=Sepal.Length+Petal.Length)
transmute(SePa.Length=ifelse(Sepal.Length > 5, "Long", "Short"))
The arrange()
function is used to sort data.
arrange(Sepal.Length)
arrange(desc(Sepal.Length))
arrange(Species, Sepal.Length)
The group_by()
function is used to group data by categorical variable(s). The summarise()
function is used to summarize data.
group_by(Species) %>% summarise(mean = mean(Petal.Length), n = n())
group_by(Species, Sepal.Length) %>% summarise(mean = mean(Petal.Length), n = n())
distinct()
distinct(Index, .keep_all= TRUE)
distinct(Index, .keep_all= FALSE)