R Beginners Course 2025

Introduction to R and Basic Programming Concepts

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2025-05-13

Welcome to the R Workshop Series!

Programming+Statistics+Biology
  • Learning R step-by-step

Welcome to the R Workshop Series!

Programming+Statistics+Biology
  • Learning R step-by-step
  • Focus: Molecular Biology and Data Analysis

Welcome to the R Workshop Series!

Programming+Statistics+Biology
  • Learning R step-by-step
  • Focus: Molecular Biology and Data Analysis
  • Building reproducible research skills (at data level)

Workshop Goals

Five Goals

Why Learn R as a Biologist?

  • Handle biological data (e.g., gene expression, sequencing)

Why Learn R as a Biologist?

  • Handle biological data (e.g., gene expression, sequencing)
  • Access and use bioinformatics packages (Bioconductor!)

Why Learn R as a Biologist?

  • Handle biological data (e.g., gene expression, sequencing)
  • Access and use bioinformatics packages (Bioconductor!)
  • Automate and streamline analyses

Why Learn R as a Biologist?

  • Handle biological data (e.g., gene expression, sequencing)
  • Access and use bioinformatics packages (Bioconductor!)
  • Automate and streamline analyses
  • Reproducibility and transparency

Why Learn R as a Biologist?

  • Handle biological data (e.g., gene expression, sequencing)
  • Access and use bioinformatics packages (Bioconductor!)
  • Automate and streamline analyses
  • Reproducibility and transparency
  • Create publication-ready graphs

Workshop Schedule

  • Beginner Workshop: May
  • Intermediate Workshop: September
  • Advanced Workshop: October

Practice between workshops is key!

Timeline

timeline
    title R Workshop Series 2025
    May 14-15  : Beginner Workshop (2 days)
    Sep 18-19 : Intermediate Workshop (2 days)
    Nov 20-21 : Advanced Workshop (2 days)

Learning Between Workshops

AI generated

Learning Between Workshops

  • Learning R = Learning a language: Use it often!

AI generated

Learning Between Workshops

  • Learning R = Learning a language: Use it often!
  • Practice exercises, small projects

AI generated

Learning Between Workshops

  • Learning R = Learning a language: Use it often!
  • Practice exercises, small projects
  • Apply R to your own data

AI generated

Learning Between Workshops

  • Learning R = Learning a language: Use it often!
  • Practice exercises, small projects
  • Apply R to your own data
  • Prepare questions for next workshop

AI generated

What You’ll Learn in This Workshop (revisted)

  • Basics of R and RStudio
  • Variables, data types, and structures
  • Data input/output and reshaping
  • Basic plotting with base R
  • control flows and apply functions
  • Introduction to reproducible reports (R Markdown)

🛠 Focus: Base R only — no extra packages yet!

Why Start with Base R?

  • Learn R’s core concepts: data types, operations, programming basics

Why Start with Base R?

  • Learn R’s core concepts: data types, operations, programming basics
  • Build strong foundations for any R analysis

Why Start with Base R?

  • Learn R’s core concepts: data types, operations, programming basics
  • Build strong foundations for any R analysis
  • Understand how R handles data internally

Why Start with Base R?

  • Learn R’s core concepts: data types, operations, programming basics
  • Build strong foundations for any R analysis
  • Understand how R handles data internally
  • Prepare for more advanced workflows later

Why Start with Base R?

  • Learn R’s core concepts: data types, operations, programming basics
  • Build strong foundations for any R analysis
  • Understand how R handles data internally
  • Prepare for more advanced workflows later

In future workshops, we will explore modern tools like the tidyverse — but good base R skills come first!

Tools We’ll Use

  • R (The software)
  • RStudio (Integrated development environment)
  • R Markdown (Literate programming)
  • Base R functions and structures (The language)

Structure of Each Workshop

  • Lectures combined with live demonstrations
  • Hands-on exercises
  • Mini projects
  • Open Q&A sessions

Learning by doing!

Workshop Norms

  • No bad questions
  • Practice actively — typing > watching
  • Help/encourage each other
  • Respect different learning paces

The Journey Ahead

  • Beginner: Base R Foundations
  • Intermediate: Advanced Concepts, Tools and Applications
  • Advanced: Reproducible Research and Bioinformatics

Step-by-step towards good data analysis skills!

Let’s Get Started!

🎯 Setting up RStudio and exploring R basics…

Slides & Code

  • [f] Full screen
  • [o] Slide Overview
  • [c] Notes
  • [h] help

git repo

R-Basic


Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Beginners_R_Course_2025.git


Slides Directly

https://cecadbioinformaticscorefacility.github.io/Beginners_R_Course_2025/

Session 1 :: Background & R Basics

The Lifeline of R

  • R is an offspring of the S programming language
  • S was a pioneer in making exploratory data analysis easy
    • make existing algorithms available in user friendly functions
    • provide accessible function documentation
    • provide interactive graphics devices

The Lifeline of R

  • R was born as an implementation of S in 1993, because native S had gone commercial
  • Being free and encouraging contributions by the user community, R can easily "evolve" to adapt to new needs and trends
  • Data driven science , including the genome projects, was the perfect “niche” which R could successfully claim for itself
  • Indeed the Bioconductor project was initiated by one of the founders of R

The Lifeline of R

  • The RStudio (now: Posit) company is gaining increasing influence on the evolution of the language
    • its Integrated Development Environment (IDE) is increasingly used by people doing data analysis with R
    • Posit is actively developing and promoting the tidyverse, which is both a special style and a code repository for the analysis of data tables
  • Currently there is quite some evolutionary pressure for change of the language!

The Iris Data of Edgar Anderson and Ronald A. Fisher

– Some Background Information on Our Practice Dataset

We will use the iris dataset of floral traits for practicing throughout the course:

data("iris") ## Load the data
head(iris, n=3)   ## glimpse at the first 3 lines
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa


It contains data on three species
with 50 observations each:  

table(iris$Species) ## species in dataset

    setosa versicolor  virginica 
        50         50         50 


The Iris Data of Edgar Anderson and Ronald A. Fisher

– Some Background Information on Our Practice Dataset

  • Data published by Ronald A. Fisher in 1936
  • Statistician and co-founder of population genetics
  • Plants collected by Edgar Anderson
    • I. setosa and I. versicolor in 1935,  
      in the same natural habitat
    • I. virginica likely in 1926, 
      at a different place
  • Field botanist with a focus on speciation mechanisms


The Iris Data of Edgar Anderson and Ronald A. Fisher

– Some Background Information on Our Practice Dataset

  • Both men were involved in the making of the Modern/Evolutionary Synthesis, with complementary central tenets: 

    • R. A. Fisher: Model evolutionary processes from the known facts of genetics
    • E. Anderson: Observe the real dynamics of (plant) populations, in order to understand the role of genetics in evolution


The Iris Data of Edgar Anderson and Ronald A. Fisher

– Some Background Information on Our Practice Dataset

  • Edgar Anderson suspected that I. versicolor may be an allopolyploid hybrid: 
    I. versicolor =  
    I. setosa (2n) x I. virginica (4n)  
    (confirmed by Lim et al. 2007)

  • This may have supported the establishment of the species, by preventing back-crossing to its parents. 

  • Fisher applied his Linear Discriminant Analysis technique to Anderson’s data, in order to test the hypothesis of additive gene action
    if true, versicolor should be twice as similar to virginica than to sertosa!


Interacting R

  • RStudio
  • Jupyter Notebook
  • Visual Studio
  • Eclipse

# Load data
data("iris")

# display data sample
head(iris)

# get statistics
summary(iris)

# get plot
boxplot(Petal.Length ~ Species, data = iris, col=c(1:3))



# Running the script
Rscript boxplot.R

Literate programming

? [function]

help([function])

example([function])

demo([topic])

browseVignettes([package])

search()

data()

Session 2 :: Basic Concepts in R

Variables

  • Variables are containers for storing data values.
  • R does not have a command for declaring a variable
  • A variable is created the moment you first assign a value to it.
  • Assignment operator <-/ = can be used for assigning a value

Code
# This is a comment

name <- "John"    # This is also a comment
age <- 40
name
[1] "John"
Code
name <- "Tom"
print (name)
[1] "Tom"

Variable Names

A variable can have a short name (like x and y) or a more descriptive name (age, carname, total_volume). Rules for R variables are:

  • A variable name must start with a letter and can be a combination of letters, digits, period(.) and underscore(_).
# Allowed variable names:
myvar    <- "John"
my_var   <- "John"
myVar    <- "John"
MYVAR    <- "John"
myvar2   <- "John"
myvar_2. <- "John"
{{.myvar   <- "John"}}
# Not Allowed variable names:
2myvar  <- "John"  # starts with a number
.2myvar <- "John"  # starts with a '.' number
my-var  <- "John"  # special character
my var  <- "John"  # space
_my_var <- "John"  # starts with a '_'
my_var% <- "John"  # special character
TRUE    <- "John"  # reserved words

Data Types

Data Type Example Verify value
Logical TRUE / FLASE x<-TRUE print(x) class(x)
TRUE
logical
Numeric 1.3, 5, 4.2 x<-1.35 print(x) class(x)
1.35
numeric
Integer 1L, 0L, 4L x<-35L print(x) class(x)
35
integer
Complex 2+3i x<-2+3i print(x) class(x)
2+3i
complex
Character “Hello!” x<-"Hello!" print(x) class(x)
Hello!
character

R Data Structure

The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

# Create a vector.
numbers <- c('100','200',"450","670")

# Print the vector.
print(numbers)
[1] "100" "200" "450" "670"


# Create a vector.
alphabets <- c('a','b',"c","d")

# Associate names to the nector
names(numbers) <- alphabets

# Print the named vector.
print(numbers)
    a     b     c     d 
"100" "200" "450" "670" 
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.
print(list1)
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x)  .Primitive("sin")
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.
factor_apple <- factor(apple_colors)

# Print the factor.
print(factor_apple)
[1] green  green  yellow red    red    red    green 
Levels: green red yellow
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), 
            nrow = 2, 
            ncol = 3, 
            byrow = TRUE)

# Print the matrix
print(M)
     [,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a" 
# Create an array.
a <- array(c('green','yellow'),dim = c(1,3,2))

# Print the array
print(a)
, , 1

     [,1]    [,2]     [,3]   
[1,] "green" "yellow" "green"

, , 2

     [,1]     [,2]    [,3]    
[1,] "yellow" "green" "yellow"
# Create the data frame.
BMI <-  data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)

# Print the data frame
print(BMI)
  gender height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26

Operators

Operators are the symbols that tell the compiler to perform specific mathematical or logical manipulations. R language is rich in built-in operators and provides the following types of operators −

Operator Name Example
+ Addition x + y
- Subtraction x - y
* Multiplication x * y
/ Division x / y
^ Exponent x ^ y
%% Modulus (Remainder from division) x %% y
%/% Integer Division x%/%y
Operator Name Example
== Equal x == y
!= Not equal x != y
> Greater than x > y
< Less than x < y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y
Operator Description
& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE
&& Logical AND operator - Returns TRUE if both statements are TRUE
| Elementwise- Logical OR operator. It returns TRUE if one of the statement is TRUE
|| Logical OR operator. It returns TRUE if one of the statement is TRUE.
! Logical NOT - returns FALSE if statement is TRUE
my_var <- 3

my_var <<- 3

3 -> my_var

3 ->> my_var

assign("my_var", c(10.4, 5.6, 3.1, 6.4, 21.7))

my_var # print my_var
[1] 10.4  5.6  3.1  6.4 21.7
Operator Description Example
: Creates a series of numbers in a sequence x <- 1:10
%in% Find out if an element belongs to a vector x %in% y
%*% Matrix Multiplication x <- Matrix1 %*% Matrix2

Session 3 :: Data I/O and Reshaping

Data IO (read & write files)

  • In this session, we’ll try build an understanding of base R functions for data input/output (I/O) and data reshaping using the iris dataset.

  • Beyond simply running code, we’ll discuss why you might choose one function over another, highlighting their specific strengths and trade-offs, when it makes sense.

  • We’ll to this in an interactive way.

CSV formats

Base R provides a versatile suite of I/O functions. Some are highly configurable (e.g., read.table()), while others wrap common defaults for convenience (e.g., read.csv()).

write.csv(iris, "iris.csv", row.names = FALSE)
iris_csv <- read.csv("iris.csv")
head(iris_csv, n = 6)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
write.csv2(iris, "iris2.csv", row.names = FALSE)
iris_csv2 <- read.csv2("iris2.csv")
head(iris_csv, n = 6)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

General Tabular I/O

When you need full control—different delimiters, quoting rules, or no headers— the generic read.table() and write.table() shine. They accept parameters like sep, quote, na.strings, and more.

write.table(iris, "iris_tab.tsv", sep = "\t", row.names = FALSE)
iris_tab <- read.table("iris_tab.tsv", header = TRUE, sep = "\t")
head(iris_tab, n = 10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
delim1 <- read.delim("iris_tab.tsv")
# change comma character in data.frame delim1 to comma
delim1$Sepal.Length <- gsub(replacement = ",", pattern="\\.", delim1$Sepal.Length)
delim1$Sepal.Width <- gsub(replacement = ",", pattern="\\.", delim1$Sepal.Width)
delim1$Petal.Length <- gsub(replacement = ",", pattern="\\.", delim1$Petal.Length)
delim1$Petal.Width <- gsub(replacement = ",", pattern="\\.", delim1$Petal.Width)
write.table(delim1, "iris_tab2.txt", sep = "\t", row.names = FALSE)
delim2 <- read.delim2("iris_tab2.txt")
head(delim1, n = 2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5,1         3,5          1,4         0,2  setosa
2          4,9           3          1,4         0,2  setosa
head(delim2, n = 2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

R’s Internal Binary Formats

When performance, fidelity and reproducibility matter—especially for large objects—R’s binary formats (.RDS / .RData) beat text. saveRDS() / readRDS() handle single objects, while save() / load() manage multiple objects.

tmp <- "iris.RDS"
saveRDS(iris, tmp)
iris_rds <- readRDS(tmp)

save(iris, iris_csv, file = "iris_data.RData")
rm(iris, iris_csv)
load("iris_data.RData")

save.image("workspace.RData")  # saves your entire session
rm(list = ls())
load("workspace.RData")  # reloads the entire session)

Data Reshaping: Base R Tools

Base R reshaping functions cover pivoting, stacking, grouping, and merging. We’ll compare each pair to understand when to use one versus another.

  • reshape() handles complex wide⇄long pivots in one call via the direction argument. It’s powerful but can be verbose when specifying varying and times.
  • stack() / unstack() provide a quick way to collapse or expand multiple columns into key-value pairs, but without the record-level identifiers that reshape() preserves.

stack() / unstack()

   stacked <- stack(iris[1:4])
   head(stacked, 4)

Execute the code. Examine:

  • Are values grouped by column (Sepal.Length then Sepal.Width, etc.)?
  • How does the ind factor map each measurement?
unstacked <- unstack(stacked)
identical(unstacked, iris[1:4])
set.seed(123)
stacked2 <- stacked[sample(nrow(stacked)), ]
unstacked2 <- try(unstack(stacked2), silent = TRUE)
unstacked2
    Sepal.Length Sepal.Width Petal.Length Petal.Width
1            7.7         3.4          5.1         0.1
2            4.3         3.8          4.7         1.4
3            5.5         3.4          1.4         1.5
4            5.0         2.9          4.6         2.2
5            5.8         2.3          6.0         1.0
6            4.6         2.0          4.9         0.2
7            6.1         2.8          1.4         2.3
8            6.1         4.4          6.1         1.7
9            6.7         3.0          1.5         0.4
10           5.0         3.1          5.8         2.1
11           5.5         3.2          1.5         1.4
12           6.4         3.2          4.9         1.5
13           4.4         2.8          4.2         0.1
14           5.5         2.5          1.6         0.4
15           4.8         2.9          1.5         0.2
16           6.2         2.9          4.4         1.5
17           5.6         3.0          1.6         1.5
18           6.9         3.1          1.6         2.3
19           7.2         3.0          5.0         0.3
20           6.1         3.4          1.3         0.2
21           5.6         4.0          4.5         2.1
22           5.4         3.5          1.6         1.3
23           7.0         3.4          6.4         0.5
24           6.1         3.0          3.9         0.2
25           6.2         2.6          1.5         0.2
26           6.8         3.5          4.7         1.2
27           5.0         3.1          1.2         0.2
28           6.5         3.6          4.8         1.0
29           6.3         2.8          4.6         0.2
30           4.6         3.0          1.4         0.2
31           6.8         3.1          3.6         2.3
32           5.8         2.3          5.8         1.3
33           5.0         2.7          4.9         1.0
34           5.1         3.0          5.6         1.3
35           6.4         4.2          4.3         1.5
36           5.1         3.0          3.8         1.4
37           4.5         2.5          1.7         0.2
38           6.0         3.5          4.5         2.5
39           5.4         3.0          5.5         1.5
40           4.9         3.4          5.1         2.3
41           5.7         3.5          4.0         0.2
42           5.2         3.2          5.1         1.8
43           5.1         2.5          1.4         0.2
44           4.9         3.3          5.4         1.3
45           6.7         2.8          4.7         1.4
46           5.0         3.1          4.4         2.0
47           5.5         2.8          1.6         1.5
48           5.8         3.0          4.5         2.0
49           4.8         3.2          5.6         0.4
50           6.3         3.0          1.3         1.5
51           6.5         3.0          5.8         1.3
52           6.3         3.0          5.7         1.6
53           6.4         3.1          3.0         2.0
54           7.6         3.0          4.7         0.2
55           5.0         2.8          6.7         0.2
56           4.6         3.0          5.9         0.3
57           6.7         3.8          5.2         2.3
58           6.5         2.9          5.1         1.1
59           6.7         2.8          4.8         1.4
60           4.6         2.2          4.9         0.2
61           6.5         2.7          1.0         2.1
62           7.7         2.6          5.1         1.2
63           5.1         2.7          1.3         0.2
64           6.7         3.1          3.9         1.5
65           6.3         2.8          5.0         0.6
66           5.1         3.0          5.6         1.2
67           7.2         3.7          5.1         2.3
68           6.6         2.9          1.3         0.1
69           5.7         3.8          1.4         1.6
70           5.0         3.7          4.1         1.7
71           6.5         3.0          1.5         0.3
72           6.3         2.7          4.0         1.0
73           5.8         2.9          5.0         1.8
74           4.7         3.5          4.2         1.6
75           5.7         2.6          5.7         1.5
76           5.5         2.7          1.5         0.4
77           4.8         3.3          1.4         1.8
78           4.8         3.4          5.9         0.1
79           5.0         3.7          1.4         1.1
80           5.6         2.5          3.5         1.8
81           5.7         3.1          6.9         1.8
82           7.3         3.4          1.4         2.1
83           5.0         2.6          6.1         1.3
84           5.7         3.2          1.5         0.3
85           5.2         2.4          1.4         1.5
86           5.6         3.0          1.7         1.3
87           7.7         2.7          5.2         1.8
88           5.4         3.8          1.5         2.2
89           6.3         2.3          5.1         1.8
90           7.7         2.5          5.0         1.3
91           6.0         2.9          5.6         0.2
92           7.9         3.0          3.3         1.2
93           5.1         2.7          3.9         1.3
94           5.5         3.0          4.8         0.2
95           6.8         2.8          1.3         0.2
96           5.7         2.2          5.5         2.2
97           7.1         3.9          5.5         2.0
98           6.0         3.2          3.7         0.3
99           4.4         2.5          6.0         1.3
100          7.2         3.2          4.5         0.2
101          4.9         3.3          5.1         1.8
102          6.0         3.2          4.5         1.9
103          6.7         3.9          1.9         1.5
104          6.4         3.6          1.6         0.2
105          5.4         3.3          1.5         2.3
106          6.7         2.2          1.4         1.8
107          6.6         2.8          4.0         1.4
108          6.1         3.4          1.4         2.1
109          5.4         3.2          3.3         1.4
110          5.3         2.8          1.7         1.8
111          6.4         3.0          4.0         0.2
112          7.4         2.4          1.5         0.2
113          5.6         3.0          1.4         0.2
114          6.0         3.4          4.4         2.3
115          4.8         2.7          4.9         1.9
116          5.2         3.0          1.6         0.2
117          6.3         2.7          6.3         0.3
118          5.4         2.8          1.3         2.5
119          6.0         2.5          3.5         0.3
120          6.3         3.2          4.0         1.9
121          6.7         3.0          1.3         0.2
122          5.2         3.6          1.1         2.0
123          6.9         3.1          6.7         1.3
124          5.1         2.5          5.4         2.1
125          6.4         3.3          5.6         0.4
126          5.9         3.1          4.5         0.2
127          5.1         3.1          4.5         1.3
128          6.2         3.5          4.7         1.8
129          4.9         2.8          1.2         1.3
130          5.6         3.4          4.5         1.0
131          5.8         3.2          5.3         1.9
132          6.3         3.6          5.7         1.8
133          6.1         2.8          5.6         0.2
134          5.8         3.2          4.4         1.1
135          4.9         3.4          4.1         2.0
136          6.4         2.9          1.5         1.6
137          4.7         2.3          1.7         2.4
138          5.0         3.2          5.3         1.9
139          4.4         2.6          1.5         1.2
140          6.9         3.0          4.6         1.0
141          6.9         2.4          6.6         2.5
142          5.7         3.0          4.2         2.4
143          5.9         3.3          4.2         0.2
144          5.9         3.4          1.4         0.4
145          5.1         3.0          4.1         2.4
146          6.2         2.9          4.8         0.4
147          5.8         3.8          4.3         1.0
148          4.9         3.8          1.5         1.4
149          5.5         2.9          1.9         0.2
150          5.7         4.1          6.1         0.1

reshape()

iris$rowID <- seq_len(nrow(iris))
   long <- reshape(
     iris,
     varying = list(names(iris)[1:4]),
     v.names   = "Measurement",
     timevar   = "Feature",
     times     = names(iris)[1:4],
     idvar     = c("rowID","Species"),
     direction = "long"
   )
   head(long, 5)
                      Species rowID      Feature Measurement
1.setosa.Sepal.Length  setosa     1 Sepal.Length         5.1
2.setosa.Sepal.Length  setosa     2 Sepal.Length         4.9
3.setosa.Sepal.Length  setosa     3 Sepal.Length         4.7
4.setosa.Sepal.Length  setosa     4 Sepal.Length         4.6
5.setosa.Sepal.Length  setosa     5 Sepal.Length         5.0
   set.seed(42)
   long_shuffled <- long[sample(nrow(long)), ]

   wide <- reshape(
     long_shuffled,
     idvar     = c("rowID","Species"),
     timevar   = "Feature",
     direction = "wide"
   )
names(wide) <- sub("^Measurement\\.", "", names(wide))
wide <- wide[,colnames(iris)]
wide <- wide[order(wide$rowID), ]
rownames(wide) <- NULL
all.equal(wide, iris)

Summary of Best Practice

  • Implicit vs. Explicit IDs: Why do stack()/unstack() fail after reordering? How does idvar rescue us?
  • When to Use Which:
    • Quick two-column views → stack()
    • Reliable round-trips with extra columns → reshape() + idvar

cbind() / rbind() vs. merge()

# Combine first four columns back-to-back plus Species
cb <- cbind(
  iris[, 1:2],
  iris[, 3:4],
  Species = iris$Species
)
head(cb)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
identical(cb, iris)
[1] FALSE
upper <- iris[1:75, ]
lower <- iris[76:150, ]
rb <- rbind(upper, lower)
  • Prediction: Does rb == iris? Not quite:
  • rbind() appends rows but preserves original row names, so identical(rb, iris) is FALSE. Use rownames(rb) <- NULL to align.
short <- iris[-(141:150), ]
# cbind(short, iris)
  • ❌ Error: cbind() expects equal row counts. It can’t recycle or drop.
swapped <- iris[76:150, c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
# rbind(upper, swapped)
  • ❌ Error or misalignment: rbind() requires identical column names & order.

Relational Binding: merge()

# Partial views
dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
merged <- merge(dir1, dir2, by = "Species")
head(merged, 10)
      Species Sepal.Length Sepal.Width
1  versicolor            7         2.7
2  versicolor            7         2.0
3  versicolor            7         3.2
4  versicolor            7         3.2
5  versicolor            7         3.1
6  versicolor            7         2.3
7  versicolor            7         2.8
8  versicolor            7         2.8
9  versicolor            7         3.3
10 versicolor            7         2.4
  • Result: Many-to-many join: e.g. 25 versicolor in each → 625 rows.
  • Why? merge() pairs every matching row for duplicated keys.
# Create per-species index
dir1$idx <- 1:nrow(dir1)
dir2$idx <- (50+1):(50+nrow(dir2))

# Now merge on both Species & idx
matched <- merge(dir1, dir2, by = c("Species","idx"), all=T)
matched
# Drop idx if you like
matched$idx <- NULL
  • Outcome: Exactly one-to-one pairing, recovering the intended alignment.

Try It: Shuffle rows of dir2 before merging—do you still get correct matches? Why? Challenge: Perform a full outer join (all=TRUE) on (Species, idx) and inspect NAs.

4. Summary & Takeaways

  • cbind() / rbind(): Great for straightforward stacking when dimensions align exactly; no key matching.
  • merge(): Key-based alignment; duplicates produce Cartesian products unless you add an ID for one-to-one matching.

By consciously adding IDs when joining on duplicated keys, you ensure your merged table mirrors your intended relational structure—no surprises!

Other interesting functions

  • split() / unsplit() break data into subsets by factor, letting you apply arbitrary functions via lapply() [TOMORROW!], then reassemble.
# Manual split / unsplit
spl <- split(iris, iris$Species)
# res_list <- lapply(spl, function(df) colMeans(df[1:4]))
unspl <- unsplit(spl, iris$Species)
  • table() quickly tabulates counts across factors.
  • cut() bins continuous variables into discrete intervals.

Session 4 :: Visualization

R Plots (base) – some preparations

First we will pre-compute the mean values of each flower trait in each species for later use.

Step 1: Split the full table into species-specific tables


class(species_tables)
[1] "list"

 

names(species_tables)
[1] "setosa"     "versicolor" "virginica" 

 

head(species_tables[["setosa"]],n=3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2

R Plots (base) – some preparations

Step 2: Compute the means and assemble them into a matrix

species_means <- 
    rbind(setosa = 
            colMeans( ## return table column means as a vector
                species_tables[["setosa"]]
                ),
          versicolor = 
            colMeans(species_tables[["versicolor"]]),
          virginica = 
            colMeans(species_tables[["virginica"]])
    )

species_means
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

 

There are easier ways to run a function over the columns of a table – tomorrow!

R Plots (base) – some preparations

Finally we define our own coloring scheme:

## These colors are supposed to be easy to
## discriminate for sight-impaired people.
species_colors = setNames( ## makes a named vector
                    palette()[1:3],  rownames(species_means)
                 )
species_colors
    setosa versicolor  virginica 
   "black"  "#DF536B"  "#61D04F" 

 

## colors reflecting the organ type:
trait_colors <-
    c(Petal.Length = "orange2",Petal.Width = "yellow2",
      Sepal.Length = "blue3",Sepal.Width = "lightblue"
    )
trait_colors 
Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
   "orange2"    "yellow2"      "blue3"  "lightblue" 

R Plots (base)

R Plots (base) – barplot()

Barplots represent sign and absolute value of numbers by the direction and length of bars.  

If called with a matrix as first argument, the function produces one plot for each column:

barplot(## one plot per column == trait,
        ## one bar == species mean!
        species_means,
        
        ## do not stack the bars
        beside=TRUE,
        
        ## larger group labels
        cex.names=1.5,
        
        col=species_colors,
        )

## add a legend (plot "augmentation"!)
legend("topright",
       rownames(species_means), 
       fill=species_colors,
       cex=1.5)

R Plots (base) – barplot()

If we want to plot trait means per species, we must change the rows of matrix species_means (= the species) into columns, because barplot() reads a matrix by column. 

This is done by the t() function ("transpose"):

m <-  t(species_means) ## TRANSPOSE

barplot(## one plot per column == species,
        ## one bar == trait mean!
        m,
        
        ## do not stack the bars
        beside=TRUE,
        
        ## larger group labels
        cex.names=2,
        
        col=trait_colors,
        
        ## increase y limit to fit the legend
        ylim = c(0,10),
        cex = 2
        )

## add a legend (plot "augmentation"!)
legend(x=1,y=10, ##"topright",
       rownames(m), 
       fill=trait_colors)

R Plots (base) – pie()

Piecharts are a quick-and-dirty alternative for representing numbers. 

The pie() function can only represent one set of numbers at a time. In addition, comparing angles on a piechart is visually not as easy as comparing bar heights.

pie(species_means[,"Petal.Length"], 
    labels = rownames(species_means),
    main = paste("Mean Petal Lengths in ",
                 "Fisher's Iris Species"),
    col=species_colors,
    
    cex=2, ## larger text annotation
    cex.main = 2 ## larger title
)

R Plots (base) – pie()

Piecharts are a quick-and-dirty alternative for representing numbers. 

The pie() function can only represent one set of numbers at a time. In addition, comparing angles on a piechart is visually not as easy as comparing bar heights.

pie(species_means["setosa",], 
    labels = colnames(species_means),
    main = paste("Mean Flower Organ Dimensions",
                 "in Iris setosa"),
    col=trait_colors,
    
    cex=2, ## larger text annotation
    cex.main = 2 ## larger title
)

R Plots (base) – plot()

The plot() function is an extremely versatile workhorse for x/y plots

As an “initializing” function, it may be called to just create an empty canvas, to be filled later:

## (see par() for graphical parameters!)
plot(x=NULL,y=NULL,
     
     ## Note that if you start empty, 
     ## you have to set the canvas 
     ## dimensions yourself!
     
     xlim=c(0,3),
     ylim=c(0,9),
     xlab = "x dimension",
     ylab = "y dimension"
     
)

R Plots (base) – plot()

Or it is called with an initial set of data, with the option to extend the plot later:

plot(x=iris$Petal.Width, 
     y=iris$Petal.Length
)

R Plots (base) – plot()

There is more than one way to relate x/y dimensions to data!

## Initialize a new plot, 
## using formula notation to specify x and y:
plot(Petal.Length ~ Petal.Width,
     data = iris)

R Plots (base) – plot()

Add a grid:

grid() 

R Plots (base) – plot()

Overplot some points with color, in order to identify a group in your data:

points(Petal.Length ~ Petal.Width, 
       
       data=subset(iris,
                Species == "versicolor"
            ),
       
       pch=21, # symbol code 21: bullet with
               # separate interior color (bg) 
               # and border color (col)
               ## See points()!

       ## Set color manually:
       col= "red", 
       bg = "red"  
)

R Plots (base) – plot()

Color all points by species, using our named vector species_colors:

plot(Petal.Length ~ Petal.Width, 
     data=iris,
     pch=21,
     
     ## index the "species_colors" vector
     ## by species names:
     col=species_colors[Species], 
     bg =species_colors[Species]  
     )

legend("topleft",
       rownames(species_means),
       fill=species_colors)

R Plots (base) – plot()

Points with adjacent positions in the input can be connected by lines, using different line styles. A typical use case is a line graph, with x as a running number or ID.

## See par() for line-related parameters!

## Make a new data.frame, 
## containing only setosa:
df <- subset(iris, Species=="setosa")

plot(
     # x is now the row number in df
     x=1:nrow(df),
     xlab="individual plant",
         
     y=df$Petal.Width,
     ylab="Petal.Width",
         
     ## show both points and 
     ## connecting lines:
     type="b",
     
     ## line width:
     lwd = 2,
     ## line style = dashed:
     lty=2,
         
     main="Iris setosa"
)

R Plots (base) – plot()

It can make sense to connect some points in a general scatterplot by lines. 
The augmenting function lines() can do this. 

 

Here, we want to connect the (x,y) means of our three species:

## Initialize a plain x/y plot:
plot(Petal.Length ~ Petal.Width,
     data=iris)

## Add colored mean points, connected by lines
lines(x=species_means[,"Petal.Width"],
      y=species_means[,"Petal.Length"],
      type="b", # show both points and lines 
      pch=21,
      bg=species_colors,
      cex=2.5
)

R Plots (base) – plot()

Annotate individual points:

text(x=species_means[,"Petal.Width"],   
     y=species_means[,"Petal.Length"],
     
     labels=rownames(species_means),
     
     pos=4, ## put labels to the right of points
            ## (see ?text)
     cex=3  ## expansion factor for the text
)

R Plots (base) – plot()

Function abline() adds indicator lines to a plot.

Regression line:

## Initialize a plain x/y plot:
plot(Petal.Length ~ Petal.Width,
     data=iris,
     ## Put a title:
     main = "Regression Line Example")
 
## Mark the linear regression line:
abline(lm(Petal.Length ~ Petal.Width,
          data=iris
       ),
       lty=2, lwd=2,col="blue"
       )

R Plots (base) – plot()

Function abline() adds indicator lines to a plot.

Lines marking locations or slopes of interest:

## Initialize a plain x/y plot:
plot(Petal.Length ~ Petal.Width,
     data=iris,
     main = "abline() example"
     )

## Horizontal and vertical markers:
abline(h = 2.5, col = "red", 
       lwd=2 ## line width
       )
abline(v = 0.75, col = "yellow", lwd=2)

## An "assumed" regression line for reference:
abline(a=1,b=2,lwd=2)

R Plots (base) – layout()

Several plots can be combined on the same page in a grid-like layout

 

The grid is specified by a matrix of possible plot positions, like so:

## prepare the layout matrix
m <- matrix(1:4, 
            nrow=2, 
            ncol=2, 
            byrow=FALSE)
m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

 

The first plot will go to grid position 1, the second to position 2 … .

R Plots (base) – layout()

layout(m) ## read the layout matrix

use_cols =  species_colors[iris$Species]
## 1
plot(Sepal.Length ~ Sepal.Width, data=iris,
     pch=21, col=use_cols, bg=use_cols, 
     cex.lab=2)
## 2
plot(Petal.Length ~ Petal.Width, data=iris, 
     pch=21, col=use_cols, bg=use_cols, 
     cex.lab=2)
## 3
plot(Sepal.Length ~ Petal.Length, data=iris, 
     pch=21, col=use_cols, bg=use_cols, 
     cex.lab=2)
## 4
plot(Sepal.Width ~ Petal.Width, data=iris, 
     pch=21, col=use_cols, bg=use_cols, 
     cex.lab=2)

layout(1) ## back to full screen

R Plots (base) – hist()

The hist() function is one of those “dual use functions”:   

With add=FALSE, it initializes the device and the coordinate system, while 
with add=TRUE,  its output goes directly to an existing plot.

 

setosa <- subset(iris,Species=="setosa")
versicolor <- subset(iris,Species=="versicolor")
virginica <- subset(iris,Species=="virginica")
              
## Plot the histogram of setosa, 
## and initialize the entire plot: 

hist(setosa$Petal.Length,
     col=species_colors["setosa"],
     
     add=FALSE, ## this is the default
     
     ## initialize to full x range !
     xlim=range(iris$Petal.Length),
     
     ## full y range you usually 
     ## only know after some trials ..
     ylim=c(0,22),
     
     ## x-axis label
     xlab="Petal Length",
     
     ## larger axis labels:
     cex.lab = 2,
     
     main = "Petal Length Distributions",
     
     ## larger title:
     cex.main = 2
)

R Plots (base) – hist()

The hist() function is one of those “dual use functions”:   

With add=FALSE, it initializes the device and the coordinate system, while 
with add=TRUE,  its output goes directly to an existing plot.

 

hist(versicolor$Petal.Length,
     add=TRUE,
     col=species_colors["versicolor"]
)

R Plots (base) – hist()

The hist() function is one of those “dual use functions”:   

With add=FALSE, it initializes the device and the coordinate system, while 
with add=TRUE,  its output goes directly to an existing plot.  

 

hist(virginica$Petal.Length,
     add=TRUE,
     col=species_colors["virginica"]
)

legend(x=2,y=22, 
       legend=c("setosa",
                "versicolor",
                "virginica"), 
       fill=species_colors,
       cex =1.5, ## larger script in legend 
       )

R Plots (base) – boxplot()

The boxplot() function has “dual-use” capabilities, too.

However it can accept a formula with a factor on the right hand side, and it will split the dataset automatically according to the factor levels. So we can plot all species at once:

boxplot(Petal.Length ~ Species, 
        data = iris,
        ## colors are not automatically
        ## inferred from the factor levels!
        col = species_colors
        )

R Plots (base) – boxplot()

Let’s add a boxplot for the global Petal.Length distribution (all species merged):

# Repeat the last boxplot, with an extended x axis: 
boxplot(Petal.Length ~ Species, 
        data = iris, col = species_colors,
        xlim=c(0,6)
        )

boxplot(iris$Petal.Length, # take the entire column!
        add=TRUE, 
        at=5, ## position on x axis
        names="all species",
        show.names=TRUE
        )

R Plots (base) – Saving Plots From RStudio

 

R Plots (base) – Saving Plots From RStudio

 

Digression: Software is Usually Built From Bits and Pieces


  • … they come under the names of subroutines, macros, functions
  • … in code, they are used like ’commands:


  • Actually the function name invokes a piece of hidden code
  • … hiding complexity
  • … yet allowing to easily access complex algorithms
  • … and to make local extensions of a language