R Beginners Course 2026

Session 3 :: Data I/O and Reshaping

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2026-05-13

Session 3 :: Data I/O and Reshaping

▶ Interactive version · 📖 Static version

Learning Objectives

By the end of this session you will be able to:

  • Choose the right base-R I/O function for a given file format and locale (read.csv / read.csv2, read.table, read.delim / read.delim2, saveRDS / readRDS, save / load).
  • Reshape data between wide and long form with stack() / unstack() and reshape(), and explain why explicit IDs matter for round-trips.
  • Combine tables with cbind() / rbind() vs. merge(), and predict the effect of duplicated keys on the join result.
  • Apply small reproducibility hygiene habits: keep files inside your project folder with relative paths; avoid save.image() / load("workspace.RData").

Data IO (read & write files)

  • In this session, we’ll try build an understanding of base R functions for data input/output (I/O) and data reshaping using the iris dataset.

  • Beyond simply running code, we’ll discuss why you might choose one function over another, highlighting their specific strengths and trade-offs, when it makes sense.

  • We’ll to this in an interactive way.

Recap — iris: built-in data frame, 150 × 5, columns Sepal.Length, Sepal.Width, Petal.Length, Petal.Width (numeric) and Species (factor: setosa, versicolor, virginica, 50 each).

What is I/O and why does it matter?

I/O = Input / Output — reading data into R and writing results out of R.

Why persistent storage?

R keeps everything in RAM — fast, but temporary.
The moment you close R, any object you haven’t saved to disk is gone.

Saving to a file lets you:

  • share results with collaborators
  • pick up a long analysis the next day
  • hand cleaned data to a downstream script or tool
  • archive what you submitted alongside a paper

The read / write cycle

  ┌─────────────────────────┐
  │   File on disk          │
  │   (CSV, TSV, RDS, …)    │
  └────────┬────────────────┘
           │  read.csv() / readRDS() / …
           ▼
  ┌─────────────────────────┐
  │   R object in memory    │
  │   data.frame, list, …   │
  └────────┬────────────────┘
           │  write.csv() / saveRDS() / …
           ▼
  ┌─────────────────────────┐
  │   File on disk          │
  └─────────────────────────┘

Text vs binary. Text formats (CSV, TSV) are human-readable and portable but larger and slower. Binary formats (RDS, RData) are compact and lossless but only R can read them directly. Choose based on who — or what — needs to open the file next.

Working directory & paths

Why this slide exists. Every I/O demo today writes a file, then reads it back. We need a sensible place to put those files — inside your project folder, using relative paths.

Two ideas to keep in mind:

Function What it does
getwd() shows R’s current working directory
setwd("path") changes it (avoid in scripts — use RStudio Projects)
file.path("data", "iris.csv") builds a path that works on Windows, macOS, Linux
dir.create("data") creates a sub-folder if it doesn’t exist yet
getwd()                          # where are we right now?
file.path("data", "iris.csv")    # builds "data/iris.csv" portably

Convention for today. Each demo writes into the current directory (or a data/ sub-folder you create once with dir.create("data")). You’ll see the files appear in RStudio’s Files pane — useful for understanding what actually happened on disk.

CSV formats — When & Why

Scenario. A collaborator emails you a sample sheet. The US lab sends it with , separators and . decimals; the Munich lab sends ; separators and , decimals. Same data, two formats — base R has a dedicated reader for each.

Key parameters (shared by the whole read.* family):

Arg Meaning read.csv read.csv2
file path or URL
sep field separator "," ";"
dec decimal mark "." ","
header first row = column names TRUE TRUE
na.strings strings to treat as NA "NA" "NA"

Decision rule. US/UK (, + .) → read.csv(). DE/FR (; + ,) → read.csv2(). Anything else → read.table() with explicit sep and dec.

CSV formats — Live

# Write iris into your project folder, then read it back.
write.csv(iris, "iris.csv", row.names = FALSE)
iris_csv <- read.csv("iris.csv")
head(iris_csv, n = 6)
write.csv2(iris, "iris_de.csv", row.names = FALSE)
iris_csv2 <- read.csv2("iris_de.csv")

# Peek at the raw file to see the locale conventions
writeLines(readLines("iris_de.csv", n = 3))

head(iris_csv2, n = 6)

Note the ; separator and , decimal in the raw file — that’s the whole point of read.csv2() (DE/FR locale).

Predict first. We write iris (where Species is a factor) to CSV and read it back. Will identical(iris, iris_back) be TRUE?

write.csv(iris, "iris.csv", row.names = FALSE)
iris_back <- read.csv("iris.csv")

# Compare types
class(iris$Species)       # "factor"
class(iris_back$Species)  # "character" — factor levels lost!

identical(iris, iris_back)  # FALSE — surprising?

Why? Since R 4.0, read.csv() no longer converts character columns to factors automatically (stringsAsFactors = FALSE is the new default). The data values are identical, but the type differs — which is enough to make identical() return FALSE.

Fix: rebuild the factor explicitly after loading:

iris_back$Species <- factor(iris_back$Species,
                            levels = levels(iris$Species))
identical(iris, iris_back)  # TRUE

General Tabular I/O — When & Why

Scenario. A counts matrix from a sequencing pipeline arrives as a TSV with comment lines at the top, missing values written as "-", and an unusual quoting style. The read.csv* shortcuts are too rigid; you need full control.

Key parameters beyond read.csv’s defaults:

Arg What it controls
sep field separator ("\t", ";", " ", …)
quote quoting characters (use "" to disable)
na.strings values to treat as NA (e.g. c("NA", "-", "."))
skip number of header lines to skip
comment.char lines starting with this char are ignored
check.names sanitise column names (set FALSE to keep raw IDs)

Decision rule. read.delim()/read.delim2() = read.table() with sep="\t" baked in (period vs comma decimals). Use read.table() itself for anything more exotic.

General Tabular I/O — Live

write.table(iris, "iris.tsv", sep = "\t", row.names = FALSE)
iris_tab <- read.table("iris.tsv", header = TRUE, sep = "\t")
head(iris_tab, n = 10)
# US-style: tab-separated, period decimal (reuses iris.tsv from previous tab)
delim1 <- read.delim("iris.tsv")

# Write the same data in DE-style: tab-separated, comma decimal
write.table(iris, "iris_de.tsv",
            sep = "\t", dec = ",", row.names = FALSE)

# Read it back with the locale-aware shortcut
delim2 <- read.delim2("iris_de.tsv")

head(delim1, n = 2)
head(delim2, n = 2)

R’s Internal Binary Formats — When & Why

Scenario. You just finished a 30-minute QC pipeline on a large object (say, a SummarizedExperiment) and want to resume tomorrow without re-running everything. Binary formats are lossless and orders of magnitude faster than text.

Two flavours, one decision:

Function pair Saves Variable name on load
saveRDS() / readRDS() one object you choose: x <- readRDS(...)
save() / load() one or many objects + their names fixed: original names reappear

Key args: file (path), compress (TRUE/"gzip"/"xz").

Decision rule. Single object → saveRDS() (you pick the name on load, safer for reuse). Workspace / many objects with their names baked in → save().

R’s Internal Binary Formats — Live

# saveRDS / readRDS: a single object, no name preserved
saveRDS(iris, "iris.rds")
iris_rds <- readRDS("iris.rds")

# save / load: one or many objects, names preserved
save(iris, iris_rds, file = "iris.RData")
rm(iris_rds)              # remove from session
load("iris.RData")        # variables reappear with their original names
ls()

Warning

save.image() / load("workspace.RData") (and saving .RData on exit) are convenient but discouraged for reproducible work: hidden state silently comes back next session. Prefer scripts you can re-run from a clean R.

What is data reshaping — and why?

The same data can be arranged in two fundamentally different shapes:

Wide format — one row per subject, one column per measurement

flower  Sepal.L  Sepal.W  Petal.L  Petal.W
1       5.1      3.5      1.4      0.2
2       4.9      3.0      1.4      0.2

→ natural for data entry, spreadsheets, many statistical tests

Long format — one row per observation, a column naming what was measured

flower  feature    value
1       Sepal.L    5.1
1       Sepal.W    3.5
1       Petal.L    1.4
1       Petal.W    0.2
2       Sepal.L    4.9
...

→ required by boxplot(y ~ group), aov(), t.test() — base R’s formula interface

Why does this matter in the lab? A plate reader exports one column per well (wide). Base R’s boxplot(value ~ well, data = df) needs one row per measurement with a grouping column (long). Reshaping is the bridge.

Tool choice. Two base R tools for wide ↔︎ long: stack() / unstack() — quick, but relies on column order. reshape() — explicit IDs, reliable round-trips, more arguments. We’ll cover both.

stack() / unstack() — When & Why

Scenario. You have one numeric column per replicate (wide) and boxplot(value ~ group) or aov() wants one row per observation with a grouping column (long). stack() is the cheapest possible reshape — when you don’t need to keep row identity.

Mental model. stack() melts numeric columns into two columns:

  • values — all the numbers, concatenated column-by-column
  • ind — a factor labelling which original column each value came from

Key args (rarely needed): select (columns to stack), drop (drop unused levels).

Decision rule. stack()/unstack() for quick wide↔︎long when row identity doesn’t matter. reshape() (next) when you need to round-trip reliably with extra ID columns.

stack() / unstack() — Live

Look & Predict → Run & Observe → Reversibility → Challenge — makes the hidden assumptions of stack() visible.

Predict first, then run. What will the first few rows of stacked look like?

stacked <- stack(iris[1:4])
head(stacked, 4)

Don’t run it — sketch the output on paper / in your head first. Switch to Run & Observe when ready.

Click Run code to execute live in your browser. Examine:

  • Are values grouped by column (Sepal.Length then Sepal.Width, etc.)?
  • How does the ind factor map each measurement?
stacked <- stack(iris[1:4])
head(stacked, 8)
stacked <- stack(iris[1:4])
unstacked <- unstack(stacked)
identical(unstacked, iris[1:4])
stacked <- stack(iris[1:4])
set.seed(123)
stacked2 <- stacked[sample(nrow(stacked)), ]
unstacked2 <- try(unstack(stacked2), silent = TRUE)
unstacked2

reshape() — When & Why

Scenario. Samples × measurements wide table → long form for aov() / t.test() / boxplot()and you must pivot back even if rows get reordered. stack() can’t do that; reshape() can, because you give it explicit IDs.

The six named arguments — read them in this order:

Arg What it controls
direction "long" or "wide"which way you’re pivoting
varying columns to melt (long) / columns to split (wide)
v.names name of the new value column
timevar name of the new key column (which-feature)
times labels that go into timevar
idvar columns that identify a row (anchor for round-trip)

Decision rule. Use reshape() when (a) you need both directions, (b) you have non-measurement columns to carry, or (c) row order isn’t guaranteed. Otherwise stack() is shorter.

reshape() — Live

Heads-up — densest call of the deck. Don’t try to absorb the five-arg call in one read; the tabs below walk through it.

iris2 <- iris                          # local copy
iris2$rowID <- seq_len(nrow(iris2))    # explicit row identifier
head(iris2, 3)

Before you run the next tab: iris2 has 150 rows and 4 measurement columns. After reshaping to long form with one row per (flower × feature):

  • How many rows will long have?
  • How many columns?
  • What will the new Feature column contain?

Commit to your three answers before clicking Run.

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),  # cols to melt into one
  v.names   = "Measurement",            # name of new value col
  timevar   = "Feature",                # name of new key col
  times     = names(iris2)[1:4],        # labels that fill timevar
  idvar     = c("rowID","Species"),     # row-identity anchor
  direction = "long"                    # "long" or "wide"
)
dim(long)
head(long, 5)

Reversibility test. We shuffle long so row order is destroyed, then reshape back to wide. With explicit idvars, the recovery should still work.

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),
  v.names   = "Measurement",
  timevar   = "Feature",
  times     = names(iris2)[1:4],
  idvar     = c("rowID","Species"),
  direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]

wide <- reshape(
  long_shuffled,
  idvar     = c("rowID","Species"),
  timevar   = "Feature",
  direction = "wide"
)
head(wide, 3)

# Clean up auto-generated names and row order, then compare
names(wide) <- sub("^Measurement\\.", "", names(wide))
wide <- wide[, colnames(iris2)]
wide <- wide[order(wide$rowID), ]
rownames(wide) <- NULL
all.equal(wide, iris2)

Predict, then run. What happens if we drop Species from idvar? Does reshape still recover the original data? Why / why not?

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),
  v.names   = "Measurement",
  timevar   = "Feature",
  times     = names(iris2)[1:4],
  idvar     = c("rowID","Species"),
  direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]
# idvar = rowID only — Species is no longer treated as a key
wide_bad <- try(reshape(
  long_shuffled,
  idvar     = "rowID",
  timevar   = "Feature",
  direction = "wide"
))
head(wide_bad, 3)

Reshaping: Best Practice

  • Implicit vs. Explicit IDs: Why do stack()/unstack() fail after reordering? How does idvar rescue us?
  • When to Use Which:
    • Quick two-column views → stack()
    • Reliable round-trips with extra columns → reshape() + idvar

identical() vs all.equal()

The next sections check round-trips. Two checkers, two purposes:

  • identical(x, y)strict structural equality. Returns FALSE if attributes (e.g. row.names), types, or storage differ — even when the values look the same.
  • all.equal(x, y)content-level check with numeric tolerance. Returns TRUE when values match within ~1.5e-8; otherwise a character describing the diff (use isTRUE(all.equal(...)) for a boolean).

Rule of thumb: use identical() to verify exact round-trips (binary I/O, reshape, key joins); use all.equal() when only the content must match (floating-point pipelines, text round-trips with type drift).

cbind() / rbind() vs. merge() — When & Why

Scenario. You have a counts matrix and a sample sheet.

  • Already aligned (same rows, same order) → cbind() glues columns.
  • Two batches with the same columns → rbind() stacks rows.
  • Alignment must come from a key column (sample_id, gene_id) → merge() does an SQL-style join.

Decision tree. Same rows in same order? → cbind. Same columns to append? → rbind. Match on a shared key? → merge. Mismatched keys with duplicates? → add an explicit idx before merge (see next slide).

cbind() — Live

Predict first. We rebuild iris by cbind-ing its column blocks back together. Before running:

  • Will all.equal(cb, iris) return TRUE?
  • Will identical(cb, iris) return TRUE?
  • If the two checkers disagree — why?
# Combine first four columns back-to-back plus Species
cb <- cbind(
  iris[, 1:2],
  iris[, 3:4],
  Species = iris$Species
)
head(cb)

# Try BOTH checkers — do they agree?
all.equal(cb, iris)   # content-level (with tolerance)
identical(cb, iris)   # strict (attributes + types)
  • Why the disagreement? cbind() rebuilds the data.frame from scratch — minor attributes (e.g. row.names) differ even when the visible content is the same. all.equal() ignores those; identical() doesn’t.

This chunk is supposed to break. We’re showing what cbind() does when row counts disagree — the error message is the lesson.

short <- iris[-(141:150), ]
short <- iris[-(141:150), ]
tryCatch(
  cbind(short, iris),
  error = function(e) conditionMessage(e)
)
  • ❌ Error: cbind() expects equal row counts. It can’t recycle or drop.

rbind() — Live

Reversibility test. Split iris into upper/lower halves and rbind them back. Predict: does identical(rb, iris) return TRUE? If not, what one line would fix it?

upper <- iris[1:75, ]
lower <- iris[76:150, ]
rb <- rbind(upper, lower)
identical(rb, iris)
  • rbind() appends rows but preserves original row names, so identical(rb, iris) is FALSE. Fix with rownames(rb) <- NULL.

This chunk is supposed to break. rbind() requires identical column names in the same order — the error message is the point.

upper <- iris[1:75, ]
swapped <- iris[76:150, c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
tryCatch(
  rbind(upper, swapped),
  error = function(e) conditionMessage(e)
)
  • ❌ Error or misalignment: rbind() requires identical column names & order.

merge() — When & Why

Scenario. Counts come from the pipeline, metadata comes from a separate sample sheet, and both share a sample_id column. merge() is the SQL‑style join you reach for to bring them together.

Key arguments:

Arg Meaning
x, y the two data frames
by column(s) shared by both — the join key
by.x, by.y use when the key column has different names in x and y
all TRUE → full outer join (keep all rows of both)
all.x, all.y left / right outer joins
suffixes renames clashing non-key columns (default c(".x",".y"))

merge() — Join types

You want Call
Inner join (default) merge(x, y, by = "k")
Left join (keep all x) merge(x, y, by = "k", all.x = TRUE)
Right join (keep all y) merge(x, y, by = "k", all.y = TRUE)
Full outer join (keep all rows) merge(x, y, by = "k", all = TRUE)

Below we join two partial tables on a key — but Species is duplicated. How do we get a one-to-one match?

merge() — Live

Predict first. Below, dir1 has 100 rows (setosa + versicolor) and dir2 has 100 rows (versicolor + virginica), both keyed on Species. How many rows do you expect in merge(dir1, dir2, by = "Species")?

dir1 <- iris[1:100,   c("Sepal.Length", "Species")]
dir2 <- iris[51:150,  c("Sepal.Width",  "Species")]
# dir1: 50 setosa + 50 versicolor
# dir2: 50 versicolor + 50 virginica

Write a number down before turning the tab.

# Partial views: dir1 has setosa + versicolor, dir2 has versicolor + virginica
dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
merged <- merge(dir1, dir2, by = "Species")
nrow(merged)
head(merged, 10)
  • Result: Inner join on a duplicated key → only versicolor survives (it’s the sole species present in both), and its 50 rows in dir1 × 50 rows in dir2 produce a Cartesian 2500 rows.
  • Did your prediction match? If you guessed ~100, the Cartesian product is the lesson here.
  • Why? merge() pairs every matching row for duplicated keys.
dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
# Create per-species index
dir1$idx <- 1:nrow(dir1)
dir2$idx <- (50+1):(50+nrow(dir2))

# Now merge on both Species & idx
matched <- merge(dir1, dir2, by = c("Species", "idx"), all = TRUE)
head(matched, 10)
# Drop idx if you like
matched$idx <- NULL
  • Outcome: Exactly one-to-one pairing, recovering the intended alignment.

Try It: Shuffle rows of dir2 before merging—do you still get correct matches? Why? Challenge: Perform a full outer join (all = TRUE) on (Species, idx) and inspect NAs.

Other interesting functions — When & Why

Scenarios.

  • split() / unsplit()“Run the same analysis per cell type / per cluster / per batch”: break a data frame into a list-of-data-frames by a grouping factor, work on each piece, then glue back.
  • table()“How many cells in each cluster? How many samples per treatment?”: fast cross-tabulation of one or several factors.
  • cut()“Bin continuous expression / age / dose into low/med/high”: turn a numeric vector into a factor with chosen breaks.

Other interesting functions — Live

  • split() / unsplit() break data into subsets by factor, letting you apply arbitrary functions via lapply() — see Session 4 (Day 2) — then reassemble.
  • unsplit() recovers the original row order only if you split the whole data frame and pass the same factor back.
# Split the full data.frame (keep Species along for the ride)
spl   <- split(iris, iris$Species)
# e.g. compute per-species column means:
# lapply(spl, function(df) colMeans(df[, 1:4]))

# Reassemble — round-trips iris exactly
unspl <- unsplit(spl, iris$Species)
identical(unspl, iris)
  • table() quickly tabulates counts across factors.
  • cut() bins continuous variables into discrete intervals.
tbl <- table(iris$Species)
fl  <- cut(iris$Sepal.Length, breaks = 3)

tbl                    # counts per species
table(fl)              # counts per Sepal.Length bin
head(fl)               # the bin labels assigned to each row

Transpose & sort — When & Why

Scenarios — both are everyday companions of plotting.

  • t() — Transpose. “My matrix has samples in columns but boxplot() / heatmap() / prcomp() expect them in rows (or vice versa).” t() swaps rows ↔︎ columns of a matrix or data frame (data frames are first coerced to a matrix — beware mixed types).
  • sort() — Sort one vector. “Order the genes by p-value before showing the top hits.” Returns the values in order.
  • order() — Get a row permutation. “Reorder a whole data frame by one (or several) columns.” Returns the indices that put values in order — use it to subset rows: df[order(df$x), ].
  • rank() — Replace values by their ranks (handy for non-parametric stats).

Plotting preview. boxplot(t(mat)) flips a sample×gene matrix to one box per sample. df[order(df$pvalue), ] puts the strongest hits at the top of a volcano-plot label list.

Transpose & sort — Live

  • t(x) swaps rows and columns. Result is always a matrix.
  • Data frames with mixed types get coerced — everything becomes character.
# Numeric part of iris: 150 rows (flowers) x 4 columns (measurements)
mat <- as.matrix(iris[, 1:4])
dim(mat)            # 150  4

# Transpose: 4 rows (measurements) x 150 columns (flowers)
tmat <- t(mat)
dim(tmat)           #   4 150
tmat[, 1:3]         # first three flowers as columns

# Round-trip
identical(mat, t(tmat))
  • sort(x) returns the values in order — useful for a vector.
  • order(x) returns the indices — use it to reorder a whole data frame.
x <- c(30, 10, 20)

sort(x)                   # values in order:        10 20 30
order(x)                  # indices that sort x:     2  3  1
x[order(x)]               # same as sort(x)

# Reorder iris by Sepal.Length (ascending)
head(iris[order(iris$Sepal.Length), ])

# Descending: negate numeric, or use decreasing = TRUE
head(iris[order(-iris$Sepal.Length), ])
head(iris[order(iris$Sepal.Length, decreasing = TRUE), ])

# Multi-key sort: Species, then Sepal.Length
head(iris[order(iris$Species, iris$Sepal.Length), ], 10)

Summary — Three things to remember

If you only remember three things:

  1. I/O — match the reader to the file. Text formats (CSV, TSV) for sharing; binary formats (RDS, RData) for saving R objects. The reader must match the format or you get silent errors.
  2. Reshaping — know which shape your analysis needs. Wide = one column per measurement; long = one row per measurement. Use explicit IDs (idvar) when you need to pivot back safely.
  3. Combining — dimensions vs. keys. cbind()/rbind() when dimensions already align; merge() when rows must be matched by a shared column. Duplicated keys → Cartesian product, not what you expect.

Summary — Quick reference

Topic Function(s) Key point
CSV US/UK read.csv() / write.csv() , sep, . decimal
CSV DE/FR read.csv2() / write.csv2() ; sep, , decimal
Any tabular read.table() / read.delim() full control via sep, dec
Single R object saveRDS() / readRDS() custom name on load
Many R objects save() / load() names baked in
Wide → long reshape(direction="long") varying, idvar
Side-by-side cbind() same row count
Stack rows rbind() same column names & order
Join on key merge() duplicated keys → Cartesian
Rotate matrix t() result is always a matrix
Reorder rows df[order(df$x), ] use order(), not sort()
Strict compare identical() type + attributes
Fuzzy compare all.equal() numeric tolerance

Hands-on Exercise

Write iris to a file using German locale conventions, read it back, and verify the round-trip.

  1. write.table() with sep = ";", dec = ",", row.names = FALSE into "iris_de.csv" (in your project folder).
  2. Read back with read.csv2() (defaults: ; and ,).
  3. Compare with all.equal(iris, iris_back) — what about Species?

Tip

Hint: since R 4.0, text is read back as character, not factor. Rebuild with factor(iris_back$Species, levels = levels(iris$Species)) before comparing.

# Step 1: write iris with ; separator and , decimal
______

# Step 2: read it back with read.csv2()
iris_back <- ______

# Step 3: compare
all.equal(iris, iris_back)