Session 3 :: Data I/O and Reshaping
Bioinformatics Core Facility CECAD
2026-05-13
Session 3 :: Data I/O and Reshaping
By the end of this session you will be able to:
read.csv / read.csv2, read.table, read.delim / read.delim2, saveRDS / readRDS, save / load).stack() / unstack() and reshape(), and explain why explicit IDs matter for round-trips.cbind() / rbind() vs. merge(), and predict the effect of duplicated keys on the join result.save.image() / load("workspace.RData").In this session, we’ll try build an understanding of base R functions for data input/output (I/O) and data reshaping using the iris dataset.
Beyond simply running code, we’ll discuss why you might choose one function over another, highlighting their specific strengths and trade-offs, when it makes sense.
We’ll to this in an interactive way.
Recap — iris: built-in data frame, 150 × 5, columns Sepal.Length, Sepal.Width, Petal.Length, Petal.Width (numeric) and Species (factor: setosa, versicolor, virginica, 50 each).
I/O = Input / Output — reading data into R and writing results out of R.
Why persistent storage?
R keeps everything in RAM — fast, but temporary.
The moment you close R, any object you haven’t saved to disk is gone.
Saving to a file lets you:
The read / write cycle
┌─────────────────────────┐
│ File on disk │
│ (CSV, TSV, RDS, …) │
└────────┬────────────────┘
│ read.csv() / readRDS() / …
▼
┌─────────────────────────┐
│ R object in memory │
│ data.frame, list, … │
└────────┬────────────────┘
│ write.csv() / saveRDS() / …
▼
┌─────────────────────────┐
│ File on disk │
└─────────────────────────┘
Text vs binary. Text formats (CSV, TSV) are human-readable and portable but larger and slower. Binary formats (RDS, RData) are compact and lossless but only R can read them directly. Choose based on who — or what — needs to open the file next.
Why this slide exists. Every I/O demo today writes a file, then reads it back. We need a sensible place to put those files — inside your project folder, using relative paths.
Two ideas to keep in mind:
| Function | What it does |
|---|---|
getwd() |
shows R’s current working directory |
setwd("path") |
changes it (avoid in scripts — use RStudio Projects) |
file.path("data", "iris.csv") |
builds a path that works on Windows, macOS, Linux |
dir.create("data") |
creates a sub-folder if it doesn’t exist yet |
Convention for today. Each demo writes into the current directory (or a data/ sub-folder you create once with dir.create("data")). You’ll see the files appear in RStudio’s Files pane — useful for understanding what actually happened on disk.
Scenario. A collaborator emails you a sample sheet. The US lab sends it with , separators and . decimals; the Munich lab sends ; separators and , decimals. Same data, two formats — base R has a dedicated reader for each.
Key parameters (shared by the whole read.* family):
| Arg | Meaning | read.csv |
read.csv2 |
|---|---|---|---|
file |
path or URL | — | — |
sep |
field separator | "," |
";" |
dec |
decimal mark | "." |
"," |
header |
first row = column names | TRUE |
TRUE |
na.strings |
strings to treat as NA |
"NA" |
"NA" |
Decision rule. US/UK (, + .) → read.csv(). DE/FR (; + ,) → read.csv2(). Anything else → read.table() with explicit sep and dec.
Note the ; separator and , decimal in the raw file — that’s the whole point of read.csv2() (DE/FR locale).
Predict first. We write iris (where Species is a factor) to CSV and read it back. Will identical(iris, iris_back) be TRUE?
Why? Since R 4.0, read.csv() no longer converts character columns to factors automatically (stringsAsFactors = FALSE is the new default). The data values are identical, but the type differs — which is enough to make identical() return FALSE.
Fix: rebuild the factor explicitly after loading:
Scenario. A counts matrix from a sequencing pipeline arrives as a TSV with comment lines at the top, missing values written as "-", and an unusual quoting style. The read.csv* shortcuts are too rigid; you need full control.
Key parameters beyond read.csv’s defaults:
| Arg | What it controls |
|---|---|
sep |
field separator ("\t", ";", " ", …) |
quote |
quoting characters (use "" to disable) |
na.strings |
values to treat as NA (e.g. c("NA", "-", ".")) |
skip |
number of header lines to skip |
comment.char |
lines starting with this char are ignored |
check.names |
sanitise column names (set FALSE to keep raw IDs) |
Decision rule. read.delim()/read.delim2() = read.table() with sep="\t" baked in (period vs comma decimals). Use read.table() itself for anything more exotic.
# US-style: tab-separated, period decimal (reuses iris.tsv from previous tab)
delim1 <- read.delim("iris.tsv")
# Write the same data in DE-style: tab-separated, comma decimal
write.table(iris, "iris_de.tsv",
sep = "\t", dec = ",", row.names = FALSE)
# Read it back with the locale-aware shortcut
delim2 <- read.delim2("iris_de.tsv")
head(delim1, n = 2)
head(delim2, n = 2)Scenario. You just finished a 30-minute QC pipeline on a large object (say, a SummarizedExperiment) and want to resume tomorrow without re-running everything. Binary formats are lossless and orders of magnitude faster than text.
Two flavours, one decision:
| Function pair | Saves | Variable name on load |
|---|---|---|
saveRDS() / readRDS() |
one object | you choose: x <- readRDS(...) |
save() / load() |
one or many objects + their names | fixed: original names reappear |
Key args: file (path), compress (TRUE/"gzip"/"xz").
Decision rule. Single object → saveRDS() (you pick the name on load, safer for reuse). Workspace / many objects with their names baked in → save().
# saveRDS / readRDS: a single object, no name preserved
saveRDS(iris, "iris.rds")
iris_rds <- readRDS("iris.rds")
# save / load: one or many objects, names preserved
save(iris, iris_rds, file = "iris.RData")
rm(iris_rds) # remove from session
load("iris.RData") # variables reappear with their original names
ls()Warning
save.image() / load("workspace.RData") (and saving .RData on exit) are convenient but discouraged for reproducible work: hidden state silently comes back next session. Prefer scripts you can re-run from a clean R.
The same data can be arranged in two fundamentally different shapes:
Wide format — one row per subject, one column per measurement
flower Sepal.L Sepal.W Petal.L Petal.W
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
→ natural for data entry, spreadsheets, many statistical tests
Long format — one row per observation, a column naming what was measured
flower feature value
1 Sepal.L 5.1
1 Sepal.W 3.5
1 Petal.L 1.4
1 Petal.W 0.2
2 Sepal.L 4.9
...
→ required by boxplot(y ~ group), aov(), t.test() — base R’s formula interface
Why does this matter in the lab? A plate reader exports one column per well (wide). Base R’s boxplot(value ~ well, data = df) needs one row per measurement with a grouping column (long). Reshaping is the bridge.
Tool choice. Two base R tools for wide ↔︎ long: stack() / unstack() — quick, but relies on column order. reshape() — explicit IDs, reliable round-trips, more arguments. We’ll cover both.
stack() / unstack() — When & WhyScenario. You have one numeric column per replicate (wide) and boxplot(value ~ group) or aov() wants one row per observation with a grouping column (long). stack() is the cheapest possible reshape — when you don’t need to keep row identity.
Mental model. stack() melts numeric columns into two columns:
values — all the numbers, concatenated column-by-columnind — a factor labelling which original column each value came fromKey args (rarely needed): select (columns to stack), drop (drop unused levels).
Decision rule. stack()/unstack() for quick wide↔︎long when row identity doesn’t matter. reshape() (next) when you need to round-trip reliably with extra ID columns.
stack() / unstack() — LiveLook & Predict → Run & Observe → Reversibility → Challenge — makes the hidden assumptions of stack() visible.
Click Run code to execute live in your browser. Examine:
reshape() — When & WhyScenario. Samples × measurements wide table → long form for aov() / t.test() / boxplot() — and you must pivot back even if rows get reordered. stack() can’t do that; reshape() can, because you give it explicit IDs.
The six named arguments — read them in this order:
| Arg | What it controls |
|---|---|
direction |
"long" or "wide" — which way you’re pivoting |
varying |
columns to melt (long) / columns to split (wide) |
v.names |
name of the new value column |
timevar |
name of the new key column (which-feature) |
times |
labels that go into timevar |
idvar |
columns that identify a row (anchor for round-trip) |
Decision rule. Use reshape() when (a) you need both directions, (b) you have non-measurement columns to carry, or (c) row order isn’t guaranteed. Otherwise stack() is shorter.
reshape() — LiveHeads-up — densest call of the deck. Don’t try to absorb the five-arg call in one read; the tabs below walk through it.
Before you run the next tab: iris2 has 150 rows and 4 measurement columns. After reshaping to long form with one row per (flower × feature):
long have?Feature column contain?Commit to your three answers before clicking Run.
iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
iris2,
varying = list(names(iris2)[1:4]), # cols to melt into one
v.names = "Measurement", # name of new value col
timevar = "Feature", # name of new key col
times = names(iris2)[1:4], # labels that fill timevar
idvar = c("rowID","Species"), # row-identity anchor
direction = "long" # "long" or "wide"
)
dim(long)
head(long, 5)Reversibility test. We shuffle long so row order is destroyed, then reshape back to wide. With explicit idvars, the recovery should still work.
iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
iris2,
varying = list(names(iris2)[1:4]),
v.names = "Measurement",
timevar = "Feature",
times = names(iris2)[1:4],
idvar = c("rowID","Species"),
direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]
wide <- reshape(
long_shuffled,
idvar = c("rowID","Species"),
timevar = "Feature",
direction = "wide"
)
head(wide, 3)
# Clean up auto-generated names and row order, then compare
names(wide) <- sub("^Measurement\\.", "", names(wide))
wide <- wide[, colnames(iris2)]
wide <- wide[order(wide$rowID), ]
rownames(wide) <- NULL
all.equal(wide, iris2)Predict, then run. What happens if we drop Species from idvar? Does reshape still recover the original data? Why / why not?
iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
iris2,
varying = list(names(iris2)[1:4]),
v.names = "Measurement",
timevar = "Feature",
times = names(iris2)[1:4],
idvar = c("rowID","Species"),
direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]
# idvar = rowID only — Species is no longer treated as a key
wide_bad <- try(reshape(
long_shuffled,
idvar = "rowID",
timevar = "Feature",
direction = "wide"
))
head(wide_bad, 3)stack()/unstack() fail after reordering? How does idvar rescue us?stack()reshape() + idvaridentical() vs all.equal()The next sections check round-trips. Two checkers, two purposes:
identical(x, y) — strict structural equality. Returns FALSE if attributes (e.g. row.names), types, or storage differ — even when the values look the same.all.equal(x, y) — content-level check with numeric tolerance. Returns TRUE when values match within ~1.5e-8; otherwise a character describing the diff (use isTRUE(all.equal(...)) for a boolean).Rule of thumb: use identical() to verify exact round-trips (binary I/O, reshape, key joins); use all.equal() when only the content must match (floating-point pipelines, text round-trips with type drift).
cbind() / rbind() vs. merge() — When & WhyScenario. You have a counts matrix and a sample sheet.
cbind() glues columns.rbind() stacks rows.sample_id, gene_id) → merge() does an SQL-style join.Decision tree. Same rows in same order? → cbind. Same columns to append? → rbind. Match on a shared key? → merge. Mismatched keys with duplicates? → add an explicit idx before merge (see next slide).
cbind() — LivePredict first. We rebuild iris by cbind-ing its column blocks back together. Before running:
all.equal(cb, iris) return TRUE?identical(cb, iris) return TRUE?cbind() rebuilds the data.frame from scratch — minor attributes (e.g. row.names) differ even when the visible content is the same. all.equal() ignores those; identical() doesn’t.This chunk is supposed to break. We’re showing what cbind() does when row counts disagree — the error message is the lesson.
cbind() expects equal row counts. It can’t recycle or drop.rbind() — LiveReversibility test. Split iris into upper/lower halves and rbind them back. Predict: does identical(rb, iris) return TRUE? If not, what one line would fix it?
rbind() appends rows but preserves original row names, so identical(rb, iris) is FALSE. Fix with rownames(rb) <- NULL.This chunk is supposed to break. rbind() requires identical column names in the same order — the error message is the point.
rbind() requires identical column names & order.merge() — When & WhyScenario. Counts come from the pipeline, metadata comes from a separate sample sheet, and both share a sample_id column. merge() is the SQL‑style join you reach for to bring them together.
Key arguments:
| Arg | Meaning |
|---|---|
x, y |
the two data frames |
by |
column(s) shared by both — the join key |
by.x, by.y |
use when the key column has different names in x and y |
all |
TRUE → full outer join (keep all rows of both) |
all.x, all.y |
left / right outer joins |
suffixes |
renames clashing non-key columns (default c(".x",".y")) |
merge() — Join types| You want | Call |
|---|---|
| Inner join (default) | merge(x, y, by = "k") |
| Left join (keep all x) | merge(x, y, by = "k", all.x = TRUE) |
| Right join (keep all y) | merge(x, y, by = "k", all.y = TRUE) |
| Full outer join (keep all rows) | merge(x, y, by = "k", all = TRUE) |
Below we join two partial tables on a key — but
Speciesis duplicated. How do we get a one-to-one match?
merge() — LivePredict first. Below, dir1 has 100 rows (setosa + versicolor) and dir2 has 100 rows (versicolor + virginica), both keyed on Species. How many rows do you expect in merge(dir1, dir2, by = "Species")?
dir1 <- iris[1:100, c("Sepal.Length", "Species")]
dir2 <- iris[51:150, c("Sepal.Width", "Species")]
# dir1: 50 setosa + 50 versicolor
# dir2: 50 versicolor + 50 virginicaWrite a number down before turning the tab.
versicolor survives (it’s the sole species present in both), and its 50 rows in dir1 × 50 rows in dir2 produce a Cartesian 2500 rows.merge() pairs every matching row for duplicated keys.dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
# Create per-species index
dir1$idx <- 1:nrow(dir1)
dir2$idx <- (50+1):(50+nrow(dir2))
# Now merge on both Species & idx
matched <- merge(dir1, dir2, by = c("Species", "idx"), all = TRUE)
head(matched, 10)
# Drop idx if you like
matched$idx <- NULLTry It: Shuffle rows of
dir2before merging—do you still get correct matches? Why? Challenge: Perform a full outer join (all = TRUE) on(Species, idx)and inspect NAs.
Scenarios.
split() / unsplit() — “Run the same analysis per cell type / per cluster / per batch”: break a data frame into a list-of-data-frames by a grouping factor, work on each piece, then glue back.table() — “How many cells in each cluster? How many samples per treatment?”: fast cross-tabulation of one or several factors.cut() — “Bin continuous expression / age / dose into low/med/high”: turn a numeric vector into a factor with chosen breaks.lapply() — see Session 4 (Day 2) — then reassemble.unsplit() recovers the original row order only if you split the whole data frame and pass the same factor back.Scenarios — both are everyday companions of plotting.
t() — Transpose. “My matrix has samples in columns but boxplot() / heatmap() / prcomp() expect them in rows (or vice versa).” t() swaps rows ↔︎ columns of a matrix or data frame (data frames are first coerced to a matrix — beware mixed types).sort() — Sort one vector. “Order the genes by p-value before showing the top hits.” Returns the values in order.order() — Get a row permutation. “Reorder a whole data frame by one (or several) columns.” Returns the indices that put values in order — use it to subset rows: df[order(df$x), ].rank() — Replace values by their ranks (handy for non-parametric stats).Plotting preview. boxplot(t(mat)) flips a sample×gene matrix to one box per sample. df[order(df$pvalue), ] puts the strongest hits at the top of a volcano-plot label list.
t(x) swaps rows and columns. Result is always a matrix.sort(x) returns the values in order — useful for a vector.order(x) returns the indices — use it to reorder a whole data frame.x <- c(30, 10, 20)
sort(x) # values in order: 10 20 30
order(x) # indices that sort x: 2 3 1
x[order(x)] # same as sort(x)
# Reorder iris by Sepal.Length (ascending)
head(iris[order(iris$Sepal.Length), ])
# Descending: negate numeric, or use decreasing = TRUE
head(iris[order(-iris$Sepal.Length), ])
head(iris[order(iris$Sepal.Length, decreasing = TRUE), ])
# Multi-key sort: Species, then Sepal.Length
head(iris[order(iris$Species, iris$Sepal.Length), ], 10)If you only remember three things:
idvar) when you need to pivot back safely.cbind()/rbind() when dimensions already align; merge() when rows must be matched by a shared column. Duplicated keys → Cartesian product, not what you expect.| Topic | Function(s) | Key point |
|---|---|---|
| CSV US/UK | read.csv() / write.csv() |
, sep, . decimal |
| CSV DE/FR | read.csv2() / write.csv2() |
; sep, , decimal |
| Any tabular | read.table() / read.delim() |
full control via sep, dec |
| Single R object | saveRDS() / readRDS() |
custom name on load |
| Many R objects | save() / load() |
names baked in |
| Wide → long | reshape(direction="long") |
varying, idvar |
| Side-by-side | cbind() |
same row count |
| Stack rows | rbind() |
same column names & order |
| Join on key | merge() |
duplicated keys → Cartesian |
| Rotate matrix | t() |
result is always a matrix |
| Reorder rows | df[order(df$x), ] |
use order(), not sort() |
| Strict compare | identical() |
type + attributes |
| Fuzzy compare | all.equal() |
numeric tolerance |
Write iris to a file using German locale conventions, read it back, and verify the round-trip.
write.table() with sep = ";", dec = ",", row.names = FALSE into "iris_de.csv" (in your project folder).read.csv2() (defaults: ; and ,).all.equal(iris, iris_back) — what about Species?Tip
Hint: since R 4.0, text is read back as character, not factor. Rebuild with factor(iris_back$Species, levels = levels(iris$Species)) before comparing.