R Beginners Course 2026

Session 3 :: Data I/O and Reshaping

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2026-05-13

Session 3 :: Data I/O and Reshaping

▶ Interactive version · 📖 Static version

Learning Objectives

By the end of this session you will be able to:

Choose the right base-R I/O function for a given file format and locale (read.csv / read.csv2, read.table, read.delim / read.delim2, saveRDS / readRDS, save / load).
Reshape data between wide and long form with stack() / unstack() and reshape(), and explain why explicit IDs matter for round-trips.
Combine tables with cbind() / rbind() vs. merge(), and predict the effect of duplicated keys on the join result.
Apply small reproducibility hygiene habits: keep files inside your project folder with relative paths; avoid save.image() / load("workspace.RData").

Data IO (read & write files)

In this session, we’ll try build an understanding of base R functions for data input/output (I/O) and data reshaping using the iris dataset.
Beyond simply running code, we’ll discuss why you might choose one function over another, highlighting their specific strengths and trade-offs, when it makes sense.
We’ll to this in an interactive way.

Recap — iris: built-in data frame, 150 × 5, columns Sepal.Length, Sepal.Width, Petal.Length, Petal.Width (numeric) and Species (factor: setosa, versicolor, virginica, 50 each).

What is I/O and why does it matter?

I/O = Input / Output — reading data into R and writing results out of R.

Why persistent storage?

R keeps everything in RAM — fast, but temporary.
The moment you close R, any object you haven’t saved to disk is gone.

Saving to a file lets you:

share results with collaborators
pick up a long analysis the next day
hand cleaned data to a downstream script or tool
archive what you submitted alongside a paper

The read / write cycle

  ┌─────────────────────────┐
  │   File on disk          │
  │   (CSV, TSV, RDS, …)    │
  └────────┬────────────────┘
           │  read.csv() / readRDS() / …
           ▼
  ┌─────────────────────────┐
  │   R object in memory    │
  │   data.frame, list, …   │
  └────────┬────────────────┘
           │  write.csv() / saveRDS() / …
           ▼
  ┌─────────────────────────┐
  │   File on disk          │
  └─────────────────────────┘

Text vs binary. Text formats (CSV, TSV) are human-readable and portable but larger and slower. Binary formats (RDS, RData) are compact and lossless but only R can read them directly. Choose based on who — or what — needs to open the file next.

Working directory & paths

Why this slide exists. Every I/O demo today writes a file, then reads it back. We need a sensible place to put those files — inside your project folder, using relative paths.

Two ideas to keep in mind:

Function	What it does
`getwd()`	shows R’s current working directory
`setwd("path")`	changes it (avoid in scripts — use RStudio Projects)
`file.path("data", "iris.csv")`	builds a path that works on Windows, macOS, Linux
`dir.create("data")`	creates a sub-folder if it doesn’t exist yet

getwd()                          # where are we right now?
file.path("data", "iris.csv")    # builds "data/iris.csv" portably

Convention for today. Each demo writes into the current directory (or a data/ sub-folder you create once with dir.create("data")). You’ll see the files appear in RStudio’s Files pane — useful for understanding what actually happened on disk.

CSV formats — When & Why

Scenario. A collaborator emails you a sample sheet. The US lab sends it with , separators and . decimals; the Munich lab sends ; separators and , decimals. Same data, two formats — base R has a dedicated reader for each.

Key parameters (shared by the whole read.* family):

Arg	Meaning	`read.csv`	`read.csv2`
`file`	path or URL	—	—
`sep`	field separator	`","`	`";"`
`dec`	decimal mark	`"."`	`","`
`header`	first row = column names	`TRUE`	`TRUE`
`na.strings`	strings to treat as `NA`	`"NA"`	`"NA"`

Decision rule. US/UK (, + .) → read.csv(). DE/FR (; + ,) → read.csv2(). Anything else → read.table() with explicit sep and dec.

# Write iris into your project folder, then read it back.
write.csv(iris, "iris.csv", row.names = FALSE)
iris_csv <- read.csv("iris.csv")
head(iris_csv, n = 6)

write.csv2(iris, "iris_de.csv", row.names = FALSE)
iris_csv2 <- read.csv2("iris_de.csv")

# Peek at the raw file to see the locale conventions
writeLines(readLines("iris_de.csv", n = 3))

head(iris_csv2, n = 6)

Note the ; separator and , decimal in the raw file — that’s the whole point of read.csv2() (DE/FR locale).

Predict first. We write iris (where Species is a factor) to CSV and read it back. Will identical(iris, iris_back) be TRUE?

write.csv(iris, "iris.csv", row.names = FALSE)
iris_back <- read.csv("iris.csv")

# Compare types
class(iris$Species)       # "factor"
class(iris_back$Species)  # "character" — factor levels lost!

identical(iris, iris_back)  # FALSE — surprising?

Why? Since R 4.0, read.csv() no longer converts character columns to factors automatically (stringsAsFactors = FALSE is the new default). The data values are identical, but the type differs — which is enough to make identical() return FALSE.

Fix: rebuild the factor explicitly after loading:

iris_back$Species <- factor(iris_back$Species,
                            levels = levels(iris$Species))
identical(iris, iris_back)  # TRUE

General Tabular I/O — When & Why

Scenario. A counts matrix from a sequencing pipeline arrives as a TSV with comment lines at the top, missing values written as "-", and an unusual quoting style. The read.csv* shortcuts are too rigid; you need full control.

Key parameters beyond read.csv’s defaults:

Arg	What it controls
`sep`	field separator (`"\t"`, `";"`, `" "`, …)
`quote`	quoting characters (use `""` to disable)
`na.strings`	values to treat as `NA` (e.g. `c("NA", "-", ".")`)
`skip`	number of header lines to skip
`comment.char`	lines starting with this char are ignored
`check.names`	sanitise column names (set `FALSE` to keep raw IDs)

Decision rule. read.delim()/read.delim2() = read.table() with sep="\t" baked in (period vs comma decimals). Use read.table() itself for anything more exotic.

General Tabular I/O — Live

read.table() and write.table()
read.delim() and read.delim2()

write.table(iris, "iris.tsv", sep = "\t", row.names = FALSE)
iris_tab <- read.table("iris.tsv", header = TRUE, sep = "\t")
head(iris_tab, n = 10)

# US-style: tab-separated, period decimal (reuses iris.tsv from previous tab)
delim1 <- read.delim("iris.tsv")

# Write the same data in DE-style: tab-separated, comma decimal
write.table(iris, "iris_de.tsv",
            sep = "\t", dec = ",", row.names = FALSE)

# Read it back with the locale-aware shortcut
delim2 <- read.delim2("iris_de.tsv")

head(delim1, n = 2)
head(delim2, n = 2)

R’s Internal Binary Formats — When & Why

Scenario. You just finished a 30-minute QC pipeline on a large object (say, a SummarizedExperiment) and want to resume tomorrow without re-running everything. Binary formats are lossless and orders of magnitude faster than text.

Two flavours, one decision:

Function pair	Saves	Variable name on load
`saveRDS()` / `readRDS()`	one object	you choose: `x <- readRDS(...)`
`save()` / `load()`	one or many objects + their names	fixed: original names reappear

Key args: file (path), compress (TRUE/"gzip"/"xz").

Decision rule. Single object → saveRDS() (you pick the name on load, safer for reuse). Workspace / many objects with their names baked in → save().

R’s Internal Binary Formats — Live

# saveRDS / readRDS: a single object, no name preserved
saveRDS(iris, "iris.rds")
iris_rds <- readRDS("iris.rds")

# save / load: one or many objects, names preserved
save(iris, iris_rds, file = "iris.RData")
rm(iris_rds)              # remove from session
load("iris.RData")        # variables reappear with their original names
ls()

Warning

save.image() / load("workspace.RData") (and saving .RData on exit) are convenient but discouraged for reproducible work: hidden state silently comes back next session. Prefer scripts you can re-run from a clean R.

What is data reshaping — and why?

The same data can be arranged in two fundamentally different shapes:

Wide format — one row per subject, one column per measurement

flower  Sepal.L  Sepal.W  Petal.L  Petal.W
1       5.1      3.5      1.4      0.2
2       4.9      3.0      1.4      0.2

→ natural for data entry, spreadsheets, many statistical tests

Long format — one row per observation, a column naming what was measured

flower  feature    value
1       Sepal.L    5.1
1       Sepal.W    3.5
1       Petal.L    1.4
1       Petal.W    0.2
2       Sepal.L    4.9
...

→ required by boxplot(y ~ group), aov(), t.test() — base R’s formula interface

Why does this matter in the lab? A plate reader exports one column per well (wide). Base R’s boxplot(value ~ well, data = df) needs one row per measurement with a grouping column (long). Reshaping is the bridge.

Tool choice. Two base R tools for wide ↔︎ long: stack() / unstack() — quick, but relies on column order. reshape() — explicit IDs, reliable round-trips, more arguments. We’ll cover both.

`stack()` / `unstack()` — When & Why

Scenario. You have one numeric column per replicate (wide) and boxplot(value ~ group) or aov() wants one row per observation with a grouping column (long). stack() is the cheapest possible reshape — when you don’t need to keep row identity.

Mental model. stack() melts numeric columns into two columns:

values — all the numbers, concatenated column-by-column
ind — a factor labelling which original column each value came from

Key args (rarely needed): select (columns to stack), drop (drop unused levels).

Decision rule. stack()/unstack() for quick wide↔︎long when row identity doesn’t matter. reshape() (next) when you need to round-trip reliably with extra ID columns.

`stack()` / `unstack()` — Live

Look & Predict → Run & Observe → Reversibility → Challenge — makes the hidden assumptions of stack() visible.

Look & Predict
Run & Observe
Reverse
Break

Predict first, then run. What will the first few rows of stacked look like?

stacked <- stack(iris[1:4])
head(stacked, 4)

Don’t run it — sketch the output on paper / in your head first. Switch to Run & Observe when ready.

Click Run code to execute live in your browser. Examine:

Are values grouped by column (Sepal.Length then Sepal.Width, etc.)?
How does the ind factor map each measurement?

stacked <- stack(iris[1:4])
head(stacked, 8)

stacked <- stack(iris[1:4])
unstacked <- unstack(stacked)
identical(unstacked, iris[1:4])

stacked <- stack(iris[1:4])
set.seed(123)
stacked2 <- stacked[sample(nrow(stacked)), ]
unstacked2 <- try(unstack(stacked2), silent = TRUE)
unstacked2

`reshape()` — When & Why

Scenario. Samples × measurements wide table → long form for aov() / t.test() / boxplot() — and you must pivot back even if rows get reordered. stack() can’t do that; reshape() can, because you give it explicit IDs.

The six named arguments — read them in this order:

Arg	What it controls
`direction`	`"long"` or `"wide"` — which way you’re pivoting
`varying`	columns to melt (long) / columns to split (wide)
`v.names`	name of the new value column
`timevar`	name of the new key column (which-feature)
`times`	labels that go into `timevar`
`idvar`	columns that identify a row (anchor for round-trip)

Decision rule. Use reshape() when (a) you need both directions, (b) you have non-measurement columns to carry, or (c) row order isn’t guaranteed. Otherwise stack() is shorter.

`reshape()` — Live

Heads-up — densest call of the deck. Don’t try to absorb the five-arg call in one read; the tabs below walk through it.

Setup
Look & Predict
Run & Observe
Reverse
Break

iris2 <- iris                          # local copy
iris2$rowID <- seq_len(nrow(iris2))    # explicit row identifier
head(iris2, 3)

Before you run the next tab: iris2 has 150 rows and 4 measurement columns. After reshaping to long form with one row per (flower × feature):

How many rows will long have?
How many columns?
What will the new Feature column contain?

Commit to your three answers before clicking Run.

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),  # cols to melt into one
  v.names   = "Measurement",            # name of new value col
  timevar   = "Feature",                # name of new key col
  times     = names(iris2)[1:4],        # labels that fill timevar
  idvar     = c("rowID","Species"),     # row-identity anchor
  direction = "long"                    # "long" or "wide"
)
dim(long)
head(long, 5)

Reversibility test. We shuffle long so row order is destroyed, then reshape back to wide. With explicit idvars, the recovery should still work.

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),
  v.names   = "Measurement",
  timevar   = "Feature",
  times     = names(iris2)[1:4],
  idvar     = c("rowID","Species"),
  direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]

wide <- reshape(
  long_shuffled,
  idvar     = c("rowID","Species"),
  timevar   = "Feature",
  direction = "wide"
)
head(wide, 3)

# Clean up auto-generated names and row order, then compare
names(wide) <- sub("^Measurement\\.", "", names(wide))
wide <- wide[, colnames(iris2)]
wide <- wide[order(wide$rowID), ]
rownames(wide) <- NULL
all.equal(wide, iris2)

Predict, then run. What happens if we drop Species from idvar? Does reshape still recover the original data? Why / why not?

iris2 <- iris
iris2$rowID <- seq_len(nrow(iris2))
long <- reshape(
  iris2,
  varying   = list(names(iris2)[1:4]),
  v.names   = "Measurement",
  timevar   = "Feature",
  times     = names(iris2)[1:4],
  idvar     = c("rowID","Species"),
  direction = "long"
)
set.seed(42)
long_shuffled <- long[sample(nrow(long)), ]
# idvar = rowID only — Species is no longer treated as a key
wide_bad <- try(reshape(
  long_shuffled,
  idvar     = "rowID",
  timevar   = "Feature",
  direction = "wide"
))
head(wide_bad, 3)

Reshaping: Best Practice

Implicit vs. Explicit IDs: Why do stack()/unstack() fail after reordering? How does idvar rescue us?
When to Use Which:
- Quick two-column views → stack()
- Reliable round-trips with extra columns → reshape() + idvar

`identical()` vs `all.equal()`

The next sections check round-trips. Two checkers, two purposes:

identical(x, y) — strict structural equality. Returns FALSE if attributes (e.g. row.names), types, or storage differ — even when the values look the same.
all.equal(x, y) — content-level check with numeric tolerance. Returns TRUE when values match within ~1.5e-8; otherwise a character describing the diff (use isTRUE(all.equal(...)) for a boolean).

Rule of thumb: use identical() to verify exact round-trips (binary I/O, reshape, key joins); use all.equal() when only the content must match (floating-point pipelines, text round-trips with type drift).

`cbind()` / `rbind()` vs. `merge()` — When & Why

Scenario. You have a counts matrix and a sample sheet.

Already aligned (same rows, same order) → cbind() glues columns.
Two batches with the same columns → rbind() stacks rows.
Alignment must come from a key column (sample_id, gene_id) → merge() does an SQL-style join.

Decision tree. Same rows in same order? → cbind. Same columns to append? → rbind. Match on a shared key? → merge. Mismatched keys with duplicates? → add an explicit idx before merge (see next slide).

`cbind()` — Live

cbind() — Look & Predict / Run & Observe
Mismatched Rows — Break

Predict first. We rebuild iris by cbind-ing its column blocks back together. Before running:

Will all.equal(cb, iris) return TRUE?
Will identical(cb, iris) return TRUE?
If the two checkers disagree — why?

# Combine first four columns back-to-back plus Species
cb <- cbind(
  iris[, 1:2],
  iris[, 3:4],
  Species = iris$Species
)
head(cb)

# Try BOTH checkers — do they agree?
all.equal(cb, iris)   # content-level (with tolerance)
identical(cb, iris)   # strict (attributes + types)

Why the disagreement? cbind() rebuilds the data.frame from scratch — minor attributes (e.g. row.names) differ even when the visible content is the same. all.equal() ignores those; identical() doesn’t.

This chunk is supposed to break. We’re showing what cbind() does when row counts disagree — the error message is the lesson.

short <- iris[-(141:150), ]
short <- iris[-(141:150), ]
tryCatch(
  cbind(short, iris),
  error = function(e) conditionMessage(e)
)

❌ Error: cbind() expects equal row counts. It can’t recycle or drop.

`rbind()` — Live

rbind() — Reverse
Mismatched Columns — Break

Reversibility test. Split iris into upper/lower halves and rbind them back. Predict: does identical(rb, iris) return TRUE? If not, what one line would fix it?

upper <- iris[1:75, ]
lower <- iris[76:150, ]
rb <- rbind(upper, lower)
identical(rb, iris)

rbind() appends rows but preserves original row names, so identical(rb, iris) is FALSE. Fix with rownames(rb) <- NULL.

This chunk is supposed to break. rbind() requires identical column names in the same order — the error message is the point.

upper <- iris[1:75, ]
swapped <- iris[76:150, c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
tryCatch(
  rbind(upper, swapped),
  error = function(e) conditionMessage(e)
)

❌ Error or misalignment: rbind() requires identical column names & order.

`merge()` — When & Why

Scenario. Counts come from the pipeline, metadata comes from a separate sample sheet, and both share a sample_id column. merge() is the SQL‑style join you reach for to bring them together.

Key arguments:

Arg	Meaning
`x`, `y`	the two data frames
`by`	column(s) shared by both — the join key
`by.x`, `by.y`	use when the key column has different names in x and y
`all`	`TRUE` → full outer join (keep all rows of both)
`all.x`, `all.y`	left / right outer joins
`suffixes`	renames clashing non-key columns (default `c(".x",".y")`)

`merge()` — Join types

You want	Call
Inner join (default)	`merge(x, y, by = "k")`
Left join (keep all x)	`merge(x, y, by = "k", all.x = TRUE)`
Right join (keep all y)	`merge(x, y, by = "k", all.y = TRUE)`
Full outer join (keep all rows)	`merge(x, y, by = "k", all = TRUE)`

Below we join two partial tables on a key — but Species is duplicated. How do we get a one-to-one match?

`merge()` — Live

Predict
Inner Join (with Duplicates)
One-to-One Matching (via ID)

Predict first. Below, dir1 has 100 rows (setosa + versicolor) and dir2 has 100 rows (versicolor + virginica), both keyed on Species. How many rows do you expect in merge(dir1, dir2, by = "Species")?

dir1 <- iris[1:100,   c("Sepal.Length", "Species")]
dir2 <- iris[51:150,  c("Sepal.Width",  "Species")]
# dir1: 50 setosa + 50 versicolor
# dir2: 50 versicolor + 50 virginica

Write a number down before turning the tab.

# Partial views: dir1 has setosa + versicolor, dir2 has versicolor + virginica
dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
merged <- merge(dir1, dir2, by = "Species")
nrow(merged)
head(merged, 10)

Result: Inner join on a duplicated key → only versicolor survives (it’s the sole species present in both), and its 50 rows in dir1 × 50 rows in dir2 produce a Cartesian 2500 rows.
Did your prediction match? If you guessed ~100, the Cartesian product is the lesson here.
Why? merge() pairs every matching row for duplicated keys.

dir1 <- iris[1:100, c("Sepal.Length","Species")]
dir2 <- iris[51:150, c("Sepal.Width","Species")]
# Create per-species index
dir1$idx <- 1:nrow(dir1)
dir2$idx <- (50+1):(50+nrow(dir2))

# Now merge on both Species & idx
matched <- merge(dir1, dir2, by = c("Species", "idx"), all = TRUE)
head(matched, 10)
# Drop idx if you like
matched$idx <- NULL

Outcome: Exactly one-to-one pairing, recovering the intended alignment.

Try It: Shuffle rows of dir2 before merging—do you still get correct matches? Why? Challenge: Perform a full outer join (all = TRUE) on (Species, idx) and inspect NAs.

Other interesting functions — When & Why

Scenarios.

split() / unsplit() — “Run the same analysis per cell type / per cluster / per batch”: break a data frame into a list-of-data-frames by a grouping factor, work on each piece, then glue back.
table() — “How many cells in each cluster? How many samples per treatment?”: fast cross-tabulation of one or several factors.
cut() — “Bin continuous expression / age / dose into low/med/high”: turn a numeric vector into a factor with chosen breaks.

Other interesting functions — Live

split() and unsplit()
table() and cut

split() / unsplit() break data into subsets by factor, letting you apply arbitrary functions via lapply() — see Session 4 (Day 2) — then reassemble.
unsplit() recovers the original row order only if you split the whole data frame and pass the same factor back.

# Split the full data.frame (keep Species along for the ride)
spl   <- split(iris, iris$Species)
# e.g. compute per-species column means:
# lapply(spl, function(df) colMeans(df[, 1:4]))

# Reassemble — round-trips iris exactly
unspl <- unsplit(spl, iris$Species)
identical(unspl, iris)

table() quickly tabulates counts across factors.
cut() bins continuous variables into discrete intervals.

tbl <- table(iris$Species)
fl  <- cut(iris$Sepal.Length, breaks = 3)

tbl                    # counts per species
table(fl)              # counts per Sepal.Length bin
head(fl)               # the bin labels assigned to each row

Transpose & sort — When & Why

Scenarios — both are everyday companions of plotting.

t() — Transpose. “My matrix has samples in columns but boxplot() / heatmap() / prcomp() expect them in rows (or vice versa).” t() swaps rows ↔︎ columns of a matrix or data frame (data frames are first coerced to a matrix — beware mixed types).
sort() — Sort one vector. “Order the genes by p-value before showing the top hits.” Returns the values in order.
order() — Get a row permutation. “Reorder a whole data frame by one (or several) columns.” Returns the indices that put values in order — use it to subset rows: df[order(df$x), ].
rank() — Replace values by their ranks (handy for non-parametric stats).

Plotting preview. boxplot(t(mat)) flips a sample×gene matrix to one box per sample. df[order(df$pvalue), ] puts the strongest hits at the top of a volcano-plot label list.

Transpose & sort — Live

t() — Transpose
sort() vs order()

t(x) swaps rows and columns. Result is always a matrix.
Data frames with mixed types get coerced — everything becomes character.

# Numeric part of iris: 150 rows (flowers) x 4 columns (measurements)
mat <- as.matrix(iris[, 1:4])
dim(mat)            # 150  4

# Transpose: 4 rows (measurements) x 150 columns (flowers)
tmat <- t(mat)
dim(tmat)           #   4 150
tmat[, 1:3]         # first three flowers as columns

# Round-trip
identical(mat, t(tmat))

sort(x) returns the values in order — useful for a vector.
order(x) returns the indices — use it to reorder a whole data frame.

x <- c(30, 10, 20)

sort(x)                   # values in order:        10 20 30
order(x)                  # indices that sort x:     2  3  1
x[order(x)]               # same as sort(x)

# Reorder iris by Sepal.Length (ascending)
head(iris[order(iris$Sepal.Length), ])

# Descending: negate numeric, or use decreasing = TRUE
head(iris[order(-iris$Sepal.Length), ])
head(iris[order(iris$Sepal.Length, decreasing = TRUE), ])

# Multi-key sort: Species, then Sepal.Length
head(iris[order(iris$Species, iris$Sepal.Length), ], 10)

Summary — Three things to remember

If you only remember three things:

I/O — match the reader to the file. Text formats (CSV, TSV) for sharing; binary formats (RDS, RData) for saving R objects. The reader must match the format or you get silent errors.
Reshaping — know which shape your analysis needs. Wide = one column per measurement; long = one row per measurement. Use explicit IDs (idvar) when you need to pivot back safely.
Combining — dimensions vs. keys. cbind()/rbind() when dimensions already align; merge() when rows must be matched by a shared column. Duplicated keys → Cartesian product, not what you expect.

Summary — Quick reference

Topic	Function(s)	Key point
CSV US/UK	`read.csv()` / `write.csv()`	`,` sep, `.` decimal
CSV DE/FR	`read.csv2()` / `write.csv2()`	`;` sep, `,` decimal
Any tabular	`read.table()` / `read.delim()`	full control via `sep`, `dec`
Single R object	`saveRDS()` / `readRDS()`	custom name on load
Many R objects	`save()` / `load()`	names baked in
Wide → long	`reshape(direction="long")`	`varying`, `idvar`
Side-by-side	`cbind()`	same row count
Stack rows	`rbind()`	same column names & order
Join on key	`merge()`	duplicated keys → Cartesian
Rotate matrix	`t()`	result is always a matrix
Reorder rows	`df[order(df$x), ]`	use `order()`, not `sort()`
Strict compare	`identical()`	type + attributes
Fuzzy compare	`all.equal()`	numeric tolerance

Hands-on Exercise

Write iris to a file using German locale conventions, read it back, and verify the round-trip.

write.table() with sep = ";", dec = ",", row.names = FALSE into "iris_de.csv" (in your project folder).
Read back with read.csv2() (defaults: ; and ,).
Compare with all.equal(iris, iris_back) — what about Species?

Tip

Hint: since R 4.0, text is read back as character, not factor. Rebuild with factor(iris_back$Species, levels = levels(iris$Species)) before comparing.

# Step 1: write iris with ; separator and , decimal
______

# Step 2: read it back with read.csv2()
iris_back <- ______

# Step 3: compare
all.equal(iris, iris_back)

R Beginners Course 2026

Learning Objectives

Data IO (read & write files)

What is I/O and why does it matter?

Working directory & paths

CSV formats — When & Why

CSV formats — Live

General Tabular I/O — When & Why

General Tabular I/O — Live

R’s Internal Binary Formats — When & Why

R’s Internal Binary Formats — Live

What is data reshaping — and why?

stack() / unstack() — When & Why

stack() / unstack() — Live

reshape() — When & Why

reshape() — Live

Reshaping: Best Practice

identical() vs all.equal()

cbind() / rbind() vs. merge() — When & Why

cbind() — Live

rbind() — Live

merge() — When & Why

merge() — Join types

merge() — Live

Other interesting functions — When & Why

Other interesting functions — Live

Transpose & sort — When & Why

Transpose & sort — Live

Summary — Three things to remember

Summary — Quick reference

Hands-on Exercise

`stack()` / `unstack()` — When & Why

`stack()` / `unstack()` — Live

`reshape()` — When & Why

`reshape()` — Live

`identical()` vs `all.equal()`

`cbind()` / `rbind()` vs. `merge()` — When & Why

`cbind()` — Live

`rbind()` — Live

`merge()` — When & Why

`merge()` — Join types

`merge()` — Live