📖 Reading Mode
⬛ Center
Progress
0% Complete
Bernoulli Trial 1/11
Binomial/Poisson 2/11
Over-dispersion 3/11
Gamma Variation 4/11
Neg. Binomial 5/11
Systematic Effects 6/11
GLM Components 7/11
NB-GLM Model 8/11
Dispersion Estimation 9/11
Hypothesis Testing 10/11
DESeq2 Workflow 11/11

⌨️ Keyboard Shortcuts

→ / Space Next slide
← / PgUp Previous slide
Home First slide
End Last slide
Esc Close modal
? Show shortcuts

🖱️ Mouse Controls

Click right half → Next slide

Click left half → Previous slide

Swipe left/right → Navigate (touch devices)

🎨 View Options

📖/🎬 Button (top-left) → Toggle Reading/Presentation Mode

⬆️/⬛ Button (below) → Toggle Top-Align/Center Layout

   (Use Top-Align when collapsed sections leave too much whitespace)

Press Esc or click outside to close
1 / 22

Design-to-Discovery: RNA-seq Differential Expression with Negative Binomial GLMs

From sequencing experiments to dispersion-aware inference

Dr.Ali T. Abdallah

Working with R (Advanced Course)

Advanced Analysis Teaching Series • November 2025

© 2025 Dr. Ali T. Abdallah. All rights reserved.

2 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22

Overview: The Journey Ahead

🎯 The Challenge You Face

You have 20,000 genes and 3 replicates per condition. Which genes truly changed? Without the right model, you'll miss real changes or report false positives.

� Our Logical Path
🧬
Biology
🔢
Counts
📊
Model
🎯
Answers

⏱️ ~45-60 min | Step-by-step | No advanced math required

Payoff: Understand WHY and HOW DESeq2 works!
Overview diagram showing the flow from biology to statistical inference
From biology to statistical inference
3 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Challenge

Why We Need a Statistical Model

  • RNA-seq produces discrete read counts shaped jointly by biology and experimental procedures. For example: Gene A shows 100, 95, 180 reads across three replicates-why the jump to 180?
  • Even under careful protocols, replicate samples show some level of disagreement due to biological variation (cell-type composition, individual heterogeneity, transcriptional bursting) and technical factors (capture efficiency, library depth, prep variations).
  • To distinguish real biological differences from noise, we need a model that correctly describes the process generating the observed counts.
  • Starting from the simplest building block-capturing one molecule-we'll see how a suitable model emerges naturally from biology and chemistry.
RNA-seq replicate variability barplot
Figure 2: Replicate variability in RNA-seq counts
4 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Key Insight

From Biology to Statistical Model: The Conceptual Flow

🧬 Biology: One RNA molecule → Success or failure
🔢 Many Molecules: Binomial → Poisson
⚠️ Reality Check: Biological + Technical variation
📊 Statistical Model: Gamma-Poisson mixture
🎯 Regression Framework: GLM for comparison

The Key Principle

Every step in our statistical model
corresponds to a
biological or experimental reality

We're not arbitrarily choosing distributions-
we're attempting to derive them from
the data-generating process!
5 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Challenge

Why Not Just Use a t-Test?

Natural first thought: To compare control vs treatment, why not just run a t-test on the counts?

📖 Expand All
Problem 1: Wrong Distribution Assumption
  • The t-test requires continuous, normally distributed data with constant variance.
  • RNA-seq values are discrete counts that show strong mean–variance dependence, so these assumptions do not hold.
  • Normalization alone cannot fix this; specialized methods (e.g., VST, limma-voom) must model the mean–variance relationship first.
Problem 2: Mean–Variance Relationship
  • The t-test assumes constant variance across genes and groups.
  • RNA-seq data violate this: variance increases with the mean due to the negative binomial sampling model.
  • Low-count genes have high relative noise, while high-count genes have greater absolute variance.
Problem 3: Ignores the Counting Process
  • RNA-seq counts follow a well-characterized biological and technical sampling process (next slides).
  • A simple t-test ignores the count nature of the data and therefore loses statistical power.
💡 Solution

We need a model that respects the count-generating process. Let's build it step-by-step...

💡 Key Takeaway: t-tests fail for RNA-seq because counts are discrete, variance depends on mean, and we can model the biological process directly.
Mean-variance relationship
Figure 2: Mean-dependent variance in RNA-seq (simulated)
6 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Building Block

Step 1 — A Bernoulli Experiment

🧬 Biological Reality: Following One mRNA Molecule

Imagine tracking a single FOXO3A mRNA in your cell sample. During RNA-seq, this molecule either gets successfully captured and sequenced → Success (X=1), or gets lost somewhere in the process → Failure (X=0).

Key Idea: One RNA molecule → Either captured (X=1) or lost (X=0)

$$X \sim \text{Bernoulli}(p_g)$$
$$P(X = 1) = p_g \quad | \quad P(X = 0) = 1 - p_g$$
📖 Show Details
What is pg? A Concrete Example

The overall success probability pg reflects the combined efficiency of all experimental steps:

  • RNA extraction: 90% of FOXO3A molecules survive cell lysis
  • Reverse transcription: 80% get converted to cDNA
  • PCR amplification: 95% amplify successfully
  • Capture chemistry: 85% bind to sequencing adapters
  • Sequencing instrument: 75% generate readable sequences

Overall pFOXO3A = 0.90 × 0.80 × 0.95 × 0.85 × 0.75 ≈ 0.43
So each FOXO3A molecule has ~43% chance of being counted!

Ideal Scenario Assumption

In a perfectly homogeneous world, each molecule of a given gene would behave as an independent Bernoulli trial with the same gene-specific success probability pg across all molecules and all replicates.

Note: This assumption breaks down in reality, which we'll address in later slides.

Why This Matters: From Molecules to Statistics

This binary outcome is the fundamental building block behind count modeling. Every counted read started as a single molecule that succeeded in this Bernoulli trial.

🔗 Biological Bridge:
• If you see 100 FOXO3A reads in your data
• This represents ~233 original FOXO3A molecules (100 ÷ 0.43)
• The other ~133 molecules were lost during processing
Statistical modeling lets us account for this loss!
💡 Key Takeaway: Each RNA molecule (like our FOXO3A example) is an independent Bernoulli trial. The success probability pg varies by gene based on sequence properties and expression level.
Bernoulli success versus failure probabilities
Figure 3: Bernoulli outcome for a single molecule
7 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Building Block

Step 2 — Binomial Counts

🧬 From One Molecule to Many

Your sample contains n = 500 FOXO3A molecules. Each has a 43% chance of being captured. How many reads will you observe? This follows a Binomial distribution.

Key Idea: Now imagine n independent molecules for the same gene, each with capture probability pg.

Y ~ Binomial(n, pg)
Total reads = sum of n independent Bernoulli trials
📖 Show Details
What does the Binomial capture? A Concrete Example
  • n = 500: Number of FOXO3A molecules in your sample
  • pFOXO3A = 0.43: Per-molecule capture probability (from Slide 6)
  • Both parameters determine the distribution of read counts

Expected reads = n × p = 500 × 0.43 = 215 reads
But you might observe anywhere from ~180 to ~250 reads due to random sampling variation!

The Ideal Scenario (Rarely True!)

In a perfect world, every replicate would share the same n and pg for a gene, giving identical expected counts.

⚠️ Reality Check:
• Replicate 1: 500 FOXO3A molecules → ~215 reads
• Replicate 2: 480 molecules (biological variation!) → ~206 reads
• Replicate 3: 520 molecules → ~224 reads
The assumption breaks down in real experiments!
Why move to Poisson? The Practical Problem

In RNA-seq, n (total molecules) is unknown and unmeasurable. We can't directly fit a Binomial model.

❓ The Challenge:
You observe 215 FOXO3A reads, but you don't know:
• Were there 500 molecules with p=0.43?
• Or 1000 molecules with p=0.215?
• Or 300 molecules with p=0.717?
All give the same expected count! We can't separate n from p.

Solution: Approximate with Poisson using λ = n·pg = 215 (next slide).

💡 Key Takeaway: The Binomial models counts from n independent trials (500 FOXO3A molecules), but we need to simplify since n is unknown in RNA-seq. We only observe the product λ = n·p.
Histogram of binomially distributed read counts with theoretical curve
Figure 3: Binomial distribution of read counts
8 / 21
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Building Block

Step 3 — Poisson Approximation

Key Idea: When n is large but pg is small, we can simplify the Binomial.

Binomial(n, pg) ≈ Poisson(λ) where λ = n · pg
λ captures the expected read count for the gene
🧬 Focus on What We Can Measure

We don't know if there were 500 FOXO3A molecules (p=0.43) or 1000 molecules (p=0.215), but both give λ = 215 expected reads. The Poisson distribution uses only λ, eliminating the unmeasurable parameters!

Why this approximation works: The FOXO3A Example

When you have many molecules (large n) but each has a small chance of being captured (small p), the math simplifies beautifully.

FOXO3A Example:
• Binomial(n=500, p=0.43) ≈ Poisson(λ=215)
• Only need to track one parameter (λ) instead of two (n, p)
• λ = 500 × 0.43 = 215 = expected read count

The Poisson is the limiting case of Binomial as n → ∞ and p → 0, with np held constant.

Key assumption: Fixed λ (The Problem We'll Fix Later!)

Poisson assumes λ is fixed across replicates → variance equals the mean.

⚠️ The Unrealistic Assumption:
Poisson assumes all 3 replicates have exactly λ = 215 for FOXO3A.
But in reality: Rep1 might have λ₁=200, Rep2 has λ₂=215, Rep3 has λ₃=230
This variation causes over-dispersion! (Slides 9-11 explain this)
Poisson approximation overlayed on binomial pmf
Figure 4: Poisson approximation to the Binomial
9 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
📍 CHECKPOINT: Where We Are

The Ideal Homogeneous Sequencing Universe

✅ Our Journey: FOXO3A Example
One molecule → 500 molecules → Poisson(λ=215)
Assumption: All replicates have identical λ = 215

Theory predicts: If every replicate saw identical conditions, Poisson would fit perfectly.

Ideal World: Fixed n and p → Poisson with Variance = Mean
Blue points in figure show this perfect scenario

In a perfect world: All 3 FOXO3A replicates have exactly λ=215, giving Variance = Mean = 215

This is the Poisson property: simple, elegant, but unrealistic for real biology!

Mean-variance for ideal Poisson: Blue points showing Var = Mean
Figure 5: Ideal Poisson relationship (Variance = Mean)
10 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Challenge ⚠️ REALITY CHECK

Reality: Variance Exceeds the Poisson Prediction

Poisson Prediction vs Reality
Blue points: Poisson ideal (Variance = Mean)
Orange points: Real data (Variance >> Mean)

What the plot reveals

Blue points: Simulated Poisson data hugging the diagonal line where Variance = Mean

Orange points: Real RNA-seq data ballooning far above the line!

This is over-dispersion-variance exceeds what Poisson predicts.

Why does this happen?

The Poisson assumption (fixed λ=215 for all FOXO3A replicates) is violated!

In reality, each replicate has a different λ value due to biological and technical variation.

We need to model λ itself as a random variable (next slides).

💡 Key Takeaway: Real data shows over-dispersion (variance >> mean). Poisson alone cannot capture this-we need a distribution that accounts for overdispersion
Try in R: Simulate Over-dispersion
Mean-variance comparison: Blue = ideal Poisson (Var = Mean), Orange = real data (Var >> Mean)
Figure 6: Mean-variance relationship (blue = Poisson ideal, orange = over-dispersed reality)
11 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-12
Part 2: GLM Framework Slides 13-15
Part 3: DESeq2 Slides 16-21
Challenge ⚠️ REALITY CHECK

Reality: Multiple Distinct Universes

🧬 FOXO3A in the Real World
Rep 1: 480 molecules → λ₁ = 200 reads
Rep 2: 500 molecules → λ₂ = 215 reads
Rep 3: 520 molecules → λ₃ = 230 reads
Different λ values → Variance > Mean (over-dispersion!)
📖 Show Details
Why do FOXO3A replicates differ?

Biological variation in molecule counts:

  • Cell-type composition: Rep 1 had more fibroblasts (lower FOXO3A)
  • Transcriptional bursting: Rep 3 caught more burst events
  • Individual heterogeneity: Different tissue samples

Technical variation in capture:

  • Library size: Rep 2 had slightly higher total reads
  • Capture chemistry: Rep 1 efficiency 41% vs Rep 3 at 44%
  • Lab handling: Different extraction batches
Combined effect: 480→500→520 molecules and varying capture rates = λ values of 200, 215, 230
The "Distinct Universes" metaphor

Think of each FOXO3A replicate as sampling from its own "distinct universe" with a different λ value:

  • Universe 1 (Rep 1): λ₁ = 200 (480 molecules, 41% capture)
  • Universe 2 (Rep 2): λ₂ = 215 (500 molecules, 43% capture)
  • Universe 3 (Rep 3): λ₃ = 230 (520 molecules, 44% capture)

The variation in λ itself is random-we need to model this!

Solution preview

We model λ as a random variable following a Gamma distribution.

For FOXO3A: Instead of fixed λ=215, we have λ ~ Gamma(shape, scale) that produces values like 200, 215, 230...

This captures the heterogeneity across replicates (details in next slide).

💡 Key Takeaway: FOXO3A replicates show different λ values (200, 215, 230) due to biological and technical variation. We need to model λ itself as random, not fixed.
Sampling Poisson rates from a Gamma population
Figure 6: Sampling variable rates from a Gamma distribution
12 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Solution

Why Gamma Distribution for λ?

Answer: λ varies because biological and technical factors multiply together.

λ = Expression Bursts × Cell Types × Library Prep × Seq Depth × Capture Eff.
Each factor is random → Product creates variable λ across replicates
🧬 FOXO3A Example: How Factors Multiply
Rep 1: 0.85 bursts × 0.92 cell-type × 1.05 lib × 0.95 depth × 0.90 capture ≈ 0.66× → λ₁ = 200
Rep 2: 1.00 bursts × 1.00 cell-type × 1.00 lib × 1.00 depth × 1.00 capture ≈ 1.00× → λ₂ = 215
Rep 3: 1.10 bursts × 1.05 cell-type × 0.98 lib × 1.08 depth × 1.02 capture ≈ 1.24× → λ₃ = 230
Products of random factors → Different λ values → Need Gamma distribution to model this!
🔗 The Key Insight
Theory: Products of random factors → Lognormal distribution (mathematical result)
Practice: We use Gamma(α, β) instead because Gamma ≈ Lognormal and Gamma + Poisson = NB (closed-form solution!)
Gamma density curves for different parameter choices
Figure 7: Gamma distributions with different shape (α) and rate (β) parameters
Gamma parameters (α, β) explained

Mean = α/β encodes average expression level (e.g., α/β = 215 for FOXO3A)

Variance = α/β² → larger α means less spread (more consistent λ across replicates)

Different (α, β) pairs fit different genes' variability (see figure)

Mathematical derivation (optional)

Step 1: Products → Sums via logarithm: log(X₁ × X₂) = log(X₁) + log(X₂)

Step 2: Central Limit Theorem → Sum of random variables ≈ Normal

Step 3: Exponentiate → exp(Normal) = Lognormal

Step 4: Gamma ≈ Lognormal in shape, but Gamma + Poisson = NB (closed form!)

13 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Solution

Gamma-Poisson Mixture → Negative Binomial

One fixed λ → Poisson
Varying λ ~ Gamma → Negative Binomial
λ ~ Gamma(α, β)   →   K | λ ~ Poisson(λ)   →   K ~ NB(r, p)

🧬 RNA-seq Interpretation

Poisson: "Given this exact concentration in the library, what range of counts will I see due to random sampling?" (additive noise)

Gamma: "How much does the actual concentration in the library vary between my samples due to biology and technical factors?" (multiplicative noise)

Key insight: Negative Binomial captures both sources of variation naturally! Variance = μ + μ²/α captures over-dispersion.

Tip: Hover to reveal controls or press the space bar to pause during presentation.
Figure 8: Negative Binomial emerges from integrating Poisson curves with varying λ ~ Gamma
Hierarchical Model

Step 1: Draw a rate λ from Gamma(α, β)

Step 2: Given that λ, draw count K from Poisson(λ)

Each replicate gets a different λ, but all from the same Gamma population.

Mathematical Result

Integrating out λ yields K ~ NegativeBinomial(r, p)

Where: r = α, p = β/(β+1) (standard NB notation)

Key Properties

Mean: μ = α/β (same as Gamma mean)

Variance: μ + μ²/α → over-dispersion!

The extra term μ²/α captures variability beyond Poisson.

DESeq2 Parameterization

DESeq2 uses: mean μ and dispersion θ = 1/α

Higher θ = more over-dispersion (more variable λ values)

This parameterization is easier to estimate from data.

14 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Challenge

From Random to Structured Components

📖 Expand All
So Far: Stochastic Variability

We've modeled quasi-random sources of variability:

  • Technical noise (sampling, library prep) → Poisson component
  • Biological heterogeneity (cell states, expression bursts) → Gamma component
  • Result: Negative Binomial distribution captures over-dispersion
What's Missing? Systematic Effects
  • Experimental conditions (control vs treatment)
  • Covariates (batch effects, patient characteristics)
  • Library size differences between samples
Key Observation: Multiplicative Effects

These factors affect expression multiplicatively:

  • Treatment effect: μtreated = 2 × μcontrol (2-fold change)
  • Larger library: 2× depth → 2× reads for same gene
  • Batch effects: systematic scaling factors
The Challenge

How do we build a regression model that handles:

  • Multiplicative covariate effects
  • Negative Binomial-distributed counts
  • Multiple explanatory variables simultaneously
💡 Key Takeaway: Stochastic variation (NB) handles noise. Systematic effects (covariates) require regression. GLM combines both!
Conceptual diagram of GLM framework
Figure 9: Bridging stochastic and systematic components
15 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Solution

The Solution: Generalized Linear Models

Our scientific question: "Which genes are differentially expressed between conditions?"

Generalized Linear Model (GLM) extends linear regression to non-normal distributions like Negative Binomial.

📖 Expand All
Three Key Components
  • Distribution family: Negative Binomial (from exponential family)
  • Linear predictor: β₀ + β₁·x₁ + β₂·x₂ + ... (additive model for covariates)
  • Link function: Connects mean μ to linear predictor
The Log-Link Function

Perfect for multiplicative effects!

log(μ) = β₀ + β₁·x₁ + β₂·x₂ + ...
Why Log-Link?

Transforms multiplicative effects → additive:

  • On original scale: μ = exp(β₀) × exp(β₁·x₁) × exp(β₂·x₂) × ...
  • Coefficient β₁ = log(fold-change) → directly interpretable!
Log-link transforms multiplicative effects to additive effects
Figure 10: Why log-link? Multiplicative effects become additive on log scale
16 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Methodology

The Negative Binomial GLM

The Core Model: For each gene j and sample i, we model expected count μij:

$$\log(\mu_{ij}) = \sum_{k=0}^{p} \beta_{kj} x_{ik} + \log(s_i)$$
Counts: \(K_{ij} \sim \text{NB}(\mu_{ij}, \theta_j)\)
📖 Show Details
Model Components Explained
  • : Sum over all p covariates (experimental conditions)
  • βkj: Effect of covariate k on gene j (k=0 is baseline/intercept)
  • xik: Value of covariate k for sample i (e.g., 0/1 for control/treatment)
  • log(si): Library size normalization (offset)
  • θj: Gene-specific dispersion parameter
Why the log-link?

The log link ensures μij > 0 (counts can't be negative!)

Key advantage: Fold changes become additive on log scale

Example: For treatment vs control, β1j = log(fold change)

Simple example: Two-group comparison

For control vs treatment design (p=1, only one covariate):

  • xi0 = 1 (intercept, always 1)
  • xi1 = 0 for control samples, 1 for treatment samples
  • β0j = baseline (control) expression for gene j
  • β1j = log fold change (treatment effect)

So: log(μij) = β0j×1 + β1j×(0 or 1) + log(si)

The normalization term: log(si)

si = size factor for sample i (adjusts for library depth)

Without this, deeper-sequenced samples would show higher counts artificially

This term is fixed (not estimated), so we call it an "offset"

💡 Key Takeaway: The NB-GLM combines the Negative Binomial distribution (handles over-dispersion) with a regression framework (handles experimental design).
Try in R: Explore Design Matrices
Design matrix for GLM
Figure 11: Design matrix connects samples to experimental conditions
17 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Implementation

Dispersion Estimation in DESeq2

  • Each gene has its own dispersion parameter \(\theta_j\), but with typical RNA-seq sample sizes (3-6 replicates), gene-wise estimates are noisy.
  • DESeq2's strategy: Borrow information across genes by assuming dispersion depends on mean expression.
  • Step 1: Estimate gene-wise dispersion \(\hat{\theta}_j\) (maximum likelihood)
  • Step 2: Fit a smooth curve through dispersion-mean relationship
  • Step 3: Shrink gene-wise estimates toward the fitted trend using empirical Bayes
  • Genes with low counts get shrunk more (less reliable estimates); high-count genes retain their individual estimates
  • This stabilizes inference without assuming all genes have identical dispersion
Try in R: Visualize Dispersion Estimates
Dispersion-mean relationship with shrinkage
Figure 12: Gene-wise dispersion estimates (black), fitted trend (red), final shrunken values (blue)
18 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Implementation

Testing for Differential Expression

🎯 The Question We're Testing:
Is βtreatment = 0? (No difference between conditions)
📊 Step 1: Wald Test
W = β̂ / SE(β̂) → Compare to N(0,1) → p-value
💡 Click for details
🔬 Step 2: Multiple Testing Correction
20,000+ genes → Must control False Discovery Rate (FDR) → Benjamini-Hochberg
💡 Click for FDR details
🎨 Step 3: Log Fold Change Shrinkage
Shrink extreme LFC for low-count genes → More reliable effect sizes
💡 Click for shrinkage rationale
📤 Final Output for Each Gene:
log₂(FC) • SE • Wald statistic • p-value • adjusted p-value (padj)
MA plot showing differential expression results
Figure 13: MA plot (M = log fold change, A = mean expression). Red = FDR < 0.05
19 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Methodology

The DESeq2 Workflow: Putting It All Together

📖 Expand All
1. Input

Raw count matrix + experimental design

2. Normalization

Estimate size factors \(s_i\) (median-of-ratios method)

3. Dispersion Estimation

Gene-wise MLE → Fit trend → Shrink estimates

4. GLM Fitting

Estimate \(\boldsymbol{\beta}_j\) for each gene via negative binomial regression

5. Hypothesis Testing

Wald test + FDR correction

6. Output

Results table with log2FC, p-values, adjusted p-values

Key insight: The NB distribution handles over-dispersion, the GLM framework handles experimental design, and shrinkage stabilizes estimates with limited replicates.
Try in R: Complete DESeq2 Analysis
DESeq2 analysis workflow diagram
Figure 14: Complete DESeq2 pipeline from raw counts to differential expression results
20 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Example

Real Data Example: DESeq2 in Action

  • Dataset: Pasilla gene knockdown in Drosophila (Brooks et al. 2011)
  • Design: 4 control vs 3 treated samples
  • 14,599 genes analyzed
DESeq2 Results Summary:
- Genes with adjusted p < 0.05: 1,234 (8.5%)
- Upregulated (LFC > 1): 543 genes
- Downregulated (LFC < -1): 601 genes
- Mean dispersion: 0.15
- Dispersion range: 0.01 – 2.3
  • In contrast, a volcano plot visualizes effect size and significance and does not directly show per-gene dispersion; a Poisson model (dispersion = 0) would underestimate variance and can produce many false positives when dispersion is present.
Real DESeq2 results
Figure 15: Volcano plot showing -log10(padj) vs log2(fold change)
21 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Example

R Code Example: Running DESeq2

library(DESeq2)

# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ condition
)

# Run DESeq2 pipeline
dds <- DESeq(dds)

# Extract results
res <- results(dds, contrast = c("condition", "treated", "control"))

# Shrink log fold changes
res_shrink <- lfcShrink(dds, coef = "condition_treated_vs_control")

# Filter significant genes
sig_genes <- subset(res, padj < 0.05)

That's it! The DESeq() function handles normalization, dispersion estimation, GLM fitting, and testing automatically.

Example DESeq2 results table
Figure 17: Typical DESeq2 results table output
22 / 22
Overview Slides 1-2
Part 1: Distributions Slides 3-13
Part 2: GLM Framework Slides 14-16
Part 3: DESeq2 Slides 17-22
Key Insight

Takeaways for RNA-seq Analysis

  • Foundation: Bernoulli → Binomial → Poisson describes the homogeneous ideal, but real data shows over-dispersion.
  • Distribution: Multiplicative biological and technical factors create rate heterogeneity, captured by the Gamma-Poisson mixture (Negative Binomial).
  • Regression: The GLM framework extends the NB distribution to compare conditions while accounting for library size and covariates.
  • Stabilization: Dispersion and LFC shrinkage borrow information across genes to improve estimates with limited replicates.
  • Inference: Wald tests with FDR correction identify differentially expressed genes while controlling false positives.
  • Practice: DESeq2 automates this entire pipeline-understanding the theory helps you interpret results and troubleshoot issues.
Visual summary of the entire framework
Figure 18: Complete conceptual map from biology to statistical inference

Exercise 1: Simulate Poisson vs Negative Binomial

×

Goal: Understand overdispersion by simulating count data from Poisson and Negative Binomial distributions.

What you'll learn: How biological variability creates overdispersion that the Poisson can't capture.

# Simulate RNA-seq-like count data to see overdispersion
library(ggplot2)

# Set parameters
mean_expression <- 100
dispersion <- 0.5  # Higher = more overdispersion
n_samples <- 1000

# Simulate Poisson (no overdispersion)
poisson_counts <- rpois(n_samples, lambda = mean_expression)

# Simulate Negative Binomial (allows overdispersion)
# size parameter relates to dispersion: smaller = more dispersed
nb_counts <- rnbinom(n_samples, mu = mean_expression, size = 1/dispersion)

# Compare variances
cat("Poisson variance:", var(poisson_counts), "\n")
cat("NB variance:", var(nb_counts), "\n")
cat("Theoretical NB variance:", mean_expression + dispersion * mean_expression^2, "\n")

# Visualize the distributions
data <- data.frame(
  counts = c(poisson_counts, nb_counts),
  distribution = rep(c("Poisson", "Negative Binomial"), each = n_samples)
)

ggplot(data, aes(x = counts, fill = distribution)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
  labs(title = "Poisson vs Negative Binomial",
       subtitle = "Same mean, different variance") +
  theme_minimal()

Exercise 2: Basic DESeq2 Analysis

×

Goal: Run a complete differential expression analysis using DESeq2.

What you'll learn: The standard DESeq2 workflow from count matrix to differential expression results.

# Complete DESeq2 workflow for differential expression analysis
library(DESeq2)
library(tidyverse)

# Load your count matrix (genes x samples)
# Rows = genes, Columns = samples
# counts <- read.csv("your_counts.csv", row.names = 1)

# For demonstration, create sample data
counts <- matrix(rpois(6000, lambda = 50), nrow = 1000, ncol = 6)
rownames(counts) <- paste0("Gene_", 1:1000)
colnames(counts) <- paste0("Sample_", 1:6)

# Create sample metadata
coldata <- data.frame(
  condition = factor(rep(c("Control", "Treatment"), each = 3)),
  row.names = colnames(counts)
)

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = coldata,
  design = ~ condition
)

# Filter low-count genes (optional but recommended)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Extract results
res <- results(dds, contrast = c("condition", "Treatment", "Control"))

# Summary of results
summary(res)

# Get significantly differentially expressed genes (padj < 0.05)
sig_genes <- res[which(res$padj < 0.05), ]
cat("Number of significant genes:", nrow(sig_genes), "\n")

# MA plot
plotMA(res, ylim = c(-2, 2))

# Export results
write.csv(as.data.frame(res), "deseq2_results.csv")

Exercise 3: Visualize Dispersion Estimates

×

Goal: Understand how DESeq2 estimates and shrinks dispersion parameters.

What you'll learn: The mean-dispersion relationship and why shrinkage improves power.

# Visualize dispersion estimation in DESeq2
library(DESeq2)
library(ggplot2)

# Assuming you have a dds object from previous DESeq2 analysis
# If not, create one following Exercise 2

# Built-in dispersion plot
plotDispEsts(dds)

# Extract dispersion information for custom plots
disp_data <- data.frame(
  mean = rowMeans(counts(dds, normalized = TRUE)),
  gene_est = dispersions(dds),         # Gene-wise estimates
  fit = dispersionFunction(dds)(rowMeans(counts(dds, normalized = TRUE))),  # Fitted trend
  final = dispersions(dds)           # Final shrunken estimates
)

# Custom dispersion plot
ggplot(disp_data, aes(x = mean)) +
  geom_point(aes(y = gene_est), alpha = 0.3, color = "gray", size = 1) +
  geom_line(aes(y = fit), color = "red", size = 1.5) +
  geom_point(aes(y = final), alpha = 0.5, color = "blue", size = 0.8) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Dispersion Estimation",
       subtitle = "Gray: gene-wise | Red: fitted trend | Blue: final shrunken",
       x = "Mean of normalized counts",
       y = "Dispersion") +
  theme_minimal()

# Show impact of shrinkage
shrinkage_effect <- disp_data %>%
  mutate(shrinkage = abs(gene_est - final) / gene_est) %>%
  summarise(median_shrinkage = median(shrinkage, na.rm = TRUE))

cat("Median shrinkage:", round(shrinkage_effect$median_shrinkage * 100, 2), "%\n")

Exercise 4: Understanding Design Matrices

×

Goal: Build and visualize design matrices for different experimental designs.

What you'll learn: How R formulas translate to design matrices that connect samples to conditions.

# Explore design matrices for different experimental designs
library(DESeq2)

# Example 1: Simple two-group comparison
coldata_simple <- data.frame(
  condition = factor(c(rep("Control", 3), rep("Treatment", 3)))
)

design_simple <- model.matrix(~ condition, data = coldata_simple)
print("Simple design matrix:")
print(design_simple)

# Example 2: Multi-factor design (condition + batch)
coldata_multi <- data.frame(
  condition = factor(c(rep("Control", 4), rep("Treatment", 4))),
  batch = factor(c("A", "B", "A", "B", "A", "B", "A", "B"))
)

design_multi <- model.matrix(~ batch + condition, data = coldata_multi)
print("Multi-factor design matrix:")
print(design_multi)

# Example 3: Interaction design
coldata_interact <- data.frame(
  genotype = factor(rep(c("WT", "KO"), each = 4)),
  treatment = factor(rep(c("Control", "Treated"), 4))
)

design_interact <- model.matrix(~ genotype * treatment, data = coldata_interact)
print("Interaction design matrix:")
print(design_interact)

# Visualize what coefficients mean
cat("\n=== Coefficient Interpretation ===\n")
cat("(Intercept): Baseline expression for reference group\n")
cat("conditionTreatment: Log2 fold change of Treatment vs Control\n")
cat("batch: Batch effect adjustment\n")
cat("genotype:treatment: Interaction (treatment effect differs by genotype)\n")

Proof: Gamma-Poisson Mixture = Negative Binomial

×

Goal: Demonstrate mathematically that if λ ~ Gamma(α, β) and K|λ ~ Poisson(λ), then marginally K ~ NegativeBinomial.

Setup:

$$\begin{align} \lambda &\sim \text{Gamma}(\alpha, \beta): \quad f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda}\\ K|\lambda &\sim \text{Poisson}(\lambda): \quad P(K=k|\lambda) = \frac{\lambda^k}{k!} e^{-\lambda} \end{align}$$

Marginal distribution by integrating out λ:

$$\begin{align} P(K = k) &= \int_0^\infty P(K=k|\lambda) f(\lambda) \, d\lambda\\ &= \int_0^\infty \frac{\lambda^k}{k!} e^{-\lambda} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{k! \Gamma(\alpha)} \int_0^\infty \lambda^{k+\alpha-1} e^{-(\beta+1)\lambda} \, d\lambda \end{align}$$

Recognize the Gamma integral:

$$\int_0^\infty \lambda^{k+\alpha-1} e^{-(\beta+1)\lambda} \, d\lambda = \frac{\Gamma(k+\alpha)}{(\beta+1)^{k+\alpha}}$$

Substitute back:

$$\begin{align} P(K = k) &= \frac{\beta^\alpha}{k! \Gamma(\alpha)} \cdot \frac{\Gamma(k+\alpha)}{(\beta+1)^{k+\alpha}}\\ &= \frac{\Gamma(k+\alpha)}{\Gamma(\alpha) k!} \cdot \frac{\beta^\alpha}{(\beta+1)^{k+\alpha}}\\ &= \binom{k+\alpha-1}{k} \left(\frac{\beta}{\beta+1}\right)^\alpha \left(\frac{1}{\beta+1}\right)^k \end{align}$$

This is the Negative Binomial PMF with:

$$\begin{align} r &= \alpha \quad \text{(shape parameter)}\\ p &= \frac{\beta}{\beta+1} \quad \text{(success probability)}\\ \text{Mean} &= \frac{\alpha}{\beta}, \quad \text{Variance} = \frac{\alpha}{\beta} + \frac{\alpha}{\beta^2} = \mu + \frac{\mu^2}{\alpha} \end{align}$$

Conclusion: The variance exceeds the mean by μ²/α, capturing over-dispersion. ∎

Proof: Negative Binomial Variance Formula

×

Goal: Show that for K ~ NegativeBinomial with mean μ = α/β, the variance is μ + μ²/α.

Using the Gamma-Poisson hierarchy:

$$\begin{align} \lambda &\sim \text{Gamma}(\alpha, \beta)\\ K|\lambda &\sim \text{Poisson}(\lambda) \end{align}$$

Law of Total Expectation:

$$\mathbb{E}[K] = \mathbb{E}[\mathbb{E}[K|\lambda]] = \mathbb{E}[\lambda] = \frac{\alpha}{\beta} = \mu$$

Law of Total Variance:

$$\begin{align} \text{Var}(K) &= \mathbb{E}[\text{Var}(K|\lambda)] + \text{Var}(\mathbb{E}[K|\lambda])\\ &= \mathbb{E}[\lambda] + \text{Var}(\lambda) \end{align}$$

For Gamma(α, β):

$$\begin{align} \mathbb{E}[\lambda] &= \frac{\alpha}{\beta} = \mu\\ \text{Var}(\lambda) &= \frac{\alpha}{\beta^2} \end{align}$$

Therefore:

$$\begin{align} \text{Var}(K) &= \mu + \frac{\alpha}{\beta^2}\\ &= \mu + \frac{(\alpha/\beta)^2}{\alpha}\\ &= \mu + \frac{\mu^2}{\alpha} \end{align}$$

Interpretation:

  • When α is small (high dispersion), the quadratic term μ²/α dominates, creating large over-dispersion
  • When α is large, variance approaches μ (Poisson limit)
  • The extra term μ²/α captures biological variability beyond sampling noise

Conclusion: The variance formula Var(K) = μ + μ²/α naturally models over-dispersion in RNA-seq data. ∎

Proof: Gamma Distribution Mean and Variance

×

Goal: Derive the mean and variance of λ ~ Gamma(α, β).

Gamma PDF:

$$f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda}, \quad \lambda > 0$$

Mean (First Moment):

$$\begin{align} \mathbb{E}[\lambda] &= \int_0^\infty \lambda \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \lambda^\alpha e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+1)}{\beta^{\alpha+1}} \quad \text{(Gamma integral)}\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\alpha \Gamma(\alpha)}{\beta^{\alpha+1}} \quad \text{(use } \Gamma(\alpha+1) = \alpha\Gamma(\alpha)\text{)}\\ &= \frac{\alpha}{\beta} \end{align}$$

Second Moment:

$$\begin{align} \mathbb{E}[\lambda^2] &= \int_0^\infty \lambda^2 \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \lambda^{\alpha+1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+2)}{\beta^{\alpha+2}}\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{(\alpha+1)\alpha\Gamma(\alpha)}{\beta^{\alpha+2}}\\ &= \frac{\alpha(\alpha+1)}{\beta^2} \end{align}$$

Variance:

$$\begin{align} \text{Var}(\lambda) &= \mathbb{E}[\lambda^2] - (\mathbb{E}[\lambda])^2\\ &= \frac{\alpha(\alpha+1)}{\beta^2} - \left(\frac{\alpha}{\beta}\right)^2\\ &= \frac{\alpha(\alpha+1) - \alpha^2}{\beta^2}\\ &= \frac{\alpha}{\beta^2} \end{align}$$

Summary:

$$\mathbb{E}[\lambda] = \frac{\alpha}{\beta}, \quad \text{Var}(\lambda) = \frac{\alpha}{\beta^2}$$

Note: The variance can be rewritten as \(\text{Var}(\lambda) = \frac{1}{\alpha} \left(\frac{\alpha}{\beta}\right)^2 = \frac{\mu^2}{\alpha}\), showing that smaller α means higher relative variability. ∎

Proof: Why Products → Lognormal/Gamma?

×

Goal: Demonstrate why products of random factors lead to Lognormal distribution, and why we use Gamma instead.

Step 1: Express λ as a product

$$\lambda = F_1 \times F_2 \times F_3 \times \cdots \times F_m$$

where \(F_i\) represents factors like expression bursts, cell-type proportions, library amplification, etc.

Step 2: Take logarithms (products become sums)

$$\log(\lambda) = \log(F_1) + \log(F_2) + \log(F_3) + \cdots + \log(F_m)$$

Step 3: Apply Central Limit Theorem

If \(\log(F_i)\) are independent random variables with finite mean \(\mu_i\) and variance \(\sigma_i^2\), then for large \(m\):

$$\log(\lambda) = \sum_{i=1}^m \log(F_i) \xrightarrow{d} \mathcal{N}\left(\sum \mu_i, \sum \sigma_i^2\right)$$

Step 4: Exponentiate to get λ

Since \(\log(\lambda)\) is approximately normal, \(\lambda = e^{\log(\lambda)}\) follows a lognormal distribution:

$$\lambda \sim \text{LogNormal}(\mu, \sigma^2)$$

Step 5: Why approximate with Gamma?

  • Conjugate prior: Gamma is the conjugate prior for Poisson (analytically tractable)
  • Closed-form solution: Gamma-Poisson mixture = Negative Binomial (exact)
  • Good approximation: Gamma can approximate lognormal well for moderate variation
  • Practical: Lognormal-Poisson mixture has no closed form, requiring numerical integration

Conclusion: Both Gamma and Lognormal are positive, right-skewed distributions that can model the same range of shapes. The Gamma wins due to mathematical convenience while capturing the essential biology! ∎

Proof: Sum of Independent Bernoulli Trials = Binomial

×

Goal: Show that if X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then their sum Y = ΣXᵢ follows Binomial(n, p).

Setup:

$$X_i \sim \text{Bernoulli}(p): \quad P(X_i = 1) = p, \quad P(X_i = 0) = 1-p$$

Define Y = X₁ + X₂ + ⋯ + Xₙ. We want to find P(Y = k) for k = 0, 1, ..., n.

Counting argument:

Y = k means exactly k of the n trials are successes (value 1) and n-k are failures (value 0).

$$\text{Number of ways to choose which } k \text{ trials succeed} = \binom{n}{k}$$

Probability calculation:

For any specific sequence with exactly k successes:

$$P(\text{specific sequence}) = p^k (1-p)^{n-k}$$

Total probability:

$$P(Y = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

Conclusion: This is exactly the PMF of Binomial(n, p). Therefore, the sum of n independent Bernoulli trials is Binomial. ∎

RNA-seq interpretation:

  • Each FOXO3A molecule = one Bernoulli trial (captured or not)
  • n = 500 molecules → 500 independent trials
  • p = 0.43 capture probability for each molecule
  • Total reads Y ~ Binomial(500, 0.43) with mean = 500 × 0.43 = 215