⌨️ Keyboard Shortcuts

→ / Space Next slide

← / PgUp Previous slide

Home First slide

End Last slide

Esc Close modal

? Show shortcuts

🖱️ Mouse Controls

• Click right half → Next slide

• Click left half → Previous slide

• Swipe left/right → Navigate (touch devices)

🎨 View Options

• 📖/🎬 Button (top-left) → Toggle Reading/Presentation Mode

• ⬆️/⬛ Button (below) → Toggle Top-Align/Center Layout

(Use Top-Align when collapsed sections leave too much whitespace)

Press Esc or click outside to close

1 / 22

Design-to-Discovery: RNA-seq Differential Expression with Negative Binomial GLMs

From sequencing experiments to dispersion-aware inference

Dr.Ali T. Abdallah

Working with R (Advanced Course)

Advanced Analysis Teaching Series • November 2025

2 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Overview: The Journey Ahead

🎯 The Challenge You Face

You have 20,000 genes and 3 replicates per condition. Which genes truly changed? Without the right model, you'll miss real changes or report false positives.

� Our Logical Path

🧬

Biology

→

🔢

Counts

→

📊

Model

→

🎯

Answers

⏱️ ~45-60 min | Step-by-step | No advanced math required

Payoff: Understand WHY and HOW DESeq2 works!

From biology to statistical inference

3 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Challenge

Why We Need a Statistical Model

RNA-seq produces discrete read counts shaped jointly by biology and experimental procedures. For example: Gene A shows 100, 95, 180 reads across three replicates-why the jump to 180?
Even under careful protocols, replicate samples show some level of disagreement due to biological variation (cell-type composition, individual heterogeneity, transcriptional bursting) and technical factors (capture efficiency, library depth, prep variations).
To distinguish real biological differences from noise, we need a model that correctly describes the process generating the observed counts.
Starting from the simplest building block-capturing one molecule-we'll see how a suitable model emerges naturally from biology and chemistry.

Figure 2: Replicate variability in RNA-seq counts

4 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Key Insight

From Biology to Statistical Model: The Conceptual Flow

🧬 Biology: One RNA molecule → Success or failure

🔢 Many Molecules: Binomial → Poisson

⚠️ Reality Check: Biological + Technical variation

📊 Statistical Model: Gamma-Poisson mixture

🎯 Regression Framework: GLM for comparison

The Key Principle

Every step in our statistical model
corresponds to a
biological or experimental reality

We're not arbitrarily choosing distributions-
we're attempting to derive them from
the data-generating process!

5 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Challenge

Why Not Just Use a t-Test?

Natural first thought: To compare control vs treatment, why not just run a t-test on the counts?

📖 Expand All

Problem 1: Wrong Distribution Assumption ▶

The t-test requires continuous, normally distributed data with constant variance.
RNA-seq values are discrete counts that show strong mean–variance dependence, so these assumptions do not hold.
Normalization alone cannot fix this; specialized methods (e.g., VST, limma-voom) must model the mean–variance relationship first.

Problem 2: Mean–Variance Relationship ▶

The t-test assumes constant variance across genes and groups.
RNA-seq data violate this: variance increases with the mean due to the negative binomial sampling model.
Low-count genes have high relative noise, while high-count genes have greater absolute variance.

Problem 3: Ignores the Counting Process ▶

RNA-seq counts follow a well-characterized biological and technical sampling process (next slides).
A simple t-test ignores the count nature of the data and therefore loses statistical power.

💡 Solution ▶

We need a model that respects the count-generating process. Let's build it step-by-step...

💡 Key Takeaway: t-tests fail for RNA-seq because counts are discrete, variance depends on mean, and we can model the biological process directly.

Figure 2: Mean-dependent variance in RNA-seq (simulated)

6 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Building Block

Step 1 — A Bernoulli Experiment

🧬 Biological Reality: Following One mRNA Molecule

Imagine tracking a single FOXO3A mRNA in your cell sample. During RNA-seq, this molecule either gets successfully captured and sequenced → Success (X=1), or gets lost somewhere in the process → Failure (X=0).

Key Idea: One RNA molecule → Either captured (X=1) or lost (X=0)

$$X \sim \text{Bernoulli}(p_g)$$

$$P(X = 1) = p_g \quad | \quad P(X = 0) = 1 - p_g$$

📖 Show Details

What is p_g? A Concrete Example ▶

The overall success probability p_g reflects the combined efficiency of all experimental steps:

RNA extraction: 90% of FOXO3A molecules survive cell lysis
Reverse transcription: 80% get converted to cDNA
PCR amplification: 95% amplify successfully
Capture chemistry: 85% bind to sequencing adapters
Sequencing instrument: 75% generate readable sequences

Overall p_FOXO3A = 0.90 × 0.80 × 0.95 × 0.85 × 0.75 ≈ 0.43
So each FOXO3A molecule has ~43% chance of being counted!

Ideal Scenario Assumption ▶

In a perfectly homogeneous world, each molecule of a given gene would behave as an independent Bernoulli trial with the same gene-specific success probability p_g across all molecules and all replicates.

Note: This assumption breaks down in reality, which we'll address in later slides.

Why This Matters: From Molecules to Statistics ▶

This binary outcome is the fundamental building block behind count modeling. Every counted read started as a single molecule that succeeded in this Bernoulli trial.

🔗 Biological Bridge:

• If you see 100 FOXO3A reads in your data
• This represents ~233 original FOXO3A molecules (100 ÷ 0.43)
• The other ~133 molecules were lost during processing
• Statistical modeling lets us account for this loss!

💡 Key Takeaway: Each RNA molecule (like our FOXO3A example) is an independent Bernoulli trial. The success probability p_g varies by gene based on sequence properties and expression level.

Bernoulli success versus failure probabilities

Figure 3: Bernoulli outcome for a single molecule

7 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Building Block

Step 2 — Binomial Counts

🧬 From One Molecule to Many

Your sample contains n = 500 FOXO3A molecules. Each has a 43% chance of being captured. How many reads will you observe? This follows a Binomial distribution.

Key Idea: Now imagine n independent molecules for the same gene, each with capture probability p_g.

Y ~ Binomial(n, p_g)

Total reads = sum of n independent Bernoulli trials

📐 View Bernoulli to Binomial Proof

📖 Show Details

What does the Binomial capture? A Concrete Example ▶

n = 500: Number of FOXO3A molecules in your sample
p_FOXO3A = 0.43: Per-molecule capture probability (from Slide 6)
Both parameters determine the distribution of read counts

Expected reads = n × p = 500 × 0.43 = 215 reads
But you might observe anywhere from ~180 to ~250 reads due to random sampling variation!

The Ideal Scenario (Rarely True!) ▶

In a perfect world, every replicate would share the same n and p_g for a gene, giving identical expected counts.

⚠️ Reality Check:

• Replicate 1: 500 FOXO3A molecules → ~215 reads
• Replicate 2: 480 molecules (biological variation!) → ~206 reads
• Replicate 3: 520 molecules → ~224 reads
• The assumption breaks down in real experiments!

Why move to Poisson? The Practical Problem ▶

In RNA-seq, n (total molecules) is unknown and unmeasurable. We can't directly fit a Binomial model.

❓ The Challenge:

You observe 215 FOXO3A reads, but you don't know:
• Were there 500 molecules with p=0.43?
• Or 1000 molecules with p=0.215?
• Or 300 molecules with p=0.717?
All give the same expected count! We can't separate n from p.

Solution: Approximate with Poisson using λ = n·p_g = 215 (next slide).

💡 Key Takeaway: The Binomial models counts from n independent trials (500 FOXO3A molecules), but we need to simplify since n is unknown in RNA-seq. We only observe the product λ = n·p.

Histogram of binomially distributed read counts with theoretical curve

Figure 3: Binomial distribution of read counts

8 / 21

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Building Block

Step 3 — Poisson Approximation

Key Idea: When n is large but p_g is small, we can simplify the Binomial.

Binomial(n, p_g) ≈ Poisson(λ) where λ = n · p_g

λ captures the expected read count for the gene

🧬 Focus on What We Can Measure

We don't know if there were 500 FOXO3A molecules (p=0.43) or 1000 molecules (p=0.215), but both give λ = 215 expected reads. The Poisson distribution uses only λ, eliminating the unmeasurable parameters!

Why this approximation works: The FOXO3A Example ▶

When you have many molecules (large n) but each has a small chance of being captured (small p), the math simplifies beautifully.

FOXO3A Example:

• Binomial(n=500, p=0.43) ≈ Poisson(λ=215)
• Only need to track one parameter (λ) instead of two (n, p)
• λ = 500 × 0.43 = 215 = expected read count

The Poisson is the limiting case of Binomial as n → ∞ and p → 0, with np held constant.

Key assumption: Fixed λ (The Problem We'll Fix Later!) ▶

Poisson assumes λ is fixed across replicates → variance equals the mean.

⚠️ The Unrealistic Assumption:

Poisson assumes all 3 replicates have exactly λ = 215 for FOXO3A.
But in reality: Rep1 might have λ₁=200, Rep2 has λ₂=215, Rep3 has λ₃=230
This variation causes over-dispersion! (Slides 9-11 explain this)

Poisson approximation overlayed on binomial pmf

Figure 4: Poisson approximation to the Binomial

9 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

📍 CHECKPOINT: Where We Are

The Ideal Homogeneous Sequencing Universe

✅ Our Journey: FOXO3A Example

One molecule → 500 molecules → Poisson(λ=215)
Assumption: All replicates have identical λ = 215

Theory predicts: If every replicate saw identical conditions, Poisson would fit perfectly.

Ideal World: Fixed n and p → Poisson with Variance = Mean

Blue points in figure show this perfect scenario

In a perfect world: All 3 FOXO3A replicates have exactly λ=215, giving Variance = Mean = 215

This is the Poisson property: simple, elegant, but unrealistic for real biology!

Mean-variance for ideal Poisson: Blue points showing Var = Mean

Figure 5: Ideal Poisson relationship (Variance = Mean)

10 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Challenge ⚠️ REALITY CHECK

Reality: Variance Exceeds the Poisson Prediction

Poisson Prediction vs Reality

Blue points: Poisson ideal (Variance = Mean)
Orange points: Real data (Variance >> Mean)

What the plot reveals

Blue points: Simulated Poisson data hugging the diagonal line where Variance = Mean

Orange points: Real RNA-seq data ballooning far above the line!

This is over-dispersion-variance exceeds what Poisson predicts.

Why does this happen? ▶

The Poisson assumption (fixed λ=215 for all FOXO3A replicates) is violated!

In reality, each replicate has a different λ value due to biological and technical variation.

We need to model λ itself as a random variable (next slides).

💡 Key Takeaway: Real data shows over-dispersion (variance >> mean). Poisson alone cannot capture this-we need a distribution that accounts for overdispersion

Try in R: Simulate Over-dispersion

Mean-variance comparison: Blue = ideal Poisson (Var = Mean), Orange = real data (Var >> Mean)

Figure 6: Mean-variance relationship (blue = Poisson ideal, orange = over-dispersed reality)

11 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-12

Part 2: GLM Framework Slides 13-15

Part 3: DESeq2 Slides 16-21

Challenge ⚠️ REALITY CHECK

Reality: Multiple Distinct Universes

🧬 FOXO3A in the Real World

                                Rep 1: 480 molecules → λ₁ = 200 reads
                            
                                Rep 2: 500 molecules → λ₂ = 215 reads
                            
                                Rep 3: 520 molecules → λ₃ = 230 reads

Different λ values → Variance > Mean (over-dispersion!)

📖 Show Details

Why do FOXO3A replicates differ? ▶

Biological variation in molecule counts:

Cell-type composition: Rep 1 had more fibroblasts (lower FOXO3A)
Transcriptional bursting: Rep 3 caught more burst events
Individual heterogeneity: Different tissue samples

Technical variation in capture:

Library size: Rep 2 had slightly higher total reads
Capture chemistry: Rep 1 efficiency 41% vs Rep 3 at 44%
Lab handling: Different extraction batches

Combined effect: 480→500→520 molecules and varying capture rates = λ values of 200, 215, 230

The "Distinct Universes" metaphor ▶

Think of each FOXO3A replicate as sampling from its own "distinct universe" with a different λ value:

Universe 1 (Rep 1): λ₁ = 200 (480 molecules, 41% capture)
Universe 2 (Rep 2): λ₂ = 215 (500 molecules, 43% capture)
Universe 3 (Rep 3): λ₃ = 230 (520 molecules, 44% capture)

The variation in λ itself is random-we need to model this!

Solution preview ▶

We model λ as a random variable following a Gamma distribution.

For FOXO3A: Instead of fixed λ=215, we have λ ~ Gamma(shape, scale) that produces values like 200, 215, 230...

This captures the heterogeneity across replicates (details in next slide).

💡 Key Takeaway: FOXO3A replicates show different λ values (200, 215, 230) due to biological and technical variation. We need to model λ itself as random, not fixed.

Sampling Poisson rates from a Gamma population

Figure 6: Sampling variable rates from a Gamma distribution

12 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Solution

Why Gamma Distribution for λ?

Answer: λ varies because biological and technical factors multiply together.

                            λ = Expression Bursts × Cell Types × Library Prep × Seq Depth × Capture Eff.
                        

Each factor is random → Product creates variable λ across replicates

🧬 FOXO3A Example: How Factors Multiply

                                Rep 1: 0.85 bursts × 0.92 cell-type × 1.05 lib × 0.95 depth × 0.90 capture ≈ 0.66× → λ₁ = 200
                            
                                Rep 2: 1.00 bursts × 1.00 cell-type × 1.00 lib × 1.00 depth × 1.00 capture ≈ 1.00× → λ₂ = 215
                            
                                Rep 3: 1.10 bursts × 1.05 cell-type × 0.98 lib × 1.08 depth × 1.02 capture ≈ 1.24× → λ₃ = 230

Products of random factors → Different λ values → Need Gamma distribution to model this!

🔗 The Key Insight

Theory: Products of random factors → Lognormal distribution (mathematical result)
Practice: We use Gamma(α, β) instead because Gamma ≈ Lognormal and Gamma + Poisson = NB (closed-form solution!)

Gamma density curves for different parameter choices

Figure 7: Gamma distributions with different shape (α) and rate (β) parameters

Gamma parameters (α, β) explained ▶

Mean = α/β encodes average expression level (e.g., α/β = 215 for FOXO3A)

Variance = α/β² → larger α means less spread (more consistent λ across replicates)

Different (α, β) pairs fit different genes' variability (see figure)

📐 View Gamma Moments Proof

Mathematical derivation (optional) ▶

Step 1: Products → Sums via logarithm: log(X₁ × X₂) = log(X₁) + log(X₂)

Step 2: Central Limit Theorem → Sum of random variables ≈ Normal

Step 3: Exponentiate → exp(Normal) = Lognormal

Step 4: Gamma ≈ Lognormal in shape, but Gamma + Poisson = NB (closed form!)

📐 View Products to Gamma Proof

13 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Solution

Gamma-Poisson Mixture → Negative Binomial

One fixed λ → Poisson
Varying λ ~ Gamma → Negative Binomial

                                λ ~ Gamma(α, β)   →   K | λ ~ Poisson(λ)   →   K ~ NB(r, p)
                            

🧬 RNA-seq Interpretation

Poisson: "Given this exact concentration in the library, what range of counts will I see due to random sampling?" (additive noise)

Gamma: "How much does the actual concentration in the library vary between my samples due to biology and technical factors?" (multiplicative noise)

Key insight: Negative Binomial captures both sources of variation naturally! Variance = μ + μ²/α captures over-dispersion.

Tip: Hover to reveal controls or press the space bar to pause during presentation.

Figure 8: Negative Binomial emerges from integrating Poisson curves with varying λ ~ Gamma

Hierarchical Model ▶

Step 1: Draw a rate λ from Gamma(α, β)

Step 2: Given that λ, draw count K from Poisson(λ)

Each replicate gets a different λ, but all from the same Gamma population.

Mathematical Result ▶

Integrating out λ yields K ~ NegativeBinomial(r, p)

Where: r = α, p = β/(β+1) (standard NB notation)

📐 View Mathematical Proof

Key Properties ▶

Mean: μ = α/β (same as Gamma mean)

Variance: μ + μ²/α → over-dispersion!

The extra term μ²/α captures variability beyond Poisson.

📐 View Variance Proof

DESeq2 Parameterization ▶

DESeq2 uses: mean μ and dispersion θ = 1/α

Higher θ = more over-dispersion (more variable λ values)

This parameterization is easier to estimate from data.

14 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Challenge

From Random to Structured Components

📖 Expand All

So Far: Stochastic Variability ▶

We've modeled quasi-random sources of variability:

Technical noise (sampling, library prep) → Poisson component
Biological heterogeneity (cell states, expression bursts) → Gamma component
Result: Negative Binomial distribution captures over-dispersion

What's Missing? Systematic Effects ▶

Experimental conditions (control vs treatment)
Covariates (batch effects, patient characteristics)
Library size differences between samples

Key Observation: Multiplicative Effects ▶

These factors affect expression multiplicatively:

Treatment effect: μ_treated = 2 × μ_control (2-fold change)
Larger library: 2× depth → 2× reads for same gene
Batch effects: systematic scaling factors

The Challenge ▶

How do we build a regression model that handles:

Multiplicative covariate effects
Negative Binomial-distributed counts
Multiple explanatory variables simultaneously

💡 Key Takeaway: Stochastic variation (NB) handles noise. Systematic effects (covariates) require regression. GLM combines both!

Figure 9: Bridging stochastic and systematic components

15 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Solution

The Solution: Generalized Linear Models

Our scientific question: "Which genes are differentially expressed between conditions?"

Generalized Linear Model (GLM) extends linear regression to non-normal distributions like Negative Binomial.

📖 Expand All

Three Key Components ▶

Distribution family: Negative Binomial (from exponential family)
Linear predictor: β₀ + β₁·x₁ + β₂·x₂ + ... (additive model for covariates)
Link function: Connects mean μ to linear predictor

The Log-Link Function ▶

Perfect for multiplicative effects!

log(μ) = β₀ + β₁·x₁ + β₂·x₂ + ...

Why Log-Link? ▶

Transforms multiplicative effects → additive:

On original scale: μ = exp(β₀) × exp(β₁·x₁) × exp(β₂·x₂) × ...
Coefficient β₁ = log(fold-change) → directly interpretable!

Log-link transforms multiplicative effects to additive effects

Figure 10: Why log-link? Multiplicative effects become additive on log scale

16 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Methodology

The Negative Binomial GLM

The Core Model: For each gene j and sample i, we model expected count μ_ij:

                                $$\log(\mu_{ij}) = \sum_{k=0}^{p} \beta_{kj} x_{ik} + \log(s_i)$$
                            

Counts: $K_{ij} \sim \text{NB}(\mu_{ij}, \theta_j)$

📖 Show Details

Model Components Explained ▶

∑: Sum over all p covariates (experimental conditions)
β_kj: Effect of covariate k on gene j (k=0 is baseline/intercept)
x_ik: Value of covariate k for sample i (e.g., 0/1 for control/treatment)
log(s_i): Library size normalization (offset)
θ_j: Gene-specific dispersion parameter

Why the log-link? ▶

The log link ensures μ_ij > 0 (counts can't be negative!)

Key advantage: Fold changes become additive on log scale

Example: For treatment vs control, β_1j = log(fold change)

Simple example: Two-group comparison ▶

For control vs treatment design (p=1, only one covariate):

x_i0 = 1 (intercept, always 1)
x_i1 = 0 for control samples, 1 for treatment samples
β_0j = baseline (control) expression for gene j
β_1j = log fold change (treatment effect)

So: log(μ_ij) = β_0j×1 + β_1j×(0 or 1) + log(s_i)

The normalization term: log(s_i) ▶

s_i = size factor for sample i (adjusts for library depth)

Without this, deeper-sequenced samples would show higher counts artificially

This term is fixed (not estimated), so we call it an "offset"

💡 Key Takeaway: The NB-GLM combines the Negative Binomial distribution (handles over-dispersion) with a regression framework (handles experimental design).

Try in R: Explore Design Matrices

Figure 11: Design matrix connects samples to experimental conditions

17 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Implementation

Dispersion Estimation in DESeq2

Each gene has its own dispersion parameter $\theta_j$, but with typical RNA-seq sample sizes (3-6 replicates), gene-wise estimates are noisy.
DESeq2's strategy: Borrow information across genes by assuming dispersion depends on mean expression.
Step 1: Estimate gene-wise dispersion $\hat{\theta}_j$ (maximum likelihood)
Step 2: Fit a smooth curve through dispersion-mean relationship
Step 3: Shrink gene-wise estimates toward the fitted trend using empirical Bayes
Genes with low counts get shrunk more (less reliable estimates); high-count genes retain their individual estimates
This stabilizes inference without assuming all genes have identical dispersion

Try in R: Visualize Dispersion Estimates

Dispersion-mean relationship with shrinkage

Figure 12: Gene-wise dispersion estimates (black), fitted trend (red), final shrunken values (blue)

18 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Implementation

Testing for Differential Expression

🎯 The Question We're Testing:

Is β_treatment = 0? (No difference between conditions)

📊 Step 1: Wald Test

W = β̂ / SE(β̂) → Compare to N(0,1) → p-value

💡 Click for details

🔬 Step 2: Multiple Testing Correction

20,000+ genes → Must control False Discovery Rate (FDR) → Benjamini-Hochberg

💡 Click for FDR details

🎨 Step 3: Log Fold Change Shrinkage

Shrink extreme LFC for low-count genes → More reliable effect sizes

💡 Click for shrinkage rationale

📤 Final Output for Each Gene:

log₂(FC) • SE • Wald statistic • p-value • adjusted p-value (padj)

MA plot showing differential expression results

Figure 13: MA plot (M = log fold change, A = mean expression). Red = FDR < 0.05

19 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Methodology

The DESeq2 Workflow: Putting It All Together

📖 Expand All

1. Input ▶

Raw count matrix + experimental design

2. Normalization ▶

Estimate size factors $s_i$ (median-of-ratios method)

3. Dispersion Estimation ▶

Gene-wise MLE → Fit trend → Shrink estimates

4. GLM Fitting ▶

Estimate $\boldsymbol{\beta}_j$ for each gene via negative binomial regression

5. Hypothesis Testing ▶

Wald test + FDR correction

6. Output ▶

Results table with log2FC, p-values, adjusted p-values

                        Key insight: The NB distribution handles over-dispersion, the GLM framework handles experimental design, and shrinkage stabilizes estimates with limited replicates.
                    

Try in R: Complete DESeq2 Analysis

Figure 14: Complete DESeq2 pipeline from raw counts to differential expression results

20 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Example

Real Data Example: DESeq2 in Action

Dataset: Pasilla gene knockdown in Drosophila (Brooks et al. 2011)
Design: 4 control vs 3 treated samples
14,599 genes analyzed

DESeq2 Results Summary:

- Genes with adjusted p < 0.05: 1,234 (8.5%)

- Upregulated (LFC > 1): 543 genes

- Downregulated (LFC < -1): 601 genes

- Mean dispersion: 0.15

- Dispersion range: 0.01 – 2.3

In contrast, a volcano plot visualizes effect size and significance and does not directly show per-gene dispersion; a Poisson model (dispersion = 0) would underestimate variance and can produce many false positives when dispersion is present.

Figure 15: Volcano plot showing -log10(padj) vs log2(fold change)

21 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Example

R Code Example: Running DESeq2

library(DESeq2)

# Create DESeqDataSet

dds <- DESeqDataSetFromMatrix(

  countData = counts,

  colData = metadata,

  design = ~ condition

)

# Run DESeq2 pipeline

dds <- DESeq(dds)

# Extract results

res <- results(dds, contrast = c("condition", "treated", "control"))

# Shrink log fold changes

res_shrink <- lfcShrink(dds, coef = "condition_treated_vs_control")

# Filter significant genes

sig_genes <- subset(res, padj < 0.05)

That's it! The DESeq() function handles normalization, dispersion estimation, GLM fitting, and testing automatically.

Figure 17: Typical DESeq2 results table output

22 / 22

Overview Slides 1-2

Part 1: Distributions Slides 3-13

Part 2: GLM Framework Slides 14-16

Part 3: DESeq2 Slides 17-22

Key Insight

Takeaways for RNA-seq Analysis

Foundation: Bernoulli → Binomial → Poisson describes the homogeneous ideal, but real data shows over-dispersion.
Distribution: Multiplicative biological and technical factors create rate heterogeneity, captured by the Gamma-Poisson mixture (Negative Binomial).
Regression: The GLM framework extends the NB distribution to compare conditions while accounting for library size and covariates.
Stabilization: Dispersion and LFC shrinkage borrow information across genes to improve estimates with limited replicates.
Inference: Wald tests with FDR correction identify differentially expressed genes while controlling false positives.
Practice: DESeq2 automates this entire pipeline-understanding the theory helps you interpret results and troubleshoot issues.

Figure 18: Complete conceptual map from biology to statistical inference

Proof: Gamma-Poisson Mixture = Negative Binomial ×

We will show that if $\lambda \sim \text{Gamma}(\alpha, \beta)$ and $K|\lambda \sim \text{Poisson}(\lambda)$, then marginally $K \sim \text{NegativeBinomial}$.

Setup:

\begin{align} \lambda &\sim \text{Gamma}(\alpha, \beta): \quad f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda}\\ K|\lambda &\sim \text{Poisson}(\lambda): \quad P(K=k|\lambda) = \frac{\lambda^k}{k!} e^{-\lambda} \end{align}

Marginal distribution by integrating out $\lambda$:

\begin{align} P(K = k) &= \int_0^\infty P(K=k|\lambda) f(\lambda) \, d\lambda\\ &= \int_0^\infty \frac{\lambda^k}{k!} e^{-\lambda} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{k! \Gamma(\alpha)} \int_0^\infty \lambda^{k+\alpha-1} e^{-(\beta+1)\lambda} \, d\lambda \end{align}

Recognize the Gamma integral:

\int_0^\infty \lambda^{k+\alpha-1} e^{-(\beta+1)\lambda} \, d\lambda = \frac{\Gamma(k+\alpha)}{(\beta+1)^{k+\alpha}}

Substitute back:

\begin{align} P(K = k) &= \frac{\beta^\alpha}{k! \Gamma(\alpha)} \cdot \frac{\Gamma(k+\alpha)}{(\beta+1)^{k+\alpha}}\\ &= \frac{\Gamma(k+\alpha)}{\Gamma(\alpha) k!} \cdot \frac{\beta^\alpha}{(\beta+1)^{k+\alpha}}\\ &= \binom{k+\alpha-1}{k} \left(\frac{\beta}{\beta+1}\right)^\alpha \left(\frac{1}{\beta+1}\right)^k \end{align}

This is the Negative Binomial PMF with:

\begin{align} r &= \alpha \quad \text{(shape parameter)}\\ p &= \frac{\beta}{\beta+1} \quad \text{(success probability)}\\ \text{Mean} &= \frac{\alpha}{\beta}, \quad \text{Variance} = \frac{\alpha}{\beta} + \frac{\alpha}{\beta^2} = \mu + \frac{\mu^2}{\alpha} \end{align}

The variance exceeds the mean by $\mu^2/\alpha$, capturing over-dispersion.

Proof: Gamma Distribution Mean and Variance×

We derive the mean and variance of $\lambda \sim \text{Gamma}(\alpha, \beta)$.

Gamma PDF:

f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda}, \quad \lambda > 0

Mean (First Moment):

\begin{align} \mathbb{E}[\lambda] &= \int_0^\infty \lambda \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \lambda^\alpha e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+1)}{\beta^{\alpha+1}} \quad \text{(Gamma integral)}\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\alpha \Gamma(\alpha)}{\beta^{\alpha+1}} \quad \text{(use } \Gamma(\alpha+1) = \alpha\Gamma(\alpha)\text{)}\\ &= \frac{\alpha}{\beta} \end{align}

Second Moment:

\begin{align} \mathbb{E}[\lambda^2] &= \int_0^\infty \lambda^2 \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \lambda^{\alpha+1} e^{-\beta\lambda} \, d\lambda\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+2)}{\beta^{\alpha+2}}\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{(\alpha+1)\alpha\Gamma(\alpha)}{\beta^{\alpha+2}}\\ &= \frac{\alpha(\alpha+1)}{\beta^2} \end{align}

Variance:

\begin{align} \text{Var}(\lambda) &= \mathbb{E}[\lambda^2] - (\mathbb{E}[\lambda])^2\\ &= \frac{\alpha(\alpha+1)}{\beta^2} - \left(\frac{\alpha}{\beta}\right)^2\\ &= \frac{\alpha(\alpha+1) - \alpha^2}{\beta^2}\\ &= \frac{\alpha}{\beta^2} \end{align}

Summary:

\mathbb{E}[\lambda] = \frac{\alpha}{\beta}, \quad \text{Var}(\lambda) = \frac{\alpha}{\beta^2}

Note: The variance can be rewritten as $\text{Var}(\lambda) = \frac{1}{\alpha} \left(\frac{\alpha}{\beta}\right)^2 = \frac{\mu^2}{\alpha}$, showing that smaller $\alpha$ means higher relative variability.

🎨 Log Fold Change Shrinkage×

The Problem: Noisy LFC Estimates for Low-Count Genes

Consider a gene with:

Control: 2 reads average
Treatment: 8 reads average
Estimated log₂(FC) = log₂(8/2) = 2 (4-fold change!)

But is this reliable? With so few counts, this could be noise!

The Solution: Empirical Bayes Shrinkage

DESeq2 shrinks LFC estimates toward zero, with the amount of shrinkage depending on:

Standard error: Higher uncertainty → More shrinkage
Gene expression level: Lower counts → More shrinkage
Dispersion: Higher dispersion → More shrinkage

The Shrinkage Formula:

\beta_j^{\text{shrunken}} = \frac{\text{SE}_{\text{prior}}^2}{\text{SE}_{\text{prior}}^2 + \text{SE}_j^2} \cdot \hat{\beta}_j

where:

$\hat{\beta}_j$: Original (MLE) log fold change estimate
$\text{SE}_j$: Standard error of the estimate
$\text{SE}_{\text{prior}}$: Prior standard deviation (estimated from all genes)

Interpretation:

Low uncertainty ($\text{SE}_j$ small): Little shrinkage, keep original estimate
High uncertainty ($\text{SE}_j$ large): Strong shrinkage toward zero

Example:

Gene A (high counts):

MLE log₂(FC) = 2.0, SE = 0.2
Shrunken log₂(FC) ≈ 1.95 (minimal shrinkage)

Gene B (low counts):

MLE log₂(FC) = 2.0, SE = 1.5
Shrunken log₂(FC) ≈ 0.8 (strong shrinkage)

Benefits:

Reduces false discoveries: Prevents calling low-count noise as significant
Improves ranking: Genes with reliable large effects rise to top
Better for downstream: More accurate effect sizes for pathway analysis

Important:

Shrinkage affects LFC estimates (effect sizes), not p-values. P-values still come from the original (unshrunken) Wald test.

DESeq2 Function:

Use lfcShrink() after DESeq() to get shrunken LFC for plotting and ranking genes.

⌨️ Keyboard Shortcuts

🖱️ Mouse Controls

🎨 View Options

Design-to-Discovery: RNA-seq Differential Expression with Negative Binomial GLMs

Overview: The Journey Ahead

Why We Need a Statistical Model

From Biology to Statistical Model: The Conceptual Flow

The Key Principle

Why Not Just Use a t-Test?

Step 1 — A Bernoulli Experiment

Step 2 — Binomial Counts

Step 3 — Poisson Approximation

The Ideal Homogeneous Sequencing Universe

Reality: Variance Exceeds the Poisson Prediction

What the plot reveals

Reality: Multiple Distinct Universes

Why Gamma Distribution for λ?

Gamma-Poisson Mixture → Negative Binomial

🧬 RNA-seq Interpretation

From Random to Structured Components

The Solution: Generalized Linear Models

The Negative Binomial GLM

Dispersion Estimation in DESeq2

Testing for Differential Expression

The DESeq2 Workflow: Putting It All Together

Real Data Example: DESeq2 in Action

R Code Example: Running DESeq2

Takeaways for RNA-seq Analysis

Proof: Binomial → Poisson Approximation×

Proof: Gamma-Poisson Mixture = Negative Binomial ×

Proof: Negative Binomial Variance Formula ×

Proof: Sum of Independent Bernoulli Trials = Binomial×

Proof: Poisson Distribution Has Variance = Mean×

Central Limit Theorem and Lognormal Distribution×

Proof: Gamma Distribution Mean and Variance×

Dispersion Shrinkage: Empirical Bayes Approach×

False Discovery Rate Control: Benjamini-Hochberg Procedure×

📐 Why Products → Lognormal/Gamma?×

🎯 Why Gamma Distribution Specifically?×

📊 Wald Test for Differential Expression×

🎨 Log Fold Change Shrinkage×

Exercise 1: Simulate Poisson vs Negative Binomial

Exercise 2: Basic DESeq2 Analysis

Exercise 3: Visualize Dispersion Estimates

Exercise 4: Understanding Design Matrices

Proof: Gamma-Poisson Mixture = Negative Binomial

Proof: Negative Binomial Variance Formula

Proof: Gamma Distribution Mean and Variance

Proof: Why Products → Lognormal/Gamma?

Proof: Sum of Independent Bernoulli Trials = Binomial