Varying λ ~ Gamma → Negative Binomial
• Click right half → Next slide
• Click left half → Previous slide
• Swipe left/right → Navigate (touch devices)
• 📖/🎬 Button (top-left) → Toggle Reading/Presentation Mode
• ⬆️/⬛ Button (below) → Toggle Top-Align/Center Layout
(Use Top-Align when collapsed sections leave too much whitespace)
Dr.Ali T. Abdallah
Working with R (Advanced Course)
Advanced Analysis Teaching Series • November 2025
© 2025 Dr. Ali T. Abdallah. All rights reserved.
You have 20,000 genes and 3 replicates per condition. Which genes truly changed? Without the right model, you'll miss real changes or report false positives.
⏱️ ~45-60 min | Step-by-step | No advanced math required
Every step in our statistical model
corresponds to a
biological or experimental reality
Natural first thought: To compare control vs treatment, why not just run a t-test on the counts?
We need a model that respects the count-generating process. Let's build it step-by-step...
Imagine tracking a single FOXO3A mRNA in your cell sample. During RNA-seq, this molecule either gets successfully captured and sequenced → Success (X=1), or gets lost somewhere in the process → Failure (X=0).
Key Idea: One RNA molecule → Either captured (X=1) or lost (X=0)
The overall success probability pg reflects the combined efficiency of all experimental steps:
Overall pFOXO3A = 0.90 × 0.80 × 0.95 × 0.85 × 0.75 ≈ 0.43
So each FOXO3A molecule has ~43% chance of being counted!
In a perfectly homogeneous world, each molecule of a given gene would behave as an independent Bernoulli trial with the same gene-specific success probability pg across all molecules and all replicates.
Note: This assumption breaks down in reality, which we'll address in later slides.
This binary outcome is the fundamental building block behind count modeling. Every counted read started as a single molecule that succeeded in this Bernoulli trial.
Your sample contains n = 500 FOXO3A molecules. Each has a 43% chance of being captured. How many reads will you observe? This follows a Binomial distribution.
Key Idea: Now imagine n independent molecules for the same gene, each with capture probability pg.
Expected reads = n × p = 500 × 0.43 = 215 reads
But you might observe anywhere from ~180 to ~250 reads due to random sampling variation!
In a perfect world, every replicate would share the same n and pg for a gene, giving identical expected counts.
In RNA-seq, n (total molecules) is unknown and unmeasurable. We can't directly fit a Binomial model.
Solution: Approximate with Poisson using λ = n·pg = 215 (next slide).
Key Idea: When n is large but pg is small, we can simplify the Binomial.
We don't know if there were 500 FOXO3A molecules (p=0.43) or 1000 molecules (p=0.215), but both give λ = 215 expected reads. The Poisson distribution uses only λ, eliminating the unmeasurable parameters!
When you have many molecules (large n) but each has a small chance of being captured (small p), the math simplifies beautifully.
The Poisson is the limiting case of Binomial as n → ∞ and p → 0, with np held constant.
Poisson assumes λ is fixed across replicates → variance equals the mean.
Theory predicts: If every replicate saw identical conditions, Poisson would fit perfectly.
In a perfect world: All 3 FOXO3A replicates have exactly λ=215, giving Variance = Mean = 215
This is the Poisson property: simple, elegant, but unrealistic for real biology!
Blue points: Simulated Poisson data hugging the diagonal line where Variance = Mean
Orange points: Real RNA-seq data ballooning far above the line!
This is over-dispersion-variance exceeds what Poisson predicts.
The Poisson assumption (fixed λ=215 for all FOXO3A replicates) is violated!
In reality, each replicate has a different λ value due to biological and technical variation.
We need to model λ itself as a random variable (next slides).
Biological variation in molecule counts:
Technical variation in capture:
Think of each FOXO3A replicate as sampling from its own "distinct universe" with a different λ value:
The variation in λ itself is random-we need to model this!
We model λ as a random variable following a Gamma distribution.
For FOXO3A: Instead of fixed λ=215, we have λ ~ Gamma(shape, scale) that produces values like 200, 215, 230...
This captures the heterogeneity across replicates (details in next slide).
Answer: λ varies because biological and technical factors multiply together.
Mean = α/β encodes average expression level (e.g., α/β = 215 for FOXO3A)
Variance = α/β² → larger α means less spread (more consistent λ across replicates)
Different (α, β) pairs fit different genes' variability (see figure)
Step 1: Products → Sums via logarithm: log(X₁ × X₂) = log(X₁) + log(X₂)
Step 2: Central Limit Theorem → Sum of random variables ≈ Normal
Step 3: Exponentiate → exp(Normal) = Lognormal
Step 4: Gamma ≈ Lognormal in shape, but Gamma + Poisson = NB (closed form!)
Poisson: "Given this exact concentration in the library, what range of counts will I see due to random sampling?" (additive noise)
Gamma: "How much does the actual concentration in the library vary between my samples due to biology and technical factors?" (multiplicative noise)
Key insight: Negative Binomial captures both sources of variation naturally! Variance = μ + μ²/α captures over-dispersion.
Step 1: Draw a rate λ from Gamma(α, β)
Step 2: Given that λ, draw count K from Poisson(λ)
Each replicate gets a different λ, but all from the same Gamma population.
Integrating out λ yields K ~ NegativeBinomial(r, p)
Where: r = α, p = β/(β+1) (standard NB notation)
Mean: μ = α/β (same as Gamma mean)
Variance: μ + μ²/α → over-dispersion!
The extra term μ²/α captures variability beyond Poisson.
DESeq2 uses: mean μ and dispersion θ = 1/α
Higher θ = more over-dispersion (more variable λ values)
This parameterization is easier to estimate from data.
We've modeled quasi-random sources of variability:
These factors affect expression multiplicatively:
How do we build a regression model that handles:
Our scientific question: "Which genes are differentially expressed between conditions?"
Generalized Linear Model (GLM) extends linear regression to non-normal distributions like Negative Binomial.
Perfect for multiplicative effects!
Transforms multiplicative effects → additive:
The Core Model: For each gene j and sample i, we model expected count μij:
The log link ensures μij > 0 (counts can't be negative!)
Key advantage: Fold changes become additive on log scale
Example: For treatment vs control, β1j = log(fold change)
For control vs treatment design (p=1, only one covariate):
So: log(μij) = β0j×1 + β1j×(0 or 1) + log(si)
si = size factor for sample i (adjusts for library depth)
Without this, deeper-sequenced samples would show higher counts artificially
This term is fixed (not estimated), so we call it an "offset"
Raw count matrix + experimental design
Estimate size factors \(s_i\) (median-of-ratios method)
Gene-wise MLE → Fit trend → Shrink estimates
Estimate \(\boldsymbol{\beta}_j\) for each gene via negative binomial regression
Wald test + FDR correction
Results table with log2FC, p-values, adjusted p-values
That's it! The DESeq() function handles normalization, dispersion estimation, GLM fitting, and testing automatically.
We want to show that when \(n\) is large and \(p\) is small such that \(np = \lambda\) remains constant, the \(\text{Binomial}(n, p)\) distribution converges to \(\text{Poisson}(\lambda)\).
Starting with the Binomial PMF:
Substitute \(p = \lambda/n\):
Simplify as \(n \to \infty\):
Therefore:
This approximation is excellent when \(n \geq 100\) and \(p \leq 0.01\), or more generally when \(np < 10\) and \(n\) is large.
We will show that if \(\lambda \sim \text{Gamma}(\alpha, \beta)\) and \(K|\lambda \sim \text{Poisson}(\lambda)\), then marginally \(K \sim \text{NegativeBinomial}\).
Setup:
Marginal distribution by integrating out \(\lambda\):
Recognize the Gamma integral:
Substitute back:
This is the Negative Binomial PMF with:
The variance exceeds the mean by \(\mu^2/\alpha\), capturing over-dispersion.
We show that for \(K \sim \text{NegativeBinomial}\) with mean \(\mu = \alpha/\beta\), the variance is \(\mu + \mu^2/\alpha\).
Using the Gamma-Poisson hierarchy:
Law of Total Expectation:
Law of Total Variance:
For \(\text{Gamma}(\alpha, \beta)\):
Therefore:
When \(\alpha\) is small (high dispersion), the quadratic term \(\mu^2/\alpha\) dominates, creating large over-dispersion. When \(\alpha\) is large, variance approaches \(\mu\) (Poisson limit).
We show that if \(X_1, X_2, \ldots, X_n\) are independent \(\text{Bernoulli}(p)\) random variables, then their sum \(Y = \sum_{i=1}^n X_i\) follows \(\text{Binomial}(n, p)\).
Setup:
Define \(Y = X_1 + X_2 + \cdots + X_n\). We want to find \(P(Y = k)\) for \(k = 0, 1, \ldots, n\).
Counting argument:
\(Y = k\) means exactly \(k\) of the \(n\) trials are successes (value 1) and \(n-k\) are failures (value 0).
Probability calculation:
For any specific sequence with exactly \(k\) successes:
Total probability:
This is exactly the PMF of \(\text{Binomial}(n, p)\). Therefore, the sum of \(n\) independent Bernoulli trials is Binomial.
We prove that for \(X \sim \text{Poisson}(\lambda)\), both \(\mathbb{E}[X] = \lambda\) and \(\text{Var}(X) = \lambda\).
Poisson PMF:
Proof that \(\mathbb{E}[X] = \lambda\):
Proof that \(\text{Var}(X) = \lambda\):
Use \(\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\). First find \(\mathbb{E}[X^2]\):
Therefore:
This unique property-variance equals mean-characterizes the Poisson distribution and is why it's unsuitable for over-dispersed RNA-seq data.
We explain why products of many small random factors lead to a lognormal distribution via the Central Limit Theorem.
Setup: Multiplicative model
Suppose the effective expression rate \(\lambda\) is determined by multiplicative factors:
where each \(F_i > 0\) is a random factor (e.g., library efficiency, cell-type proportion, capture rate).
Taking logarithms:
Central Limit Theorem:
If \(\log(F_i)\) are independent random variables with finite mean \(\mu_i\) and variance \(\sigma_i^2\), then for large \(m\):
Lognormal distribution:
Since \(\log(\lambda)\) is approximately normal, \(\lambda = e^{\log(\lambda)}\) follows a lognormal distribution:
Why Gamma instead?
The lognormal is mathematically valid, but:
Practical takeaway: The multiplicative factors story justifies a heavy-tailed, positive distribution for \(\lambda\). Gamma is chosen for mathematical convenience while capturing the same essential features.
We derive the mean and variance of \(\lambda \sim \text{Gamma}(\alpha, \beta)\).
Gamma PDF:
Mean (First Moment):
Second Moment:
Variance:
Summary:
Note: The variance can be rewritten as \(\text{Var}(\lambda) = \frac{1}{\alpha} \left(\frac{\alpha}{\beta}\right)^2 = \frac{\mu^2}{\alpha}\), showing that smaller \(\alpha\) means higher relative variability.
DESeq2 uses empirical Bayes shrinkage to stabilize dispersion estimates when sample sizes are small.
Problem:
With only 3-6 replicates per condition, the maximum likelihood estimate \(\hat{\theta}_j\) of gene \(j\)'s dispersion is highly variable (high variance estimator).
Observation:
Dispersion tends to depend on mean expression: \(\theta_j = f(\mu_j) + \epsilon_j\) where \(f\) is a smooth function (often power-law or spline).
Empirical Bayes Strategy:
where \(w_j \in [0,1]\) is a data-driven weight that balances:
Weight calculation:
The weight \(w_j\) depends on the precision (inverse variance) of \(\hat{\theta}_j\):
where \(\tau^2\) is the variance of true dispersions around the fitted trend (estimated from data).
Effect:
Benefit: Reduces false positives from genes with unreliable dispersion estimates while preserving true biological variability.
When testing \(m\) hypotheses simultaneously (e.g., 20,000 genes), we need to control the False Discovery Rate (FDR): the expected proportion of false positives among rejected hypotheses.
Problem with naive thresholding:
If we reject all genes with \(p < 0.05\) and test 20,000 genes where all nulls are true, we expect \(20{,}000 \times 0.05 = 1{,}000\) false positives!
Benjamini-Hochberg (BH) Procedure:
Given \(m\) p-values \(p_1, \ldots, p_m\), to control FDR at level \(\alpha\) (e.g., 0.05):
Step 1: Order p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\)
Step 2: Find the largest \(k\) such that:
Step 3: Reject hypotheses \(H_{(1)}, \ldots, H_{(k)}\)
Adjusted p-values:
DESeq2 reports \(\text{padj}_i\), the smallest FDR level at which hypothesis \(i\) would be rejected:
Interpretation:
FDR Guarantee (under independence or positive dependence):
This is much weaker than family-wise error rate (FWER) control (e.g., Bonferroni) but more powerful for high-throughput screening.
The Mathematical Journey:
Step 1: Express λ as a product
where \(F_i\) represents factors like expression bursts, cell-type proportions, library amplification, etc.
Step 2: Take logarithms (products become sums)
Step 3: Apply Central Limit Theorem
If \(\log(F_i)\) are independent random variables with finite mean \(\mu_i\) and variance \(\sigma_i^2\), then for large \(m\):
Step 4: Exponentiate to get λ
Since \(\log(\lambda)\) is approximately normal, \(\lambda = e^{\log(\lambda)}\) follows a lognormal distribution:
Step 5: Why approximate with Gamma?
Visual comparison:
Both Gamma and Lognormal are positive, right-skewed distributions that can model the same range of shapes. The Gamma wins due to mathematical convenience!
The Key Reason: Mathematical Elegance
Gamma distribution is chosen primarily because it is the conjugate prior for the Poisson distribution.
What does this mean?
The Four Benefits:
1. Closed-form solution (No numerical integration needed)
2. Interpretable parameters
3. Matches biological reality
4. Computational efficiency
Bottom Line:
While lognormal is theoretically justified by multiplicative factors, Gamma is chosen because it leads to the Negative Binomial distribution-giving us a tractable, interpretable, and computationally efficient model for RNA-seq counts!
What is the Wald Test?
The Wald test is a statistical hypothesis test used to determine if a coefficient in our regression model is significantly different from zero.
The Test Statistic:
where:
Under the null hypothesis (no differential expression, \(\beta_{j,\text{treatment}} = 0\)):
The test statistic follows a standard normal distribution.
Computing the p-value:
where \(\Phi\) is the standard normal cumulative distribution function. We use a two-tailed test (factor of 2) because we care about both up- and down-regulation.
Interpretation:
Why Wald Test (instead of Likelihood Ratio Test)?
Important Note:
For genes with very low counts or high dispersion, the asymptotic normal approximation may not be accurate. DESeq2 handles this through filtering and shrinkage.
The Problem: Noisy LFC Estimates for Low-Count Genes
Consider a gene with:
But is this reliable? With so few counts, this could be noise!
The Solution: Empirical Bayes Shrinkage
DESeq2 shrinks LFC estimates toward zero, with the amount of shrinkage depending on:
The Shrinkage Formula:
where:
Interpretation:
Example:
Benefits:
Important:
Shrinkage affects LFC estimates (effect sizes), not p-values. P-values still come from the original (unshrunken) Wald test.
DESeq2 Function:
Use lfcShrink() after DESeq() to get shrunken LFC for plotting and ranking genes.
Goal: Understand overdispersion by simulating count data from Poisson and Negative Binomial distributions.
What you'll learn: How biological variability creates overdispersion that the Poisson can't capture.
# Simulate RNA-seq-like count data to see overdispersion library(ggplot2) # Set parameters mean_expression <- 100 dispersion <- 0.5 # Higher = more overdispersion n_samples <- 1000 # Simulate Poisson (no overdispersion) poisson_counts <- rpois(n_samples, lambda = mean_expression) # Simulate Negative Binomial (allows overdispersion) # size parameter relates to dispersion: smaller = more dispersed nb_counts <- rnbinom(n_samples, mu = mean_expression, size = 1/dispersion) # Compare variances cat("Poisson variance:", var(poisson_counts), "\n") cat("NB variance:", var(nb_counts), "\n") cat("Theoretical NB variance:", mean_expression + dispersion * mean_expression^2, "\n") # Visualize the distributions data <- data.frame( counts = c(poisson_counts, nb_counts), distribution = rep(c("Poisson", "Negative Binomial"), each = n_samples) ) ggplot(data, aes(x = counts, fill = distribution)) + geom_histogram(alpha = 0.6, position = "identity", bins = 30) + labs(title = "Poisson vs Negative Binomial", subtitle = "Same mean, different variance") + theme_minimal()
Goal: Run a complete differential expression analysis using DESeq2.
What you'll learn: The standard DESeq2 workflow from count matrix to differential expression results.
# Complete DESeq2 workflow for differential expression analysis library(DESeq2) library(tidyverse) # Load your count matrix (genes x samples) # Rows = genes, Columns = samples # counts <- read.csv("your_counts.csv", row.names = 1) # For demonstration, create sample data counts <- matrix(rpois(6000, lambda = 50), nrow = 1000, ncol = 6) rownames(counts) <- paste0("Gene_", 1:1000) colnames(counts) <- paste0("Sample_", 1:6) # Create sample metadata coldata <- data.frame( condition = factor(rep(c("Control", "Treatment"), each = 3)), row.names = colnames(counts) ) # Create DESeq2 dataset dds <- DESeqDataSetFromMatrix( countData = counts, colData = coldata, design = ~ condition ) # Filter low-count genes (optional but recommended) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Extract results res <- results(dds, contrast = c("condition", "Treatment", "Control")) # Summary of results summary(res) # Get significantly differentially expressed genes (padj < 0.05) sig_genes <- res[which(res$padj < 0.05), ] cat("Number of significant genes:", nrow(sig_genes), "\n") # MA plot plotMA(res, ylim = c(-2, 2)) # Export results write.csv(as.data.frame(res), "deseq2_results.csv")
Goal: Understand how DESeq2 estimates and shrinks dispersion parameters.
What you'll learn: The mean-dispersion relationship and why shrinkage improves power.
# Visualize dispersion estimation in DESeq2 library(DESeq2) library(ggplot2) # Assuming you have a dds object from previous DESeq2 analysis # If not, create one following Exercise 2 # Built-in dispersion plot plotDispEsts(dds) # Extract dispersion information for custom plots disp_data <- data.frame( mean = rowMeans(counts(dds, normalized = TRUE)), gene_est = dispersions(dds), # Gene-wise estimates fit = dispersionFunction(dds)(rowMeans(counts(dds, normalized = TRUE))), # Fitted trend final = dispersions(dds) # Final shrunken estimates ) # Custom dispersion plot ggplot(disp_data, aes(x = mean)) + geom_point(aes(y = gene_est), alpha = 0.3, color = "gray", size = 1) + geom_line(aes(y = fit), color = "red", size = 1.5) + geom_point(aes(y = final), alpha = 0.5, color = "blue", size = 0.8) + scale_x_log10() + scale_y_log10() + labs(title = "Dispersion Estimation", subtitle = "Gray: gene-wise | Red: fitted trend | Blue: final shrunken", x = "Mean of normalized counts", y = "Dispersion") + theme_minimal() # Show impact of shrinkage shrinkage_effect <- disp_data %>% mutate(shrinkage = abs(gene_est - final) / gene_est) %>% summarise(median_shrinkage = median(shrinkage, na.rm = TRUE)) cat("Median shrinkage:", round(shrinkage_effect$median_shrinkage * 100, 2), "%\n")
Goal: Build and visualize design matrices for different experimental designs.
What you'll learn: How R formulas translate to design matrices that connect samples to conditions.
# Explore design matrices for different experimental designs library(DESeq2) # Example 1: Simple two-group comparison coldata_simple <- data.frame( condition = factor(c(rep("Control", 3), rep("Treatment", 3))) ) design_simple <- model.matrix(~ condition, data = coldata_simple) print("Simple design matrix:") print(design_simple) # Example 2: Multi-factor design (condition + batch) coldata_multi <- data.frame( condition = factor(c(rep("Control", 4), rep("Treatment", 4))), batch = factor(c("A", "B", "A", "B", "A", "B", "A", "B")) ) design_multi <- model.matrix(~ batch + condition, data = coldata_multi) print("Multi-factor design matrix:") print(design_multi) # Example 3: Interaction design coldata_interact <- data.frame( genotype = factor(rep(c("WT", "KO"), each = 4)), treatment = factor(rep(c("Control", "Treated"), 4)) ) design_interact <- model.matrix(~ genotype * treatment, data = coldata_interact) print("Interaction design matrix:") print(design_interact) # Visualize what coefficients mean cat("\n=== Coefficient Interpretation ===\n") cat("(Intercept): Baseline expression for reference group\n") cat("conditionTreatment: Log2 fold change of Treatment vs Control\n") cat("batch: Batch effect adjustment\n") cat("genotype:treatment: Interaction (treatment effect differs by genotype)\n")
Goal: Demonstrate mathematically that if λ ~ Gamma(α, β) and K|λ ~ Poisson(λ), then marginally K ~ NegativeBinomial.
Setup:
Marginal distribution by integrating out λ:
Recognize the Gamma integral:
Substitute back:
This is the Negative Binomial PMF with:
Conclusion: The variance exceeds the mean by μ²/α, capturing over-dispersion. ∎
Goal: Show that for K ~ NegativeBinomial with mean μ = α/β, the variance is μ + μ²/α.
Using the Gamma-Poisson hierarchy:
Law of Total Expectation:
Law of Total Variance:
For Gamma(α, β):
Therefore:
Interpretation:
Conclusion: The variance formula Var(K) = μ + μ²/α naturally models over-dispersion in RNA-seq data. ∎
Goal: Derive the mean and variance of λ ~ Gamma(α, β).
Gamma PDF:
Mean (First Moment):
Second Moment:
Variance:
Summary:
Note: The variance can be rewritten as \(\text{Var}(\lambda) = \frac{1}{\alpha} \left(\frac{\alpha}{\beta}\right)^2 = \frac{\mu^2}{\alpha}\), showing that smaller α means higher relative variability. ∎
Goal: Demonstrate why products of random factors lead to Lognormal distribution, and why we use Gamma instead.
Step 1: Express λ as a product
where \(F_i\) represents factors like expression bursts, cell-type proportions, library amplification, etc.
Step 2: Take logarithms (products become sums)
Step 3: Apply Central Limit Theorem
If \(\log(F_i)\) are independent random variables with finite mean \(\mu_i\) and variance \(\sigma_i^2\), then for large \(m\):
Step 4: Exponentiate to get λ
Since \(\log(\lambda)\) is approximately normal, \(\lambda = e^{\log(\lambda)}\) follows a lognormal distribution:
Step 5: Why approximate with Gamma?
Conclusion: Both Gamma and Lognormal are positive, right-skewed distributions that can model the same range of shapes. The Gamma wins due to mathematical convenience while capturing the essential biology! ∎
Goal: Show that if X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then their sum Y = ΣXᵢ follows Binomial(n, p).
Setup:
Define Y = X₁ + X₂ + ⋯ + Xₙ. We want to find P(Y = k) for k = 0, 1, ..., n.
Counting argument:
Y = k means exactly k of the n trials are successes (value 1) and n-k are failures (value 0).
Probability calculation:
For any specific sequence with exactly k successes:
Total probability:
Conclusion: This is exactly the PMF of Binomial(n, p). Therefore, the sum of n independent Bernoulli trials is Binomial. ∎
RNA-seq interpretation: