PCA Exercises with Biological Dataset
PCA Exercises with Biological Dataset (airway
)
# Install and load necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("airway", quietly = TRUE)) {
BiocManager::install("airway")
}
if (!requireNamespace("AnnotationDbi", quietly = TRUE)) {
BiocManager::install("AnnotationDbi")
}
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
BiocManager::install("org.Hs.eg.db")
}
library(ggplot2)
library(DESeq2)
library(airway)
library(AnnotationDbi)
library(org.Hs.eg.db)
# Load airway dataset
data(airway)
Exercise 1: Perform PCA on RNA-seq Data
- Objective: Perform PCA on the normalized counts of the airway dataset after variance stabilization.
dds <- DESeqDataSet(airway, design=~dex)
# Perform variance stabilizing transformation (VST)
vsd <- vst(dds, blind = FALSE)
# Extract the top 500 most variable genes for PCA
top_var_genes <- order(rowVars(assay(vsd)), decreasing = TRUE)[1:500]
# Perform PCA on the top 500 most variable genes
pca_result <- prcomp(t(assay(vsd)[top_var_genes, ]), center = TRUE, scale. = TRUE)
# Calculate proportion of variance explained by each component
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
# Display summary of PCA result
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 15.198 11.0602 8.8165 7.05108 3.14998 2.44650 1.82629
## Proportion of Variance 0.462 0.2447 0.1555 0.09944 0.01984 0.01197 0.00667
## Cumulative Proportion 0.462 0.7066 0.8621 0.96151 0.98136 0.99333 1.00000
## PC8
## Standard deviation 1.372e-14
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
- Your Task: Create a scree plot
using
ggplot2
to visualize the proportion of variance explained by each principal component.
Exercise 2: PCA Scatter Plot of the First Two Principal Components
- Objective: Visualize the first two principal
components using a scatter plot with
ggplot2
.
# Get the PCA scores (coordinates for each sample on the principal components)
pca_scores <- as.data.frame(pca_result$x)
# Add sample metadata (cell type and dex treatment) to the PCA scores
pca_scores$cell <- colData(vsd)$cell
pca_scores$dex <- colData(vsd)$dex
# Create scatter plot for PC1 and PC2
ggplot(pca_scores, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
geom_point(size = 3) +
ggtitle("PCA Scatter Plot: PC1 vs PC2") +
xlab(paste0("PC1 (", round(var_explained[1] * 100, 2), "% variance)")) +
ylab(paste0("PC2 (", round(var_explained[2] * 100, 2), "% variance)")) +
theme_minimal()
- Your Task:
- Look at the PCA scatter plot of PC1 vs. PC2 from Exercise 2.
- Answer the following questions:
- Are the samples from the dexamethasone-treated group separated from the untreated group?
- Which principal component (PC1 or PC2) seems to capture more of the variation between the treatment groups?
- What might the separation along PC1 or PC2 indicate about the influence of dexamethasone on gene expression in this dataset?
Exercise 3: Visualizing the Top 20 Loadings on the First Two Principal Components
- Objective: Create a loading plot
for the top 20 genes using
ggplot2
and annotate the Ensembl IDs with gene symbols.
# Extract loadings for the first two components
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2])
# Get the top 20 genes with the highest absolute loadings in PC1 and PC2
top_genes <- head(order(abs(pca_loadings$PC1) + abs(pca_loadings$PC2), decreasing = TRUE), 20)
# Subset the PCA loadings to the top 20 genes
top_loadings <- pca_loadings[top_genes, ]
# Annotate gene symbols from Ensembl IDs
top_loadings$ensembl_id <- rownames(top_loadings)
top_loadings$gene_symbol <- mapIds(org.Hs.eg.db, keys = top_loadings$ensembl_id,
column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
## 'select()' returned 1:1 mapping between keys and columns
# Remove genes with missing symbols and create a loading plot
top_loadings <- na.omit(top_loadings)
# Create loading plot using ggplot2
ggplot(top_loadings, aes(x = PC1, y = PC2, label = gene_symbol)) +
geom_point(size = 3) +
geom_text(vjust = -0.5, hjust = 0.5) +
ggtitle("Top 20 Gene Loadings on PC1 and PC2") +
xlab("PC1 Loadings") +
ylab("PC2 Loadings") +
theme_minimal()
- Your Task: Identify the top 20 genes contributing to PC1 and PC2, and explain their potential biological significance.