PCA Exercises with Biological Dataset

PCA Exercises with Biological Dataset (`airway`)

# Install and load necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("airway", quietly = TRUE)) {
  BiocManager::install("airway")
}
if (!requireNamespace("AnnotationDbi", quietly = TRUE)) {
  BiocManager::install("AnnotationDbi")
}
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Hs.eg.db")
}

library(ggplot2)
library(DESeq2)
library(airway)
library(AnnotationDbi)
library(org.Hs.eg.db)

# Load airway dataset
data(airway)

Exercise 1: Perform PCA on RNA-seq Data

Objective: Perform PCA on the normalized counts of the airway dataset after variance stabilization.

dds <- DESeqDataSet(airway, design=~dex)

# Perform variance stabilizing transformation (VST)
vsd <- vst(dds, blind = FALSE)

# Extract the top 500 most variable genes for PCA
top_var_genes <- order(rowVars(assay(vsd)), decreasing = TRUE)[1:500]

# Perform PCA on the top 500 most variable genes
pca_result <- prcomp(t(assay(vsd)[top_var_genes, ]), center = TRUE, scale. = TRUE)

# Calculate proportion of variance explained by each component
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)

# Display summary of PCA result
summary(pca_result)

## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     15.198 11.0602 8.8165 7.05108 3.14998 2.44650 1.82629
## Proportion of Variance  0.462  0.2447 0.1555 0.09944 0.01984 0.01197 0.00667
## Cumulative Proportion   0.462  0.7066 0.8621 0.96151 0.98136 0.99333 1.00000
##                              PC8
## Standard deviation     1.372e-14
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

Your Task: Create a scree plot using ggplot2 to visualize the proportion of variance explained by each principal component.

Exercise 2: PCA Scatter Plot of the First Two Principal Components

Objective: Visualize the first two principal components using a scatter plot with ggplot2.

# Get the PCA scores (coordinates for each sample on the principal components)
pca_scores <- as.data.frame(pca_result$x)

# Add sample metadata (cell type and dex treatment) to the PCA scores
pca_scores$cell <- colData(vsd)$cell
pca_scores$dex <- colData(vsd)$dex

# Create scatter plot for PC1 and PC2
ggplot(pca_scores, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size = 3) +
  ggtitle("PCA Scatter Plot: PC1 vs PC2") +
  xlab(paste0("PC1 (", round(var_explained[1] * 100, 2), "% variance)")) +
  ylab(paste0("PC2 (", round(var_explained[2] * 100, 2), "% variance)")) +
  theme_minimal()

Your Task:

Look at the PCA scatter plot of PC1 vs. PC2 from Exercise 2.
Answer the following questions:
- Are the samples from the dexamethasone-treated group separated from the untreated group?
- Which principal component (PC1 or PC2) seems to capture more of the variation between the treatment groups?
- What might the separation along PC1 or PC2 indicate about the influence of dexamethasone on gene expression in this dataset?

Exercise 3: Visualizing the Top 20 Loadings on the First Two Principal Components

Objective: Create a loading plot for the top 20 genes using ggplot2 and annotate the Ensembl IDs with gene symbols.

# Extract loadings for the first two components
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2])

# Get the top 20 genes with the highest absolute loadings in PC1 and PC2
top_genes <- head(order(abs(pca_loadings$PC1) + abs(pca_loadings$PC2), decreasing = TRUE), 20)

# Subset the PCA loadings to the top 20 genes
top_loadings <- pca_loadings[top_genes, ]

# Annotate gene symbols from Ensembl IDs
top_loadings$ensembl_id <- rownames(top_loadings)
top_loadings$gene_symbol <- mapIds(org.Hs.eg.db, keys = top_loadings$ensembl_id,
                                   column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")

## 'select()' returned 1:1 mapping between keys and columns

# Remove genes with missing symbols and create a loading plot
top_loadings <- na.omit(top_loadings)

# Create loading plot using ggplot2
ggplot(top_loadings, aes(x = PC1, y = PC2, label = gene_symbol)) +
  geom_point(size = 3) +
  geom_text(vjust = -0.5, hjust = 0.5) +
  ggtitle("Top 20 Gene Loadings on PC1 and PC2") +
  xlab("PC1 Loadings") +
  ylab("PC2 Loadings") +
  theme_minimal()

Your Task: Identify the top 20 genes contributing to PC1 and PC2, and explain their potential biological significance.

PCA Exercises with Biological Dataset

PCA Exercises with Biological Dataset

PCA Exercises with Biological Dataset (airway)

Exercise 1: Perform PCA on RNA-seq Data

Exercise 2: PCA Scatter Plot of the First Two Principal Components

Exercise 3: Visualizing the Top 20 Loadings on the First Two Principal Components

PCA Exercises with Biological Dataset (`airway`)