R-Workshop 2024: RNASeq dataset analysis in R
This guide walks you through analyzing an RNASeq dataset using R,
with a particular focus on the airway
dataset, which comes
prepackaged with R’s airway
library. We’ll cover library
setup, dataset preprocessing, differential expression analysis, PCA, and
visualization techniques such as volcano plots.
1. Load Libraries and Datasets
Before starting, we need to load various libraries required for RNASeq data analysis and visualization. If these libraries aren’t installed yet, R will automatically install them for you.
Loading and Installing Required Libraries
In R, libraries (also called “packages”) are collections of functions and data that extend the basic functionality of R. Before you start analyzing your RNASeq data, you need to install and load several specialized libraries. These libraries provide tools for handling biological data, visualizing results, and performing statistical analysis.
Let’s walk through the process of installing and loading the necessary libraries.
1. Checking if Libraries Are Installed and Installing Them if Missing
# Check if the BiocManager package is installed, if not, install it
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Explanation: - The requireNamespace()
function checks whether a package is installed. - The
quietly = TRUE
argument suppresses messages during the
check. - If the package is not installed, the
install.packages()
function installs it. In this case, we
are installing BiocManager
, a tool that helps you install
Bioconductor packages, which are commonly used in bioinformatics.
# Install DESeq2 if it's not already installed
if(!requireNamespace("DESeq2", quietly = TRUE))
BiocManager::install("DESeq2")
## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
Explanation: - DESeq2
is a powerful
package designed for differential gene expression analysis. It’s often
used to analyze RNASeq data to find genes that are upregulated or
downregulated between conditions. - If the package is not found on your
system, the BiocManager::install()
function installs it
from the Bioconductor repository.
# Install dplyr if it's not already installed
if(!requireNamespace("dplyr", quietly = TRUE))
install.packages("dplyr")
Explanation: - dplyr
is a widely-used
package for data manipulation. You will use it to clean and manipulate
your RNASeq data, such as filtering genes or organizing metadata. - This
package is part of the tidyverse, a popular set of tools for data
science in R.
# Install the org.Hs.eg.db package for human gene annotations
if(!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
##
Explanation: - org.Hs.eg.db
provides
mappings between Ensembl gene IDs and gene symbols for Homo
sapiens (human). You will need this package to convert gene
identifiers (often used in RNASeq) to human gene symbols.
# Install airway package if not already installed (RNASeq dataset)
if(!requireNamespace("airway", quietly = TRUE))
BiocManager::install("airway")
Explanation: - The airway
package
contains the RNASeq dataset that will be used throughout this tutorial.
It includes gene expression data for human airway smooth muscle cells
treated with dexamethasone.
# Install ggplot2 if it's not already installed
if(!requireNamespace("ggplot2", quietly = TRUE))
install.packages("ggplot2")
Explanation: - ggplot2
is a powerful
and flexible package for data visualization. It will be used to create
plots such as PCA (Principal Component Analysis) plots and volcano plots
to visualize your RNASeq analysis results.
# Install ggrepel for enhanced plotting (avoids label overlap)
if(!requireNamespace("ggrepel", quietly = TRUE))
install.packages("ggrepel")
Explanation: - ggrepel
enhances
ggplot2
plots by preventing text labels from overlapping.
It is especially useful in creating clear and readable plots for
labeling significant genes in RNASeq analysis.
# Install DT package for interactive tables
if(!requireNamespace("DT", quietly = TRUE))
install.packages("DT")
Explanation: - The DT
package helps
create interactive data tables. You will use it to display the results
of your differential expression analysis in a format that allows easy
sorting and filtering.
2. Loading the Installed Libraries
After ensuring that all necessary packages are installed, we need to load them into the current R session. Loading a library makes its functions and data available for use.
# Load the libraries
library(DESeq2)
library(ggplot2)
library(dplyr)
library(org.Hs.eg.db)
library(airway)
library(ggrepel)
library(DT)
Explanation: - This block loads the installed
libraries into your R session. Each library()
call makes
the associated package functions and datasets available for immediate
use.
3. Loading the Dataset
Finally, we load the airway
dataset, which contains
RNASeq data from human airway cells.
Explanation: - The data()
function
loads built-in datasets from R packages. Here, it loads the
airway
dataset from the airway
package, which
we will use for the RNASeq analysis.
Explanation
- This code installs the necessary packages if they aren’t already available and then loads them for use.
- The
airway
dataset is a sample RNASeq dataset from theairway
package. It contains gene expression data for studying airway smooth muscle cells under different conditions.
2. Preprocess the Dataset
Preprocessing involves preparing the RNASeq dataset for analysis, including organizing the feature (gene) data and metadata. You can either load preprocessed files or generate them from scratch.
RUN <- F # Set this to TRUE if you want to preprocess the dataset from scratch
if(RUN){
my_airway_dataset <- airway
my_airway_dataset <- assay(my_airway_dataset) %>% as.data.frame()
# Add gene symbols to the dataset
my_airway_dataset$gene_symbol <- mapIds(org.Hs.eg.db, keys = rownames(my_airway_dataset), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
# Organize feature data (gene symbols and Ensembl gene IDs)
my_airway_featureData <- my_airway_dataset %>% tibble::add_column(ensembl_gene_id = rownames(.)) %>%
dplyr::select(ensembl_gene_id, gene_symbol)
# Clean up the dataset, removing unnecessary columns
my_airway_dataset <- my_airway_dataset %>% dplyr::select(-gene_symbol)
# Extract metadata (colData)
my_airway_colData <- colData(airway) %>% as.data.frame()
# Save preprocessed data
saveRDS(my_airway_dataset, file="data/my_airway_dataset.rds")
saveRDS(my_airway_colData, file="data/my_airway_colData.rds")
saveRDS(my_airway_featureData, file="data/my_airway_featureData.rds")
} else {
# Load preprocessed data
my_airway_dataset <- readRDS("data/my_airway_dataset.rds")
my_airway_colData <- readRDS("data/my_airway_colData.rds")
my_airway_featureData <- readRDS("data/my_airway_featureData.rds")
}
Excursus to SummarizedExperiment
The class of the airway dataset is:
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
In bioinformatics, SummarizedExperiment
and
RangedSummarizedExperiment
are commonly used data
structures for storing genomic data. Both handle assay data (like
expression counts) and metadata, but they differ in their handling of
genomic coordinates.
SummarizedExperiment
The SummarizedExperiment
class stores:
Assay data (e.g., gene expression matrix).
Row metadata (e.g., gene names).
Column metadata (e.g., sample information).
However, it does not store genomic ranges (positions on chromosomes).
Example:
library(SummarizedExperiment)
# Assay data (gene expression counts)
assay_data <- matrix(rnorm(100), nrow=10, ncol=10)
rownames(assay_data) <- paste0("gene", 1:10)
colnames(assay_data) <- paste0("sample", 1:10)
# Metadata
row_data <- data.frame(gene_id=rownames(assay_data))
col_data <- data.frame(condition=c(rep("treated", 5), rep("control", 5)), row.names=colnames(assay_data))
# Create SummarizedExperiment object
se <- SummarizedExperiment(assays = list(counts = assay_data), rowData = row_data, colData = col_data)
se
## class: SummarizedExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(1): gene_id
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(1): condition
RangedSummarizedExperiment
RangedSummarizedExperiment
extends
SummarizedExperiment
by including genomic
ranges (chromosome, start, end, strand) for each gene or
region.
Example:
library(GenomicRanges)
library(SummarizedExperiment)
# Genomic ranges (GRanges object)
gr <- GRanges(seqnames = Rle(c("chr1", "chr1", "chr2", "chr2")),
ranges = IRanges(start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)),
strand = Rle(strand(c("+", "-", "+", "-"))))
# Assay data
assay_data <- matrix(rnorm(40), nrow=4, ncol=10)
rownames(assay_data) <- paste0("gene", 1:4)
col_data <- data.frame(condition=c(rep("treated", 5), rep("control", 5)), row.names=colnames(assay_data))
# Create RangedSummarizedExperiment object
rse <- SummarizedExperiment(assays = list(counts = assay_data), rowRanges = gr, colData = col_data)
rse
## class: RangedSummarizedExperiment
## dim: 4 10
## metadata(0):
## assays(1): counts
## rownames(4): gene1 gene2 gene3 gene4
## rowData names(0):
## colnames: NULL
## colData names(1): condition
Key Differences
Feature | SummarizedExperiment | RangedSummarizedExperiment |
---|---|---|
Assay data | Yes | Yes |
Row data (gene metadata) | Yes, but no genomic coordinates | Yes, includes genomic coordinates (via GRanges ) |
Column data (sample metadata) | Yes | Yes |
Genomic coordinate handling | No | Yes (chromosome, start, end, strand) |
Subsetting by Genomic Ranges
With RangedSummarizedExperiment
, you can subset data
based on genomic positions. For example, subsetting genes on
chr1
:
## class: RangedSummarizedExperiment
## dim: 0 10
## metadata(0):
## assays(1): counts
## rownames(0):
## rowData names(0):
## colnames: NULL
## colData names(1): condition
Let us have a look into the data structure of the airway dataset:
## Formal class 'RangedSummarizedExperiment' [package "SummarizedExperiment"] with 6 slots
## ..@ rowRanges :Formal class 'CompressedGRangesList' [package "GenomicRanges"] with 5 slots
## .. .. ..@ unlistData :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
## .. .. .. .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
## .. .. .. .. .. .. ..@ values : Factor w/ 722 levels "1","2","3","4",..: 23 20 1 6 1 23 6 3 7 12 ...
## .. .. .. .. .. .. ..@ lengths : int [1:47491] 27 29 173 80 75 27 4 41 196 71 ...
## .. .. .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. .. .. ..@ metadata : list()
## .. .. .. .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
## .. .. .. .. .. .. ..@ start : int [1:738009] 99883667 99885756 99887482 99887538 99888402 99888402 99888439 99888928 99888928 99890175 ...
## .. .. .. .. .. .. ..@ width : int [1:738009] 1317 108 84 28 135 135 98 99 99 75 ...
## .. .. .. .. .. .. ..@ NAMES : NULL
## .. .. .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. .. .. ..@ metadata : list()
## .. .. .. .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
## .. .. .. .. .. .. ..@ values : Factor w/ 3 levels "+","-","*": 2 1 2 1 2 1 2 1 2 1 ...
## .. .. .. .. .. .. ..@ lengths : int [1:31658] 17 10 59 72 26 45 68 12 42 33 ...
## .. .. .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. .. .. ..@ metadata : list()
## .. .. .. .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
## .. .. .. .. .. .. ..@ seqnames : chr [1:722] "1" "2" "3" "4" ...
## .. .. .. .. .. .. ..@ seqlengths : int [1:722] 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 ...
## .. .. .. .. .. .. ..@ is_circular: logi [1:722] FALSE FALSE FALSE FALSE FALSE FALSE ...
## .. .. .. .. .. .. ..@ genome : chr [1:722] NA NA NA NA ...
## .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. .. .. .. .. ..@ rownames : NULL
## .. .. .. .. .. .. ..@ nrows : int 738009
## .. .. .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. .. .. ..@ metadata : list()
## .. .. .. .. .. .. ..@ listData :List of 2
## .. .. .. .. .. .. .. ..$ exon_id : int [1:738009] 667145 667146 667147 667148 667149 667150 667151 667153 667152 667154 ...
## .. .. .. .. .. .. .. ..$ exon_name: chr [1:738009] "ENSE00001459322" "ENSE00000868868" "ENSE00000401072" "ENSE00001849132" ...
## .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. ..@ metadata : list()
## .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. .. .. ..@ rownames : NULL
## .. .. .. .. ..@ nrows : int 63677
## .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. ..@ metadata : list()
## .. .. .. .. ..@ listData :List of 10
## .. .. .. .. .. ..$ gene_id : chr [1:63677] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## .. .. .. .. .. ..$ gene_name : chr [1:63677] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
## .. .. .. .. .. ..$ entrezid : int [1:63677] NA NA NA NA NA NA NA NA NA NA ...
## .. .. .. .. .. ..$ gene_biotype : chr [1:63677] "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
## .. .. .. .. .. ..$ gene_seq_start : int [1:63677] 99883667 99839799 49551404 169818772 169631245 27938575 196621008 143815948 53362139 41040684 ...
## .. .. .. .. .. ..$ gene_seq_end : int [1:63677] 99894988 99854882 49575092 169863408 169823221 27961788 196716634 143832827 53481768 41067715 ...
## .. .. .. .. .. ..$ seq_name : chr [1:63677] "X" "X" "20" "1" ...
## .. .. .. .. .. ..$ seq_strand : int [1:63677] -1 1 -1 -1 1 -1 1 -1 -1 1 ...
## .. .. .. .. .. ..$ seq_coord_system: int [1:63677] NA NA NA NA NA NA NA NA NA NA ...
## .. .. .. .. .. ..$ symbol : chr [1:63677] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
## .. .. ..@ elementType : chr "GRanges"
## .. .. ..@ metadata :List of 1
## .. .. .. ..$ genomeInfo:List of 20
## .. .. .. .. ..$ Db type : chr "TranscriptDb"
## .. .. .. .. ..$ Supporting package : chr "GenomicFeatures"
## .. .. .. .. ..$ Data source : chr "BioMart"
## .. .. .. .. ..$ Organism : chr "Homo sapiens"
## .. .. .. .. ..$ Resource URL : chr "www.biomart.org:80"
## .. .. .. .. ..$ BioMart database : chr "ensembl"
## .. .. .. .. ..$ BioMart database version : chr "ENSEMBL GENES 75 (SANGER UK)"
## .. .. .. .. ..$ BioMart dataset : chr "hsapiens_gene_ensembl"
## .. .. .. .. ..$ BioMart dataset description : chr "Homo sapiens genes (GRCh37.p13)"
## .. .. .. .. ..$ BioMart dataset version : chr "GRCh37.p13"
## .. .. .. .. ..$ Full dataset : chr "yes"
## .. .. .. .. ..$ miRBase build ID : chr NA
## .. .. .. .. ..$ transcript_nrow : chr "215647"
## .. .. .. .. ..$ exon_nrow : chr "745593"
## .. .. .. .. ..$ cds_nrow : chr "537555"
## .. .. .. .. ..$ Db created by : chr "GenomicFeatures package from Bioconductor"
## .. .. .. .. ..$ Creation time : chr "2014-07-10 14:55:55 -0400 (Thu, 10 Jul 2014)"
## .. .. .. .. ..$ GenomicFeatures version at creation time: chr "1.17.9"
## .. .. .. .. ..$ RSQLite version at creation time : chr "0.11.4"
## .. .. .. .. ..$ DBSCHEMAVERSION : chr "1.0"
## .. .. ..@ partitioning :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
## .. .. .. .. ..@ end : int [1:63677] 17 27 56 86 158 184 229 243 297 309 ...
## .. .. .. .. ..@ NAMES : chr [1:63677] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. ..@ metadata : list()
## ..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. ..@ rownames : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
## .. .. ..@ nrows : int 8
## .. .. ..@ elementType : chr "ANY"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## .. .. ..@ listData :List of 9
## .. .. .. ..$ SampleName: Factor w/ 8 levels "GSM1275862","GSM1275863",..: 1 2 3 4 5 6 7 8
## .. .. .. ..$ cell : Factor w/ 4 levels "N052611","N061011",..: 4 4 1 1 3 3 2 2
## .. .. .. ..$ dex : Factor w/ 2 levels "trt","untrt": 2 1 2 1 2 1 2 1
## .. .. .. ..$ albut : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1
## .. .. .. ..$ Run : Factor w/ 8 levels "SRR1039508","SRR1039509",..: 1 2 3 4 5 6 7 8
## .. .. .. ..$ avgLength : int [1:8] 126 126 126 87 120 126 101 98
## .. .. .. ..$ Experiment: Factor w/ 8 levels "SRX384345","SRX384346",..: 1 2 3 4 5 6 7 8
## .. .. .. ..$ Sample : Factor w/ 8 levels "SRS508567","SRS508568",..: 2 1 3 4 5 6 7 8
## .. .. .. ..$ BioSample : Factor w/ 8 levels "SAMN02422669",..: 1 4 6 2 7 3 8 5
## ..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
## .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
## .. .. .. .. ..@ listData :List of 1
## .. .. .. .. .. ..$ counts: int [1:63677, 1:8] 679 0 467 260 60 0 3251 1433 519 394 ...
## .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. ..@ metadata : list()
## ..@ NAMES : NULL
## ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. ..@ rownames : NULL
## .. .. ..@ nrows : int 63677
## .. .. ..@ elementType : chr "ANY"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## .. .. ..@ listData : Named list()
## ..@ metadata :List of 1
## .. ..$ :Formal class 'MIAME' [package "Biobase"] with 13 slots
## .. .. .. ..@ name : chr "Himes BE"
## .. .. .. ..@ lab : chr NA
## .. .. .. ..@ contact : chr ""
## .. .. .. ..@ title : chr "RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine"| __truncated__
## .. .. .. ..@ abstract : chr "Asthma is a chronic inflammatory respiratory disease that affects over 300 million people worldwide. Glucocorti"| __truncated__
## .. .. .. ..@ url : chr "http://www.ncbi.nlm.nih.gov/pubmed/24926665"
## .. .. .. ..@ pubMedIds : chr "24926665"
## .. .. .. ..@ samples : list()
## .. .. .. ..@ hybridizations : list()
## .. .. .. ..@ normControls : list()
## .. .. .. ..@ preprocessing : list()
## .. .. .. ..@ other : list()
## .. .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. .. ..@ .Data:List of 2
## .. .. .. .. .. .. ..$ : int [1:3] 1 0 0
## .. .. .. .. .. .. ..$ : int [1:3] 1 1 0
## .. .. .. .. .. ..$ names: chr [1:2] "MIAxE" "MIAME"
rowData-function
The rowData-function returns the feature data of the airway dataset. Its class is:
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"
DFrame is a class that is used to store the feature data of the airway dataset. It is a subclass of DataFrame. The DataFrame class is a subclass of the data.frame class. The DataFrame class is used to store data in a tabular format. The additional functionality provided by DFrame compared to data.frame is the ability to store metadata. The metadata is stored in the metadata slot of the DFrame object. The metadata is a list that can store any type of data. The metadata can be used to store additional information about the data stored in the DFrame object.
Let us have a look:
## DataFrame with 6 rows and 10 columns
## gene_id gene_name entrezid gene_biotype
## <character> <character> <integer> <character>
## ENSG00000000003 ENSG00000000003 TSPAN6 NA protein_coding
## ENSG00000000005 ENSG00000000005 TNMD NA protein_coding
## ENSG00000000419 ENSG00000000419 DPM1 NA protein_coding
## ENSG00000000457 ENSG00000000457 SCYL3 NA protein_coding
## ENSG00000000460 ENSG00000000460 C1orf112 NA protein_coding
## ENSG00000000938 ENSG00000000938 FGR NA protein_coding
## gene_seq_start gene_seq_end seq_name seq_strand
## <integer> <integer> <character> <integer>
## ENSG00000000003 99883667 99894988 X -1
## ENSG00000000005 99839799 99854882 X 1
## ENSG00000000419 49551404 49575092 20 -1
## ENSG00000000457 169818772 169863408 1 -1
## ENSG00000000460 169631245 169823221 1 1
## ENSG00000000938 27938575 27961788 1 -1
## seq_coord_system symbol
## <integer> <character>
## ENSG00000000003 NA TSPAN6
## ENSG00000000005 NA TNMD
## ENSG00000000419 NA DPM1
## ENSG00000000457 NA SCYL3
## ENSG00000000460 NA C1orf112
## ENSG00000000938 NA FGR
By using the str
function you could see the inner world
of this DFrame object.
## Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## ..@ rownames : chr [1:63677] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## ..@ nrows : int 63677
## ..@ elementType : chr "ANY"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
## ..@ listData :List of 10
## .. ..$ gene_id : chr [1:63677] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## .. ..$ gene_name : chr [1:63677] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
## .. ..$ entrezid : int [1:63677] NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ gene_biotype : chr [1:63677] "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
## .. ..$ gene_seq_start : int [1:63677] 99883667 99839799 49551404 169818772 169631245 27938575 196621008 143815948 53362139 41040684 ...
## .. ..$ gene_seq_end : int [1:63677] 99894988 99854882 49575092 169863408 169823221 27961788 196716634 143832827 53481768 41067715 ...
## .. ..$ seq_name : chr [1:63677] "X" "X" "20" "1" ...
## .. ..$ seq_strand : int [1:63677] -1 1 -1 -1 1 -1 1 -1 -1 1 ...
## .. ..$ seq_coord_system: int [1:63677] NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ symbol : chr [1:63677] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
As one could see multiple meta-information about the data are present like elementType, elementMetadata, metadata etc. If we compare this to a normal data.frame object
## 'data.frame': 63677 obs. of 10 variables:
## $ gene_id : chr "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## $ gene_name : chr "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
## $ entrezid : int NA NA NA NA NA NA NA NA NA NA ...
## $ gene_biotype : chr "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
## $ gene_seq_start : int 99883667 99839799 49551404 169818772 169631245 27938575 196621008 143815948 53362139 41040684 ...
## $ gene_seq_end : int 99894988 99854882 49575092 169863408 169823221 27961788 196716634 143832827 53481768 41067715 ...
## $ seq_name : chr "X" "X" "20" "1" ...
## $ seq_strand : int -1 1 -1 -1 1 -1 1 -1 -1 1 ...
## $ seq_coord_system: int NA NA NA NA NA NA NA NA NA NA ...
## $ symbol : chr "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
We could clearly see the difference. The DFrame object has additional slots that could be used to store additional type of information.
colData-function
The colData-function, called accessor-function, returns the metadata of the airway dataset samples. Its class is also a:
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"
## Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## ..@ rownames : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
## ..@ nrows : int 8
## ..@ elementType : chr "ANY"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
## ..@ listData :List of 9
## .. ..$ SampleName: Factor w/ 8 levels "GSM1275862","GSM1275863",..: 1 2 3 4 5 6 7 8
## .. ..$ cell : Factor w/ 4 levels "N052611","N061011",..: 4 4 1 1 3 3 2 2
## .. ..$ dex : Factor w/ 2 levels "trt","untrt": 2 1 2 1 2 1 2 1
## .. ..$ albut : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1
## .. ..$ Run : Factor w/ 8 levels "SRR1039508","SRR1039509",..: 1 2 3 4 5 6 7 8
## .. ..$ avgLength : int [1:8] 126 126 126 87 120 126 101 98
## .. ..$ Experiment: Factor w/ 8 levels "SRX384345","SRX384346",..: 1 2 3 4 5 6 7 8
## .. ..$ Sample : Factor w/ 8 levels "SRS508567","SRS508568",..: 2 1 3 4 5 6 7 8
## .. ..$ BioSample : Factor w/ 8 levels "SAMN02422669",..: 1 4 6 2 7 3 8 5
the content is as follows:
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
metadata-function
returns a list of objects containing metadata related to overall experiment. So it is not directly metadata of the samples as in colData but it adds a context to the experiment dataset. The class of the metadata in our case is:
## [1] "MIAME"
## attr(,"package")
## [1] "Biobase"
MIAME is an acronym for Minimum Information About a Microarray Experiment. MIAME is conceptually a set of guidelines for the minimum information that should be provided when reporting a microarray experiment. The MIAME guidelines were developed by the Microarray Gene Expression Data (MGED) Society. The MGED Society is an international organization that aims to promote the sharing of microarray data and the development of standards for microarray data.
This framework could be also partially used for RNA-Seq data. The Package Biobase procide a class MIAME implementing this framework.
## Experiment data
## Experimenter name: Himes BE
## Laboratory: NA
## Contact information:
## Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.
## URL: http://www.ncbi.nlm.nih.gov/pubmed/24926665
## PMIDs: 24926665
##
## Abstract: A 226 word abstract is available. Use 'abstract' method.
## ensembl_gene_id gene_symbol
## ENSG00000000003 ENSG00000000003 TSPAN6
## ENSG00000000005 ENSG00000000005 TNMD
## ENSG00000000419 ENSG00000000419 DPM1
## ENSG00000000457 ENSG00000000457 SCYL3
## ENSG00000000460 ENSG00000000460 FIRRM
## ENSG00000000938 ENSG00000000938 FGR
Explanation
- Feature Data: You extract gene symbols and
associate them with Ensembl gene IDs using
org.Hs.eg.db
. This step is critical for adding biological meaning to the analysis. - Metadata: The metadata contains information about
the experimental conditions (e.g.,
dex
treatment) applied to each sample. - You can either load preprocessed datasets or preprocess the data on
the fly by setting
RUN
toTRUE
.
Assay-function
Let’s take a quick look at the count data using the assay()-function. It expects a SummarizedExperiment object as input and returns the assay data as a matrix.
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
## [1] "matrix" "array"
Why Should Biologists Use SummarizedExperiment
and
RangedSummarizedExperiment
?
As a biologist diving into bioinformatics and RNASeq data analysis,
you may often work with large datasets that contain gene expression
levels, sample metadata, and genomic coordinates. Managing these
different types of data in a consistent and organized way can be
challenging. This is where
SummarizedExperiment
and
RangedSummarizedExperiment
come in.
1. All Your Data in One Place
Both SummarizedExperiment
and
RangedSummarizedExperiment
are designed to hold all
the information you need for RNASeq and genomic analysis in one
object: - Assay Data: Your gene expression
counts. - Row Metadata: Information about genes or
genomic features (like gene IDs or symbols). - Column
Metadata: Sample information, such as treatment conditions or
time points.
By using these classes, you don’t have to manage separate files for each type of data. Everything is stored in a single, well-organized object.
2. Easy to Manipulate and Query
These classes allow you to easily access and manipulate the different
types of data: - You can quickly filter genes based on
expression levels or metadata. - You can subset samples
based on treatment or experimental conditions. - With
RangedSummarizedExperiment
, you can even subset by
genomic coordinates, allowing you to focus on specific regions
of the genome.
For example, suppose you want to analyze only genes located on a
particular chromosome or region. RangedSummarizedExperiment
makes this kind of genomic subsetting straightforward.
3. Seamless Integration with Bioconductor Tools
Both classes are part of the Bioconductor ecosystem,
which means they integrate smoothly with many tools designed for
biological data analysis. For example: - The DESeq2
package
for differential expression analysis works directly with these classes.
- Visualization tools like ggplot2
can easily handle data
from these objects.
This means that by learning these classes, you’re preparing yourself to work with a wide range of powerful tools in bioinformatics.
4. Reproducibility and Collaboration
Because these objects store data in a consistent format, it’s easy to share your analysis with others: - Your collaborators can load the exact same dataset with all the metadata intact. - Your workflows become reproducible, since everything is stored in one object, minimizing the risk of losing or misplacing parts of the dataset.
5. A Beginner-Friendly Start
Although these classes may sound complex, they are beginner-friendly. Instead of juggling multiple data frames and files, you only need to learn how to work with one object that holds all the necessary information. The ability to access, filter, and modify the data using simple commands makes it an excellent entry point for biologists starting with R.
In summary, SummarizedExperiment
and
RangedSummarizedExperiment
provide a convenient, organized,
and powerful way to manage, analyze, and visualize your biological data
in R. They help you focus on the biological questions rather than the
data handling, making your research more efficient, reproducible, and
scalable.
4. Differential Expression Analysis
The core of RNASeq analysis is identifying genes that are
differentially expressed under different conditions. In this section, we
will use DESeq2
for the analysis.
# Create a DESeq2 object (DESeqDataSet) from the count matrix and metadata.
dds <- DESeq2::DESeqDataSet(se = my_airway_dataset,
design = ~dex)
# Run the DESeq pipeline for differential expression analysis
dds <- DESeq(dds)
vsd <- vst(dds, blind = T)
# Extract the results table, which contains log2 fold changes and p-values
results <- DESeq2::results(dds) %>% as.data.frame()
# Add significance column based on log2FoldChange and adjusted p-value
results <- results %>%
tibble::add_column(significance =
if_else(.$log2FoldChange > 1 & .$padj < 0.05, "upregulated",
if_else(.$log2FoldChange < -1 & .$padj < 0.05, "downregulated", "not significant")))
# Remove rows with missing values
results <- na.omit(results)
# Add gene symbols to the results using Ensembl gene IDs
results <- results %>% tibble::add_column(ensembl_gene_id = rownames(.)) %>%
left_join(my_airway_featureData, by = c("ensembl_gene_id"))
# Select relevant columns and arrange by adjusted p-value
results <- results %>% dplyr::select(ensembl_gene_id,
gene_symbol,
baseMean,
log2FoldChange,
padj,
lfcSE,
stat,
pvalue,
significance) %>%
arrange(padj)
# Display the results in an interactive table
DT::datatable(results,
filter = "top",
extensions = c('Buttons', 'ColReorder'),
options = list(dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
colReorder = TRUE))
Explanation
- DESeq2 Object: We first create a
DESeqDataSet
object that encapsulates the count data, metadata, and experimental design (in this case, the design is~dex
, indicating the treatment). - Differential Expression: The
DESeq
function runs the analysis, and we extract results that include fold changes and statistical significance. - Interactive Table: The
DT
package is used to create an interactive table where users can filter and sort results.
5. Principal Component Analysis (PCA)
PCA is a useful method for visualizing the global structure of high-dimensional datasets like RNASeq.
# Transpose the data and convert it into a matrix
data_matrix <- t(assay(vst(dds)))
# Remove columns (features) with zero variance
data_matrix_no_zero_var <- data_matrix[, apply(data_matrix, 2, var) != 0]
# Perform PCA
pca <- prcomp(data_matrix_no_zero_var, scale. = TRUE, center = TRUE, retx = TRUE)
# Extract PCA scores and add metadata
pca_scores <- pca$x %>% as.data.frame() %>% tibble::add_column(dex = colData(dds)$dex)
# Extract PCA loadings (gene contributions)
pca_loadings <- pca$rotation %>% as.data.frame()
pca_loadings <- pca_loadings %>% tibble::add_column(ensembl_gene_id = rownames(.)) %>%
left_join(my_airway_featureData, by = c("ensembl_gene_id")) %>%
dplyr::select(ensembl_gene_id, gene_symbol, 1:8)
Explanation
- Data Preparation: We transpose the data matrix and remove genes with zero variance to avoid problems in PCA.
- PCA: The
prcomp
function performs PCA on the preprocessed data, and the principal components (PCs) are extracted.
6. Visualization: PCA Plot
Visualize the first two principal components to observe the grouping of samples.
# Get PCA summary
sum_pca <- summary(pca)
importance <- sum_pca$importance %>% as.data.frame()
# Create new sample names in the format dex_REP[1:4]
my_airway_colData$Sample.New <- paste0(my_airway_colData$dex,"_REP",rep(1:4, each=2))
pca_scores <- pca_scores %>% tibble::add_column(sample=my_airway_colData$Sample.New)
# Plot the PCA results
ggplot2::ggplot(data=pca_scores, aes(x=PC1, y=PC2, color=dex, label=sample)) +
geom_point(size=3) +
labs(title="PCA of airway dataset",
x=paste0("PC1 (", round(100*importance[2,1], 2), "%)"),
y=paste0("PC2 (", round(100*importance[2,2], 2), "%)"))+
ggrepel::geom_label_repel()
7. Visualization: Volcano Plot
Volcano plots are a popular method for visualizing fold changes against significance.
Initial Volcano Plot
vp <- ggplot2::ggplot(data = results, aes(x = log2FoldChange, y = -log10(padj), color = significance)) +
geom_point() +
ggtitle("Volcano plot of the airway dataset") +
theme(plot.title = element_text(face = "bold", hjust = .5))
vp
Adjust the Plot
Center the plot and add cutoff lines.
# Extract the maximal absolute log2 fold change
max_abs_log2FoldChange <- max(abs(results$log2FoldChange), na.rm = TRUE)
# Extract the maximal -log10(padj) (lowest adjusted p-value)
max_neg_log10_padj <- max(-log10(results$padj), na.rm = TRUE)
# Adjust the plot limits
vp.2 <- vp + xlim(c(-max_abs_log2FoldChange, max_abs_log2FoldChange)) +
ylim(c(0, max_neg_log10_padj))
# Add cutoff lines for log2FoldChange and padj
vp.3 <- vp.2 + geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = -1, linetype = "dashed") +
geom_vline(xintercept = 1, linetype = "dashed")
vp.3
Customize Colors and Highlight Genes
Highlight the top 20 genes by padj
.
custom_colors <- c(upregulated = "darkred", downregulated = "blue", `not significant` = "grey")
vp.4 <- vp.3 + scale_color_manual(values = custom_colors)
# Top 20 genes by adjusted p-value
top_20_genes <- results %>% arrange(padj) %>% head(20)
vp.5 <- vp.4 +
geom_text_repel(data = top_20_genes, aes(label = gene_symbol),
box.padding = 0.5, point.padding = 0.5,
segment.color = "grey50", segment.size = 0.5,
segment.alpha = 0.5, max.overlaps = Inf, size = 3)
vp.5
You now have a comprehensive guide to analyzing RNASeq data in R, including preprocessing, differential expression analysis, PCA, and visualizing the results with plots like the PCA and volcano plots.