R Beginners Course 2025

Introduction to R and Basic Programming Concepts

Bioinformatics Core Facility CECAD

2025-03-17

Slides & Code

  • [f] Full screen
  • [o] Slide Overview
  • [c] Notes
  • [h] help

git repo

R-Basic


Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Beginners_R_Course_2025.git


Slides Directly

https://cecadbioinformaticscorefacility.github.io/Beginners_R_Course_2025/

Session 5 :: Descriptive Statistics

What is Descriptive Statistics ?

descriptive statistics (in the broad sense of the term) is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset.


There exists many measures to summarize a dataset. They are broadly divided into three types:

  • Central Tendency
    • Mean, median, and mode
  • Variability
    • Range, variance, and standard deviation
  • Distribution
    • Modality, Skewness, Kurtosis

Measures of Central Tendency

mean, median, mode

mean(iris$Sepal.Width)
[1] 3.057333


median(iris$Sepal.Width)
[1] 3


x <- table(iris$Sepal.Width)
sort(x, decreasing = TRUE)

  3 2.8 3.2 3.4 3.1 2.9 2.7 2.5 3.3 3.5 3.8 2.6 2.3 3.6 2.2 2.4 3.7 3.9   2   4 
 26  14  13  12  11  10   9   8   6   6   6   5   4   4   3   3   3   2   1   1 
4.1 4.2 4.4 
  1   1   1 

Measures of Variability

Min, max, range

min(iris$Sepal.Length)
[1] 4.3
max(iris$Sepal.Length)
[1] 7.9
range(iris$Sepal.Length)
[1] 4.3 7.9


Variance

# variance
var(iris$Sepal.Length) 
[1] 0.6856935


Standard Deviation

# standard deviation
sd(iris$Sepal.Length)
[1] 0.8280661

Measures of Distribution

Modes

The modality of a distribution is determined by the number of peaks it contains


Skewness

Skewness is a measurement of the symmetry of a distribution.

x<-iris$Sepal.Width
sum((x-mean(x))^3)/((length(x)-1)*sd(x)^3)
[1] 0.3147128


Kurtosis

Kurtosis measures whether your dataset is heavy-tailed or light-tailed compared to a normal distribution.

sum((x-mean(x))^4)/((length(x)-1)*sd(x)^4)
[1] 3.15977




library(moments)

Descriptive Statistics

quantile(iris$Sepal.Length, 0.25) # first quartile
25% 
5.1 


quantile(iris$Sepal.Length, 0.75) # third quartile
75% 
6.4 


IQR(iris$Sepal.Length) # interquartile range 
[1] 1.3
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Frequency/Cross/Contingency Tables

Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data.

At their core, cross-tabulations are simply data tables that present the results of the entire group of respondents, as well as results from subgroups of survey respondents. With them, you can examine relationships within the data that might not be readily apparent when only looking at total survey responses


demo_data <- iris

demo_data$size <- 
    ifelse(demo_data$Sepal.Length <
                            median(demo_data$Sepal.Length),
  "small", "big"
)
table(demo_data$Species, demo_data$size)
            
             big small
  setosa       1    49
  versicolor  29    21
  virginica   47     3

Correlation

Correlation measures the relationship between two variables if they are linked to each other. It denotes if variables evolve in the same direction, in the opposite direction, or are independent.

  • Correlation is usually computed on two quantitative variables, but it can also be computed on two qualitative ordinal variables.

  • Pearson correlation is often used for quantitative continuous variables that have a linear relationship

  • Spearman correlation (which is actually similar to Pearson but based on the ranked values for each variable rather than on the raw data) is often used to evaluate relationships involving at least one qualitative ordinal variable or two quantitative variables if the link is partially linear


cor(iris$Sepal.Length,iris$Sepal.Width, method = "pearson")
[1] -0.1175698
cor(iris$Sepal.Length,iris$Sepal.Width, method = "spearman")
[1] -0.1667777

Correlation Test

a correlation coefficient different from 0 in the sample does not mean that the correlation is significantly different from 0 in the population. This needs to be tested with a hypothesis test—and known as the correlation test.

The null and alternative hypothesis for the correlation test are as follows:

  • H0 : ρ = 0 (meaning that there is no linear relationship between the two variables)
  • H1 : ρ ≠ 0 (meaning that there is a linear relationship between the two variables)


cor.test(iris$Sepal.Length,iris$Sepal.Width)

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.27269325  0.04351158
sample estimates:
       cor 
-0.1175698 

Session 6 :: More Basic Concepts in R

Control Flow

  • if is used to evaluate whether a statement is TRUE or FALSE
  • If the statement is TRUE, the code block is executed
  • If the statement is FALSE, the code block is skipped
x <- 30L
if(is.integer(x)) {
   print("X is an Integer")
}
[1] "X is an Integer"

if A B Condition A->B E C if Conditional Code D C->D D->E B->C:n True B->D False

  • if-else is used to evaluate whether a statement is TRUE or FALSE
  • If the statement is TRUE, the first code block is executed
  • If the statement is FALSE, the second code block is executed
x <- c("what","is","truth")
if("Truth" %in% x) {
   print("Truth is found")
} else {
   print("Truth is not found")
}
[1] "Truth is not found"

else A B Condition A->B E C if Conditional Code F C->F D else Conditional Code D->F F->E B->C True B->D False

  • switch is used to select one of several code blocks to be executed
  • The switch function takes an expression and a list of cases
x <- switch(
   3,
   "first",
   "second",
   "third",
   "fourth"
)
print(x)
[1] "third"

switch A B Expression A->B F C Code block 1 C->F D Code block 2 D->F E Code block 3 E->F B->C Case 1 B->D Case 2 B->E Case 3

R Loops

v <- LETTERS[1:4]
#
for ( i in v) {
   print(i)
}
[1] "A"
[1] "B"
[1] "C"
[1] "D"

v <- c("Hello","while loop")
cnt <- 2

while (cnt < 7) {
   print(v)
   cnt = cnt + 1
}
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"

v <- c("Hello","loop")
cnt <- 2

repeat {
   print(v)
   cnt <- cnt + 1
    
   if(cnt > 5) {
      break
   }
}
[1] "Hello" "loop" 
[1] "Hello" "loop" 
[1] "Hello" "loop" 
[1] "Hello" "loop" 

v <- LETTERS[1:6]
for ( i in v) {
   
   if (i == "D") {
      next
   }
   print(i)
}
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"

The apply family

An apply function is essentially a loop, but run faster than loops and often require less code. The apply family of functions is a set of functions in R that allow you to apply a function to the rows or columns of a matrix or data frame. The main functions in the apply family are:

mtrx <- matrix(c(1:10, 11:20, 21:30), 
               nrow = 10, ncol = 3)
mtrx
      [,1] [,2] [,3]
 [1,]    1   11   21
 [2,]    2   12   22
 [3,]    3   13   23
 [4,]    4   14   24
 [5,]    5   15   25
 [6,]    6   16   26
 [7,]    7   17   27
 [8,]    8   18   28
 [9,]    9   19   29
[10,]   10   20   30
apply(mtrx, 1, sum) # row-wise
 [1] 33 36 39 42 45 48 51 54 57 60
apply(mtrx, 2, sum) # column-wise
[1]  55 155 255
st.err <- function(x){
  sd(x)/sqrt(length(x))
}
apply(mtrx,2, st.err)
[1] 0.9574271 0.9574271 0.9574271

lapply is used to apply a function to each element of a list or vector and returns a list. It is useful when you want to apply a function to each element of a list and return the results in a list format.

A<-c(1:10)
B<-c(11:20)
C<-c(21:30)
my.lst<-list(A,B,C)
lapply(my.lst, sum)
[[1]]
[1] 55

[[2]]
[1] 155

[[3]]
[1] 255

sapply is a simplified version of lapply. It tries to simplify the result to a vector or matrix if possible.

#sappy(my.lst, mean)

vapply is similar to sapply, but it requires you to specify the type of output you expect. This can help prevent unexpected results.

vapply(my.lst, mean, numeric(1))
[1]  5.5 15.5 25.5

tapply is used to apply a function to subsets of a vector, based on a grouping factor. It is useful for performing calculations on subsets of data.

tapply(iris$Sepal.Length, iris$Species, mean)
    setosa versicolor  virginica 
     5.006      5.936      6.588 

mapply is a multivariate version of sapply. It allows you to apply a function to multiple arguments in parallel.

A
 [1]  1  2  3  4  5  6  7  8  9 10
B
 [1] 11 12 13 14 15 16 17 18 19 20
C
 [1] 21 22 23 24 25 26 27 28 29 30
mapply(sum, A, B, C)
 [1] 33 36 39 42 45 48 51 54 57 60

Session 7 :: Practice 1

Session 8 :: Practice 2

Practice 2 ..

  1. Use .Rmd or .qmd file to document your code and results.
  2. Select iris dataset and create a new dataset with only Sepal.Length and Sepal.Width columns.
  3. Create a new column Sepal.Area using cbind in the new dataset by multiplying Sepal.Length and Sepal.Width.
    • Check if there is any correlation between Sepal.Length and Sepal.Area.
    • Create a new column Sepal.Size in the new dataset by using ifelse function to classify Sepal.Area as small or big based on the median value of Sepal.Area
    • Create a Contingency Table for Sepal.Area.
    • Create a Mosaic Plot for Sepal.Area.
    • Create a Histogram for Sepal.Area and save it with specific dimension and resolution.
    • save dataset as iris_new.csv and iris_new.xls in the working directory.

Practice 2

  1. Use for and/or while loop to calculate the mean of Sepal.Length and Sepal.Width for each species in the new dataset.
    • Use break statement to exit the loop when the mean of Sepal.Length is greater than 5.
    • Use next statement to skip the iteration when the mean of Sepal.Length is less than 5.
  2. Use apply family functions to calculate the mean, median, standard deviation, and standard error of Sepal.Length and Sepal.Width for each species in the new dataset.
    • Use boxplots for Sepal.Length and Sepal.Width for each species in the new dataset.
  3. Use the cut() function to create a vector of intervals for each of the four floral traits (see ?cut). Then, for each trait, make a contingency table of Species versus the interval factor of that trait, and draw a barplot of this table. In the barplot, there should be one bar per trait interval, with the counts of each species in this interval stacked. Combine the plots for the 4 floral traits into one plot, using layout().