Differential Gene Expression Analysis

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question

Objectives
  • To check for sample mixup in data.

  • To identify genes differentially expressed between treatment groups.

  • To describe the effect of sample size on differential gene expression analysis.


Intro

One of the most common applications of RNA sequencing technology is to identify genes that are differentially expressed between sample groups, for example, between wild type and mutant, or between tumor and normal samples. Count data report the number of sequence reads (fragments) assigned to each gene, which describes the expression abundance of a gene. Similar data can be found in ChIP-Seq, HiC, shRNA screening, or mass spectrometry experiments.

We will be using Diversity Outbred Striatum IVSA dataset: 386 DO mice, males vs. females.

For a given gene, how would you compare expression between males and females?

QC Sample Mix ups

Let us make a plot.

plot of chunk samplemixups

DESeq2 Package

With low number of samples per group, the estimate of variance is not accurate. The solution is share information across the genes, i.e. Bayesian shrinkage.

For DESeq2 we will need be using read counts at RNA level.

Let us filter out genes with zero and low expression (less than 5 read counts) in 50% of the samples.

[1] 1 3 5 7 8 9
[1] 20945   416

And apply DESeq2 package as follows:

summary of Differential Expression Analysis

summary(res)

out of 20945 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 422, 2% 
LFC < 0 (down)   : 497, 2.4% 
outliers [1]     : 0, 0% 
low counts [2]   : 2031, 9.7% 
(mean count < 12)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

MA plot

In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

plotMA(res, main="M-A Plot: All Samples per group", ylim=c(-2,2))

plot of chunk maplot

Plot counts

It can also be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts, which normalizes counts by sequencing depth and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup, where more than one variable can be specified. Here we specify the gene which had the smallest p value from the results table created above. You can select the gene to plot by rowname or by numeric index.

ylab.nm <- "Eif2s3x"
d<-plotCounts(dds, gene=which.min(res$padj), intgroup="group",
              returnData=TRUE)
p <- ggplot(d, aes(x=group, y=count)) +
  geom_point(position=position_jitter(w=0.2,h=0),size=3)
p <- p + theme(axis.text=element_text(size=12),
               axis.title=element_text(size=20,face="bold", colour = "blue"),
               plot.title = element_text(size = rel(2))) + ylab(label = ylab.nm)
p

plot of chunk plotcount

Extract DE genes

We have identified 716 up-regulated and 789 down-regulated genes, respectively. Let us next extract these gene lists for future use.

We first order results by the smallest p value:

resOrdered <- res[order(res$pvalue),]

Extract all results

write.csv(as.data.frame(resOrdered), file="DO_FvM_deresults.csv")

Extract significant only results

resSig <- subset(resOrdered, padj < 0.1)
write.csv(as.data.frame(resSig), file="DO_FvM_deresults_sig0.1.csv")

Gene ontology enrichment analysis

We will use R/topGO for the purposes of enrichment analysis

Error in loadNamespace(name): there is no package called 'webshot'

Export to GeneWeaver

Let us first subset the differential gene expression results. One containing up-regulated genes and the other down-regulated genes. We will create a batch upload file with two genesets, named UpReg_Sex_Specific_Genes_AddictionShortCourse2017 and DownReg_Sex_Specific_Genes_AddictionShortCourse2017

[1] TRUE

Key Points

  • Non-coding RNA can be used to check for sample mixups in RNA sequencing data.