Mapping Many Gene Expression Traits
Overview
Teaching: 30 min
Exercises: 30 minQuestions
How do I map many genes?
Objectives
To map several genes at the same time
Load Libraries
For this lesson, we need to install two more libraries and load them. You can do this by typing the following code:
BiocManager::install(c("AnnotationHub","rtracklayer"))
Let’s install our libraries, and source two other R scripts.
library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
library(qtl2ggplot)
library(RColorBrewer)
source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")
Before we begin this lesson, we need to create another directory called results
in our main directory. You can do this by clicking on the “Files” tab and navigate into the main directory. Then select “New Folder” and name it “results”.
Load Data
# expression data
load("../data/attie_DO500_expr.datasets.RData")
# data from paper
load("../data/dataset.islet.rnaseq.RData")
# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
# mapping data
load("../data/attie_DO500_mapping.data.RData")
# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
Data Selection
For this lesson, lets choose a random set of 50 gene expression phenotypes.
genes = colnames(norm)
sams <- sample(length(genes), 50, replace = FALSE, prob = NULL)
genes <- genes[sams]
gene.info <- dataset.islet.rnaseq$annots[genes,]
rownames(gene.info) = NULL
kable(gene.info[1:10,])
gene_id | symbol | chr | start | end | strand | middle | nearest.marker.id | biotype | module | hotspot |
---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000028182 | Lrriq3 | 3 | 155.09343 | 155.19428 | 1 | 155.14386 | 3_155141204 | protein_coding | brown | NA |
ENSMUSG00000025437 | Usp33 | 3 | 152.34648 | 152.39361 | 1 | 152.37005 | 3_152372563 | protein_coding | pink | NA |
ENSMUSG00000022354 | Ndufb9 | 15 | 58.93381 | 58.93949 | 1 | 58.93665 | 15_58956707 | protein_coding | blue | NA |
ENSMUSG00000086905 | Gm13716 | 2 | 85.19693 | 85.19867 | 1 | 85.19780 | 2_85202802 | antisense | grey | NA |
ENSMUSG00000022812 | Gsk3b | 16 | 38.08900 | 38.24608 | 1 | 38.16754 | 16_38137342 | protein_coding | midnightblue | NA |
ENSMUSG00000018974 | Sart3 | 5 | 113.74245 | 113.77165 | -1 | 113.75705 | 5_113653421 | protein_coding | purple | NA |
ENSMUSG00000021013 | Ttc8 | 12 | 98.92057 | 98.98324 | 1 | 98.95191 | 12_98904979 | protein_coding | grey | NA |
ENSMUSG00000075558 | Gm17151 | 10 | 80.76686 | 80.76899 | -1 | 80.76792 | 10_80824573 | antisense | grey | NA |
ENSMUSG00000049699 | Ucn2 | 9 | 108.98616 | 108.98716 | 1 | 108.98666 | 9_108986207 | protein_coding | grey | NA |
ENSMUSG00000050534 | Htr5b | 1 | 121.50984 | 121.52846 | -1 | 121.51915 | 1_121401626 | protein_coding | grey | NA |
Expression Data
Lets check the distribution for the first 20 gene expression phenotypes. If you would like to check the distribution of all 50 genes, change for(gene in genes[1:20])
in the code below to for(gene in genes)
.
par(mfrow=c(3,4))
for(gene in genes[1:20]){
hist(norm[,gene], main = gene)
}
Check the distributions. Do they all have a normal distribution?
You will notice that the distribtion of some genes are skewed to the left. This means that that only a small amount of samples have data and therefore, will need to be removed. A suitable qc would be keeping expression data that have at least 5% of the samples with more than 10 reads.
genes_qc <- which(as.numeric(colSums(counts[,genes] > 10)) >= 0.05 * nrow(counts[,genes]))
genes <- genes[genes_qc]
The Marker Map
We are using the same marker map as in the previous lesson
Genotype probabilities
We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs
. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.
Kinship Matrix
We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.
Covariates
Now let’s add the necessary covariates. For these 50 gene expression data, we will correct for DOwave
,sex
and diet_days
.
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)
covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]
Performing a genome scan
Now lets perform the genome scan! We are also going to save our qtl results in an Rdata
file to be used in further lessons. We will not perform permutations in this lesson as it will take too long. Instead we will use 6, which is the LOD score used in the paper to determine significance.
QTL Scans
qtl.file = "../results/gene.norm_qtl_cis.trans.Rdata"
if(file.exists(qtl.file)) {
load(qtl.file)
} else {
qtl = scan1(genoprobs = probs,
pheno = norm[,genes, drop = FALSE],
kinship = K,
addcovar = covar,
cores = 2)
save(qtl, file = qtl.file)
}
QTL plots
Let’s plot the first 20 gene expression phenotypes. If you would like to plot all 50, change for(i in 1:20)
in the code below to for(i in 1:ncol(qtl))
.
par(mfrow=c(3,4))
for(i in 1:20) {
plot_scan1(x = qtl,
map = map,
lodcolumn = i,
main = colnames(qtl)[i])
abline(h = 6, col = 2, lwd = 2)
}
QTL Peaks
We are also going to save our peak results so we can use these again else where. First, lets get out peaks with a LOD score greater than 6.
lod_threshold = 6
peaks = find_peaks(scan1_output = qtl,
map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
We will save these peaks into a csv file.
kable(peaks[1:10,] %>%
dplyr::select(-lodindex) %>%
arrange(chr, pos), caption = "Expression QTL (eQTL) Peaks with LOD >= 6")
write_csv(peaks, "../results/gene.norm_qtl_peaks_cis.trans.csv")
Table: Phenotype QTL Peaks with LOD >= 6
lodcolumn | chr | pos | lod | ci_lo | ci_hi |
---|---|---|---|---|---|
ENSMUSG00000026575 | 1 | 58.22118 | 8.209107 | 57.626420 | 59.24670 |
ENSMUSG00000026575 | 1 | 159.09013 | 27.718407 | 156.733003 | 159.16100 |
ENSMUSG00000026575 | 1 | 164.84588 | 44.516456 | 163.460361 | 164.85502 |
ENSMUSG00000027868 | 3 | 92.66069 | 28.519024 | 91.152799 | 92.68765 |
ENSMUSG00000027868 | 3 | 98.14651 | 63.870847 | 98.144249 | 99.88818 |
ENSMUSG00000020140 | 10 | 115.85360 | 13.645956 | 115.451100 | 115.89945 |
ENSMUSG00000027868 | 11 | 106.62472 | 6.454471 | 9.144667 | 107.71252 |
ENSMUSG00000021302 | 13 | 14.93271 | 39.030453 | 13.564744 | 14.93875 |
ENSMUSG00000026575 | 15 | 12.40748 | 6.070807 | 9.260173 | 18.72732 |
ENSMUSG00000027868 | 15 | 66.27724 | 6.502607 | 65.853995 | 67.86561 |
QTL Peaks Figure
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)
Challenge
What do the qtl scans for all gene exression traits look like? Note: Don’t worry, we’ve done the qtl scans for you!!! You can read in this file,
../data/gene.norm_qtl_all.genes.Rdata
, which are thescan1
results for all gene expression traits.Solution
load("../data/gene.norm_qtl_all.genes.Rdata") lod_threshold = 6 peaks = find_peaks(scan1_output = qtl.all, map = map, threshold = lod_threshold, peakdrop = 4, prob = 0.95) write_csv(peaks, "../results/gene.norm_qtl_all.genes_peaks.csv") ## Heat Map qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)
Key Points