This lesson is in the early stages of development (Alpha version)

Mapping Many Gene Expression Traits

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do I map many genes?

Objectives
  • To map several genes at the same time

Load Libraries

For this lesson, we need to install two more libraries and load them. You can do this by typing the following code:

BiocManager::install(c("AnnotationHub","rtracklayer"))

Let’s install our libraries, and source two other R scripts.

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
library(qtl2ggplot)
library(RColorBrewer)

source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")

Before we begin this lesson, we need to create another directory called results in our main directory. You can do this by clicking on the “Files” tab and navigate into the main directory. Then select “New Folder” and name it “results”.

Load Data

# expression data
load("../data/attie_DO500_expr.datasets.RData")

# data from paper
load("../data/dataset.islet.rnaseq.RData")

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

# mapping data
load("../data/attie_DO500_mapping.data.RData")

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

Data Selection

For this lesson, lets choose a random set of 50 gene expression phenotypes.

genes = colnames(norm)

sams <- sample(length(genes), 50, replace = FALSE, prob = NULL)
genes <- genes[sams]

gene.info <- dataset.islet.rnaseq$annots[genes,]
rownames(gene.info) = NULL
kable(gene.info[1:10,])
gene_id symbol chr start end strand middle nearest.marker.id biotype module hotspot
ENSMUSG00000028182 Lrriq3 3 155.09343 155.19428 1 155.14386 3_155141204 protein_coding brown NA
ENSMUSG00000025437 Usp33 3 152.34648 152.39361 1 152.37005 3_152372563 protein_coding pink NA
ENSMUSG00000022354 Ndufb9 15 58.93381 58.93949 1 58.93665 15_58956707 protein_coding blue NA
ENSMUSG00000086905 Gm13716 2 85.19693 85.19867 1 85.19780 2_85202802 antisense grey NA
ENSMUSG00000022812 Gsk3b 16 38.08900 38.24608 1 38.16754 16_38137342 protein_coding midnightblue NA
ENSMUSG00000018974 Sart3 5 113.74245 113.77165 -1 113.75705 5_113653421 protein_coding purple NA
ENSMUSG00000021013 Ttc8 12 98.92057 98.98324 1 98.95191 12_98904979 protein_coding grey NA
ENSMUSG00000075558 Gm17151 10 80.76686 80.76899 -1 80.76792 10_80824573 antisense grey NA
ENSMUSG00000049699 Ucn2 9 108.98616 108.98716 1 108.98666 9_108986207 protein_coding grey NA
ENSMUSG00000050534 Htr5b 1 121.50984 121.52846 -1 121.51915 1_121401626 protein_coding grey NA

Expression Data

Lets check the distribution for the first 20 gene expression phenotypes. If you would like to check the distribution of all 50 genes, change for(gene in genes[1:20]) in the code below to for(gene in genes).

par(mfrow=c(3,4))
for(gene in genes[1:20]){
  hist(norm[,gene], main = gene)
  }

plot of chunk hist_untransformedplot of chunk hist_untransformed

Check the distributions. Do they all have a normal distribution?

You will notice that the distribtion of some genes are skewed to the left. This means that that only a small amount of samples have data and therefore, will need to be removed. A suitable qc would be keeping expression data that have at least 5% of the samples with more than 10 reads.

genes_qc <- which(as.numeric(colSums(counts[,genes] > 10)) >= 0.05 * nrow(counts[,genes]))
genes <- genes[genes_qc]

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For these 50 gene expression data, we will correct for DOwave,sex and diet_days.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]

Performing a genome scan

Now lets perform the genome scan! We are also going to save our qtl results in an Rdata file to be used in further lessons. We will not perform permutations in this lesson as it will take too long. Instead we will use 6, which is the LOD score used in the paper to determine significance.

QTL Scans

qtl.file = "../results/gene.norm_qtl_cis.trans.Rdata"

if(file.exists(qtl.file)) {
  load(qtl.file)
  } else {
    qtl = scan1(genoprobs = probs, 
                pheno = norm[,genes, drop = FALSE],
                kinship = K, 
                addcovar = covar, 
                cores = 2)
    save(qtl, file = qtl.file)
    }

QTL plots

Let’s plot the first 20 gene expression phenotypes. If you would like to plot all 50, change for(i in 1:20) in the code below to for(i in 1:ncol(qtl)).

par(mfrow=c(3,4))
for(i in 1:20) {
  plot_scan1(x = qtl, 
             map = map, 
             lodcolumn = i, 
             main = colnames(qtl)[i])
  abline(h = 6, col = 2, lwd = 2)
  }

plot of chunk qtl_plotsplot of chunk qtl_plots

QTL Peaks

We are also going to save our peak results so we can use these again else where. First, lets get out peaks with a LOD score greater than 6.

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, 
                   prob = 0.95)

We will save these peaks into a csv file.

kable(peaks[1:10,] %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Expression QTL (eQTL) Peaks with LOD >= 6")

write_csv(peaks, "../results/gene.norm_qtl_peaks_cis.trans.csv")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn chr pos lod ci_lo ci_hi
ENSMUSG00000026575 1 58.22118 8.209107 57.626420 59.24670
ENSMUSG00000026575 1 159.09013 27.718407 156.733003 159.16100
ENSMUSG00000026575 1 164.84588 44.516456 163.460361 164.85502
ENSMUSG00000027868 3 92.66069 28.519024 91.152799 92.68765
ENSMUSG00000027868 3 98.14651 63.870847 98.144249 99.88818
ENSMUSG00000020140 10 115.85360 13.645956 115.451100 115.89945
ENSMUSG00000027868 11 106.62472 6.454471 9.144667 107.71252
ENSMUSG00000021302 13 14.93271 39.030453 13.564744 14.93875
ENSMUSG00000026575 15 12.40748 6.070807 9.260173 18.72732
ENSMUSG00000027868 15 66.27724 6.502607 65.853995 67.86561

QTL Peaks Figure

qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

plot of chunk qtl_heatmap

Challenge

What do the qtl scans for all gene exression traits look like? Note: Don’t worry, we’ve done the qtl scans for you!!! You can read in this file, ../data/gene.norm_qtl_all.genes.Rdata, which are the scan1 results for all gene expression traits.

Solution

load("../data/gene.norm_qtl_all.genes.Rdata")

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl.all, 
               map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)
write_csv(peaks, "../results/gene.norm_qtl_all.genes_peaks.csv")

## Heat Map
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

Key Points