Mapping Many Gene Expression Traits

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How do I map many genes?

Objectives

To map several genes at the same time

Load Libraries

For this lesson, we need to install two more libraries and load them. You can do this by typing the following code:

BiocManager::install(c("AnnotationHub","rtracklayer"))

Let’s install our libraries, and source two other R scripts.

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
library(qtl2ggplot)
library(RColorBrewer)

source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")

Before we begin this lesson, we need to create another directory called results in our main directory. You can do this by clicking on the “Files” tab and navigate into the main directory. Then select “New Folder” and name it “results”.

Load Data

# expression data
load("../data/attie_DO500_expr.datasets.RData")

# data from paper
load("../data/dataset.islet.rnaseq.RData")

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

# mapping data
load("../data/attie_DO500_mapping.data.RData")

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

Data Selection

For this lesson, lets choose a random set of 50 gene expression phenotypes.

genes = colnames(norm)

sams <- sample(length(genes), 50, replace = FALSE, prob = NULL)
genes <- genes[sams]

gene.info <- dataset.islet.rnaseq$annots[genes,]
rownames(gene.info) = NULL
kable(gene.info[1:10,])

gene_id	symbol	chr	start	end	strand	middle	nearest.marker.id	biotype	module	hotspot
ENSMUSG00000028182	Lrriq3	3	155.09343	155.19428	1	155.14386	3_155141204	protein_coding	brown	NA
ENSMUSG00000025437	Usp33	3	152.34648	152.39361	1	152.37005	3_152372563	protein_coding	pink	NA
ENSMUSG00000022354	Ndufb9	15	58.93381	58.93949	1	58.93665	15_58956707	protein_coding	blue	NA
ENSMUSG00000086905	Gm13716	2	85.19693	85.19867	1	85.19780	2_85202802	antisense	grey	NA
ENSMUSG00000022812	Gsk3b	16	38.08900	38.24608	1	38.16754	16_38137342	protein_coding	midnightblue	NA
ENSMUSG00000018974	Sart3	5	113.74245	113.77165	-1	113.75705	5_113653421	protein_coding	purple	NA
ENSMUSG00000021013	Ttc8	12	98.92057	98.98324	1	98.95191	12_98904979	protein_coding	grey	NA
ENSMUSG00000075558	Gm17151	10	80.76686	80.76899	-1	80.76792	10_80824573	antisense	grey	NA
ENSMUSG00000049699	Ucn2	9	108.98616	108.98716	1	108.98666	9_108986207	protein_coding	grey	NA
ENSMUSG00000050534	Htr5b	1	121.50984	121.52846	-1	121.51915	1_121401626	protein_coding	grey	NA

Expression Data

Lets check the distribution for the first 20 gene expression phenotypes. If you would like to check the distribution of all 50 genes, change for(gene in genes[1:20]) in the code below to for(gene in genes).

par(mfrow=c(3,4))
for(gene in genes[1:20]){
  hist(norm[,gene], main = gene)
  }

plot of chunk hist_untransformed

Check the distributions. Do they all have a normal distribution?

You will notice that the distribtion of some genes are skewed to the left. This means that that only a small amount of samples have data and therefore, will need to be removed. A suitable qc would be keeping expression data that have at least 5% of the samples with more than 10 reads.

genes_qc <- which(as.numeric(colSums(counts[,genes] > 10)) >= 0.05 * nrow(counts[,genes]))
genes <- genes[genes_qc]

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For these 50 gene expression data, we will correct for DOwave,sex and diet_days.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]

Performing a genome scan

Now lets perform the genome scan! We are also going to save our qtl results in an Rdata file to be used in further lessons. We will not perform permutations in this lesson as it will take too long. Instead we will use 6, which is the LOD score used in the paper to determine significance.

QTL Scans

qtl.file = "../results/gene.norm_qtl_cis.trans.Rdata"

if(file.exists(qtl.file)) {
  load(qtl.file)
  } else {
    qtl = scan1(genoprobs = probs, 
                pheno = norm[,genes, drop = FALSE],
                kinship = K, 
                addcovar = covar, 
                cores = 2)
    save(qtl, file = qtl.file)
    }

QTL plots

Let’s plot the first 20 gene expression phenotypes. If you would like to plot all 50, change for(i in 1:20) in the code below to for(i in 1:ncol(qtl)).

par(mfrow=c(3,4))
for(i in 1:20) {
  plot_scan1(x = qtl, 
             map = map, 
             lodcolumn = i, 
             main = colnames(qtl)[i])
  abline(h = 6, col = 2, lwd = 2)
  }

plot of chunk qtl_plots

QTL Peaks

We are also going to save our peak results so we can use these again else where. First, lets get out peaks with a LOD score greater than 6.

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, 
                   prob = 0.95)

We will save these peaks into a csv file.

kable(peaks[1:10,] %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Expression QTL (eQTL) Peaks with LOD >= 6")

write_csv(peaks, "../results/gene.norm_qtl_peaks_cis.trans.csv")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn	chr	pos	lod	ci_lo	ci_hi
ENSMUSG00000026575	1	58.22118	8.209107	57.626420	59.24670
ENSMUSG00000026575	1	159.09013	27.718407	156.733003	159.16100
ENSMUSG00000026575	1	164.84588	44.516456	163.460361	164.85502
ENSMUSG00000027868	3	92.66069	28.519024	91.152799	92.68765
ENSMUSG00000027868	3	98.14651	63.870847	98.144249	99.88818
ENSMUSG00000020140	10	115.85360	13.645956	115.451100	115.89945
ENSMUSG00000027868	11	106.62472	6.454471	9.144667	107.71252
ENSMUSG00000021302	13	14.93271	39.030453	13.564744	14.93875
ENSMUSG00000026575	15	12.40748	6.070807	9.260173	18.72732
ENSMUSG00000027868	15	66.27724	6.502607	65.853995	67.86561

QTL Peaks Figure

qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

plot of chunk qtl_heatmap

Challenge

What do the qtl scans for all gene exression traits look like? Note: Don’t worry, we’ve done the qtl scans for you!!! You can read in this file, ../data/gene.norm_qtl_all.genes.Rdata, which are the scan1 results for all gene expression traits.
Solution
load("../data/gene.norm_qtl_all.genes.Rdata")

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl.all, 
               map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)
write_csv(peaks, "../results/gene.norm_qtl_all.genes_peaks.csv")

## Heat Map
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

Key Points

previous episode

Gene Expression QTL Analysis

next episode