Mapping A Single Gene Expression Trait

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How do I map one gene expression trait?

Objectives

QTL mapping of an expression data set

In this tutorial, we are going to use the gene expression data as our phenotype to map QTLs. That is, we are looking to find any QTLs (cis or trans) that are responsible for variation in gene expression across samples.

We are using the same directory as our previous tutorial, which you can check by typing getwd() in the Console. If you are not in the correct directory, type in setwd("code") in the console or Session -> Set Working Directory -> Choose Directory in the RStudio menu to set your working directory to the main directory.

If you have started a new R session, you will need to load the libraries again. If you haven’t, the libraries listed below are already loaded.

Load Libraries

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)

Load Data

In this lesson, we are loading in gene expression data for 21,771 genes: attie_DO500_expr.datasets.RData that includes normalised and raw gene expression data. dataset.islet.rnaseq is the dataset you can download directly from Dryad. Again, if you are using the same R session, you will not need to the load the mapping, phenotypes and genotype probabilities data, again.

# expression data
load("../data/attie_DO500_expr.datasets.RData")

# data from paper
load("../data/dataset.islet.rnaseq.RData")

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

# mapping data
load("../data/attie_DO500_mapping.data.RData")

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

Expression Data

Raw gene expression counts are in the counts data object. These counts have been normalised and saved in the norm data object. More information is about normalisation is here.

To view the contents of either one of these data , click the table on the right hand side either norm or counts in the Environment tab. If you type in names(counts), you will see the names all start with ENSMUSG. These are Ensembl IDs. If we want to see which gene these IDs correspond to, type in dataset.islet.rnaseq$annots, which gives information about each gene, including ensemble id, gene symbol as well as start & stop location of the gene and chromsome on which the gene lies.

Because we are working with the insulin tAUC phenotype, let’s map the expression counts for Hnf1b which is known to influence this phenotype is these data. First, we need to find the Ensembl ID for this gene:

dataset.islet.rnaseq$annots[dataset.islet.rnaseq$annots$symbol == "Hnf1b",]

                              gene_id symbol chr    start      end strand
ENSMUSG00000020679 ENSMUSG00000020679  Hnf1b  11 83.85006 83.90592      1
                     middle nearest.marker.id        biotype       module
ENSMUSG00000020679 83.87799       11_84097611 protein_coding midnightblue
                   hotspot
ENSMUSG00000020679    <NA>

We can see that the ensembl ID of Hnf1b is ENSMUSG00000020679. If we check the distribution for Hnf1b expression data between the raw and normalised data, we can see there distribution has been corrected. Here is the distribution of the raw counts:

hist(counts$ENSMUSG00000020679, main = "Hnf1b")

plot of chunk hist_untransformed

and here is the distribution of the normalised counts:

hist(norm$ENSMUSG00000020679, main = "Hnf1b")

plot of chunk hist_transformed

The histogram indicates that distribution of these counts are normalised.

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For Hnf1b expression data, let’s see which covariates are significant.

###merging covariate data and expression data to test for sex, wave and diet_days.

cov.counts <- merge(covar, norm, by=c("row.names"), sort=F)

#testing covairates on expression data

tmp = cov.counts %>%
        dplyr::select(mouse, sex, DOwave, diet_days, ENSMUSG00000020679) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 
#  kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Expression")

tmp

# A tibble: 4 x 8
# Groups:   expression [1]
  expression         data     model  term      estimate std.e~1 stati~2  p.value
  <chr>              <list>   <list> <chr>        <dbl>   <dbl>   <dbl>    <dbl>
1 ENSMUSG00000020679 <tibble> <lm>   (Interce~ -1.70    0.506     -3.36 8.72e- 4
2 ENSMUSG00000020679 <tibble> <lm>   sexM      -0.199   0.0824    -2.42 1.60e- 2
3 ENSMUSG00000020679 <tibble> <lm>   DOwave     0.510   0.0370    13.8  3.73e-35
4 ENSMUSG00000020679 <tibble> <lm>   diet_days  0.00403 0.00386    1.04 2.97e- 1
# ... with abbreviated variable names 1: std.error, 2: statistic

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Expression") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

plot of chunk covariatesa_sig

We can see that sex and DOwave are significant. Here DOwave is the group or batch number as not all mice were submitted for genotyping at the same time. Because of this, we now have to correct for it. Considering the paper included the covariate, diet_days, we will include that as well.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]

Performing a genome scan

[Permutations]

First, we need to work out the signifcance level. Let’s find the signifance level for 0.1, 0.05 and 0.01.

operm <- scan1perm(genoprobs = probs, 
                   pheno = norm[,"ENSMUSG00000020679", drop = FALSE], 
                   addcovar=covar,
                   n_perm=1000)

Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01

load("../data/operm_ENSMUSG00000020679_1000.Rdata")

summary(operm,alpha=c(0.1, 0.05, 0.01))

LOD thresholds (1000 permutations)
     ENSMUSG00000020679
0.1                7.29
0.05               7.73
0.01               8.30

Genome Scan

qtl = scan1(genoprobs = probs, 
            pheno = norm[,"ENSMUSG00000020679", drop = FALSE], 
            kinship = K, 
            addcovar = covar)

Next, we plot the genome scan.

plot_scan1(x = qtl, 
           map = map, 
           lodcolumn = "ENSMUSG00000020679",
           main = colnames(qtl))
           add_threshold(map,  summary(operm, alpha=0.1), col = 'purple')
           add_threshold(map,  summary(operm, alpha=0.05), col = 'red')
           add_threshold(map,  summary(operm, alpha=0.01), col = 'blue')

plot of chunk qtl_plot

Finding LOD peaks

Let’s find LOD peaks

lod_threshold = summary(operm, alpha=0.01)
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, prob = 0.95)
kable(peaks %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn	chr	pos	lod	ci_lo	ci_hi
ENSMUSG00000020679	11	84.40138	36.87894	83.64714	84.40138

QTL effects

blup <- scan1blup(genoprobs=probs[,peaks$chr[1]], 
                  norm[,peaks$lodcolumn[1], drop=FALSE])

plot_coefCC(blup, 
       map=map, 
       columns=1:8,
       bgcolor="gray95", 
       legend="bottomleft",
       scan1_output = qtl )

plot of chunk qtl_effects

Challenge

Now choose another gene expression trait in norm data object and perform the same steps.
1). Check the distribution of the raw counts and normalised counts 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship provided.
4). Plot the genome scan for this gene.
5). Find the peaks above LOD score of 6.

Solution

Replace <ensembl id> with your choice of gene expression trait

#1).

hist(pheno_clin$<ensembl id>)

pheno_clin$<ensembl id>_log <- log(pheno_clin$<ensembl id>)

hist(pheno_clin$<ensembl id>_log)


#2).

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, <ensembl id>) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

tmp

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

#3).

qtl = scan1(genoprobs = probs, 
           pheno = pheno_clin[,"<ensembl id>", drop = FALSE], 
        kinship = K, 
        addcovar = covar)

#4).

plot_scan1(x = qtl, map = map, lodcolumn = "<ensembl id>")
abline(h = 6, col = 2, lwd = 2)

#5). 

peaks = find_peaks(scan1_output = qtl, map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)

Key Points

To run a QTL analysis for expression data

previous episode

Gene Expression QTL Analysis

next episode