Review Mapping Steps

Overview

Teaching: 30 min
Exercises: 30 min

Questions

What are the steps involved in running QTL mapping in Diversity Outbred mice?

Objectives

Reviewing the QTL mapping steps learnt in the first two days

Before we begin to run QTL mapping on gene expression data to find eQTLs, let’s review the main QTL mapping steps that we learnt in the QTL mapping course. As a reminder, we are using data from the Keller et al. paper that are freely available to download.

Make sure that you are in your main directory. If you’re not sure where you are working right now, you can check your working directory with getwd(). If you are not in your main directory, run setwd("code") in the Console or Session -> Set Working Directory -> Choose Directory in the RStudio menu to set your working directory to the code directory.

Load Libraries

Below are the neccessary libraries that we require for this review. They are already installed on your machines so go ahead an load them using the following code:

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)

Load Data

The data for this tutorial has been saved as several R binary files which contain several data objects, including phenotypes and mapping data as well as the genoprobs.

Load the data in now by running the following command in your new script.

##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

##mapping data
load("../data/attie_DO500_mapping.data.RData")

##genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

We loaded in a few data objects today. Check the Environment tab to see what was loaded. You should see a phenotypes object called pheno_clin with 500 observations (in rows) of 32 variables (in columns), an object called map (physical map), and an object called probs (genotype probabilities).

Phenotypes

In this data set, we have 20 phenotypes for 500 Diversity Outbred mice. pheno_clin is a data frame containing the phenotype data as well as covariate information. Click on the triangle to the left of pheno_clin in the Environment pane to view its contents. Run names(pheno_clin) to list the variables.

pheno_clin_dict is the phenotype dictionary. This data frame contains information on each variable in pheno_clin, including name,short name, pheno_type, formula (if used) and description.

Since the paper is interested in type 2 diabetes and insulin secretion, we will choose insulin tAUC (area under the curve (AUC) which was calculated without any correction for baseline differences) for this review

Many statistical models, including the QTL mapping model in qtl2, expect that the incoming data will be normally distributed. You may use transformations such as log or square root to make your data more normally distributed. Here, we will log transform the data.

Here is a histogram of the untransformed data.

hist(pheno_clin$Ins_tAUC, main = "Insulin Area Under the Curve")

plot of chunk hist_untransformed

Now, let’s apply the log() function to this data to correct the distribution.

pheno_clin$Ins_tAUC_log <- log(pheno_clin$Ins_tAUC)

Now, let’s make a histogram of the log-transformed data.

hist(pheno_clin$Ins_tAUC_log, main = "insulin tAUC (log-transformed)")

plot of chunk hist_log_transform

This looks much better!

The Marker Map

The marker map for each chromosome is stored in the map object. This is used to plot the LOD scores calculated at each marker during QTL mapping. Each list element is a numeric vector with each marker position in megabases (Mb). Here we are using the 69K grid marker file. Often when there are numerous genotype arrays used in a study, we interoplate all to a 69k grid file so we are able to combine all samples across different array types.

Look at the structure of map in the Environment tab by clicking the triangle to the left or by running str(map) in the Console.

Genotype probabilities

Each element of probs is a 3 dimensional array containing the founder allele dosages for each sample at each marker on one chromosome. These are the 8 state alelle probabilities (not 32) using the 69k marker grid for same 500 DO mice that also have clinical phenotypes. We have already calculated genotype probabilities for you, so you can skip the step for calculating genotype probabilities and the optional step for calculating allele probabilities.

Next, we look at the dimensions of probs for chromosome 1:

dim(probs[[1]])

[1]  500    8 4711

plot_genoprob(probs, map, ind = 1, chr = 1)

plot of chunk geno_plot

In the plot above, the founder contributions, which range between 0 and 1, are colored from white (= 0) to black (= 1.0). A value of ~0.5 is grey. The markers are on the X-axis and the eight founders (denoted by the letters A through H) on the Y-axis. Starting at the left, we see that this sample has genotype GH because the rows for G & H are grey, indicating values of 0.5 for both alleles. Moving along the genome to the right, the genotype becomes HH where where the row is black indicating a value of 1.0. This is followed by CD, DD, DG, AD, AH, CE, etc. The values at each marker sum to 1.0.

Kinship Matrix

The kinship matrix has already been calculated and loaded in above

n_samples <- 50
heatmap(K[[1]][1:n_samples, 1:n_samples])

plot of chunk kinship_probs

The figure above shows kinship between all pairs of samples. Light yellow indicates low kinship and dark red indicates higher kinship. Orange values indicate varying levels of kinship between 0 and 1. The dark red diagonal of the matrix indicates that each sample is identical to itself. The orange blocks along the diagonal may indicate close relatives (i.e. siblings or cousins).

Covariates

Next, we need to create additive covariates that will be used in the mapping model. First, we need to see which covariates are significant. In the data set, we have sex, DOwave (Wave (i.e., batch) of DO mice) and diet_days (number of days on diet) to test whether there are any gender, batch or diet effects.

First we are going to select out the covariates and phenotype we want from pheno_clin data frame. Then reformat these selected variables into a long format (using the gather command) grouped by the phenotypes (in this case, we only have Ins_tAUC_log)

### Tests for sex, wave and diet_days.

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, Ins_tAUC_log) %>%
        gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(phenotype) %>%
        nest()

Let’s create a linear model function that will regress the covariates and the phenotype.

mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}

Now let’s apply that function to the data object tmp that we created above.

tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ  = map(model, tidy)) %>%
  unnest(summ)

tmp

# A tibble: 7 x 8
# Groups:   phenotype [1]
  phenotype    data               model  term  estimate std.e~1 stati~2  p.value
  <chr>        <list>             <list> <chr>    <dbl>   <dbl>   <dbl>    <dbl>
1 Ins_tAUC_log <tibble [500 x 5]> <lm>   (Int~  4.49    0.447    10.1   1.10e-21
2 Ins_tAUC_log <tibble [500 x 5]> <lm>   sexM   0.457   0.0742    6.17  1.48e- 9
3 Ins_tAUC_log <tibble [500 x 5]> <lm>   DOwa~ -0.294   0.118    -2.49  1.33e- 2
4 Ins_tAUC_log <tibble [500 x 5]> <lm>   DOwa~ -0.395   0.118    -3.36  8.46e- 4
5 Ins_tAUC_log <tibble [500 x 5]> <lm>   DOwa~ -0.176   0.118    -1.49  1.38e- 1
6 Ins_tAUC_log <tibble [500 x 5]> <lm>   DOwa~ -0.137   0.118    -1.16  2.46e- 1
7 Ins_tAUC_log <tibble [500 x 5]> <lm>   diet~  0.00111 0.00346   0.322 7.48e- 1
# ... with abbreviated variable names 1: std.error, 2: statistic

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~phenotype) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

plot of chunk covariates_sig_function_plots

We can see that sex and DOwave (especially the third batch) are significant. Here DOwave is the group or batch number as not all mice are in the experient at the same time. Because of this, we now have to correct for it.

# convert sex and DO wave (batch) to factors
pheno_clin$sex    = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave, data = pheno_clin)[,-1]

REMEMBER: the sample IDs must be in the rownames of pheno, addcovar, genoprobs and K. qtl2 uses the sample IDs to align the samples between objects.

Performing a genome scan

At each marker on the genotyping array, we will fit a model that regresses the phenotype (insulin secretion AUC) on covariates and the founder allele proportions. Note that this model will give us an estimate of the effect of each founder allele at each marker. There are eight founder strains that contributed to the DO, so we will get eight founder allele effects.

[Permutations]

First, we need to work out the signifcance level. Let’s find the signifance level for 0.1, 0.05 and 0.01.

operm <- scan1perm(genoprobs = probs, 
                   pheno     = pheno_clin[,"Ins_tAUC_log", drop = FALSE],
                   addcovar  = covar,
                   n_perm    = 1000)

Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01

load("../data/operm_Ins_tAUC_log_1000.Rdata")

summary(operm, alpha = c(0.1, 0.05, 0.01))

LOD thresholds (1000 permutations)
     Ins_tAUC_log
0.1          7.19
0.05         7.63
0.01         8.43

Genome Scan

qtl = scan1(genoprobs = probs, 
            pheno     = pheno_clin[,"Ins_tAUC_log", drop = FALSE], 
            kinship   = K, 
            addcovar  = covar)

Next, we plot the genome scan.

plot_scan1(x   = qtl, 
           map = map, 
           lodcolumn = "Ins_tAUC_log")
           add_threshold(map,  summary(operm, alpha = 0.1),  col = 'purple')
           add_threshold(map,  summary(operm, alpha = 0.05), col = 'red')
           add_threshold(map,  summary(operm, alpha = 0.01), col = 'blue')

plot of chunk qtl_plot

We can see a very strong peak on chromosome 11 with no other distinguishable peaks.

Finding LOD peaks

We can find all of the peaks above the significance threshold using the find_peaks function.

The support interval is determined using the Bayesian Credible Interval and represents the region most likely to contain the causal polymorphism(s). We can obtain this interval by adding a prob argument to find_peaks. We pass in a value of 0.95 to request a support interval that contains the causal SNP 95% of the time.

In case there are multiple peaks are found on a chromosome, the peakdrop argument allows us to the find the peak which has a certain LOD score drop between other peaks.

Let’s find LOD peaks. Here we are choosing to find peaks with a LOD score greater than 6.

lod_threshold = summary(operm, alpha=0.01)
peaks = find_peaks(scan1_output = qtl, map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, 
                   prob = 0.95)

kable(peaks %>% dplyr::select (-lodindex) %>% 
        arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn	chr	pos	lod	ci_lo	ci_hi
Ins_tAUC_log	11	83.59467	11.25884	83.58553	84.95444

QTL effects

Once we have selected a QTL peak to investigate, we can make several types of plots to view the founder allele effects. In the mapping model, we can estimate the magnitude and direction of the founder allele effects. One method is to estimate the allele effects across the entire chromosome. This can be informative for complex peaks that span many Mb. We use the [scan1blup][https://github.com/kbroman/qtl2/blob/main/R/scan1blup.R] fuction to estimate the allele effects and the [plot_coefCC][https://github.com/kbroman/qtl2/blob/main/R/plot_coef.R] fuction to plot the results.

# g <- maxmarg(probs = probs, 
#              map   = map, 
#              chr   = peaks$chr[1],
#              pos   = peaks$pos[1], 
#              minprob     = 0.4,
#              return_char = TRUE)
# 
# plot_pxg(geno  = g, 
#          pheno = pheno_clin[,peaks$lodcolumn[1]], 
#          ylab  = peaks$lodcolumn[1], 
#          sort  = FALSE)

chr = 11

blup <- scan1blup(genoprobs = probs[,chr], 
                  pheno     = pheno_clin[,"Ins_tAUC_log", drop = FALSE], 
                  kinship   = K[[chr]], 
                  addcovar  = covar)

plot_coefCC(x            = blup, 
            map          = map, 
            columns      = 1:8,
            bgcolor      = "gray95", 
            legend       = "bottomleft",
            scan1_output = qtl,
            main         = "Ins_tAUC_log")

plot of chunk qtl_effects

NOTE: Don’t run the code above because it takes a long time. We have provided the code and the output so that you can run this for your data.

From the plot above, DO mice carrying the A/J, c57BL/6J, 129S1/SvImJ, or NOD/ShiLtJ alleles at the chromosome 11 QTL peak have higher Insulint tAUC than DO mice carrying the other founder alleles.

If we are only interested in the allele effects at the marker with the highest LOD score, we can extract the genoprobs at that marker and estimate and plot the allele effects.

# Get the marker with the highest LOD score.
max_mkr = find_marker(map = map, chr = peaks$chr, pos = peaks$pos)

max_pr = pull_genoprobpos(genoprobs = probs, marker = max_mkr)

mod = fit1(genoprobs = max_pr, 
           pheno     = pheno_clin[,"Ins_tAUC_log", drop = FALSE], 
           kinship   = K[[chr]], 
           addcovar  = covar)

data.frame(founder = names(mod$coef[1:8]),
           coef    = mod$coef[1:8],
           se      = mod$SE[1:8]) %>% 
  ggplot(aes(x = founder, y = coef)) +
    geom_pointrange(aes(ymin = coef - se, ymax = coef + se)) +
    labs(title = "Founder Allele Effects at Chr 11 Ins_tAUC QTL")

plot of chunk fit1

In this plot, you can see that DO mice carrying the four founder alleles on the left (A/J, c57BL/6J, 129S1/SvImJ, NOD/ShiLtJ) have higher Insulin tAUC values than DO mice carrying the other founders.

As a reference, here are the DO founder letter codes:

kable(data.frame(Letter = LETTERS[1:8],
                 Strain = c("A/J",       "C57BL/6J", "129S1/SvImJ", "NOD/ShiLtJ",
                            "NZO/HlLtJ", "CAST/EiJ", "PWK/PhJ",     "WSB/EiJ")))

Letter	Strain
A	A/J
B	C57BL/6J
C	129S1/SvImJ
D	NOD/ShiLtJ
E	NZO/HlLtJ
F	CAST/EiJ
G	PWK/PhJ
H	WSB/EiJ

Challenge

Now choose another phenotype in pheno_clin and perform the same steps.
1). Check the distribution. Does it need transforming? If so, how would you do it? 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship provided.
4). Plot the genome scan for this phenotype.
5). Find the peaks above LOD score of 6.

Solution

Replace <pheno name> with your choice of phenotype

#1).

hist(pheno_clin$<pheno name>)

pheno_clin$<pheno name>_log <- log(pheno_clin$<pheno name>)

hist(pheno_clin$<pheno name>_log)


#2).

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, <pheno name>) %>%
        gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(phenotype) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

tmp

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~phenotype) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

#3).

qtl = scan1(genoprobs = probs, 
           pheno = pheno_clin[,"<pheno name>", drop = FALSE], 
        kinship = K, 
        addcovar = covar)

#4).

plot_scan1(x = qtl, map = map, lodcolumn = "<pheno name>")
abline(h = 6, col = 2, lwd = 2)

#5). 

peaks = find_peaks(scan1_output = qtl, map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)

Key Points

To understand the key steps running a QTL mapping analysis

previous episode

Gene Expression QTL Analysis

next episode