Review Mapping Steps
Last updated on 2024-11-21 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What are the steps involved in running QTL mapping in Diversity Outbred mice?
Objectives
- To understand the key steps running a QTL mapping analysis
Before we begin to run QTL mapping on gene expression data to find eQTLs, let’s review the main QTL mapping steps that we learnt in the QTL mapping course. As a reminder, we are using data from the Keller et al. paper that are freely available to download.
Make sure that you are in your main directory. If you’re not sure
where you are working right now, you can check your working directory
with getwd()
. If you are not in your main directory, run
setwd("code")
in the Console or Session -> Set Working
Directory -> Choose Directory in the RStudio menu to set your working
directory to the code directory.
Load Libraries
Below are the neccessary libraries that we require for this review. They are already installed on your machines so go ahead an load them using the following code:
R
library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
Load Data
The data for this tutorial has been saved as several R binary files which contain several data objects, including phenotypes and mapping data as well as the genoprobs.
Load the data in now by running the following command in your new script.
R
##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
##mapping data
load("../data/attie_DO500_mapping.data.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_mapping.data.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
##genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
WARNING
Warning in gzfile(file, "rb"): cannot open compressed file
'../data/attie_DO500_genoprobs_v5.rds', probable reason 'No such file or
directory'
ERROR
Error in gzfile(file, "rb"): cannot open the connection
We loaded in a few data objects today. Check the Environment tab to
see what was loaded. You should see a phenotypes object called
pheno_clin
with 500 observations (in rows) of 32 variables
(in columns), an object called map (physical map), and an object called
probs (genotype probabilities).
Phenotypes
In this data set, we have 20 phenotypes for 500 Diversity Outbred
mice. pheno_clin
is a data frame containing the phenotype
data as well as covariate information. Click on the triangle to the left
of pheno_clin
in the Environment pane to view its contents.
Run names(pheno_clin)
to list the variables.
pheno_clin_dict
is the phenotype dictionary. This data
frame contains information on each variable in pheno_clin
,
including name
,short name
,
pheno_type
, formula
(if used) and
description
.
Since the paper is interested in type 2 diabetes and insulin
secretion, we will choose insulin tAUC
(area under the
curve (AUC) which was calculated without any correction for baseline
differences) for this review
Many statistical models, including the QTL mapping model in qtl2, expect that the incoming data will be normally distributed. You may use transformations such as log or square root to make your data more normally distributed. Here, we will log transform the data.
Here is a histogram of the untransformed data.
R
hist(pheno_clin$Ins_tAUC, main = "Insulin Area Under the Curve")
ERROR
Error: object 'pheno_clin' not found
Now, let’s apply the log()
function to this data to
correct the distribution.
R
pheno_clin$Ins_tAUC_log <- log(pheno_clin$Ins_tAUC)
ERROR
Error: object 'pheno_clin' not found
Now, let’s make a histogram of the log-transformed data.
R
hist(pheno_clin$Ins_tAUC_log, main = "insulin tAUC (log-transformed)")
ERROR
Error: object 'pheno_clin' not found
This looks much better!
The Marker Map
The marker map for each chromosome is stored in the map
object. This is used to plot the LOD scores calculated at each marker
during QTL mapping. Each list element is a numeric vector with each
marker position in megabases (Mb). Here we are using the 69K grid marker
file. Often when there are numerous genotype arrays used in a study, we
interoplate all to a 69k grid file so we are able to combine all samples
across different array types.
Look at the structure of map
in the Environment tab by
clicking the triangle to the left or by running str(map)
in
the Console.
Genotype probabilities
Each element of probs
is a 3 dimensional array
containing the founder allele dosages for each sample at each marker on
one chromosome. These are the 8 state allelle probabilities (not 32)
using the 69k marker grid for same 500 DO mice that also have clinical
phenotypes. We have already calculated genotype probabilities for you,
so you can skip the step for calculating
genotype probabilities and the optional step for calculating allele
probabilities.
Next, we look at the dimensions of probs
for chromosome
1:
R
dim(probs[[1]])
ERROR
Error: object 'probs' not found
R
plot_genoprob(probs, map, ind = 1, chr = 1)
ERROR
Error: object 'probs' not found
In the plot above, the founder contributions, which range between 0 and 1, are colored from white (= 0) to black (= 1.0). A value of ~0.5 is grey. The markers are on the X-axis and the eight founders (denoted by the letters A through H) on the Y-axis. Starting at the left, we see that this sample has genotype GH because the rows for G & H are grey, indicating values of 0.5 for both alleles. Moving along the genome to the right, the genotype becomes HH where where the row is black indicating a value of 1.0. This is followed by CD, DD, DG, AD, AH, CE, etc. The values at each marker sum to 1.0.
Kinship Matrix
The kinship matrix has already been calculated and loaded in above
R
n_samples <- 50
heatmap(K[[1]][1:n_samples, 1:n_samples])
ERROR
Error: object 'K' not found
The figure above shows kinship between all pairs of samples. Light yellow indicates low kinship and dark red indicates higher kinship. Orange values indicate varying levels of kinship between 0 and 1. The dark red diagonal of the matrix indicates that each sample is identical to itself. The orange blocks along the diagonal may indicate close relatives (i.e. siblings or cousins).
Covariates
Next, we need to create additive covariates that will be used in the
mapping model. First, we need to see which covariates are significant.
In the data set, we have sex
, DOwave
(Wave
(i.e., batch) of DO mice) and diet_days
(number of days on
diet) to test whether there are any gender, batch or diet effects.
First we are going to select out the covariates and phenotype we want
from pheno_clin
data frame. Then reformat these selected
variables into a long format (using the gather
command)
grouped by the phenotypes (in this case, we only have
Ins_tAUC_log
)
R
### Tests for sex, wave and diet_days.
tmp = pheno_clin %>%
dplyr::select(mouse, sex, DOwave, diet_days, Ins_tAUC_log) %>%
gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(phenotype) %>%
nest()
ERROR
Error: object 'pheno_clin' not found
Let’s create a linear model function that will regress the covariates and the phenotype.
R
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
Now let’s apply that function to the data object tmp
that we created above.
R
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
ERROR
Error: object 'tmp' not found
R
tmp
ERROR
Error: object 'tmp' not found
R
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~phenotype) +
labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
ERROR
Error: object 'tmp' not found
We can see that sex and DOwave (especially the third batch) are significant. Here DOwave is the group or batch number as not all mice are in the experiment at the same time. Because of this, we now have to correct for it.
R
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$DOwave = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
covar = model.matrix(~sex + DOwave, data = pheno_clin)[,-1]
ERROR
Error: object 'pheno_clin' not found
REMEMBER: the sample IDs must be in the rownames of
pheno
, addcovar
, genoprobs
and
K
. qtl2
uses the sample IDs to align the
samples between objects.
Performing a genome scan
At each marker on the genotyping array, we will fit a model that regresses the phenotype (insulin secretion AUC) on covariates and the founder allele proportions. Note that this model will give us an estimate of the effect of each founder allele at each marker. There are eight founder strains that contributed to the DO, so we will get eight founder allele effects.
[Permutations]
First, we need to work out the signifcance level. Let’s find the significance level for 0.1, 0.05 and 0.01.
R
operm <- scan1perm(genoprobs = probs,
pheno = pheno_clin[,"Ins_tAUC_log", drop = FALSE],
addcovar=covar,
n_perm=1000)
Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01
R
load("../data/operm_Ins_tAUC_log_1000.Rdata")
summary(operm,alpha=c(0.1, 0.05, 0.01))
Genome Scan
R
qtl = scan1(genoprobs = probs,
pheno = pheno_clin[,"Ins_tAUC_log", drop = FALSE],
kinship = K,
addcovar = covar)
Next, we plot the genome scan.
R
plot_scan1(x = qtl,
map = map,
lodcolumn = "Ins_tAUC_log")
ERROR
Error: object 'qtl' not found
R
add_threshold(map, summary(operm,alpha=0.1), col = 'purple')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
R
add_threshold(map, summary(operm, alpha=0.05), col = 'red')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
R
add_threshold(map, summary(operm, alpha=0.01), col = 'blue')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
We can see a very strong peak on chromosome 11 with no other distibuishable peaks.
Finding LOD peaks
We can find all of the peaks above the significance threshold using the find_peaks function.
The support interval is determined using the Bayesian Credible
Interval and represents the region most likely to contain the
causative polymorphism(s). We can obtain this interval by adding a
prob
argument to find_peaks.
We pass in a value of 0.95
to request a support interval
that contains the causal SNP 95% of the time.
In case there are multiple peaks are found on a chromosome, the
peakdrop
argument allows us to the find the peak which has
a certain LOD score drop between other peaks.
Let’s find LOD peaks. Here we are choosing to find peaks with a LOD score greater than 6.
R
lod_threshold = summary(operm, alpha=0.01)
ERROR
Error: object 'operm' not found
R
peaks = find_peaks(scan1_output = qtl, map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
ERROR
Error: object 'qtl' not found
R
kable(peaks %>% dplyr::select (-lodindex) %>%
arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")
ERROR
Error: object 'peaks' not found
QTL effects
R
g <- maxmarg(probs,
map=map,
chr=peaks$chr[1],
pos=peaks$pos[1],
minprob = 0.4,
return_char=TRUE)
ERROR
Error: object 'probs' not found
R
plot_pxg(g,
pheno_clin[,peaks$lodcolumn[1]],
ylab=peaks$lodcolumn[1],
sort=FALSE)
ERROR
Error: object 'g' not found
R
blup <- scan1blup(genoprobs=probs[,peaks$chr[1]],
pheno=pheno_clin[,peaks$lodcolumn[1], drop=FALSE])
ERROR
Error: object 'probs' not found
R
plot_coefCC(blup,
map=map,
columns=1:8,
bgcolor="gray95",
legend="bottomleft",
scan1_output = qtl
)
ERROR
Error: object 'blup' not found
Challenge 1:
Now choose another phenotype in pheno_clin
and perform
the same steps.
1). Check the distribution. Does it need transforming? If so, how would
you do it? 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship
provided.
4). Plot the genome scan for this phenotype.
5). Find the peaks above LOD score of 6.
R
#1).
hist(pheno_clin$<pheno name>)
pheno_clin$<pheno name>_log <- log(pheno_clin$<pheno name>)
hist(pheno_clin$<pheno name>_log)
#2).
tmp = pheno_clin %>%
dplyr::select(mouse, sex, DOwave, diet_days, <pheno name>) %>%
gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(phenotype) %>%
nest()
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
tmp
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~phenotype) +
labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
#3).
qtl = scan1(genoprobs = probs,
pheno = pheno_clin[,"<pheno name>", drop = FALSE],
kinship = K,
addcovar = covar)
#4).
plot_scan1(x = qtl, map = map, lodcolumn = "<pheno name>")
abline(h = 6, col = 2, lwd = 2)
#5).
peaks = find_peaks(scan1_output = qtl, map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally