Content from Introduction
Last updated on 2024-11-12 | Edit this page
Estimated time: 12 minutes
Overview
Questions
- What are expression quantitative trait loci (eQTL)?
- How are eQTL used in genetic studies?
Objectives
- Describe how an expression quantitative trait locus (eQTL) impacts gene expression.
- Describe how eQTL are used in genetic studies.
Introduction
Differences in disease risk between individuals are often caused by genetic variants. Identifying the effects of genetic variants is key to understanding disease phenotypes and their underlying biology. The effects of variants in many single-gene disorders, such as cystic fibrosis, are generally well-characterized and their disease biology well understood. For example, in cystic fibrosis, mutations in the coding region of the CFTR gene alter the three-dimensional structure of the chloride channel proteins in epithelial cells, affecting not only chloride transport, but also sodium and potassium transport in the lungs, pancreas and skin. The path from gene mutation to altered protein to disease phenotype is relatively simple and well understood.
The most common human disorders, however, involve many genes interacting with each other and with the environment, a far more complicated path to follow than the path from a single gene mutation to its protein to a disease phenotype. Cardiovascular disease, Alzheimer’s disease, arthritis, diabetes and cancer involve a complex interplay of genes with environment, and their mechanisms are not well understood. One method of understanding the relationship between genetic variants and disease is a “Genome-wide association study (GWAS), which associates genetic variants with disease traits. It is tempting to think that these genetic variants would fall in coding regions. However, most GWAS variants for common diseases like diabetes are located in non-coding regions of the genome. These variants are therefore likely to fall in regulatory sequences which are involved in gene regulation.
Gene regulation controls the quantity, timing and locale of gene expression. Analyzing the association between gene expression and genetic variants is known as expression quantitative trait locus (eQTL) mapping. eQTL mapping searches for associations between the expression of one or more genes and a genetic locus. Specifically, genetic variants underlying eQTL peak explain some of the variation in gene expression levels. eQTL studies can reveal the architecture of quantitative traits, connect DNA sequence variation to phenotypic variation, and shed light on transcriptional regulation and regulatory variation. Traditional analytic techniques like linkage and association mapping can be applied to thousands of gene expression traits (transcripts) in eQTL analysis, such that gene expression can be mapped in much the same way as a physiological phenotype like blood pressure or heart rate. Joining gene expression and physiological phenotypes with genetic variation can identify genes with variants affecting disease phenotypes.
To the simple diagram above we’ll add two more details. Non-coding SNPs can regulate gene expression from nearby locatons on the same chromosome (in cis):
SNPs that affect gene expression from afar, often from a different chromosome from the gene that they regulate are called distal (trans) regulators.
In this lesson we revisit genetic mapping of quantitative traits and apply its methods to gene expression. The examples are from Genetic Drivers of Pancreatic Islet Function by Keller, et al. This study offers supporting evidence for type 2 diabetes-associated loci in human GWAS, most of which affect pancreatic islet function. The study assessed pancreatic islet gene expression in Diversity Outbred mice on either a regular chow or high-fat, high-sugar diet. Islet mRNA abundance was quantified and analyzed, and the study identified more than 18,000 eQTL peaks.
Key Points
- An expression quantitative trait locus (eQTL) explains part of the variation in gene expression.
- Traditional linkage and association mapping can be applied to gene expression traits (transcripts).
- Genetic variants, such as single nucleotide polymorphisms (SNPs), that underlie eQTL illuminate transcriptional regulation and variation.
Content from Genetic Drivers of Pancreatic Islet Function
Last updated on 2024-11-12 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- What is the hypothesis of an example eQTL study?
Objectives
- Describe an example eQTL study in Diversity Outbred mice.
- State the hypothesis from an example eQTL study in Diversity Outbred mice.
Genome-wide association studies (GWAS) often identify variants in non-coding regions of the genome, indicating that regulation of gene expression predominates in common diseases like type II diabetes. Most of the more than 100 genetic loci associated with type II diabetes affect the function of pancreatic islets, which produce insulin for regulating blood glucose levels. Susceptibility to type II diabetes (T2D) increases with obesity, such that T2D-associated genetic loci operate mainly under conditions of obesity (See Keller, Mark P et al. “Genetic Drivers of Pancreatic Islet Function.” Genetics vol. 209,1 (2018): 335-356). Like most GWAS loci, the T2D-associated genetic loci identified from genome-wide association studies (GWAS) have very small effect sizes and odds ratios just slightly more than 1.
This study explored islet gene expression in diabetes. The authors hypothesized that gene expression changes in response to dietary challenge would reveal signaling pathways involved in stress responses. The expression of many genes often map to the same locus, indicating that expression of these genes is controlled in common. If their mRNAs encode proteins with common physiological functions, the function of the controlling gene(s) is revealed. Variation in expression of the controlling gene(s), rather than a genetic variant, can be tested as an immediate cause of a disease-related phenotype.
In this study, Diversity Outbred (DO) mice were fed a high-fat, high-sugar diet as a stressor, sensitizing the mice to develop diabetic traits. Body weight and plasma glucose, insulin, and triglyceride measurements were taken biweekly. Food intake could be measured since animals were individually housed. A glucose tolerance test at 18 weeks of age provided measurement of dynamic glucose and insulin changes at 5, 15, 30, 60 and 120 minutes after glucose administration. Area under the curve (AUC) was determined from these time points for both plasma glucose and insulin levels. Homeostatic model assessment (HOMA) of insulin resistance (IR) and pancreatic islet function (B) were determined after the glucose tolerance test was given. Islet cells were isolated from pancreas, and RNA extracted and libraries constructed from isolated RNA for gene expression measurements.
Genome scans were performed with the leave-one-chromosome-out (LOCO) method for kinship correction. Sex and experimental cohort (DO wave) were used as covariates. The results of one scan for insulin area under the curve (AUC) is shown below with a strong peak on chromosome 11. In this lesson, we will look into genes located under this peak.
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Load and explore the data
Last updated on 2024-11-12 | Edit this page
Estimated time: 45 minutes
Overview
Questions
- What data are required for eqtl mapping?
Objectives
- To provide an example and exploration of data used for eqtl mapping.
Load the libraries.
R
library(ggbeeswarm)
library(tidyverse)
library(knitr)
library(corrplot)
# the following analysis is derived from supplementary
# File S1 Attie_eQTL_paper_physiology.Rmd
# by Daniel Gatti. See Data Dryad entry for more information.
Physiological Phenotypes
The complete data used in these analyses are available from Data Dryad.
Load in the clinical phenotypes.
R
# load the data
load("../data/attie_DO500_clinical.phenotypes.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
See the data dictionary to see a description of each of these phenotypes. You can also view a table of the data dictionary.
R
pheno_clin_dict %>%
select(description, formula) %>%
kable()
ERROR
Error: object 'pheno_clin_dict' not found
Phenotype Distributions
Boxplots are a great way to view the distribution of the data and to identify any outliers. We will be using the total area under the curve of insulin from the glucose tolerance test (Ins_tAUC). We will also log-transform the data using the [scale_y_log10()][scale_y_log10] function.
R
# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
geom_boxplot() +
scale_y_log10() +
labs(title = "Insulin tAUC", y = "Insulin tAUC")
ERROR
Error: object 'pheno_clin' not found
Another visualization that has become popular is the [Violin Plot][https://en.wikipedia.org/wiki/Violin_plot]. We can
create one using ggplot’s [geom_violin][https://ggplot2.tidyverse.org/reference/geom_violin.html].
Whereas the boxplot automatically adds the median, we must tell
geom_violin()
which quantiles that we want to draw using
the argument draw_quantiles = c(0.25, 0.5, 0.75)
. We have
also overlaid the data points using ggbeeswarm’s [geom_beeswarm][https://www.rdocumentation.org/packages/ggbeeswarm/versions/0.5.3/topics/geom_beeswarm].
We have told geom_beeswarm()
to plot the points using the
argument alpha = 0.1
. The alpha
argument
ranges between 0 (completely transparent) to 1 (completely opaque). A
value of 0.1 means mostly transparent.
R
# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
geom_beeswarm(alpha = 0.1) +
scale_y_log10() +
labs(title = "Insulin tAUC", y = "Insulin tAUC")
ERROR
Error: object 'pheno_clin' not found
Challenge 1:
How many orders of magnitude (powers of 10) does Insulin tAUC span?
Insulin tAUC spans three orders of magnitude, from near 10 to over 1000.
Challenge 2:
Which sex has higher median Insulin tAUC values?
Males have higher Insulin tAUC than females.
Both of the boxplot and the violin plot are useful visualizations which you can use to get some sense of the distribution of your data.
Quality Control of Data
Many statistical tests rely upon the data having a “normal” (or Gaussian) distribution. Many biological phenotypes do not follow this distribution and must be transformed before analysis. This is why we log-transformed the data in the plots above.
While we can “eyeball” the distributions in the violin plot, it would be better to use a “quantile-quantile” plot.
R
pheno_clin %>%
ggplot(aes(sample = Ins_tAUC)) +
stat_qq() +
geom_qq_line() +
facet_wrap(~sex)
ERROR
Error: object 'pheno_clin' not found
In these plots, the “quantiles” of the normal distribution are plotted on the X-axis and the data are plotted on the Y-axis. The line indicates the quantiles that would be followed by a normal distribution. The untransformed data do not follow a normal distribution because the points are far from the line.
Next, we will loag-transform the data and then create a quantile-quantile plot.
R
pheno_clin %>%
mutate(Ins_tAUC = log(Ins_tAUC)) %>%
ggplot(aes(sample = Ins_tAUC)) +
stat_qq() +
geom_qq_line() +
facet_wrap(~sex)
ERROR
Error: object 'pheno_clin' not found
Challenge 3:
Does the log transformation make the data more normally distributed? Explain your answer.
Yes. The log transformation makes the data more normally distributed because the data points follow the normality line more closely.
Challenge 4:
Do any data points look suspicious to you? Explain your answer.
The data points that deviate from the normality line would be worth investigating. All data deviates somewhat from normality, but the three lowest points in the male data plot would be worth investigating. They may be real, but there may also have been mishap in the assay.
Another way to identify outliers is to standardize the data and look for data points that are more than four standard deviations from the mean.
To do this, we will log transform and standardize Insulin tAUC.
R
ins_tauc = pheno_clin %>%
select(mouse, sex, Ins_tAUC) %>%
group_by(sex) %>%
mutate(Ins_tAUC = log(Ins_tAUC),
Ins_tAUC = scale(Ins_tAUC))
ERROR
Error: object 'pheno_clin' not found
R
ins_tauc %>%
ggplot(aes(x = sex, y = Ins_tAUC)) +
geom_boxplot() +
geom_hline(aes(yintercept = -4), color = 'red') +
geom_hline(aes(yintercept = 4), color = 'red') +
labs(title = "Distribution of Standardized Ins_tAUC")
ERROR
Error: object 'ins_tauc' not found
There are no data points outside of the four standard deviation limits.
Gene Expression Phenotypes
R
# load the expression data along with annotations and metadata
load("../data/dataset.islet.rnaseq.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/dataset.islet.rnaseq.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
names(dataset.islet.rnaseq)
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
# look at gene annotations
dataset.islet.rnaseq$annots[1:6,]
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
# look at raw counts
dataset.islet.rnaseq$raw[1:6,1:6]
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
# look at sample metadata
# summarize mouse sex, birth dates and DO waves
table(dataset.islet.rnaseq$samples[, c("sex", "birthdate")])
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
table(dataset.islet.rnaseq$samples[, c("sex", "DOwave")])
ERROR
Error: object 'dataset.islet.rnaseq' not found
In order to make reasonable gene comparisons between samples, the count data need to be normalized. In the quantile-quantile (Q-Q) plot below, count data for the first gene are plotted over a diagonal line tracing a normal distribution for those counts. Notice that most of the count data values lie off of this line, indicating that these gene counts are not normally distributed.
ERROR
Error: object 'dataset.islet.rnaseq' not found
Q-Q plots for the first six genes show that count data for these genes are not normally distributed. They are also not on the same scale. The y-axis values for each subplot range to 20,000 counts in the first subplot, 250 in the second, 90 in the third, and so on.
R
dataset.islet.rnaseq$raw %>%
as.data.frame() %>%
select(ENSMUSG00000000001:ENSMUSG00000000058) %>%
pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>%
ggplot(aes(sample = value)) +
stat_qq() +
geom_qq_line() +
facet_wrap(~gene, scales = 'free') +
labs(title = 'Count distribution for six genes',
xlab = 'Normal percentiles', y = 'Count percentiles')
ERROR
Error: object 'dataset.islet.rnaseq' not found
Q-Q plots of the normalized expression data for the first six genes show that the data values match the diagonal line well, meaning that they are now normally distributed. They are also all on the same scale now as well.
R
dataset.islet.rnaseq$expr %>%
as.data.frame() %>%
select(ENSMUSG00000000001:ENSMUSG00000000058) %>%
pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>%
ggplot(aes(sample = value)) +
stat_qq() +
geom_qq_line() +
facet_wrap(~gene, scales = 'free') +
labs(title = 'Normalized count distribution for six genes',
xlab = 'Normal percentiles', y = 'Count percentiles')
ERROR
Error: object 'dataset.islet.rnaseq' not found
Boxplots of raw counts for six example genes are shown at left below. Notice that the median count values (horizontal black bar in each boxplot) are not comparable between the genes because the counts are not on the same scale. At right, boxplots for the same genes show normalized count data on the same scale.
R
raw = dataset.islet.rnaseq$raw %>%
as.data.frame() %>%
select(ENSMUSG00000000001:ENSMUSG00000000058) %>%
pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>%
mutate(type = 'raw')
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
norm = dataset.islet.rnaseq$expr %>%
as.data.frame() %>%
select(ENSMUSG00000000001:ENSMUSG00000000058) %>%
pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>%
mutate(type = 'normalized')
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
bind_rows(raw, norm) %>%
mutate(type = factor(type, levels = c('raw', 'normalized'))) %>%
ggplot(aes(gene, value)) +
geom_boxplot() +
facet_wrap(~type, scales = 'free') +
labs(title = 'Count distributions for example genes') +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1))
ERROR
Error in `bind_rows()` at dplyr/R/mutate.R:146:3:
! Argument 1 must be a data frame or a named atomic vector.
R
rm(raw, norm)
WARNING
Warning in rm(raw, norm): object 'raw' not found
WARNING
Warning in rm(raw, norm): object 'norm' not found
Have a look at the first several rows of normalized count data.
R
# look at normalized counts
dataset.islet.rnaseq$expr[1:6,1:6]
ERROR
Error: object 'dataset.islet.rnaseq' not found
The expression data loaded provides LOD peaks for the eQTL analyses performed in this study. As a preview of what you will be doing next, look at the first several rows of LOD peak values and extract the LOD peaks for chromosome 11.
R
# look at LOD peaks
dataset.islet.rnaseq$lod.peaks[1:6,]
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
# look at chromosome 11 LOD peaks
chr11_peaks <- dataset.islet.rnaseq$annots %>%
select(gene_id, chr) %>%
filter(chr=="11") %>%
left_join(dataset.islet.rnaseq$lod.peaks,
by = c("chr" = "chrom", "gene_id" = "annot.id"))
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
# look at the first several rows of chromosome 11 peaks
head(chr11_peaks)
ERROR
Error: object 'chr11_peaks' not found
R
# how many rows?
dim(chr11_peaks)
ERROR
Error: object 'chr11_peaks' not found
R
# how many rows have LOD scores?
chr11_peaks %>% filter(!is.na(lod)) %>% dim()
ERROR
Error: object 'chr11_peaks' not found
R
# sort chromosome 11 peaks by LOD score
chr11_peaks %>% arrange(desc(lod)) %>% head()
ERROR
Error: object 'chr11_peaks' not found
R
# range of LOD scores and positions
range(chr11_peaks$lod, na.rm = TRUE)
ERROR
Error: object 'chr11_peaks' not found
R
range(chr11_peaks$pos, na.rm = TRUE)
ERROR
Error: object 'chr11_peaks' not found
R
# view LOD scores by position
chr11_peaks %>% arrange(desc(lod)) %>%
ggplot(aes(pos, lod)) + geom_point()
ERROR
Error: object 'chr11_peaks' not found
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Review Mapping Steps
Last updated on 2024-11-12 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What are the steps involved in running QTL mapping in Diversity Outbred mice?
Objectives
- To understand the key steps running a QTL mapping analysis
Before we begin to run QTL mapping on gene expression data to find eQTLs, let’s review the main QTL mapping steps that we learnt in the QTL mapping course. As a reminder, we are using data from the Keller et al. paper that are freely available to download.
Make sure that you are in your main directory. If you’re not sure
where you are working right now, you can check your working directory
with getwd()
. If you are not in your main directory, run
setwd("code")
in the Console or Session -> Set Working
Directory -> Choose Directory in the RStudio menu to set your working
directory to the code directory.
Load Libraries
Below are the neccessary libraries that we require for this review. They are already installed on your machines so go ahead an load them using the following code:
R
library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
Load Data
The data for this tutorial has been saved as several R binary files which contain several data objects, including phenotypes and mapping data as well as the genoprobs.
Load the data in now by running the following command in your new script.
R
##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
##mapping data
load("../data/attie_DO500_mapping.data.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_mapping.data.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
##genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
WARNING
Warning in gzfile(file, "rb"): cannot open compressed file
'../data/attie_DO500_genoprobs_v5.rds', probable reason 'No such file or
directory'
ERROR
Error in gzfile(file, "rb"): cannot open the connection
We loaded in a few data objects today. Check the Environment tab to
see what was loaded. You should see a phenotypes object called
pheno_clin
with 500 observations (in rows) of 32 variables
(in columns), an object called map (physical map), and an object called
probs (genotype probabilities).
Phenotypes
In this data set, we have 20 phenotypes for 500 Diversity Outbred
mice. pheno_clin
is a data frame containing the phenotype
data as well as covariate information. Click on the triangle to the left
of pheno_clin
in the Environment pane to view its contents.
Run names(pheno_clin)
to list the variables.
pheno_clin_dict
is the phenotype dictionary. This data
frame contains information on each variable in pheno_clin
,
including name
,short name
,
pheno_type
, formula
(if used) and
description
.
Since the paper is interested in type 2 diabetes and insulin
secretion, we will choose insulin tAUC
(area under the
curve (AUC) which was calculated without any correction for baseline
differences) for this review
Many statistical models, including the QTL mapping model in qtl2, expect that the incoming data will be normally distributed. You may use transformations such as log or square root to make your data more normally distributed. Here, we will log transform the data.
Here is a histogram of the untransformed data.
R
hist(pheno_clin$Ins_tAUC, main = "Insulin Area Under the Curve")
ERROR
Error: object 'pheno_clin' not found
Now, let’s apply the log()
function to this data to
correct the distribution.
R
pheno_clin$Ins_tAUC_log <- log(pheno_clin$Ins_tAUC)
ERROR
Error: object 'pheno_clin' not found
Now, let’s make a histogram of the log-transformed data.
R
hist(pheno_clin$Ins_tAUC_log, main = "insulin tAUC (log-transformed)")
ERROR
Error: object 'pheno_clin' not found
This looks much better!
The Marker Map
The marker map for each chromosome is stored in the map
object. This is used to plot the LOD scores calculated at each marker
during QTL mapping. Each list element is a numeric vector with each
marker position in megabases (Mb). Here we are using the 69K grid marker
file. Often when there are numerous genotype arrays used in a study, we
interoplate all to a 69k grid file so we are able to combine all samples
across different array types.
Look at the structure of map
in the Environment tab by
clicking the triangle to the left or by running str(map)
in
the Console.
Genotype probabilities
Each element of probs
is a 3 dimensional array
containing the founder allele dosages for each sample at each marker on
one chromosome. These are the 8 state allelle probabilities (not 32)
using the 69k marker grid for same 500 DO mice that also have clinical
phenotypes. We have already calculated genotype probabilities for you,
so you can skip the step for calculating
genotype probabilities and the optional step for calculating allele
probabilities.
Next, we look at the dimensions of probs
for chromosome
1:
R
dim(probs[[1]])
ERROR
Error: object 'probs' not found
R
plot_genoprob(probs, map, ind = 1, chr = 1)
ERROR
Error: object 'probs' not found
In the plot above, the founder contributions, which range between 0 and 1, are colored from white (= 0) to black (= 1.0). A value of ~0.5 is grey. The markers are on the X-axis and the eight founders (denoted by the letters A through H) on the Y-axis. Starting at the left, we see that this sample has genotype GH because the rows for G & H are grey, indicating values of 0.5 for both alleles. Moving along the genome to the right, the genotype becomes HH where where the row is black indicating a value of 1.0. This is followed by CD, DD, DG, AD, AH, CE, etc. The values at each marker sum to 1.0.
Kinship Matrix
The kinship matrix has already been calculated and loaded in above
R
n_samples <- 50
heatmap(K[[1]][1:n_samples, 1:n_samples])
ERROR
Error: object 'K' not found
The figure above shows kinship between all pairs of samples. Light yellow indicates low kinship and dark red indicates higher kinship. Orange values indicate varying levels of kinship between 0 and 1. The dark red diagonal of the matrix indicates that each sample is identical to itself. The orange blocks along the diagonal may indicate close relatives (i.e. siblings or cousins).
Covariates
Next, we need to create additive covariates that will be used in the
mapping model. First, we need to see which covariates are significant.
In the data set, we have sex
, DOwave
(Wave
(i.e., batch) of DO mice) and diet_days
(number of days on
diet) to test whether there are any gender, batch or diet effects.
First we are going to select out the covariates and phenotype we want
from pheno_clin
data frame. Then reformat these selected
variables into a long format (using the gather
command)
grouped by the phenotypes (in this case, we only have
Ins_tAUC_log
)
R
### Tests for sex, wave and diet_days.
tmp = pheno_clin %>%
dplyr::select(mouse, sex, DOwave, diet_days, Ins_tAUC_log) %>%
gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(phenotype) %>%
nest()
ERROR
Error: object 'pheno_clin' not found
Let’s create a linear model function that will regress the covariates and the phenotype.
R
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
Now let’s apply that function to the data object tmp
that we created above.
R
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
ERROR
Error: object 'tmp' not found
R
tmp
ERROR
Error: object 'tmp' not found
R
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~phenotype) +
labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
ERROR
Error: object 'tmp' not found
We can see that sex and DOwave (especially the third batch) are significant. Here DOwave is the group or batch number as not all mice are in the experiment at the same time. Because of this, we now have to correct for it.
R
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$DOwave = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
covar = model.matrix(~sex + DOwave, data = pheno_clin)[,-1]
ERROR
Error: object 'pheno_clin' not found
REMEMBER: the sample IDs must be in the rownames of
pheno
, addcovar
, genoprobs
and
K
. qtl2
uses the sample IDs to align the
samples between objects.
Performing a genome scan
At each marker on the genotyping array, we will fit a model that regresses the phenotype (insulin secretion AUC) on covariates and the founder allele proportions. Note that this model will give us an estimate of the effect of each founder allele at each marker. There are eight founder strains that contributed to the DO, so we will get eight founder allele effects.
[Permutations]
First, we need to work out the signifcance level. Let’s find the significance level for 0.1, 0.05 and 0.01.
R
operm <- scan1perm(genoprobs = probs,
pheno = pheno_clin[,"Ins_tAUC_log", drop = FALSE],
addcovar=covar,
n_perm=1000)
Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01
R
load("../data/operm_Ins_tAUC_log_1000.Rdata")
summary(operm,alpha=c(0.1, 0.05, 0.01))
Genome Scan
R
qtl = scan1(genoprobs = probs,
pheno = pheno_clin[,"Ins_tAUC_log", drop = FALSE],
kinship = K,
addcovar = covar)
Next, we plot the genome scan.
R
plot_scan1(x = qtl,
map = map,
lodcolumn = "Ins_tAUC_log")
ERROR
Error: object 'qtl' not found
R
add_threshold(map, summary(operm,alpha=0.1), col = 'purple')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
R
add_threshold(map, summary(operm, alpha=0.05), col = 'red')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
R
add_threshold(map, summary(operm, alpha=0.01), col = 'blue')
ERROR
Error in .rangeNum(..., na.rm = na.rm, finite = finite, isNumeric = is.numeric): argument is missing, with no default
We can see a very strong peak on chromosome 11 with no other distibuishable peaks.
Finding LOD peaks
We can find all of the peaks above the significance threshold using the find_peaks function.
The support interval is determined using the Bayesian Credible
Interval and represents the region most likely to contain the
causative polymorphism(s). We can obtain this interval by adding a
prob
argument to find_peaks.
We pass in a value of 0.95
to request a support interval
that contains the causal SNP 95% of the time.
In case there are multiple peaks are found on a chromosome, the
peakdrop
argument allows us to the find the peak which has
a certain LOD score drop between other peaks.
Let’s find LOD peaks. Here we are choosing to find peaks with a LOD score greater than 6.
R
lod_threshold = summary(operm, alpha=0.01)
ERROR
Error: object 'operm' not found
R
peaks = find_peaks(scan1_output = qtl, map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
ERROR
Error: object 'qtl' not found
R
kable(peaks %>% dplyr::select (-lodindex) %>%
arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")
ERROR
Error: object 'peaks' not found
QTL effects
R
g <- maxmarg(probs,
map=map,
chr=peaks$chr[1],
pos=peaks$pos[1],
minprob = 0.4,
return_char=TRUE)
ERROR
Error: object 'probs' not found
R
plot_pxg(g,
pheno_clin[,peaks$lodcolumn[1]],
ylab=peaks$lodcolumn[1],
sort=FALSE)
ERROR
Error: object 'g' not found
R
blup <- scan1blup(genoprobs=probs[,peaks$chr[1]],
pheno=pheno_clin[,peaks$lodcolumn[1], drop=FALSE])
ERROR
Error: object 'probs' not found
R
plot_coefCC(blup,
map=map,
columns=1:8,
bgcolor="gray95",
legend="bottomleft",
scan1_output = qtl
)
ERROR
Error: object 'blup' not found
Challenge 1:
Now choose another phenotype in pheno_clin
and perform
the same steps.
1). Check the distribution. Does it need transforming? If so, how would
you do it? 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship
provided.
4). Plot the genome scan for this phenotype.
5). Find the peaks above LOD score of 6.
R
#1).
hist(pheno_clin$<pheno name>)
pheno_clin$<pheno name>_log <- log(pheno_clin$<pheno name>)
hist(pheno_clin$<pheno name>_log)
#2).
tmp = pheno_clin %>%
dplyr::select(mouse, sex, DOwave, diet_days, <pheno name>) %>%
gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(phenotype) %>%
nest()
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
tmp
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~phenotype) +
labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
#3).
qtl = scan1(genoprobs = probs,
pheno = pheno_clin[,"<pheno name>", drop = FALSE],
kinship = K,
addcovar = covar)
#4).
plot_scan1(x = qtl, map = map, lodcolumn = "<pheno name>")
abline(h = 6, col = 2, lwd = 2)
#5).
peaks = find_peaks(scan1_output = qtl, map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Mapping A Single Gene Expression Trait
Last updated on 2024-11-12 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do I map one gene expression trait?
Objectives
- To run a QTL analysis for expression data
In this tutorial, we are going to use the gene expression data as our phenotype to map QTLs. That is, we are looking to find any QTLs (cis or trans) that are responsible for variation in gene expression across samples.
We are using the same directory as our previous tutorial, which you
can check by typing getwd()
in the Console. If you are not
in the correct directory, type in setwd("code")
in the
console or Session -> Set Working Directory -> Choose Directory in
the RStudio menu to set your working directory to the main
directory.
If you have started a new R session, you will need to load the libraries again. If you haven’t, the libraries listed below are already loaded.
Load Data
In this lesson, we are loading in gene expression data for
21,771
genes: attie_DO500_expr.datasets.RData
that includes normalised and raw gene expression data.
dataset.islet.rnaseq
is the dataset you can download
directly from Dryad. Again, if you are using the same R session, you
will not need to the load the mapping, phenotypes and genotype
probabilities data, again.
R
# expression data
load("../data/attie_DO500_expr.datasets.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_expr.datasets.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# data from paper
load("../data/dataset.islet.rnaseq.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/dataset.islet.rnaseq.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# mapping data
load("../data/attie_DO500_mapping.data.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_mapping.data.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
WARNING
Warning in gzfile(file, "rb"): cannot open compressed file
'../data/attie_DO500_genoprobs_v5.rds', probable reason 'No such file or
directory'
ERROR
Error in gzfile(file, "rb"): cannot open the connection
Expression Data
Raw gene expression counts are in the counts
data
object. These counts have been normalised and saved in the
norm
data object. More information is about normalisation
is here.
To view the contents of either one of these data , click the table on
the right hand side either norm
or counts
in
the Environment tab. If you type in names(counts)
, you will
see the names all start with ENSMUSG
. These are Ensembl
IDs. If we want to see which gene these IDs correspond to, type in
dataset.islet.rnaseq$annots
, which gives information about
each gene, including ensemble id, gene symbol as well as start &
stop location of the gene and chromsome on which the gene lies.
Because we are working with the insulin tAUC
phenotype,
let’s map the expression counts for Hnf1b
which is known to
influence this phenotype is these data. First, we need to find the
Ensembl ID for this gene:
R
dataset.islet.rnaseq$annots[dataset.islet.rnaseq$annots$symbol == "Hnf1b",]
ERROR
Error: object 'dataset.islet.rnaseq' not found
We can see that the ensembl ID of Hnf1b
is
ENSMUSG00000020679
. If we check the distribution for
Hnf1b
expression data between the raw and normalised data,
we can see there distribution has been corrected. Here is the
distribution of the raw counts:
R
hist(counts$ENSMUSG00000020679, main = "Hnf1b")
ERROR
Error: object 'counts' not found
and here is the distribution of the normalised counts:
R
hist(norm$ENSMUSG00000020679, main = "Hnf1b")
ERROR
Error in norm$ENSMUSG00000020679: object of type 'closure' is not subsettable
The histogram indicates that distribution of these counts are normalised.
The Marker Map
We are using the same marker map as in the previous lesson
Genotype probabilities
We have explored this earlier in th previous lesson.
But, as a reminder, we have already calculated genotype probabilities
which we loaded above called probs
. This contains the 8
state genotype probabilities using the 69k grid map of the same 500 DO
mice that also have clinical phenotypes.
Kinship Matrix
We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.
Covariates
Now let’s add the necessary covariates. For Hnf1b
expression data, let’s see which covariates are significant.
R
###merging covariate data and expression data to test for sex, wave and diet_days.
cov.counts <- merge(covar, norm, by=c("row.names"), sort=F)
ERROR
Error: object 'covar' not found
R
#testing covairates on expression data
tmp = cov.counts %>%
dplyr::select(mouse, sex, DOwave, diet_days, ENSMUSG00000020679) %>%
gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(expression) %>%
nest()
ERROR
Error: object 'cov.counts' not found
R
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
ERROR
Error: object 'tmp' not found
R
# kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Expression")
tmp
ERROR
Error: object 'tmp' not found
R
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~expression) +
labs(title = "Significance of Sex, Wave & Diet Days on Expression") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
ERROR
Error: object 'tmp' not found
We can see that sex and DOwave are significant. Here DOwave is the
group or batch number as not all mice were submitted for genotyping at
the same time. Because of this, we now have to correct for it.
Considering the paper included the covariate, diet_days
, we
will include that as well.
R
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$DOwave = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$diet_days = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]
ERROR
Error: object 'pheno_clin' not found
Performing a genome scan
[Permutations]
First, we need to work out the signifcance level. Let’s find the signifance level for 0.1, 0.05 and 0.01.
R
operm <- scan1perm(genoprobs = probs,
pheno = norm[,"ENSMUSG00000020679", drop = FALSE],
addcovar=covar,
n_perm=1000)
Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01
R
load("../data/operm_ENSMUSG00000020679_1000.Rdata")
summary(operm,alpha=c(0.1, 0.05, 0.01))
Genome Scan
R
qtl = scan1(genoprobs = probs,
pheno = norm[,"ENSMUSG00000020679", drop = FALSE],
kinship = K,
addcovar = covar)
Next, we plot the genome scan.
R
plot_scan1(x = qtl,
map = map,
lodcolumn = "ENSMUSG00000020679",
main = colnames(qtl))
add_threshold(map, summary(operm, alpha=0.1), col = 'purple')
add_threshold(map, summary(operm, alpha=0.05), col = 'red')
add_threshold(map, summary(operm, alpha=0.01), col = 'blue')
Finding LOD peaks
Let’s find LOD peaks
R
lod_threshold = summary(operm, alpha=0.01)
peaks = find_peaks(scan1_output = qtl,
map = map,
threshold = lod_threshold,
peakdrop = 4, prob = 0.95)
kable(peaks %>%
dplyr::select(-lodindex) %>%
arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")
QTL effects
R
blup <- scan1blup(genoprobs=probs[,peaks$chr[1]],
norm[,peaks$lodcolumn[1], drop=FALSE])
plot_coefCC(blup,
map=map,
columns=1:8,
bgcolor="gray95",
legend="bottomleft",
scan1_output = qtl )
Challenge 1:
Now choose another gene expression trait in norm
data
object and perform the same steps.
1). Check the distribution of the raw counts and normalised counts 2).
Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship
provided.
4). Plot the genome scan for this gene.
5). Find the peaks above LOD score of 6.
Replace <ensembl id>
with your choice of gene
expression trait
R
#1).
hist(pheno_clin$<ensembl id>)
pheno_clin$<ensembl id>_log <- log(pheno_clin$<ensembl id>)
hist(pheno_clin$<ensembl id>_log)
#2).
tmp = pheno_clin %>%
dplyr::select(mouse, sex, DOwave, diet_days, <ensembl id>) %>%
gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
group_by(expression) %>%
nest()
mod_fxn = function(df) {
lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
mutate(model = map(data, mod_fxn)) %>%
mutate(summ = map(model, tidy)) %>%
unnest(summ)
tmp
tmp %>%
filter(term != "(Intercept)") %>%
mutate(neg.log.p = -log10(p.value)) %>%
ggplot(aes(term, neg.log.p)) +
geom_point() +
facet_wrap(~expression) +
labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
#3).
qtl = scan1(genoprobs = probs,
pheno = pheno_clin[,"<ensembl id>", drop = FALSE],
kinship = K,
addcovar = covar)
#4).
plot_scan1(x = qtl, map = map, lodcolumn = "<ensembl id>")
abline(h = 6, col = 2, lwd = 2)
#5).
peaks = find_peaks(scan1_output = qtl, map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Mapping Many Gene Expression Traits
Last updated on 2024-11-12 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do I map many genes?
Objectives
- To map several genes at the same time
Load Libraries
For this lesson, we need to install two more libraries and load them. You can do this by typing the following code:
R
BiocManager::install(c("AnnotationHub","rtracklayer"))
Let’s install our libraries, and source two other R scripts.
R
library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
library(qtl2ggplot)
library(RColorBrewer)
#source("../code/gg_transcriptome_map.R")
#source("../code/qtl_heatmap.R")
Before we begin this lesson, we need to create another directory
called results
in our main directory. You can do this by
clicking on the “Files” tab and navigate into the main directory. Then
select “New Folder” and name it “results”.
Load Data
R
# expression data
load("../data/attie_DO500_expr.datasets.RData")
# data from paper
load("../data/dataset.islet.rnaseq.RData")
# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
# mapping data
load("../data/attie_DO500_mapping.data.RData")
# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
Data Selection
For this lesson, lets choose a random set of 50 gene expression phenotypes.
R
genes = colnames(norm)
sams <- sample(length(genes), 50, replace = FALSE, prob = NULL)
ERROR
Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
R
genes <- genes[sams]
ERROR
Error: object 'sams' not found
R
gene.info <- dataset.islet.rnaseq$annots[genes,]
ERROR
Error: object 'dataset.islet.rnaseq' not found
R
rownames(gene.info) = NULL
ERROR
Error: object 'gene.info' not found
R
kable(gene.info[1:10,])
ERROR
Error: object 'gene.info' not found
Expression Data
Lets check the distribution for the first 20 gene expression
phenotypes. If you would like to check the distribution of all 50 genes,
change for(gene in genes[1:20])
in the code below to
for(gene in genes)
.
R
par(mfrow=c(3,4))
for(gene in genes[1:20]){
hist(norm[,gene], main = gene)
}
Check the distributions. Do they all have a normal distribution?
You will notice that the distribtion of some genes are skewed to the left. This means that that only a small amount of samples have data and therefore, will need to be removed. A suitable qc would be keeping expression data that have at least 5% of the samples with more than 10 reads.
R
genes_qc <- which(as.numeric(colSums(counts[ , genes] > 10)) >= 0.05 * nrow(counts[,genes]))
ERROR
Error: object 'counts' not found
R
genes <- genes[genes_qc]
ERROR
Error: object 'genes_qc' not found
The Marker Map
We are using the same marker map as in the previous lesson
Genotype probabilities
We have explored this earlier in th previous lesson.
But, as a reminder, we have already calculated genotype probabilities
which we loaded above called probs
. This contains the 8
state g enotype probabilities using the 69k grid map of the same 500 DO
mice that also have clinical phenotypes.
Kinship Matrix
We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.
Covariates
Now let’s add the necessary covariates. For these 50 gene expression
data, we will correct for DOwave
,sex
and
diet_days
.
R
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$DOwave = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
pheno_clin$diet_days = factor(pheno_clin$DOwave)
ERROR
Error: object 'pheno_clin' not found
R
covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]
ERROR
Error: object 'pheno_clin' not found
Performing a genome scan
Now lets perform the genome scan! We are also going to save our qtl
results in an Rdata
file to be used in further lessons. We
will not perform permutations in this lesson as it will take too long.
Instead we will use 6, which is the LOD score used in the paper to
determine significance.
QTL Scans
R
qtl.file = "../results/gene.norm_qtl_cis.trans.Rdata"
if(file.exists(qtl.file)) {
load(qtl.file)
} else {
qtl = scan1(genoprobs = probs,
pheno = norm[,genes, drop = FALSE],
kinship = K,
addcovar = covar,
cores = 2)
save(qtl, file = qtl.file)
}
QTL plots
Let’s plot the first 20 gene expression phenotypes. If you would like
to plot all 50, change for(i in 1:20)
in the code below to
for(i in 1:ncol(qtl))
.
R
par(mfrow=c(3,4))
for(i in 1:20) {
plot_scan1(x = qtl,
map = map,
lodcolumn = i,
main = colnames(qtl)[i])
abline(h = 6, col = 2, lwd = 2)
}
ERROR
Error: object 'qtl' not found
QTL Peaks
We are also going to save our peak results so we can use these again
else where.
First, lets get out peaks with a LOD score greater than 6.
R
lod_threshold = 6
peaks = find_peaks(scan1_output = qtl,
map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
ERROR
Error: object 'qtl' not found
We will save these peaks into a csv file.
R
kable(peaks[1:10,] %>%
dplyr::select(-lodindex) %>%
arrange(chr, pos), caption = "Expression QTL (eQTL) Peaks with LOD >= 6")
# write_csv(peaks, "../results/gene.norm_qtl_peaks_cis.trans.csv")
ERROR
Error: object 'peaks' not found
QTL Peaks Figure
R
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)
ERROR
Error in qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5): could not find function "qtl_heatmap"
Challenge 1:
What do the qtl scans for all gene exression traits look like?
Note: Don’t worry, we’ve done the qtl scans for you!!! You can
read in this file, ../data/gene.norm_qtl_all.genes.Rdata
,
which are the scan1
results for all gene expression
traits.
R
load("../data/gene.norm_qtl_all.genes.Rdata")
lod_threshold = 6
peaks = find_peaks(scan1_output = qtl.all,
map = map,
threshold = lod_threshold,
peakdrop = 4,
prob = 0.95)
write_csv(peaks, "../results/gene.norm_qtl_all.genes_peaks.csv")
## Heat Map
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Creating A Transcriptome Map
Last updated on 2024-11-12 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do I create and interpret a transcriptome map?
Objectives
- Describe a transcriptome map.
- Interpret a transcriptome map.
Load Libraries
R
library(tidyverse)
library(qtl2)
library(qtl2convert)
library(RColorBrewer)
library(qtl2ggplot)
# source("../code/gg_transcriptome_map.R")
Load Data
Load in the LOD peaks over 6 from previous lesson.
R
# REad in the LOD peaks from the previous lesson.
lod_summary <- read.csv("../results/gene.norm_qtl_peaks_cis.trans.csv")
In order to use the ggtmap
function, we need to provide
specific column names. These are documented in the
gg_transcriptome_map.R
file in the code directory of this
workshop. The required column names are:
-
data
: data.frame (or tibble) with the following columns:-
ensembl
: (required) character string containing the Ensembl gene ID. -
qtl_chr
: (required) character string containing QTL chromsome. -
qtl_pos
: (required) floating point number containing the QTL position in Mb. -
qtl_lod
: (optional) floating point number containing the LOD score. -
gene_chr
: (optional) character string containing transcript chromosome. -
gene_start
: (optional) character string containing transcript start position in Mb. -
gene_end
: (optional) character string containing transcript end position in Mb.
-
R
# Get gene positions.
ensembl <- get_ensembl_genes()
ERROR
Error in get_ensembl_genes(): could not find function "get_ensembl_genes"
R
df <- data.frame(ensembl = ensembl$gene_id,
gene_chr = seqnames(ensembl),
gene_start = start(ensembl) * 1e-6,
gene_end = end(ensembl) * 1e-6,
stringsAsFactors = F)
ERROR
Error: object 'ensembl' not found
R
lod_summary <- lod_summary %>%
rename(lodcolumn = "ensembl",
chr = "qtl_chr",
pos = "qtl_pos",
lod = "qtl_lod") %>%
left_join(df, by = "ensembl") %>%
mutate(marker.id = str_c(qtl_chr, qtl_pos * 1e6, sep = "_"),
gene_chr = factor(gene_chr, levels = c(1:19, "X")),
qtl_chr = factor(qtl_chr, levels = c(1:19, "X")))
ERROR
Error: object 'lod_summary' not found
R
rm(df)
Some of the genes will have a QTL in the same location as the gene and others will have a QTL on a chromosome where the gene is not located.
Challenge 1:
What do we call eQTL that are co-colocated with the gene? What do we call eQTL that are located on a different chromosome than the gene?
A cis-eQTL is an eQTL that is co-colocated with the gene. A trans-eQTL is an eQTL that is located on a chromosome other than the gene that was mapped.
We can tabulate the number of cis- and trans-eQTL that we have and add this to our QTL summary table. A cis-eQTL occurs when the QTL peaks is directly over the gene position. But what if it is 2 Mb away? Or 10 Mb? It’s possible that a gene may have a trans eQTL on the same chromosome if the QTL is “far enough” from the gene. We have selected 4 Mb as a good rule of thumb.
R
lod_summary <- lod_summary %>%
mutate(cis = if_else(qtl_chr == gene_chr &
abs(gene_start - qtl_pos) < 4,
"cis", "trans"))
ERROR
Error: object 'lod_summary' not found
R
count(lod_summary, cis)
ERROR
Error: object 'lod_summary' not found
Plot Transcriptome Map
R
ggtmap(data = lod_summary %>%
filter(qtl_lod >= 7.18),
cis.points = TRUE,
cis.radius = 4)
ERROR
Error in ggtmap(data = lod_summary %>% filter(qtl_lod >= 7.18), cis.points = TRUE, : could not find function "ggtmap"
The plot above is called a “Transcriptome Map” because it shows the positions of the genes (or transcripts) and their corresponding QTL. The QTL position is shown on the X-axis and the gene position is shown on the Y-axis. The chromosomes are listed along the top and right of the plot. What type of QTL are the genes with QTL that are located along the diagonal?
Key Points
- Transcriptome maps aid in understanding gene expression regulation.
Content from Transcriptome Map of cis and trans eQTL
Last updated on 2024-11-12 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How do I create a full transcriptome map?
Objectives
- Explain how to use markdown with the new lesson template
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
Load Libraries
R
library(tidyverse)
library(qtl2)
library(knitr)
library(RColorBrewer)
library(qtl2ggplot)
# source("../code/gg_transcriptome_map.R")
Load Data
Load in the RNA-seq eQTL mapping results.
R
# loading previous results
load("../data/dataset.islet.rnaseq.RData")
Next, we need to format the column names of our eQTL results to that
the ggtmap
function can use the results.
R
lod_summary = dataset.islet.rnaseq$lod.peaks
# Get gene positions.
ensembl <- get_ensembl_genes()
df <- data.frame(ensembl = ensembl$gene_id,
gene_chr = seqnames(ensembl),
gene_start = start(ensembl) * 1e-6,
gene_end = end(ensembl) * 1e-6,
stringsAsFactors = F)
# Create eQTL table for transcriptome map function.
lod_summary <- lod_summary %>%
rename(annot.id = "ensembl",
chrom = "qtl_chr",
pos = "qtl_pos",
lod = "qtl_lod") %>%
left_join(df, by = "ensembl") %>%
mutate(marker.id = str_c(qtl_chr, qtl_pos * 1e6, sep = "_"),
gene_chr = factor(gene_chr, levels = c(1:19, "X")),
qtl_chr = factor(qtl_chr, levels = c(1:19, "X"))) %>%
mutate(cis = if_else(qtl_chr == gene_chr & abs(gene_start - qtl_pos) < 4, "cis", "trans"))
rm(df)
Plot Transcriptome Map
In the previous lesson, we mapped the QTL locations of 50 genes. In this lesson, we will map the QTL positions of genes.
R
ggtmap(data = lod_summary %>%
filter(qtl_lod >= 7.18),
cis.points = TRUE,
cis.radius = 4)
This transcriptome map is definitely a lot more crowded than the one in the previous lesson. Again, the gene locations are shown on the X-axis and the QTL locations are shown on the Y-axis.
Challenge 1: What patterns among the points do you see in the transcriptome map?
There are at least two patterns. One is the dense diagonal line of cis-eQTL. The other is the increased density of QTL in vertical lines.
Challenge 2: What would a vertical band in the transcriptome map mean?
A vertical band indicates that one locus regulates the expression of many genes.
Look at the transcriptome map. How many vertical bands do you see and which chromosomes are they on?
QTL Density Plot
In the transcriptome map above, we noticed vertical banding patterns in the eQTL, which indicate that one locus may regulate the expression of dozens of genes. How many genes are regulated by each locus and which genes are they? In order to address this question, we need to make a plot of the density of eQTL along the genome. This is like stacking up the eQTL onto the X-axis.
We have provided a function to do this in the
gg_transcriptome_map.R
file in the code
directory of this lesson. The function is called
eqtl_density_plot
and takes the following arguments:
-
data
: data.frame (or tibble) with the following columns:-
ensembl
: (required) character string containing the Ensembl gene ID. -
qtl_chr
: (required) character string containing QTL chromsome. -
qtl_pos
: (required) floating point number containing the QTL position in Mb. -
qtl_lod
: (optional) floating point number containing the LOD score. -
gene_chr
: (optional) character string containing transcript chromosome. -
gene_start
: (optional) character string containing transcript start position in Mb. -
gene_end
: (optional) character string containing transcript end position in Mb.
-
-
lod_thr
: numeric value that is the LOD above which QTL will be retained. Default = 7.
This function has been designed to use the same data structure as we used to create the transcriptome map. First, we will look at the overall QTL density for all peaks with LOD > 7.18.
R
eqtl_density_plot(data = lod_summary, lod_thr = 7.18)
There are clearly some loci that have an enrichment of eQTL. We have
drawn a dashed line at 100"
as an arbitrary cutoff as a
potential cutoff to use when selecting peaks.
Compare this plot with the transcriptome map, in which we saw vertical bands of eQTL. Do the peaks in the eQTL density plot match the bands in the transcriptome map?
Next, we will look at the locations of the cis-eQTL. We must also select a LOD threshold. We will use 7.18 since this is what was used in the Keller et al. paper.
R
eqtl_density_plot(data = filter(lod_summary, cis == "cis"), lod_thr = 7.18)
In the plot above, there are many loci that have many genes associated with their expression. Some loci have over 100 genes associated with them. For example, there is a locus on chromosome 17 that may regulate over 100 genes. In this case, we are looking at cis-eQTL, QTL which are co-located with the gene. What might this mean biologically? Perhaps there is a mutation which causes a region of chromatin to open, which leads to increased expression of a set of genes. This increased expression may have biological consequences.
Next, we will create an eQTL density plot of the trans-eQTL. These are QTL for which the gene and QTL are far from each other.
R
eqtl_density_plot(data = filter(lod_summary, cis == "trans"), lod_thr = 7.18)
In this case, we see much taller peaks than in the cis-eQTL density plot and these peaks match the ones in the overall eQTL density plot.
There are many potential explanations for a trans-eQTL band. There may be a polymorphism in a transcription factor which alters the expression of many other genes. Or there may be a polymorphism that alters an amino acid and prevents it from binding properly to another protein in a signalling cascade. Biologists are often interested in these trans-eQTL bands, which are called “eQTL hotspots”. A set of over 100 genes with differential expression between genotypes may help us to understand the biology behind variation in higher level phenotypes. It is also possible that one of the genes with a cis-eQTL under the eQTL hotspot regulates the expression of the remaining hotspot genes.
Islet RNASeq eQTL Hotspots
Select eQTL Hotspots
There are several decisions to make when selecting eQTL hotspots. What LOD threshold should you use to select the genes that comprise hotspots? What number of genes should you use as a threshold to call a peak an eQTL hotspot? In this case, the authors select a LOD of 7.18 and decided that 100 trans-regulated genes was a useful threshold.
TBD: Add references for eQTL module selection. Possibly augment lesson?
In order to identify eQTL hotspots, we will look at the density of trans-eQTL along the genome using the density function. We will filter to retain QTL with LOD > 7.8 and will get the position on each chromosome with the highest density of eQTL.
R
qtl_dens = lod_summary %>%
filter(qtl_lod > 7.18 & cis == 'trans') %>%
group_by(qtl_chr) %>%
summarize(dens_x = density(qtl_pos, adjust = 0.1)$x,
dens_y = density(qtl_pos, adjust = 0.1)$y) %>%
slice_max(dens_y, n = 1)
qtl_dens
In the table above, there is one row per chromosome. x
is the position on each chromosome with the highest density of eQTL.
y
is the density, which we needed to obtain the
x
position but is not used further.
Now that we have the location of maximum eQTL density on each chromosome, we will count the number of eQTL within +/- 2 Mb of the center.
R
hotspots = left_join(lod_summary, select(qtl_dens, qtl_chr, dens_x)) %>%
filter(qtl_lod > 7.18 & cis == 'trans') %>%
mutate(pos_diff = abs(dens_x - qtl_pos)) %>%
filter(pos_diff <= 2) %>%
select(-pos_diff) %>%
rename(dens_x = 'center')
head(hotspots)
Now that we have a list of genes near the position of highest eQTL density on each chromosome, we can count the number of genes in each potential hotspot and retain the ones containing more than 100 genes.
R
hotspots = hotspots %>%
count(qtl_chr, center) %>%
filter(n >= 100)
kable(hotspots, caption = "Islet trans-eQTL hotspots")
Chr 11 eQTL Hotspot
From this analysis, we have identified five eQTL hotspots. Due to time constraints, we will not examine all of them in depth. We will look at the chromosome 11 eQTL hotspot in greater depth. It contains 163 trans-eQTL. First, we will get all of the genes with an eQTL in within +/- 2 Mb of the chromosome 11 eQTL hotspot.
R
chr11_mid = hotspots %>%
filter(qtl_chr == '11') %>%
pull(center)
chr11_eqtl = lod_summary %>%
filter(qtl_lod > 7.18 &
qtl_chr == '11' &
abs(chr11_mid - qtl_pos) < 2 &
!is.na(cis))
Next, we will filter to retain the cis-eQTL in the same interval. It is possible that one or more genes near 71.5 Mb on chromosome 11 have a cis-eQTL, which in turn alters the expression of the trans-eQTL genes.
R
chr11_cis = chr11_eqtl %>%
filter(cis == 'cis')
kable(chr11_cis, caption = 'Chr 11 cis-eQTL')
As you can see, there are cis-eQTL genes under the chromosome 11 eQTL hotspot. This is a large number of candidate genes to screen. There may be more genes with non-synonymous, splice, or stop mutations under the eQTL hotspot as well. The Sanger Mouse Genomes website has been removed and we are uncertain if it will be replaced. There are two websites where you can find this information:
- Ensembl: Once you search for a gene, you can select the “Variant Table” under “Genetic Variation” in the left navigation panel.
- Founder Variant Search: You can use this site to search for variants in specific genes or genomic intervals.
Next, we will get the expression of the genes in the chromosome 11 eQTL hotspot. The expression data is a numeric matrix, so we will use the colnames to filter the genes.
R
chr11_genes = dataset.islet.rnaseq$expr[,chr11_eqtl$ensembl]
Next, we will join the eQTL and expression data and write it out to a file. This way you will have all of the data for the hotspot in one place.
R
chr11_expr = tibble(ensembl = colnames(chr11_genes),
data.frame(t(chr11_genes)))
chr11_all = left_join(chr11_eqtl, chr11_expr, by = 'ensembl')
write_csv(chr11_all, file = file.path('../results/', 'chr11_eqtl_genes.csv'))
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Maximum eQTL Peaks and Nearby Genes
Last updated on 2024-11-12 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How do I display maximum eQTL peaks and nearby genes?
Objectives
- Explain how to use markdown with the new lesson template
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
Challenge 1: Can you do it?
What is the output of this command?
R
paste("This", "new", "lesson", "looks", "good")
OUTPUT
[1] "This new lesson looks good"
Challenge 2: how do you nest solutions within challenge blocks?
You can add a line with at least three colons and a
solution
tag.
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Interpreting qtl2 results
Last updated on 2024-11-12 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How do I interpret qtl2 results?
Objectives
- Interpret the relationship between sequence, expression and phenotype variation from qtl2 mapping results.
::::::::::::::::::::::::::::::::::::: challenge
Challenge: Interpreting qtl2 results
Refer to the figure above.
1). What does panel A show? What conclusions could you draw from
panel A?
2). What does panel B show? What conclusions could you draw from panel
B?
3). What does panel C show? What conclusions could you draw from panel
C?
4). How are panels A through C related to one another? What story do
they tell together?
::::::::::::::::::::::::::::::::::::::::::::::::
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Mediation Analysis
Last updated on 2024-11-12 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- What is mediation analysis?
- How do I explore causal relations with mediation analysis?
Objectives
- Describe mediation analysis as applied in genetics and genomics.
GWAS studies show that most disease-associated variants are found in non-coding regions. This fact leads to the idea that regulation of gene expression is an important mechanism enabling genetic variants to affect complex traits. Mediation analysis can identify a causal chain from genetic variants to molecular and clinical phenotypes. The graphic below shows complete mediation, in which a predictor variable does not directly impact the response variable, but does directly the mediator. The mediator has a direct impact on the response variable. We would observe the relationship between predictor and response, not knowing that a mediator intervenes in this relationship.
Mediation analysis is widely used in the social sciences including psychology. In biomedicine, mediation analysis has been employed to investigate how gene expression mediates the effects of genetic variants on complex phenotypes and disease.
For example, a genetic variant (non-coding SNP) indirectly regulates expression of gene 2 through a mediator, gene 1. The SNP regulates expression of gene 1 in cis, and expression of gene 1 influences expression of gene 2 in trans.
Instead of the expression of one gene impacting another, expression of gene 1 in the graphic above could impact a physiological phenotype like blood glucose. Expression of gene 1 would mediate the relationship between the non-coding SNP and the glucose phenotype.
Gene Akr1e1 is located on chromosome 13 in mouse. How would you interpret the LOD plot below? On which chromosome(s) would you expect to find the driver gene(s)? The SNP(s)?
Myo15b is located on chromosome 11. How would you interpret the following LOD plot? On which chromosome(s) would you expect to find the driver gene(s)? The SNP(s)?
The QTL Viewer for the Attie islet data integrates mediation into exploration of the data. Below, mediation analysis identifies gene Hnf4a as the chromosome 2 gene that impacts Myo15b expression.
Load Functions
R
rankZ = function(x) {
x = rank(x, na.last = "keep", ties.method = "average") / (sum(!is.na(x)) + 1)
return(qnorm(x))
}
Load Data
R
# expression data
load("../data/attie_DO500_expr.datasets.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_expr.datasets.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# data from paper
load("../data/dataset.islet.rnaseq.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/dataset.islet.rnaseq.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# mapping data
load("../data/attie_DO500_mapping.data.RData")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_mapping.data.RData', probable reason 'No such file or
directory'
ERROR
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
R
# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")
WARNING
Warning in gzfile(file, "rb"): cannot open compressed file
'../data/attie_DO500_genoprobs_v5.rds', probable reason 'No such file or
directory'
ERROR
Error in gzfile(file, "rb"): cannot open the connection
Example from package. This will be removed, but I needed it for now.
R
# DOQTL liver protein expresion dataset
data(Tmem68)
# Let us mediate Tmem68 to other proteins
trgt = matrix(Tmem68$target, ncol = 1, dimnames = list(names(Tmem68$target), 'Tmem68'))
annot = Tmem68$annotation
colnames(annot)[5] = 'MIDDLE_POINT'
med <- mediation.scan(target = trgt,
mediator = Tmem68$mediator,
annotation = annot,
covar = Tmem68$covar,
qtl.geno = Tmem68$qtl.geno)
# Interactive plot
kplot(med)
Searching for Candidate Genes
Mapping the Phenotype
We will map the Insulin tAUC
trait becuase it has a QTL
on chromosome 11. First, we will rankZ
transform the
phenotypes because this helps to satisfy the model assumptions. We will
map using sex, generation, and the number of days on the diet as
additive covariates.
R
# NOTE: QTL is not nearly as large with untransformed phenotype.
pheno_rz = pheno_clin %>%
select(num_islets:weight_10wk) %>%
as.matrix()
ERROR
Error: object 'pheno_clin' not found
R
pheno_rz = apply(pheno_rz, 2, rankZ)
ERROR
Error: object 'pheno_rz' not found
R
covar$DOwave = factor(covar$DOwave)
ERROR
Error: object 'covar' not found
R
addcovar = model.matrix(~sex + DOwave + diet_days, data = covar)[,-1]
ERROR
Error: object 'covar' not found
R
ins_qtl = scan1(genoprobs = probs,
pheno = pheno_rz[,'Ins_tAUC', drop = FALSE],
kinship = K,
addcovar = addcovar)
ERROR
Error: object 'probs' not found
R
plot_scan1(ins_qtl, map, main = 'Insulin tAUC')
ERROR
Error: object 'ins_qtl' not found
There is a large peak on chromosome 11. Let’s look at the LOD score and the location.
R
peaks = find_peaks(ins_qtl, map, threshold = 10, prob = 0.95)
ERROR
Error: object 'ins_qtl' not found
R
peaks
ERROR
Error: object 'peaks' not found
Let’s also estimate the founder allele effects for
insulin tAUC
.
TBD: This is slow. Precalculate and just show the plot?
R
chr = '11'
feff = scan1blup(genoprobs = probs[,chr],
pheno = pheno_rz[,'Ins_tAUC', drop = FALSE],
kinship = K[[chr]],
addcovar = addcovar)
ERROR
Error: object 'probs' not found
R
plot_coefCC(feff, map, scan1_output = qtl, legend = 'bottomleft')
ERROR
Error: object 'feff' not found
Looking at the founder allele effects above, we can see that A/J, BL6, 129, and NOD alleles contribute to higher insulin tAUC values. Ideally, we are looking for a gene with a pattern of allele effects that matches this pattern, or it’s inverse.
TBD: Plot founder allele effects for two genes, one with a similar pattern, one without.
Searching for cis-eQTL under a physiological QTL
When we have a physiological QTL, we often start by looking for genes which contain missense, splice, or stop codon SNPs. This is a reasonable first approach to selecting candidate genes and may pay off. However, many QTL fall in intergenic regions, which suggests that the causal variant(s) may be in regulatory regions. These regulatory regions are not as well annotated as coding regions and it may be difficult to identify them from sequence alone. However, if you have gene expression data in a relevant tissue, you can look for genes in the QTL interval which have a cis-eQTL. If the pattern of founder allele effects for a candidate gene is similar to the allele effects of the physiological QTL, then this gene is a good candidate.
TBD: Need a figure with physiological QTL effect & cis-eQTL effects.
We have a QTL for insulin tAUC that extends from Mb to Mb. This support interval means that we expect the causal variant to be within this interval 95% of the time. We would like to select a set of candidate genes located near this QTL. If the causal variant is a coding SNP, then the gene will be located within this interval. However, if the causal variant is a regulatory SNP, then the gene that it regulates may be up to 2 Mb away from the variant. This means that we should be searching for genes within 2 Mb of the edges of the support interval.
TBD: Is there a REF for the +/- 2 Mb distance?
First, we need to get the support interval and add 2 MB to each side.
R
ci = c(peaks$ci_lo - 2, peaks$ci_hi + 2)
ERROR
Error: object 'peaks' not found
Next, we filter the eQTL results to retain genes in this interval.
R
cis_eqtl = dataset.islet.rnaseq$lod.peaks %>%
filter(chrom == '11' &
pos >= ci[1] &
pos <= ci[2])
ERROR
Error: object 'dataset.islet.rnaseq' not found
There are r nrow(cis_eqtl) genes with cis-eQTL in this interval. That is a lot of genes! We could estimate the founder allele effects for each of the genes and stare at them pensively. Or we could add each gene to the QTL mapping model and see if the LOD score changes.
Mapping the Phenotype with a Gene as as Covariate
In QTL mediation analysis, we are hypothesizing that there is a genomic variant which affects the expression of a gene, which in turn affects the physiological phenotype.
TBD: Add mediation figure here again?
There is a gene called Slfn3 under the Insulin tAUC peak. Our hypothesis is that this gene may influence Insulin AUC becuase it has a cis-eQTL in the same location. In order to test this hypothesis, we will add the expression of this gene to the model as an additive covariate. If Slfn3 explains some of the variance in Insulin tAUC, then the LOD should decrease. If it does not, then the LOD will not change much.
R
covar_gene = cbind(norm[rownames(addcovar), 'ENSMUSG00000018986'], addcovar)
ERROR
Error: object 'addcovar' not found
R
slfn3_qtl = scan1(genoprobs = probs,
pheno = pheno_rz[,'Ins_tAUC', drop = FALSE],
kinship = K,
addcovar = covar_gene)
ERROR
Error: object 'probs' not found
R
plot_scan1(ins_qtl, map, chr = chr, main = 'Insulin tAUC with Slfn3')
ERROR
Error: object 'ins_qtl' not found
R
plot_scan1(slfn3_qtl, map, chr = chr, color = 'blue', add = TRUE)
ERROR
Error: object 'slfn3_qtl' not found
As you can see, the LOD decreased when we added the expression of Slfn3 into the model. Let’s see what the new LOD is on chromosome 11.
R
find_peaks(qtl, map, threshold = 6.5)
ERROR
Error: object 'qtl' not found
The LOd decreased from 10.3 to 7.2.
DMG: STOPPED HERE
Searching for Causal Genes
In mediation analysis, we don’t know the causal gene in advance. We must fit a model like the one above for each gene that was measured. But this would involve mapping the trait across the entire genome for each gene. We don’t need to do this to identify causal genes under the phenotype QTL. We only need to fit a model using each gene at the position of the phenotype QTL.
Mediation analysis requires five pieces of data and metadata:
- Target: the phenotype that has a QTL for which we are seeking causal gene candidates.
- Mediator: gene expression for genes that will be used in the mediation analysis.
- Annotation: gene annotation data for the mediator genes.
- Covar: a set of covariates to use in the model.
- QTL.Geno: the genotype probabilities at the Target maximum QTL.
Challenge 1: Can you do it?
Visit the Attie islet data QTL viewer:
https://churchilllab.jax.org/qtlviewer/attie/islets# 1.
Select Islet RNA for the current data set.
2. Type in a gene symbol to search for.
3. Click on the highest peak in the LOD plot.
4. Look at the Effects in the bottom right panel.
5. Select the Mediation tab and click the LOD peak again.
6. Hover over the dots in the mediation plot to discover genes that
raise or lower the peak.
Key Points
- Mediation analysis investigates an intermediary between an independent variable and its effect on a dependent variable.
- Mediation analysis is used in high-throughput genomics studies to identify molecular phenotypes, such as gene expression or methylation traits, that mediate the effect of genetic variation on disease phenotypes or other outcomes of interest.