Mapping A Single Gene Expression Trait

Last updated on 2024-11-12 | Edit this page

Overview

Questions

  • How do I map one gene expression trait?

Objectives

  • To run a QTL analysis for expression data

In this tutorial, we are going to use the gene expression data as our phenotype to map QTLs. That is, we are looking to find any QTLs (cis or trans) that are responsible for variation in gene expression across samples.

We are using the same directory as our previous tutorial, which you can check by typing getwd() in the Console. If you are not in the correct directory, type in setwd("code") in the console or Session -> Set Working Directory -> Choose Directory in the RStudio menu to set your working directory to the main directory.

If you have started a new R session, you will need to load the libraries again. If you haven’t, the libraries listed below are already loaded.

Load Libraries

R

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)

Load Data

In this lesson, we are loading in gene expression data for 21,771 genes: attie_DO500_expr.datasets.RData that includes normalised and raw gene expression data. dataset.islet.rnaseq is the dataset you can download directly from Dryad. Again, if you are using the same R session, you will not need to the load the mapping, phenotypes and genotype probabilities data, again.

R

# expression data
load("../data/attie_DO500_expr.datasets.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_expr.datasets.RData', probable reason 'No such file or
directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

R

# data from paper
load("../data/dataset.islet.rnaseq.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/dataset.islet.rnaseq.RData', probable reason 'No such file or
directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

R

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

R

# mapping data
load("../data/attie_DO500_mapping.data.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_mapping.data.RData', probable reason 'No such file or
directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

R

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

WARNING

Warning in gzfile(file, "rb"): cannot open compressed file
'../data/attie_DO500_genoprobs_v5.rds', probable reason 'No such file or
directory'

ERROR

Error in gzfile(file, "rb"): cannot open the connection

Expression Data

Raw gene expression counts are in the counts data object. These counts have been normalised and saved in the norm data object. More information is about normalisation is here.

To view the contents of either one of these data , click the table on the right hand side either norm or counts in the Environment tab. If you type in names(counts), you will see the names all start with ENSMUSG. These are Ensembl IDs. If we want to see which gene these IDs correspond to, type in dataset.islet.rnaseq$annots, which gives information about each gene, including ensemble id, gene symbol as well as start & stop location of the gene and chromsome on which the gene lies.

Because we are working with the insulin tAUC phenotype, let’s map the expression counts for Hnf1b which is known to influence this phenotype is these data. First, we need to find the Ensembl ID for this gene:

R

dataset.islet.rnaseq$annots[dataset.islet.rnaseq$annots$symbol == "Hnf1b",]

ERROR

Error: object 'dataset.islet.rnaseq' not found

We can see that the ensembl ID of Hnf1b is ENSMUSG00000020679. If we check the distribution for Hnf1b expression data between the raw and normalised data, we can see there distribution has been corrected. Here is the distribution of the raw counts:

R

hist(counts$ENSMUSG00000020679, main = "Hnf1b")

ERROR

Error: object 'counts' not found

and here is the distribution of the normalised counts:

R

hist(norm$ENSMUSG00000020679, main = "Hnf1b")

ERROR

Error in norm$ENSMUSG00000020679: object of type 'closure' is not subsettable

The histogram indicates that distribution of these counts are normalised.

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For Hnf1b expression data, let’s see which covariates are significant.

R

###merging covariate data and expression data to test for sex, wave and diet_days.

cov.counts <- merge(covar, norm, by=c("row.names"), sort=F)

ERROR

Error: object 'covar' not found

R

#testing covairates on expression data

tmp = cov.counts %>%
        dplyr::select(mouse, sex, DOwave, diet_days, ENSMUSG00000020679) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()

ERROR

Error: object 'cov.counts' not found

R

mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

ERROR

Error: object 'tmp' not found

R

#  kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Expression")

tmp

ERROR

Error: object 'tmp' not found

R

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Expression") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

ERROR

Error: object 'tmp' not found

We can see that sex and DOwave are significant. Here DOwave is the group or batch number as not all mice were submitted for genotyping at the same time. Because of this, we now have to correct for it. Considering the paper included the covariate, diet_days, we will include that as well.

R

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)

ERROR

Error: object 'pheno_clin' not found

R

pheno_clin$DOwave = factor(pheno_clin$DOwave)

ERROR

Error: object 'pheno_clin' not found

R

pheno_clin$diet_days = factor(pheno_clin$DOwave)

ERROR

Error: object 'pheno_clin' not found

R

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)[,-1]

ERROR

Error: object 'pheno_clin' not found

Performing a genome scan

[Permutations]

First, we need to work out the signifcance level. Let’s find the signifance level for 0.1, 0.05 and 0.01.

R

operm <- scan1perm(genoprobs = probs, 
                   pheno = norm[,"ENSMUSG00000020679", drop = FALSE], 
                   addcovar=covar,
                   n_perm=1000)

Note DO NOT RUN THIS (it will take too long). Instead, I have run it earlier and will load it in here. We will also perform a summary to find the summary level for 0.1, 0.05 and 0.01

R

load("../data/operm_ENSMUSG00000020679_1000.Rdata")

summary(operm,alpha=c(0.1, 0.05, 0.01))

Genome Scan

R

qtl = scan1(genoprobs = probs, 
            pheno = norm[,"ENSMUSG00000020679", drop = FALSE], 
            kinship = K, 
            addcovar = covar)

Next, we plot the genome scan.

R

plot_scan1(x = qtl, 
           map = map, 
           lodcolumn = "ENSMUSG00000020679",
           main = colnames(qtl))
           add_threshold(map,  summary(operm, alpha=0.1), col = 'purple')
           add_threshold(map,  summary(operm, alpha=0.05), col = 'red')
           add_threshold(map,  summary(operm, alpha=0.01), col = 'blue')

Finding LOD peaks

Let’s find LOD peaks

R

lod_threshold = summary(operm, alpha=0.01)
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, prob = 0.95)
kable(peaks %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")

QTL effects

R

blup <- scan1blup(genoprobs=probs[,peaks$chr[1]], 
                  norm[,peaks$lodcolumn[1], drop=FALSE])

plot_coefCC(blup, 
       map=map, 
       columns=1:8,
       bgcolor="gray95", 
       legend="bottomleft",
       scan1_output = qtl )

Challenge 1:

Now choose another gene expression trait in norm data object and perform the same steps.
1). Check the distribution of the raw counts and normalised counts 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship provided.
4). Plot the genome scan for this gene.
5). Find the peaks above LOD score of 6.

Replace <ensembl id> with your choice of gene expression trait

R

#1).

hist(pheno_clin$<ensembl id>)

pheno_clin$<ensembl id>_log <- log(pheno_clin$<ensembl id>)

hist(pheno_clin$<ensembl id>_log)


#2).

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, <ensembl id>) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

tmp

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

#3).

qtl = scan1(genoprobs = probs, 
           pheno = pheno_clin[,"<ensembl id>", drop = FALSE], 
        kinship = K, 
        addcovar = covar)

#4).

plot_scan1(x = qtl, map = map, lodcolumn = "<ensembl id>")
abline(h = 6, col = 2, lwd = 2)

#5). 

peaks = find_peaks(scan1_output = qtl, map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally