Mapping Many Gene Expression Traits
Last updated on 2025-01-07 | Edit this page
Overview
Questions
- How do I map all of the genes in my data set?
- What resources do I need to map all of the genes in my data set?
Objectives
- To map several genes at the same time
Mapping All Genes
Most people have laptops with enough memory and computing power to map hundreds of genes. However, we have over 20,000 genes in our data set. This would either take a very long time or would use up all of the memory on most peoples’ laptops.
To map all of the genes, you will need to use a computing cluster, such as sumner2 at JAX. Since mapping all of the genes took 12 hours, you will not perform this operation in this workshop. Instead, we will show you how to set up a computing job to run this analysis.
sumner2 uses slurm to schedule
computing jobs. slurm
is software that runs on the cluster
and manages how computing jobs are run. Users submit job requests and
slurm
checks whether the requested memory, number of
processors, and time are available and then decides when to run the job.
It also balances the needs of different users so that everyone can get
their computing jobs done.
sumner2 also uses software containers to run software. JAX
uses Singularity
containers. These contain software without the need to install it on the
cluster. The Computational Sciences group makes a large number of
containers available on sumner2 and we will use their
qtl2
container.
If you are unfamiliar with the JAX computing cluster, Research IT has
excellent
documentation that explains how the cluster works, how to submit
jobs using slurm
, and how to use Singularity
containers.
bash Script
To run a batch job on the cluster, we will create a bash script which
will request resources from slurm
and will call an R
script, which will perform the mapping. The script that we used is shown
below.
#!/bin/bash
#SBATCH --qos batch
#SBATCH --partition compute
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 22
#SBATCH --mem 200G
#SBATCH --time 1-0:00
CONTAINER=/projects/omics_share/meta/containers/quay.io-jaxcompsci-r-qtl2-deseq-biomart-tidy-v4.img
module load singularity
singularity exec ${CONTAINER} Rscript eqtl.R
For each data set, you will need to determine the number of cores to
use. qtl2
can perform the QTL mapping in parallel on one
node, so you will typically request only one node. In this case, we
requested 22 cores on one node. sumner2 has nodes with up to 70
cores, but there is a point of diminishing returns when parallelizing
too much. We performed a short test run and found that we used about
100GB of memory, so we requested twice that amount in case there were
memory surges during computation. And we requested one day (24 hours) of
compute time.
Overall, the job took 12 hours and used 100GB of memory.
Note that there are many ways to set up a computing job, including
starting many separate mapping jobs and combining the results. We have
chosen to show you one simple way in this course. If you have
slurm
expertise, you can try other methods of breaking up
the work.
Next, we call the R script using singularity exec
to
execute the command which follows the container name. We call
Rscript
to run the eQTL mapping script, which we placed in
the same directory as the bash script.
The inputs to the script are the genoprobs, expression data, covariates, kinship matrix, and the marker map. You should have all of these pre-computed and saved on the cluster in the same directory.
R Script
################################################################################
# Script to map all of the Attic DO 500 pancreatic islet RNASeq data.
#
# Daniel Gatti
# dan.gatti@jax.org
# 2024-11-01
################################################################################
##### LIBRARIES #####
library(qtl2)
##### VARIABLES #####
# Base directory for project.
base_dir = '/projects/compsci/eqtl_course'
# Number of cores. MUST MATCH SLURM REQUEST.
n_cores = 20
# Read in pre-built data files.
probs = readRDS(file.path(base_dir, 'attie_DO500_genoprobs_v5.rds'))
K = readRDS(file.path(base_dir, 'attie_do_kinship.rds'))
covar = readRDS(file.path(base_dir, 'attie_do_covar.rds'))
expr = readRDS(file.path(base_dir, 'attie_do_expr_rz.rds'))
map = readRDS(file.path(base_dir, 'attie_do_map.rds'))
##### MAIN #####
# Convert covariates to factors.
covar$sex = factor(covar$sex)
covar$DOwave = factor(covar$DOwave)
# Create a matrix of additive covariates.
addcovar = model.matrix(~sex + DOwave, data = covar)[,-1]
# Map all genes.
lod = scan1(genoprobs = probs,
pheno = expr,
kinship = K,
addcovar = addcovar,
cores = n_cores)
# Save the LOD scores for all genes.
saveRDS(lod, file = file.path(base_dir, 'attie_do_eqtl_lod.rds'))
# Get the peaks with LOD >= 6.
peaks = find_peaks(scan1_output = lod,
map = map,
threshold = 6,
prob = 0.95)
# Save the harvested LOD peaks.
saveRDS(peaks, file = file.path(base_dir, 'attie_do_eqtl_peaks.rds'))
The output of this script is two files.
-
attie_do_eqtl_lod.rds
: This is numeric matrix containing LOD scores for all genes at all markers. We save it as a compressed R data file (*.rds
). In this case, the file is 11 GB. -
attie_do_eqtl_peaks.rds
: This is a data.frame containing the peaks with LOD over 6. We save it as a compressed R data file (*.rds
). In this case, the file is 674 KB. You should have downloaded this file into yourdata
directory.
Finding QTL Peaks
Let’s load in the QTL peaks that we pre-computed.
R
peaks <- readRDS(file = "data/attie_do_eqtl_peaks.rds")
peaks
is a data frame which contains all of the peaks in
the same format as we saw in the previous lesson. Let’s remind ourselves
what the peaks table looks like.
R
head(peaks)
OUTPUT
lodindex lodcolumn chr pos lod ci_lo ci_hi
1 2 ENSMUSG00000000028 6 71.73759 7.402989 65.005336 77.42623
2 2 ENSMUSG00000000028 11 40.55159 6.028608 3.031502 119.01535
3 2 ENSMUSG00000000028 16 18.54468 7.875773 15.918386 20.81798
4 3 ENSMUSG00000000037 5 99.00918 6.837536 93.289032 99.67072
5 3 ENSMUSG00000000037 12 77.55794 6.153038 77.002567 110.22383
6 3 ENSMUSG00000000037 X 161.30209 26.698858 160.417579 161.80999
Each row contains information for one peak, including the gene ID, chromosome, peak position, LOD, and support interval.
Notice that both genes ENSMUSG00000000028
and
ENSMUSG00000000037
have more than one peak.
Challenge 1: How many genes have QTL peaks?
Look at how many genes we have and how many rows there are in peaks. Is this the same number? If not, can you explain why the numbers are different.
Get the number of genes from expr_rz
.
R
ncol(expr_rz)
OUTPUT
[1] 21771
Get the number of QTL peaks in peaks
.
R
nrow(peaks)
OUTPUT
[1] 42941
There are more peaks than there are genes.
There are several different possibilities when mapping a gene.
- A gene may have no QTL peaks over a LOD of 6, and so it will not appear in the list of peaks.
- A gene may have more than one QTL peak with a LOD greater than 6, and so will appear many times in the list of peaks.
Let’s count the number of peaks that each gene has.
R
dplyr::count(peaks, lodcolumn) |>
dplyr::count(n)
OUTPUT
n nn
1 1 9201
2 2 6755
3 3 3789
4 4 1425
5 5 483
6 6 87
7 7 25
8 8 4
9 9 1
10 10 1
From the table above, we can see that most genes have one or two peaks. However, some genes have 9 peaks with LOD > 6!
When we built this table, we selected all of the peaks with LOD > 6. This is well below the alpha = 0.05 significance threshold. Let’s get that threshold and filter the list of peaks.
R
ethr <- summary(object = eperm,
alpha = 0.05)
ethr
OUTPUT
LOD thresholds (1000 permutations)
ENSMUSG00000020679
0.05 7.47
Now filter the peaks to retain peaks with LOD greater than or equal to 7.4745761.
R
peaks_filt <- filter(peaks, lod >= ethr[1,1])
Challenge 2: How many peaks are there after filtering?
Get the number of QTL peaks that we have after filtering.
R
nrow(peaks_filt)
OUTPUT
[1] 17164
There are now fewer peaks.
Challenge 3: How many peaks does each gene have?
Use the code above to count the number of QTL peaks that each gene has.
R
dplyr::count(peaks_filt, lodcolumn) |>
dplyr::count(n)
OUTPUT
n nn
1 1 12707
2 2 1959
3 3 151
4 4 19
5 5 2
Now most genes have only one peak and only two genes have five peaks.
Let’s look at the number of genes with LOD scores above a certain value. We do this by counting the number of peaks with LOD > 6, then > 7, etc.
Callout
Do not try to type the next block. We will look at it on the workshop website.
R
lod_brks <- 1:100
results <- data.frame(lod_brks, n = 0)
for(i in lod_brks) {
results$n[i] <- sum(peaks$lod >= i)
} # for(i)
results |>
filter(lod_brks >= 6) |>
ggplot(aes(lod_brks, n)) +
geom_line(linewidth = 1.25) +
scale_x_continuous(breaks = 0:10 * 10) +
labs(title = "Number of Peaks above LOD Score",
x = "LOD",
y = "Number of Peaks with LOD >= x") +
theme(text = element_text(size = 20))
R
rm(lod_brks, results)
As you can see from the plot above, most peaks have LODs less than 10. However, about 12,000 peaks have LODs over 10. As we increase the LOD score, we see that fewer genes have high LODs. Let’s look at how many peaks each gene has as we increase the LOD threshold.
Callout
Do not try to type the next block. We will look at it on the workshop website.
R
lod_brks <- 0:100
result <- data.frame(lod_brks = -1, n = 0, nn = 0)
for(i in lod_brks) {
smry <- peaks |>
filter(lod > i) |>
dplyr::count(lodcolumn) |>
dplyr::count(n) |>
mutate(lod_brks = i)
result <- bind_rows(result, smry)
} # for(i)
result |>
filter(lod_brks >= 6) |>
mutate(n = as.character(n)) |>
ggplot(aes(lod_brks, nn, color = n)) +
geom_line(aes(group = n), size = 1.25) +
geom_point() +
scale_x_continuous(breaks = 0:10 * 10) +
labs(title = "Number of Peaks per Gene by LOD Threshold",
x = "LOD",
y = "Number of Peaks per Gene",
color = "Number of Peaks") +
theme(text = element_text(size = 20))
R
rm(lod_brks, result)
From this plot, we can see that higher LOD scores occur only once per gene. This makes sense since a high LOD score implies that the peak explains a large proportion of the gene’s variance.
Summary
In this episode, you learned how to run an eQTL analysis on a computing cluster and started to look at the resulting peaks. When we filter the peaks by higher LOD scores, we tend to get fewer peaks per gene.
Key Points
- Mapping all genes in a study requires a computing cluster.
- Genes may have more than one QTL peak.
- High LOD scores often occur only once per gene.