SNP association mapping
OverviewTeaching: 30 min
Exercises: 30 minQuestions
How do I identify SNPs in a QTL?Objectives
Perform a basic QTL analysis.
Identify QTL with a genome scan.
Find SNPs within a QTL.
Convert founder genotypes to a strain distribution pattern (SDP).
Infer SNP genotypes for Diversity Outbred mice.
Perform SNP association analysis
For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.
QTL analysis in Diversity Outbred mice
To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See Svenson et al. (2012) and Gatti et al. (2014).
We’ll consider a subset of the data from
Recla et al. (2014), available as part of the
qtl2data github repository. (The
full data are in
DO_Recla; the directory
DOex contains a reduced set, with just three chromosomes, one phenotype (
OF_immobile_pct, percent immobile in the open field test), and a
reduced set of markers.
You can download the data from a single zip file, as follows:
DOex <- read_cross2(file = "https://raw.githubusercontent.com/rqtl/qtl2data/master/DOex/DOex.zip")
Let’s quickly whip through a basic analysis.
We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.
pr <- calc_genoprob(DOex, error_prob=0.002) apr <- genoprob_to_alleleprob(pr)
We calculate kinship matrices (using the “LOCO” method, though with the caveat that here we are only considering genotypes on three chromosomes).
k <- calc_kinship(apr, "loco")
We create a numeric covariate for sex; be sure to include the individual IDs as names.
sex <- (DOex$covar$Sex == "male")*1 names(sex) <- rownames(DOex$covar)
Note that you can create the vector and assign the names in one step using the basic R function setNames().
sex <- setNames( (DOex$covar$Sex == "male")*1, rownames(DOex$covar) )
We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.
out <- scan1(apr, DOex$pheno, k, sex)
Here’s a plot of the results.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot(out, DOex$gmap)
There’s a strong peak on chromosome 2. Let’s look at the QTL effects. We estimate them with
scan1coef(). We need to subset the allele probabilities and the list of kinship matrices.
coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)
For the DO, with 8 QTL alleles, we can use the function
plot_coefCC in the R/qtl2plot package, which plots the 8 allele effects in the “official” Collaborative Cross (CC)
colors. (Well, actually slightly modified colors, because I think the official colors are kind of ugly.) The strong locus seems to be mostly
due to the NZO allele. Note that
CCcolors is a vector of colors included in the qtl2plot package; there’s also a
with the official colors.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95", legend="bottomleft")
If you provide
plot_coefCC() with the genome scan output, it will display the LOD curve below the coefficient estimates.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot_coefCC(coef_c2, DOex$gmap["2"], scan1_output=out, bgcolor="gray95", legend="bottomleft")
Connecting to SNP and gene databases
To perform SNP association analysis in the region of a QTL, we’ll need to grab data on all of the SNPs in the region, including their genotypes in the eight founder strains for the CC and DO populations. As a related task, we’ll also want to identify the genes in the region.
We want R/qtl2 to permit different ways of storing this information. For the CC and DO populations, we’ve prepared SQLite database files for the variants in the CC founders and for the MGI mouse gene annotations. But others might wish to use a different kind of database, or may wish to query an online database.
We provide a template for how to use R/qtl2 to connect to SNP and gene databases, with the functions
create_gene_query_func(). Each returns a function for querying variant and gene databases, respectively. The query functions that are returned take just three arguments (chr, start, end end), and themselves return a data frame of variants on the one hand and genes on the other.
During setup you would have downloaded the SQLite databases from Figshare and placed these in your
- cc_variants.sqlite doi:10.6084/m9.figshare.5280229.v2, variants in the Collaborative Cross founders (3 GB)
- mouse_genes.sqlite doi:10.6084/m9.figshare.5280238.v4 full set of mouse gene annotations (677 MB)
- mouse_genes_mgi.sqlite doi:10.6084/m9.figshare.5286019.v5 just the MGI mouse gene annotations (11 MB)
To create a function for querying the CC variants, call
create_variant_query_func() with the path to the
query_variants <- create_variant_query_func("../data/cc_variants.sqlite")
To grab the variants in the interval 97-98 Mbp on chromosome 2, you’d then do the following:
variants_2_97.5 <- query_variants(2, 97, 98)
Similarly, to create a function for querying the MGI mouse gene annotations, you call
create_gene_query_func() with the path to the
query_genes <- create_gene_query_func("../data/mouse_genes_mgi.sqlite")
To grab the genes overlapping the interval 97-98 Mbp on chromosome 2, you’d then do the following:
genes_2_97.5 <- query_genes(2, 97, 98)
The way we’ve set this up is a bit complicated, but it allows greatest flexibility on the part of the user. And for our own work, we like to have a local SQLite database, for rapid queries of SNPs and genes.
Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.
Well, actually, we first need to find the location of the inferred QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find nearby SNPs, we really want to know the Mbp position. The calculations were only performed at the marker positions, and so we can use
max(), giving both the
scan1() output and the physical map, and then pull out the position from the results.
peak_Mbp <- max(out, DOex$pmap)$pos
The marker is at 97.5 Mbp. We’ll focus on a 2 Mbp interval centered at 97.5 Mbp.
We can pull out the variants in the 2 Mbp interval centered at 97.5 on chr 2 using the query function we defined above:
variants <- query_variants(2, peak_Mbp - 1, peak_Mbp + 1)
There are 27737 variants in the interval, including 27492 SNPs, 101 indels, and 144 structural variants. We’re treating all of them as biallelic markers (all but the major allele in the eight founder strains binned into a single allele). In the following, we’re going to just say “SNP” rather than “variant”, even though the variants include indels and structural variants.
After identifying the variants in the interval of interest, we use our genotype probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the SNP genotypes of the founders.
We do this assuming that the genotype probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:
- Find the interval for each SNP.
- Reduce the SNPs to a “distinct” set: if two SNPs have the same strain distribution pattern (SDP; the pattern of alleles in the eight founders) and are in the same interval, by our assumption their allele probabilities will be the same.
- Take the average of the allele probabilities at the two endpoints of each interval.
- Collapse the 8 allele probabilities to two according to each observed SDP in the interval.
- We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).
All of these steps are combined into a single function
scan1snps(), which takes the genotype probabilities, a physical map of those locations, the phenotype, the kinship matrix, covariates, the query function for grabbing the SNPs in a given interval, and the chromosome, start, and end positions for the interval. (If
insert_pseudomarkers() was used to insert pseudomarkers, you will need to use
interp_map() to get interpolated Mbp positions for the pseudomarkers.)
out_snps <- scan1snps(pr, DOex$pmap, DOex$pheno, k[["2"]], sex, query_func=query_variants, chr=2, start=peak_Mbp-1, end=peak_Mbp+1, keep_all_snps=TRUE)
The output is a list with two components:
lod is a matrix of LOD scores (with a single column, since we’re using just one phenotype), and
snpinfo is a data frame with SNP information. With the argument
snpinfo data frame contains information about all of the variants in the region with an index column indicating the equivalence classes.
plot_snpasso() can be used to plot the results, with points at each of the SNPs. The default is to plot all SNPs. In this case, there are 27737 variants in the region, but only 150 distinct ones.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot_snpasso(out_snps$lod, out_snps$snpinfo)
We can actually just type plot() rather than plot_snpasso(), because with the snpinfo table in place of a genetic map, plot() detects that a SNP association plot should be created.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot(out_snps$lod, out_snps$snpinfo)
We can use our
query_genes() function to identify the genes in the region, and
plot_genes() to plot their locations. But also
plot_snpasso() can take the gene locations with the argument genes and then display them below the SNP association results. Here, we are also highlighting the top SNPs in the SNP association plot using the
drop_hilit argument. SNPs with LOD score within
drop_hilit of the maximum are shown in pink.
genes <- query_genes(2, peak_Mbp - 1, peak_Mbp + 1) par(mar=c(4.1, 4.1, 0.6, 0.6)) plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes)
To get a table of the SNPs with the largest LOD scores, use the function
top_snps(). This will show all SNPs with LOD score within some amount (the default is 1.5) of the maximum SNP LOD score. We’re going to display just a subset of the columns.
top <- top_snps(out_snps$lod, out_snps$snpinfo) print(top[,c(1, 8:15, 20)], row.names=FALSE)
snp_id A_J C57BL_6J 129S1_SvImJ NOD_ShiLtJ NZO_HlLtJ rs33297153 1 1 2 1 2 rs226751615 1 1 1 1 2 rs230653109 1 1 1 1 2 rs250407660 1 1 1 1 2 rs248216267 1 1 1 1 2 rs237564299 1 1 1 1 2 rs255011323 1 1 1 1 2 rs212448521 1 1 1 1 2 rs27379206 1 1 1 1 2 rs27379194 1 1 1 1 2 rs244484646 1 1 1 1 2 rs215855380 1 1 1 1 2 rs236942088 1 1 1 1 2 rs212414861 1 1 1 1 2 rs229578122 1 1 1 1 2 rs254318131 1 1 1 1 2 rs217679969 1 1 1 1 2 rs238404461 1 1 1 1 2 rs262749925 1 1 1 1 2 rs231282689 1 1 1 1 2 rs260286709 1 1 1 1 2 rs27396282 1 1 1 1 2 rs263841533 1 1 1 1 2 rs231205920 1 1 1 1 2 rs242885221 1 1 1 1 2 rs586746690 1 1 2 1 2 rs49002164 1 1 2 1 2 SV_2_96945406_96945414 1 1 1 1 2 rs244595995 1 1 2 1 2 SV_2_96993116_96993118 1 1 1 1 2 CAST_EiJ PWK_PhJ WSB_EiJ lod 3 1 1 8.213182 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 7.949721 2 1 1 8.631497 2 1 1 8.631497 3 1 1 8.280942 2 1 1 8.631497 2 1 1 8.280942
The top SNPs all have NZO and CAST with a common allele, different from the other 6 founders. The next-best SNPs have NZO, CAST, and 129 with a common allele, and then there’s a group of SNPs where NZO has a unique allele.
scan1snps() function can also be used to perform a genome-wide SNP association scan, by providing a variant query function but leaving chr, start, and end unspecified. In this case it’s maybe best to use keep_all_snps=FALSE (the default) and only save the index SNPs.
out_gwas <- scan1snps(pr, DOex$pmap, DOex$pheno, k, sex, query_func=query_variants, cores=0)
We can make a Manhattan plot of the results as follows. We use
altcol to define a color for alternate chromosomes and
gap=0 to have no gap between chromosomes.
par(mar=c(4.1, 4.1, 0.6, 0.6)) plot(out_gwas$lod, out_gwas$snpinfo, altcol="green4", gap=0)
Note that while there are LOD scores on the X chromosome that are as large as those on chromosome 2, we’re allowing a genotype × sex interaction on the X chromosome and so the test has 3 degrees of freedom (versus 2 degrees of freedom on the autosomes) and so the LOD scores will naturally be larger. If we’d used the allele probabilities (apr above, calculated from genoprob_to_alleleprob()) rather than the genotype probabilities, we would be performing a test with 1 degree of freedom on both the X chromosome and the autosomes.