Performing a genome scan with a linear mixed model
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do I use a linear mixed model in a genome scan?
How do different mapping and kinship calculation methods differ?
Objectives
Create a genome scan with a linear mixed model.
Compare LOD plots for Haley-Knott regression and linear mixed model methods.
Compare LOD plots for the standard kinship matrix with the leave-one-chromosome-out (LOCO) method.
Genetic mapping in mice presents a good example of why accounting for population structure is important. Laboratory mouse strains are descended from a small number of founders (fancy mice) and went through several population bottlenecks. Wild-derived strains are not descended from fancy mice and don’t share the same history as laboratory strains. Linear mixed models were developed to solve problems with population structure created by differing ancestries, and to handle relatedness between individuals. Linear mixed models (LMMs) consider genome-wide similarity between all pairs of individuals to account for population structure, known kinship and unknown relatedness. Linear mixed models in mapping studies can successfully correct for genetic relatedness between individuals in a population by incorporating kinship into the model. Earlier we calculated a kinship matrix for input to a linear mixed model to account for relationships among individuals. For a current review of mixed models in genetics, see this preprint of Martin and Eskin, 2017.
Simple linear regression takes the form
y = μ + βX + ε
which describes a line with slope β and y-intercept μ. The error (or residual) is represented by ε.
To model data from a cross, we use
where yj is the phenotype of the jth individual, μ is the mean phenotype value, βk is the effect of the kth genotype, Xjk is the genotype for individual j, and εj is the error for the jth individual. In the figure below, μ equals 94.6, and β equals 15.4 for the alternative hypothesis (QTL exists). This linear model is y = 94.6 + 15.4X + ε. The model intersects the genotype groups at their group means. In contrast, the null hypothesis would state that there is no difference in group means (no QTL anywhere). The linear model for the null hypothesis would be y = 94.6 + 0X + ε. This states that the phenotype is equal to the combined mean (94.6), plus some error (ε). In other words, genotype doesn’t affect the phenotype.
The linear models above describe the relationship between genotype and phenotype but are inadequate for describing the relationship between genotype and phenotype in large datasets. They don’t account for relatedness among individuals. In real populations, the effect of a single genotype is influenced by many other genotypes that affect the phenotype. A true genetic model takes into account the effect of all variants on the phenotype.
To model the phenotypes of all individuals in the data, we can adapt the simple linear model to include all individuals and their variants so that we capture the effect of all variants shared by individuals on their phenotypes.
Now, y represents the phenotypes of all individuals. The effect of the ith genotype on the phenotype is βi, the mean is μ times 1, (mean multiplied by a vector of 1s) and the error is ε. Here, the number of genotypes is M.
To model the effect of all genotypes and to account for relatedness, we test the effect of a single genotype while bringing all other genotypes into the model.
βk is the effect of the genotype Xk, and Σi≠kβiXi sums the effects of all other genotypes except genotype k. For the leave one chromosome out (LOCO) method, βkXk is the effect of genotypes on chromosome k, and βiXi represents effect of genotypes on all other chromosomes.
If the sample contains divergent subpopulations, SNPs on different chromosomes will be correlated because of the difference in allele frequencies between subpopulations caused by relatedness. To correct for correlations between chromosomes, we model all genotypes on the other chromosomes when testing for the association of a SNP.
To perform a genome scan using a linear mixed model you also use the function scan1
; you just need to provide the argument kinship
, a kinship matrix (or, for the LOCO method, a list of kinship matrices).
out_pg <- scan1(pr, iron$pheno, kinship=kinship, Xcovar=Xcovar)
Again, on a multi-core machine, you can get some speed-up using the cores
argument.
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)
If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use type="loco"
in the call to calc_kinship()
.
kinship_loco <- calc_kinship(pr, "loco")
For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from calc_kinship()
with method="loco"
.
out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)
To plot the results, we again use plot_scan1()
.
Here is a plot of the LOD scores by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.
You can use the code below to generate overlaid plots for each method.
plot_scan1(out_pg_loco, map = map, lodcolumn = "liver", col = "black")
plot_scan1(out_pg, map = map, lodcolumn = "liver", col = "blue", add = TRUE)
plot_scan1(out, map = map, lodcolumn = "liver", col = "green", add = TRUE)
For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.
For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.
Challenge 1
Earlier you inserted pseudomarkers for the
grav
data and saved the results to an object calledgravmap
. Then you calculated genotype probabilities and saved the results to an object calledgravpr
.
1). Calculate kinship for thegrav
data using the LOCO method.
2). Run a genome scan with the genotype probabilities and kinship that you calculated.
3). Find the maximum LOD score for the scan usingwhich(out_grav == maxlod(out_grav), arr.ind = TRUE)
.
4). Plot the genome scan for this phenotype (hint: use the column number as lodcolumn).Solution to Challenge 1
1).
grav_kinship <- calc_kinship(gravpr, "loco")
2).out_grav <- scan1(genoprobs = gravpr, pheno = grav$pheno, kinship = grav_kinship)
3).which(out_grav == maxlod(out_grav), arr.ind = TRUE)
row 166, col 133 4).plot(out_grav, lodcolumn = 133, map = gravmap)
Challenge 2
What are the benefits and disadvantages of the three methods for genome scanning (Haley-Knott regression, kinship matrix, and leave-one-chromosome out (LOCO)?)
Which method would you use to scan, and why?
Think about the advantages and disadvantages of each, discuss with a neighbor, and share your thoughts in the collaborative document.Solution to Challenge 2
file <- paste0("https://raw.githubusercontent.com/rqtl/",
"qtl2data/master/B6BTBR/b6btbr.zip")
b6btbr <- read_cross2(file)
Challenge 3
Pair programming exercise: with your partner, review and carry out all of the steps in QTL mapping that we have covered so far, using a new data set. One of you types the code, the other explains what needs to happen next, finds the relevant code in the lesson, suggests new names for objects (i.e. NOT the ones you’ve already used, such as “map”, “pr”, “out”, etc.).
- Run the code above to load the B6 x BTBR intercross data into an object called b6btbr.
- Insert pseudomarkers and calculate genotype probabilities.
- Run a genome scan for the log10_insulin_10wk phenotype.
- Calculate a kinship matrix.
- Calculate a list of kinship matrices with the LOCO method.
- Run genome scans with the regular kinship matrix and with the list of LOCO matrices.
- Plot the 3 different genome scans in a single plot in different colors.
- Which chromosomes appear to have peaks with a LOD score greater than 4? Which methods identify these peaks? Which don’t?
Solution to Challenge 3
file <- paste0("https://raw.githubusercontent.com/rqtl/", "qtl2data/master/B6BTBR/b6btbr.zip")
b6btbr <- read_cross2(file)
summary(b6btbr)
head(b6btbr$pheno)
colnames(b6btbr$pheno)
b6bmap <- insert_pseudomarkers(map=b6btbr$gmap, step=1)
prb6b <- calc_genoprob(cross=b6btbr, map=b6bmap, error_prob=0.002)
b6bXcovar <- get_x_covar(b6btbr)
b6bout <- scan1(genoprobs = prb6b, pheno = b6btbr$pheno, Xcovar=b6bXcovar)
plot(b6bout, map = b6bmap)
b6bkinship <- calc_kinship(probs = prb6b)
out_pg_b6b <- scan1(prb6b, b6btbr$pheno, kinship=b6bkinship, Xcovar=b6bXcovar)
kinship_loco_b6b <- calc_kinship(prb6b, "loco")
out_pg_loco_b6b <- scan1(prb6b, b6btbr$pheno, kinship_loco_b6b, Xcovar=b6bXcovar)
plot_scan1(out_pg_loco_b6b, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "black")
plot_scan1(out_pg_b6b, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "blue", add = TRUE)
plot_scan1(b6bout, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "green", add = TRUE)
Key Points
To perform a genome scan with a linear mixed model, supply a kinship matrix.
Different mapping and kinship calculation methods give different results.