Content from Introduction to the Data Set


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • What data will we be using in this workshop?

Objectives

  • Understand the experimental design of the data set.
  • Understand the goals of the experiment.

Introduction


In the first part of this lesson, we will be analyzing data from a mouse experiment involving Type 2 diabetes (T2D). There are two types of diabetes: type 1, in which the immune system attacks insulin-secreting cells and prevents insulin production, and type 2, in which the pancreas makes less insulin and the body becomes less responsive to insulin.

Figure showing Type 2 diabetes & insulin. Created in BioRender.com

This study is from Tian et al and involves an intercross between the diabetes-resistant C57BL/6J (B6 or B) strain and the diabetes-susceptible BTBR T+ tf/J (BTBR or R) strain mice carrying a Leptinob/ob mutation.

Figure showing intercross breeding design.

The mutation causes the mice to not produce leptin, a hormone that regulates hunger and satiety. When leptin levels are low (or missing), the body does not receive satiety signals and continues to feel hunger. Leptinob/ob mice continue to eat and become obese. Obesity is one of the risk factors for T2D and this experiment sought to use genetic variation between B6 and BTBR strains to identify genes which influence T2D.

This study measured insulin and glucose levels in mice at 10 weeks, at which time the mice were euthanized. After euthanasia, the author’s harvested six tissues, adipose, gastrocnemius muscle, hypothalamus, pancreatic islets, kidney, and liver, and measured transcript levels via gene expression microarray.

In this study, we will analyze circulating insulin levels and pancreatic islet gene expression. We will map circulating insulin levels to identify genomic loci which influence insulin levels. We will then use the pancreatic islet gene expression data to identify candidate genes.

Key Points

  • Leptinob/ob mice do now produce insulin and become obese due to overeating.
  • This study crossed mice carrying the Leptinob/ob mutation in C57BL/6J and BTBR T+ tf/J.
  • C57BL/6J mice are resistant to diabetes and BTBR mice are susceptible.
  • By crossing these two strains, the authors aimed to identify genes which influence susceptibility to T2D.

Content from Input File Format


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How are the data files formatted for qtl2?
  • Which data files are required for qtl2?
  • Where can I find sample data for mapping with the qtl2 package?

Objectives

  • To specify which input files are required for qtl2 and how they should be formatted.
  • To locate sample data for qtl mapping.

QTL mapping data consists of a set of tables of data: sample genotypes, phenotypes, marker maps, etc. These different tables are in different comma-separated value (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the genotype data file will have individual IDs in the first column, marker names for the rest of the column headers.

Table showing the mouse genotypes as BB, BR, and RR.
Attie Sample Genotypes

The sample genotype file above shows two alleles: B and R. These represent the founder strains for an intercross, which are C57BL/6 (BB) and BTBR (RR) Tian et al. The B and R alleles themselves represent the haplotypes inherited from the parental strains C57BL/6 and BTBR.

In the Diversity Outbred (DO) and Collaborative Cross (CC), alleles A to H represent haplotypes of the 8 founder strains.

Figure showing the colors and letter codes of the CC/DO founders.
CC and DO Founder Alleles

CC lines have very low heterozygosity throughout their genomes. For most loci, CC lines will be homozygous for one of the founder strains A-H above, and as such will have one of only 8 genotypes (e.g. AA, BB, CC, …). In contrast, DO mice (not lines) have high heterozygosity throughout their genomes. Each locus will have one of 36 possible genotypes (e.g. AA, AB, AC, …, BB, BC, BD,…).

Figure showing CC and DO genomes
Example Collaborative Cross and Diversity Outbred genomes

For the purposes of learning QTL mapping, this lesson begins with an intercross that has only 3 possible genotypes instead of 8 or 36. Once we have learned how to use qtl2 for the simpler case, we will advance to the most complex case involving mapping in DO mice.

R/qtl2 accepts the following files:
1. genotypes. 2. phenotypes. 3. phenotype covariates (i.e. tissue type, time points)
4. genetic map
5. physical map (optional)
6. control file (YAML or JSON format, not CSV).

We use both a genetic marker map and a physical map (if available). A sample from a genetic map of SNP markers is shown here.

Table showing the marker, chromosome, and centimorgan position for five markers
Attie Genetic Map

A physical marker map provides location in bases rather than centiMorgans.

Table showing top five rows of physical marker map.
Attie Physical Map

Numeric phenotypes are separate from the often non-numeric covariates.

Table showing top five rows of phenotype table, including insulin
Attie Phenotypes

Phenotype covariates are metadata describing the phenotypes. For example, in the case of a phenotype measured over time, one column in the phenotype covariate data could be the time of measurement. For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs. The covariates shown below include sex and parental grandmother (pgm).

Table showing the top five rows of covariates table
Attie Covariates

In addition to the set of CSV files with the primary data, we need a separate control file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.

Attie Control File{alt=‘Figure showing the qtl2 control file’, width=75%}

A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2(), has a single argument: the name (with path) of the control file.

For further details, see the separate vignette on the input file format.

Challenge 1: What data does qtl2 need?

  1. Which data files are required by qtl2?
  2. Which ones are optional?
  3. How should they be formatted?
  1. marker genotypes, phenotypes, genetic map
  2. physical map
  3. csv; JSON or YAML for control file

Sample data sets


In this lesson, we will not work with data sets included in the qtl2 package, though you may want to explore them to learn more. You can find out more about the sample data files from the R/qtl2 web site. Zipped versions of these datasets are included with the qtl2geno package and can be loaded into R using the read_cross2() function. Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.

Challenge 2: Additional R/qtl2 datasets

Go to https://github.com/rqtl/qtl2data to view additional sample data.
1). Find the Recla data and locate the phenotype data file. Open the file by clicking on the file name. What is in the first column? the first row?
2). Locate the genotype data file, click on the file name, and view the raw data. What is in the first column? the first row?
3). Locate the covariates file and open it by clicking on the file name. What kind of information does this file contain?
4). Locate the control file (YAML or JSON format) and open it. What kind of information does this file contain?

1). What is in the first column of the phenotype file? Animal ID. The first row? Phenotype variable names - OF_distance_first4, OF_distance, OF_corner_pct, OF_periphery_pct, …
2). What is in the first column of the genotype file? marker ID. the first row? Animal ID - 1,4,5,6,7,8,9,10, …
3). Locate the covariates file and open it. What kind of information does this file contain? Animal ID, sex, cohort, group, subgroup, ngen, and coat color.
4). Locate the control file (YAML or JSON format) and open it. What kind of information does this file contain? Names of primary data files, genotype and allele codes, cross type, description, and other metadata.

Preparing your Diversity Outbred (DO) data for qtl2


Karl Broman provides detailed instructions for preparing DO mouse data for R/qtl2.

Key Points

  • QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc.
  • These different tables are in separate comma-delimited (CSV) files.
  • In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns.
  • In addition to primary data, a separate file with control parameters (or metadata) in either YAML or JSON format is required.
  • Published and public data already formatted for QTL mapping are available on the web.
  • These data can be used as a model for formatting your own QTL data.

Content from Calculating Genotype Probabilities


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How do I calculate QTL at positions between genotyped markers?
  • How do I calculate QTL genotype probabilities?
  • How do I calculate allele probabilities?
  • How can I speed up calculations if I have a large data set?

Objectives

  • To explain why the first step in QTL analysis is to calculate genotype probabilities.
  • To calculate genotype probabilities.

The first task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. For example, the first step would be to determine the probabilities for genotypes BR and RR at the locus indicated below.

Genotype probabilities must be calculated between typed markers; adapted from Broman & Sen, 2009{‘a chromosome with two typed markers labeled BR and RR with a locus of unknown genotype between them’}

The calc_genoprob() function calculates QTL genotype probabilities conditional on the available marker data. These are needed for most of the QTL mapping functions. The result is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals \(\times\) genotypes \(\times\) positions.

Figure showing three-dimensional array of genotype probabilities (genoprobs)
Three-dimensional genotype probabilities array

See this page for a graphical review of data structures in R{‘a web page showing R data structures including one-dimensional vectors and lists, two dimensional dataframes and matrices, and n-dimensional arrays’}.

We’ll use the Attie BL6/BTBR dataset from Tian et al (an intercross) as an example. In this study, circulating insulin levels were measured in an F2 cross between mouse strains C57BL/6J and BTBTR T+ .

First, we will load in the qtl2 library, which provides the functions that we will use for QTL analysis.

R

suppressPackageStartupMessages(library(qtl2))

R

cross <- read_cross2(file = 'data/attie_b6btbr_grcm39/attie_control.json')

To load your own data from your machine, you would use the file path to your data files. For example, if the file path to your data files is /Users/myUserName/qtlProject/data, the command to load your data would look like this:

R

myQTLdata <- read_cross2(file = "/Users/myUserName/qtlProject/data/myqtldata.json" )

The JSON file contains all control information for your data, including names of data files, cross type, column specifications for sex and cross information, and more. This can also be in YAML format. Alternatively, all data files can be zipped together for loading.

R

myQTLdata <- read_cross2(file = "/Users/myUserName/qtlProject/data/myqtldata.zip" )

Back to the BTBR data. Now look at a summary of the cross data and the names of each variable within the data.

R

summary(cross)

OUTPUT

Object of class cross2 (crosstype "f2")

Total individuals             490
No. genotyped individuals     490
No. phenotyped individuals    490
No. with both geno & pheno    490

No. phenotypes                  3
No. covariates                  8
No. phenotype covariates        0

No. chromosomes                20
Total markers                2057

No. markers by chr:
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19   X
156 135 157 126 125 102 109  91  93 123 124 116 116  91 102  66  60  95  50  20 

R

names(cross)

OUTPUT

 [1] "crosstype"  "geno"       "gmap"       "pmap"       "pheno"
 [6] "covar"      "is_x_chr"   "is_female"  "cross_info" "alleles"   

Have a look at the markers listed in the genetic map, gmap. Markers are listed by chromosome and described by cM position. View only the markers on the first two chromosomes.

R

head(cross$gmap, n=2)

OUTPUT

$`1`
rs13475697  rs3681603 rs13475703 rs13475710  rs6367205 rs13475716 rs13475717
 0.1881141  0.1920975  0.4167755  0.6488793  0.6555814  0.6638576  0.6676198
rs13475719 rs13459050  rs3680898 rs13475727 rs13475728 rs13475729 rs13475731
 0.6711377  0.6749344  0.6775292  1.8149573  1.9596637  2.3456569  2.7186389
rs13475737 rs13475744  rs6397513 rs13475747 rs13475748 rs13475749 rs13475750
 3.1059517  3.8222865  4.3094607  4.3120150  4.5098582  4.8154609  4.8853505
rs13475751 rs13475752 rs13475762 rs13475764 rs13475765 rs13475768 rs13475769
 4.8869793  4.8902179  7.2954871  8.2102887  8.3708197  8.7178703  8.8859153
rs13475771  rs6384194 rs13475790  rs3676270 rs13475794 rs13475801  rs4222269
 9.1374722  9.9295192  9.9970634 10.1508878 10.3962716 11.5981956 11.9606369
 rs6387241 rs13475822 rs13475824 rs13475826 rs13475827 rs13475834 rs13475880
16.8770742 16.9815396 17.4434784 18.0866148 18.6276972 19.2288050 27.4056813
rs13475883  rs6239834  rs3162895  rs6212146  rs3022802 rs13475899 rs13475900
28.4641674 30.8427150 31.1526514 31.2751278 31.3428706 31.8493556 31.8518088
rs13475906  rs3022803 rs13475907 rs13475909 rs13475912  rs6209698 rs13475929
32.2967145 32.3074644 32.3683291 32.8001894 33.6026526 36.5341646 37.6881435
rs13475931 rs13475933 rs13475934  rs4222476 rs13475939  rs8253473 rs13475941
37.7429827 38.0416271 38.0430095 38.9647582 39.4116688 39.4192277 39.4871064
rs13475944 rs13475947 rs13475948 rs13475950 rs13475951 rs13475954 rs13475955
39.7672829 40.2599440 40.3380113 40.3417592 40.3439501 41.1407252 41.2887176
rs13475963 rs13475966 rs13475967 rs13475970 rs13475960  rs6250696 rs13475973
42.4744416 42.5667702 42.9736574 43.1427994 43.5985261 43.5992946 43.6014053
 rs3691187 rs13475988 rs13475991 rs13476023 rs13476024  rs3684654  rs6274257
44.6237384 45.7855528 46.0180221 47.8579278 47.8600317 48.2423958 48.9612178
rs13476045 rs13476049  rs6319405 rs13476050 rs13476051 rs13476054 rs13476057
49.2018340 49.3701384 49.4261039 49.4275718 49.4323558 49.4972616 49.5031830
rs13476059 rs13476060 rs13476062 rs13476066 rs13476067  rs6259837 rs13476080
49.5084008 49.5113545 49.6085043 49.6644819 50.1779477 50.8256056 51.0328603
 rs6302966 rs13476085 rs13476089  rs3717360 rs13476090  rs6248251 rs13476091
51.3659553 51.6974451 52.3869798 52.3903517 52.3936241 52.4228715 52.5787388
 rs3088725  rs3022832  rs4222577 rs13476100  rs6263067  rs8256168  rs6327099
53.4044231 53.4129004 53.4189013 54.3267003 54.4193890 55.1459517 55.3274320
rs13476111 rs13476119  rs8236484  rs8270838  rs8236489 rs13476129 rs13476134
55.9050491 56.8936305 56.9852502 57.1870637 58.0248893 58.7605079 59.5401544
rs13476135 rs13476137 rs13476138 rs13476140 rs13476148  rs6202860 rs13476158
59.5426193 59.6023794 60.3355828 60.3439598 61.1791787 61.9905512 61.9930265
rs13476163 rs13476177 rs13476178 rs13476183 rs13476184  rs6194543 rs13476196
62.0039607 62.6243588 62.6269118 63.8101331 64.0856907 66.4047817 66.7425394
rs13476201  rs3685700  rs3022846 rs13476210 rs13476214 rs13459163  rs4222816
67.2638714 68.7230251 68.7246243 69.1209547 70.1550813 75.5548371 75.5593190
 rs4222820  rs3090340  rs8245949 rs13476242 rs13476251 rs13476254  rs6383012
75.5593202 75.5637846 76.7508053 79.0157673 79.7644000 79.8248805 85.3173344
rs13476279  rs6348421 rs13476290 rs13476300 rs13476302 rs13476304  rs3669814
86.7653503 88.2128991 89.0565541 94.6215368 94.8227821 94.8269227 95.5413280
rs13501301 rs13476316
96.0784002 96.9960494

$`2`
rs13476325 rs13476327 rs13476328 rs13476330  rs3695983 rs13476334 rs13476337
  1.329379   1.760872   1.839732   1.950151   1.954566   2.265170   3.619681
rs13476342  rs3696091 rs13476348  rs3681847 rs13476358 rs13476424 rs13476427
  4.113919   5.308337   6.708720   7.382269  10.053730  21.102901  21.547272
rs13476432 rs13476433 rs13476438 rs13476440 rs13476445 rs13476446 rs13476448
 22.457323  22.458829  22.472305  22.502827  22.703755  22.706606  22.809631
rs13476449 rs13476451 rs13476452 rs13476456  rs6333344 rs13476459  rs6288325
 22.811905  22.816989  22.917366  23.496527  23.499872  23.505774  23.764498
rs13476470 rs13476482  rs6203572 rs13476485 rs13476501 rs13476502 rs13476503
 25.671278  26.670288  26.702994  27.001404  29.014864  29.017621  29.019099
rs13476525 rs13476526 rs13476530  rs3660779 rs13476533 rs13476534  rs4223189
 31.635258  31.882499  32.606343  33.182340  33.622217  33.881289  33.886227
 rs6205317 rs13476536  rs3709716 rs13476538 rs13476540  rs6314726 rs13476543
 34.125908  34.182229  34.777129  35.011099  35.196852  35.859566  36.045536
rs13476544  rs6222797 rs13476546 rs13476553 rs13476554  rs8263229 rs13476565
 36.199822  36.656504  36.783097  37.624894  37.628034  38.509801  39.813261
rs13476566  rs3670631 rs13476583 rs13476586 rs13476592 rs13476595 rs13476645
 40.314536  41.215814  43.089828  43.274068  45.681933  46.026997  50.383519
rs13476655 rs13476660  rs6378047 rs13476661 rs13476666 rs13476667 rs13476669
 50.804234  51.471634  51.474692  51.557954  52.083779  52.257580  52.543537
rs13476672  rs3724460  rs3143279 rs13476687  rs6278009 rs13476693  rs3022892
 52.594599  53.100014  54.071835  54.256293  54.721524  55.005598  55.109510
rs13476700 rs13476702 rs13476703 rs13476705  rs4223406  rs3090608 rs13476739
 55.226885  55.233738  55.400903  55.678374  55.949739  57.841429  58.879499
rs13476747 rs13476754  rs6257970 rs13476758  rs3022901 rs13476769  rs4223486
 59.218037  59.231937  59.742629  60.304581  61.226450  61.739777  61.741096
rs13476774  rs4223511 rs13476778 rs13476783  rs6234650 rs13476784 rs13476786
 61.912021  62.703328  63.294674  63.904805  64.118754  64.219897  64.224107
rs13476788  rs3682725 rs13476801 rs13476803 rs13476816  rs6170159  rs3022909
 64.987248  65.971246  66.931607  67.071161  69.496928  69.584633  69.689891
rs13476819 rs13476822 rs13476823  rs3716380  rs6332517 rs13476826 rs13476827
 69.754708  71.191302  71.405084  71.816078  71.959179  72.019045  72.087910
rs13476830 rs13476831 rs13476832  rs4223605  rs3022932 rs13476872  rs3692409
 72.364799  72.368511  72.371502  76.460822  76.621947  78.531760  79.528125
 rs3726342 rs13476882  rs3673613  rs8260429  rs8275858 rs13476892 rs13476894
 80.337464  81.259831  82.295892  82.969957  83.669130  84.046818  84.787594
 rs3024096 rs13476907 rs13476910 rs13476918  rs3673248 rs13476928 rs13476934
 87.473424  87.498263  87.955210  92.672743  94.072805  96.238835  98.601444
rs13476935 rs13476936
 99.057281  99.455950 

Next we use calc_genoprob() to calculate the QTL genotype probabilities.

R

probs <- calc_genoprob(cross      = cross, 
                       map        = cross$gmap, 
                       error_prob = 0.002)

The argument error_prob supplies an assumed genotyping error probability of 0.002. If a value for error_prob is not supplied, the default probability is 0.0001.

Recall that the result of calc_genoprob, probs, is a list of three-dimensional arrays (one per chromosome).

R

names(probs)

OUTPUT

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "X" 

Each three-dimensional array of probabilities is arranged as individuals \(\times\) genotypes \(\times\) positions. Have a look at the names of each of the three dimensions for chromosome 19.

R

dimnames(probs$`19`)

OUTPUT

[[1]]
  [1] "Mouse3051" "Mouse3551" "Mouse3430" "Mouse3476" "Mouse3414" "Mouse3145"
  [7] "Mouse3656" "Mouse3242" "Mouse3427" "Mouse3527" "Mouse3281" "Mouse3405"
 [13] "Mouse3530" "Mouse3477" "Mouse3498" "Mouse3526" "Mouse3284" "Mouse3160"
 [19] "Mouse3655" "Mouse3615" "Mouse3491" "Mouse3603" "Mouse3191" "Mouse3130"
 [25] "Mouse3330" "Mouse3199" "Mouse3614" "Mouse3577" "Mouse3081" "Mouse3204"
 [31] "Mouse3513" "Mouse3219" "Mouse3331" "Mouse3301" "Mouse3503" "Mouse3083"
 [37] "Mouse3568" "Mouse3189" "Mouse3287" "Mouse3131" "Mouse3311" "Mouse3357"
 [43] "Mouse3149" "Mouse3256" "Mouse3644" "Mouse3217" "Mouse3212" "Mouse3082"
 [49] "Mouse3156" "Mouse3535" "Mouse3481" "Mouse3123" "Mouse3359" "Mouse3555"
 [55] "Mouse3597" "Mouse3624" "Mouse3314" "Mouse3128" "Mouse3531" "Mouse3295"
 [61] "Mouse3231" "Mouse3496" "Mouse3438" "Mouse3183" "Mouse3052" "Mouse3237"
 [67] "Mouse3462" "Mouse3293" "Mouse3543" "Mouse3276" "Mouse3200" "Mouse3502"
 [73] "Mouse3171" "Mouse3364" "Mouse3524" "Mouse3334" "Mouse3355" "Mouse3254"
 [79] "Mouse3358" "Mouse3468" "Mouse3192" "Mouse3214" "Mouse3536" "Mouse3606"
 [85] "Mouse3226" "Mouse3393" "Mouse3415" "Mouse3266" "Mouse3648" "Mouse3224"
 [91] "Mouse3474" "Mouse3381" "Mouse3138" "Mouse3660" "Mouse3616" "Mouse3425"
 [97] "Mouse3554" "Mouse3196" "Mouse3528" "Mouse3312" "Mouse3045" "Mouse3585"
[103] "Mouse3471" "Mouse3308" "Mouse3628" "Mouse3429" "Mouse3324" "Mouse3124"
[109] "Mouse3291" "Mouse3452" "Mouse3373" "Mouse3367" "Mouse3579" "Mouse3647"
[115] "Mouse3169" "Mouse3335" "Mouse3122" "Mouse3635" "Mouse3154" "Mouse3484"
[121] "Mouse3652" "Mouse3612" "Mouse3668" "Mouse3233" "Mouse3175" "Mouse3306"
[127] "Mouse3046" "Mouse3663" "Mouse3165" "Mouse3519" "Mouse3592" "Mouse3127"
[133] "Mouse3184" "Mouse3650" "Mouse3599" "Mouse3494" "Mouse3605" "Mouse3505"
[139] "Mouse3573" "Mouse3561" "Mouse3489" "Mouse3480" "Mouse3186" "Mouse3421"
[145] "Mouse3607" "Mouse3346" "Mouse3375" "Mouse3633" "Mouse3589" "Mouse3094"
[151] "Mouse3611" "Mouse3307" "Mouse3133" "Mouse3152" "Mouse3518" "Mouse3209"
[157] "Mouse3056" "Mouse3320" "Mouse3365" "Mouse3313" "Mouse3441" "Mouse3339"
[163] "Mouse3352" "Mouse3159" "Mouse3619" "Mouse3238" "Mouse3203" "Mouse3137"
[169] "Mouse3509" "Mouse3289" "Mouse3054" "Mouse3432" "Mouse3487" "Mouse3179"
[175] "Mouse3572" "Mouse3285" "Mouse3466" "Mouse3252" "Mouse3517" "Mouse3546"
[181] "Mouse3185" "Mouse3665" "Mouse3537" "Mouse3096" "Mouse3600" "Mouse3349"
[187] "Mouse3098" "Mouse3275" "Mouse3667" "Mouse3342" "Mouse3333" "Mouse3300"
[193] "Mouse3244" "Mouse3478" "Mouse3560" "Mouse3501" "Mouse3315" "Mouse3440"
[199] "Mouse3669" "Mouse3486" "Mouse3632" "Mouse3319" "Mouse3453" "Mouse3172"
[205] "Mouse3121" "Mouse3590" "Mouse3215" "Mouse3447" "Mouse3618" "Mouse3340"
[211] "Mouse3047" "Mouse3666" "Mouse3516" "Mouse3225" "Mouse3167" "Mouse3207"
[217] "Mouse3631" "Mouse3444" "Mouse3168" "Mouse3298" "Mouse3602" "Mouse3309"
[223] "Mouse3416" "Mouse3260" "Mouse3146" "Mouse3374" "Mouse3144" "Mouse3485"
[229] "Mouse3610" "Mouse3348" "Mouse3500" "Mouse3613" "Mouse3253" "Mouse3384"
[235] "Mouse3664" "Mouse3206" "Mouse3426" "Mouse3332" "Mouse3210" "Mouse3283"
[241] "Mouse3670" "Mouse3120" "Mouse3274" "Mouse3461" "Mouse3202" "Mouse3472"
[247] "Mouse3437" "Mouse3434" "Mouse3593" "Mouse3055" "Mouse3234" "Mouse3422"
[253] "Mouse3571" "Mouse3236" "Mouse3049" "Mouse3350" "Mouse3249" "Mouse3326"
[259] "Mouse3134" "Mouse3143" "Mouse3493" "Mouse3361" "Mouse3636" "Mouse3436"
[265] "Mouse3510" "Mouse3117" "Mouse3601" "Mouse3303" "Mouse3497" "Mouse3544"
[271] "Mouse3463" "Mouse3118" "Mouse3354" "Mouse3162" "Mouse3464" "Mouse3181"
[277] "Mouse3188" "Mouse3356" "Mouse3521" "Mouse3591" "Mouse3241" "Mouse3467"
[283] "Mouse3469" "Mouse3262" "Mouse3643" "Mouse3548" "Mouse3372" "Mouse3542"
[289] "Mouse3563" "Mouse3583" "Mouse3584" "Mouse3208" "Mouse3661" "Mouse3659"
[295] "Mouse3195" "Mouse3459" "Mouse3653" "Mouse3649" "Mouse3382" "Mouse3180"
[301] "Mouse3386" "Mouse3084" "Mouse3205" "Mouse3299" "Mouse3515" "Mouse3540"
[307] "Mouse3255" "Mouse3177" "Mouse3523" "Mouse3366" "Mouse3567" "Mouse3557"
[313] "Mouse3114" "Mouse3623" "Mouse3419" "Mouse3580" "Mouse3271" "Mouse3385"
[319] "Mouse3492" "Mouse3119" "Mouse3232" "Mouse3598" "Mouse3150" "Mouse3310"
[325] "Mouse3164" "Mouse3587" "Mouse3050" "Mouse3627" "Mouse3506" "Mouse3413"
[331] "Mouse3435" "Mouse3151" "Mouse3112" "Mouse3630" "Mouse3646" "Mouse3223"
[337] "Mouse3187" "Mouse3263" "Mouse3637" "Mouse3662" "Mouse3508" "Mouse3550"
[343] "Mouse3125" "Mouse3545" "Mouse3570" "Mouse3641" "Mouse3136" "Mouse3626"
[349] "Mouse3166" "Mouse3269" "Mouse3529" "Mouse3218" "Mouse3625" "Mouse3448"
[355] "Mouse3378" "Mouse3227" "Mouse3651" "Mouse3182" "Mouse3304" "Mouse3617"
[361] "Mouse3141" "Mouse3552" "Mouse3479" "Mouse3658" "Mouse3539" "Mouse3190"
[367] "Mouse3093" "Mouse3097" "Mouse3126" "Mouse3170" "Mouse3229" "Mouse3520"
[373] "Mouse3582" "Mouse3351" "Mouse3129" "Mouse3153" "Mouse3450" "Mouse3113"
[379] "Mouse3586" "Mouse3549" "Mouse3538" "Mouse3201" "Mouse3556" "Mouse3247"
[385] "Mouse3455" "Mouse3176" "Mouse3344" "Mouse3343" "Mouse3439" "Mouse3629"
[391] "Mouse3286" "Mouse3216" "Mouse3588" "Mouse3488" "Mouse3221" "Mouse3142"
[397] "Mouse3428" "Mouse3111" "Mouse3353" "Mouse3211" "Mouse3569" "Mouse3280"
[403] "Mouse3325" "Mouse3368" "Mouse3553" "Mouse3245" "Mouse3228" "Mouse3135"
[409] "Mouse3622" "Mouse3095" "Mouse3369" "Mouse3609" "Mouse3410" "Mouse3302"
[415] "Mouse3594" "Mouse3483" "Mouse3197" "Mouse3336" "Mouse3507" "Mouse3305"
[421] "Mouse3532" "Mouse3250" "Mouse3194" "Mouse3449" "Mouse3178" "Mouse3198"
[427] "Mouse3620" "Mouse3596" "Mouse3638" "Mouse3222" "Mouse3147" "Mouse3163"
[433] "Mouse3273" "Mouse3473" "Mouse3578" "Mouse3465" "Mouse3279" "Mouse3558"
[439] "Mouse3443" "Mouse3490" "Mouse3460" "Mouse3248" "Mouse3243" "Mouse3431"
[445] "Mouse3564" "Mouse3347" "Mouse3565" "Mouse3525" "Mouse3574" "Mouse3329"
[451] "Mouse3140" "Mouse3257" "Mouse3328" "Mouse3193" "Mouse3132" "Mouse3220"
[457] "Mouse3235" "Mouse3499" "Mouse3246" "Mouse3270" "Mouse3608" "Mouse3442"
[463] "Mouse3157" "Mouse3642" "Mouse3566" "Mouse3139" "Mouse3282" "Mouse3053"
[469] "Mouse3454" "Mouse3363" "Mouse3213" "Mouse3654" "Mouse3514" "Mouse3341"
[475] "Mouse3401" "Mouse3388" "Mouse3604" "Mouse3161" "Mouse3451" "Mouse3634"
[481] "Mouse3482" "Mouse3559" "Mouse3645" "Mouse3264" "Mouse3155" "Mouse3251"
[487] "Mouse3297" "Mouse3541" "Mouse3158" "Mouse3294"

[[2]]
[1] "BB" "BR" "RR"

[[3]]
 [1] "rs4232073"  "rs13483548" "rs13483549" "rs13483550" "rs13483554"
 [6] "rs13483555" "rs3090321"  "rs3090137"  "rs6309315"  "rs13483577"
[11] "rs3090325"  "rs13483579" "rs13483584" "rs13483586" "rs13483587"
[16] "rs13483589" "rs13483592" "rs13483593" "rs6344448"  "rs13483594"
[21] "rs13483595" "rs3705022"  "rs13483609" "rs13483612" "rs13483648"
[26] "rs13483650" "rs13483654" "rs13483658" "rs13483660" "rs13483664"
[31] "rs13483666" "rs13483667" "rs13483670" "rs8275553"  "rs8275912"
[36] "rs13483677" "rs13483679" "rs13483680" "rs13483681" "rs3660143"
[41] "rs13483682" "rs13483683" "rs13483685" "rs13483686" "rs6355398"
[46] "rs4222106"  "rs13483690" "rs13483693" "rs13483695" "rs13483699"

View the first three rows of genotype probabilities for the first genotyped marker on chromosome 19.

R

(probs$`19`)[1:5,,"rs4232073"]

OUTPUT

                    BB           BR           RR
Mouse3051 1.317728e-11 1.235895e-07 9.999999e-01
Mouse3551 9.999840e-01 1.595361e-05 5.027172e-08
Mouse3430 1.317728e-11 1.235895e-07 9.999999e-01
Mouse3476 9.999999e-01 1.235895e-07 1.317728e-11
Mouse3414 6.179474e-08 9.999999e-01 6.179474e-08

We can also view the genotype probabilities using plot_genoprob. The arguments to this function specify:

  1. probs: the genotype probabilities,
  2. map: the marker map,
  3. ind: the index of the individual to plot,
  4. chr: the index of the chromosome to plot.

R

plot_genoprob(probs = probs, 
              map   = cross$pmap, 
              ind   = 1, 
              chr   = 19, 
              main  = rownames(probs[['19']])[1])

The coordinates along chromosome 19 are shown on the horizontal axis and the three genotypes are shown on the vertical axis. Higher genotype probabilities are plotted in darker shades. This mouse has a RR genotype on the proximal end of the chromosome and transitions to BR.

Challenge 1

1). Load a second dataset from Arabidopsis recombinant inbred lines (Moore et al, Genetics, 2013) in a study of plant root response to gravity (gravitropism).

grav <- read_cross2(file = system.file('extdata', 'grav2.zip', package = 'qtl2'))
2). How many individuals were in the study? How many phenotypes? How many chromosomes?
3). Insert pseudomarkers at 1cM intervals and save the results to an object called gravmap. Have a look at the first chromosome.
4). Calculate genotype probabilities and save the results to an object called gravpr. View the genotypes for the first three markers and pseudomarkers on chromosome 1 for the first five individuals.

1). grav <- read_cross2(file = system.file('extdata', 'grav2.zip', package = 'qtl2'))
2). summary(grav)
3). gravmap <- insert_pseudomarkers(map = grav$gmap, step = 1)
followed by head(gravmap, n=1)
4). gravpr <- calc_genoprob(cross = grav, map = gravmap) followed by
(gravpr$``1``)[1:5,,"PVV4"], (gravpr$1)[1:5,,"c1.loc1"], and
(gravpr$1)[1:5,,"c1.loc2"]

R

m  = maxmarg(probs)
ph = guess_phase(cross, m)
plot_onegeno(ph, cross$pmap)

Challenge 2

Calculate genotype probabilities for a different data set from the qtl2 data repository, this one from a study of obesity and diabetes in a C57BL/6 (B6) × BTBR intercross.
1). Create a new script in RStudio with File -> New File -> R Script.
2) Download the B6 x BTBR zip file from the qtl2 data repository into an object called b6btbr by running this code:
b6btbr <- read_cross2(file = "https://raw.githubusercontent.com/rqtl/qtl2data/master/B6BTBR/b6btbr.zip")
3). View a summary of the b6btbr data. How many individuals? phenotypes? chromosomes? markers?
4). View the genetic map for the b6btbr data.
5). Insert pseudomarkers at 2 cM intervals. Assign the results to an object called b6btbrmap.
6). Calculate genotype probabilities assuming a genotyping error probability of 0.001. Assign the results to an object called b6btbrpr.
7). View the first several rows of genotype probabilities for any marker on chromosome 18.

1). Create a new script in RStudio with File -> New File -> R Script.
2). b6btbr <- read_cross2(file = "https://raw.githubusercontent.com/rqtl/qtl2data/master/B6BTBR/b6btbr.zip
3). summary(b6btbr) shows 544 individuals, 3 phenotypes, 20 chromosomes, 2057 markers.
4). b6btbr$gmap
5). b6btbrmap <- insert_pseudomarkers(map=b6btbr$gmap, step=2)
6). b6btbrpr <- calc_genoprob(cross=b6btbr, map=b6btbrmap, error_prob=0.001)
7). dimnames((b6btbrpr$18)) shows all marker names for chromosome 18. head((b6btbrpr$18)[,,"c18.loc48"]) gives genotype probabilities for an example pseudomarker, while head((b6btbrpr$18)[,,"rs6338896"]) gives genotype probabilities for a genotyped marker.

Parallel calculations (optional) To speed up the calculations with large datasets on a multi-core machine, you can use the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer.

R

probs <- calc_genoprob(cross = iron, map = map, error_prob = 0.002, cores = 4)

Allele probabilities (optional) The genome scan functions use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function genoprob_to_alleleprob().

R

apr <- genoprob_to_alleleprob(probs = probs)

The figure below shows genotype and allele probabilities for 3 samples. In the Diversity Outbred, there are 36 possible genotype states (AA, AB, AC, …, BB, BC, BD, …, CC, CD, CE, …, DD, DE, DF, …, EE,…) or 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1. The first SNP below has genotype BB. In the table describing alleles (8 state founder probabilities), the probability that this SNP has a B allele is 1. The 2nd SNP has genotype BH, so the allele table shows a probability of 0.5 for B and 0.5 for H. The third SNP is either BG or BH, and has a probability of 0.5 for each of these genotypes. The allele table shows a probability of 0.5 for allele B, and 0.25 for both G and H.

Genotype and allele probabilities{‘a table showing the probabilities for each of 36 genotypes in the Diversity Outbred followed by a second table showing probabilities for each of the 8 founder alleles’}

Key Points

  • The first step in QTL analysis is to calculate genotype probabilities.
  • Calculate genotype probabilities between genotyped markers with calc_genoprob().

Content from Performing a genome scan


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How do I perform a genome scan?
  • How do I plot a genome scan?
  • How do additive covariates differ from interactive covariates?

Objectives

  • Map one trait using additive covariates.
  • Map the sample trait using additive and interactive covariates.
  • Plot a genome scan.

The freely available chapter on single-QTL analysis from Broman and Sen’s A Guide to QTL Mapping with R/qtl describes different methods for QTL analysis. Two of these methods are described here using data from an experiment on hypertension in the mouse Sugiyama et al., Genomics 71:70-77, 2001. The study employed a backcross between two mouse strains resulting in two possible genotypes - AA for homozygotes, and AB for heterozygotes.

Linear regression can be employed to identify presence of QTL in a cross. To identify QTL using regression, we compare the fit for two models: 1) the null hypothesis that there are no QTL anywhere in the genome; and 2) the alternative hypothesis that there is a QTL near a specific position. A sloped line indicates that there is a difference in mean phenotype between the two genotype groups, and that a QTL is present. A line with no slope indicates that there is no difference in mean phenotype between the two groups, and that no QTL exists. Regression aims to find the line of best fit to the data. In the case of a backcross with only two genotypes, a t-test is performed at the marker to determine whether the difference in phenotype means is zero.

adapted from Broman & Sen, 2009
adapted from Broman & Sen, 2009

To find the line of best fit, the residuals or errors are calculated, then squared for each data point.

Squared error for a single data pointThe line of best fit minimizes the sum of squared errors

The line of best fit will be the one that minimizes the sum of squared residuals, which maximizes the likelihood of the data.

Marker regression produces a LOD (logarithm of odds) score comparing the null hypothesis to the alternative. The LOD score is calculated using the sum of squared residuals for the null and alternative hypotheses. The LOD score is the difference between the log10 likelihood of the null hypothesis and the log10 likelihood of the alternative hypothesis. It is related to the regression model above by identifying the line of best fit to the data. A higher LOD score indicates greater likelihood of the alternative hypothesis. A LOD score closer to zero favors the null hypothesis.

Marker regression can identify the existence and effect of a QTL by comparing means between groups, however, it requires known marker genotypes and can’t identify QTL in between typed markers. To identify QTL between typed markers, we use Haley-Knott regression. After calculating genotype probabilities, we can regress the phenotypes for animals of unknown genotype on these conditional genotype probabilities (conditional on known marker genotypes). In Haley-Knott regression, phenotype values can be plotted and a regression line drawn through the phenotype mean for the untyped individuals.

As shown by the green circle in the figure, an individual of unknown genotype is placed between known genotypes according to the probability of its genotype being AA or AB. In this case, the probability of this individual having genotype AA is 0.6, and the probability of having genotype AB is 0.4.

To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function scan1(). scan1() takes as input the genotype probabilities, a matrix of phenotypes, and then optional additive and interactive covariates. Another option is to provide a vector of weights.

Additive Genome Scan


There are two potential covariates in the Attie data set. Let’s look at the top of the covariates in the cross object.

R

head(cross$covar)

OUTPUT

             Sex pgm adipose_batch gastroc_batch hypo_batch islet_batch
Mouse3051   Male   1    12/19/2007     8/11/2008 11/26/2007  11/28/2007
Mouse3551   Male   1    12/19/2007     8/12/2008 11/27/2007  12/03/2007
Mouse3430   Male   1    12/19/2007     8/12/2008 11/27/2007  12/03/2007
Mouse3476   Male   1    12/19/2007     8/12/2008 11/27/2007  12/03/2007
Mouse3414   Male   1    12/18/2007     8/11/2008 11/26/2007  11/28/2007
Mouse3145 Female   1    12/19/2007     8/11/2008         NA  11/28/2007
          kidney_batch liver_batch
Mouse3051   07/15/2008       other
Mouse3551   07/16/2008  12/10/2007
Mouse3430   07/15/2008  12/05/2007
Mouse3476   07/16/2008  12/05/2007
Mouse3414   07/15/2008  12/05/2007
Mouse3145   07/15/2008  12/10/2007

Sex is potential covariate. It is a good idea to always include sex in any analysis. Examples of other covariates might be age, diet, treatment, or experimental batch. It is worth taking time to identify covariates that may affect your results.

R

cross$covar$Sex <- factor(cross$covar$Sex)

addcovar <- model.matrix(~Sex, data = cross$covar)[,-1, drop = FALSE]

When we perform a genome scan with additive covariates, we are searching for loci that have the same effect in both covariate groups. In this case, we are searching for loci that affect females and male in the same way.

R

lod_add <- scan1(genoprobs = probs, 
                 pheno     = cross$pheno[,'log10_insulin_10wk'], 
                 addcovar  = addcovar)

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob() and calc_kinship().

R

lod_add <- scan1(genoprobs = probs, 
                 pheno     = cross$pheno[,'log10_insulin_10wk', drop = FALSE], 
                 addcovar  = addcovar,
                 cores     = 4)

The output of scan1() is a matrix of LOD scores, positions × phenotypes.

Take a look at the first ten rows of the scan object. The numerical values are the LOD scores for the marker named at the beginning of the row. LOD values are shown for circulating insulin.

R

head(lod_add, n = 10)

OUTPUT

               pheno1
rs13475697 0.04674829
rs3681603  0.04674829
rs13475703 0.04680494
rs13475710 0.12953382
rs6367205  0.13734728
rs13475716 0.13735534
rs13475717 0.13735534
rs13475719 0.13735534
rs13459050 0.13735534
rs3680898  0.13735541

The function plot_scan1() can be used to plot the LOD curves. Use the argument lodcolumn to indicate which column to plot.

R

plot_scan1(lod_add, 
           map  = cross$pmap,
           main = 'log(insulin): 10 weeks')

The LOD plot for insulin shows several peaks, with the largest peak on chromosome 2. There are smaller peaks on other chromosomes. Which of these peaks is significant, and why? We’ll evaluate the significance of genome scan results in a later episode in performing a permutation test.

Challenge 1

Use the sort() function to sort the LOD scores for liver. Hint: run dim(out) for the row and column dimensions, or colnames(out) for the column names.
Which marker has the highest LOD score? Which genotyped marker has the highest LOD score? What chromosome number are they on?

sort(out[,1]) for column 1 or sort(out[,"liver"]) for the column named “liver”.
The pseudomarker with the largest score is c16.loc29, with a LOD of 7.68. The genotyped marker with the largest LOD is D16Mit30 with a score of 7.28. Both are located on chromosome 16.

Challenge 2

Use the sort() function to sort the LOD scores for spleen. Which pseudomarker has the highest LOD score? Which genotyped marker has the highest LOD score? What chromosome number are they on?

sort(out[,2]) or sort(out[,"spleen"]) The pseudomarker with the largest score is c9.loc57, with a LOD of 12.1. The genotyped marker with the largest LOD is D9Mit182 with a score of 10.4. Both are located on chromosome 9.

Challenge 3

Plot the LOD scores for spleen. Does the genome scan for spleen share any large-ish peaks with the scan for liver?

plot_scan1(out, map = map, lodcolumn = "spleen") Both liver and spleen genome scans share a peak on chromosome 8 with a LOD score near 4.

Interactive Genome Scan


Above, we mapped insulin levels using sex as an additive covariate and searched for loci where both sexes had the same effect. But what if the two sexes have different effects? The we would like to map in a way that allows each sex to have different effects. We do this using an interactive genome scan.

You should always include the interactive covariates as additive covariates as well. In this case, we only have sex as a covariate, so we can use the additive covariate matrix. For clarity, we will make a copy and name it for the interactive covariates.

R

intcovar = addcovar

R

lod_int <- scan1(genoprobs = probs, 
                 pheno     = cross$pheno[,'log10_insulin_10wk'], 
                 addcovar  = addcovar,
                 intcovar  = intcovar)

R

plot_scan1(x = lod_int, map = cross$pmap, main = 'log10_insulin_10wk')

It is difficult to tell if there is a difference in LOD scores between the additive and interactive scans. To resolve this, we can plot both genome scans in the same plo using the “add = TRUE” argument.

R

plot_scan1(x = lod_int, map = cross$pmap, main = 'log10_insulin_10wk')
plot_scan1(x = lod_add, map = cross$pmap, col = 'blue', add = TRUE)

It is still difficult to tell whether any peaks differ by sex. Another way to view the plot is to plot the difference between the interactive and additive scans.

R

plot_scan1(x = lod_int, map = cross$pmap, main = 'log10_insulin_10wk')
plot_scan1(x = lod_add, map = cross$pmap, col = 'blue', add = TRUE)
plot_scan1(x = lod_int - lod_add, map = cross$pmap, col = 'red', add = TRUE)

While it was important to look at the effect of sex on the trait, in this experiment, there do not appear to be any sex-specific peaks.

Key Points

  • A qtl2 genome scan requires genotype probabilities and a phenotype matrix.
  • The output from a genome scan contains a LOD score matrix, map positions, and phenotypes.
  • LOD curve plots for a genome scan can be viewed with plot_scan1().

Content from Calculating A Kinship Matrix


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • Why would I calculate kinship between individuals?
  • How do I calculate kinship between individuals?
  • What does a kinship matrix look like?

Objectives

  • Explain why and when kinship calculation matters in mapping.
  • Create a kinship matrix for individuals.

Population structure and kinship are common confounding factors in genome-wide association studies (GWAS), case-control studies, and other study types in genetics. They create false positive associations between genotype and phenotype at genetic markers that differ in genotype frequencies between subpopulations due to genetic relatedness between samples. Simple association tests assume statistical independence between individuals. Population structure and kinship confound associations when phenotype covariance between individuals results from genetic similarity. Accounting for relatedness between individuals helps to distinguish true associations from false positives generated by population structure or kinship.

As an example see the table below for phenotype and genotype frequencies between two subpopulations in a case-control study.

subpop1 subpop2 overall pop
frequency 0.5 0.5 1
probability of AA genotype 0.1 0.9 0.5
probability of disease 0.9 0.1 0.5
probability of disease & AA 0.09 0.09 0.09

The full population consists of two equally represented subpopulations. In the overall population, the probability of the AA genotype is 0.5, and the probability of disease is also 0.5. The joint probability of both disease and AA genotype in the population (0.09) is less than either the probability of disease (0.5) or the probability of the AA genotype (0.5) alone, and is considerably less than the joint probability of 0.25 that would be calculated if subpopulations weren’t taken into account. In a case-control study that fails to recognize subpopulations, most of the cases will come from subpopulation 1 since this subpopulation has a disease probability of 0.9. However, this subpopulation also has a low probability of the AA genotype. So a false association between AA genotype and disease would occur because only overall population probabilities would be considered.

Linear mixed models (LMMs) consider genome-wide similarity between all pairs of individuals to account for population structure, known kinship and unknown relatedness. They model the covariance between individuals. Linear mixed models in association mapping studies can successfully correct for genetic relatedness between individuals in a population by incorporating kinship into the model. To perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the calc_kinship() function. It takes the genotype probabilities as input.

R

kinship <- calc_kinship(probs = probs)

Take a look at the kinship values calculated for the first 5 individuals.

R

kinship[1:5, 1:5]

OUTPUT

          Mouse3051 Mouse3551 Mouse3430 Mouse3476 Mouse3414
Mouse3051     0.737     0.501     0.535     0.520     0.510
Mouse3551     0.501     0.758     0.515     0.512     0.499
Mouse3430     0.535     0.515     0.749     0.458     0.493
Mouse3476     0.520     0.512     0.458     0.749     0.434
Mouse3414     0.510     0.499     0.493     0.434     0.696

We can also look at the first 50 mice in the kinship matrix.

R

n_samples <- 50
heatmap(kinship[1:n_samples, 1:n_samples], symm = TRUE)

The mice are listed in the same order on both sides of the matrix. The comb-like structures are called “dendrograms” and they indicate how the mice are clustered together. Each cell represents the degree of allele sharing between mice. Red colors indicate higher kinship and yellow colors indicate lower kinship. Each mouse is closely related to itself, so the cells along the diagonal tend to be darker than the other cells. You can see some evidence of related mice, possibly siblings, in the orange-shaded blocks along the diagonal.

By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use use_allele_probs=FALSE in the call to calc_kinship(). Further, by default we omit the X chromosome and only use the autosomes. To include the X chromosome, use omit_x=FALSE.

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob().

R

kinship <- calc_kinship(probs = probs, 
                        cores = 4)

Challenge 1

Think about what a kinship matrix is and what it represents. Share your understanding with a neighbor. Write your explanation in the collaborative document or in your own personal notes.

Key Points

  • Kinship matrices account for relationships among individuals.
  • Kinship is calculated as the proportion of shared alleles between individuals.
  • Kinship calculation is a precursor to a genome scan via a linear mixed model.

Content from Performing a genome scan with a linear mixed model


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How do I use a linear mixed model in a genome scan?
  • How do different mapping and kinship calculation methods differ?

Objectives

  • Create a genome scan with a linear mixed model.
  • Compare LOD plots for Haley-Knott regression and linear mixed model methods.
  • Compare LOD plots for the standard kinship matrix with the leave-one-chromosome-out (LOCO) method.

Genetic mapping in mice presents a good example of why accounting for population structure is important. Laboratory mouse strains are descended from a small number of founders (fancy mice) and went through several population bottlenecks. Wild-derived strains are not descended from fancy mice and don’t share the same history as laboratory strains. Linear mixed models were developed to solve problems with population structure created by differing ancestries, and to handle relatedness between individuals. Linear mixed models (LMMs) consider genome-wide similarity between all pairs of individuals to account for population structure, known kinship and unknown relatedness. Linear mixed models in mapping studies can successfully correct for genetic relatedness between individuals in a population by incorporating kinship into the model. Earlier we calculated a kinship matrix for input to a linear mixed model to account for relationships among individuals. For a current review of mixed models in genetics, see this preprint of Martin and Eskin, 2017.

Simple linear regression takes the form:

\(y = \mu + \beta G + \epsilon\)

which describes a line with slope \(\beta\) and y-intercept \(\mu\). The error (or residual) is represented by \(\epsilon\).

To model the relationship between the phenotype \(y\) and the genotypes at one marker, we use:

\(y_j = \mu + \beta_k G_{jk} + \epsilon_j\)

where $y_{ij} is the phenotype of the \(j\)th individual, \(\mu\); is the mean phenotype value, \(\beta_k\) is the effect of the \(k\)th genotype, $G_{jk} is the genotype for individual \(j\), and \(\epsilon_j\) is the error for the \(j\)th individual. In the figure below, \(\mu\); equals 1, and \(\beta\); equals 0.1 for the alternative hypothesis (QTL exists). This linear model is \(y\) = 1 + 0.1X + \(\epsilon\). The model intersects the genotype groups at their group means. In contrast, the null hypothesis would state that there is no difference in group means (no QTL anywhere). The linear model for the null hypothesis would be \(y\) = 94.6 + 0\(X\) + \(\epsilon\). This states that the phenotype is equal to the combined mean (94.6), plus some error (\(\epsilon\)). In other words, genotype doesn’t affect the phenotype.

The linear models above describe the relationship between genotype and phenotype but are inadequate for describing the relationship between genotype and phenotype in large datasets. They don’t account for relatedness among individuals. In real populations, the effect of a single genotype is influenced by many other genotypes that affect the phenotype. A true genetic model takes into account the effect of all variants on the phenotype.

To model the phenotypes of all individuals in the data, we can adapt the simple linear model to include all individuals and their variants so that we capture the effect of all variants shared by individuals on their phenotypes.

\(y=\mu+\sum_{i=1}^M{\beta_iX_i}+\epsilon\)

Now, \(y\) represents the phenotypes of all individuals. The effect of the \(i\)th genotype on the phenotype is \(\beta_i\), the mean is \(\mu\) and the error is \(\epsilon\). Here, the number of genotypes is M.

To model the effect of all genotypes and to account for relatedness, we test the effect of a single genotype while bringing all other genotypes into the model.

\(\beta_k\) is the effect of the genotype \(X_k\), and \(\Sigma_{i\neq k}\beta_iX_i\) sums the effects of all other genotypes except genotype k. For the leave one chromosome out (LOCO) method, \(\beta_kX_k\) is the effect of genotypes on chromosome \(k\), and \(\beta_iX_i\) represents effect of genotypes on all other chromosomes.

If the sample contains divergent subpopulations, SNPs on different chromosomes will be correlated because of the difference in allele frequencies between subpopulations caused by relatedness. To correct for correlations between chromosomes, we model all genotypes on the other chromosomes when testing for the association of a SNP.

First, we will create a single kinship matrix using all of the genoprobs on all chromosomes.

R

kinship_all <- calc_kinship(probs = probs, 
                            type = "overall")

To perform a genome scan using a linear mixed model you also use the function scan1; you just need to provide the argument kinship, a kinship matrix (or, for the LOCO method, a list of kinship matrices).

R

lod_add_pg <- scan1(genoprobs = probs, 
                    pheno     = cross$pheno[,"log10_insulin_10wk",drop = FALSE], 
                    kinship   = kinship_all, 
                    addcovar  = addcovar)

Again, on a multi-core machine, you can get some speed-up using the cores argument.

R

lod_add_all <- scan1(genoprobs = probs, 
                     pheno     = cross$pheno[,"log10_insulin_10wk",drop = FALSE], 
                     kinship   = kinship_all, 
                     addcovar  = addcovar, 
                     cores     = 4)

If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use type="loco" in the call to calc_kinship().

R

kinship_loco <- calc_kinship(probs = probs, 
                             type  = "loco")

For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from calc_kinship() with method="loco".

R

lod_add_loco <- scan1(genoprobs = probs, 
                      pheno     = cross$pheno[,"log10_insulin_10wk",drop = FALSE], 
                      kinship   = kinship_loco, 
                      addcovar  = addcovar)

To plot the results, we again use plot_scan1().

Here is a plot of the LOD scores by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.

R

plot_scan1(x     = lod_add,      
           map   = cross$pmap, 
           col   = 'black',
           ylim  = c(0, 7.5))
plot_scan1(x     = lod_add_all,  
           map   = cross$pmap, 
           col   = 'blue',   
           add   = TRUE)

ERROR

Error in eval(expr, envir, enclos): object 'lod_add_all' not found

R

plot_scan1(x     = lod_add_loco, 
           map   = cross$pmap, 
           col   = 'orange', 
           add   = TRUE)

ERROR

Error in eval(expr, envir, enclos): object 'lod_add_loco' not found

R

legend(x = 1500, y = 7.5, legend = c("No kinship", "All kinship", "LOCO kinship"),
       lwd = 2, col = c('black', 'blue', 'orange'))

For the circulating insulin, the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 6 where it gives lower LOD scores.

Challenge 1

What are the benefits and disadvantages of the three methods for genome scanning (Haley-Knott regression, kinship matrix, and leave-one-chromosome out (LOCO)?)
Which method would you use to scan, and why?
Think about the advantages and disadvantages of each, discuss with a neighbor, and share your thoughts in the collaborative document.

Challenge 2

Pair programming exercise: with your partner, review and carry out all of the steps in QTL mapping that we have covered so far, using a new data set. One of you types the code, the other explains what needs to happen next, finds the relevant code in the lesson, suggests new names for objects (i.e. NOT the ones you’ve already used, such as “map”, “pr”, “out”, etc.).

  1. Run the code above to load the B6 x BTBR intercross data into an object called b6btbr.
  2. Insert pseudomarkers and calculate genotype probabilities.
  3. Run a genome scan for the log10_insulin_10wk phenotype.
  4. Calculate a kinship matrix.
  5. Calculate a list of kinship matrices with the LOCO method.
  6. Run genome scans with the regular kinship matrix and with the list of LOCO matrices.
  7. Plot the 3 different genome scans in a single plot in different colors.
  8. Which chromosomes appear to have peaks with a LOD score greater than 4? Which methods identify these peaks? Which don’t?

file <- paste0("https://raw.githubusercontent.com/rqtl/", "qtl2data/master/B6BTBR/b6btbr.zip")
b6btbr <- read_cross2(file)
summary(b6btbr)
head(b6btbr$pheno)
colnames(b6btbr$pheno)
b6bmap <- insert_pseudomarkers(map=b6btbr$gmap, step=1)
prb6b <- calc_genoprob(cross=b6btbr, map=b6bmap, error_prob=0.002)
b6baddcovar <- get_x_covar(b6btbr)
b6bout <- scan1(genoprobs = prb6b, pheno = b6btbr$pheno, addcovar=b6baddcovar)
plot(b6bout, map = b6bmap)
b6bkinship <- calc_kinship(probs = prb6b)
out_pg_b6b <- scan1(prb6b, b6btbr$pheno, kinship=b6bkinship, addcovar=b6baddcovar)
kinship_loco_b6b <- calc_kinship(prb6b, "loco")
out_pg_loco_b6b <- scan1(prb6b, b6btbr$pheno, kinship_loco_b6b, addcovar=b6baddcovar)
plot_scan1(out_pg_loco_b6b, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "black")
plot_scan1(out_pg_b6b, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "blue", add = TRUE)
plot_scan1(b6bout, map = b6bmap, lodcolumn = "log10_insulin_10wk", col = "green", add = TRUE)

R

file <- paste0("https://raw.githubusercontent.com/rqtl/",
               "qtl2data/master/B6BTBR/b6btbr.zip")
b6btbr <- read_cross2(file)

Key Points

  • “To perform a genome scan with a linear mixed model, supply a kinship matrix.”
  • “Different mapping and kinship calculation methods give different results.”

Content from Performing a genome scan with binary traits


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • “How do I create a genome scan for binary traits?”

Objectives

  • Convert phenotypes to binary values.
  • Use logistic regression for genome scans with binary traits.
  • Plot and compare genome scans for binary traits.

The genome scans above were performed assuming that the residual variation followed a normal distribution. This will often provide reasonable results even if the residuals are not normal, but an important special case is that of a binary trait, with values 0 and 1, which is best treated differently. The scan1 function can perform a genome scan with binary traits by logistic regression, using the argument model="binary". (The default value for the model argument is "normal".) At present, we can not account for relationships among individuals in this analysis.

Let’s look at the phenotypes in the cross again.

R

head(cross$pheno)

OUTPUT

          log10_insulin_10wk agouti_tan tufted
Mouse3051              1.399          1      0
Mouse3551              0.369          1      1
Mouse3430              0.860          0      1
Mouse3476              0.800          1      0
Mouse3414              1.370          0      0
Mouse3145              1.783          1      0

There are two binary traits called “agouti_tan”, and “tufted” which are related to coat color and shape.

We perform a binary genome scan in a similar manner to mapping continuous traits by using scan1. When we mapped insulin, there was a hidden argument called model which told qtl2 which mapping model to use. There are two options: normal, the default, and binary. The normal argument tells qtl2 to ues a "normal" (least squares) linear model. To map a binary trait, we will includemodel = “binary”` to indicate that the phenotype is a binary trait with values 0 and 1.

R

lod_agouti <- scan1(genoprobs = probs, 
                    pheno     = cross$pheno[,'agouti_tan'], 
                    addcovar  = addcovar, 
                    model     = "binary")

Let’s plot the result and see if there is a peak.

R

plot_scan1(x    = lod_agouti, 
           map  = cross$pmap, 
           main = 'Agouti')

Yes! There is a big peak on chromosome 2. Let’s zoom in on chromosome 2.

R

plot_scan1(x    = lod_agouti, 
           map  = cross$pmap, 
           chr  = 2,
           main = 'Agouti')

We can use find_peaks to find the position of the highest LOD score.

R

find_peaks(scan1_output = lod_agouti, 
           map          = cross$pmap)

This turns out to be a well-known coat color locus for agouti coat color which contains the nonagouti gene. Mice carrying two black alleles will have a black coat, and mice carrying one or no black alleles will have agouti coats.

Challenge 1: How many mice have black coats?

Look at the frequency of the black (0) and agouti (1) phenotypes. What proportion of the mice are black? Can you use what you learned about how the nonagouti locus works and the cross design to explain the frequency of black mice?

First, get the number of black and agouti mice.

R

tbl <- table(cross$pheno[,"agouti_tan"])
tbl

OUTPUT


  0   1
125 356 

Then use the number of mice to calculate the proportion with each coat color.

R

tbl / sum(tbl)

OUTPUT


   0    1
0.26 0.74 

We can see that the black (0) mice occur about 25 % of the time. If the A allele causes mice to have black coats when it is recessive, and if a is the agouti allele, then, when breeding two heterozygous (Aa) mice together, we expect mean allele frequencies of:

Allele Frequency Coat Color
AA 0.25 black
Aa 0.5 agouti
aa 0.25 agouti

From this, we can see that about 25% of the mice should have black coats.

Challenge 2: Map the “tufted” phenotype.

Map the tufted phenotype an determine if there are any tall peaks for this trait.

First, map the trait.

R

lod_tufted <- scan1(genoprobs = probs, 
                    pheno     = cross$pheno[,"tufted"], 
                    addcovar  = addcovar, 
                    model     = "binary")

Then, plot the LOD score.

R

plot_scan1(x    = lod_tufted, 
           map  = cross$pmap, 
           main = "Tufted")

There is a large peak on chromosome 17. This is a known locus associated with the Itpr3 gene near 27.3 Mb on chromsome 17.

Key Points

  • “A genome scan for binary traits (0 and 1) requires special handling; scans for non-binary traits assume normal variation of the residuals.”
  • “A genome scan for binary traits is performed with logistic regression.”

Content from Performing a permutation test


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • “How can I evaluate the statistical significance of genome scan results?”

Objectives

  • Run a permutation test to establish LOD score thresholds.

Using Permutations to Determine Significance Thresholds


To establish the statistical significance of the results of a genome scan, a permutation test can identify the maximum LOD score that can occur by random chance. A permutation tests shuffles genotypes and phenotypes, essentially breaking the relationship between the two. The genome-wide maximum LOD score is then calculated on the permuted data, and this score used as a threshold of statistical significance. A genome-wide maximum LOD on shuffled, or permuted, data serves as the threshold because it represents the highest LOD score generated by random chance.

The code below shuffles the phenotypes so that they no longer match up with the genotypes. The purpose of this is to find out how high the LOD score can be due to random chance alone.

R

shuffled_order <- sample(rownames(cross$pheno))
pheno_permuted <- cross$pheno
rownames(pheno_permuted) <- shuffled_order
xcovar_permuted <- Xcovar
rownames(xcovar_permuted) <- shuffled_order
out_permuted <- scan1(genoprobs = probs, pheno = pheno_permuted, Xcovar = xcovar_permuted)
plot(out_permuted, map)
head(shuffled_order)

The scan1perm() function takes the same arguments as scan1(), plus additional arguments to control the permutations:

  • n_perm is the number of permutation replicates.
  • perm_Xsp controls whether to perform autosome/X chromosome specific permutations (with perm_Xsp=TRUE) or not (the default is FALSE).
  • perm_strata is a vector that defines the strata for a stratified permutation test.
  • chr_lengths is a vector of chromosome lengths, used in the case that perm_Xsp=TRUE.

As with scan1(), you may provide a kinship matrix (or vector of kinship matrices, for the “leave one chromosome out” (loco) approach), in order to fit linear mixed models. If kinship is unspecified, the function performs ordinary Haley-Knott regression.

To perform a permutation test with the insulin phenotype, we run scan1perm(), provide it with the genotype probabilities, the phenotype data, X covariates and number of permutations. For expediency, we’ll use only 10 permutations, although 1000 is recommended.

R

perm_add <- scan1perm(genoprobs = probs, 
                      pheno     = cross$pheno[,'log10_insulin_10wk',drop = FALSE],
                      addcovar  = addcovar,
                      Xcovar    = addcovar,
                      n_perm    = 1000) 

Note the need to specify special covariates for the X chromosome (via Xcovar), to be included under the null hypothesis of no QTL. And note that when these are provided, the default is to perform a stratified permutation test, using strata defined by the rows in Xcovar. In general, when the X chromosome is considered, one will wish to stratify at least by sex.

Also note that, as with scan1(), you can speed up the calculations on a multi-core machine by specifying the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer. For large data sets, be mindful of the amount of memory that will be needed; you may need to use fewer than the maximum number of cores, to avoid going beyond the available memory.

R

perm_add <- scan1perm(genoprobs = probs, 
                      phenotype = cross$pheno[,'log10_insulin_10wk',drop = FALSE], 
                      addcovar  = addcovar
                      Xcovar    = Xcovar, 
                      n_perm    = 1000, 
                      cores     = 0)

perm_add now contains the maximum LOD score for each permutation for the liver and spleen phenotypes. There should be 1000 values for each phenotypes. We can view the insulin permutation LOD scores by making a histogram.

R

hist(perm_add, 
     breaks = 50, 
     xlab   = "LOD", 
     las    = 1,
     main   = "Empirical distribution of maximum LOD scores under permuation")
abline(v = summary(perm_add), col = 'red', lwd = 2)

In the histogram above, you can see that most of the maximum LOD scores fall between 1 and 3.5. This means that we expect LOD scores less than 3.5 to occur by chance fairly often. The red line indicates the alpha = 0.05 threshold, which means that we only see LOD values by chance this high or higher, 5% of the time. This is one way of estimating a significance threshold for QTL plots.

To get estimated significance thresholds, use the function summary().

R

summary(perm_add)

OUTPUT

LOD thresholds (1000 permutations)
     log10_insulin_10wk
0.05               3.84

The default is to return the 5% significance thresholds. Thresholds for other (or for multiple) significance levels can be obtained via the alpha argument.

R

summary(perm_add, 
        alpha = c(0.2, 0.05))

OUTPUT

LOD thresholds (1000 permutations)
     log10_insulin_10wk
0.2                3.16
0.05               3.84

Estimating an X Chromosome Specific Threshold


To obtain autosome/X chromosome-specific significance thresholds, specify perm_Xsp=TRUE. In this case, you need to provide chromosome lengths, which may be obtained with the function chr_lengths().

R

perm_add2 <- scan1perm(genoprobs   = probs, 
                       pheno       = cross$pheno[,"log10_insulin_10wk",drop = FALSE], 
                       addcovar    = addcovar, 
                       n_perm      = 1000,
                       perm_Xsp    = TRUE, 
                       chr_lengths = chr_lengths(cross$pmap))

Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See Broman et al. (2006) Genetics 174:2151-2158.

The significance thresholds are again derived via summary():

R

summary(perm_add2, 
        alpha = c(0.2, 0.05))

OUTPUT

Autosome LOD thresholds (1000 permutations)
     log10_insulin_10wk
0.2                3.19
0.05               3.96

X chromosome LOD thresholds (14369 permutations)
     log10_insulin_10wk
0.2                3.12
0.05               4.02

Estimating Significance Thresholds with the Kinship Matrix


You may have noticed above that we did not include the kinship matrix as an argument to scan1perm. We can include the LOCO kinship matrices in our permutations, since this is how we mapped insulin previously.

R

perm_add_loco <- scan1perm(genoprobs = probs, 
                           pheno     = cross$pheno[,'log10_insulin_10wk',drop = FALSE],
                           kinship   = kinship_loco,
                           addcovar  = addcovar,
                           n_perm    = 1000) 

How does this affect the significance threshold estimates?

R

summary(perm_add_loco, 
        alpha = c(0.2, 0.05))

OUTPUT

LOD thresholds (1000 permutations)
     log10_insulin_10wk
0.2                3.15
0.05               3.88

There is not a large difference in the thresholds. Currently, we are on the fence about using the kinship matrices to estimate significance thresholds. In principle, the kinship matrices should be used because we used them when mapping the phenotype. However, in practice, we often find that the significance thresholds are similar to those obtained without including the kinship matrices. Given that it also takes more time to run the permutations with the kinship matrices, it seems reasonable to exclude them.

We ran 1000 permutations of the insulin phenotype and estimated the 0.05 significance threshold with and without the kinship matrices. We repeated this process 100 times and plotted the thresholds below.

Significance Thresholds with/without Kinship{alt=“Comparison of significane threholds”,width=“50%”}

The plot shows that the median significance threshold is the same. The lines connecting the points denote matched simulations in which the permutation order was the same. While the exact value of the LOD threshold is different, the median value and the variance are similar.

Estimating Binary Model Significance Thresholds


As with scan1, we can use scan1perm with binary traits, using the argument model="binary". Again, this can’t be used with a kinship matrix, but all of the other arguments can be applied.

R

perm_bin <- scan1perm(genoprobs = probs, 
                      pheno     = cross$pheno[,"agouti_tan",drop = FALSE], 
                      addcovar  = addcovar, 
                      n_perm    = 1000, 
                      perm_Xsp  = TRUE, 
                      chr_lengths = chr_lengths(cross$pmap),
                      model     = "binary")

Here are the estimated 5% and 20% significance thresholds.

R

summary(perm_bin, 
        alpha = c(0.2, 0.05))

OUTPUT

Autosome LOD thresholds (1000 permutations)
     agouti_tan
0.2        3.21
0.05       3.97

X chromosome LOD thresholds (14369 permutations)
     agouti_tan
0.2        3.07
0.05       3.76

Selecting the Number of Permutations


How do we know how many permutations to perform in order to obtain a good estimate of the significance threshold? Could we get a good estimate with 10 permutations? 100? 1000?

When we run more permutations, we decrease the variance of the threshold estimate.

Significance Threshold Variance{alt=“Figure showing decreasing variance of significance threshold estimates with increasing permutations”,width=75%}

In the figure above, we performed 10, 100, or 1000 permutations 1000 times and recorded the 0.05 significance threshold each time. We plotted the significance threshold versus the number of permutations and overlaid violin plots showing the median value. Note that the variance of the significance threshold estimate is higher at lower numbers of permutations. With 1000 permutations, the variance decreases. The table below shows the number of permutations and the mean and standard deviation of the significance threshold. With 1000 permutations, the estimate is 3.86 and the standard deviation is 0.064, which is an acceptable value.

Num. Perm. Mean Std. Dev.
10 3.63 0.492
100 3.81 0.195
1000 3.86 0.064

Challenge 1:

Run the preceding code to shuffle the phenotype data and plot a genome scan with this shuffled (permuted) data.

What is the maximum LOD score in the scan from this permuted data?
How does it compare to the maximum LOD scores obtained from the earlier scan?
How does it compare to the 5% and 20% LOD thresholds obtained earlier?
Paste the maximum LOD score in the scan from your permuted data into the etherpad.

Challenge 2

  1. Find the 1% and 10% significance thresholds for the first set of permutations contained in the object operm.
  2. What do the 1% and 10% significance thresholds say about LOD scores?
  1. summary(operm, alpha=c(0.01, 0.10))
  2. These LOD thresholds indicate maximum LOD scores that can be obtained by random chance at the 1% and 10% significance levels. We expect to see LOD values this high or higher 1% and 10% of the time respectively.

Key Points

  • “A permutation test establishes the statistical significance of a genome scan.”

Content from Finding QTL peaks


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • “How do I locate QTL peaks above a certain LOD threshold value?”

Objectives

  • Locate QTL peaks above a LOD threshold value throughout the genome.
  • Identify the Bayesian credible interval for a QTL peak.

Once we have LOD scores from a genome scan and a significance threshold, we can look for QTL peaks associated with the phenotype. High LOD scores indicate the neighborhood of a QTL but don’t give its precise position. To find the exact position of a QTL, we define an interval that is likely to hold the QTL.

We’ll use the Bayesian credible interval, which is a method for defining QTL intervals. It describes the probability that the interval contains the true value. Credible intervals make a probabilistic statement about the true value, for example, a 95% credible interval states that there is a 95% chance that the true value lies within the interval.

Let’s remind ourselves how the genome scan for insulin looks.

R

thr <- summary(perm_add_loco)

plot_scan1(x    = lod_add_loco,
           map  = cross$pmap,
           main = "Insulin")
add_threshold(map = cross$pmap, thresholdA = thr, col = 'red')

To find peaks above a given threshold LOD value, use the function find_peaks(). It can also provide Bayesian credible intervals by using the argument prob (the nominal coverage for the Bayes credible intervals). Set the argument expand2markers = FALSE to keep from expanding the interval out to genotyped markers, or exclude this argument if you’d like to include flanking markers.

You need to provide both the scan1() output, the marker map and a threshold. We will use the 95% threshold from the permutations in the previous lesson.

R

find_peaks(scan1_output = lod_add_loco, 
           map          = cross$pmap, 
           threshold    = thr, 
           prob         = 0.95, 
           expand2markers = FALSE)

OUTPUT

  lodindex lodcolumn chr       pos      lod      ci_lo     ci_hi
1        1    pheno1   2 138.94475 7.127351  64.949395 149.57739
2        1    pheno1   7 144.18230 5.724018 139.368290 144.18230
3        1    pheno1  12  25.14494 4.310493  15.834815  29.05053
4        1    pheno1  14  22.24292 3.974322   6.241951  45.93876
5        1    pheno1  16  80.37433 4.114024  10.238134  80.37433
6        1    pheno1  19  54.83012 5.476587  48.370980  55.15007

In the table above, we have one peak per chromosome because that is the default behavior of find_peaks(). The find_peaks() function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument peakdrop indicates the amount that the LOD curve must drop between the lowest of two adjacent peaks. Use this feature with caution.

R

find_peaks(scan1_output = lod_add_loco, 
           map          = cross$pmap, 
           threshold    = thr, 
           peakdrop     = 1.8, 
           prob         = 0.95, 
           expand2markers = FALSE)

OUTPUT

  lodindex lodcolumn chr       pos      lod      ci_lo     ci_hi
1        1    pheno1   2 138.94475 7.127351  64.949395 149.57739
2        1    pheno1   7 144.18230 5.724018 139.368290 144.18230
3        1    pheno1  12  25.14494 4.310493  15.834815  29.05053
4        1    pheno1  14  22.24292 3.974322   6.241951  45.93876
5        1    pheno1  16  17.48123 3.995627   5.604167  37.99110
6        1    pheno1  16  80.37433 4.114024  74.250773  80.37433
7        1    pheno1  19  54.83012 5.476587  48.370980  55.15007

Each row shows a different peak; the columns show the peak location, LOD score and the lower and upper interval endpoints. Note that we now have two peaks on chromosome 16, one at 17.5 Mb and one at 80.4 Mb.

Challenge 1

Find peaks in the genome scan object called lod_add_loco that meet a threshold of 3 and are in the interval described by a 2 point LOD drop from the peak. How many peaks meet the LOD threshold of 3 and lie within the interval defined by a 2 point LOD drop from the maximum peaks on each chromosome?

R

find_peaks(scan1_output = lod_add_loco, 
           map          = cross$pmap, 
           threshold    = 3, 
           drop         = 2)

OUTPUT

  lodindex lodcolumn chr       pos      lod      ci_lo     ci_hi
1        1    pheno1   2 138.94475 7.127351  63.943187 156.83772
2        1    pheno1   5 103.41486 3.130862  41.967549 132.28428
3        1    pheno1   7 144.18230 5.724018 129.414016 144.18230
4        1    pheno1   9  83.67606 3.865635  17.504307 111.02206
5        1    pheno1  12  25.14494 4.310493   9.998200  34.23274
6        1    pheno1  14  22.24292 3.974322   6.241951  68.04655
7        1    pheno1  16  80.37433 4.114024   3.804882  96.52406
8        1    pheno1  19  54.83012 5.476587  47.361847  56.37100

This produces 7 peaks on 6 different chromosomes that meet a LOD threshold of 3 and are within the interval defined by a 2-LOD drop from the maximum peak on each chromosome.

Challenge 2

1). Calculate the 90% Bayes credible interval. For chromosome 2, what is the range of the interval that has a 90% chance of containing the true QTL position?
2). Compare with the 95% Bayes credible interval calculated earlier. How does the interval change as you increase the probability? Why?

1). This produces a range from 118.0 to 149.6 Mb.

R

pks = find_peaks(scan1_output = lod_add_loco, 
                 map          = cross$pmap,
                 prob         = 0.90, 
                 expand2markers = FALSE)
subset(pks, chr == '2')

OUTPUT

  lodindex lodcolumn chr      pos      lod    ci_lo    ci_hi
1        1    pheno1   2 138.9448 7.127351 117.9557 149.5774

2). This produces a range from 64.9 to 149.6 Mb, which is much broader than the 90% interval. The interval widens because the probability that the interval contains the true QTL position has increased.

R

pks = find_peaks(scan1_output = lod_add_loco, 
                 map          = cross$pmap,
                 prob         = 0.95, 
                 expand2markers = FALSE)
subset(pks, chr == '2')

OUTPUT

  lodindex lodcolumn chr      pos      lod    ci_lo    ci_hi
1        1    pheno1   2 138.9448 7.127351 64.94939 149.5774

Key Points

  • LOD peaks and support intervals can be identified with find_peaks().
  • The Bayesian Credible Interval estimates the width of the support interval around a QTL peak.

Content from Estimating QTL effects


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How do I find the founder allele effects at a QTL peak?

Objectives

  • Estimate the founder allele effects at a QTL peak.
  • Plot the estimated founder allele effects.

Estimating Founder Allele Effects


Recall that to model data from a cross, we use

\(y_j = \mu + \beta_k G_{jk} + \epsilon_j\)

where \(y_{ij}\) is the phenotype of the \(j\)th individual, \(\mu\) is the mean phenotype value, \(\beta_k\) is the effect of the \(k\)th genotype, \(G_{jk}\) is the genotype for individual \(j\), and \(\epsilon_j\) is the error for the \(j\)th individual. In the figure below, \(\mu\) equals 1, and \(\beta\) equals 0.1 for the alternative hypothesis (QTL exists).

This linear model is y = 1 + 0.1X + ε. The model intersects the genotype groups at their group means, and is based on μ and βk for chromosome 2 marker rs13476803 located at 138.944754 Mb.

The effect of genotype RR (the β for the RR genotype) at marker rs13476803 is 0.1, while the effect of the BB genotype is -0.1 on the insulin phenotype. The effect of the BR genotype is 0 relative to \(\mu\) equals 1.

The scan1() function returns only LOD scores. To obtain estimated QTL effects, use the function scan1coef(). This function takes a single phenotype and the genotype probabilities for a single chromosome and returns a matrix with the estimated coefficients at each putative QTL location along the chromosome.

For example, to get the estimated QTL effects on chromosome 2 for the insulin phenotype, we would provide the chromosome 2 genotype probabilities and the insulin phenotype to the function scan1coef() as follows:

R

chr      <- "7"
eff_chr7 <- scan1coef(genoprobs = probs[,chr], 
                      pheno     = cross$pheno[,"log10_insulin_10wk", drop = FALSE],
                      kinship   = kinship_loco[[chr]],
                      addcovar  = addcovar)

The result is a matrix of 109 positions \(\times\)
4 genotypes. An additional column contains the intercept values (\(\mu\)).

R

dim(eff_chr7)

OUTPUT

[1] 109   5

Plotting Founder Allele Effects Along a Chromosome


To plot the QTL effects, use the plot_coef() function. Add the LOD plot to the scan1_output argument to include a LOD plot at the bottom.

R

plot_coef(x            = eff_chr7, 
          map          = cross$pmap,
          scan1_output = lod_add_loco, 
          legend       = "topright")

The plot shows effect values on the y-axis and cM values on the x-axis. The value of the intercept (μ) appears at the top. The effect of the BR genotype is centered around zero, with the effects of the other two genotypes above and below. We are usually not directly interested in how the additive covariates change across the genome, but rather, the the founder allele effects change.

To plot only the founder allele effects, use the argument columns to indicate which coefficients to plot. Let’s look at the columns which contain the founder allele effects.

R

head(eff_chr7)

OUTPUT

                   BB           BR          RR    SexMale intercept
rs8252589  0.02792439 -0.002516938 -0.02540745 -0.1753913  1.011125
rs13479104 0.02140644  0.002728978 -0.02413541 -0.1751367  1.009704
rs13479112 0.02010520  0.006609288 -0.02671449 -0.1750756  1.008747
rs13479114 0.01880760  0.007656166 -0.02646377 -0.1750906  1.008527
rs13479120 0.02083594  0.005991180 -0.02682712 -0.1752293  1.008946
rs13479124 0.02061108  0.002498998 -0.02311008 -0.1751596  1.009863

We would like to plot the columns “BB”, “BR”, and “RR”, which are in columns 1 through 3. This is what we pass into the columns argument.

R

plot_coef(x       = eff_chr7, 
          map     = cross$pmap, 
          columns = 1:3, 
          scan1_output = lod_add_loco, 
          main    = "Chromosome 2 QTL effects and LOD scores",
          legend  = "topleft")

Looking at the plot above, which founder allele contributes to higher insulin levels?

Estimating Founder Allele Effects using BLUPs


Another option for estimating the founder allele effects is to treat them as random effects and calculate Best Linear Unbiased Predictors (BLUPs). This is particularly valuable for multi-parent populations such as the Collaborative Cross and Diversity Outbred mice, where the large number of possible genotypes at a QTL leads to considerable variability in the effect estimates. To calculate BLUPs, use scan1blup(); it takes the same arguments as scan1coef(), including the option of a kinship matrix to account for a residual polygenic effect.

R

blup_chr7 <- scan1blup(genoprobs = probs[,chr], 
                       pheno     = cross$pheno[,"log10_insulin_10wk", drop = FALSE],
                       kinship   = kinship_loco[[chr]],
                       addcovar  = addcovar)

We can plot the BLUP effects using plot_coef as before.

R

plot_coef(x       = blup_chr7, 
          map     = cross$pmap, 
          columns = 1:3, 
          scan1_output = lod_add_loco, 
          main    = paste("Chromosome", chr, "QTL BLUP effects and LOD scores"),
          legend  = "topleft")

In the plot below, we plotted the founder allele effects (solid lines) and the BLUPs (dashed lines). In this case, the effects are not greatly different, but the effects are “shrunken” toward zero.

R

plot_coef(x       = eff_chr7, 
          map     = cross$pmap, 
          columns = 1:3, 
          main    = paste("Chromosome", chr,"QTL BLUP effects and LOD scores"),
          legend  = "topleft")
plot_coef(x       = blup_chr7, 
          map     = cross$pmap, 
          columns = 1:3,
          lty     = 2,
          legend  = "topleft",
          add     = TRUE)

Plotting Allele Effects at one Marker


You may also want plot the founder allele effects at the marker with the highest LOD. To do this, you first need to get the position of the marker from the peak list.

R

peaks <- find_peaks(scan1_output = lod_add_loco,
                    map          = cross$pmap,
                    threshold    = thr)
peaks

OUTPUT

  lodindex lodcolumn chr       pos      lod
1        1    pheno1   2 138.94475 7.127351
2        1    pheno1   7 144.18230 5.724018
3        1    pheno1  12  25.14494 4.310493
4        1    pheno1  14  22.24292 3.974322
5        1    pheno1  16  80.37433 4.114024
6        1    pheno1  19  54.83012 5.476587

The position of the maximum LOD on chromosome 7 is 144.182298 Mb. We can pass this value into the qtl2 function pull_genoprobpos to get the genoprobs at this marker.

R

max_pos <- subset(peaks, chr == '7')$pos
max_mkr <- find_marker(map = cross$pmap, 
                       chr = chr, 
                       pos = max_pos)

pr      <- pull_genoprobpos(genoprobs = probs, 
                            marker    = max_mkr)

What does the structure of pr look like?

R

str(pr)

OUTPUT

 num [1:490, 1:3] 1.55e-06 1.55e-06 1.55e-06 3.94e-05 1.00 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:490] "Mouse3051" "Mouse3551" "Mouse3430" "Mouse3476" ...
  ..$ : chr [1:3] "BB" "BR" "RR"

pr is a numeric matrix with 490 rows and 3 columns. The rownames contain the mouse IDs and the column names contain the genotypes.

We can then pass pr as an argument into the fit1 function, which fits the mapping model at a single marker.

R

mod = fit1(genoprobs = pr,
           pheno     = cross$pheno[,"log10_insulin_10wk", drop = FALSE],
           kinship   = kinship_loco[[chr]], 
           addcovar  = addcovar)

Then, we can plot the founder allele effects and their standard error.

R

mod_eff = data.frame(eff = mod$coef, 
                     se  = mod$SE) |>
           rownames_to_column("genotype") |>
           filter(genotype %in% c("BB", "BR", "RR"))

ggplot(data    = mod_eff, 
       mapping = aes(x = genotype, y = eff)) +
  geom_beeswarm() +
  geom_pointrange(mapping = aes(ymin = eff - se, 
                                ymax = eff + se),
                  size       = 1.5,
                  linewidth  = 1.25) +
  labs(title = paste("Founder Allele Effects on Chr", chr),
       x     = "Genotype", y = "Founder Allele Effects") +
  theme(text = element_text(size = 20))

In the plot above, which founder allele contributes to higher insulin levels? Is that consistent with the plot created using plot_coef above?

If instead you want additive and dominance effects, you can provide a square matrix of contrasts, as follows:

R

c7effB <- scan1coef(genoprobs = probs[,chr], 
                    pheno     = cross$pheno[,"log10_insulin_10wk", drop = FALSE],
                    kinship   = kinship_loco[[chr]],
                    addcovar  = addcovar,
                    contrasts = cbind(mu = c(   1, 1,    1), 
                                      a  = c(  -1, 0,    1), 
                                      d  = c(-0.5, 1, -0.5)))

The result will then contain the estimates of mu, a (the additive effect), and d (the dominance effect).

R

dim(c7effB)

OUTPUT

[1] 109   4

R

head(c7effB)

OUTPUT

                 mu           a            d    SexMale
rs8252589  1.011125 -0.02666592 -0.002516938 -0.1753913
rs13479104 1.009704 -0.02277092  0.002728978 -0.1751367
rs13479112 1.008747 -0.02340984  0.006609288 -0.1750756
rs13479114 1.008527 -0.02263568  0.007656166 -0.1750906
rs13479120 1.008946 -0.02383153  0.005991180 -0.1752293
rs13479124 1.009863 -0.02186058  0.002498998 -0.1751596

For marker rs13479570, mu, a, and d are 1.0058686, -0.1114157, -0.0201158, -0.164111.

Here’s a plot of the chromosome 7 additive and dominance effects, which are in the second and third columns.

R

plot_coef(x       = c7effB, 
          map     = cross$pmap[chr], 
          columns = 2:3, 
          col     = 1:2)
legend('bottomleft', lty = 1, col = 1:2, legend = c("additive", "dominant"))

Plotting Phenotypes versus Genotypes


Finally, to plot the raw phenotypes against the genotypes at a single putative QTL position, you can use the function plot_pxg(). This takes a vector of genotypes as produced by the maxmarg() function, which picks the most likely genotype from a set of genotype probabilities, provided it is greater than some specified value (the argument minprob). Note that the “marg” in “maxmarg” stands for “marginal”, as this function is selecting the genotype at each position that has maximum marginal probability.

For example, we could get inferred genotypes at the chr 7 QTL for the insulin phenotype (at 28.6 cM) as follows:

R

g <- maxmarg(probs = probs, 
             map   = cross$pmap, 
             chr   = chr, 
             pos   = max_pos, 
             return_char = TRUE)

We use return_char = TRUE to have maxmarg() return a vector of character strings with the genotype labels.

We then plot the insulin phenotype against these genotypes as follows:

R

plot_pxg(geno   = g, 
         pheno  = cross$pheno[,"log10_insulin_10wk"], 
         SEmult = 2,
         main   = paste("Insulin vs Chr", chr ,"Genotype"))

Challenge 1

Calculate the insulin BLUP effects for chromosome 7.
1) Create an object called blup_chr7 to contain the effects.
2) Plot the chromosome 7 BLUPs and add the LOD plot at bottom. 3) Which founder allele increases insulin levels?

R

chr <- '7'
blup_chr7 <- scan1blup(genoprobs = probs[,chr], 
                       pheno     = cross$pheno[,"log10_insulin_10wk"],
                       addcovar  = addcovar,
                       kinship   = kinship_loco[[chr]])  
plot_coef(x       = blup_chr7, 
          map     = cross$pmap,
          columns = 1:3,
          scan1_output = lod_add_loco,
          legend  = "topleft",
          main    = "Insulin")

The C57BL/6J allele on chromosome 7 increases insulin levels.

Challenge 2

Calculate the insulin BLUP effects for chromosome 19.
1. Create an object called blup_chr19 to contain the effects.
2. Plot the chromosome 19 BLUPs and add the LOD plot at bottom. 3. Which founder allele increases insulin levels? 4. Plot insulin versus the genotype at the marker with the higest LOD on chromosome 19.

R

chr <- '19'
blup_chr19 <- scan1blup(genoprobs = probs[, chr], 
                        pheno     = cross$pheno[, "log10_insulin_10wk"],
                        addcovar  = addcovar,
                        kinship   = kinship_loco[[chr]])  
plot_coef(x       = blup_chr19, 
          map     = cross$pmap,
          columns = 1:3,
          scan1_output = lod_add_loco,
          legend  = "topleft",
          main    = "Insulin")

The BTBR allele on chromosome 19 increases insulin levels.

R

max_pos19 = peaks$pos[peaks$chr == chr]
g <- maxmarg(probs = probs, 
             map   = cross$pmap, 
             chr   = chr, 
             pos   = max_pos19, 
             return_char = TRUE)

plot_pxg(geno   = g, 
         pheno  = cross$pheno[,"log10_insulin_10wk"], 
         SEmult = 2, 
         main   = "Insulin vs Chr 19 Genotype")

Key Points

  • “Estimated founder allele effects can be plotted from the mapping model coefficients.”
  • “Additive and dominance effects can be plotted using contrasts.”

Content from Integrating Gene Expression Data


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How can I use gene expression data to identify candidate genes?
  • What is expression QTL mapping?

Objectives

  • Find genes which are correlated with a physiological phenotype.
  • Find genes which have expression QTL in the same position as a physiological phenotype.

Introduction


Once you have QTL peak, the next step if often to identify genes which may be causal candidates. However, there are often hundreds of genes within the QTL support interval. How do you select candidate genes?

Often, there is a tissue which is related to the phenotype that you are measuring. For example, if you measure circulating cardiac troponin, then gene expression in the heart may be relevant. For some complex phenotypes, it may not be easy to select a tissue. For example, if the liver creates a metabolite that causes kidney injury, which is then exacerbated by the immune system, in which tissue would you measure gene expression?

Susceptibility to type 2 diabetes involves a complex interaction between several tissues. However, for the purposes of this tutorial in which we mapped circulating insulin levels, it is reasonable to look at gene expression in the pancreas.

Reading in Gene Expression Data


Gene expression data consists of two parts: expression measurements and gene annotation. In this study, gene expression was measured via two-color microarray. The expression values represent a ratio of two sets of RNA: a control set and the sample set. The expression values represent the log-normalized ratio of these two sets.

We have prepared two files containing normalized gene expression data and the gene annotation.

R

Sys.getenv("VROOM_CONNECTION_SIZE")

OUTPUT

[1] ""

R

annot <- read_csv("data/attie_b6btbr_grcm39/attie_gene_annot.csv",
                  show_col_types = FALSE) |>
           mutate(a_gene_id = as.character(a_gene_id))

expr  <- read_csv("data/attie_b6btbr_grcm39/attie_gene_expr.csv",
                  show_col_types = FALSE)

Challenge 1: Size of data structures.

  1. How many rows and columns are in annot?
  2. What are the column names off annot?

Dimensions of annot.

R

dim(annot)

OUTPUT

[1] 29860    13

Column names of annot.

R

colnames(annot)

OUTPUT

 [1] "a_gene_id"       "chr"             "start"           "end"
 [5] "width"           "strand"          "gene_id"         "gene_name"
 [9] "gene_biotype"    "description"     "gene_id_version" "symbol"
[13] "entrezid"       

Challenge 2

  1. What are the dimensions of expr?
  2. Look at the top 5 row by 5 column block of expr.
  1. Dimensions of expr.

R

dim(expr)

OUTPUT

[1]   490 29861
  1. Top-left block of expr.

R

expr[1:5,1:5]

OUTPUT

# A tibble: 5 × 5
  MouseNum  `497628` `497629` `497630` `497632`
  <chr>        <dbl>    <dbl>    <dbl>    <dbl>
1 Mouse3051  -0.0790 -0.0114  -0.0790    0.0461
2 Mouse3551   0.0544 -0.0693  -0.0721   -0.408
3 Mouse3430   0.154  -0.0468  -0.00900  -0.298
4 Mouse3476   0.144  -0.0451  -0.00505   0.12
5 Mouse3414   0.266   0.00496 -0.0746   -0.127 

Let’s look at the relationship between the annotation and the expression data.

The annotation data has 29860 rows and the expression data has 29861 columns. The first column in expr contains the mouse ID and the remaining columns contain the expression values for each gene. The gene IDs are in the column names. These are Agilent gene IDs. They are also in the a_gene_id column in the annotation.

R

all(annot$a_gene_id == colnames(expr)[-1])

OUTPUT

[1] TRUE

Now we know that the genes are aligned between the annotation and the expression data. When you receive your own expression data, it is critical that you align the genes in your expression data with your annotation data.

We must also align the mouse IDs between the physiological phenotypes and the expression data. We saw above that the mouse IDs are in the first column of the expression data. Let’s see where the mouse IDs are in the phenotype data.

R

head(cross$pheno)

OUTPUT

          log10_insulin_10wk agouti_tan tufted
Mouse3051          1.3985133          1      0
Mouse3551          0.3693926          1      1
Mouse3430          0.8599836          0      1
Mouse3476          0.7997233          1      0
Mouse3414          1.3702541          0      0
Mouse3145          1.7827215          1      0

cross$pheno is a matrix which contains the mouse IDs in the rownames. Let’s convert the expression data to a matrix as well.

R

expr <- expr |>
         column_to_rownames(var = "MouseNum") |>
         as.matrix()

Now let’s check whether the mouse IDs are aligned between cross$pheno and expr. Again, when you assemble your phenotype and expression data, you will need to make sure that the sample IDs are aligned between the data sets.

R

all(rownames(cross$pheno) == rownames(expr))

OUTPUT

[1] TRUE

Identifying Genes in a QTL Support Interval


In previous episodes, we found significant QTL peaks using the find_peaks function. Let’s look at those peaks again.

R

peaks

OUTPUT

  lodindex lodcolumn chr       pos      lod      ci_lo     ci_hi
1        1    pheno1   2 138.94475 7.127351  64.949395 149.57739
2        1    pheno1   7 144.18230 5.724018 139.368290 144.18230
3        1    pheno1  12  25.14494 4.310493  15.834815  29.05053
4        1    pheno1  14  22.24292 3.974322   6.241951  45.93876
5        1    pheno1  16  80.37433 4.114024  10.238134  80.37433
6        1    pheno1  19  54.83012 5.476587  48.370980  55.15007

We looked at the QTL peak on chromosome 19 in a previous lesson. The QTL interval is 6.779094 Mb wide. This is quite wide. Let’s get the genes expressed in the pancreas within this region.

R

chr         <- '19'
peaks_chr19 <- filter(peaks, chr == '19')
annot_chr19 <- filter(annot, chr == '19' & start > peaks_chr19$ci_lo & end < peaks_chr19$ci_hi)
expr_chr19  <- expr[,annot_chr19$a_gene_id]

There are 22 genes! How can we start to narrow down which ones may be candidate genes?

Challenge 3: How can we narrow down the candidate gene list?

  1. Take a moment to think of ways that you could narrow down the gene list to select the most promising candidate genes that regulate insulin levels.
  2. Turn to your neighbor and exchange your ideas.
  3. Share your ideas with the group.

Identifying Coding SNPs in QTL Intervals


When you have a QTL peak, you can search for SNPs which lie within the coding regions of genes. These SNPs may cause amino acid substitutions which will affect protein structure. The most comprehensive SNP resource is the Mouse Genomes Project. They have sequenced 52 inbred strains, including BTBR, and have made the data available in Variant Call Format (VCF). The full VCF files are available on the EBI FTP site. However, the full SNP file is 22 GB! When you need other strains, you can download and query that file. However, for this workshop, we only need SNPs for the BTBR strain. We have created a VCF file containing only SNPs which differ from C57BL/6J and BTBR. Let’s read in this file now using the VariantAnnotation function readVcf.

R

vcf <- readVcf(file.path("data", "btbr_snps_grcm39.vcf.gz"))

The vcf object contains both the SNP allele calls and information about the sequencing depth and SNP consequences. Let’s see how many SNPs are in the VCF.

R

dim(vcf)

OUTPUT

[1] 4791000       1

There are about 4.8 million SNPs in vcf. There are many fields in the VCF file and there are vignettes which document the more advanced features. Here, we will search for SNPs within QTL support intervals and filter them to retain missense or stop coding SNPs.

In order to filter the SNPs by location, we need to create a GenomicRanges object which contains the coordinates of the QTL support interval on chromosome 19. Note that the support interval is reported by qtl2 in Mb and the GRanges object must be given bp positions, so we need to multiply these by 1e6.

R

chr19_gr <- GRanges(seqnames = peaks_chr19$chr, 
                    ranges   = IRanges(peaks_chr19$ci_lo * 1e6, peaks_chr19$ci_hi * 1e6))
chr19_gr

OUTPUT

GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       19 48370979-55150074      *
  -------
  seqinfo: 20 sequences from an unspecified genome; no seqlengths

Next, we will filter the vcf object to retain SNPs within the support interval.

R

vcf_chr19 <- subsetByOverlaps(vcf, ranges = chr19_gr)

Let’s see how many SNPs there are.

R

dim(vcf_chr19)

OUTPUT

[1] 19065     1

There are still 19,000 SNPs! But many of them may not be in coding regions or may be synonymous. Next, we will search for SNPs that have missense or stop coding changes. These are likely to have the most severe effects. Note that synonymous SNPs and SNPs in the untranslated region of genes may affect RNA folding, RNA stability, and translation. But it is more difficult to predict those effects.

The vcf object contains predicted SNP consequences in a field called CSQ. The format of this field is challenging to parse, so we will get it, search for missense and stop SNPs, and then do some data-wrangling to may the output readable.

R

csq <- info(vcf_chr19)$CSQ
csq <- as.list(csq)

wh  <- grep('missense|stop', csq)
csq <- csq[wh]
rr  <- rowRanges(vcf_chr19)[wh]

Let’s see how many SNPs had missense or stop codon consequences.

R

length(csq)

OUTPUT

[1] 8

There were only 8 SNPs out of 19,000 SNPs with missense or stop codon consequences. Next, let’s do the data wrangling to reformat the consequences of these SNPs.

R

csq <- lapply(csq, strsplit, split = "\\|")
csq <- lapply(csq, function(z) {
                     matrix(unlist(z), ncol = length(z[[1]]), byrow = TRUE)
                   })

for(i in seq_along(csq)) {
  
  csq[[i]] <- data.frame(snp_id = names(rr)[i],
                         csq[[i]])
  
} # for(i)

csq = do.call(rbind, csq)

Let’s look at the top of csq.

R

head(csq)

OUTPUT

           snp_id X1               X2       X3     X4                 X5
1 19:50141282_A/G  G   intron_variant MODIFIER Sorcs1 ENSMUSG00000043531
2 19:50141282_A/G  T   intron_variant MODIFIER Sorcs1 ENSMUSG00000043531
3 19:50141282_A/G  C   intron_variant MODIFIER Sorcs1 ENSMUSG00000043531
4 19:50141282_A/G  G missense_variant MODERATE Sorcs1 ENSMUSG00000043531
5 19:50141282_A/G  T missense_variant MODERATE Sorcs1 ENSMUSG00000043531
6 19:50141282_A/G  C missense_variant MODERATE Sorcs1 ENSMUSG00000043531
          X6                 X7             X8    X9   X10 X11 X12  X13  X14
1 Transcript ENSMUST00000072685 protein_coding       26/26
2 Transcript ENSMUST00000072685 protein_coding       26/26
3 Transcript ENSMUST00000072685 protein_coding       26/26
4 Transcript ENSMUST00000111756 protein_coding 27/27               3465 3448
5 Transcript ENSMUST00000111756 protein_coding 27/27               3465 3448
6 Transcript ENSMUST00000111756 protein_coding 27/27               3465 3448
   X15 X16     X17 X18 X19 X20 X21 X22 X23 X24               X25 X26 X27 X28
1                           -1     SNV MGI
2                           -1     SNV MGI
3                           -1     SNV MGI
4 1150 S/P Tct/Cct          -1     SNV MGI       tolerated(0.09)
5 1150 S/T Tct/Act          -1     SNV MGI       tolerated(0.26)
6 1150 S/A Tct/Gct          -1     SNV MGI     deleterious(0.03)
  X29
1
2
3
4
5
6    

There are a lot of columns and we may not need all of them. Next, let’s reduce the number of columns to keep the ones that we need.

R

csq <- csq[,c(1:7,9,17,18,26)]

Next, we will subset the consequences to retain the unique rows and then retain rows which contain “missense” and “stop”.

R

csq <- distinct(csq) |>
         filter(str_detect(X2, "missense|stop"))

Let’s look at the top of csq now.

R

head(csq)

OUTPUT

           snp_id X1               X2       X3     X4                 X5
1 19:50141282_A/G  G missense_variant MODERATE Sorcs1 ENSMUSG00000043531
2 19:50141282_A/G  T missense_variant MODERATE Sorcs1 ENSMUSG00000043531
3 19:50141282_A/G  C missense_variant MODERATE Sorcs1 ENSMUSG00000043531
4 19:50141444_G/A  A missense_variant MODERATE Sorcs1 ENSMUSG00000043531
5 19:50141444_G/A  T missense_variant MODERATE Sorcs1 ENSMUSG00000043531
6 19:50141444_G/A  C missense_variant MODERATE Sorcs1 ENSMUSG00000043531
          X6             X8 X16     X17                              X25
1 Transcript protein_coding S/P Tct/Cct                  tolerated(0.09)
2 Transcript protein_coding S/T Tct/Act                  tolerated(0.26)
3 Transcript protein_coding S/A Tct/Gct                deleterious(0.03)
4 Transcript protein_coding S/F tCc/tTc deleterious_low_confidence(0.01)
5 Transcript protein_coding S/Y tCc/tAc deleterious_low_confidence(0.01)
6 Transcript protein_coding S/C tCc/tGc deleterious_low_confidence(0.01)

Column X4 contains gene names. Let’s get the unique gene names.

R

csq |>
  distinct(X4)

OUTPUT

      X4
1 Sorcs1
2  Rbm20

There are only two genes in the QTL support interval which contain missense SNPs.

Challenge 4: Search Genes using Pubmed

  1. Go to Pubmed and search for papers involving the two genes and insulin or diabetes.

Both genes return published papers relating to insulin:

Sorcs1

Rbm2

Sorcs1 was identified as a gene which relates to obesity-induced type 2 diabetes by the same group that provided this data set. In fact, it was identified in the same type of mouse cross between C57BL/6J and BTBR.

In this case, we used published SNPs data to identify two potential candidate genes. If we had not found publications that had already associated these genes with diabetes, these would be genes that you would follow up on in the lab.

Using Expression QTL Mapping


Another method of searching for candidate genes is to look for genes which have QTL peaks in the same location as the insulin QTL. Genes which have QTL that are co-located with insulin QTL will also be strongly correlated with insulin levels. These genes may be correlated with insulin because they control insulin levels, respond to insulin levels, or are correlated by chance. However, genes which have QTL within the insulin QTL support interval are reasonable candidate genes since we expect the genotype to influence expression levels.

Let’s perform QTL mapping on the genes within the chromosome 19 insulin QTL support interval. We can do this by passing in the expression values as phenotypes into scan1.

R

eqtl_chr19 <- scan1(genoprobs = probs[,chr],
                   pheno     = expr_chr19,
                   kinship   = kinship_loco[[chr]],
                   addcovar  = addcovar)

Let’s look at the top of the results.

R

head(eqtl_chr19[,1:6])

OUTPUT

             498370   500695    501344    504488    505754     506116
rs4232073  2.382757 2.017080 0.7656653 0.2790253 0.9177116 0.15670328
rs13483548 2.340860 2.005920 0.7778981 0.2792932 0.9191257 0.15667150
rs13483549 2.334980 1.995714 0.7841222 0.2779127 0.9132370 0.14886307
rs13483550 2.227526 1.720540 0.8645001 0.2454031 0.7653925 0.06175031
rs13483554 1.939305 1.690000 0.8860579 0.2489432 0.8189368 0.11267374
rs13483555 2.083754 1.623693 0.8624511 0.2550340 0.8669779 0.12892654

The eQTL results have one row for each marker and one column for each gene. Each column represents the LOD plot for one gene. Let’s plot the genome scan for one gene.

R

gene_id = '10002678668'
plot_scan1(x   = eqtl_chr19, 
           map = cross$pmap,
           lodcolumn = gene_id,
           main      = gene_id)

This gene seems to have a QTL on chromosome 19 near 52 Mb with a LOD of 34. The insulin QTL is closer to 55 Mb, so this may be a good candidate gene.

We can harvest the significant QTL peaks on chromosome 19 using find_peaks. We will use the default threshold of LOD = 3.

R

eqtl_chr19_peaks = find_peaks(eqtl_chr19, map = cross$pmap) |>
                    arrange(pos)
eqtl_chr19_peaks

OUTPUT

  lodindex   lodcolumn chr      pos       lod
1       21 10004035475  19 47.36185  4.655021
2       17 10002936879  19 51.36235  7.657043
3       16 10002928587  19 51.82752  7.117759
4       11 10002678668  19 52.42857 34.668074
5        1      498370  19 52.99433 12.391144
6       14 10002910800  19 54.83012  4.233249

We can see that there are 6 significant QTL on chromosome 19. All of these genes are within the support interval of the insulin QTL.

Let’s filter the eQTL genes and add in gene annotation for them.

R

eqtl_chr19_peaks <- eqtl_chr19_peaks |>
                     filter(pos > peaks_chr19$ci_lo & pos < peaks_chr19$ci_hi) |>
                     left_join(annot, by = c('lodcolumn' = 'a_gene_id'))
eqtl_chr19_peaks

OUTPUT

  lodindex   lodcolumn chr.x      pos       lod chr.y    start      end  width
1       17 10002936879    19 51.36235  7.657043    19 53.66574 53.85551 189775
2       16 10002928587    19 51.82752  7.117759    19 50.13174 50.66708 535348
3       11 10002678668    19 52.42857 34.668074    19 52.92036 53.02864 108289
4        1      498370    19 52.99433 12.391144    19 53.36782 53.37901  11189
5       14 10002910800    19 54.83012  4.233249    19 52.25273 52.25391   1180
  strand            gene_id gene_name   gene_biotype
1      + ENSMUSG00000043639     Rbm20 protein_coding
2      - ENSMUSG00000043531    Sorcs1 protein_coding
3      - ENSMUSG00000025027   Xpnpep1 protein_coding
4      - ENSMUSG00000025024    Smndc1 protein_coding
5      + ENSMUSG00000035804      Ins1 protein_coding
                                                                                description
1                          RNA binding motif protein 20 [Source:MGI Symbol;Acc:MGI:1920963]
2   sortilin-related VPS10 domain containing receptor 1 [Source:MGI Symbol;Acc:MGI:1929666]
3 X-prolyl aminopeptidase (aminopeptidase P) 1, soluble [Source:MGI Symbol;Acc:MGI:2180003]
4             survival motor neuron domain containing 1 [Source:MGI Symbol;Acc:MGI:1923729]
5                                               insulin I [Source:MGI Symbol;Acc:MGI:96572]
        gene_id_version  symbol entrezid
1 ENSMUSG00000043639.16   Rbm20       NA
2 ENSMUSG00000043531.17  Sorcs1       NA
3 ENSMUSG00000025027.19 Xpnpep1       NA
4  ENSMUSG00000025024.8  Smndc1       NA
5  ENSMUSG00000035804.6    Ins1       NA

Let’s plot the insulin QTL again to remind ourselves what the peak on chromosome 19 look like. We will also plot the genome scan for one of the genes on chromosome 19.

R

plot_scan1(x    = eqtl_chr19,
           map  = cross$pmap,
           lodcolumn = "10002936879",
           main = "Insulin")
plot_scan1(x    = lod_add_loco, 
           map  = cross$pmap,
           chr  = '19',
           col  = 'blue',
           add  = TRUE)
legend("topleft", legend = c("insluin", "10002936879"), 
       col = c('blue', 'black'), lwd = 2)

Mediation Analysis


If a gene is correlated with insulin levels and if it has an eQTL in the same location as insulin, then we could add it into the insulin QTL model and see if it reduces the LOD on chromosome 19.

R

stopifnot(rownames(addcovar) == rownames(expr_chr19))
addcovar_chr19 <- cbind(addcovar, expr_chr19[,"10002936879"])

lod_med = scan1(genoprobs = probs[,chr],
                pheno     = cross$pheno[,'log10_insulin_10wk',drop = FALSE],
                kinship   = kinship_loco[[chr]],
                addcovar  = addcovar_chr19)

Next, we will plot the original insulin genome scan and overlay the genome scan with gene 10002936879 in the model.

R

plot_scan1(x    = lod_add_loco,
           map  = cross$pmap,
           chr  = '19',
           main = "Insulin with/without 10002936879")
plot_scan1(x   = lod_med,
           map = cross$pmap,
           chr  = '19',
           col = "blue",
           add = TRUE)
legend("topleft", legend = c("insluin", "mediation"), 
       col = c('black', 'blue'), lwd = 2)

The LOD dropped from 5.4765866 to 2.0202412. This is a difference of 3.4563455.

It would be slow to run a genome scan for each gene on chromosome 19 and look at the LOD drop. Also, we don’t really need the LOD drop across the entire chromosome. We only need the LOD drop at the location of the peak LOD. To do this, we can use the fit1 function to get the LOD at the marker with the highest LOD in the insulin genome scan.

R

# Get the probs at the maximum insulin QTL on Chr 19.
pr_chr19 = pull_genoprobpos(genoprobs = probs,
                            map       = cross$pmap,
                            chr       = chr,
                            pos       = peaks_chr19$pos)

# Create data sructure for results.
lod_drop = data.frame(a_gene_id = colnames(expr_chr19),
                      lod       = 0)

for(i in 1:ncol(expr_chr19)) {

  # Make new covariates.
  curr_covar = cbind(addcovar, expr_chr19[,i])
  
  mod = fit1(genoprobs = pr_chr19,
             pheno     = cross$pheno[,'log10_insulin_10wk',drop = FALSE],
             kinship   = kinship_loco[[chr]],
             addcovar  = curr_covar)
  
  lod_drop$lod[i] = mod$lod
  
} # for(i)

lod_drop$lod_drop = lod_drop$lod - peaks_chr19$lod

R

lod_drop <- left_join(lod_drop, annot_chr19, by = 'a_gene_id')

plot(lod_drop$start, lod_drop$lod_drop, col = NA, 
     main = "Mediation Analysis on Chr 19",
     xlab = "Posiiton (Mb)", ylab = "LOD Drop")
text(lod_drop$start, lod_drop$lod_drop, labels = lod_drop$gene_name)
abline(v = peaks_chr19$pos, col = 2)

In the plot above, we plotted the position of each gene versus the decrease in the LOD score (i.e. LOD drop). Genes with the lowest LOD drop are the best candidate genes based on mediation analysis.

Let’s look at the annotation for the genes with LOD drop less than -2.

R

lod_drop |>
  filter(lod_drop < -2) |>
  select(gene_id, symbol, chr, start, end, lod, lod_drop, description)

OUTPUT

             gene_id  symbol chr    start      end      lod  lod_drop
1 ENSMUSG00000025027 Xpnpep1  19 52.92036 53.02864 2.850368 -2.626219
2 ENSMUSG00000035804    Ins1  19 52.25273 52.25391 2.598842 -2.877745
3 ENSMUSG00000043639   Rbm20  19 53.66574 53.85551 2.020241 -3.456345
                                                                                description
1 X-prolyl aminopeptidase (aminopeptidase P) 1, soluble [Source:MGI Symbol;Acc:MGI:2180003]
2                                               insulin I [Source:MGI Symbol;Acc:MGI:96572]
3                          RNA binding motif protein 20 [Source:MGI Symbol;Acc:MGI:1920963]

Do you see any good candidate genes which might regulate insulin?

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally

Content from QTL Mapping in Diversity Outbred Mice


Last updated on 2024-10-21 | Edit this page

Overview

Questions

  • How do I map traits in Diversity Outbred mice?
  • How do I interpret the founder allele effects at a QTL peak?
  • How do I perform association mapping in Diversity Outbred mice?
  • How do I narrow down the set of candidate genes under a QTL?

Objectives

  • Understand the data objects required for QTL mapping.
  • Use the scan1 family of functions to perform different types of QTL scans.
  • Interpret the founder allele effects.
  • Find significant QTL peaks and identify candidate peaks under a QTL peak.
  • Use auxiliary data to identify candidate genes under a QTL peak.

Introduction


This tutorial will take you through the process of mapping a QTL and searching for candidate genes for DNA damage from benzene exposure. The adverse health effects of benzene, including leukemia and aplastic anemia, were first studied in occupational settings in which workers were exposed to high benzene concentrations. Environmental sources of benzene exposure typically come from the petrochemical industry, however, a person’s total exposure can be increased from cigarettes, consumer products, gas stations, and gasoline powered engines or tools stored at home.

Exposure thresholds for toxicants are often determined using animal models that have limited genetic diversity, including standard inbred and outbred rats and mice. These animal models fail to capture the influence of genetic diversity on toxicity response, an important component of human responses to chemicals such as benzene. The Diversity Outbred (DO) mice reflect a level of genetic diversity similar to that of humans.

The data comes from a toxicology study in which Diversity Outbred (DO) mice were exposed to benzene via inhalation for 6 hours a day, 5 days a week for 4 weeks (French, J. E., et al. (2015) Environ Health Perspect 123(3): 237-245). The study was conducted in two equally sized cohort of 300 male mice each, for a total of 600 mice. They were then sacrificed and reticulocytes (red blood cell precursors) were isolated from bone marrow. The number of micro-nucleated reticulocytes, a measure of DNA damage, was then measured in each mouse. The goal is to map gene(s) that influence the level of DNA damage in the bone marrow.

Benzene study dosing showing 6 hours per day, 5 days per week of inhalation.
Benzene Study Dosing

DO Reference Data


As you work with DO data, you may need different reference files. These are complied on a JAX website at: https://www.jax.org/research-and-faculty/genetic-diversity-initiative/tools-data/diversity-outbred-reference-data. There are links to the markers and founder genotypes as well as the standard colors that we use for the eight founder strains.

Open the qtl_mapping Project.


If you have not opened the qtl_mapping project, do this now so that your script and console will run in the correct directory.

  1. From the File menu, select “Open Project…”.
  2. Navigate to the project that you created in your Desktop/qtl_mapping folder and open it.

This should reset your RStudio window and close any files.

Create a New Markdown File


Create a new R Markdown file and save it in the “Desktop/scripts” folder as “do_qtl_mapping.Rmd”.

Load Libraries


The primary library that we will use is qtl2. You should have qtl2 installed from the Setup that you performed before the workshop.

R

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggbeeswarm))
suppressPackageStartupMessages(library(qtl2))
suppressPackageStartupMessages(library(qtl2convert))

Load and Explore the Data


The data for this tutorial has been saved as an R binary file that contains several data objects. Load it in now by running the following command in your new script.

R

load("./data/qtl2_demo_grcm39.Rdata")

We loaded in four data objects from the *.Rdata file. Look at the Environment tab to see what was loaded. You should see an object called pheno which contains the phenotypes. There are two objects called gmap and pmap, which contain the marker positions in genetic and physical coordinates. Finally, there is an object called probs which contains the founder allele probabilities for each mouse at each marker.

Phenotypes

pheno is a data frame containing the phenotype data. Click on the blue circle to the left of pheno in the Environment pane to view its contents.

Callout

The sample IDs must be in the rownames of pheno for qtl2 to work.

pheno contains the sample ID, the study cohort, the concentration of benzene and several blood cell measurements. Note that the sample IDs are also stored in the rownames of pheno.

In order to save time for this tutorial, we will only map with 149 samples from the 100 ppm dosing group.

Challenge 1: How many mice are there in the phenotype data?

Look in the Environment tab or use the Console to figure out how many mice there are in the phenotype data.

You can look at the number of observations in the Environment tab or you can get the number of rows in pheno in the Console.

R

nrow(pheno)

OUTPUT

[1] 598

There are 598 mice in the phenotype data.

Challenge 2: How many mice are there from each sex?

Use an R command to determine how many mice there are from each sex.

You can look at the number of observations in the Environment tab or you can get the number of rows in pheno in the Console.

R

count(pheno, sex)

OUTPUT

   sex   n
1 male 598

All of the mice are male.

Challenge 3: How is the “prop.bm.mn.ret” phenotype distributed?

Make histogram of the “prop.bm.mn.ret” column and assess whether it is normally distributed.

You can look at the number of observations in the Environment tab or you can get the number of rows in pheno in the Console.

R

ggplot(data = pheno, mapping = aes(prop.bm.mn.ret)) +
  geom_histogram() +
  labs(title = "Histogram of Bone Marrow MN-RETs") +
  theme(text = element_text(size = 20))

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The bone marrow MN-RET values do not look normally distributed.

Challenge 4: How is the log(“prop.bm.mn.ret”) phenotype distributed?

Take the log of the “prop.bm.mn.ret” column and make a histogram.Assess whether it is normally distributed.

You can look at the number of observations in the Environment tab or you can get the number of rows in pheno in the Console.

R

pheno |>
  mutate(log_mnret = log(prop.bm.mn.ret)) |>
  ggplot(mapping = aes(log_mnret)) +
    geom_histogram() +
  labs(title = "Histogram of log(Bone Marrow MN-RETs)") +
  theme(text = element_text(size = 20))

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The log of the bone marrow MN-RET values look more normally distributed.

Let’s add a column containing the log-transformed phenotype to our pheno object.

R

pheno <- pheno |>
           mutate(log_mnret = log(prop.bm.mn.ret))

Some researchers are concerned about the reproducibility of DO studies. The argument is that each DO mouse is unique and therefore can never be reproduced. But this misses the point of using the DO. While mice are the sampling unit, in QTL mapping we are sampling the founder alleles at each locus. An average of 1/8th of the alleles should come from each founder at any given locus. Also, DO mice are a population of mice, not a single strain. While it is true that results in an individual DO mouse may not be reproducible, results at the population level should be reproducible. This is similar to the human population in that results from one individual may not represent all humans, but results at the population level should be reproducible.

The benzene inhalation study was conducted in two separate cohorts (termed study in the pheno object). Let’s plot the proportion of micronucleated reticulocytes in bone marrow versus the benzene concentration for each study cohort.

R

pheno |>
  mutate(conc = factor(conc)) |>
  ggplot(mapping = aes(conc, prop.bm.mn.ret * 1000, color = conc)) +
    geom_violin(draw_quantiles = 0.5, linewidth = 1.25) +
    geom_beeswarm() +
    scale_y_log10() +
    scale_color_brewer(palette = "Reds") +
    facet_wrap(~study, ncol = 2) +
    labs(title = "Bone Marrow MN-RET by Study Cohort") +
    theme(text = element_text(size = 20), 
          legend.position = "none")

As you can see, while individual mice have varying micronucleated reticulocytes, there is a dose-dependent increase in micronucleated reticulocytes in both cohorts. This is an example of how results in the DO reproduce at the population level.

Markers

The markers are the SNPs on the Mouse Universal Genotyping Array (MUGA) that were used to reconstruct the haplotypes of the DO mice. This version of the array had about 8,000 SNPs on it. The latest version contains over 140,000 SNPs. We are using this smaller data set so that we can finish the analysis in a timely manner during class.

The markers have been mapped to the latest mouse genome build (GRCm39) and are provided by Karl Broman on Github at: https://github.com/kbroman/MUGAarrays/tree/main/UWisc. There are four versions of the MUGA platforms. In this study, we used the MUGA and those marker files are named as “muga_uwisc_vN.csv”, where “N” is a version number. We will use version 4, which you downloaded into your “data” directory during the Setup before the workshop.

Open the marker file now:

R

markers = read.csv("./data/muga_uwisc_v4.csv")

Challenge 5: Which columns contain the marker positions?

Look at the structure of the genetic or physical map in the Environment tab and see which columns contain marker positions in some coordinate system.

R

str(markers)

OUTPUT

'data.frame':	7854 obs. of  11 variables:
 $ marker        : chr  "JAX00240603" "UNC010001397" "UNC010515443" "UNC010001943" ...
 $ chr           : chr  "1" "1" "1" "1" ...
 $ bp_mm10       : int  3252796 3336839 3668628 3977130 4430623 4531029 5045931 5840130 6020779 6378154 ...
 $ bp_grcm39     : int  3323019 3407062 3738851 4047353 4500846 4601252 5116154 5910353 6091003 6448378 ...
 $ cM_cox        : num  0.15 0.172 0.194 0.197 0.202 ...
 $ strand        : chr  "minus" "minus" "minus" "plus" ...
 $ snp           : chr  "TC" "AC" "TC" "AC" ...
 $ unique        : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ unmapped      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ probe         : chr  "AAAATATGATTCTTTTAATTAATGCATCAGTGATGAGAAGTGAGAGTGGC" "CATAGTGTCTGGTGAGAAGTCTGGAGTTATTCTAATAGGCCTGCCTTTAT" "CAGGAAATGATGCTGAGAAAGTGAGAAGTAGGAAAACGTGGAGAAAAATA" "TCTATTCCTATCACCTTGTACAAAGCTCAAGTCTTGTAAACCCCCCCCCC" ...
 $ strand_flipped: logi  TRUE FALSE FALSE FALSE FALSE FALSE ...

The columns labelled “bp_mm10”, “bp_grcm39”, and “cM_cox” contain marker positions.

Challenge 6: Which chromosomes are in the markers?

Use an R command to determine which chromosomes are in markers.

R

count(markers, chr)

OUTPUT

    chr   n
1     1 493
2    10 375
3    11 335
4    12 318
5    13 327
6    14 331
7    15 265
8    16 266
9    17 227
10   18 216
11   19 162
12    2 474
13    3 438
14    4 412
15    5 382
16    6 414
17    7 379
18    8 341
19    9 342
20    M   3
21    X 414
22    Y   5
23 <NA> 935

There are 23 chromosomes, including some markers with “NA”. These are markers that could not be uniquely mapped to one chromosome.

Challenge 6: What do you think the difference is between teh “bp_mm10” and

“bp_grcm39” columns?

Turn to your neighbor and discuss what the differences between the two column are. Then share your ideas with the class.

There are different builds of the mouse genome. The Genome Resource Consortium (GRC) named mouse builds from build 1 to 39. The University of California at Santa Cruz developed their own naming system which conflicted with the GRC names. “GRCm38” was equivalent to “mm10”. We are using the GRC naming conventions for genome builds.

qtl2 needs the markers to be in a specific format. qtl2 requires that the markers be in a list. The qtl2convert package provides functions to make this easier.

First, we will filter the markers to retain ones on the autosomes and Chr X. We don’t use the mitochondrial or Y chromosomes in mapping, but they may be important covariates. The MUGA did not contain enough markers on Chr M and Y to identify the contributing founder strain, but later versions of the MUGA, such as the GigaMUGA, do contain enough markers to identify founder contributions.

R

markers <- markers |>
             filter(chr %in% c(1:19, 'X'))

Next, we will create a column with GRCm39 Mb positions.

R

markers <- markers |>
             mutate(pos = bp_grcm39 / 1e6)

Finally, we will use the qtl2convert::map_df_to_list() function to create the list of markers that qtl2 requires. We will call this object the marker “map” to distinguish it from the marker data.frame.

R

map = qtl2convert::map_df_to_list(map = markers, pos_column = "pos")

Founder Allele Probabilities (Genoprobs)

The probs object is a list with 20 elements. Each element of probs is a three-dimensional array containing the founder allele dosages for each sample at each marker on one chromosome. These probabilities have been pre-calculated for you, so you can skip the step for calculating allele probabilities.

Let’s look at the dimensions of probs for chromosome 1:

R

dim(probs[[1]])

OUTPUT

[1] 590   8 479

probs[[1]] is a three-dimensional array containing the proportion of each founder haplotype at each marker for each DO sample. The 590 samples are in the first dimension, the 8 founders in the second and the 479 markers along chromosome 1 are in the third dimension.

Let’s return to the probs object. Look at the contents of one sample on chromosome 1.

R

plot_genoprob(probs, map, ind = 1, chr = 1)

Uh oh! We have an error which says “Different numbers of positions in probs and map”. This means that the number of markers in the genoprobs and the marker map is different. In fact, this is quite important. The markers in probs and map must match exactly on every chromosome.

Challenge 7: Align the markers in probs and map.

Work with the person next to you to write a short script that will get the intersection of the marker names in the probs and map for each chromosome, subset the markers in each object to only contain the common markers, and make sure that the markers in probs are in the same order as in map.

Hint: The markers names for each chromosome in probs are in dimention 3. i.e. dimnames(probs[[1]])[[3]] contains the marker names for chromosome 1.

We will write a loop that gets the common markers between probs and map for each chromosome and then subsets them.

R

# Verify that probs and map are the same length.
stopifnot(length(probs) == length(map))

# Verify that probs and map have their chromsomes in the same order.
stopifnot(names(probs) == names(map))

# Loop through each chromosome...
for(i in seq_along(probs)) {
  
  # Get the intersection of the markers in probs and map.
  common_markers <- intersect(dimnames(probs[[i]])[[3]], names(map[[i]]))
  
  # Subset the map to contain the common markers.
  map[[i]] <- map[[i]][common_markers]
  
  # Subset the probs to contain the common markers.
  probs[[i]] <- probs[[i]][,,names(map[[i]])]
  
  # Verify that the markers are identical in probs and map.
  stopifnot(dimnames(probs[[i]])[[3]] == names(map[[i]]))
  
} # for(i)

Now that we have aligned the markers in probs and map, let’s plot the allele probabilities for one sample on chromosome 1.

R

plot_genoprob(probs, map, ind = 1, chr = 1, main = "Founder Allele Probabilities")

In the plot above, the founder contributions, which range between 0 and 1, are colored from white (= 0) to black (= 1.0). A value of ~0.5 is grey. The markers are on the X-axis and the eight founders (denoted by the letters A through H) on the Y-axis. Starting at the left, we see that this sample has genotype BB because the row for B is black, indicating values of 1.0. Moving along the genome to the right, the genotype becomes CE where rows C and E are gray, followed by CD, FH, AG, GH, etc. The values at each marker sum to 1.0.

Calculating A Kinship Matrix

Next, we need to create a matrix that accounts for the kinship relationships between the mice. We do this by looking at the correlation between the founder haplotypes for each sample at each SNP. For each chromosome, we create a kinship matrix using all markers except the ones on the current chromosome using the “loco” (leave-one-chromosome-out) method. Simulations suggest that mapping using this approach increases the power to detect QTL.

Callout

The sample IDs must be in the rownames of probs. The sample IDs will be copied to the row and column names in the kinship matrices.

R

K <- calc_kinship(probs = probs, type = "loco")

Kinship values between pairs of samples range between 0 (no relationship) and 1.0 (completely identical). Let’s look at the kinship matrix for the first 50 samples.

R

n_samples <- 50
heatmap(K[[1]][1:n_samples, 1:n_samples])

The figure above shows kinship between all pairs of samples. Light yellow indicates low kinship and dark red indicates higher kinship. Orange values indicate varying levels of kinship between 0 and 1. The dark red diagonal of the matrix indicates that each sample is identical to itself. The orange blocks along the diagonal may indicate close relatives (i.e. siblings or cousins).

Covariates

Next, we need to create additive covariates that will be used in the mapping model. We will use study cohort as a covariate in the mapping model. This is the same as outbreeding generation since each cohort was purchased from successive generations. If we were mapping with all mice, we would also add benzene concentration to the model. This study contained only male mice, but in most cases, you would include sex as an additive covariate as well.

To recap, you would normally add sex and outbreeding generation to the model. In this case, we have one sex and are using study instead of generation.

R

addcovar <- model.matrix(~study, data = pheno)[,-1, drop = FALSE]

The code above copies the rownames(pheno) to rownames(addcovar) as a side-effect.

Callout

The sample IDs must be in the rownames of pheno, addcovar, genoprobs and K. qtl2 uses the sample IDs to align the samples between objects. For more information about data file format, see Karl Broman’s vignette on input file format.

Performing a Genome Scan


Before we perform our first genome scan, let’s look at the mapping model. At each marker on the genotyping array, we will fit a model that regresses the phenotype (micronucleated reticulocytes) on covariates and the founder allele proportions.

\(y_i = \beta_ss_i+\sum_{j=1}^8(\beta_jg_{ij}) + \lambda_i + \epsilon_i\)

where:

\(y_i\) is the phenotype for mouse \(i\), \(\beta_s\) is the effect of study cohort, \(s_i\) is the study cohort for mouse \(i\), \(\beta_j\) is the effect of founder allele \(j\), \(g_{ij}\) is the probability that mouse \(i\) carries an allele from founder \(j\), \(\lambda_;<sub>_i\) is an adjustment for kinship-induced correlated errors for mouse \(i\), \(\epsilon_i\) is the residual error for mouse \(i\).

Note that this model will give us an estimate of the effect of each of the eight founder alleles at each marker. This will be important when we estimate the founder allele effects below.

There are almost 600 samples in this data set and it may take several minutes to map one trait. In order to save some time, we will map using only the samples in the 100 ppm concentration group. We will create a smaller phenotype data.frame.

R

pheno_100 <- pheno[pheno$conc == 100,]

Challenge 8: How many mice are we mapping with?

Get the number of rows in pheno_100.

You can do this by looking at the number of observations for pheno_100 in the Environment tab or in the Console.

R

nrow(pheno_100)

OUTPUT

[1] 149

There are 149 mice in pheno_100.

In order to map the proportion of bone marrow reticulocytes that were micro-nucleated, you will use the scan1 function. To see the arguments for scan1, you can type help(scan1). First, let’s map the untransformed phenotype. (Recall that we log-transformed it above).

R

index <- which(colnames(pheno_100) == "prop.bm.mn.ret")
lod   <- scan1(genoprobs = probs, 
               pheno     = pheno_100[,index, drop = FALSE], 
               kinship   = K, 
               addcovar  = addcovar)

Next, we plot the genome scan.

R

plot_scan1(x    = lod, 
           map  = map, 
           main = "Proportion of Micro-nucleated Bone Marrow Reticulocytes")

It looks like we have a large peak on chromosome 10.

Challenge 9: How does a log-tranformation change the QTL plot?

  1. Perform a genome scan on the column called log_mnret. (Hint: set index to the column index in pheno.)
  2. How does the LOD score for the peak on Chr 10 change?

R

index <- which(colnames(pheno_100) == "log_mnret")
lod2  <- scan1(genoprobs = probs, 
               pheno     = pheno_100[,index, drop = FALSE], 
               kinship   = K, 
               addcovar  = addcovar)
plot_scan1(x    = lod2, 
           map  = map, 
           main = "(log(Proportion of Micro-nucleated Bone Marrow Reticulocytes)")

Using a log transformation increases the LOD increase from about 17 to over 25.

The challenge shows the importance of looking at your data and transforming it to meet the expectations of the mapping model. In this case, a log transformation improved the model fit and increased the LOD score. We will continue the rest of this lesson using the log-transformed data. Set your index variable equal to the column index of log.MN.RET. Redo the genome scan with the log-transformed reticulocytes.

R

index <- which(colnames(pheno) == "log_mnret")
lod   <- scan1(genoprobs = probs, 
               pheno     = pheno_100[, index,  drop = FALSE], 
               kinship   = K, 
               addcovar  = addcovar)

Assessing Significance of LOD Scores


There is clearly a large peak on Chr 10. But is it significant? In other words, could we see a LOD score over 25 by chance? And if so, what is the probability of seeing a LOD of 25 or higher?

These questions are most commonly answered via permutation of the sample labels and recording the maximum LOD score in each permutation. This procedure breaks the connection between the phenotypes and the genotypes of the mice, so the results represent the expected LOD by chance.

Imagine that you shuffle the sample labels in the phenotype file and map the resampled trait. Then you record the maximum LOD across the genome. You would get a table that looks like this:

Permutation Max LOD
1 1.7
2 3.2
3 2.2
4 1.8
5 2.3

Once you ran 1,000 permutations, you could plot a histogram of the maximum LOD values.

R

perms   = readRDS('./data/sim_perm1000.rds')
max_lod = apply(perms, 2, max)
hist(max_lod, breaks = 100, main = 'Histogram of Max. LOD')
q95 = quantile(max_lod, probs = 0.95)
abline(v = q95, col = 'red', lwd = 2)

In this case, the maximum LOD score is around 9. Most of the values fall between 5 and 6. We have drawn a red line at the 95th percentile of the distribution. LOD values above this are likely to occur 5% of the time. That means 1 in 20 times. We often use this 95th percentile as our significance threshold for one trait.

We advise running at least 1,000 permutations to obtain significance thresholds. In the interest of time, we perform 100 permutations here using the scan1perm function.

R

perms <- scan1perm(genoprobs = probs, 
                   pheno     = pheno_100[,index, drop = FALSE], 
                   kinshp    = K,
                   addcovar  = addcovar, 
                   n_perm    = 100)

The perms object contains the maximum LOD score from each genome scan of permuted data.

Challenge 10: How does a log-tranformation change the QTL plot?

  1. Create a histogram of the LOD scores perms. Hint: use the hist() function.
  2. Estimate the value of the LOD score at the 95th percentile.
  3. Then find the value of the LOD score at the 95th percentile using the summary() function.

R

hist(x = perms, breaks = 15)

R

summary(perms)

OUTPUT

LOD thresholds (100 permutations)
     log_mnret
0.05      7.04

Note that this summary function returns the 95th percentile value of the LOD distribution.

We can now add thresholds to the previous QTL plot. We use a significance threshold of p < 0.05. To do this, we select the 95th percentile of the permutation LOD distribution.

R

plot(x    = lod, 
     map  = map,  
     main = "Proportion of Micro-nucleated Bone Marrow Reticulocytes")
thr = summary(perms)
add_threshold(map = map, thresholdA = thr, col = 'red')

The peak on Chr 10 is well above the red significance line.

Finding LOD Peaks


We can find all of the peaks above the significance threshold using the find_peaks function.

R

find_peaks(scan1_output = lod,
           map          = map, 
           threshold    = thr)

OUTPUT

  lodindex lodcolumn chr      pos      lod
1        1 log_mnret  10 33.52438 27.01701

Challenge 11: How does a log-tranformation change the QTL plot?

Find all peaks for this scan whether or not they meet the 95% significance threshold.

R

find_peaks(scan1_output = lod, 
           map          = map)

OUTPUT

   lodindex lodcolumn chr        pos       lod
1         1 log_mnret   1   4.601252  3.594686
2         1 log_mnret   2 141.066196  5.351645
3         1 log_mnret   3 124.914616  3.904994
4         1 log_mnret   4 132.636314  5.501382
5         1 log_mnret   6  85.246537  5.965923
6         1 log_mnret   7  44.136809  4.325292
7         1 log_mnret   8  26.176972  3.073792
8         1 log_mnret   9   8.115893  4.027077
9         1 log_mnret  10  33.524378 27.017008
10        1 log_mnret  11  48.605368  3.563292
11        1 log_mnret  12  28.480475  3.549043
12        1 log_mnret  13 109.752043  3.111400
13        1 log_mnret  16  72.091039  5.263606
14        1 log_mnret  17  12.528523  3.659285
15        1 log_mnret  18  65.977753  3.894702
16        1 log_mnret   X   8.770202  4.059789

Notice that some peaks are missing because they don’t meet the default threshold value of 3. See help(find_peaks) for more information about this function.

The support interval is determined using the
Bayesian Credible Interval and represents the region most likely to contain the causative polymorphism(s). We can obtain this interval by adding a prob argument to find_peaks. We pass in a value of 0.95 to request a support interval that contains the causal SNP 95% of the time.

R

find_peaks(scan1_output = lod, 
           map          = map, 
           threshold    = thr, 
           prob         = 0.95)

OUTPUT

  lodindex lodcolumn chr      pos      lod   ci_lo    ci_hi
1        1 log_mnret  10 33.52438 27.01701 31.7442 34.05311

From the output above, you can see that the support interval is 2.3 Mb wide (31.7442 to 34.05311 Mb). The location of the maximum LOD score is at 33.52438 Mb.

Estimated Founder Allele Effects


We will now zoom in on Chr 10 and look at the contribution of each of the eight founder alleles to the proportion of bone marrow reticulocytes that were micro-nucleated. Remember, the mapping model above estimates the effect of each of the eight DO founders. We can plot these effects (also called ‘coefficients’) across Chr 10 using scan1coef.

R

chr    <- 10
coef10 <- scan1blup(genoprobs = probs[,chr], 
                    pheno     = pheno_100[,index, drop = FALSE], 
                    kinship   = K[[chr]], 
                    addcovar  = addcovar)

This produces an object containing estimates of each of the eight DO founder allele effect. These are the βj values in the mapping equation above.

R

plot_coefCC(x    = coef10, 
            map  = map, 
            scan1_output = lod, 
            main = "Proportion of Micro-nucleated Bone Marrow Reticulocytes")

The top panel shows the eight founder allele effects (or model coefficients) along Chr 10. The founder allele effects are centered at zero and the units are the same as the phenotype. You can see that DO mice containing the CAST/EiJ allele near 34 Mb have lower levels of micro-nucleated reticulocytes. This means that the CAST allele is associated with less DNA damage and has a protective effect. The bottom panel shows the LOD score, with the support interval for the peak shaded blue.

SNP Association Mapping


At this point, we have a 2.3 Mb wide support interval that contains polymorphism(s) that influence benzene-induced DNA damage. Next, we will impute the DO founder sequences onto the DO genomes. The Mouse Genomes Project has sequenced the eight DO founders and provides SNP, insertion-deletion (indel), and structural variant files for the strains (see Baud et.al., Nat. Gen., 2013). We can impute these SNPs onto the DO genomes and then perform association mapping. The process involves several steps and I have provided a function to encapsulate the steps. To access the Sanger SNPs, we use a SQLlite database provided by Karl Broman. You should have downloaded this during Setup. It is available from
figshare, but the file is 10 GB, so it may take too long to download right now.

SNP Imputation{alt=“Figure showing haplotypes being used to impute founder SNPs onto DO genomes.”,width=75%}

Association mapping involves imputing the founder SNPs onto each DO genome and fitting the mapping model at each SNP. At each marker, we fit the following model:

\(y_{im} = \beta_ss_i+\beta_mg_{im} + \lambda_i + \epsilon_i\)

where:

\(y_i\) is the phenotype for mouse \(i\), \(\beta_s\) is the effect of study cohort, \(s_i\) is the study cohort for mouse \(i\), \(\beta_m\) is the effect of adding one allele at marker \(m\), \(g_{im}\) is the allele call for mouse \(i\) at marker \(m\)>, \(\lambda_i\) is an adjustment for kinship-induced correlated errors for mouse \(i\), \(\epsilon_i\) is the residual error for mouse \(i\).

We can call scan1snps to perform association mapping in the QTL interval on Chr 10. We first create variables for the chromosome and support interval where we are mapping. We then create a function to get the SNPs from the founder SNP database.

Callout

It is important to use the keep_all_snps = TRUE in order to return all SNPs.

R

chr   <- 10
start <- 30
end   <- 36

# Create function to query founder SNP database.
query_snps  <- create_variant_query_func(dbfile   = "./data/fv.2021.snps.db3",
                                         id_field = "variants_id")

assoc      <- scan1snps(genoprobs  = probs[,chr], 
                        map        = map, 
                        pheno      = pheno_100[,index,drop = FALSE], 
                        kinship    = K, 
                        addcovar   = addcovar, 
                        query_func = query_snps,
                        chr        = chr, 
                        start      = start, 
                        end        = end, 
                        keep_all_snps = TRUE)

The assoc object is a list containing two objects: the LOD scores for each unique SNP and a snpinfo object that maps the LOD scores to each SNP. To plot the association mapping, we need to provide both objects to the plot_snpasso function.

R

#png('./fig/bm_mnret_assoc_fig1.png', width = 2000, height = 2000, res = 256)
plot_snpasso(scan1output = assoc$lod, 
             snpinfo     = assoc$snpinfo, 
             main        = "Proportion of Micro-nucleated Bone Marrow Reticulocytes")
#dev.off()

Bone Marrow MN-RET Association Mapping{alt=“Plot showing LOD of each SNP in QTL interval”,width=100%}

This plot shows the LOD score for each SNP in the QTL interval. The SNPs occur in “shelves” because all of the SNPs in a haplotype block have the same founder strain pattern. The SNPs with the highest LOD scores are the ones for which CAST/EiJ contributes the alternate allele.

You might naturally ask what genes are under this QTL peak. We will query the SNP/gens database for the genes in the interval. As before, we will create a function to query the genes in a specific interval of the genome.

R

# Create function to query gene database.
query_genes <- create_gene_query_func(dbfile      = "./data/fv.2021.snps.db3",
                                      chr_field   = "chromosome", 
                                      name_field  = "symbol",
                                      start_field = "start_position", 
                                      stop_field  = "end_position")

genes <- query_genes(chr, start, end)
head(genes)

OUTPUT

          ensembl_id ensembl_version    Name                     name chromosome start_position end_position strand    start
1 ENSMUSG00000000295              14   Hddc2   HD domain containing 2         10       31189379     31204200      1 31.18938
2 ENSMUSG00000000296               9 Tpd52l1 tumor protein D52-like 1         10       31208372     31321954     -1 31.20837
3 ENSMUSG00000019779              16     Frk       fyn-related kinase         10       34359395     34487274      1 34.35939
4 ENSMUSG00000019782              11   Rwdd1  RWD domain containing 1         10       33872551     33895620     -1 33.87255
5 ENSMUSG00000019785              15   Clvs2               clavesin 2         10       33388282     33500765     -1 33.38828
6 ENSMUSG00000019787              10    Trdn                  triadin         10       32956550     33352705      1 32.95655
      stop
1 31.20420
2 31.32195
3 34.48727
4 33.89562
5 33.50077
6 33.35271

The genes object contains annotation information for each gene in the interval.

Next, we will create a plot with two panels: one containing the association mapping LOD scores and one containing the genes in the QTL interval. We do this by passing in the genes argument to plot_snpasso. We can also adjust the proportion of the plot allocated for SNPs (on the top) and genes (on the bottom) using the ‘top_panel_prop’ argument.

R

#png('./fig/bm_mnret_assoc_fig2.png', width = 2000, height = 2000, res = 256)
plot_snpasso(scan1output = assoc$lod, 
             snpinfo     = assoc$snpinfo, 
             main        = "Proportion of Micro-nucleated Bone Marrow Reticulocytes", 
             genes       = genes,
             colors      = "black",
             top_panel_prop = 0.25)
#dev.off()

Bone Marrow MN-RET Association Mapping{alt=“Plot showing LOD of each SNP in QTL interval with genes below”,width=100%}

Now that we have the genes in this interval, we would like to know which founders have the minor allele for the SNPs with the highest LOD scores. To do this, we will highlight SNPs that are withing a 1 LOD drop of the highest LOD and we will add an argument to show which founder contributes the minor allele at the highlighted SNPs.

R

#png('./fig/bm_mnret_assoc_fig3.png', width = 2000, height = 2000, res = 256)
plot_snpasso(scan1output = assoc$lod, 
             snpinfo     = assoc$snpinfo, 
             main        = "Proportion of Micro-nucleated Bone Marrow Reticulocytes", 
             genes       = genes,
             colors      = "black",
             panel_prop = c(0.25, 0.25, 0.5),
             drop_hilit  = 1.0,
             col_hilit   = 'red',
             sdp_panel   = TRUE)
#dev.off()

Bone Marrow MN-RET Association Mapping{alt=“Plot showing LOD of each SNP in QTL interval with genes below and strain distribution pattern above”,width=100%}

Searching for Candidate Genes

One strategy for finding genes related to a phenotype is to search for genes with expression QTL (eQTL) in the same location. Ideally, we would have liver and bone marrow gene expression data in the DO mice from this experiment. Unfortunately, we did not collect this data. However, we have liver gene expression for a separate set of untreated DO mice Liver eQTL Viewer. We searched for genes in the QTL interval that had an eQTL in the same location. Then, we looked at the pattern of founder effects to see if CAST stood out. We found two genes that met these criteria.

Sult3a1/Sult3a2 Liver eQTL{Figure showing LOD plots of Sult3a1 & 2 and CAST-specific haplotype effects}

The plot above shows the genome scane for two genes: Sult3a1 and Gm4794. Gm4794 has been renamed to Sult3a2. As you can see, both Sult3a1 Sult3a2 have eQTL in the same location at the MN-RET QTL on Chr 10. Mice carrying the CAST allele (in green) express these genes more highly. Sult3a1 is a sulfotransferase that may be involved in adding a sulfate group to phenol, one of the metabolites of benzene.

Challenge 11: Look for Sult3a1 & Sult3a2 QTL in DO bone expression

Go to the DO Bone eQTL Viewer, type in Sult3a1 and click on the search button. Then search on Sult3a2.

Neither Sult3a1 or Sult3a2 has an eQTL in the DO bone marrow data set.

Next, we will look at the relationship between Sult3a1 and Sult3a2. Go to the Ensembl web page for Sult3a2. In the menu on the left, click on the “Gene Tree (image)” link.

Ensembl gene tree of Sult3a1 & Sult3a2{Figure showing that Sult3a1 & 2 are paralogs}

As you can see, Sult3a2 is a paralog of Sult3a1 and hence both genes are sulfotransferases. These genes encode enzymes that attach a sulfate group to other compounds.

Challenge 12: Look for structural variants near Sult3a1

Go to the Ensembl website and select “Mouse” from the species dropdown. Then type “10:33300000-33800000” into the search box and press “Go”. Scroll down to the detailed view and look for structural variants in this region.

You should see a plot similar to the one below.

Ensembl Structural Variants{alt=“Ensembl viewer showing genes and structural variants under that chromosome 10 QTL peak”,width=100%}

Ensembl Structural Variants{alt=“Ensembl viewer showing genes and structural variants under that chromosome 10 QTL peak”,width=100%}

In order to visualize the size of the copy number gain, we queried the Mouse Genomes Project alignment files for the eight founders. We piled up the reads at each base (which is beyond the scope of this tutorial) and made the figure below.

DO Founder BAM file pileups{alt=“Figure showing CAST with a duplication at the Sult3a1/2 locus.”,width=100%}

As you can see, there appears to be a duplicatation in the CAST founders that covers four genes: Clvs2, Gm15939, Sult3a1 and Sult3a2. Clvs2 is expressed in neurons and Gm15939 is a predicted gene that may not produce a transcript.

Hence, we have three pieces of evidence that narrows our candidate gene list to Sult3a1 and Sult3a2:

  1. Both genes have a liver eQTL in the same location as the micronucleated reticulocytes QTL.
  2. Among genes in the micronucleated reticulocytes QTL interval, only Sult3a1 and Sult3a2 have differential expression of the CAST allele in the liver.
  3. There is a copy number gain of these two genes in CAST.

Sulfation is a prominent detoxification mechanism for benzene as well. The diagram below shows the metabolism pathway for benzene (Monks, T. J., et al. (2010). Chem Biol Interact 184(1-2): 201-206.) Hydroquinone, phenol and catechol are all sulfated and excreted from the body.

Benezene metabolism pathways{alt=“Figure showing benezene metabolism pathways”,width=100%}

This analysis has led us to the following hypothesis. Inhaled benzene is absorbed by the lungs into the bloodstream and transported to the liver. There, it is metabolized, and some metabolites are transported to the bone marrow. One class of genes that is involved in toxicant metabolism are sulfotransferases. Sult3a1 is a phase II enzyme that conjugates compounds (such as phenol, which is a metabolite of benzene) with a sulfate group before transport into the bile. It is possible that a high level of Sult3a1 expression could remove benzene by-products and be protective. Our hypothesis is that the copy number gain in the CAST allele increases liver gene expression of Sult3a1 and Sult3a2. High liver expression of these genes allows mice containing the CAST allele to rapidly conjugate harmful benzene metabolites and excrete them from the body before they can reach the bone marrow and cause DNA damage. Further experimental validation is required, but this is a plausible hypothesis.

Benzene metabolism hypothesis{alt=“Figure showing mice carrying the CAST allele being protected from benezene toxicity”,width=100%}

We hope that this tutorial has shown you how the DO can be used to map QTL and use the founder effects and bioinformatics resources to narrow down the candidate gene list. Here, we made used of external gene expression databases and the founder sequence data to build a case for a pair of genes.

Challenge 13: Map another trait.

  1. Make a histogram of the column pre.prop.mn.ret in pheno. Does it look like it should be log transformed? If so, add a column to pheno that contains the log of pre.prop.mn.ret.
  2. Perform a genome scan using all samples on the column called pre.prop.mn.ret. Since you’re using all samples, the scan will take longer. You will also need to add the benezene concentration to the covariates. (Hint: set index to the column index in pheno.)
  3. Which chromosome has the highest peak? Use find_peaks to get the location and support interval for the highest peak.
  4. Calculate and plot the BLUPs for the chromosome with the highest peak. (This may take a few minutes.) Which founder contributes an allele that makes MN-RET values higher?
  5. Perform association mapping in the support interval for the QTL peak, plot the results and plot the genes beneath the association mapping plot.
  1. Plot the distribution of

R

hist(pheno$pre.prop.mn.ret)

Log transform the pre-dose proportion of MN-RETs.

R

pheno$log_pre_prop_mnret <- log(pheno$pre.prop.mn.ret)
  1. Set the mapping index equal to the pre.prop.mn.ret column.

R

index <- which(colnames(pheno) == "log_pre_prop_mnret")

Create new covariates which include study and benzene concentration.

R

addcovar2 <- model.matrix(~study + conc, data = pheno)[,-1]

R

lod_pre <- scan1(genoprobs = probs, 
                 pheno     = pheno[,index, drop = FALSE], 
                 kinship   = K, 
                 addcovar  = addcovar2)
plot_scan1(x    = lod_pre, 
           map  = map, 
           main = "Log-Transformed Pre-dose micronucleated reticulocytes")
  1. Find peaks above a significance threshold.

R

find_peaks(lod_pre, map, threshold = 8, prob = 0.95)

OUTPUT

  lodindex          lodcolumn chr       pos       lod     ci_lo     ci_hi
1        1 log_pre_prop_mnret   4 132.77875 13.145968 132.77875 134.27938
2        1 log_pre_prop_mnret  12  75.45433  8.404242  74.14949  77.30552
  1. Calculate and plot BLUPs for the highest peak.

R

chr   <- 4
coef4 <- scan1blup(genoprobs = probs[,chr], 
                   pheno     = pheno[,index, drop = FALSE], 
                   kinship   = K[[chr]], 
                   addcovar  = addcovar2)
plot_coefCC(x            = coef4, 
            map          = map, 
            scan1_output = lod_pre, 
            legend       = "topleft",
            main         = "Log-Transformed Pre-dose micronucleated reticulocytes")
  1. Perform association mapping in the QTL interval for the highest peak.

R

start  <- 132.8
end    <- 134.3
assoc4 <- scan1snps(genoprobs  = probs[,chr], 
                    map        = map, 
                    pheno      = pheno[,index,drop = FALSE], 
                    kinship    = K, 
                    addcovar   = addcovar2, 
                    query_func = query_snps, 
                    chr        = chr,
                    start      = start, 
                    end        = end, 
                    keep_all_snps = TRUE)

genes <- query_genes(chr, start, end)

R

#png('./fig/pre_mnret_assoc.png', width = 2000, height = 2000, res = 256)
plot_snpasso(scan1output = assoc4$lod, 
             snpinfo     = assoc4$snpinfo, 
             main        = "Log-Transformed Pre-dose micronucleated reticulocytes", 
             genes       = genes,
             colors      = "black",
             panel_prop = c(0.25, 0.25, 0.5),
             drop_hilit  = 1.0,
             col_hilit   = 'red',
             sdp_panel   = TRUE)
#dev.off()

Pre-dose MN-RET Association Mapping{alt=“Figure showing pre-dose MN-RET association mapping with WSB carrying the alternate allele”,width=100%}

R

sessionInfo()

OUTPUT

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] qtl2convert_0.30 qtl2_0.36        ggbeeswarm_0.7.2 lubridate_1.9.3
 [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.2
 [9] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_3.5.1
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] utf8_1.2.4          generics_0.1.3      renv_1.0.11
 [4] RSQLite_2.3.7       stringi_1.8.4       hms_1.1.3
 [7] magrittr_2.0.3      RColorBrewer_1.1-3  evaluate_0.24.0
[10] grid_4.4.1          timechange_0.3.0    fastmap_1.2.0
[13] blob_1.2.4          DBI_1.2.3           BiocManager_1.30.25
[16] fansi_1.0.6         scales_1.3.0        qtl_1.70
[19] cli_3.6.3           rlang_1.1.4         bit64_4.5.2
[22] munsell_0.5.1       withr_3.0.1         cachem_1.1.0
[25] yaml_2.3.10         parallel_4.4.1      tools_4.4.1
[28] tzdb_0.4.0          memoise_2.0.1       colorspace_2.1-1
[31] vctrs_0.6.5         R6_2.5.1            lifecycle_1.0.4
[34] bit_4.5.0           vipor_0.4.7         pkgconfig_2.0.3
[37] beeswarm_0.4.0      pillar_1.9.0        gtable_0.3.5
[40] glue_1.7.0          data.table_1.16.2   Rcpp_1.0.13
[43] highr_0.11          xfun_0.46           tidyselect_1.2.1
[46] knitr_1.48          farver_2.1.2        labeling_0.4.3
[49] compiler_4.4.1     

Key Points

  • There are generally five steps to QTL mapping in DO mice:
    • map the trait,
    • perform permutations,
    • find significant peaks,
    • calculate founder allele effects at the QTL peak,
    • perform association mapping to narrow the gene candidates.
  • You may need to bring in outside resources to help narrow your candidate gene list.
  • You will need the 10 GB SNP database to perform association mapping.