# Calculating Genotype Probabilities

## Overview

Teaching:30 min

Exercises:30 minQuestions

How do I calculate QTL at positions between genotyped markers?

How do I calculate QTL genotype probabilities?

How do I calculate allele probabilities?

How can I speed up calculations if I have a large data set?

Objectives

To explain why the first step in QTL analysis is to calculate genotype probabilities.

To insert pseudomarkers between genotyped markers.

To calculate genotype probabilities.

The first task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. For example, the first step would be to determine the probabilities for genotypes AA and AB at the locus indicated below.

The `calc_genoprob()`

function calculates QTL genotype probabilities, conditional on the available marker data. These are needed for most of the QTL mapping functions. The result is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.

See this page for a graphical review of data structures in R.

To find QTL at positions between markers (so called “pseudomarkers”), first insert pseudomarkers into the genetic map with the function `insert_pseudomarkers()`

.

We’ll use the
iron dataset
from Grant et al. (2006) Hepatology 44:174-185
(an intercross) as an example. In this study spleen and liver iron levels were measured in an F2 cross between mouse strains C57BL/6J and SWR. C57BL/6J mice exhibit low levels of non-heme iron, while SWR mice exhibit high levels. Iron levels between spleen and liver in the F2s were poorly correlated, indicating tissue-specific regulation. Significant QTL were found on chromosomes 2 and 16 for liver, and on chromosomes 8 and 9 in spleen. Candidate genes included transferrin near the chromosome 9 peak, and *β*2-microglobulin near the chromosome 2 peak.

We first load the data using the function `system.file`

, which finds files located in packages. The iron data are built into the qtl2 package, so we can use the `system.file()`

function to load them directly from the package.

```
library(qtl2)
iron <- read_cross2(file = system.file("extdata", "iron.zip", package="qtl2") )
```

To load your own data from your machine, you would use the file path to your data files. For example, if the file path to your data files is `/Users/myUserName/qtlProject/data`

, the command to load your data would look like this:

```
myQTLdata <- read_cross2(file = "/Users/myUserName/qtlProject/data/myqtldata.yaml" )
```

The YAML file contains all control information for your data, including names of data files, cross type, column specifications for sex and cross information, and more. This can also be in JSON format. Alternatively, all data files can be zipped together for loading.

```
myQTLdata <- read_cross2(file = "/Users/myUserName/qtlProject/data/myqtldata.zip" )
```

Back to the iron data. Now look at a summary of the cross data and the names of each variable within the data.

```
summary(iron)
```

```
Object of class cross2 (crosstype "f2")
Total individuals 284
No. genotyped individuals 284
No. phenotyped individuals 284
No. with both geno & pheno 284
No. phenotypes 2
No. covariates 2
No. phenotype covariates 1
No. chromosomes 20
Total markers 66
No. markers by chr:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X
3 5 2 2 2 2 7 8 5 2 7 2 2 2 2 5 2 2 2 2
```

```
names(iron)
```

```
[1] "crosstype" "geno" "gmap" "pmap" "pheno"
[6] "covar" "phenocovar" "is_x_chr" "is_female" "cross_info"
[11] "alleles"
```

Have a look at the markers listed in the genetic map, `gmap`

. Markers are listed by chromosome and described by cM position. View only the markers on the first several chromosomes.

```
head(iron$gmap)
```

```
$`1`
D1Mit18 D1Mit80 D1Mit17
27.3 51.4 110.4
$`2`
D2Mit379 D2Mit75 D2Mit17 D2Mit304 D2Mit48
38.3 48.1 56.8 58.0 73.2
$`3`
D3Mit22 D3Mit18
25.1 54.6
$`4`
D4Mit2 D4Mit352
10.9 53.6
$`5`
D5Mit11 D5Mit30
17.5 62.3
$`6`
D6Mit104 D6Mit15
41.5 66.7
```

We then use `insert_pseudomarkers()`

to insert pseudomarkers into the
genetic map, which we grab from the `iron`

object as `iron$gmap`

:

```
map <- insert_pseudomarkers(map=iron$gmap, step=1)
```

Pseudomarkers are inserted at regular intervals, and genotype probabilities will be calculated at each of these intervals. Now have a look at the new object called `map`

. View only the first two chromosomes.

```
head(map, n=2)
```

```
$`1`
D1Mit18 c1.loc28 c1.loc29 c1.loc30 c1.loc31 c1.loc32 c1.loc33
27.3 28.3 29.3 30.3 31.3 32.3 33.3
c1.loc34 c1.loc35 c1.loc36 c1.loc37 c1.loc38 c1.loc39 c1.loc40
34.3 35.3 36.3 37.3 38.3 39.3 40.3
c1.loc41 c1.loc42 c1.loc43 c1.loc44 c1.loc45 c1.loc46 c1.loc47
41.3 42.3 43.3 44.3 45.3 46.3 47.3
c1.loc48 c1.loc49 c1.loc50 c1.loc51 D1Mit80 c1.loc52 c1.loc53
48.3 49.3 50.3 51.3 51.4 52.3 53.3
c1.loc54 c1.loc55 c1.loc56 c1.loc57 c1.loc58 c1.loc59 c1.loc60
54.3 55.3 56.3 57.3 58.3 59.3 60.3
c1.loc61 c1.loc62 c1.loc63 c1.loc64 c1.loc65 c1.loc66 c1.loc67
61.3 62.3 63.3 64.3 65.3 66.3 67.3
c1.loc68 c1.loc69 c1.loc70 c1.loc71 c1.loc72 c1.loc73 c1.loc74
68.3 69.3 70.3 71.3 72.3 73.3 74.3
c1.loc75 c1.loc76 c1.loc77 c1.loc78 c1.loc79 c1.loc80 c1.loc81
75.3 76.3 77.3 78.3 79.3 80.3 81.3
c1.loc82 c1.loc83 c1.loc84 c1.loc85 c1.loc86 c1.loc87 c1.loc88
82.3 83.3 84.3 85.3 86.3 87.3 88.3
c1.loc89 c1.loc90 c1.loc91 c1.loc92 c1.loc93 c1.loc94 c1.loc95
89.3 90.3 91.3 92.3 93.3 94.3 95.3
c1.loc96 c1.loc97 c1.loc98 c1.loc99 c1.loc100 c1.loc101 c1.loc102
96.3 97.3 98.3 99.3 100.3 101.3 102.3
c1.loc103 c1.loc104 c1.loc105 c1.loc106 c1.loc107 c1.loc108 c1.loc109
103.3 104.3 105.3 106.3 107.3 108.3 109.3
c1.loc110 D1Mit17
110.3 110.4
$`2`
D2Mit379 c2.loc39 c2.loc40 c2.loc41 c2.loc42 c2.loc43 c2.loc44 c2.loc45
38.3 39.3 40.3 41.3 42.3 43.3 44.3 45.3
c2.loc46 c2.loc47 D2Mit75 c2.loc48 c2.loc49 c2.loc50 c2.loc51 c2.loc52
46.3 47.3 48.1 48.3 49.3 50.3 51.3 52.3
c2.loc53 c2.loc54 c2.loc55 c2.loc56 D2Mit17 c2.loc57 D2Mit304 c2.loc58
53.3 54.3 55.3 56.3 56.8 57.3 58.0 58.3
c2.loc59 c2.loc60 c2.loc61 c2.loc62 c2.loc63 c2.loc64 c2.loc65 c2.loc66
59.3 60.3 61.3 62.3 63.3 64.3 65.3 66.3
c2.loc67 c2.loc68 c2.loc69 c2.loc70 c2.loc71 c2.loc72 D2Mit48
67.3 68.3 69.3 70.3 71.3 72.3 73.2
```

Notice that pseudomarkers are now spaced at 1 cM intervals from genotyped markers. The argument `step=1`

generated pseudomarkers at these intervals.

Next we use `calc_genoprob()`

to calculate the QTL genotype probabilities.

```
pr <- calc_genoprob(cross=iron, map=map, error_prob=0.002)
```

The argument `error_prob`

supplies an assumed genotyping error probability of 0.002. If a value for `error_prob`

is not supplied, the default probability is 0.0001.

Recall that the result of `calc_genoprob`

, `pr`

, is a list of three-dimensional arrays (one per chromosome).

```
names(pr)
```

```
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
[15] "15" "16" "17" "18" "19" "X"
```

Each 3d array of probabilities is arranged as individuals × genotypes × positions. Have a look at the names of each of the three dimensions for chromosome 19.

```
dimnames(pr$`19`)
```

```
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
[12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
[23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
[34] "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
[45] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55"
[56] "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
[67] "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77"
[78] "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88"
[89] "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
[100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
[111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
[122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
[133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"
[144] "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
[155] "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165"
[166] "166" "167" "168" "169" "170" "171" "172" "173" "174" "175" "176"
[177] "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"
[188] "188" "189" "190" "191" "192" "193" "194" "195" "196" "197" "198"
[199] "199" "200" "201" "202" "203" "204" "205" "206" "207" "208" "209"
[210] "210" "211" "212" "213" "214" "215" "216" "217" "218" "219" "220"
[221] "221" "222" "223" "224" "225" "226" "227" "228" "229" "230" "231"
[232] "232" "233" "234" "235" "236" "237" "238" "239" "240" "241" "242"
[243] "243" "244" "245" "246" "247" "248" "249" "250" "251" "252" "253"
[254] "254" "255" "256" "257" "258" "259" "260" "261" "262" "263" "264"
[265] "265" "266" "267" "268" "269" "270" "271" "272" "273" "274" "275"
[276] "276" "277" "278" "279" "280" "281" "282" "283" "284"
[[2]]
[1] "SS" "SB" "BB"
[[3]]
[1] "D19Mit68" "c19.loc4" "c19.loc5" "c19.loc6" "c19.loc7"
[6] "c19.loc8" "c19.loc9" "c19.loc10" "c19.loc11" "c19.loc12"
[11] "c19.loc13" "c19.loc14" "c19.loc15" "c19.loc16" "c19.loc17"
[16] "c19.loc18" "c19.loc19" "c19.loc20" "c19.loc21" "c19.loc22"
[21] "c19.loc23" "c19.loc24" "c19.loc25" "c19.loc26" "c19.loc27"
[26] "c19.loc28" "c19.loc29" "c19.loc30" "c19.loc31" "c19.loc32"
[31] "c19.loc33" "c19.loc34" "c19.loc35" "c19.loc36" "c19.loc37"
[36] "D19Mit37"
```

View the first three rows of genotype probabilities for the first genotyped marker on chromosome 19, and the two adjacent pseudomarkers located at 1 cM intervals away. Compare the probabilities for each pseudomarker genotype with those of the genotyped marker.

```
(pr$`19`)[1:3,,"D19Mit68"] # genotyped marker
```

```
SS SB BB
1 0.0009976995 0.003298162 0.9957041387
2 0.2500000000 0.500000000 0.2500000000
3 0.0003029243 0.999394151 0.0003029243
```

```
(pr$`19`)[1:3,,"c19.loc4"] # pseudomarker 1 cM away
```

```
SS SB BB
1 0.001080613 0.03581825 0.963101136
2 0.250000000 0.50000000 0.250000000
3 0.006141104 0.98771779 0.006141104
```

```
(pr$`19`)[1:3,,"c19.loc5"] # the next pseudomarker
```

```
SS SB BB
1 0.001342511 0.06759555 0.93106194
2 0.250000000 0.50000000 0.25000000
3 0.011589058 0.97682188 0.01158906
```

## Challenge 1

Find a partner and explain why calculating genotype probabilities is the first step in QTL analysis. Why do you need to insert pseudomarkers first? Listen to your partner’s explanation, then write your responses in the collaborative document.

## Solution to Challenge 1

## Challenge 2

Calculate genotype probabilities for a different data set from the qtl2 data repository, this one from a study of obesity and diabetes in a C57BL/6 (B6) × BTBR intercross.

1). Create a new script in RStudio with File -> New File -> R Script. 2) Download the B6 x BTBR zip file from the qtl2 data repository into an object called`b6btbr`

by running this code:`b6btbr <- read_cross2(file = "https://raw.githubusercontent.com/rqtl/qtl2data/master/B6BTBR/b6btbr.zip")`

2). View a summary of the`b6btbr`

data. How many individuals? phenotypes? chromosomes? markers? 3). View the genetic map for the`b6btbr`

data.

4). Insert pseudomarkers at 2 cM intervals. Assign the results to an object called`b6btbrmap`

.

5). Calculate genotype probabilities assuming a genotyping error probability of 0.001. Assign the results to an object called`b6btbrpr`

.

6). View the first several rows of genotype probabilities for any marker on chromosome 18.## Solution to Challenge 2

1). Create a new script in RStudio with File -> New File -> R Script. 2).

`b6btbr <- read_cross2(file = "https://raw.githubusercontent.com/rqtl/qtl2data/master/B6BTBR/b6btbr.zip`

3).`summary(b6btbr)`

shows 544 individuals, 3 phenotypes, 20 chromosomes, 2057 markers. 4).`b6btbr$gmap`

5).`b6btbrmap <- insert_pseudomarkers(map=b6btbr$gmap, step=2)`

6).`b6btbrpr <- calc_genoprob(cross=b6btbr, map=b6btbrmap, error_prob=0.001)`

7).`dimnames((b6btbrpr$`

18`))`

shows all marker names for chromosome 18.`head((b6btbrpr$`

18`)[,,"c18.loc48"])`

gives genotype probabilities for an example pseudomarker, while`head((b6btbrpr$`

18`)[,,"rs6338896"])`

gives genotype probabilities for a genotyped marker.

**Parallel calculations (optional)** To speed up the calculations with large datasets on a multi-core machine, you can use the argument `cores`

. With `cores=0`

, the number of available cores will be detected via `parallel::detectCores()`

. Otherwise, specify the number of cores as a positive integer.

```
pr <- calc_genoprob(cross=iron, map=map, error_prob=0.002, cores=4)
```

**Allele probabilities (optional)** The genome scan functions use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function `genoprob_to_alleleprob()`

.

```
apr <- genoprob_to_alleleprob(probs=pr)
```

The figure below shows genotype and allele probabilities for 3 samples. In the Diversity Outbred, there are 36 possible genotype states (AA, AB, AC, …, BB, BC, BD, …, CC, CD, CE, …, DD, DE, DF, …, EE,…) or 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1. The first SNP below has genotype BB. In the table describing alleles (8 state founder probabilities), the probability that this SNP has a B allele is 1. The 2nd SNP has genotype BH, so the allele table shows a probability of 0.5 for B and 0.5 for H. The third SNP is either BG or BH, and has a probability of 0.5 for each of these genotypes. The allele table shows a probability of 0.5 for allele B, and 0.25 for both G and H.

## Key Points

The first step in QTL analysis is to calculate genotype probabilities.

Insert pseudomarkers to calculate QTL at positions between genotyped markers.

Calculate genotype probabilities between genotyped markers with calc_genoprob().