Load and explore the data

Overview

Teaching: 15 min
Exercises: 30 min

Questions

What data are required for eqtl mapping?

Objectives

To provide an example and exploration of data used for eqtl mapping.

Load the libraries.

library(ggbeeswarm)
library(tidyverse)
library(knitr)
library(corrplot)
# the following analysis is derived from supplementary 
# File S1 Attie_eQTL_paper_physiology.Rmd 
# by Daniel Gatti. See Data Dryad entry for more information.

Physiological Phenotypes

The complete data used in these analyses are available from Data Dryad.

Load in the clinical phenotypes.

# load the data
load("../data/attie_DO500_clinical.phenotypes.RData")

See the data dictionary to see a description of each of these phenotypes. You can also view a table of the data dictionary.

pheno_clin_dict %>% 
  select(description, formula) %>% 
  kable()

	description	formula
mouse	Animal identifier.	NA
sex	Male (M) or female (F).	NA
sac_date	Date when mouse was sacrificed; used to compute days on diet, using birth dates.	NA
partial_inflation	Some mice showed a partial pancreatic inflation which would negatively effect the total number of islets collected from these mice.	NA
coat_color	Visual inspection by Kathy Schuler on coat color.	NA
oGTT_date	Date the oGTT was performed.	NA
FAD_NAD_paired	A change in the method that was used to make this measurement by Matt Merrins’ lab. Paired was the same islet for the value at 3.3mM vs. 8.3mM glucose; unpaired was where averages were used for each glucose concentration and used to compute ratio.	NA
FAD_NAD_filter_set	A different filter set was used on the microscope to make the fluorescent measurement; may have influenced the values.	NA
crumblers	Some mice store food in their bedding (hoarders) which would be incorrectly interpreted as consumed.	NA
birthdate	Birth date	NA
diet_days	Number of days.	NA
num_islets	Total number of islets harvested per mouse; negatively impacted by those with partial inflation.	NA
Ins_per_islet	Amount of insulin per islet in units of ng/ml/islet.	NA
WPIC	Derived number; equal to total number of islets times insulin content per islet.	Ins_per_islet * num_islets
HOMA_IR_0min	glucose*insulin/405 at time t=0 for the oGTT	Glu_0min * Ins_0min / 405
HOMA_B_0min	360 * Insulin / (Glucose - 63) at time t=0 for the oGTT	360 * Ins_0min / (Glu_0min - 63)
Glu_tAUC	Area under the curve (AUC) calculation without any correction for baseline differences.	complicated
Ins_tAUC	Area under the curve (AUC) calculation without any correction for baseline differences.	complicated
Glu_6wk	Plasma glucose with units of mg/dl; fasting.	NA
Ins_6wk	Plasma insulin with units of ng/ml; fasting.	NA
TG_6wk	Plasma triglyceride (TG) with units of mg/dl; fasting.	NA
Glu_10wk	Plasma glucose with units of mg/dl; fasting.	NA
Ins_10wk	Plasma insulin with units of ng/ml; fasting.	NA
TG_10wk	Plasma triglyceride (TG) with units of mg/dl; fasting.	NA
Glu_14wk	Plasma glucose with units of mg/dl; fasting.	NA
Ins_14wk	Plasma insulin with units of ng/ml; fasting.	NA
TG_14wk	Plasma triglyceride (TG) with units of mg/dl; fasting.	NA
food_ave	Average food consumption over the measurements made for each mouse.	complicated
weight_2wk	Body weight at indicated date; units are gm.	NA
weight_6wk	Body weight at indicated date; units are gm.	NA
weight_10wk	Body weight at indicated date; units are gm.	NA
DOwave	Wave (i.e., batch) of DO mice	NA

Phenotype Distributions

Boxplots are a great way to view the distribution of the data and to identify any outliers. We will be using the total area under the curve of insulin from the glucose tolerance test (Ins_tAUC). We will also log-transform the data using the [scale_y_log10()][scale_y_log10] function.

# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(title = "Insulin tAUC", y = "Insulin tAUC")

plot of chunk Ins_tAUC_boxplot

Another visualization that has become popular is the [Violin Plot][https://en.wikipedia.org/wiki/Violin_plot]. We can create one using ggplot’s [geom_violin][https://ggplot2.tidyverse.org/reference/geom_violin.html]. Whereas the boxplot automatically adds the median, we must tell geom_violin() which quantiles that we want to draw using the argument “draw_quantiles = c(0.25, 0.5, 0.75)”. We have also overlaid the data points using ggbeeswarm’s [geom_beeswarm][https://www.rdocumentation.org/packages/ggbeeswarm/versions/0.5.3/topics/geom_beeswarm]. We have told geom_beeswarm() to plot the points using the argument “alpha = 0.1”. The alpha argument ranges between 0 (completely transparent) to 1 (completely opaque). A value of 0.1 means mostly transparent.

# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_beeswarm(alpha = 0.1) +
  scale_y_log10() +
  labs(title = "Insulin tAUC", y = "Insulin tAUC")

plot of chunk Ins_tAUC_violin

Challenge 1

How many orders of magnitude (powers of 10) does Insulin tAUC span?

Solution

Insulin tAUC spans three orders of magnitude, from near 10 to over 1000.

Challenge 2

Which sex has higher median Insulin tAUC values?

Solution

Males have higher Insulin tAUC than females.

Both of the boxplot and the violin plot are useful visualizations which you can use to get some sense of the distribution of your data.

Quality Control of Data

Many statistical tests rely upon the data having a “normal” (or Gaussian) distribution. Many biological phenotypes do not follow this distribution and must be transformed before analysis. This is why we log-transformed the data in the plots above.

While we can “eyeball” the distributions in the violin plot, it would be better to use a “quantile-quantile” plot.

pheno_clin %>% 
  ggplot(aes(sample = Ins_tAUC)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~sex)

plot of chunk qqplot

In these plots, the “quantiles” of the normal distribution are plotted on the X-axis and the data are plotted on the Y-axis. The line indicates the quantiles that would be followed by a normal distribution. The untransformed data do not follow a normal distribution because the points are far from the line.

Next, we will loag-transform the data and then create a quantile-quantile plot.

pheno_clin %>% 
  mutate(Ins_tAUC = log(Ins_tAUC)) %>% 
  ggplot(aes(sample = Ins_tAUC)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~sex)

plot of chunk qqplot_log

Challenge 3

Does the log transformation make the data more normally distributed? Explain your answer.

Solution

Yes. The log transformation makes the data more normally distributed because the data points follow the normality line more closely.

Challenge 4

Do any data points look suspicious to you? Explain your answer.

Solution

The data points that deviate from the normality line would be worth investigating. All data deviates somewhat from normality, but the three lowest points in the male data plot would be worth investigating. They may be real, but there may also have been mishap in the assay.

Another way to identify outliers is to standardize the data and look for data points that are more than four standard deviations from the mean.

To do this, we will log transform and standardize Insulin tAUC.

ins_tauc = pheno_clin %>% 
             select(mouse, sex, Ins_tAUC) %>%
             group_by(sex) %>% 
             mutate(Ins_tAUC = log(Ins_tAUC),
                    Ins_tAUC = scale(Ins_tAUC))

ins_tauc %>% 
  ggplot(aes(x = sex, y = Ins_tAUC)) +
    geom_boxplot() +
    geom_hline(aes(yintercept = -4), color = 'red') +
    geom_hline(aes(yintercept =  4), color = 'red') +
    labs(title = "Distribution of Standardized Ins_tAUC")

plot of chunk pheno_std

There are no data points outside of the four standard deviation limits.

Gene Expression Phenotypes

# load the expression data along with annotations and metadata
load("../data/dataset.islet.rnaseq.RData")
names(dataset.islet.rnaseq)

[1] "annots"        "covar"         "covar.factors" "datatype"     
[5] "display.name"  "expr"          "lod.peaks"     "raw"          
[9] "samples"      

# look at gene annotations
dataset.islet.rnaseq$annots[1:6,]

                              gene_id symbol chr     start       end strand
ENSMUSG00000000001 ENSMUSG00000000001  Gnai3   3 108.10728 108.14615     -1
ENSMUSG00000000028 ENSMUSG00000000028  Cdc45  16  18.78045  18.81199     -1
ENSMUSG00000000037 ENSMUSG00000000037  Scml2   X 161.11719 161.25821      1
ENSMUSG00000000049 ENSMUSG00000000049   Apoh  11 108.34335 108.41440      1
ENSMUSG00000000056 ENSMUSG00000000056   Narf  11 121.23725 121.25586      1
ENSMUSG00000000058 ENSMUSG00000000058   Cav2   6  17.28119  17.28911      1
                      middle nearest.marker.id        biotype      module
ENSMUSG00000000001 108.12671       3_108090236 protein_coding   darkgreen
ENSMUSG00000000028  18.79622       16_18817262 protein_coding        grey
ENSMUSG00000000037 161.18770       X_161182677 protein_coding        grey
ENSMUSG00000000049 108.37887      11_108369225 protein_coding greenyellow
ENSMUSG00000000056 121.24655      11_121200487 protein_coding       brown
ENSMUSG00000000058  17.28515        6_17288298 protein_coding       brown
                   hotspot
ENSMUSG00000000001    <NA>
ENSMUSG00000000028    <NA>
ENSMUSG00000000037    <NA>
ENSMUSG00000000049    <NA>
ENSMUSG00000000056    <NA>
ENSMUSG00000000058    <NA>

# look at raw counts
dataset.islet.rnaseq$raw[1:6,1:6]

      ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000037
DO021              10247                108                 29
DO022              11838                187                 35
DO023              12591                160                 19
DO024              12424                216                 30
DO025              10906                 76                 21
DO026              12248                110                 34
      ENSMUSG00000000049 ENSMUSG00000000056 ENSMUSG00000000058
DO021                 15                120                703
DO022                 18                136                747
DO023                 18                275               1081
DO024                 81                160                761
DO025                  7                163                770
DO026                 18                204                644

# look at sample metadata
# summarize mouse sex, birth dates and DO waves
table(dataset.islet.rnaseq$samples[, c("sex", "birthdate")])

   birthdate
sex 2014-05-29 2014-10-15 2015-02-25 2015-09-22
  F         46         46         49         47
  M         45         48         50         47

table(dataset.islet.rnaseq$samples[, c("sex", "DOwave")])

   DOwave
sex  1  2  3  4  5
  F 46 46 49 47  0
  M 45 48 50 47  0

In order to make reasonable gene comparisons between samples, the count data need to be normalized. In the quantile-quantile (Q-Q) plot below, count data for the first gene are plotted over a diagonal line tracing a normal distribution for those counts. Notice that most of the count data values lie off of this line, indicating that these gene counts are not normally distributed.

plot of chunk view_manual_qqplot_raw

Q-Q plots for the first six genes show that count data for these genes are not normally distributed. They are also not on the same scale. The y-axis values for each subplot range to 20,000 counts in the first subplot, 250 in the second, 90 in the third, and so on.

dataset.islet.rnaseq$raw %>% 
  as.data.frame() %>%
  select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
  pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
  ggplot(aes(sample = value)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~gene, scales = 'free') +
    labs(title = 'Count distribution for six genes',
         xlab = 'Normal percentiles', y = 'Count percentiles')

plot of chunk view_qqplots_raw

Q-Q plots of the normalized expression data for the first six genes show that the data values match the diagonal line well, meaning that they are now normally distributed. They are also all on the same scale now as well.

dataset.islet.rnaseq$expr %>% 
  as.data.frame() %>%
  select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
  pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
  ggplot(aes(sample = value)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~gene, scales = 'free') +
    labs(title = 'Normalized count distribution for six genes',
         xlab = 'Normal percentiles', y = 'Count percentiles')

plot of chunk view_qqplots_normalized

Boxplots of raw counts for six example genes are shown at left below. Notice that the median count values (horizontal black bar in each boxplot) are not comparable between the genes because the counts are not on the same scale. At right, boxplots for the same genes show normalized count data on the same scale.

raw = dataset.islet.rnaseq$raw %>% 
        as.data.frame() %>% 
        select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
        pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
        mutate(type = 'raw')

norm = dataset.islet.rnaseq$expr %>% 
         as.data.frame() %>% 
         select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
         pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
         mutate(type = 'normalized')

bind_rows(raw, norm) %>%
  mutate(type = factor(type, levels = c('raw', 'normalized'))) %>% 
  ggplot(aes(gene, value)) +
    geom_boxplot() +
    facet_wrap(~type, scales = 'free') +
    labs(title = 'Count distributions for example genes') +
    theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1))

plot of chunk view_example_boxplots

rm(raw, norm)

Have a look at the first several rows of normalized count data.

# look at normalized counts
dataset.islet.rnaseq$expr[1:6,1:6]

      ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000037
DO021        -0.09858728         0.19145703          0.5364855
DO022         0.97443030         2.41565800          1.1599398
DO023         0.75497867         1.62242658         -0.4767954
DO024         1.21299720         2.79216889          0.7205211
DO025         1.62242658        -0.27918258          0.3272808
DO026         0.48416010         0.03281516          0.8453674
      ENSMUSG00000000049 ENSMUSG00000000056 ENSMUSG00000000058
DO021         0.67037652         -1.4903750          0.6703765
DO022         1.03980755         -1.1221937          0.9744303
DO023         0.89324170          0.2317220          1.8245906
DO024         2.15179362         -0.7903569          0.9957498
DO025        -0.03281516         -0.2587715          1.5530179
DO026         0.86427916         -0.3973634          0.2791826

The expression data loaded provides LOD peaks for the eQTL analyses performed in this study. As a preview of what you will be doing next, look at the first several rows of LOD peak values and extract the LOD peaks for chromosome 11.

# look at LOD peaks
dataset.islet.rnaseq$lod.peaks[1:6,]

            annot.id   marker.id chrom        pos       lod
ENSMUSG00000037922 3_136728699     3 136.728699 11.976856
ENSMUSG00000037926 2_164758936     2 164.758936  7.091543
ENSMUSG00000037926 5_147178504     5 147.178504  6.248598
ENSMUSG00000037933 5_147253583     5 147.253583  8.581871
ENSMUSG00000037933  13_6542783    13   6.542783  6.065497
ENSMUSG00000037935 11_99340415    11  99.340415  8.089051

# look at chromosome 11 LOD peaks
chr11_peaks <- dataset.islet.rnaseq$annots %>% 
   select(gene_id, chr) %>% 
   filter(chr=="11") %>%
   left_join(dataset.islet.rnaseq$lod.peaks, 
             by = c("chr" = "chrom", "gene_id" = "annot.id")) 

# look at the first several rows of chromosome 11 peaks
head(chr11_peaks)

             gene_id chr    marker.id       pos       lod
ENSMUSG00000000049  11 11_109408556 109.40856  8.337996
ENSMUSG00000000056  11 11_121434933 121.43493 11.379449
ENSMUSG00000000093  11  11_74732783  74.73278  9.889119
ENSMUSG00000000120  11  11_95962181  95.96218 11.257591
ENSMUSG00000000125  11         <NA>        NA        NA
ENSMUSG00000000126  11  11_26992236  26.99224  6.013533

# how many rows?
dim(chr11_peaks)

[1] 1810    5

# how many rows have LOD scores?
chr11_peaks %>% filter(!is.na(lod)) %>% dim()

[1] 1208    5

# sort chromosome 11 peaks by LOD score
chr11_peaks %>% arrange(desc(lod)) %>% head()

             gene_id chr   marker.id      pos      lod
ENSMUSG00000020268  11 11_55073940 55.07394 189.2332
ENSMUSG00000020333  11 11_53961279 53.96128 147.8134
ENSMUSG00000017404  11 11_98062374 98.06237 136.0430
ENSMUSG00000093483  11 11_83104215 83.10421 135.1067
ENSMUSG00000058546  11 11_78204410 78.20441 132.3365
ENSMUSG00000078695  11 11_97596132 97.59613 131.8005

# range of LOD scores and positions
range(chr11_peaks$lod, na.rm = TRUE)

[1]   6.000975 189.233211

range(chr11_peaks$pos, na.rm = TRUE)

[1]   3.126009 122.078650

# view LOD scores by position
chr11_peaks %>% arrange(desc(lod)) %>% 
  ggplot(aes(pos, lod)) + geom_point()

plot of chunk lod_peaks

Key Points

previous episode

Gene Expression QTL Analysis

next episode

Load and explore the data

Overview

Physiological Phenotypes

Phenotype Distributions

Challenge 1

Solution

Challenge 2

Solution

Quality Control of Data

Challenge 3

Solution

Challenge 4

Solution

Gene Expression Phenotypes

Key Points

previous episode

next episode