Load and explore the data

Last updated on 2024-11-12 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • What data are required for eqtl mapping?

Objectives

  • To provide an example and exploration of data used for eqtl mapping.

Load the libraries.

R

library(ggbeeswarm)
library(tidyverse)
library(knitr)
library(corrplot)
# the following analysis is derived from supplementary 
# File S1 Attie_eQTL_paper_physiology.Rmd 
# by Daniel Gatti. See Data Dryad entry for more information.

Physiological Phenotypes


The complete data used in these analyses are available from Data Dryad.

Load in the clinical phenotypes.

R

# load the data
load("../data/attie_DO500_clinical.phenotypes.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/attie_DO500_clinical.phenotypes.RData', probable reason 'No such file
or directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

See the data dictionary to see a description of each of these phenotypes. You can also view a table of the data dictionary.

R

pheno_clin_dict %>% 
  select(description, formula) %>% 
  kable()

ERROR

Error: object 'pheno_clin_dict' not found

Phenotype Distributions

Boxplots are a great way to view the distribution of the data and to identify any outliers. We will be using the total area under the curve of insulin from the glucose tolerance test (Ins_tAUC). We will also log-transform the data using the [scale_y_log10()][scale_y_log10] function.

R

# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(title = "Insulin tAUC", y = "Insulin tAUC")

ERROR

Error: object 'pheno_clin' not found

Another visualization that has become popular is the [Violin Plot][https://en.wikipedia.org/wiki/Violin_plot]. We can create one using ggplot’s [geom_violin][https://ggplot2.tidyverse.org/reference/geom_violin.html]. Whereas the boxplot automatically adds the median, we must tell geom_violin() which quantiles that we want to draw using the argument draw_quantiles = c(0.25, 0.5, 0.75). We have also overlaid the data points using ggbeeswarm’s [geom_beeswarm][https://www.rdocumentation.org/packages/ggbeeswarm/versions/0.5.3/topics/geom_beeswarm]. We have told geom_beeswarm() to plot the points using the argument alpha = 0.1. The alpha argument ranges between 0 (completely transparent) to 1 (completely opaque). A value of 0.1 means mostly transparent.

R

# plot Insulin on a log 10 scale
ggplot(pheno_clin, aes(sex, Ins_tAUC)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_beeswarm(alpha = 0.1) +
  scale_y_log10() +
  labs(title = "Insulin tAUC", y = "Insulin tAUC")

ERROR

Error: object 'pheno_clin' not found

Challenge 1:

How many orders of magnitude (powers of 10) does Insulin tAUC span?

Insulin tAUC spans three orders of magnitude, from near 10 to over 1000.

Challenge 2:

Which sex has higher median Insulin tAUC values?

Males have higher Insulin tAUC than females.

Both of the boxplot and the violin plot are useful visualizations which you can use to get some sense of the distribution of your data.

Quality Control of Data

Many statistical tests rely upon the data having a “normal” (or Gaussian) distribution. Many biological phenotypes do not follow this distribution and must be transformed before analysis. This is why we log-transformed the data in the plots above.

While we can “eyeball” the distributions in the violin plot, it would be better to use a “quantile-quantile” plot.

R

pheno_clin %>% 
  ggplot(aes(sample = Ins_tAUC)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~sex)

ERROR

Error: object 'pheno_clin' not found

In these plots, the “quantiles” of the normal distribution are plotted on the X-axis and the data are plotted on the Y-axis. The line indicates the quantiles that would be followed by a normal distribution. The untransformed data do not follow a normal distribution because the points are far from the line.

Next, we will loag-transform the data and then create a quantile-quantile plot.

R

pheno_clin %>% 
  mutate(Ins_tAUC = log(Ins_tAUC)) %>% 
  ggplot(aes(sample = Ins_tAUC)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~sex)

ERROR

Error: object 'pheno_clin' not found

Challenge 3:

Does the log transformation make the data more normally distributed? Explain your answer.

Yes. The log transformation makes the data more normally distributed because the data points follow the normality line more closely.

Challenge 4:

Do any data points look suspicious to you? Explain your answer.

The data points that deviate from the normality line would be worth investigating. All data deviates somewhat from normality, but the three lowest points in the male data plot would be worth investigating. They may be real, but there may also have been mishap in the assay.

Another way to identify outliers is to standardize the data and look for data points that are more than four standard deviations from the mean.

To do this, we will log transform and standardize Insulin tAUC.

R

ins_tauc = pheno_clin %>% 
             select(mouse, sex, Ins_tAUC) %>%
             group_by(sex) %>% 
             mutate(Ins_tAUC = log(Ins_tAUC),
                    Ins_tAUC = scale(Ins_tAUC))

ERROR

Error: object 'pheno_clin' not found

R

ins_tauc %>% 
  ggplot(aes(x = sex, y = Ins_tAUC)) +
    geom_boxplot() +
    geom_hline(aes(yintercept = -4), color = 'red') +
    geom_hline(aes(yintercept =  4), color = 'red') +
    labs(title = "Distribution of Standardized Ins_tAUC")

ERROR

Error: object 'ins_tauc' not found

There are no data points outside of the four standard deviation limits.

Gene Expression Phenotypes


R

# load the expression data along with annotations and metadata
load("../data/dataset.islet.rnaseq.RData")

WARNING

Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'../data/dataset.islet.rnaseq.RData', probable reason 'No such file or
directory'

ERROR

Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection

R

names(dataset.islet.rnaseq)

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

# look at gene annotations
dataset.islet.rnaseq$annots[1:6,]

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

# look at raw counts
dataset.islet.rnaseq$raw[1:6,1:6]

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

# look at sample metadata
# summarize mouse sex, birth dates and DO waves
table(dataset.islet.rnaseq$samples[, c("sex", "birthdate")])

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

table(dataset.islet.rnaseq$samples[, c("sex", "DOwave")])

ERROR

Error: object 'dataset.islet.rnaseq' not found

In order to make reasonable gene comparisons between samples, the count data need to be normalized. In the quantile-quantile (Q-Q) plot below, count data for the first gene are plotted over a diagonal line tracing a normal distribution for those counts. Notice that most of the count data values lie off of this line, indicating that these gene counts are not normally distributed.

ERROR

Error: object 'dataset.islet.rnaseq' not found

Q-Q plots for the first six genes show that count data for these genes are not normally distributed. They are also not on the same scale. The y-axis values for each subplot range to 20,000 counts in the first subplot, 250 in the second, 90 in the third, and so on.

R

dataset.islet.rnaseq$raw %>% 
  as.data.frame() %>%
  select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
  pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
  ggplot(aes(sample = value)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~gene, scales = 'free') +
    labs(title = 'Count distribution for six genes',
         xlab = 'Normal percentiles', y = 'Count percentiles')

ERROR

Error: object 'dataset.islet.rnaseq' not found

Q-Q plots of the normalized expression data for the first six genes show that the data values match the diagonal line well, meaning that they are now normally distributed. They are also all on the same scale now as well.

R

dataset.islet.rnaseq$expr %>% 
  as.data.frame() %>%
  select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
  pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
  ggplot(aes(sample = value)) +
    stat_qq() +
    geom_qq_line() +
    facet_wrap(~gene, scales = 'free') +
    labs(title = 'Normalized count distribution for six genes',
         xlab = 'Normal percentiles', y = 'Count percentiles')

ERROR

Error: object 'dataset.islet.rnaseq' not found

Boxplots of raw counts for six example genes are shown at left below. Notice that the median count values (horizontal black bar in each boxplot) are not comparable between the genes because the counts are not on the same scale. At right, boxplots for the same genes show normalized count data on the same scale.

R

raw = dataset.islet.rnaseq$raw %>% 
        as.data.frame() %>% 
        select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
        pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
        mutate(type = 'raw')

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

norm = dataset.islet.rnaseq$expr %>% 
         as.data.frame() %>% 
         select(ENSMUSG00000000001:ENSMUSG00000000058) %>% 
         pivot_longer(cols = everything(), names_to = 'gene', values_to = 'value') %>% 
         mutate(type = 'normalized')

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

bind_rows(raw, norm) %>%
  mutate(type = factor(type, levels = c('raw', 'normalized'))) %>% 
  ggplot(aes(gene, value)) +
    geom_boxplot() +
    facet_wrap(~type, scales = 'free') +
    labs(title = 'Count distributions for example genes') +
    theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1))

ERROR

Error in `bind_rows()` at dplyr/R/mutate.R:146:3:
! Argument 1 must be a data frame or a named atomic vector.

R

rm(raw, norm)

WARNING

Warning in rm(raw, norm): object 'raw' not found

WARNING

Warning in rm(raw, norm): object 'norm' not found

Have a look at the first several rows of normalized count data.

R

# look at normalized counts
dataset.islet.rnaseq$expr[1:6,1:6]

ERROR

Error: object 'dataset.islet.rnaseq' not found

The expression data loaded provides LOD peaks for the eQTL analyses performed in this study. As a preview of what you will be doing next, look at the first several rows of LOD peak values and extract the LOD peaks for chromosome 11.

R

# look at LOD peaks
dataset.islet.rnaseq$lod.peaks[1:6,]

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

# look at chromosome 11 LOD peaks
chr11_peaks <- dataset.islet.rnaseq$annots %>% 
   select(gene_id, chr) %>% 
   filter(chr=="11") %>%
   left_join(dataset.islet.rnaseq$lod.peaks, 
             by = c("chr" = "chrom", "gene_id" = "annot.id")) 

ERROR

Error: object 'dataset.islet.rnaseq' not found

R

# look at the first several rows of chromosome 11 peaks
head(chr11_peaks)

ERROR

Error: object 'chr11_peaks' not found

R

# how many rows?
dim(chr11_peaks)

ERROR

Error: object 'chr11_peaks' not found

R

# how many rows have LOD scores?
chr11_peaks %>% filter(!is.na(lod)) %>% dim()

ERROR

Error: object 'chr11_peaks' not found

R

# sort chromosome 11 peaks by LOD score
chr11_peaks %>% arrange(desc(lod)) %>% head()

ERROR

Error: object 'chr11_peaks' not found

R

# range of LOD scores and positions
range(chr11_peaks$lod, na.rm = TRUE)

ERROR

Error: object 'chr11_peaks' not found

R

range(chr11_peaks$pos, na.rm = TRUE)

ERROR

Error: object 'chr11_peaks' not found

R

# view LOD scores by position
chr11_peaks %>% arrange(desc(lod)) %>% 
  ggplot(aes(pos, lod)) + geom_point()

ERROR

Error: object 'chr11_peaks' not found

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally