Summary and Schedule

High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this lesson we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data.

This lesson presents data analysis concepts featured in Data Analysis for the Life Sciences by Rafael A. Irizarry and Michael I. Love. The lesson is adapted from software for chapters on Inference for High-Dimensional Data, Statistical Modeling and Distance and Dimension Reduction which are published under this MIT license. Adaptation was funded by NIH grant 1R25GM141520 awarded to Dr. Gary Churchill at The Jackson Laboratory.

Callout

This lesson assumes basic skills in the R statistical programming language. If you know how install packages, load libraries and data, and work with common R data structures (e.g. vectors, data frames, matrices) you are ready for this course.

Setup Instructions

Download files required for the lesson

00h 00m

1. Example Gene Expression Datasets

What data will be be using for our analyses?

00h 35m

2. Basic inference for high-throughput data

How are inferences from high-throughput data different from inferences from smaller samples?
How is the interpretation of p-values affected by high-throughput data?

01h 35m

3. Procedures for Multiple Comparisons

Why are p-values not a useful quantity when dealing with high-dimensional data?
What are error rates and how are they calculated?

02h 05m

4. Error Rates

Why are type I and II error rates a problem in inferential statistics?
What is Family Wise Error Rate, and why is it a concern in high throughput data?

02h 35m

5. The Bonferroni Correction

What is one way to control family wise error rate?

03h 20m

6. False Discovery Rate

What are False Discovery Rates, and when are they a concern in data analysis?
How can you control false discovery rates?

04h 00m

7. Direct Approach to FDR and q-values

How can you control false discovery rates when you don’t have an a priori error rate?

05h 00m

8. Basic EDA for high-throughput data

What problems can different kinds of exploratory plots detect in high-throughput data?

06h 15m

9. Principal Components Analysis

How can researchers simplify or streamline EDA in high-throughput data sets?
What is principal component analysis (PCA) and when can it be used?

07h 15m

10. Statistical Models

What are some of the most widely used parametric distributions used in the life sciences besides the normal?

08h 15m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Data Files and Project Organization

Make a new folder in your Desktop called inference. Move into this new folder.
Create a data folder to hold the data, a scripts folder to house your scripts, and a results folder to hold results.

Alternatively, you can use the R console to run the following commands for steps 1 and 2.
```
setwd("~/Desktop")
dir.create("./inference")
setwd("~/Desktop/inference")
dir.create("./data")
dir.create("./scripts")
dir.create("./results")
```
Please download the following files and place it in your data folder. You can copy and paste the following into the R console to download the data.

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv",
              destfile = "data/femaleControlsPopulation.csv",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/GSE5859/raw/refs/heads/master/data/GSE5859.rda",
              destfile = "data/GSE5859.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/GSE5859Subset/raw/refs/heads/master/data/GSE5859Subset.rda",
              destfile = "data/GSE5859Subset.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/maPooling/raw/refs/heads/master/data/maPooling.RData",
              destfile = "data/maPooling.RData",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/tissuesGeneExpression/raw/refs/heads/master/data/tissuesGeneExpression.rda",
              destfile = "data/tissuesGeneExpression.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/dagdata/raw/refs/heads/master/data/hcmv.rda",
              destfile = "data/hcmv.rda",
              mode = "wb")

Software Setup

R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.

Install the latest version of R from CRAN. If you are using a JAX-owned machine, you can use the JAX Self Service app instead without needing support from the IT help desk.
Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux. If you are using a JAX-owned machine, you can use the JAX Self Service app instead without needing support from the IT help desk.
Start RStudio. We will use several packages from CRAN. You can install them from the Console or from the Install button on the RStudio Packages tab. Copy-paste this list of packages into the Install dialog box:

BiocManager, here, rafalib, lasso2, matrixStats, Brq

Alternatively, run the following in the Console.
```
install.packages(c("BiocManager", "here",
                   "rafalib", "matrixStats", "Brq"))
```
Once you have installed the packages, load the libraries by checking the box next to each package name on the Packages tab, or alternatively running this code in the Console for each package.
```
library(BiocManager)
library(here)
library(rafalib)
library(matrixStats)
library(Brq)
```

Install these Bioconductor packages by running the code in the Console:

BiocManager::install(c("genefilter", "SpikeInSubset",
                       "SummarizedExperiment", "parathyroidSE", "Biobase",
                       "limma", "qvalue", "PCAtools"))

Once you have installed the Bioconductor packages, load the libraries by running this code in the Console.

library(genefilter)
library(SpikeInSubset)
library(SummarizedExperiment)
library(parathyroidSE)
library(Biobase)
library(limma)
library(qvalue)
library(PCAtools)