Summary and Setup

High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this lesson we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data.

This lesson presents data analysis concepts featured in Data Analysis for the Life Sciences by Rafael A. Irizarry and Michael I. Love. The lesson is adapted from software for chapters on Inference for High-Dimensional Data, Statistical Modeling and Distance and Dimension Reduction which are published under this MIT license. Adaptation was funded by NIH grant 1R25GM141520 awarded to Dr. Gary Churchill at The Jackson Laboratory.

Callout

This lesson assumes basic skills in the R statistical programming language. If you know how install packages, load libraries and data, and work with common R data structures (e.g. vectors, data frames, matrices) you are ready for this course.

Data Files and Project Organization


  1. Make a new folder in your Desktop called inference. Move into this new folder.

  2. Create a data folder to hold the data, a scripts folder to house your scripts, and a results folder to hold results.

    Alternatively, you can use the R console to run the following commands for steps 1 and 2.

    setwd("~/Desktop")
    dir.create("./inference")
    setwd("~/Desktop/inference")
    dir.create("./data")
    dir.create("./scripts")
    dir.create("./results")
  3. Please download the following files and place it in your data folder. You can copy and paste the following into the R console to download the data.

download.file(url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv",
              destfile = "data/femaleControlsPopulation.csv",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/GSE5859/raw/refs/heads/master/data/GSE5859.rda",
              destfile = "data/GSE5859.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/GSE5859Subset/raw/refs/heads/master/data/GSE5859Subset.rda",
              destfile = "data/GSE5859Subset.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/maPooling/raw/refs/heads/master/data/maPooling.RData",
              destfile = "data/maPooling.RData",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/tissuesGeneExpression/raw/refs/heads/master/data/tissuesGeneExpression.rda",
              destfile = "data/tissuesGeneExpression.rda",
              mode = "wb")
download.file(url = "https://github.com/genomicsclass/dagdata/raw/refs/heads/master/data/hcmv.rda",
              destfile = "data/hcmv.rda",
              mode = "wb")

Software Setup


R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.

  1. Install the latest version of R from CRAN. If you are using a JAX-owned machine, you can use the JAX Self Service app instead without needing support from the IT help desk.

  2. Install the latest version of RStudio here. Choose the free RStudio Desktop version for Windows, Mac, or Linux. If you are using a JAX-owned machine, you can use the JAX Self Service app instead without needing support from the IT help desk.

  3. Start RStudio. We will use several packages from CRAN. You can install them from the Console or from the Install button on the RStudio Packages tab. Copy-paste this list of packages into the Install dialog box:

    BiocManager, here, rafalib, lasso2, matrixStats, Brq

    Alternatively, run the following in the Console.

    install.packages(c("BiocManager", "here",
                       "rafalib", "matrixStats", "Brq"))
  4. Once you have installed the packages, load the libraries by checking the box next to each package name on the Packages tab, or alternatively running this code in the Console for each package.

    library(BiocManager)
    library(here)
    library(rafalib)
    library(matrixStats)
    library(Brq)
  5. Install these Bioconductor packages by running the code in the Console:

    BiocManager::install(c("genefilter", "SpikeInSubset",
                           "SummarizedExperiment", "parathyroidSE", "Biobase",
                           "limma", "qvalue", "PCAtools"))
  6. Once you have installed the Bioconductor packages, load the libraries by running this code in the Console.

    library(genefilter)
    library(SpikeInSubset)
    library(SummarizedExperiment)
    library(parathyroidSE)
    library(Biobase)
    library(limma)
    library(qvalue)
    library(PCAtools)