Example Gene Expression Datasets

Overview

Teaching: 15 min
Exercises: 20 min

Questions

What data will be be using for our analyses?

Objectives

Explore a high-throughput dataset composed of three tables.

Examine the features (high-throughput measurements) of the data you explored.

Explore a Gene Expression Dataset

Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fMRI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable.

Data packages

Several of the examples we are going to use in the following sections are best obtained through R packages. These are available from GitHub and can be installed using the install_github function from the devtools package. Microsoft Windows users might need to follow these instructions to properly install devtools.

Once devtools is installed, you can then install the data packages like this:

library(devtools)
install_github("genomicsclass/GSE5859Subset")

The three tables

Most of the data we use as examples in this book are created with high-throughput technologies. These technologies measure thousands of features. Examples of features are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures.

A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units samples following experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example “random sample”.

So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.

Because a dataset is typically defined by a set of experimental units and a product defines a fixed set of features, the high-throughput measurements can be stored in an n x m matrix, with n the number of units and m the number of features. In R, the convention has been to store the transpose of these matrices, in which all the rows become columns and the columns become the rows.

Here is an example from a gene expression dataset:

library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables
dim(geneExpression)

## [1] 8793   24

We have RNA expression measurements for 8793 genes from blood taken from 24 individuals (the experimental units). For most statistical analyses, we will also need information about the individuals. For example, in this case the data was originally collected to compare gene expression across ethnic groups. However, we have created a subset of this dataset for illustration and separated the data into two groups:

dim(sampleInfo)

## [1] 24  4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

sampleInfo$group

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

One of the columns, filenames, permits us to connect the rows of this table to the columns of the measurement table.

match(sampleInfo$filename, colnames(geneExpression))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Finally, we have a table describing the features:

dim(geneAnnotation)

## [1] 8793    4

head(geneAnnotation)

##      PROBEID  CHR     CHRLOC SYMBOL
## 1  1007_s_at chr6   30852327   DDR1
## 30   1053_at chr7  -73645832   RFC2
## 31    117_at chr1  161494036  HSPA6
## 32    121_at chr2 -113973574   PAX8
## 33 1255_g_at chr6   42123144 GUCA1A
## 34   1294_at chr3  -49842638   UBA7

The table includes an ID that permits us to connect the rows of this table with the rows of the measurement table:

head(match(geneAnnotation$PROBEID, rownames(geneExpression)))

## [1] 1 2 3 4 5 6

The table also includes biological information about the features, namely chromosome location and the gene “name” used by biologists.

For the remaining parts of this lesson we will be downloading larger datasets than those we have been using. Most of these datasets are not available as part of the standard R installation or packages such as UsingR. For some of these packages, we have created packages and offer them via GitHub. To download these you will need to install the devtools package. Once you do this, you can install packages such as the GSE5859Subset which we will be using here:

library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset)

This package loads three tables: geneAnnotation, geneExpression, and sampleInfo. Answer the following questions to familiarize yourself with the data set:

Exercise 1: How many samples where processed on 2005-06-27?

Solution

unique(sampleInfo$date) # check date format
sampleInfo[sampleInfo$date == "2005-06-27",]  
sum(sampleInfo$date == "2005-06-27") # sum of TRUEs     

Exercise 2: How many of the genes represented in this particular technology

are on chromosome Y?
Solution
unique(geneAnnotation$CHR) # check chromosome spelling  
sum(geneAnnotation$CHR == "chrY", na.rm = TRUE) # remove missing values 
# (NAs) to sum TRUEs

Exercise 3: What is the log expression value for gene ARPC1A on the one

subject that we measured on 2005-06-10?

Solution

sampleInfo[sampleInfo$date == "2005-06-10",] # June 10 sample 
sampleFileName <- sampleInfo[sampleInfo$date == "2005-06-10", "filename"] # save file name   
sampleProbeID <- geneAnnotation[which(geneAnnotation$SYMBOL == "ARPC1A"), "PROBEID"] # save probe ID   
geneExpression[sampleProbeID, sampleFileName]

Discussion

What kinds of research questions might you ask of this data? What are the dependent (response) and independent variables? Turn to a partner and discuss, then share with the group in the collaborative document.

Solution

Key Points

High-throughput data measures thousands of features.

High-throughput data is typically composed of multiple tables.

previous episode

Inference for High-dimensional Data

next episode

Example Gene Expression Datasets

Overview

Explore a Gene Expression Dataset

Data packages

The three tables

Exercise 1: How many samples where processed on 2005-06-27?

Solution

Exercise 2: How many of the genes represented in this particular technology

Solution

Exercise 3: What is the log expression value for gene ARPC1A on the one

Solution

Discussion

Solution

Key Points

previous episode

next episode