Example Gene Expression Datasets
Overview
Teaching: 15 min
Exercises: 20 minQuestions
What data will be be using for our analyses?
Objectives
Explore a high-throughput dataset composed of three tables.
Examine the features (high-throughput measurements) of the data you explored.
Explore a Gene Expression Dataset
Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fMRI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable.
Data packages
Several of the examples we are going to use in the following sections are best
obtained through R packages. These are available from GitHub and can be
installed using the install_github function from the devtools package.
Microsoft Windows users might need to follow
these instructions to properly
install devtools.
Once devtools is installed, you can then install the data packages like this:
library(devtools)
install_github("genomicsclass/GSE5859Subset")
The three tables
Most of the data we use as examples in this book are created with high-throughput technologies. These technologies measure thousands of features. Examples of features are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures.
A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units samples following experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example “random sample”.
So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.
Because a dataset is typically defined by a set of experimental units and a product defines a fixed set of features, the high-throughput measurements can be stored in an n x m matrix, with n the number of units and m the number of features. In R, the convention has been to store the transpose of these matrices, in which all the rows become columns and the columns become the rows.
Here is an example from a gene expression dataset:
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables
dim(geneExpression)
## [1] 8793 24
We have RNA expression measurements for 8793 genes from blood taken from 24 individuals (the experimental units). For most statistical analyses, we will also need information about the individuals. For example, in this case the data was originally collected to compare gene expression across ethnic groups. However, we have created a subset of this dataset for illustration and separated the data into two groups:
dim(sampleInfo)
## [1] 24 4
head(sampleInfo)
## ethnicity date filename group
## 107 ASN 2005-06-23 GSM136508.CEL.gz 1
## 122 ASN 2005-06-27 GSM136530.CEL.gz 1
## 113 ASN 2005-06-27 GSM136517.CEL.gz 1
## 163 ASN 2005-10-28 GSM136576.CEL.gz 1
## 153 ASN 2005-10-07 GSM136566.CEL.gz 1
## 161 ASN 2005-10-07 GSM136574.CEL.gz 1
sampleInfo$group
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
One of the columns, filenames, permits us to connect the rows of this table to the columns of the measurement table.
match(sampleInfo$filename, colnames(geneExpression))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Finally, we have a table describing the features:
dim(geneAnnotation)
## [1] 8793 4
head(geneAnnotation)
## PROBEID CHR CHRLOC SYMBOL
## 1 1007_s_at chr6 30852327 DDR1
## 30 1053_at chr7 -73645832 RFC2
## 31 117_at chr1 161494036 HSPA6
## 32 121_at chr2 -113973574 PAX8
## 33 1255_g_at chr6 42123144 GUCA1A
## 34 1294_at chr3 -49842638 UBA7
The table includes an ID that permits us to connect the rows of this table with the rows of the measurement table:
head(match(geneAnnotation$PROBEID, rownames(geneExpression)))
## [1] 1 2 3 4 5 6
The table also includes biological information about the features, namely chromosome location and the gene “name” used by biologists.
For the remaining parts of this lesson we will be downloading larger datasets than those we have been using. Most of these datasets are not available as part of the standard R installation or packages such as UsingR. For some of these packages, we have created packages and offer them via GitHub. To download these you will need to install the devtools package. Once you do this, you can install packages such as the GSE5859Subset which we will be using here:
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset)
This package loads three tables: geneAnnotation, geneExpression, and sampleInfo. Answer the following questions to familiarize yourself with the data set:
Exercise 1: How many samples where processed on 2005-06-27?
Solution
unique(sampleInfo$date) # check date format sampleInfo[sampleInfo$date == "2005-06-27",] sum(sampleInfo$date == "2005-06-27") # sum of TRUEs
Exercise 2: How many of the genes represented in this particular technology
are on chromosome Y?
Solution
unique(geneAnnotation$CHR) # check chromosome spelling sum(geneAnnotation$CHR == "chrY", na.rm = TRUE) # remove missing values # (NAs) to sum TRUEs
Exercise 3: What is the log expression value for gene ARPC1A on the one
subject that we measured on 2005-06-10?
Solution
sampleInfo[sampleInfo$date == "2005-06-10",] # June 10 sample sampleFileName <- sampleInfo[sampleInfo$date == "2005-06-10", "filename"] # save file name sampleProbeID <- geneAnnotation[which(geneAnnotation$SYMBOL == "ARPC1A"), "PROBEID"] # save probe ID geneExpression[sampleProbeID, sampleFileName]
Discussion
What kinds of research questions might you ask of this data? What are the dependent (response) and independent variables? Turn to a partner and discuss, then share with the group in the collaborative document.
Solution
Key Points
High-throughput data measures thousands of features.
High-throughput data is typically composed of multiple tables.