Inference for High-dimensional Data

High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this lesson we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data.

Prerequisites

This lesson assumes basic skills in the R statistical programming language and statistical concepts including population sampling and interpreting p-values.

To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.

Schedule

	Setup	Download files required for the lesson
00:00	1. Introduction	What is statistical inference in the context of high-dimensional data? Why is it important to know about statistical inference when analyzing high-dimensional data?
00:10	2. Example Gene Expression Datasets	What data will be be using for our analyses?
00:45	3. Basic inference for high-throughput data	How are inferences from high-throughput data different from inferences from smaller samples? How is the interpretation of p-values affected by high-throughput data?
01:45	4. Procedures for Multiple Comparisons	Why are p-values not a useful quantity when dealing with high-dimensional data? What are error rates and how are they calculated?
02:15	5. Error Rates	Why are type I and II error rates a problem in inferential statistics? What is Family Wise Error Rate, and why is it a concern in high throughput data?
02:45	6. The Bonferroni Correction	What is one way to control family wise error rate?
03:30	7. False Discovery Rate	What are False Discovery Rates, and when are they a concern in data analysis? How can you control false discovery rates?
04:10	8. Direct Approach to FDR and q-values	How can you control false discovery rates when you don’t have an a priori error rate?
05:10	9. Basic EDA for high-throughput data	What problems can different kinds of exploratory plots detect in high-throughput data?
06:25	10. Principal Components Analysis	How can researchers simplify or streamline EDA in high-throughput data sets? What is principal component analysis (PCA) and when can it be used?
07:25	11. Statistical Models	What are some of the most widely used parametric distributions used in the life sciences besides the normal?
08:25	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.