This lesson is in the early stages of development (Alpha version)

Introduction

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • What is statistical inference in the context of high-dimensional data?

  • Why is it important to know about statistical inference when analyzing high-dimensional data?

Objectives
  • Biological technology has expanded the quantity and complexity of data, and requires special analytical considerations.

Introduction

High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this chapter, we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data. In later chapters, we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling.

Discussion

Turn to a partner and discuss the following point.
More data are better than less data, right? Based on what you already know about statistical inference, do you agree with this or not? Is more data better? Is less data better? Why or why not? Share your thoughts responses with the group through the collaborative document.

Solution

Key Points

  • Inferential analyses of high throughput data must account for their extremely large sample sizes.