Example Gene Expression DatasetsExplore a Gene Expression Dataset
- High-throughput data measures thousands of features.
- High-throughput data is typically composed of multiple tables.
Basic inference for high-throughput data
- P-values are random variables.
- Very small p-values can occur by random chance when analyzing high-throughput data.
Procedures for Multiple Comparisons
- We want our inferential analyses to maximize the percentage of true positives (sensitivity) and true negatives (specificity)
- p values are random variables, conducting multiple comparisons on high throughput data can produce a large number of false positives (Type 1 errors) simply by chance.
- There are procedures for improving sensitivity and specificity by controlling error rates below a predefined value.
Error Rates
- Type 1 and Type II error are complementary concerns in data analysis. The reason we don’t have extremely strict cutoffs for alpha is because we don’t want to miss finding true positive results.
- It is possible to calculate the probability of finding at least one false positive result when conducting multiple inferential tests. This is the Family Wise Error Rate.
The Bonferroni Correction
- The Bonferroni technique controls FWER by dividing a predetermined alpha rate (e.g. alpha = .05) by the number of inferential tests performed.
- The Bonferroni correction is very strict and conservative.
False Discovery Rate
- Restricting FWER too much can cause researchers to reject the null hypothesis when it’s actually true. This is especially likely in the small samples used in discovery phase experiments.
- The Benjamini-Hochberg correction controls FDR by guaranteeing it to be below a desired alpha level.
- FDR is a more liberal correction than Bonferroni. While it generates more false positives, it also provides more statistical power.
Direct Approach to FDR and q-values
- The Storey correction makes different assumptions that Benjamini-Hochberg. It does not set a priori alpha levels, but instead estimates the number of true null hypotheses from a given data set.
- The Storey correction is less computationally stable than Benjamini-Hochberg.
Basic EDA for high-throughput data
- While it is tempting to jump straight in to inferential analyses, it’s very important to run EDA first. Visualizing high-throughput data with EDA enables researchers to detect biological and technical issues with data. Plots can show at a glance where errors lie, making inferential analyses more accurate, efficient, and replicable.
Principal Components AnalysisWhat is a principal component?How many principal components do we need?Using PCA to analyse gene expression dataUsing PCA output in further analysis
- Visualizing data with thousands or tens of thousands of measurements is impossible using standard techniques.
- Dimension reduction techniques coupled with visualization can reveal relationships between dimensions (rows or columns) in the data.
- Principal components analysis is a dimension reduction technique that can reduce and summarize large datasets.
Statistical ModelsStatistical Models
- Not every dataset follows a normal distribution. It is important to be aware of other parametric distributions underlying the data collected. This improves the technical precision of study results.