This lesson is still being designed and assembled (Pre-Alpha version)

Statistical Experimental Design

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is gained from good experimental design?

Objectives
  • Connect experimental design with data quality and meaningful findings.

FIXME

Key Points

  • Good experimental design generates information-rich data with a clear message.


Essential Features of a Comparative Experiment

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How are comparative experiments structured?

Objectives
  • Describe the common features of comparative experiments.

  • Identify experimental units, treatments, response measurements, ancillary and nuisance variables

The Generation 100 study evaluated the effects of exercise on more than 1500 elderly Norwegians from Trondheim, Norway, to determine if exercise led to a longer active and healthy life. Specifically the researchers investigated the relationship between exercise intensity and health and longevity. One group performed high-intensity interval training (10 minute warm-up followed by four 4-minute intervals at ∼90% of peak heart rate) twice a week for five years. A second group performed moderate exercise twice a week (50 minutes of continuous exercise at ∼70% of peak heart rate). A third control group followed physical activity advice according to national recommendations. Clinical examinations and questionnaires were administered to all at the start and after one, three, and five years. Heart rate, blood pressure, leg and grip strength, cognitive function, and other health indicators were measured during clinical exams.

Challenge 1: Raw ingredients of a comparative experiment

Discuss the following questions with your partner, then share your answers to each question in the collaborative document.

  1. What is the research question in this study? If you prefer to name a hypothesis, turn the research question into a declarative statement.
  2. What is the treatment (treatment factor)? How many levels are there for this treatment factor?
  3. What are the experimental units (the entities to which treatments are applied)?
  4. What are the responses (the measurements used to determine treatment effects)?
  5. Should participants have been allowed to choose which group (high-intensity, moderate exercise, or national standard) they wanted to join? Why or why not? Should the experimenters have assigned participants to treatment groups based on their judgment of each participant’s characteristics? Why or why not?

Solution

The research question asked whether exercise, specifically high-intensity exercise, would affect healthspan and lifespan of elderly Norwegians. The treatments were high-intensity, moderate-intensity, and national standard exercise groups. The experimental units are the individuals. The responses measured were heart rate, blood pressure, strength, cognitive function and other health indicators. The main response measured was 5-year survival. If participants had been allowed to choose their preferred exercise group or if experimenters had chosen the groups based on participant characteristics, extraneous variables (e.g. state of depression) could be introduced into the study. When participants are randomly assigned to treatment groups, these variables are spread across the groups and cancel out. Furthermore, if experimenters had used their own judgment to assign participants to groups, their own biases could have affected the results.

Conducting a Comparative Experiment

Comparative experiments apply treatments to experimental units and measure the responses, then compare the responses to those treatments with statistical analysis. If in the Generation 100 study the experimenters had only one group (e.g. high-intensity training), they might have achieved good results but would have no way of knowing if either of the other treatments would have achieved the same or even better results. To know whether high-intensity training is better than moderate or low-intensity training, it was necessary to run experiments in which some experimental units engaged in high-intensity training, others in moderate, and others still in low-intensity training. Only then can the responses to those treatments be statistically analyzed to determine treatment effects.

Challenge 2: Which are the experimental units?

Identify the experimental units in each experiment described below, then share your answers in the collaborative document.

  1. Three hundred mice are individually housed in the same room. Half of them are fed a high-fat diet and the other half are fed regular chow.
  2. Three hundred mice are housed five per cage in the same room. Half of them are fed a high-fat diet and the other half are fed regular chow.
  3. Three hundred mice are individually housed in two different rooms. Those in the first room are fed a high-fat diet and those in the other room are fed regular chow.

Solution

  1. The individual animal is the experimental unit.

  2. The cage receives the treatment and is the experimental unit.

  3. The room receives the treatment and is the experimental unit.

Reducing Bias with Randomization and Blinding

Randomized studies assign experimental units to treatment groups randomly by pulling a number out of a hat or using a computer’s random number generator. The main purpose for randomization comes later during statistical analysis, where we compare the data we obtained to the data distribution we might have obtained. Randomization provides us a way to create the distribution of data we might have obtained and ensures that our comparisons between treatment groups are valid. Random assignment (allocation) of experimental units to treatment groups prevents the subjective bias that might be introduced by an experimenter who selects, even in good faith and with good intention, which experimental units should get which treatment. For example, if the experimenter selected which people would do high-, moderate- and low-intensity training they might unconsciously bias the groups by body size or shape. This selection bias would influence the outcome of the experiment.

Randomization also accounts for or cancels out effects of “nuisance” variables like the time or day of the experiment, the investigator or technician, equipment calibration, exposure to light or ventilation in animal rooms, or other variables that are not being studied but that do influence the responses. Randomization balances out the effects of nuisance variables between treatment groups by giving an equal probability for an experimental unit to be assigned to any treatment group.

Blinding (also known as masking) prevents the experimenter from influencing the outcome of an experiment to suit their expectations or preferred hypothesis. Ideally experimenters should not know which treatment the experimental units have received or will receive from the start to the statistical analysis stage of the experiment. This might require additional personnel like technicians or other colleague to perform some tasks, and should be considered during experimental design. If ideal circumstances can’t be arranged, it should be possible to carry out at least some of the stages blind. Blinding during allocation (assignment of experimental units to treatment groups), treatment, data collection or data analysis can reduce experimental bias.

Challenge 3: How does bias enter an experiment?

Identify ways that bias enters into each experiment described below, then share your answers in the collaborative document.

  1. A clinician perceives increased aggression in subjects given testosterone.
  2. A clinician concludes that mood of each subject has improved in the treatment group given a new antidepressant.
  3. A researcher unintentionally treats subjects differently based on their treatment group by providing more food to control group animals.
  4. A clinician gives different nonverbal cues to patients in the treatment group of a clinical trial than to the control group patients.

Solution

1 and 2 describe nonblind data collection reporting increased treatment effects. Inflated effect sizes are a common problem with nonblinded studies. In 3 and 4 the experimenter

Key Points

  • The raw ingredients of comparative experiments are experimental units, treatments and responses.


Experimental Design Principles

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What are the core principles of experimental design?

Objectives
  • The way in which a design applies treatments to experimental units and measures the responses will determine 1) what questions can be answered and 2) with what precision relationships can be described.

  • The core principles guiding the way are 1) replication, 2) randomization and 3) blocking.

Variability is natural in the real world. A medication given to a group of patients will affect each of them differently. A specific diet given to a cage of mice will affect each mouse differently. Ideally if something is measured many times, each measurement will give exactly the same result and will represent the true value. This ideal doesn’t exist in the real world. For example, the mass of one kilogram is defined by the International Prototype Kilogram, a cylinder composed of platinum and iridium about the size of a golf ball. International Prototype of the Kilogram

Copies of this prototype kilogram (replicates) are distributed worldwide so each country hosting a replica has its own national standard kilogram. None of the replicas measure precisely the same despite careful storage and handling. The reasons for this variation in measurements are not known. A kilogram in Austria differs from a kilogram in Australia, which differs from that in Brazil, Kazakhstan, Pakistan, Switzerland or the U.S. What we assume is an absolute measure of mass shows real-world natural variability. Variability is a feature of natural systems and also a natural part of every experiment we undertake.

Replication to characterize variability

To figure out whether a difference in responses is real or inherently random, replication applies the same treatment to multiple experimental units. The variability of the responses within a set of replicates provides a measure against which we can compare differences among different treatments. This variability is known as experimental error. This does not mean that something was done wrongly! It’s a phrase describing the variability in the responses. Random variation is also known as random error or noise. It reflects imprecision, but not inaccuracy. Larger sample sizes reduce this imprecision.

In addition to random (experimental) error, also known as noise, there are two other sources of variability in experiments. Systematic error or bias, occurs when there are deviations in measurements or observations that are consistently in one particular direction, either overestimating or underestimating the true value. As an example, a scale might be calibrated so that mass measurements are consistently too high or too low. Unlike random error, systematic error is consistent in one direction, is predictable and follows a pattern. Larger sample sizes don’t correct for systematic bias; equipment or measurement calibration does. Technical replicates define this systematic bias by running the same sample through the machine or measurement protocol multiple times to characterize the variation caused by equipment or protocols.

A biological replicate measures different biological samples in parallel to estimate the variation caused by the unique biology of the samples. The sample or group of samples are derived from the same biological source, such as cells, tissues, organisms, or individuals. Biological replicates assess the variability and reproducibility of experimental results. For example, if a study examines the effect of a drug on cell growth, biological replicates would involve multiple sets of cells from the same cell line to test the drug’s effects. This helps to ensure that any observed changes are due to the drug itself rather than variations in the biological material being used.

The greater the number of replications, the greater the precision (the closeness of two or more measurements to each other). Having a large enough sample size to ensure high precision is necessary to ensure reproducible results.

Exercise 1: Which kind of error?

A study used to determine the effect of a drug on weight loss could have the following sources of experimental error. Classify the following sources as either biological, systematic, or random error.
1). A scale is broken and provides inconsistent readings.
2). A scale is calibrated wrongly and consistently measures mice 1 gram heavier.
3). A mouse has an unusually high weight compared to its experimental group (i.e., it is an outlier).
4). Strong atmospheric low pressure and accompanying storms affect instrument readings, animal behavior, and indoor relative humidity.

Solution to Exercise 1

1). random, because the scale is broken and provides any kind of random reading it comes up with (inconsistent reading)
2). systematic
3). biological
4). random or systematic; you argue which and explain why

These three sources of error can be mitigated by good experimental design. Systematic and biological error can be mitigated through adequate numbers of technical and biological replicates, respectively. Random error can also be mitigated by experimental design, however, replicates are not effective. By definition random error is unpredictable or unknowable. For example, an atmospheric low pressure system or a strong storm could affect equipment measurements, animal behavior, and indoor relative humidity, which introduces random error. We could assume that all random error will balance itself out, and that all samples will be equally subject to random error. A more precise way to mitigate random error is through blocking.

Randomization

Exercise 2: The efficient technician

Your technician colleague finds a way to simplify and expedite an experiment. The experiment applies four different wheel-running treatments to twenty different mice over the course of five days. Four mice are treated individually each day for two hours each with a random selection of the four treatments. Your clever colleague decides that a simplified protocol would work just as well and save time. Run treatment 1 five times on day 1, treatment 2 five times on day 2, and so on. Some overtime would be required each day but the experiment would be completed in only four days, and then they can take Friday off! Does this adjustment make sense to you?
Can you foresee any problems with the experimental results?

Solution to Exercise 2

Since each treatment is run on only one day, the day effectively becomes the experimental unit (explain this). Each experimental unit (day) has five samples (mice), but only one replication of each treatment. There is no valid way to compare treatments as a result. There is no way to separate the treatment effect from the day-to-day differences in environment, equipment setup, personnel, and other extraneous variables.

Why should treatments be randomly assigned to experimental units? Randomization minimizes bias and moderates experimental error (a.k.a. noise). A hat full of numbers, a random number table or a computational random number generator can be used to assign random numbers to experimental units so that any experimental unit has equal chances of being assigned to a specific treatment group.

Here is an example of randomization using a random number generator. The study asks how a high-fat diet affects blood pressure in mice. If the number is odd, the sample is assigned to the treatment group, which receives the high-fat diet. If the random number is even, the sample is assigned to the control group (the group that doesn’t receive the treatment, in this case, regular chow).

# create the mouse IDs and 26 random numbers between 1 and 100
mouse_ID <- LETTERS
random_number <- sample(x = 100, size = 26)

# %% is the modulo operator, which returns the remainder from division by 2
# if the remainder is 0 (even number), regular chow diet is assigned
treatment <- ifelse(random_number %% 2 == 0, "chow", "high fat")
random_allocation <- data.frame(mouse_ID, random_number, treatment)
random_allocation
   mouse_ID random_number treatment
1         A            43  high fat
2         B            80      chow
3         C            35  high fat
4         D            32      chow
5         E            60      chow
6         F            11  high fat
7         G            89  high fat
8         H            82      chow
9         I            71  high fat
10        J            15  high fat
11        K            69  high fat
12        L            23  high fat
13        M            46      chow
14        N             3  high fat
15        O            79  high fat
16        P            83  high fat
17        Q            28      chow
18        R            30      chow
19        S            96      chow
20        T            94      chow
21        U            42      chow
22        V            66      chow
23        W            55  high fat
24        X            37  high fat
25        Y            77  high fat
26        Z            78      chow

This might produce unequal numbers between treatment and control groups. It isn’t necessary to have equal numbers, however, sensitivity or statistical power (the probability of detecting an effect when it truly exists) is maximized when sample numbers are equal.

table(random_allocation$treatment)

    chow high fat 
      12       14 

To randomly assign samples to groups with equal numbers, you can do the following.

# place IDs and random numbers in data frame
equal_allocation <- data.frame(mouse_ID, random_number)

# sort by random numbers (not by sample IDs)
equal_allocation <- equal_allocation[order(random_number),]

# now assign to treatment or control groups
treatment <- sort(rep(x = c("chow", "high fat"), times = 13))
equal_allocation <- cbind(equal_allocation, treatment)
row.names(equal_allocation) <- 1:26
equal_allocation
   mouse_ID random_number treatment
1         N             3      chow
2         F            11      chow
3         J            15      chow
4         L            23      chow
5         Q            28      chow
6         R            30      chow
7         D            32      chow
8         C            35      chow
9         X            37      chow
10        U            42      chow
11        A            43      chow
12        M            46      chow
13        W            55      chow
14        E            60  high fat
15        V            66  high fat
16        K            69  high fat
17        I            71  high fat
18        Y            77  high fat
19        Z            78  high fat
20        O            79  high fat
21        B            80  high fat
22        H            82  high fat
23        P            83  high fat
24        G            89  high fat
25        T            94  high fat
26        S            96  high fat

You can write out this treatment plan to a comma-separated values (csv) file, then open it in Excel and use it to record your data collection or just keep track of which samples are randomly assigned which diet.

write.csv(equal_allocation, file = "../data/random-assign.csv", row.names = FALSE)

Discussion

Why not assign treatment and control groups to samples in alphabetical order?
Did we really need a random number generator to obtain randomized equal groups?

Solution

1). Scenario: One technician processed samples A through M, and a different technician processed samples N through Z. Might the first technician have processed samples somewhat differently from the second technician? If so, there would be a “technician effect” in the results that would be difficult to separate from the treatment effect. 2). Another scenario: Samples A through M were processed on a Monday, and samples N through Z on a Tuesday. Might the weather or the environment in general have been different between Monday and Tuesday? What if a big construction project started on Tuesday, or the whole team had a birthday gathering for one of their members, or anything else in the environment differed between Monday and Tuesday? If so, there would be a “day-of-the-week effect” in the results that would be difficult to separate from the treatment effect. 3). Yet another scenario: Samples A through M were from one strain, and samples N through Z from a different strain. How would you be able to distinguish between the treatment effect and the strain effect? 4). Yet another scenario: Samples with consecutive ids were all sibling groups. For example, samples A, B and C were all siblings, and all assigned to the same treatment.
All of these cases would have introduced an effect (from the technician, the day of the week, the strain, or sibling relationships) that would confound the results and lead to misinterpretation.

Controlling Natural Variation with Blocking

Experimental units can be grouped, or blocked, to increase the precision of treatment comparisons. Blocking divides an experiment into groups of experimental units to control natural variation among these units. Treatments are randomized to experimental units within each block. Each block, then, is effectively a sub-experiment.

Randomization within blocks accounts for nuisance variables that could bias the results, such as day, time, cage proximity to light or ventilation, etc. In the illustration below, three treatments are randomized to the experimental units (the cages) on each shelf. Each shelf is a block that accounts for random variation introduced by a nuisance variable, proximity to the light.

shelves of cages beneath a light

Shelf height is a blocking factor that should be included in the data analysis phase of the experiment. Adding a nuisance variable as a blocking factor accounts for variability and can increase the probability of detecting a real treatment effect (statistical power). If the blocking factor doesn’t substantially impact variability, however, it reduces the information used to estimate a statistic (degrees of freedom) and diminishes statistical power. Blocking should only be used when a variable is suspected to impact the experiment.

Another way to define blocks of experimental units is to use characteristics or traits that are likely associated with the response. Sex and age, for example, can serve as blocking factors in experiments, with experimental units randomly allocated to each block based on age category and sex. Stratified randomization places experimental units into separate blocks for each age category and sex. As with nuisance variables, these blocking factors (age and sex) should be used in the subsequent data analysis.

Exercise 3: Explain blocking to the efficient technician

Your technician colleague is not only efficient but very well-organized. They will administer treatments A, B and C shown in the figure above.

  1. Explain to your colleague why the treatments should not be administered by shelf (e.g. the top shelf all get treatment A, the next shelf B and the lower shelf treatment C).
  2. Explain blocking to the technician and describe how it helps the experiment.

Solution to Exercise 3

Exercise 4: How and when to set up blocks

For the following scenarios, describe whether you would set up blocks and if so, how you would set up the blocks.

  1. A large gene expression study will be run in five different batches or runs.
  2. An experiment will use two different models of equipment to obtain measurements.
  3. Samples will be processed in the morning, afternoon and evening.
  4. In a study in which mice were randomly assigned to treatment and control groups, the air handler in the room went off and temperature and humidity increased.

Solution to Exercise 4

Key Points

  • Replication, randomization and blocking determine the validity and usefulness of an experiment.


Statistics in Data Analysis

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can information be extracted and communicated from experimental data?

Objectives
  • Plotting reveals information in the data.

  • Statistical significance testing compares experimental data obtained to probability distributions of data that might also be possible.

  • A probability distribution is a mathematical function that gives the probabilities of different possible outcomes for an experiment.

A picture is worth a thousand words

To motivate this next section on statistics, we start with an example of human variability. This 1975 living histogram of female students from the University of Wisconsin Madison shows variability in a natural population.

living histogram of female student heights

B. Joiner, Int’l Stats Review, 1975

Exercise 1: A living histogram

From the living histogram, can you estimate by eye
1). the mean and median heights of this sample of female students?
2). the spread of the data? Estimate either standard deviation or variance by eye. If you’re not sure how to do this, think about how you would describe the spread of the data from the mean. You do not need to calculate a statistic.
3). any outliers? Estimate by eye - don’t worry about calculations.
4). What do you predict would happen to mean, median, spread and outliers if an equal number of male students were added to the histogram?

Solution to Exercise 1

1). Mean and median are two measures of the center of the data. The median is the 50th% of the data with half the female students above this value and the other half below. There are approximately 100 students total. Fifty of them appear to be above 5 feet 5 inches and fifty of them appear to be below 5’5”. The median is not influenced by extreme values (outliers), but the mean value is. While there are some very tall and very short people, the bulk of them appear to be centered around a mean of 5 foot 5 inches with a somewhat longer right tail to the histogram.
2). If the mean is approximately 5’5” and the distribution appears normal (bell-shaped), then we know that approximately 68% of the data lies within one standard deviation (sd) of the mean and ~95% lies within two sd’s. If there are ~100 people in the sample, 95% of them lie between 5’0” and 5’10” (2 sd’s = 5” above and 5” below the mean). One standard deviation then would be about 5”/2 = 2.5” from the mean of 5’5”. So 68% of the data (~68 people) lie within 5 feet 2.5 inches and 5 feet 7.5 inches.
3). There are some very tall and very short people but it’s not clear whether they are outliers. Outliers deviate significantly from expected values, specifically by more than 3 standard deviations in a normal distribution. Values that are greater than 3 sd’s (7.5”) above or below the mean could be considered outliers. Outliers would then be shorter than 4 feet 7.5 inches or taller than 6 feet 2.5 inches. The shortest are 4 feet 9 inches and the tallest 6’ 0 inches. There are no outliers in this sample because all heights fall within 3 sd’s.
4). Average male heights are greater than average female heights, so you could expect that a random sample of 100 male students would increase the average height of the sample of 200 students. The mean would shift to the right of the distribution toward taller heights, as would the median.

The first step in data analysis: plot the data!

A picture is worth a thousand words, and a picture of your data could reveal important information that can guide you forward. So first, plot the data!

# read in the simulated heart rate data
heart_rate <- read.csv("../data/heart_rate.csv")

# take a random sample of 100 and create a histogram
# first set the seed for the random number generator
set.seed(42)
sample100 <- sample(heart_rate$heart_rate, 100)
hist(sample100, xlab = "resting heart rate for 100 participants")
plot of chunk readin_data

plot of chunk readin_data

Exercise 2: What does this picture tell you about resting heart rates?

Do the data appear to be normally distributed? Why does this matter?
Do the left and right tails of the data seem to mirror each other or not?
Are there gaps in the data?
Are there large clusters of similar heart rate values in the data?
Are there apparent outliers?
What message do the data deliver in this histogram?

Solution to Exercise 2

Now create a boxplot of the same sample data.

boxplot(sample100)
plot of chunk boxplot_simulated_gen100_data

plot of chunk boxplot_simulated_gen100_data

Exercise 3: What does this boxplot tell you about resting heart rates?

What does the box signify?
What does horizontal black line dividing the box signify?
Are there apparent outliers?
How does the boxplot relate to the histogram? What message do the data deliver in this boxplot?

Solution to Exercise 3

Plotting the data can identify unusual response measurements (outliers), reveal relationships between variables, and guide further statistical analysis. When data are not normally distributed (bell-shaped and symmetrical), many of the statistical methods typically used will not perform well. In these cases the data can be transformed to a more symmetrical bell-shaped curve.

Statistical significance testing

The Generation 100 studied aims to determine whether high-intensity exercise in elderly adults affects lifespan and healthspan.

boxplot(heart_rate ~ exercise_group, data = heart_rate)
plot of chunk exercise_intensity_boxplot

plot of chunk exercise_intensity_boxplot

Exercise 3: Comparing two groups - control vs. high intensity

  1. Does there appear to be a significant heart rate difference between the two groups? How would you know?
  2. Do any of the data overlap between the two boxplots?
  3. Can you know which exercise group a person belongs to just by knowing their heart rate? For example, for a heart rate of 80 could you say with certainty that a person belongs to one group or the other?

    Solution to Exercise 3

    1. There appears to be a trend of lower heart rate in the high-intensity exercise group, however, we can’t say whether or not it is significant without performing statistical tests.
    2. There is considerable overlap between the two groups, which shows that there is considerable variability in the data.
    3. Someone with a heart rate of 80 could belong to either group. When considering significance of heart rate differences between the two groups we don’t look at individuals, rather, we look at averages between the two groups.

The boxplots above show a trend of lower heart rate in the high-intensity exercise group and higher heart rate in the control exercise group. There is inherent variability in heart rate in both groups however, which is to be expected. That variability appears in the box and whisker lengths of the boxplots, along with the outliers that appear as hollow circles. This variability in heart rate measurements also means that the boxplots overlap between the two groups, making it difficult to determine whether there is a significant difference in mean heart rate between the groups.

We can calculate the difference in means between the two groups to answer the question about exercise intensity.

# calculate the means of the two groups
HI <- heart_rate[heart_rate$exercise_group=="high intensity", "heart_rate"]
control <- heart_rate[heart_rate$exercise_group=="control", "heart_rate"]
mean(control) - mean(HI)
[1] 20.97851

The actual difference in mean heart rates is 20.9785147. Another way of stating this is that the high-intensity group had a mean heart rate that was 26 percent lower than the control group.

So are we done now? Does this difference support the alternative hypothesis that there is a significant difference in mean heart rates? Or does it fail to reject the null hypothesis of no significant difference? Why do we need p-values and confidence intervals if we have evidence we think supports our claim? The reason is that the mean values are random variables that can take many different values. We are working with two samples of elderly Norwegians, not the entire population of elderly Norwegians. The means are estimates of the true mean heart rate of the entire population, a number that we can never know because we can’t access the entire population of elders. The sample means will vary with every sample we take from the population. To demonstrate this, let’s take a sample from each exercise group and calculate the difference in means for those samples.

# calculate the sample mean of 100 people in each group
HI100 <- mean(sample(HI, size=100)) 
control100 <- mean(sample(control, size=100)) 
control100 - HI100
[1] 20.22396

Now take another sample of 100 from each group and calculate the difference in means.

# calculate the sample mean of 100 people in each group
HI100 <- mean(sample(HI, size=100)) 
control100 <- mean(sample(control, size=100)) 
control100 - HI100
[1] 21.05398

Are the differences in sample means the same? We can repeat this sampling again and again, and each time arrive at a different value. The sample means are a random variable, meaning that they can take on any number of different values. Since they are random variables, the difference between the means is also a random variable.

Let’s explore random variables further. Imagine that you have measured the resting heart rate of the entire population of elderly people 70 or older, not just the 1,567 from the Generation 100 study. In practice we would never have access to the entire population, so this is a thought exercise.

# read in the heart rates of the entire population of all elderly people
population <- rbind(heart_rate$heart_rate, heart_rate$heart_rate)

# sample 100 of them and calculate the mean three times
mean(sample(population, size=100)) 
[1] 71.2388
mean(sample(population, size=100)) 
[1] 68.13981
mean(sample(population, size=100)) 
[1] 69.99927

Notice how the mean changes each time you sample. We can continue to do this many times to learn about the distribution of this random variable.

The null hypothesis

Now let’s return to the mean difference between treatment groups. How do we know that this difference is due to the exercise? What happens if all 100 in the sample do the same exercise intensity? Will we see a difference as large as we saw between the two treatment groups? This is called the null hypothesis. The word null reminds us to be skeptical and to entertain the possibility that there is no difference.

Because we have access to the population, we can randomly sample 100 controls to observe as many of the difference in means when exercise intensity has no effect. We can give everyone the same exercise plan and then record the difference in means between two randomly split groups of 100 and 100.

Here is this process written in R code:

##100 controls
control <- sample(population,100)
##another 100 controls that we pretend are on a high-intensity regimen 
treatment <- sample(population,100)
print(mean(treatment) - mean(control))
[1] -1.891352

Now let’s do it 10,000 times. We will use a “for-loop”, an operation that lets us automate this (a simpler approach that, we will learn later, is to use replicate).

n <- 10000
null <- vector("numeric",n) 
for (i in 1:n) {
  control <- sample(population,100)
  treatment <- sample(population,100)
  null[i] <- mean(treatment) - mean(control)
}

Significance testing can answer questions about differences between the two groups in light of inherent variability in heart rate measurements. Comparing the data obtained to a probability distribution of data that might have been obtained can help to answer questions about the effects of exercise intensity on heart rate.

The t-test

What does it mean that a difference is statistically significant? We can eye plots like the one above and see a difference, however, we need something more objective than eyeballs to claim a significant difference. A t-test will report whether the difference in mean values between the two groups is significant. The null hypothesis would state that there is no difference in mean values, while the alternative hypothesis states that there is a difference in the means of the two samples from the whole population of elders in Norway.

# provide a formula stating that heart rate is dependent on exercise intensity

Exercise 4: What does a p-value mean?

What does this p-value tell us about the difference in means between the two groups? How can we interpret this value? What does it say about the significance of the difference in mean values?

Solution to Exercise 4

P-values are often misinterpreted as the probability that, in this example, high- and control exercise result in the same average heart rate. However, “high- and control exercise result in the same average heart rate” is not a random variable like the number of heads or tails in 10 flips of a coin. It’s a statement that doesn’t have a probability distribution, so you can’t make probability statements about it. The p-value summarizes the comparison between our data and the data we might have obtained from a probability distribution if there were no difference in mean heart rates. Specifically, the p-value tells us how far out on the tail of that distribution the data we got falls. To understand this better, we’ll explore probability distributions next.

Probability and probability distributions

Suppose you have measured the resting heart rate of the entire population of elderly Norwegians 70 or older, not just the 1,567 from the Generation 100 study. Imagine you need to describe all of these numbers to someone who has no idea what resting heart rate is. Imagine that all the measurements from the entire population are contained in heart_rate. We could list out all of the numbers for them to see or take a sample and show them the sample of heart rates, but this would be inefficient and wouldn’t provide much insight into the data. A better approach is to define and visualize a distribution. The simplest way to think of a distribution is as a compact description of many numbers.

Histograms show us the proportion of values within an interval. Here is a histogram showing all resting heart rates for the entire population 70 and older.

hist(heart_rate, xlab = "resting heart rate")
Error in hist.default(heart_rate, xlab = "resting heart rate"): 'x' must be numeric

Showing this plot is much more informative and easier to interpret than a long table of numbers. With this histogram we can approximate the number of individuals in any given interval. For example, there are approximately 30 individuals (~2%) with a resting heart rate greater than 100, and another ~30 with a resting heart rate below 60.

The histogram above approximates one that is very common in nature: the bell curve, also known as the normal distribution or Gaussian distribution.

plot(function(x) dnorm(x), -3.5, 3.5, main = "Standard normal distribution", 
     xlab = "z", ylab = "")
plot of chunk standard_normal_dist

plot of chunk standard_normal_dist

The curve shown above is an example of a probability density function that defines a bell-shaped curve. The y-axis is the probability density, and the total area under the curve sums to 1.0 on the y-axis. The x-axis denotes a variable z that by statistical convention has a standard normal distribution. If you draw a random value from a normal distribution, the probability that the value falls in a particular interval, say from a to b, is given by the area under the curve between a and b. Software can be used to calculate these probabilities.

Real-world populations may be approximated by the mathematical ideal of the normal distribution. Repeat the sampling we did earlier and produce a new histogram of the sample.

sample100 <- sample(heart_rate, 100)
Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
hist(sample100, xlab = "resting heart rate for 100 participants")
plot of chunk sample_hist

plot of chunk sample_hist

Exercise 4: Sampling from a population

Does the sample appear to be normally distributed?
Can you estimate the mean resting heart rate by eye?
What is the sample mean using R (hint: use mean())?
Can you estimate the sample standard deviation by eye? Hint: if normally distributed, 68% of the data will lie within one standard deviation of the mean and 95% will lie within 2 standard deviations. What is the sample standard deviation using R (hint: use sd())? Estimate the number of people with a resting heart rate between 60 and 70. What message does the sample deliver about the population from which it was drawn?

Solution to Exercise 4

When the histogram of a list of numbers approximates the normal distribution, we can use a convenient mathematical formula to approximate the proportion of values or outcomes in any given interval.

The perils of p-values

Confidence intervals

Sample sizes and power curves

Comparing standard deviations

Key Points

  • Plotting and significance testing describe patterns in the data and quantify effects against random variation.


Completely Randomized Designs

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a completely randomized design (CRD)?

Objectives
  • CRD is the simplest experimental design.

  • In CRD, treatments are assigned randomly to experimental units.

  • CRD assumes that the experimental units are relatively homogeneous or similar.

  • CRD doesn’t remove or account for systematic differences among experimental units.

A single qualitative factor

Analysis of variance (ANOVA)

Equal variances and normality

Inference

A single quantitative factor

Design issues

Key Points

  • CRD is a simple design that can be used when experimental are homogeneous.


Completely Randomized Design with More than One Treatment Factor

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How is a CRD with more than one treatment factor designed and analyzed?

Objectives
  • .

  • .

Key Points

  • .


Randomized Complete Block Designs

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is randomized complete block design?

Objectives
  • A randomized complete block design randomizes treatments to experimental units within the block.

  • Blocking increases the precision of treatment comparisons.

Design issues

Imagine that you want to evaluate the effect of different doses of a new drug on the proliferation of cancer cell lines in vitro. You use four different cancer cell lines because you would like the results to generalize to many types of cell lines. Divide each of the cell lines into four treatment groups, each with the same number of cells. Each treatment group receives a different dose of the drug for five consecutive days.

Group 1: Control (no drug)
Group 2: Low dose (10 μM) Group 3: Medium dose (50 μM) Group 4: High dose (100 μM)

# create treatment levels
f <- factor(c("control", "low", "medium", "high"))

# create random orderings of the treatment levels
block1 <- sample(f, 4)
block2 <- sample(f, 4)
block3 <- sample(f, 4)
block4 <- sample(f, 4)
treatment <- c(block1, block2, block3, block4)
block <- factor(rep(c("cellLine1", "cellLine2", "cellLine3", "cellLine4"), each = 4))
dishnum <- rep(1:4, 4)
plan <- data.frame(cellLine = block, DishNumber = dishnum, treatment = treatment)
plan
    cellLine DishNumber treatment
1  cellLine1          1    medium
2  cellLine1          2      high
3  cellLine1          3   control
4  cellLine1          4       low
5  cellLine2          1      high
6  cellLine2          2   control
7  cellLine2          3       low
8  cellLine2          4    medium
9  cellLine3          1       low
10 cellLine3          2    medium
11 cellLine3          3   control
12 cellLine3          4      high
13 cellLine4          1   control
14 cellLine4          2       low
15 cellLine4          3    medium
16 cellLine4          4      high

When analyzing a random complete block design, the effect of the block is included in the equation along with the effect of the treatment.

Randomized block design with a single replication

Sizing a randomized block experiment

True replication

Balanced incomplete block designs

Key Points

  • Replication, randomization and blocking determine the validity and usefulness of an experiment.


Repeated Measures Designs

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a repeated measures design?

Objectives
  • A repeated measures design measures the response of experimental units repeatedly during the study.

Drug effect on heart rate

Among-subject vs. within-subject variability

Each subject can be its own control

crossover design for heart rate

Key Points

  • .