This lesson is in the early stages of development (Alpha version)

Missing data

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What does R do when data are missing?

Objectives
  • Analyze vectors with missing data.

Missing data

As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE to calculate the result as if the missing values were removed (rm stands for ReMoved) first.

heights <- c(2, 4, 4, NA, 6)
mean(heights)
[1] NA
max(heights)
[1] NA
mean(heights, na.rm = TRUE)
[1] 4
max(heights, na.rm = TRUE)
[1] 6

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples.

# Extract those elements which are not missing values.
heights[!is.na(heights)]
[1] 2 4 4 6
# Returns the object with incomplete cases removed. 
# The returned object is an atomic vector of type 
# `"numeric"` (or `"double"`).
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"
# Extract those elements which are complete cases. 
# The returned object is an atomic vector of type 
# `"numeric"` (or `"double"`).
heights[complete.cases(heights)]
[1] 2 4 4 6

Recall that you can use the typeof() function to find the type of your atomic vector.

Exercise

  1. Using this vector of heights in inches, create a new vector, heights_no_na, with the NAs removed.
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
  1. Use the function median() to calculate the median of the heights vector.

  2. Use R to figure out how many people in the set are taller than 67 inches.

Solution

  1. heights_no_na <- heights[!is.na(heights)]    
    # or  
    heights_no_na <- na.omit(heights)  
    # or  
    heights_no_na <- heights[complete.cases(heights)]
    
    1. median(heights, na.rm = TRUE)  
      
    2. heights_above_67 <- heights_no_na[heights_no_na > 67]  
      length(heights_above_67)
      

Key Points