Missing data
Overview
Teaching: 15 min
Exercises: 15 minQuestions
What does R do when data are missing?
Objectives
Analyze vectors with missing data.
Missing data
As R was designed to analyze datasets, it includes the concept of missing data
(which is uncommon in other programming languages). Missing data are represented
in vectors as NA
.
When doing operations on numbers, most functions will return NA
if the data
you are working with include missing values. This feature
makes it harder to overlook the cases where you are dealing with missing data.
You can add the argument na.rm = TRUE
to calculate the result as if the missing
values were removed (rm
stands for ReMoved) first.
heights <- c(2, 4, 4, NA, 6)
mean(heights)
[1] NA
max(heights)
[1] NA
mean(heights, na.rm = TRUE)
[1] 4
max(heights, na.rm = TRUE)
[1] 6
If your data include missing values, you may want to become familiar with the
functions is.na()
, na.omit()
, and complete.cases()
. See below for
examples.
# Extract those elements which are not missing values.
heights[!is.na(heights)]
[1] 2 4 4 6
# Returns the object with incomplete cases removed.
# The returned object is an atomic vector of type
# `"numeric"` (or `"double"`).
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"
# Extract those elements which are complete cases.
# The returned object is an atomic vector of type
# `"numeric"` (or `"double"`).
heights[complete.cases(heights)]
[1] 2 4 4 6
Recall that you can use the typeof()
function to find the type of your atomic vector.
Exercise
- Using this vector of heights in inches, create a new vector,
heights_no_na
, with the NAs removed.heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
Use the function
median()
to calculate the median of theheights
vector.Use R to figure out how many people in the set are taller than 67 inches.
Solution
heights_no_na <- heights[!is.na(heights)] # or heights_no_na <- na.omit(heights) # or heights_no_na <- heights[complete.cases(heights)]
median(heights, na.rm = TRUE)
heights_above_67 <- heights_no_na[heights_no_na > 67] length(heights_above_67)
Key Points