This lesson is in the early stages of development (Alpha version)

Subsetting vectors

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I select specific elements of a data vector?

Objectives
  • Inspect the content of vectors and manipulate their content.

  • Subset and extract values from vectors.

Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

animals <- c("mouse", "rat", "dog", "cat")
animals[2]
[1] "rat"
animals[c(3, 2)]
[1] "dog" "rat"

We can also repeat the indices to create an object with more elements than the original one:

more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
[1] "mouse" "rat"   "dog"   "rat"   "mouse" "cat"  

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
[1] 21 54 55

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:

weight_g > 50    # will return logicals with TRUE for the indices that meet the condition
[1] FALSE FALSE FALSE  TRUE  TRUE
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
[1] 54 55

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

weight_g[weight_g > 30 & weight_g < 50]
[1] 34 39
weight_g[weight_g <= 30 | weight_g == 55]
[1] 21 55
weight_g[weight_g >= 30 & weight_g == 21]
numeric(0)

Here, > for “greater than”, < stands for “less than”, <= for “less than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

animals <- c("mouse", "rat", "dog", "cat", "cat")

# return both rat and cat
animals[animals == "cat" | animals == "rat"] 
[1] "rat" "cat" "cat"
# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
animals %in% c("rat", "cat", "dog", "duck", "goat") 
[1] FALSE  TRUE  TRUE  TRUE  TRUE
# use the logical vector created by %in% to return elements from animals 
# that are found in the character vector
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
[1] "rat" "dog" "cat" "cat"

Exercise

Can you figure out why "four" > "five" returns TRUE?

Solution

When using ">" or "<" on strings, R compares their alphabetical order. Here "four" comes after "five", and therefore is greater than it.

Key Points