Subsetting vectors
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I select specific elements of a data vector?
Objectives
Inspect the content of vectors and manipulate their content.
Subset and extract values from vectors.
Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
[1] "rat"
animals[c(3, 2)]
[1] "dog" "rat"
We can also repeat the indices to create an object with more elements than the original one:
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
[1] "mouse" "rat" "dog" "rat" "mouse" "cat"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Conditional subsetting
Another common way of subsetting is by using a logical vector. TRUE
will
select the element with the same index, while FALSE
will not:
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
[1] 21 54 55
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
weight_g > 50 # will return logicals with TRUE for the indices that meet the condition
[1] FALSE FALSE FALSE TRUE TRUE
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
[1] 54 55
You can combine multiple tests using &
(both conditions are true, AND) or |
(at least one of the conditions is true, OR):
weight_g[weight_g > 30 & weight_g < 50]
[1] 34 39
weight_g[weight_g <= 30 | weight_g == 55]
[1] 21 55
weight_g[weight_g >= 30 & weight_g == 21]
numeric(0)
Here, >
for “greater than”, <
stands for “less than”, <=
for “less than
or equal to”, and ==
for “equal to”. The double equal sign ==
is a test for
numerical equality between the left and right hand sides, and should not be
confused with the single =
sign, which performs variable assignment (similar
to <-
).
A common task is to search for certain strings in a vector. One could use the
“or” operator |
to test for equality to multiple values, but this can quickly
become tedious. The function %in%
allows you to test if any of the elements of
a search vector are found:
animals <- c("mouse", "rat", "dog", "cat", "cat")
# return both rat and cat
animals[animals == "cat" | animals == "rat"]
[1] "rat" "cat" "cat"
# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
animals %in% c("rat", "cat", "dog", "duck", "goat")
[1] FALSE TRUE TRUE TRUE TRUE
# use the logical vector created by %in% to return elements from animals
# that are found in the character vector
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
[1] "rat" "dog" "cat" "cat"
Exercise
Can you figure out why
"four" > "five"
returnsTRUE
?Solution
When using
">"
or"<"
on strings, R compares their alphabetical order. Here"four"
comes after"five"
, and therefore is greater than it.
Key Points