Using the dplyr package
Overview
Teaching: 15 min
Exercises: 20 minQuestions
How do I select, filter, and subset data in R?
Objectives
Select specific observations from data.
Filter out those observations that meet specific criteria.
Subset data.
Save transformed data.
library(tidyverse)
Reading text files (CSV)
Now that we know how to find our files, let’s load up our first data set.
When trying to type in a file path,
you can hit the <TAB>
key to autocomplete the files.
This will help you with a lot of potential spelling mistakes.
read_csv("../data/medicaldata_tumorgrowth.csv")
# A tibble: 574 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
Debug help:
- If the above code returns a
could not find function "read_csv"
make sure you have loaded up the proper library withlibrary(tidyverse)
- If the above code returns a
does not exist in current working directory
, make sure the working directory it lists is your expected “starting point” (i.e., working directory), and make sure the file path is spelled correctly.
read_csv
will show us the columns that were read in,
as well as the data type of that column (e.g., character, double – a number).
Loading a data set is great, but we need a convenient way to refer to the data set.
We don’t want to re-load the data set every time we want to perform an action on it.
We can take this loaded data set and assign it to a variable.
We can do this with the assignment operator, <-
.
Note the way it is typed, a less than symbol (<
) followed immediately by the dash (-
) without any spaces in between.
The right side of the assignment operator, <-
, will be executed and then assigned to the variable on the left.
tumor <- read_csv("../data/medicaldata_tumorgrowth.csv")
Notice this time we no longer see the dataset being printed.
The “Environment” tab in the RStudio panel will now have an entry for the variable you used.
Clicking on the right data set icon will open a view of your dataset,
clicking on the arrow will show you the column-by-column text representation (technically it’s called the structure
).
To look at our dataset we can execute just the variable we assigned the dataset to.
tumor
# A tibble: 574 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
This tabular dataset that has now been loaded into R is called a data frame object (or simply dataframe),
the tidyverse
uses a tibble
.
For the most part, a data.frame
object will behave like a tibble
object.
What is dplyr?
The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
To learn more about dplyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet.
Selecting columns
Here is the tumor dataset we are working with.
tumor
# A tibble: 574 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
In order to subset the columns of our dataset, we can use the select
function.
The first argument of the select
function is the dataset we want to select columns from.
Then, the other arguments are the columns we want to select.
select(tumor, Size)
# A tibble: 574 x 1
Size
<dbl>
1 41.8
2 85
3 114
4 162.
5 178.
6 325
7 624.
8 648.
9 836.
10 1030.
# … with 564 more rows
If we want multiple columns (or to re-order columns) we can specify each column we want selected.
select(tumor, Group, Day, Size)
# A tibble: 574 x 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 0 41.8
2 1 3 85
3 1 4 114
4 1 5 162.
5 1 6 178.
6 1 7 325
7 1 10 624.
8 1 11 648.
9 1 12 836.
10 1 13 1030.
# … with 564 more rows
The %>%
pipe operator
A common way tidyverse
code gets written is using the %>%
operator.
The %>%
takes the object on the left, and passes that into the first argument to the function on the right.
We can re-write the above select statements as follows.
R interprets the code in the exact same way.
tumor %>%
select(Size)
# A tibble: 574 x 1
Size
<dbl>
1 41.8
2 85
3 114
4 162.
5 178.
6 325
7 624.
8 648.
9 836.
10 1030.
# … with 564 more rows
The way we pronounce the pipe, %>%
is “then”.
So the above statement is read “tumor, then, select size”.
tumor %>%
select(Group, Day, Size)
# A tibble: 574 x 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 0 41.8
2 1 3 85
3 1 4 114
4 1 5 162.
5 1 6 178.
6 1 7 325
7 1 10 624.
8 1 11 648.
9 1 12 836.
10 1 13 1030.
# … with 564 more rows
The benefits of the pipe may not be as apparent now, but when we start chaining multiple data sets together, it’ll make the code a little more readable.
Filtering rows
Now that we know how to select columns on our dataframe, we can now learn how to filter the rows.
To filter rows, we can use the filter
function.
The first argument of filter
is the dataframe object (the same pattern as the select
function),
and the other arguments are statements that return some TRUE
or FALSE
value (this is known as a boolean condition).
For example if we want to filter the tumor
dataset such that the Group
value is 1
we can filter the rows using that condition.
filter(tumor, Group == 1)
# A tibble: 97 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 87 more rows
We can write the “pipe” version as such:
tumor %>%
filter(Group == 1)
# A tibble: 97 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 87 more rows
If we want to combine multiple conditions, where Group
is 1
and Day
is 0
,
we can put each statement as a separate parameter into the filter
function.
tumor %>%
filter(Group == 1, Day == 0)
# A tibble: 8 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 102 0 79.4
3 1.CTR 1 103 0 44.8
4 1.CTR 1 104 0 67.7
5 1.CTR 1 105 0 54.7
6 1.CTR 1 106 0 60
7 1.CTR 1 107 0 46.8
8 1.CTR 1 108 0 49.4
We can also use the &
for the and operator.
tumor %>%
filter(Group == 1 & Day == 0)
# A tibble: 8 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 102 0 79.4
3 1.CTR 1 103 0 44.8
4 1.CTR 1 104 0 67.7
5 1.CTR 1 105 0 54.7
6 1.CTR 1 106 0 60
7 1.CTR 1 107 0 46.8
8 1.CTR 1 108 0 49.4
Or use the |
for the or operator.
tumor %>%
filter(Day == 0 | Day == 13)
# A tibble: 60 x 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 13 1030.
3 1.CTR 1 102 0 79.4
4 1.CTR 1 102 13 619.
5 1.CTR 1 103 0 44.8
6 1.CTR 1 104 0 67.7
7 1.CTR 1 105 0 54.7
8 1.CTR 1 105 13 1699.
9 1.CTR 1 106 0 60
10 1.CTR 1 107 0 46.8
# … with 50 more rows
Subsetting columns and rows
To subset on both columns and rows,
we call the corresponding subseting function one after the other.
Here we are taking the tumor
column, then select
ing the Group
, Day
and Size
columns, then
filter
ing the rows where the Size
is greater than 2000
.
tumor %>%
select(Group, Day, Size) %>%
filter(Size > 2000)
# A tibble: 14 x 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 15 2406.
2 1 14 2163.
3 1 16 2125.
4 1 13 2343.
5 1 13 2296.
6 2 22 2047.
7 2 25 2269.
8 2 27 2115.
9 3 13 2059.
10 3 28 2362.
11 3 27 2157.
12 3 20 2160.
13 3 24 2189.
14 4 22 2343
This is an example where the pipe, %>%
makes the code a little easier to read.
We can “un-pipe” our code and it looks like this:
filter(select(tumor, Group, Day, Size), Size > 2000)
# A tibble: 14 x 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 15 2406.
2 1 14 2163.
3 1 16 2125.
4 1 13 2343.
5 1 13 2296.
6 2 22 2047.
7 2 25 2269.
8 2 27 2115.
9 3 13 2059.
10 3 28 2362.
11 3 27 2157.
12 3 20 2160.
13 3 24 2189.
14 4 22 2343
Saving out data
Saving out your data frame objects is the first step in creating pipelines. Even if all you do is read in data, filter it, and write out the subsetted dataset for another use case, you have created your first pipeline. Now, you have a repeatable way of performing the same action when your input dataset is updated or changed.
Let’s say we want to work with a filtered version of our tumor
dataset,
we can first write the code to subset the data we want.
filtered <- tumor %>%
select(Group, ID, Day, Size) %>%
filter(Day == 0 | Day == 13)
filtered
# A tibble: 60 x 4
Group ID Day Size
<dbl> <dbl> <dbl> <dbl>
1 1 101 0 41.8
2 1 101 13 1030.
3 1 102 0 79.4
4 1 102 13 619.
5 1 103 0 44.8
6 1 104 0 67.7
7 1 105 0 54.7
8 1 105 13 1699.
9 1 106 0 60
10 1 107 0 46.8
# … with 50 more rows
We can then write out the datafrme to a file (in this case a CSV text file), using the same file pathing string format as we loaded a dataset earlier.
write_csv(filtered, "../data/tumor_filtered.csv")
One thing to be mindful of is most programs and computer languages don’t like having spaces in file names.
In general, you’ll end up typing more characters dealing with spaces in file names,
instead of using an underscore, _
, or dash, -
.
Also as a general rule, keeping files in lower-case will prevent a lot of potential spelling mistakes.
Exercise
- Read in all the patient data from the spreadsheets chapter found in
data > spreadsheets > all.csv
select
thefirst_name
,last_name
,day
,time
, andtemp_f
columns.filter
the rows such that the person’s name is “Aniya Warner”- Save the filtered dataset into the
output
folder namedaniya_warner.csv
.Solution
all_patient_data <- read_csv("../data/spreadsheets/all.csv")
aniya_warner <- all_patient_data %>%
select(first_name, last_name, day, time, temp_f) %>%
filter(first_name == "Aniya", last_name == "Warner")
write_csv(aniya_warner, file = "../output/aniya_warner.csv")
Key Points