This lesson is in the early stages of development (Alpha version)

Using the dplyr package

Overview

Teaching: 15 min
Exercises: 20 min
Questions
  • How do I select, filter, and subset data in R?

Objectives
  • Select specific observations from data.

  • Filter out those observations that meet specific criteria.

  • Subset data.

  • Save transformed data.

library(tidyverse)

Reading text files (CSV)

Now that we know how to find our files, let’s load up our first data set. When trying to type in a file path, you can hit the <TAB> key to autocomplete the files. This will help you with a lot of potential spelling mistakes.

read_csv("../data/medicaldata_tumorgrowth.csv")
# A tibble: 574 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows

Debug help:

read_csv will show us the columns that were read in, as well as the data type of that column (e.g., character, double – a number).

Loading a data set is great, but we need a convenient way to refer to the data set. We don’t want to re-load the data set every time we want to perform an action on it. We can take this loaded data set and assign it to a variable. We can do this with the assignment operator, <-. Note the way it is typed, a less than symbol (<) followed immediately by the dash (-) without any spaces in between. The right side of the assignment operator, <-, will be executed and then assigned to the variable on the left.

tumor <- read_csv("../data/medicaldata_tumorgrowth.csv")

Notice this time we no longer see the dataset being printed. The “Environment” tab in the RStudio panel will now have an entry for the variable you used. Clicking on the right data set icon will open a view of your dataset, clicking on the arrow will show you the column-by-column text representation (technically it’s called the structure).

To look at our dataset we can execute just the variable we assigned the dataset to.

tumor
# A tibble: 574 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows

This tabular dataset that has now been loaded into R is called a data frame object (or simply dataframe), the tidyverse uses a tibble. For the most part, a data.frame object will behave like a tibble object.

What is dplyr?

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

To learn more about dplyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet.

Selecting columns

Here is the tumor dataset we are working with.

tumor
# A tibble: 574 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows

In order to subset the columns of our dataset, we can use the select function. The first argument of the select function is the dataset we want to select columns from. Then, the other arguments are the columns we want to select.

select(tumor, Size)
# A tibble: 574 x 1
     Size
    <dbl>
 1   41.8
 2   85  
 3  114  
 4  162. 
 5  178. 
 6  325  
 7  624. 
 8  648. 
 9  836. 
10 1030. 
# … with 564 more rows

If we want multiple columns (or to re-order columns) we can specify each column we want selected.

select(tumor, Group, Day, Size)
# A tibble: 574 x 3
   Group   Day   Size
   <dbl> <dbl>  <dbl>
 1     1     0   41.8
 2     1     3   85  
 3     1     4  114  
 4     1     5  162. 
 5     1     6  178. 
 6     1     7  325  
 7     1    10  624. 
 8     1    11  648. 
 9     1    12  836. 
10     1    13 1030. 
# … with 564 more rows

The %>% pipe operator

A common way tidyverse code gets written is using the %>% operator. The %>% takes the object on the left, and passes that into the first argument to the function on the right. We can re-write the above select statements as follows. R interprets the code in the exact same way.

tumor %>%
  select(Size)
# A tibble: 574 x 1
     Size
    <dbl>
 1   41.8
 2   85  
 3  114  
 4  162. 
 5  178. 
 6  325  
 7  624. 
 8  648. 
 9  836. 
10 1030. 
# … with 564 more rows

The way we pronounce the pipe, %>% is “then”. So the above statement is read “tumor, then, select size”.

tumor %>%
  select(Group, Day, Size)
# A tibble: 574 x 3
   Group   Day   Size
   <dbl> <dbl>  <dbl>
 1     1     0   41.8
 2     1     3   85  
 3     1     4  114  
 4     1     5  162. 
 5     1     6  178. 
 6     1     7  325  
 7     1    10  624. 
 8     1    11  648. 
 9     1    12  836. 
10     1    13 1030. 
# … with 564 more rows

The benefits of the pipe may not be as apparent now, but when we start chaining multiple data sets together, it’ll make the code a little more readable.

Filtering rows

Now that we know how to select columns on our dataframe, we can now learn how to filter the rows.

To filter rows, we can use the filter function. The first argument of filter is the dataframe object (the same pattern as the select function), and the other arguments are statements that return some TRUE or FALSE value (this is known as a boolean condition).

For example if we want to filter the tumor dataset such that the Group value is 1 we can filter the rows using that condition.

filter(tumor, Group == 1)
# A tibble: 97 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 87 more rows

We can write the “pipe” version as such:

tumor %>%
  filter(Group == 1)
# A tibble: 97 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 87 more rows

If we want to combine multiple conditions, where Group is 1 and Day is 0, we can put each statement as a separate parameter into the filter function.

tumor %>%
  filter(Group == 1, Day == 0)
# A tibble: 8 x 5
  Grp   Group    ID   Day  Size
  <chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR     1   101     0  41.8
2 1.CTR     1   102     0  79.4
3 1.CTR     1   103     0  44.8
4 1.CTR     1   104     0  67.7
5 1.CTR     1   105     0  54.7
6 1.CTR     1   106     0  60  
7 1.CTR     1   107     0  46.8
8 1.CTR     1   108     0  49.4

We can also use the & for the and operator.

tumor %>%
  filter(Group == 1 & Day == 0)
# A tibble: 8 x 5
  Grp   Group    ID   Day  Size
  <chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR     1   101     0  41.8
2 1.CTR     1   102     0  79.4
3 1.CTR     1   103     0  44.8
4 1.CTR     1   104     0  67.7
5 1.CTR     1   105     0  54.7
6 1.CTR     1   106     0  60  
7 1.CTR     1   107     0  46.8
8 1.CTR     1   108     0  49.4

Or use the | for the or operator.

tumor %>%
  filter(Day == 0 | Day == 13)
# A tibble: 60 x 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101    13 1030. 
 3 1.CTR     1   102     0   79.4
 4 1.CTR     1   102    13  619. 
 5 1.CTR     1   103     0   44.8
 6 1.CTR     1   104     0   67.7
 7 1.CTR     1   105     0   54.7
 8 1.CTR     1   105    13 1699. 
 9 1.CTR     1   106     0   60  
10 1.CTR     1   107     0   46.8
# … with 50 more rows

Subsetting columns and rows

To subset on both columns and rows, we call the corresponding subseting function one after the other. Here we are taking the tumor column, then selecting the Group, Day and Size columns, then filtering the rows where the Size is greater than 2000.

tumor %>%
  select(Group, Day, Size) %>%
  filter(Size > 2000)
# A tibble: 14 x 3
   Group   Day  Size
   <dbl> <dbl> <dbl>
 1     1    15 2406.
 2     1    14 2163.
 3     1    16 2125.
 4     1    13 2343.
 5     1    13 2296.
 6     2    22 2047.
 7     2    25 2269.
 8     2    27 2115.
 9     3    13 2059.
10     3    28 2362.
11     3    27 2157.
12     3    20 2160.
13     3    24 2189.
14     4    22 2343 

This is an example where the pipe, %>% makes the code a little easier to read. We can “un-pipe” our code and it looks like this:

filter(select(tumor, Group, Day, Size), Size > 2000)
# A tibble: 14 x 3
   Group   Day  Size
   <dbl> <dbl> <dbl>
 1     1    15 2406.
 2     1    14 2163.
 3     1    16 2125.
 4     1    13 2343.
 5     1    13 2296.
 6     2    22 2047.
 7     2    25 2269.
 8     2    27 2115.
 9     3    13 2059.
10     3    28 2362.
11     3    27 2157.
12     3    20 2160.
13     3    24 2189.
14     4    22 2343 

Saving out data

Saving out your data frame objects is the first step in creating pipelines. Even if all you do is read in data, filter it, and write out the subsetted dataset for another use case, you have created your first pipeline. Now, you have a repeatable way of performing the same action when your input dataset is updated or changed.

Let’s say we want to work with a filtered version of our tumor dataset, we can first write the code to subset the data we want.

filtered <- tumor %>%
  select(Group, ID, Day, Size) %>%
  filter(Day == 0 | Day == 13)
filtered
# A tibble: 60 x 4
   Group    ID   Day   Size
   <dbl> <dbl> <dbl>  <dbl>
 1     1   101     0   41.8
 2     1   101    13 1030. 
 3     1   102     0   79.4
 4     1   102    13  619. 
 5     1   103     0   44.8
 6     1   104     0   67.7
 7     1   105     0   54.7
 8     1   105    13 1699. 
 9     1   106     0   60  
10     1   107     0   46.8
# … with 50 more rows

We can then write out the datafrme to a file (in this case a CSV text file), using the same file pathing string format as we loaded a dataset earlier.

write_csv(filtered, "../data/tumor_filtered.csv")

One thing to be mindful of is most programs and computer languages don’t like having spaces in file names. In general, you’ll end up typing more characters dealing with spaces in file names, instead of using an underscore, _, or dash, -. Also as a general rule, keeping files in lower-case will prevent a lot of potential spelling mistakes.

Exercise

  1. Read in all the patient data from the spreadsheets chapter found in data > spreadsheets > all.csv
  2. select the first_name, last_name, day, time, and temp_f columns.
  3. filter the rows such that the person’s name is “Aniya Warner”
  4. Save the filtered dataset into the output folder named aniya_warner.csv.

Solution

  1. all_patient_data <- read_csv("../data/spreadsheets/all.csv")
  2. aniya_warner <- all_patient_data %>%
    select(first_name, last_name, day, time, temp_f) %>%
    filter(first_name == "Aniya", last_name == "Warner")
    write_csv(aniya_warner, file = "../output/aniya_warner.csv")

Key Points