Data Transformation
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How do I transform data?
Objectives
Add new columns to a data frame that are functions of existing columns with
mutate
.Use the split-apply-combine concept for data analysis.
Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.
library(tidyverse)
tumor <- read_csv("../data/medicaldata_tumorgrowth.csv")
Mutate
Frequently you’ll want to create new columns based on the values in existing
columns, for example to do unit conversions, or to find the ratio of values in two
columns. For this we’ll use mutate()
.
To create a new column of tumor size (volume) in cubic centimeters:
tumor %>%
mutate(size_cubic_cm = Size / 1000)
# A tibble: 574 × 6
Grp Group ID Day Size size_cubic_cm
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8 0.0418
2 1.CTR 1 101 3 85 0.085
3 1.CTR 1 101 4 114 0.114
4 1.CTR 1 101 5 162. 0.162
5 1.CTR 1 101 6 178. 0.178
6 1.CTR 1 101 7 325 0.325
7 1.CTR 1 101 10 624. 0.624
8 1.CTR 1 101 11 648. 0.648
9 1.CTR 1 101 12 836. 0.836
10 1.CTR 1 101 13 1030. 1.03
# … with 564 more rows
You can also create a second new column based on the first new column within the same call of mutate()
:
tumor %>%
mutate(size_cubic_cm = Size / 1000,
size_cubic_in = size_cubic_cm / 16)
# A tibble: 574 × 7
Grp Group ID Day Size size_cubic_cm size_cubic_in
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8 0.0418 0.00261
2 1.CTR 1 101 3 85 0.085 0.00531
3 1.CTR 1 101 4 114 0.114 0.00712
4 1.CTR 1 101 5 162. 0.162 0.0101
5 1.CTR 1 101 6 178. 0.178 0.0111
6 1.CTR 1 101 7 325 0.325 0.0203
7 1.CTR 1 101 10 624. 0.624 0.0390
8 1.CTR 1 101 11 648. 0.648 0.0405
9 1.CTR 1 101 12 836. 0.836 0.0522
10 1.CTR 1 101 13 1030. 1.03 0.0644
# … with 564 more rows
If this runs off your screen and you just want to see the first few rows, you
can use a pipe to view the head()
of the data. (Pipes work with non-dplyr
functions, too, as long as the dplyr
or magrittr
package is loaded).
tumor %>%
mutate(size_cubic_cm = Size / 1000) %>%
head()
# A tibble: 6 × 6
Grp Group ID Day Size size_cubic_cm
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8 0.0418
2 1.CTR 1 101 3 85 0.085
3 1.CTR 1 101 4 114 0.114
4 1.CTR 1 101 5 162. 0.162
5 1.CTR 1 101 6 178. 0.178
6 1.CTR 1 101 7 325 0.325
Exercise 1
Create a new data frame from the
tumor
data that meets the following criteria: contains only theGrp
column and a new column calledsize_cubic_in
containing theSize
values converted to cubic inches. In thissize_cubic_in
column, all values are greater than 0.1.Hint: think about how the commands should be ordered to produce this data frame!
Solution
tumor_cubic_in <- tumor %>% mutate(size_cubic_cm = Size / 1000, size_cubic_in = size_cubic_cm / 16) %>% filter(size_cubic_in > 0.1) %>% select(Grp, size_cubic_in)
Split-apply-combine data analysis and the summarize()
function
Many data analysis tasks can be approached using the split-apply-combine
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. dplyr
makes this very easy through the use of the
group_by()
function.
The summarize()
function
group_by()
is often used together with summarize()
, which collapses each
group into a single-row summary of that group. group_by()
takes as arguments
the column names that contain the categorical variables for which you want
to calculate the summary statistics. So to compute the mean Size
by group:
tumor %>%
group_by(Grp) %>%
summarize(mean_size = mean(Size))
# A tibble: 4 × 2
Grp mean_size
<chr> <dbl>
1 1.CTR 688.
2 2.D 577.
3 3.R 683.
4 4.D+R 506.
You can also group by multiple columns:
tumor %>%
group_by(Grp, Day) %>%
summarize(mean_size = mean(Size)) %>%
tail()
`summarise()` has grouped output by 'Grp'. You can override using the `.groups` argument.
# A tibble: 6 × 3
# Groups: Grp [1]
Grp Day mean_size
<chr> <dbl> <dbl>
1 4.D+R 23 878.
2 4.D+R 24 1085.
3 4.D+R 25 823.
4 4.D+R 26 903.
5 4.D+R 27 830.
6 4.D+R 28 821.
Here, we used tail()
to look at the last six rows of our summary. Before, we had
used head()
to look at the first six rows.
If you want to display more data, you can use the print()
function
at the end of your chain with the argument n
specifying the number of rows to
display:
tumor %>%
group_by(Grp, Day) %>%
summarize(mean_size = mean(Size)) %>%
print(n = 15)
`summarise()` has grouped output by 'Grp'. You can override using the `.groups` argument.
# A tibble: 109 × 3
# Groups: Grp [4]
Grp Day mean_size
<chr> <dbl> <dbl>
1 1.CTR 0 55.6
2 1.CTR 1 86.1
3 1.CTR 2 77.5
4 1.CTR 3 87.9
5 1.CTR 4 130.
6 1.CTR 5 216.
7 1.CTR 6 277.
8 1.CTR 7 319.
9 1.CTR 8 395.
10 1.CTR 9 525.
11 1.CTR 10 618.
12 1.CTR 11 887.
13 1.CTR 12 1305.
14 1.CTR 13 1597.
15 1.CTR 14 1331.
# … with 94 more rows
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum size for each day for each group:
tumor %>%
group_by(Grp, Day) %>%
summarize(mean_size = mean(Size),
min_size = min(Size))
`summarise()` has grouped output by 'Grp'. You can override using the `.groups` argument.
# A tibble: 109 × 4
# Groups: Grp [4]
Grp Day mean_size min_size
<chr> <dbl> <dbl> <dbl>
1 1.CTR 0 55.6 41.8
2 1.CTR 1 86.1 67.5
3 1.CTR 2 77.5 55.8
4 1.CTR 3 87.9 61.1
5 1.CTR 4 130. 75.3
6 1.CTR 5 216. 112.
7 1.CTR 6 277. 118.
8 1.CTR 7 319. 165.
9 1.CTR 8 395. 269.
10 1.CTR 9 525. 478.
# … with 99 more rows
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of mean weight:
tumor %>%
group_by(Grp, Day) %>%
summarize(mean_size = mean(Size),
min_size = min(Size)) %>%
arrange(desc(mean_size))
`summarise()` has grouped output by 'Grp'. You can override using the `.groups` argument.
# A tibble: 109 × 4
# Groups: Grp [4]
Grp Day mean_size min_size
<chr> <dbl> <dbl> <dbl>
1 1.CTR 16 2045. 1964.
2 2.D 25 1836. 1525.
3 2.D 26 1738. 1738.
4 3.R 28 1714. 1065
5 2.D 24 1668. 1624.
6 2.D 22 1650. 1109.
7 1.CTR 13 1597. 619.
8 1.CTR 21 1564. 1564.
9 3.R 24 1546. 902.
10 1.CTR 15 1535. 810.
# … with 99 more rows
Counting
When working with data, we often want to know the number of observations found
for each factor or combination of factors. For this task, dplyr
provides
count()
. For example, if we wanted to count the number of rows of data for
each group, we would do:
tumor %>%
count(Grp)
# A tibble: 4 × 2
Grp n
<chr> <int>
1 1.CTR 97
2 2.D 171
3 3.R 161
4 4.D+R 145
The count()
function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, tumor %>% count()
is equivalent to:
tumor %>%
group_by(Grp) %>%
summarise(count = n())
# A tibble: 4 × 2
Grp count
<chr> <int>
1 1.CTR 97
2 2.D 171
3 3.R 161
4 4.D+R 145
For convenience, count()
provides the sort
argument:
tumor %>%
count(Grp, sort = TRUE)
# A tibble: 4 × 2
Grp n
<chr> <int>
1 2.D 171
2 3.R 161
3 4.D+R 145
4 1.CTR 97
Previous example shows the use of count()
to count the number of rows/observations
for one factor (i.e., Grp
).
If we wanted to count combination of factors, such as Grp
and Day
,
we would specify the first and the second factor as the arguments of count()
:
tumor %>%
count(Grp, Day)
# A tibble: 109 × 3
Grp Day n
<chr> <dbl> <int>
1 1.CTR 0 8
2 1.CTR 1 4
3 1.CTR 2 3
4 1.CTR 3 5
5 1.CTR 4 8
6 1.CTR 5 5
7 1.CTR 6 5
8 1.CTR 7 8
9 1.CTR 8 4
10 1.CTR 9 3
# … with 99 more rows
With the above code, we can proceed with arrange()
to sort the table
according to a number of criteria so that we have a better comparison.
For instance, we might want to arrange the table above in (i) an alphabetical order of
the levels of the group and (ii) in descending order of the count:
tumor %>%
count(Grp, Day) %>%
arrange(Day, desc(n))
# A tibble: 109 × 3
Grp Day n
<chr> <dbl> <int>
1 2.D 0 10
2 3.R 0 10
3 4.D+R 0 9
4 1.CTR 0 8
5 2.D 1 7
6 4.D+R 1 7
7 1.CTR 1 4
8 3.R 1 3
9 2.D 2 7
10 4.D+R 2 5
# … with 99 more rows
Exercise 2
How many observations are there for each
ID
number?Solution
tumor %>% count(ID)
# A tibble: 37 × 2 ID n <dbl> <int> 1 101 16 2 102 16 3 103 12 4 104 13 5 105 11 6 106 13 7 107 8 8 108 8 9 201 19 10 202 15 # … with 27 more rows
Exercise 3
Use
group_by()
andsummarize()
to find the mean, min, and max tumor size for each ID (usingID
). Also add the number of observations (hint: see?n
).Solution
tumor %>% group_by(ID) %>% summarize( mean_size = mean(Size), min_size = min(Size), max_size = max(Size), n = n() )
# A tibble: 37 × 5 ID mean_size min_size max_size n <dbl> <dbl> <dbl> <dbl> <int> 1 101 810. 41.8 1678. 16 2 102 670. 79.4 1564. 16 3 103 640. 44.8 2406. 12 4 104 595. 67.7 1964. 13 5 105 683. 54.7 2163. 11 6 106 488. 60 2125. 13 7 107 809. 46.8 2343. 8 8 108 917. 49.4 2296. 8 9 201 768. 49.1 1998. 19 10 202 674. 60.6 2047. 15 # … with 27 more rows
Exercise 4
What was the largest tumor measured in each day? Return the columns
Day
,Grp
,ID
, andSize
.Solution
tumor %>% group_by(Day) %>% filter(Size == max(Size)) %>% select(Day, Grp, ID, Size) %>% arrange(Day)
# A tibble: 29 × 4 # Groups: Day [29] Day Grp ID Size <dbl> <chr> <dbl> <dbl> 1 0 1.CTR 102 79.4 2 1 1.CTR 102 110. 3 2 3.R 306 129. 4 3 2.D 210 134 5 4 1.CTR 102 201. 6 5 3.R 306 316. 7 6 3.R 308 612. 8 7 3.R 301 732. 9 8 1.CTR 102 670 10 9 3.R 310 892. # … with 19 more rows
Key Points