Tidy Walkthrough, Part 1
Created by: Monica Thieu
extra
R
Goals for this vignette
- Demonstrate (what I think are) key first-level features of the tidyverse
- Illustrate psychology use cases for tidyverse functions
- Hopefully convince you to join the cult of tidy
We shall do this by walking through various use cases you might encounter with your data. I’ll jump in with specific examples of functions when appropriate, to highlight solutions for particular situations you might run into with your own data.
Please keep in mind!
This vignette is not intended to be an exhaustive walkthrough of all functions from the indicated tidyverse
packages.
For reference, please see the excellent reference pages online for the relevant package. For a more detailed self-teaching experience, please refer to the incredible R for Data Science online textbook by Garrett Grolemund and Hadley Wickham, which teaches using tidyverse
functions.
Links to Files
The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/r-extra/tidyverse-guide/tidyguide-1.rmd
.
Quick cheat-list of key functions from various pkgs
Remember that each of these pkgs contain other functions too! Often these other functions are more specific use cases of the primary functions that crop up often enough that you’d want a separate function for them.
To see all the functions in each pkg, visit their reference pages at tidyverse.org! I love these reference pages so much.
%>%
tidyr
gather()
&spread()
for making long and wideseparate()
&unite()
for fixing key columns (pre/post-processing forgather()
/spread()
)expand()
,crossing()
, etc.– likeexpand.grid()
but a little more predictable
dplyr
select()
,filter()
,distinct()
for subsetting and reordering dfs/tibblesgroup_by()
/ungroup()
for setting groupsmutate()
for your column manipulation workhorsesummarize()
for squishing multiple rows down to summary statisticsarrange()
to sort dfs/tibbles more easily
Brief intro to the pipe
Before we begin…
Enter the pipe %>%
! The pipe does one simple, but key, thing: takes the object on the left-hand side and feeds it into the first argument of the function on the right-hand side. This means that:
a %>% foo()
is equivalent tofoo(a)
. Fine and gooda %>% foo() %>% bar(arg = TRUE)
is equivalent tobar(foo(a), arg = TRUE)
. Now, nested function calls read left-to-right!- Most common use case:
df_new <- df_old %>% foo() %>% bar(arg = TRUE) %>% baz()
is equivalent todf_new <- baz(bar(foo(df_old), arg = TRUE))
. Now, you can chain a series of preprocessing commands to operate on a dataframe all at once, and easily read those commands as typed in your script. No more accidentally not running some key preprocessing command that causes later code to break!
Note that #pipelife requires functions to:
- take as their first argument the object to be operated upon
- return the same object (or an analog of said), but now operated upon
Essentially all functions from the tidyverse are pipe-safe, but bear this in mind when trying to incorporate functions from base R or other packages into your tidy new world.
Initializing fake data
We’ll use a simulated dataset for this vignette, so you don’t need to worry about any dependencies involving datasets you don’t have access to while you’re following along.
Here, I’ll simulate a recognition memory task, where each fake subject sees a series of objects, some they’ve seen before in a learning phase, and some they haven’t. For each object, which is either old (they saw it before) or new (they did not see it before), subjects must respond “old” if they think they saw it before, or “new” if they think they didn’t see it before. I’ll also simulate some person-level demographic variables (age, gender) and trait variables (questionnaire scores).
If you have tidyverse
loaded, all this code should run if you try to run it in your R console. I’m using functions that I’ll introduce later, so don’t worry if you don’t fully understand this right now! Feel free to skip past this.
# 20 fake subjects, 50 fake trials per subject
# Will simulate the person-level variables FIRST,
# then expand to simulate the trial-level variables
raw <- tibble(id = 1L:20L,
age = sample(18L:35L, size = 20, replace = TRUE),
# assuming binary gender for the purposes of this simulation
gender = sample(c("male", "female"), size = 20, replace = TRUE)) %>%
# simulating some "questionnaire" scores; person-level
mutate(q_1 = rnorm(n = n(), mean = 30, sd = 10),
q_2 = rnorm(n = n(), mean = 30, sd = 10),
q_3 = rnorm(n = n(), mean = 30, sd = 10)) %>%
# slice() subsets rows by position; you can use it to repeat rows by repeating position indices
slice(rep(1:n(), each = 50)) %>%
# We'll get to this in a bit--this causes every "group"
# aka every set of rows with the same value for "id", to behave as an independent df
group_by(id) %>%
# I just want to have a column for "trial order", I like those in my task data
mutate(trial_num = 1:n(),
# Each subject sees half OLD and half NEW trials in this recognition memory task
is_old = rep(0L:1L, times = n()/2),
# I'm shuffling the order of "old" and "new" trials in my fake memory task
is_old = sample(is_old),
# This will generate binary "old"/"new" responses corresponding roughly to a d' of 1
# yep, everyone has the same d' today
response = if_else(is_old == 1,
rbinom(n = n(), size = 1, prob = 0.7),
rbinom(n = n(), size = 1, prob = 0.3)),
rt = rnorm(n = n(), mean = 3, sd = 1.5)) %>%
ungroup()
Loading data
In today’s example, we won’t actually be loading any data, because we’re generating simulated data. If you do have real data you’re looking to analyze, though, chances are you have it saved in tabular form (e.g. a CSV) and you need to read it into R. You might be using something like base::read.csv()
to do this.
As it happens, the tidyverse features a package, readr
, which provides file reading functions that operate in the same family as base::read.table()
, base::read.csv()
, etc. I like to use these file reading functions, readr::read_csv()
(note that tidyverse
functions tend to use underscores instead of periods!) in particular, because they run a little more quickly and predictably than their base R counterparts.
The command you’d use would look a little something like this, aka very similar to base R:
Features of readr
file reading functions include:
- default args:
header
alwaysTRUE
stringsAsFactors
alwaysFALSE
(better safe than sorry!)
- attempts to guess the data type of each column it reads and tells you what data type it guessed for each one, or you can manually specify the data type of each column upon reading in the data
- always returns a tibble
Now, there will be cases when you need the fine-grained control over delimiter assumptions, header assumptions, and such that the base R file reading functions afford you. Most of the time, though, I find that read_csv()
suits my needs well.
PS: In our companion R package, cuPsychRTools
, we provide helper functions like read_csv_multi()
, which wraps read_csv()
to allow you to specify a vector of files to read instead of just one file. This way, you can read in multiple data files that contain different observations of the same data (e.g. a psychological task run on multiple subjects, with each subject’s data in a separate file) and have them all output into one very long df!
Visually inspecting data
The tidyverse introduces the tibble (via the tibble
package, which is loaded automatically when you load the whole tidyverse
). You can think of a tibble as a dataframe, but a little better. One of the reasons I like to work with tibbles is that visually inspecting them using the console is quite easy:
## # A tibble: 1,000 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 male 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 male 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 male 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 male 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 male 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 male 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 male 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 male 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 male 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 male 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
You can see that the default print output of tibbles is kind of like the output of str()
on steroids.
Compared to a dataframe, a tibble:
- has a cleaner & more informative
print()
method- automatically tells you the data type of each column
- prints only the top 10 rows for longer tibbles (but you can customize how many rows show)
- in some RStudio themes, prints numbers in certain colors to help you quickly identify negatives, decimals, etc
- is stricter when you create tibbles using
tibble()
, or when you index (with [] or $)
There are a couple other useful things that tibbles do that dataframes don’t, but the strictness and the print method are the two most useful things that I find. As you’ll see, imposing a bit more strictness on R can have the beneficial side effect of minimizing buggy cases in your own code, since you’re forced to be more explicit.
By the way, note that packages within the tidyverse work together–raw rectangular data read in using readr
is given the tibble class by default.
A tibble isn’t too far off from a dataframe in its underlying properties, though.
## [1] "tbl_df" "tbl" "data.frame"
"tbl_df"
is the class name for tibbles. As you can see, tibbles still have the "data.frame"
class, so they can do most anything dataframes do. (There are the occasional older non-tidyverse functions that get confused when they see tibbles, and that’s okay, you can always call as.data.frame(your_tibble)
if you need to in a pinch.)
Non-tibble dataframes can be coerced to tibble using as_tibble()
. This shouldn’t change any of the existing data type of any of the columns.
For the rest of this vignette, I will refer to data of a tabular structure (dataframe or tibble) as a df. Know that while the tidyverse is tibble-optimized, everything WORKS on a dataframe, so you can still think of your data structures as dataframes. Hence, I’ll use the term df to help you with mental consistency.
Subsetting data
We’ll deal with subsetting dfs two ways: choosing a subset of columns and a subset of rows.
Subsetting by column
dplyr::select()
is the workhorse function for subsetting a df by column. You simply need to name the columns you would like to select!
Note that these are unquoted variable names and not a character vector. Tidyverse functions allow you to call column names bare as variables inside of them.
# This auto-indenting biz after the pipe works because the pipe is a left-right operator
# Remember that raw is getting piped into select() as the first argument
# so the only enumerated arguments inside select() are the columns of interest,
# which is kind of nice!
raw %>%
select(id, age, gender, trial_num, is_old, response, rt)
## # A tibble: 1,000 x 7
## id age gender trial_num is_old response rt
## <int> <int> <chr> <int> <int> <int> <dbl>
## 1 1 32 male 1 0 1 6.01
## 2 1 32 male 2 0 0 1.40
## 3 1 32 male 3 1 1 3.19
## 4 1 32 male 4 0 0 3.33
## 5 1 32 male 5 0 0 0.865
## 6 1 32 male 6 1 1 6.39
## 7 1 32 male 7 0 0 1.94
## 8 1 32 male 8 1 1 3.50
## 9 1 32 male 9 0 1 2.58
## 10 1 32 male 10 0 0 3.53
## # … with 990 more rows
Note that you can use select()
to rename columns as well; simply put the new name in the argument name spot, and the old name in the argument value spot.
# Renaming columns is so smooth in the tidyverse!
raw %>%
select(subject_id = id, age, gender, trial_num, is_old, response, rt)
## # A tibble: 1,000 x 7
## subject_id age gender trial_num is_old response rt
## <int> <int> <chr> <int> <int> <int> <dbl>
## 1 1 32 male 1 0 1 6.01
## 2 1 32 male 2 0 0 1.40
## 3 1 32 male 3 1 1 3.19
## 4 1 32 male 4 0 0 3.33
## 5 1 32 male 5 0 0 0.865
## 6 1 32 male 6 1 1 6.39
## 7 1 32 male 7 0 0 1.94
## 8 1 32 male 8 1 1 3.50
## 9 1 32 male 9 0 1 2.58
## 10 1 32 male 10 0 0 3.53
## # … with 990 more rows
There also exists a version of select()
called rename()
which behaves exactly as select()
does, except that rename()
never subsets columns, and always gives you all the columns of your original df (renamed or otherwise).
You can specify columns in select()
by naming each individual column you want, or you can select groups of columns in the following ways.
You can use the colon :
to select all columns between the ones named, kind of like when creating integer vectors using the colon:
## # A tibble: 1,000 x 3
## id age gender
## <int> <int> <chr>
## 1 1 32 male
## 2 1 32 male
## 3 1 32 male
## 4 1 32 male
## 5 1 32 male
## 6 1 32 male
## 7 1 32 male
## 8 1 32 male
## 9 1 32 male
## 10 1 32 male
## # … with 990 more rows
You can also use any one of a series of special helper functions provided by dplyr
to select columns in a regex-style way. Like so:
## # A tibble: 1,000 x 4
## id q_1 q_2 q_3
## <int> <dbl> <dbl> <dbl>
## 1 1 26.6 23.7 37.8
## 2 1 26.6 23.7 37.8
## 3 1 26.6 23.7 37.8
## 4 1 26.6 23.7 37.8
## 5 1 26.6 23.7 37.8
## 6 1 26.6 23.7 37.8
## 7 1 26.6 23.7 37.8
## 8 1 26.6 23.7 37.8
## 9 1 26.6 23.7 37.8
## 10 1 26.6 23.7 37.8
## # … with 990 more rows
Check the the reference page for tidyselect::select_helpers
to see all of these helper functions that are available to you.
Note that the order in which you specify columns/helper calls determines the ordering of the columns in your output.
## # A tibble: 1,000 x 3
## gender age id
## <chr> <int> <int>
## 1 male 32 1
## 2 male 32 1
## 3 male 32 1
## 4 male 32 1
## 5 male 32 1
## 6 male 32 1
## 7 male 32 1
## 8 male 32 1
## 9 male 32 1
## 10 male 32 1
## # … with 990 more rows
In this way, you can explicitly use select()
to reorder columns without dropping columns. To do this, use the tidyselect
helper function everything()
:
## # A tibble: 1,000 x 10
## id gender age q_1 q_2 q_3 trial_num is_old response rt
## <int> <chr> <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 male 32 26.6 23.7 37.8 1 0 1 6.01
## 2 1 male 32 26.6 23.7 37.8 2 0 0 1.40
## 3 1 male 32 26.6 23.7 37.8 3 1 1 3.19
## 4 1 male 32 26.6 23.7 37.8 4 0 0 3.33
## 5 1 male 32 26.6 23.7 37.8 5 0 0 0.865
## 6 1 male 32 26.6 23.7 37.8 6 1 1 6.39
## 7 1 male 32 26.6 23.7 37.8 7 0 0 1.94
## 8 1 male 32 26.6 23.7 37.8 8 1 1 3.50
## 9 1 male 32 26.6 23.7 37.8 9 0 1 2.58
## 10 1 male 32 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
everything()
essentially returns all the names of the other, not-explicitly-named columns, and keeps them all in their original order minus the columns you explicitly specified. You can use this syntax to pull a selected number of columns to the far left of your df, and then leave all other columns in their previous order.
You can also use select()
to keep all columns except ones you specify. For this, use the -
operator like so:
## # A tibble: 1,000 x 9
## id age q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
You can combine the -
with the :
operator, or with one of the tidyselect
helpers we saw before:
## # A tibble: 1,000 x 6
## id age gender q_1 q_2 q_3
## <int> <int> <chr> <dbl> <dbl> <dbl>
## 1 1 32 male 26.6 23.7 37.8
## 2 1 32 male 26.6 23.7 37.8
## 3 1 32 male 26.6 23.7 37.8
## 4 1 32 male 26.6 23.7 37.8
## 5 1 32 male 26.6 23.7 37.8
## 6 1 32 male 26.6 23.7 37.8
## 7 1 32 male 26.6 23.7 37.8
## 8 1 32 male 26.6 23.7 37.8
## 9 1 32 male 26.6 23.7 37.8
## 10 1 32 male 26.6 23.7 37.8
## # … with 990 more rows
## # A tibble: 1,000 x 7
## id age gender trial_num is_old response rt
## <int> <int> <chr> <int> <int> <int> <dbl>
## 1 1 32 male 1 0 1 6.01
## 2 1 32 male 2 0 0 1.40
## 3 1 32 male 3 1 1 3.19
## 4 1 32 male 4 0 0 3.33
## 5 1 32 male 5 0 0 0.865
## 6 1 32 male 6 1 1 6.39
## 7 1 32 male 7 0 0 1.94
## 8 1 32 male 8 1 1 3.50
## 9 1 32 male 9 0 1 2.58
## 10 1 32 male 10 0 0 3.53
## # … with 990 more rows
Subsetting by row
There are a number of dplyr
functions which will, in some way, reduce the number of rows in your df.
Logical subsetting
filter()
will return a df subset for only those rows that return TRUE
for the logical statements specified in its arguments.
Notice that, as with other functions in the tidyverse, you call df columns as bare variable names, without needing $
or []
indexing.
## # A tibble: 50 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 13 35 female 46.7 38.9 17.3 1 0 1 2.74
## 2 13 35 female 46.7 38.9 17.3 2 0 1 2.95
## 3 13 35 female 46.7 38.9 17.3 3 0 0 2.10
## 4 13 35 female 46.7 38.9 17.3 4 0 1 2.76
## 5 13 35 female 46.7 38.9 17.3 5 1 1 3.08
## 6 13 35 female 46.7 38.9 17.3 6 0 0 2.36
## 7 13 35 female 46.7 38.9 17.3 7 0 0 3.00
## 8 13 35 female 46.7 38.9 17.3 8 1 1 0.946
## 9 13 35 female 46.7 38.9 17.3 9 1 1 1.60
## 10 13 35 female 46.7 38.9 17.3 10 1 1 6.23
## # … with 40 more rows
filter()
takes logical statements with &
and |
in them like you might use in bracket indexing.
# would I ordinarily use a %in% call for this?
# Yes I would, but this is just to show how the or operator | works here
raw %>%
filter(id == 13 | id == 17)
## # A tibble: 100 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 13 35 female 46.7 38.9 17.3 1 0 1 2.74
## 2 13 35 female 46.7 38.9 17.3 2 0 1 2.95
## 3 13 35 female 46.7 38.9 17.3 3 0 0 2.10
## 4 13 35 female 46.7 38.9 17.3 4 0 1 2.76
## 5 13 35 female 46.7 38.9 17.3 5 1 1 3.08
## 6 13 35 female 46.7 38.9 17.3 6 0 0 2.36
## 7 13 35 female 46.7 38.9 17.3 7 0 0 3.00
## 8 13 35 female 46.7 38.9 17.3 8 1 1 0.946
## 9 13 35 female 46.7 38.9 17.3 9 1 1 1.60
## 10 13 35 female 46.7 38.9 17.3 10 1 1 6.23
## # … with 90 more rows
filter()
also has a nice feature: specifying separate logical statements as separate arguments, separated by commas, implicitly acts like chaining the statements together with &
, since the intersection of logical vectors is probably the most common use case for row subsetting.
## # A tibble: 25 x 7
## id age gender trial_num is_old response rt
## <int> <int> <chr> <int> <int> <int> <dbl>
## 1 13 35 female 5 1 1 3.08
## 2 13 35 female 8 1 1 0.946
## 3 13 35 female 9 1 1 1.60
## 4 13 35 female 10 1 1 6.23
## 5 13 35 female 11 1 1 2.29
## 6 13 35 female 12 1 1 1.58
## 7 13 35 female 13 1 0 0.903
## 8 13 35 female 16 1 1 0.606
## 9 13 35 female 17 1 1 2.53
## 10 13 35 female 19 1 1 5.20
## # … with 15 more rows
Subsetting rows by position
Should you find this necessary, you can subset rows by numeric index using dplyr::slice()
. It takes any numeric vector of valid indices as input.
## # A tibble: 10 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 8 32 male 38.6 31.7 24.8 1 1 1 2.61
## 2 8 32 male 38.6 31.7 24.8 2 1 1 1.63
## 3 8 32 male 38.6 31.7 24.8 3 0 0 2.68
## 4 8 32 male 38.6 31.7 24.8 4 0 0 4.30
## 5 8 32 male 38.6 31.7 24.8 5 0 1 2.59
## 6 8 32 male 38.6 31.7 24.8 6 0 0 1.35
## 7 8 32 male 38.6 31.7 24.8 7 1 0 4.76
## 8 8 32 male 38.6 31.7 24.8 8 1 0 2.10
## 9 8 32 male 38.6 31.7 24.8 9 1 1 3.92
## 10 8 32 male 38.6 31.7 24.8 10 1 1 3.65
Getting only unique/un-duplicated rows
Another function that technically “subsets” data, but not logically: distinct()
is the dplyr
function for removing rows with duplicate observations.
raw %>%
# First, I'm removing columns with trialwise observations so that all remaining values are in fact duplicated
select(id, gender, age) %>%
# the .keep_all argument specifies whether to drop columns that aren't being distinct-ed by
distinct(id, .keep_all = TRUE)
## # A tibble: 20 x 3
## id gender age
## <int> <chr> <int>
## 1 1 male 32
## 2 2 male 31
## 3 3 female 32
## 4 4 female 25
## 5 5 male 23
## 6 6 male 29
## 7 7 male 25
## 8 8 male 32
## 9 9 female 18
## 10 10 male 19
## 11 11 male 20
## 12 12 female 25
## 13 13 female 35
## 14 14 female 28
## 15 15 male 33
## 16 16 male 29
## 17 17 male 27
## 18 18 female 35
## 19 19 male 31
## 20 20 male 21
You can, of course, call distinct()
on multiple columns, which will retain every unique combination of values in said columns once.
## # A tibble: 15 x 2
## age gender
## <int> <chr>
## 1 32 male
## 2 31 male
## 3 32 female
## 4 25 female
## 5 23 male
## 6 29 male
## 7 25 male
## 8 18 female
## 9 19 male
## 10 20 male
## 11 35 female
## 12 28 female
## 13 33 male
## 14 27 male
## 15 21 male
Subsetting random rows
This is something you might need to do on occasion. For example, sometimes when I want to plot single-subject data for just a few subjects from a dataset, I want to select a random subset of subjects so I’m not always just looking at the subjects who happened to be at the top of the df for whatever reason.
dplyr
contains the functions sample_n()
and sample_frac()
, which wrap base::sample()
to let you quickly get a random subset of rows, specified either by number of rows to get or by fraction of total rows to get.
Note that these work best if your df has one row per subject, because these behave a very particular way on grouped dfs (more on these later).
raw %>%
# this will create a df with only one row per subject
distinct(id, gender, age) %>%
sample_n(5)
## # A tibble: 5 x 3
## id age gender
## <int> <int> <chr>
## 1 17 27 male
## 2 15 33 male
## 3 12 25 female
## 4 16 29 male
## 5 7 25 male
Subsetting by row and column simultaneously
Thanks to the pipe %>%
, you can chain a select()
call with a filter()
(or otherwise) call to subset by row and by column in the same command.
In this case, order doesn’t super matter, so it’s a matter of personal preference as to which you ought to call first. I would generally recommend putting all renaming calls (if you are renaming any columns) as early as possible to allow the rest of the code to be consistent with itself re: column names.
## # A tibble: 50 x 7
## id age gender trial_num is_old response rt
## <int> <int> <chr> <int> <int> <int> <dbl>
## 1 13 35 female 1 0 1 2.74
## 2 13 35 female 2 0 1 2.95
## 3 13 35 female 3 0 0 2.10
## 4 13 35 female 4 0 1 2.76
## 5 13 35 female 5 1 1 3.08
## 6 13 35 female 6 0 0 2.36
## 7 13 35 female 7 0 0 3.00
## 8 13 35 female 8 1 1 0.946
## 9 13 35 female 9 1 1 1.60
## 10 13 35 female 10 1 1 6.23
## # … with 40 more rows
Manipulating data
dplyr
is the home for most of the functions of the tidyverse that deal with manipulating your data in some kind of way–creating new data, calculating summary data, etc. We’ll tour these functions by breaking them down into their use cases.
Creating/modifying columns
dplyr::mutate()
will be your workhorse function for creating new columns or modifying existing ones in a df.
Inside of every mutate()
call, as with previous tidyverse
functions, column names are fed in as symbols (like variables, without quotes) without needing to $ index them from the parent df every time.
raw %>%
# just to make the output narrower
select(-starts_with("q")) %>%
mutate(rt_ms = round(rt * 1000))
## # A tibble: 1,000 x 8
## id age gender trial_num is_old response rt rt_ms
## <int> <int> <chr> <int> <int> <int> <dbl> <dbl>
## 1 1 32 male 1 0 1 6.01 6012
## 2 1 32 male 2 0 0 1.40 1404
## 3 1 32 male 3 1 1 3.19 3188
## 4 1 32 male 4 0 0 3.33 3326
## 5 1 32 male 5 0 0 0.865 865
## 6 1 32 male 6 1 1 6.39 6389
## 7 1 32 male 7 0 0 1.94 1943
## 8 1 32 male 8 1 1 3.50 3500
## 9 1 32 male 9 0 1 2.58 2575
## 10 1 32 male 10 0 0 3.53 3532
## # … with 990 more rows
You can do anything inside a mutate()
call that:
- is totally element-wise
- returns a vector of length 1 (this will recycle)
- returns a vector of the same length as the total number of rows
Like tibble()
for creating tibbles, mutate()
will not recycle vectors longer than 1 but shorter than the input df. I like this behavior, as it has protected me on many occasions from creating columns that didn’t contain what I meant them to! If you do have a vector that you WANT recycled into a column, I would feed that vector into rep()
so that you can explicitly create an output vector of the right length.
# This will run, because I've wrapped the offending vector in rep() to force it to be the right length
# you can use any of the rep() args to control the length of the output vector
# I'll explain n() in a little bit also
raw %>%
distinct(id, gender, age) %>%
mutate(group = rep(c("treatment", "control"), length.out = n()))
## # A tibble: 20 x 4
## id age gender group
## <int> <int> <chr> <chr>
## 1 1 32 male treatment
## 2 2 31 male control
## 3 3 32 female treatment
## 4 4 25 female control
## 5 5 23 male treatment
## 6 6 29 male control
## 7 7 25 male treatment
## 8 8 32 male control
## 9 9 18 female treatment
## 10 10 19 male control
## 11 11 20 male treatment
## 12 12 25 female control
## 13 13 35 female treatment
## 14 14 28 female control
## 15 15 33 male treatment
## 16 16 29 male control
## 17 17 27 male treatment
## 18 18 35 female control
## 19 19 31 male treatment
## 20 20 21 male control
Nice helpers
This section features a bunch of handy functions you can use inside mutate()
(but not exclusively! they work on any old vector too) to do all sorts of magical things. Again, this is not an exhaustive walkthrough of all the tidy functions you might find useful–this is a limited subset of functions/cases that I find myself running into regularly.
n()
n()
is a shortcut (only usable inside other tidyverse functions, alas) that returns the number of rows in the current df. I find it a lot safer than using length(my_df)
inside of calls operating on my_df
(for example).
You can call it inside of other functions inside of mutate()
as well.
raw %>%
# again, to make the print output narrower
select(-starts_with("q")) %>%
mutate(nrows = n())
## # A tibble: 1,000 x 8
## id age gender trial_num is_old response rt nrows
## <int> <int> <chr> <int> <int> <int> <dbl> <int>
## 1 1 32 male 1 0 1 6.01 1000
## 2 1 32 male 2 0 0 1.40 1000
## 3 1 32 male 3 1 1 3.19 1000
## 4 1 32 male 4 0 0 3.33 1000
## 5 1 32 male 5 0 0 0.865 1000
## 6 1 32 male 6 1 1 6.39 1000
## 7 1 32 male 7 0 0 1.94 1000
## 8 1 32 male 8 1 1 3.50 1000
## 9 1 32 male 9 0 1 2.58 1000
## 10 1 32 male 10 0 0 3.53 1000
## # … with 990 more rows
Observe that the value of the column nrows
is equal to the number of rows reported in the tibble output. You can use this number for all sorts of stuff–I frequently find myself creating a helper column in my dfs with the row index, by doing something like my_df %>% mutate(index = 1:n())
.
recode()
recode()
will point-replace specified values in a vector with other values. It’s most useful when you would have otherwise used a series of nested if-else calls to reassign old values of your column to new values.
It has a particular syntax: the old values must be argument names (on the left of the =
) and formatted as bare variable names, and the new values can be specified as strings or numbers respectively.
As with other functions in the tidyverse, recode()
won’t let you try to put substitute values of different data types inside of it–all new values must be the same type.
You also don’t have to replace every existing value (you can leave values the way they are by just not listing them in the recode()
call), but again, all values in the output vector must be the same data type.
## # A tibble: 1,000 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 M 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 M 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 M 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 M 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 M 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 M 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 M 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 M 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 M 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 M 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
Note that to call recode()
on a numeric, or otherwise not variable-name-safe, input vector, you have to use backtick `` quotes to force the non-variable-able values to render as variables.
## # A tibble: 1,000 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <chr> <int> <dbl>
## 1 1 32 male 26.6 23.7 37.8 1 new 1 6.01
## 2 1 32 male 26.6 23.7 37.8 2 new 0 1.40
## 3 1 32 male 26.6 23.7 37.8 3 old 1 3.19
## 4 1 32 male 26.6 23.7 37.8 4 new 0 3.33
## 5 1 32 male 26.6 23.7 37.8 5 new 0 0.865
## 6 1 32 male 26.6 23.7 37.8 6 old 1 6.39
## 7 1 32 male 26.6 23.7 37.8 7 new 0 1.94
## 8 1 32 male 26.6 23.7 37.8 8 old 1 3.50
## 9 1 32 male 26.6 23.7 37.8 9 new 1 2.58
## 10 1 32 male 26.6 23.7 37.8 10 new 0 3.53
## # … with 990 more rows
if_else() and case_when()
dplyr::if_else()
(again, observe the underscore in the function name) is a stricter & faster version of base::ifelse()
, your workhorse conditional vector modifying function. In line with the tidy way, if_else()
will yell at you if your values for true
and false
aren’t the same data type, to try to protect you from accidental data typing disaster.
raw %>%
distinct(id, age, gender) %>%
# I know being over 30 isn't OLD. Just for illustration!
mutate(age_bin = if_else(age >= 30,
"old",
"young"))
## # A tibble: 20 x 4
## id age gender age_bin
## <int> <int> <chr> <chr>
## 1 1 32 male old
## 2 2 31 male old
## 3 3 32 female old
## 4 4 25 female young
## 5 5 23 male young
## 6 6 29 male young
## 7 7 25 male young
## 8 8 32 male old
## 9 9 18 female young
## 10 10 19 male young
## 11 11 20 male young
## 12 12 25 female young
## 13 13 35 female old
## 14 14 28 female young
## 15 15 33 male old
## 16 16 29 male young
## 17 17 27 male young
## 18 18 35 female old
## 19 19 31 male old
## 20 20 21 male young
dplyr::case_when()
is a function that lets you create a vector based on a switch statement. A switch statement is essentially a neater if-else if-else if-… statement, that lets you specify a series of conditions, one of which you’d expect to be true for each piece of data going in.
Notice the particular syntax that case_when()
expects to let it do its job: the logical statement is specified on the left, then a tilde ~
, then the desired output value on the right. The last logical statement, TRUE ~ ...
, functions as a final “else” statement in this large if-else chain, which assigns a default output to any values that aren’t TRUE
for any other conditions in the statement. I like to have the final case either be NA
, or the existing value of the old vector (if overwriting an existing vector).
Below is one of my common use cases for case_when()
–to bin numeric values. (I prefer using this to base::cut()
because I find that this reads much more cleanly to my eyes.)
raw %>%
distinct(id, age, gender) %>%
mutate(age_bin = case_when(age >= 30 ~ "old",
age > 22 & age < 30 ~ "medium",
age <= 22 ~ "young",
TRUE ~ NA_character_))
## # A tibble: 20 x 4
## id age gender age_bin
## <int> <int> <chr> <chr>
## 1 1 32 male old
## 2 2 31 male old
## 3 3 32 female old
## 4 4 25 female medium
## 5 5 23 male medium
## 6 6 29 male medium
## 7 7 25 male medium
## 8 8 32 male old
## 9 9 18 female young
## 10 10 19 male young
## 11 11 20 male young
## 12 12 25 female medium
## 13 13 35 female old
## 14 14 28 female medium
## 15 15 33 male old
## 16 16 29 male medium
## 17 17 27 male medium
## 18 18 35 female old
## 19 19 31 male old
## 20 20 21 male young
case_when()
lets you specify all manner of logical statements, so you can let your heart go wild!
# You will probably never need this specific coding
# This is just to show how you can have slightly more complex logical statements
raw %>%
distinct(id, age, gender) %>%
mutate(age_gender = case_when(age >= 30 & gender == "male" ~ "old male",
age >= 30 & gender == "female" ~ "old female",
gender == "female" ~ "female",
TRUE ~ NA_character_))
## # A tibble: 20 x 4
## id age gender age_gender
## <int> <int> <chr> <chr>
## 1 1 32 male old male
## 2 2 31 male old male
## 3 3 32 female old female
## 4 4 25 female female
## 5 5 23 male <NA>
## 6 6 29 male <NA>
## 7 7 25 male <NA>
## 8 8 32 male old male
## 9 9 18 female female
## 10 10 19 male <NA>
## 11 11 20 male <NA>
## 12 12 25 female female
## 13 13 35 female old female
## 14 14 28 female female
## 15 15 33 male old male
## 16 16 29 male <NA>
## 17 17 27 male <NA>
## 18 18 35 female old female
## 19 19 31 male old male
## 20 20 21 male <NA>
Important: case_when()
evaluates its logical statements in the order you specify them, just like an if-else if statement. If you have some values for which multiple of your conditions are true, the values will be assigned to the first true condition listed in your statement. Observe the behavior of the following call, versus the one immediately above:
# Now, ALL the values of age_gender where gender == "female" will be "female",
# because that condition was satisfied BEFORE the case_when() statement could check
# the age >= 30 & gender == "female" statement
raw %>%
distinct(id, age, gender) %>%
mutate(age_gender = case_when(gender == "female" ~ "female",
age >= 30 & gender == "male" ~ "old male",
age >= 30 & gender == "female" ~ "old female",
TRUE ~ NA_character_))
## # A tibble: 20 x 4
## id age gender age_gender
## <int> <int> <chr> <chr>
## 1 1 32 male old male
## 2 2 31 male old male
## 3 3 32 female female
## 4 4 25 female female
## 5 5 23 male <NA>
## 6 6 29 male <NA>
## 7 7 25 male <NA>
## 8 8 32 male old male
## 9 9 18 female female
## 10 10 19 male <NA>
## 11 11 20 male <NA>
## 12 12 25 female female
## 13 13 35 female female
## 14 14 28 female female
## 15 15 33 male old male
## 16 16 29 male <NA>
## 17 17 27 male <NA>
## 18 18 35 female female
## 19 19 31 male old male
## 20 20 21 male <NA>
As long as you remember that case_when()
statements check conditions sequentially, and you’re deliberate when you specify conditions that might simultaneously be true, you should be okay.
By the way, notice that a case_when()
statement for the case where every logical condition is testing for equality with ==
is equivalent to calling recode()
to do the same thing.
raw %>%
distinct(id, age, gender) %>%
mutate(gender = case_when(gender == "male" ~ "M",
gender == "female" ~ "F",
TRUE ~ gender))
## # A tibble: 20 x 3
## id age gender
## <int> <int> <chr>
## 1 1 32 M
## 2 2 31 M
## 3 3 32 F
## 4 4 25 F
## 5 5 23 M
## 6 6 29 M
## 7 7 25 M
## 8 8 32 M
## 9 9 18 F
## 10 10 19 M
## 11 11 20 M
## 12 12 25 F
## 13 13 35 F
## 14 14 28 F
## 15 15 33 M
## 16 16 29 M
## 17 17 27 M
## 18 18 35 F
## 19 19 31 M
## 20 20 21 M
The above case_when()
call is equivalent to the following recode()
call:
## # A tibble: 20 x 3
## id age gender
## <int> <int> <chr>
## 1 1 32 M
## 2 2 31 M
## 3 3 32 F
## 4 4 25 F
## 5 5 23 M
## 6 6 29 M
## 7 7 25 M
## 8 8 32 M
## 9 9 18 F
## 10 10 19 M
## 11 11 20 M
## 12 12 25 F
## 13 13 35 F
## 14 14 28 F
## 15 15 33 M
## 16 16 29 M
## 17 17 27 M
## 18 18 35 F
## 19 19 31 M
## 20 20 21 M
The recode()
call is a lot less verbose. In the case where you can use recode()
in place of case_when()
, I recommend that. In situations where you need the full flexibility that case_when()
affords you, then continue on your merry way!
coalesce() and na_if()
One thing you might find yourself doing a lot with if_else()
calls is replacing values in a column with NA
, or replacing NA
values in a column with something else. dplyr
helper functions are here to help you save a few keystrokes!
To replace a particular value in a column with NA
, use the aptly named na_if()
.
raw %>%
distinct(id, gender) %>%
# In this example, I'm saving the result into a new column,
# but you could easily overwrite the contents of an old column
mutate(female = na_if(gender, "male"))
## # A tibble: 20 x 3
## id gender female
## <int> <chr> <chr>
## 1 1 male <NA>
## 2 2 male <NA>
## 3 3 female female
## 4 4 female female
## 5 5 male <NA>
## 6 6 male <NA>
## 7 7 male <NA>
## 8 8 male <NA>
## 9 9 female female
## 10 10 male <NA>
## 11 11 male <NA>
## 12 12 female female
## 13 13 female female
## 14 14 female female
## 15 15 male <NA>
## 16 16 male <NA>
## 17 17 male <NA>
## 18 18 female female
## 19 19 male <NA>
## 20 20 male <NA>
Bear in mind that na_if()
is a specialized case of if_else(column == value_to_remove, NA, column)
, and so you should use if_else()
if you want to replace things using any comparison other than direct equality.
To replace missing values in a column with non-missing values, use coalesce()
. You can use this a few ways.
First off, you can replace all missing values with the same new value.
raw %>%
distinct(id, gender) %>%
mutate(female = na_if(gender, "male"),
# See that I'm calling coalesce() on the column with NAs I just created
gender2 = coalesce(female, "male"))
## # A tibble: 20 x 4
## id gender female gender2
## <int> <chr> <chr> <chr>
## 1 1 male <NA> male
## 2 2 male <NA> male
## 3 3 female female female
## 4 4 female female female
## 5 5 male <NA> male
## 6 6 male <NA> male
## 7 7 male <NA> male
## 8 8 male <NA> male
## 9 9 female female female
## 10 10 male <NA> male
## 11 11 male <NA> male
## 12 12 female female female
## 13 13 female female female
## 14 14 female female female
## 15 15 male <NA> male
## 16 16 male <NA> male
## 17 17 male <NA> male
## 18 18 female female female
## 19 19 male <NA> male
## 20 20 male <NA> male
You can also use coalesce()
to zip together a series of vectors, where every time one has a missing value the other one has a non-missing value.
raw %>%
distinct(id, gender) %>%
mutate(female = na_if(gender, "male"),
male = na_if(gender, "female"),
# notice that coalesce is being called on two columns,
# specified as bare variables
gender2 = coalesce(female, male))
## # A tibble: 20 x 5
## id gender female male gender2
## <int> <chr> <chr> <chr> <chr>
## 1 1 male <NA> male male
## 2 2 male <NA> male male
## 3 3 female female <NA> female
## 4 4 female female <NA> female
## 5 5 male <NA> male male
## 6 6 male <NA> male male
## 7 7 male <NA> male male
## 8 8 male <NA> male male
## 9 9 female female <NA> female
## 10 10 male <NA> male male
## 11 11 male <NA> male male
## 12 12 female female <NA> female
## 13 13 female female <NA> female
## 14 14 female female <NA> female
## 15 15 male <NA> male male
## 16 16 male <NA> male male
## 17 17 male <NA> male male
## 18 18 female female <NA> female
## 19 19 male <NA> male male
## 20 20 male <NA> male male
Here’s a more explicit example of using coalesce()
to zip together more than two vectors:
coalesce(c(1, NA, NA, NA, NA),
c(NA, NA, NA, NA, 5),
c(NA, NA, NA, 4, NA),
c(NA, NA, 3, NA, NA),
c(NA, 2, NA, NA, NA))
## [1] 1 2 3 4 5
As with many other tidy functions, coalesce()
obeys order, so if there are cases where more than one of your input vectors have a non-missing value, your output vector will take the value of the first vector with an non-missing value.
coalesce(c(1, NA, NA, NA, NA),
c(NA, 3, NA, NA, 5),
c(NA, NA, NA, 4, NA),
c(NA, NA, 3, NA, NA),
c(NA, 2, NA, NA, NA))
## [1] 1 3 3 4 5
stringr
If you don’t already use stringr
, it’s a fantastic package in the tidyverse that makes string manipulation a breeze. Detecting string matches, subsetting strings, removing white space, you name it, stringr
probably has a function for it!
There won’t be any code examples here because there are so many functions in this package that you might want to use that I don’t know which ones to show. If you want to peruse what the package has to offer, please check out the official stringr
cheatsheet to get the full experience!
Calculating summary statistics
Handily, dplyr
’s general function for calculating summary statistics is summarize()
! You can use this to generate any single summary value on a vector you can think of–mean, median, SD, you name it.
What can you use to generate a summary statistic? In this case, any function that operates on a vector (or multiple vectors) and returns a single value (length 1) as output. Gotta be length 1!
In summarize()
, as in mutate()
, the argument name will be the name of the new column generated by the code provided in the argument value.
(If you DON’T specify the column name (by leaving the argument name empty), the code you provided to calculate the column will automatically be substituted in as the name for the column. Don’t do this…)
# Just trying not to overload any function names
summaries <- raw %>%
summarize(rt_mean = mean(rt),
rt_median = median(rt))
summaries
## # A tibble: 1 x 2
## rt_mean rt_median
## <dbl> <dbl>
## 1 2.98 2.99
Special versions of summarize()
summarize()
lets you custom-specify every summary statistic you want to calculate, which is very nice indeed. Sometimes, however, you may have cases where you would like to calculate the same statistic(s) on very many variables, and you wouldn’t like to type each of those out. We can handle this!
The two functions I use most often for this purpose are summarize_if()
and summarize_at()
.
summarize_if()
lets you specify summary functions, and then applies those functions to every column in your starting df that meets some logical criteria. For example, let’s say you want to calculate the mean and median of every column in a df that contains numeric data.
# Creating one character column and three numeric columns so you can see that the character column shan't appear in the output of summarize_if()
tibble(chars = letters[1:10],
nums1 = rnorm(10, mean = 0, sd = 1)) %>%
mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
nums3 = rnorm(n(), mean = 0, sd = 20)) %>%
summarize_if(is.numeric, funs(avg = mean(.),
median = median(.)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1 x 6
## nums1_avg nums2_avg nums3_avg nums1_median nums2_median nums3_median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.173 93.8 -2.28 0.124 90.2 -5.79
The first true argument of summarize_if()
is .predicate
, where you specify the tester function that should return TRUE
or FALSE
for every column in your starter df.
The next argument is .funs
, where you specify the summary function(s) you would like done to all of the columns for which .predicate
returns TRUE
. dplyr
provides the helper function funs()
to generate a list of function calls in the right format to be fed into summarize_if()
. All you need is to specify the input function calls, each separated by a comma. The period . in the function calls represents “this df in which we are working”. This allows you to tell tidyverse
functions which argument in your summary functions will receive the data from the starter df that’s being summarized. This is useful in cases where you have a summary function where the data isn’t the first argument; this way you can make sure the data to be summarized doesn’t get sent into the wrong argument and gum up the works.
You can see that funs()
allows you to name the function calls you enter inside it, and that summarize_if()
automatically appends the provided names of functions to the output cols.
The other specialty summarizing function, summarize_at()
, does not use a logical predicate function to choose columns to summarize, but instead allows you to specify by name all the columns to summarize.
Instead of specifying a function in .predicate
, that argument is replaced with .vars
, where you specify the variables you want. You can do this a couple of ways:
With a character vector explicitly naming each column you want as a string:
tibble(chars = letters[1:10],
nums1 = rnorm(10, mean = 0, sd = 1)) %>%
mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
nums3 = rnorm(n(), mean = 0, sd = 20)) %>%
summarize_at(c("nums1", "nums2"), funs(avg = mean(.),
median = median(.)))
## # A tibble: 1 x 4
## nums1_avg nums2_avg nums1_median nums2_median
## <dbl> <dbl> <dbl> <dbl>
## 1 0.142 102. 0.272 108.
With tidyselect
helpers:
tibble(chars = letters[1:10],
nums1 = rnorm(10, mean = 0, sd = 1)) %>%
mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
# This one doesn't start with "nums" like the other two do, notice!
numbers3 = rnorm(n(), mean = 0, sd = 20)) %>%
summarize_at(vars(starts_with("nums")), funs(avg = mean(.),
median = median(.)))
## # A tibble: 1 x 4
## nums1_avg nums2_avg nums1_median nums2_median
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0805 106. -0.00313 103.
Note that here, you have to wrap your tidyselect
call (in this case starts_with()
) inside a vars()
call. dplyr::vars()
is a helper function that lets you use tidyselect
helper functions inside other tidyverse
functions that aren’t select()
.
Grouping
dplyr
supports grouping within dfs/tibbles. This means that, if groups are defined in your df, all operations that in some way operate along the nrows of your df now operate independently for each group. Your df still looks the same, but this grouping metadata will affect the behavior of some other dplyr
functions called on your data.
To group your data, call dplyr::group_by()
, listing the variable(s) you want to group by.
## # A tibble: 1,000 x 10
## # Groups: id [20]
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 male 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 male 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 male 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 male 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 male 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 male 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 male 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 male 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 male 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 male 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
You can group by as many variables as you would like. Each group will be composed of the observations belonging to one of the unique combinations of grouping variable levels present in your data.
## # A tibble: 1,000 x 10
## # Groups: id, is_old [40]
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 male 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 male 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 male 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 male 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 male 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 male 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 male 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 male 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 male 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 male 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
To remove grouping from your df, use ungroup()
. No arguments needed!
# Notice that the tibble output no longer tells you that groups are present in this df
raw %>%
group_by(id, is_old) %>%
ungroup()
## # A tibble: 1,000 x 10
## id age gender q_1 q_2 q_3 trial_num is_old response rt
## <int> <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
## 1 1 32 male 26.6 23.7 37.8 1 0 1 6.01
## 2 1 32 male 26.6 23.7 37.8 2 0 0 1.40
## 3 1 32 male 26.6 23.7 37.8 3 1 1 3.19
## 4 1 32 male 26.6 23.7 37.8 4 0 0 3.33
## 5 1 32 male 26.6 23.7 37.8 5 0 0 0.865
## 6 1 32 male 26.6 23.7 37.8 6 1 1 6.39
## 7 1 32 male 26.6 23.7 37.8 7 0 0 1.94
## 8 1 32 male 26.6 23.7 37.8 8 1 1 3.50
## 9 1 32 male 26.6 23.7 37.8 9 0 1 2.58
## 10 1 32 male 26.6 23.7 37.8 10 0 0 3.53
## # … with 990 more rows
Mutating grouped data
If we think back to the two types of vector operations mutate()
supports, you can think about how grouping affects these operations.
- Grouping does not affect:
- truly element-wise operations
- operations returning a vector length 1
- Grouping does affect:
- operations using all values in the column (z-scoring, ranking, etc)
- operations referencing the number of rows in the df (e.g.
n()
)
For all operations that use all values in the column, grouping causes these operations to see each group as if it were its own independent df.
raw %>%
# Just to make the df narrower for printing
select(-starts_with("q")) %>%
mutate(rt_z_across = scale(rt)) %>%
group_by(id) %>%
mutate(rt_z_within = scale(rt))
## # A tibble: 1,000 x 9
## # Groups: id [20]
## id age gender trial_num is_old response rt rt_z_across[,1]
## <int> <int> <chr> <int> <int> <int> <dbl> <dbl>
## 1 1 32 male 1 0 1 6.01 1.97
## 2 1 32 male 2 0 0 1.40 -1.03
## 3 1 32 male 3 1 1 3.19 0.136
## 4 1 32 male 4 0 0 3.33 0.226
## 5 1 32 male 5 0 0 0.865 -1.38
## 6 1 32 male 6 1 1 6.39 2.22
## 7 1 32 male 7 0 0 1.94 -0.675
## 8 1 32 male 8 1 1 3.50 0.339
## 9 1 32 male 9 0 1 2.58 -0.263
## 10 1 32 male 10 0 0 3.53 0.360
## # … with 990 more rows, and 1 more variable: rt_z_within[,1] <dbl>
Here, scale()
(a base R function to mean-center or z-score vectors) demonstrates different behavior when called on an ungrouped df (the first call) vs on a grouped df (the second call). In this case, the first call z-scores across all observations in the vector, across subject, while the second, grouped, call z-scores independently within each subject.
Sometimes, you’ll want to mutate()
considering every observation in the df, and sometimes you’ll want to mutate()
independently for each group in your df. You can do both! Just be mindful of the grouping status of your df at all times.
Summarizing grouped data
Grouping is incredibly useful in conjunction with the summarize()
functions, because it allows a smooth, modular way to specify when you want summary statistics calculated separately for every level of some grouping variable.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 20 x 2
## id rt_median
## <int> <dbl>
## 1 1 2.98
## 2 2 2.96
## 3 3 3.30
## 4 4 2.59
## 5 5 3.29
## 6 6 3.02
## 7 7 3.08
## 8 8 2.95
## 9 9 3.54
## 10 10 3.82
## 11 11 2.85
## 12 12 2.48
## 13 13 2.74
## 14 14 3.51
## 15 15 2.78
## 16 16 2.85
## 17 17 3.00
## 18 18 3.07
## 19 19 2.98
## 20 20 2.63
Observe that the grouping variable is output by summarize()
as well as the calculated summary statistic, because the values of the grouping variable are necessary to mark out which summary statistic belongs to which grouping levels.
We shouldn’t be surprised about the output of summarize()
when we feed in data that has been grouped by multiple variables.
# for example, if you wanted to tabulate summary statistics for multiple conditions within subject
raw %>%
group_by(id, is_old) %>%
summarize(rt_median = median(rt))
## `summarise()` regrouping output by 'id' (override with `.groups` argument)
## # A tibble: 40 x 3
## # Groups: id [20]
## id is_old rt_median
## <int> <int> <dbl>
## 1 1 0 2.66
## 2 1 1 3.37
## 3 2 0 2.67
## 4 2 1 3.15
## 5 3 0 2.53
## 6 3 1 3.83
## 7 4 0 2.58
## 8 4 1 2.68
## 9 5 0 3.44
## 10 5 1 2.69
## # … with 30 more rows
The special summarizing functions, like summarize_if()
and summarize_at()
, will also obey grouping.
When to ungroup your df
Some tidyverse functions either run more slowly on grouped data, or have unwanted behavior on grouped data. For the following functions, I would recommend ungrouping your df before applying:
filter()
, because it runs rather slowly on grouped dfs- UNLESS! You may want to filter your df based on rankings within a grouping variable (e.g. the row with the minimum RT for every subject). In this case, something like
filter(rt == min(rt))
would return one row per group if called on a grouped df.
- UNLESS! You may want to filter your df based on rankings within a grouping variable (e.g. the row with the minimum RT for every subject). In this case, something like
complete()
andcrossing()
fromtidyr
(more on these below)
Creating new rows
Repeating existing rows
This strategy is inspired by a Stack Overflow post that I can’t seem to find now, but I find this useful when I am generating simulated datasets with multilevel structure. Essentially, because dplyr::slice()
and other numeric indexing methods (base R bracket indexing included) allow you to index the same position multiple times to get repeats in your output, you can repeat rows in your df by using slice()
to index the same row multiple times.
If you want to repeat every row in your df, you can use the below syntax to rep()
every index in your df (1:n()
) each
number of times. Specifying each
instead of times
ensures that all repeats of the same row stay next to each other, which is useful if you use this strategy to expand simulated person-level data to accommodate multiple observations per person.
## # A tibble: 15 x 1
## x
## <chr>
## 1 a
## 2 a
## 3 a
## 4 b
## 5 b
## 6 b
## 7 c
## 8 c
## 9 c
## 10 d
## 11 d
## 12 d
## 13 e
## 14 e
## 15 e
Completing combinations of grouping variables
There might be situations (particularly, those that are about to be fed into a summarize()
call) where you want to complete combinations of grouping variables that might not exist in your data. For example, you might be working with a task where each trial can have a correct or incorrect response, and you’d like to use summarize()
to count up the total number of correct and incorrect responses per person. If you have some star subject who gets all their trials correct, there will be no instances of that subject and the “incorrect” response value, so instead of showing a count of 0 rows, the summary will just be missing that row.
After you’ve called summarize()
, you can use tidyr::complete()
to create all the grouping variable combos you might have been missing (because they had no instances to count).
tibble(x = letters[1:5]) %>%
slice(rep(1:n(), each = 3)) %>%
mutate(y = rbinom(n(), 1, 0.5)) %>%
count(x, y) %>%
# complete() must be called on ungrouped data, otherwise it attempts to complete the groups independently, which means it doesn't recognize when one group has levels of a variable that another group doesn't have
ungroup() %>%
complete(x, y, fill = list(n = 0L))
## # A tibble: 10 x 3
## x y n
## <chr> <int> <int>
## 1 a 0 1
## 2 a 1 2
## 3 b 0 3
## 4 b 1 0
## 5 c 0 0
## 6 c 1 3
## 7 d 0 1
## 8 d 1 2
## 9 e 0 2
## 10 e 1 1
More broadly, you can use tidyr
functions to do what you might do using base::expand.grid()
, but with a bit more control. tidyr::crossing()
is directly analogous to base::expand.grid()
, in that it will create every possible combination of each of the provided variable levels.
# in crossing(), vectors listed first will repeat slowest
crossing(x = letters[1:2], y = letters[3:4], z = 1:2)
## # A tibble: 8 x 3
## x y z
## <chr> <chr> <int>
## 1 a c 1
## 2 a c 2
## 3 a d 1
## 4 a d 2
## 5 b c 1
## 6 b c 2
## 7 b d 1
## 8 b d 2
Reshaping data
tidyr
provides the functions you’ll need to reshape your data. Once you get the hang of them, hopefully you’ll find the functions easy to use, because they have essentially no ancillary arguments specifying how you want your data reshaped. Each of the tidy reshaping functions does only one thing, so what they lose in flexibility they gain in predictability, so you should get either what you expect, or an error, when you call these functions.
From wide to long
tidyr::pivot_longer()
is how we’ll be taking data from wide form to long form. We’ll take wide data that has different conditions of an observed/calculated variable in different columns, and make it long, so that all those values of the same variable are in one column, with another column indicating what condition that value comes form.
(If you want a deeper review on tidy data structure, and what wide form and long form mean, read this vignette.)
In this example, I’m first going to calculate some related summary statistics on reaction time to create wide data, because our original trialwise data is in long form.
summaries <- raw %>%
group_by(id) %>%
summarize(rt_min = min(rt),
rt_median = median(rt),
rt_max = max(rt)) %>%
# Note that I'm using sample_n() here to just make the output df shorter by selecting some subjs
sample_n(3)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## id rt_min rt_median rt_max
## <int> <dbl> <dbl> <dbl>
## 1 3 -0.198 3.30 6.31
## 2 6 -0.716 3.02 5.78
## 3 12 -1.11 2.48 5.11
You can see this df is wide, because different observations of summary RT (minimum, median, and maximum) are each in their own column. (This data does have the utility of having one row per subject. Wide data is not necessarily always bad!)
When we gather()
this wide df and thus make it long, we expect to have one column for RT, and another column indicating whether that RT is a minimum, median, or maximum.
gather()
takes three main arguments (peep the other ones in the docs if you want to futz around):
key
: you specify the name of the new column (doesn’t exist yet!) containing your indicator variable for the conditions of your value variablevalue
: you specify the name of the column that contains values of the variable themselves...
: you specify all of the columns that actually contain values to begather()
-ed. Note that here, you can specify columns just as you would inselect()
, including usingtidyselect()
helpers.
summaries_long <- summaries %>%
# if I wanted to use tidyselect helpers, I could use starts_with("rt") here
pivot_longer(cols = c(rt_min, rt_median, rt_max), names_to = "statistic", values_to = "rt")
summaries_long
## # A tibble: 9 x 3
## id statistic rt
## <int> <chr> <dbl>
## 1 3 rt_min -0.198
## 2 3 rt_median 3.30
## 3 3 rt_max 6.31
## 4 6 rt_min -0.716
## 5 6 rt_median 3.02
## 6 6 rt_max 5.78
## 7 12 rt_min -1.11
## 8 12 rt_median 2.48
## 9 12 rt_max 5.11
To trim off prefixes of key/indicator variable, you can use the names_prefix
argument to tell R to drop a particular string from the beginning of each column name before pivoting.
summaries_long <- summaries %>%
pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt", names_prefix = "rt_")
summaries_long
## # A tibble: 9 x 3
## id statistic rt
## <int> <chr> <dbl>
## 1 3 min -0.198
## 2 3 median 3.30
## 3 3 max 6.31
## 4 6 min -0.716
## 5 6 median 3.02
## 6 6 max 5.78
## 7 12 min -1.11
## 8 12 median 2.48
## 9 12 max 5.11
Pivot-longering multiple distinct variables at once
You can also use pivot_longer()
to pivot multiple value variables at once. If you have multiple unique variables in a wide df that are all bound by the same indicator conditions, you can modify the arguments of pivot_longer()
and get the job done.
Below, I’m first going to create a summary df where we have observations of multiple variables (in this case, the three “questionnaire” scores), and each has multiple observation conditions (the minimum and maximum score on that questionnaire within each gender).
summaries_qs <- raw %>%
select(id, gender, starts_with("q")) %>%
distinct(id, .keep_all = TRUE) %>%
group_by(gender) %>%
summarize_at(vars(starts_with("q")), funs(min = min(.),
max = max(.)))
summaries_qs
## # A tibble: 2 x 7
## gender q_1_min q_2_min q_3_min q_1_max q_2_max q_3_max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 15.9 18.4 17.3 64.9 46.2 33.5
## 2 male 18.1 18.5 7.71 44.7 53.0 47.5
Say I want to get this into a long df where each questionnaire has its own column, and there’s one indicator column for whether the score is a minimum or maximum score, and that indicator corresponds across all three questionnaire columns.
pivot_longer()
in its usual won’t get me all the way there:
## # A tibble: 12 x 3
## gender extreme q_score
## <chr> <chr> <dbl>
## 1 female q_1_min 15.9
## 2 female q_2_min 18.4
## 3 female q_3_min 17.3
## 4 female q_1_max 64.9
## 5 female q_2_max 46.2
## 6 female q_3_max 33.5
## 7 male q_1_min 18.1
## 8 male q_2_min 18.5
## 9 male q_3_min 7.71
## 10 male q_1_max 44.7
## 11 male q_2_max 53.0
## 12 male q_3_max 47.5
You can see that the key/indicator column, extreme
, actually contains information for two distinct variables: which questionnaire the score is from, AND which extreme the score is for. pivot_longer()
can handle this, but we need to tell it where to get the names for all the new columns from, since there’s more than one value column now.
Here, we feed a vector into the names_to
argument, telling it which part of the old column names should go into the names column (here extreme
) and which should be split out into value columns (with the special phrase ".value"
). Then we use names_pattern
to specify capture groups to tell R how to break up the column names.
summaries_qs %>%
pivot_longer(cols = starts_with("q"),
names_to = c(".value", "extreme"),
names_pattern = "(._.*)_(.*)")
## # A tibble: 4 x 5
## gender extreme q_1 q_2 q_3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 female min 15.9 18.4 17.3
## 2 female max 64.9 46.2 33.5
## 3 male min 18.1 18.5 7.71
## 4 male max 44.7 53.0 47.5
From long to wide
To go from long form to wide form, you’ll use tidyr::pivot_wider()
.
pivot_wider()
’s main arguments are analogous to pivot_longer()
’s, but not exactly the same because you’re doing something different:
names_from
: the name of the existing column that is the key/indicator column. You must specify this as a bare variable name, because you’re referring to an existing column.values_from
: the name of the existing value column. Again, specify as a bare variable name.
summaries_long <- summaries %>%
pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt")
summaries_long %>%
pivot_wider(names_from = statistic, values_from = rt)
## # A tibble: 3 x 4
## id rt_min rt_median rt_max
## <int> <dbl> <dbl> <dbl>
## 1 3 -0.198 3.30 6.31
## 2 6 -0.716 3.02 5.78
## 3 12 -1.11 2.48 5.11
pivot_wider()
automatically sets the column names of the new wide-split columns to whatever the indicator values were in the key column. Keep this in mind if you’re modifying the indicator values to be cleaner–you may want to modify them again if you plan to spread your data, to put the name of the value back into the indicator conditions.
For example, the below is nice in long form but a little less clear in wide form, because we now don’t necessarily know what the min/median/max are calculated on:
summaries %>%
pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt", names_prefix = "rt_") %>%
pivot_wider(names_from = statistic, values_from = rt)
## # A tibble: 3 x 4
## id min median max
## <int> <dbl> <dbl> <dbl>
## 1 3 -0.198 3.30 6.31
## 2 6 -0.716 3.02 5.78
## 3 12 -1.11 2.48 5.11
Pivot-widering multiple distinct variables at once
Similarly to pivot_longer()
, pivot_wider()
’s default behavior is to only let you pivot one value column at a time. Again, if you want to widen multiple unique variables that have repeated measures over the same conditions, you can modify the arguments of pivot_wider()
to achieve this.
With pivot_wider()
, the main thing you change is to pass either a vector of column names or a tidyselect
call to values_from
, instead of just one column. With multiple columns going in as the argument, usually pivot_wider()
knows how to pivot those columns and make their new names.
summaries_qs_long <- summaries_qs %>%
pivot_longer(cols = starts_with("q"),
names_to = c(".value", "extreme"),
names_pattern = "(._.*)_(.*)")
summaries_qs_long %>%
pivot_wider(names_from = extreme,
values_from = starts_with("q"))
## # A tibble: 2 x 7
## gender q_1_min q_1_max q_2_min q_2_max q_3_min q_3_max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 15.9 64.9 18.4 46.2 17.3 33.5
## 2 male 18.1 44.7 18.5 53.0 7.71 47.5
If you want to change the separator character that splits up the new appended names, you can use the names_sep
argument. As you can see here, the default separator is an underscore.
Coming up in part 2
The next walkthrough will get into working with nested dfs, custom-vectorizing functions using purrr::map()
, using dfs to store multiple model objects, and more.
Next: Tidy Walkthrough, Part 2