Tidy Walkthrough, Part 1

Created by: Monica Thieu extra R


Goals for this vignette

  • Demonstrate (what I think are) key first-level features of the tidyverse
  • Illustrate psychology use cases for tidyverse functions
  • Hopefully convince you to join the cult of tidy

We shall do this by walking through various use cases you might encounter with your data. I’ll jump in with specific examples of functions when appropriate, to highlight solutions for particular situations you might run into with your own data.

Please keep in mind!

This vignette is not intended to be an exhaustive walkthrough of all functions from the indicated tidyverse packages.

For reference, please see the excellent reference pages online for the relevant package. For a more detailed self-teaching experience, please refer to the incredible R for Data Science online textbook by Garrett Grolemund and Hadley Wickham, which teaches using tidyverse functions.

Quick cheat-list of key functions from various pkgs

Remember that each of these pkgs contain other functions too! Often these other functions are more specific use cases of the primary functions that crop up often enough that you’d want a separate function for them.

To see all the functions in each pkg, visit their reference pages at tidyverse.org! I love these reference pages so much.

  • %>%
  • tidyr
    • gather() & spread() for making long and wide
    • separate() & unite() for fixing key columns (pre/post-processing for gather()/spread())
    • expand(), crossing(), etc.– like expand.grid() but a little more predictable
  • dplyr
    • select(), filter(), distinct() for subsetting and reordering dfs/tibbles
    • group_by()/ungroup() for setting groups
    • mutate() for your column manipulation workhorse
    • summarize() for squishing multiple rows down to summary statistics
    • arrange() to sort dfs/tibbles more easily

Brief intro to the pipe

Before we begin…

Enter the pipe %>%! The pipe does one simple, but key, thing: takes the object on the left-hand side and feeds it into the first argument of the function on the right-hand side. This means that:

  • a %>% foo() is equivalent to foo(a). Fine and good
  • a %>% foo() %>% bar(arg = TRUE) is equivalent to bar(foo(a), arg = TRUE). Now, nested function calls read left-to-right!
  • Most common use case: df_new <- df_old %>% foo() %>% bar(arg = TRUE) %>% baz() is equivalent to df_new <- baz(bar(foo(df_old), arg = TRUE)). Now, you can chain a series of preprocessing commands to operate on a dataframe all at once, and easily read those commands as typed in your script. No more accidentally not running some key preprocessing command that causes later code to break!

Note that #pipelife requires functions to:

  • take as their first argument the object to be operated upon
  • return the same object (or an analog of said), but now operated upon

Essentially all functions from the tidyverse are pipe-safe, but bear this in mind when trying to incorporate functions from base R or other packages into your tidy new world.

Initializing fake data

We’ll use a simulated dataset for this vignette, so you don’t need to worry about any dependencies involving datasets you don’t have access to while you’re following along.

Here, I’ll simulate a recognition memory task, where each fake subject sees a series of objects, some they’ve seen before in a learning phase, and some they haven’t. For each object, which is either old (they saw it before) or new (they did not see it before), subjects must respond “old” if they think they saw it before, or “new” if they think they didn’t see it before. I’ll also simulate some person-level demographic variables (age, gender) and trait variables (questionnaire scores).

If you have tidyverse loaded, all this code should run if you try to run it in your R console. I’m using functions that I’ll introduce later, so don’t worry if you don’t fully understand this right now! Feel free to skip past this.

# 20 fake subjects, 50 fake trials per subject

# Will simulate the person-level variables FIRST,
# then expand to simulate the trial-level variables
raw <- tibble(id = 1L:20L,
              age = sample(18L:35L, size = 20, replace = TRUE),
              # assuming binary gender for the purposes of this simulation
              gender = sample(c("male", "female"), size = 20, replace = TRUE)) %>%
  # simulating some "questionnaire" scores; person-level
  mutate(q_1 = rnorm(n = n(), mean = 30, sd = 10),
              q_2 = rnorm(n = n(), mean = 30, sd = 10),
              q_3 = rnorm(n = n(), mean = 30, sd = 10)) %>%
  # slice() subsets rows by position; you can use it to repeat rows by repeating position indices
  slice(rep(1:n(), each = 50)) %>%
  # We'll get to this in a bit--this causes every "group"
  # aka every set of rows with the same value for "id", to behave as an independent df
  group_by(id) %>%
  # I just want to have a column for "trial order", I like those in my task data
  mutate(trial_num = 1:n(),
         # Each subject sees half OLD and half NEW trials in this recognition memory task
         is_old = rep(0L:1L, times = n()/2),
         # I'm shuffling the order of "old" and "new" trials in my fake memory task
         is_old = sample(is_old),
         # This will generate binary "old"/"new" responses corresponding roughly to a d' of 1
         # yep, everyone has the same d' today
         response = if_else(is_old == 1,
                            rbinom(n = n(), size = 1, prob = 0.7),
                            rbinom(n = n(), size = 1, prob = 0.3)),
         rt = rnorm(n = n(), mean = 3, sd = 1.5)) %>%
  ungroup()

Loading data

In today’s example, we won’t actually be loading any data, because we’re generating simulated data. If you do have real data you’re looking to analyze, though, chances are you have it saved in tabular form (e.g. a CSV) and you need to read it into R. You might be using something like base::read.csv() to do this.

As it happens, the tidyverse features a package, readr, which provides file reading functions that operate in the same family as base::read.table(), base::read.csv(), etc. I like to use these file reading functions, readr::read_csv() (note that tidyverse functions tend to use underscores instead of periods!) in particular, because they run a little more quickly and predictably than their base R counterparts.

The command you’d use would look a little something like this, aka very similar to base R:

data <- read_csv("/path/to/data.csv")

Features of readr file reading functions include:

  • default args:
    • header always TRUE
    • stringsAsFactors always FALSE (better safe than sorry!)
  • attempts to guess the data type of each column it reads and tells you what data type it guessed for each one, or you can manually specify the data type of each column upon reading in the data
  • always returns a tibble

Now, there will be cases when you need the fine-grained control over delimiter assumptions, header assumptions, and such that the base R file reading functions afford you. Most of the time, though, I find that read_csv() suits my needs well.

PS: In our companion R package, cuPsychRTools, we provide helper functions like read_csv_multi(), which wraps read_csv() to allow you to specify a vector of files to read instead of just one file. This way, you can read in multiple data files that contain different observations of the same data (e.g. a psychological task run on multiple subjects, with each subject’s data in a separate file) and have them all output into one very long df!

Visually inspecting data

The tidyverse introduces the tibble (via the tibble package, which is loaded automatically when you load the whole tidyverse). You can think of a tibble as a dataframe, but a little better. One of the reasons I like to work with tibbles is that visually inspecting them using the console is quite easy:

raw
## # A tibble: 1,000 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32 male    26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32 male    26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32 male    26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32 male    26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32 male    26.6  23.7  37.8         5      0        0 0.865
##  6     1    32 male    26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32 male    26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32 male    26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32 male    26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32 male    26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

You can see that the default print output of tibbles is kind of like the output of str() on steroids.

Compared to a dataframe, a tibble:

  • has a cleaner & more informative print() method
    • automatically tells you the data type of each column
    • prints only the top 10 rows for longer tibbles (but you can customize how many rows show)
    • in some RStudio themes, prints numbers in certain colors to help you quickly identify negatives, decimals, etc
  • is stricter when you create tibbles using tibble(), or when you index (with [] or $)

There are a couple other useful things that tibbles do that dataframes don’t, but the strictness and the print method are the two most useful things that I find. As you’ll see, imposing a bit more strictness on R can have the beneficial side effect of minimizing buggy cases in your own code, since you’re forced to be more explicit.

By the way, note that packages within the tidyverse work together–raw rectangular data read in using readr is given the tibble class by default.

A tibble isn’t too far off from a dataframe in its underlying properties, though.

class(raw)
## [1] "tbl_df"     "tbl"        "data.frame"

"tbl_df" is the class name for tibbles. As you can see, tibbles still have the "data.frame" class, so they can do most anything dataframes do. (There are the occasional older non-tidyverse functions that get confused when they see tibbles, and that’s okay, you can always call as.data.frame(your_tibble) if you need to in a pinch.)

Non-tibble dataframes can be coerced to tibble using as_tibble(). This shouldn’t change any of the existing data type of any of the columns.

For the rest of this vignette, I will refer to data of a tabular structure (dataframe or tibble) as a df. Know that while the tidyverse is tibble-optimized, everything WORKS on a dataframe, so you can still think of your data structures as dataframes. Hence, I’ll use the term df to help you with mental consistency.

Subsetting data

We’ll deal with subsetting dfs two ways: choosing a subset of columns and a subset of rows.

Subsetting by column

dplyr::select() is the workhorse function for subsetting a df by column. You simply need to name the columns you would like to select!

Note that these are unquoted variable names and not a character vector. Tidyverse functions allow you to call column names bare as variables inside of them.

# This auto-indenting biz after the pipe works because the pipe is a left-right operator
# Remember that raw is getting piped into select() as the first argument
# so the only enumerated arguments inside select() are the columns of interest,
# which is kind of nice!
raw %>%
  select(id, age, gender, trial_num, is_old, response, rt)
## # A tibble: 1,000 x 7
##       id   age gender trial_num is_old response    rt
##    <int> <int> <chr>      <int>  <int>    <int> <dbl>
##  1     1    32 male           1      0        1 6.01 
##  2     1    32 male           2      0        0 1.40 
##  3     1    32 male           3      1        1 3.19 
##  4     1    32 male           4      0        0 3.33 
##  5     1    32 male           5      0        0 0.865
##  6     1    32 male           6      1        1 6.39 
##  7     1    32 male           7      0        0 1.94 
##  8     1    32 male           8      1        1 3.50 
##  9     1    32 male           9      0        1 2.58 
## 10     1    32 male          10      0        0 3.53 
## # … with 990 more rows

Note that you can use select() to rename columns as well; simply put the new name in the argument name spot, and the old name in the argument value spot.

# Renaming columns is so smooth in the tidyverse!
raw %>%
  select(subject_id = id, age, gender, trial_num, is_old, response, rt)
## # A tibble: 1,000 x 7
##    subject_id   age gender trial_num is_old response    rt
##         <int> <int> <chr>      <int>  <int>    <int> <dbl>
##  1          1    32 male           1      0        1 6.01 
##  2          1    32 male           2      0        0 1.40 
##  3          1    32 male           3      1        1 3.19 
##  4          1    32 male           4      0        0 3.33 
##  5          1    32 male           5      0        0 0.865
##  6          1    32 male           6      1        1 6.39 
##  7          1    32 male           7      0        0 1.94 
##  8          1    32 male           8      1        1 3.50 
##  9          1    32 male           9      0        1 2.58 
## 10          1    32 male          10      0        0 3.53 
## # … with 990 more rows

There also exists a version of select() called rename() which behaves exactly as select() does, except that rename() never subsets columns, and always gives you all the columns of your original df (renamed or otherwise).

You can specify columns in select() by naming each individual column you want, or you can select groups of columns in the following ways.

You can use the colon : to select all columns between the ones named, kind of like when creating integer vectors using the colon:

raw %>%
  select(id:gender)
## # A tibble: 1,000 x 3
##       id   age gender
##    <int> <int> <chr> 
##  1     1    32 male  
##  2     1    32 male  
##  3     1    32 male  
##  4     1    32 male  
##  5     1    32 male  
##  6     1    32 male  
##  7     1    32 male  
##  8     1    32 male  
##  9     1    32 male  
## 10     1    32 male  
## # … with 990 more rows

You can also use any one of a series of special helper functions provided by dplyr to select columns in a regex-style way. Like so:

raw %>%
  select(id, starts_with("q"))
## # A tibble: 1,000 x 4
##       id   q_1   q_2   q_3
##    <int> <dbl> <dbl> <dbl>
##  1     1  26.6  23.7  37.8
##  2     1  26.6  23.7  37.8
##  3     1  26.6  23.7  37.8
##  4     1  26.6  23.7  37.8
##  5     1  26.6  23.7  37.8
##  6     1  26.6  23.7  37.8
##  7     1  26.6  23.7  37.8
##  8     1  26.6  23.7  37.8
##  9     1  26.6  23.7  37.8
## 10     1  26.6  23.7  37.8
## # … with 990 more rows

Check the the reference page for tidyselect::select_helpers to see all of these helper functions that are available to you.

Note that the order in which you specify columns/helper calls determines the ordering of the columns in your output.

raw %>%
  select(gender, age, id)
## # A tibble: 1,000 x 3
##    gender   age    id
##    <chr>  <int> <int>
##  1 male      32     1
##  2 male      32     1
##  3 male      32     1
##  4 male      32     1
##  5 male      32     1
##  6 male      32     1
##  7 male      32     1
##  8 male      32     1
##  9 male      32     1
## 10 male      32     1
## # … with 990 more rows

In this way, you can explicitly use select() to reorder columns without dropping columns. To do this, use the tidyselect helper function everything():

raw %>%
  select(id, gender, everything())
## # A tibble: 1,000 x 10
##       id gender   age   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <chr>  <int> <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1 male      32  26.6  23.7  37.8         1      0        1 6.01 
##  2     1 male      32  26.6  23.7  37.8         2      0        0 1.40 
##  3     1 male      32  26.6  23.7  37.8         3      1        1 3.19 
##  4     1 male      32  26.6  23.7  37.8         4      0        0 3.33 
##  5     1 male      32  26.6  23.7  37.8         5      0        0 0.865
##  6     1 male      32  26.6  23.7  37.8         6      1        1 6.39 
##  7     1 male      32  26.6  23.7  37.8         7      0        0 1.94 
##  8     1 male      32  26.6  23.7  37.8         8      1        1 3.50 
##  9     1 male      32  26.6  23.7  37.8         9      0        1 2.58 
## 10     1 male      32  26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

everything() essentially returns all the names of the other, not-explicitly-named columns, and keeps them all in their original order minus the columns you explicitly specified. You can use this syntax to pull a selected number of columns to the far left of your df, and then leave all other columns in their previous order.

You can also use select() to keep all columns except ones you specify. For this, use the - operator like so:

raw %>%
  select(-gender)
## # A tibble: 1,000 x 9
##       id   age   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32  26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32  26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32  26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32  26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32  26.6  23.7  37.8         5      0        0 0.865
##  6     1    32  26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32  26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32  26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32  26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32  26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

You can combine the - with the : operator, or with one of the tidyselect helpers we saw before:

raw %>%
  select(-(trial_num:rt))
## # A tibble: 1,000 x 6
##       id   age gender   q_1   q_2   q_3
##    <int> <int> <chr>  <dbl> <dbl> <dbl>
##  1     1    32 male    26.6  23.7  37.8
##  2     1    32 male    26.6  23.7  37.8
##  3     1    32 male    26.6  23.7  37.8
##  4     1    32 male    26.6  23.7  37.8
##  5     1    32 male    26.6  23.7  37.8
##  6     1    32 male    26.6  23.7  37.8
##  7     1    32 male    26.6  23.7  37.8
##  8     1    32 male    26.6  23.7  37.8
##  9     1    32 male    26.6  23.7  37.8
## 10     1    32 male    26.6  23.7  37.8
## # … with 990 more rows
raw %>%
  select(-starts_with("q"))
## # A tibble: 1,000 x 7
##       id   age gender trial_num is_old response    rt
##    <int> <int> <chr>      <int>  <int>    <int> <dbl>
##  1     1    32 male           1      0        1 6.01 
##  2     1    32 male           2      0        0 1.40 
##  3     1    32 male           3      1        1 3.19 
##  4     1    32 male           4      0        0 3.33 
##  5     1    32 male           5      0        0 0.865
##  6     1    32 male           6      1        1 6.39 
##  7     1    32 male           7      0        0 1.94 
##  8     1    32 male           8      1        1 3.50 
##  9     1    32 male           9      0        1 2.58 
## 10     1    32 male          10      0        0 3.53 
## # … with 990 more rows

Subsetting by row

There are a number of dplyr functions which will, in some way, reduce the number of rows in your df.

Logical subsetting

filter() will return a df subset for only those rows that return TRUE for the logical statements specified in its arguments.

Notice that, as with other functions in the tidyverse, you call df columns as bare variable names, without needing $ or [] indexing.

raw %>%
  filter(id == 13)
## # A tibble: 50 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1    13    35 female  46.7  38.9  17.3         1      0        1 2.74 
##  2    13    35 female  46.7  38.9  17.3         2      0        1 2.95 
##  3    13    35 female  46.7  38.9  17.3         3      0        0 2.10 
##  4    13    35 female  46.7  38.9  17.3         4      0        1 2.76 
##  5    13    35 female  46.7  38.9  17.3         5      1        1 3.08 
##  6    13    35 female  46.7  38.9  17.3         6      0        0 2.36 
##  7    13    35 female  46.7  38.9  17.3         7      0        0 3.00 
##  8    13    35 female  46.7  38.9  17.3         8      1        1 0.946
##  9    13    35 female  46.7  38.9  17.3         9      1        1 1.60 
## 10    13    35 female  46.7  38.9  17.3        10      1        1 6.23 
## # … with 40 more rows

filter() takes logical statements with & and | in them like you might use in bracket indexing.

# would I ordinarily use a %in% call for this?
# Yes I would, but this is just to show how the or operator | works here
raw %>%
  filter(id == 13 | id == 17)
## # A tibble: 100 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1    13    35 female  46.7  38.9  17.3         1      0        1 2.74 
##  2    13    35 female  46.7  38.9  17.3         2      0        1 2.95 
##  3    13    35 female  46.7  38.9  17.3         3      0        0 2.10 
##  4    13    35 female  46.7  38.9  17.3         4      0        1 2.76 
##  5    13    35 female  46.7  38.9  17.3         5      1        1 3.08 
##  6    13    35 female  46.7  38.9  17.3         6      0        0 2.36 
##  7    13    35 female  46.7  38.9  17.3         7      0        0 3.00 
##  8    13    35 female  46.7  38.9  17.3         8      1        1 0.946
##  9    13    35 female  46.7  38.9  17.3         9      1        1 1.60 
## 10    13    35 female  46.7  38.9  17.3        10      1        1 6.23 
## # … with 90 more rows

filter() also has a nice feature: specifying separate logical statements as separate arguments, separated by commas, implicitly acts like chaining the statements together with &, since the intersection of logical vectors is probably the most common use case for row subsetting.

raw %>%
  select(id, age, gender, trial_num, is_old, response, rt) %>%
  filter(id == 13, is_old == 1)
## # A tibble: 25 x 7
##       id   age gender trial_num is_old response    rt
##    <int> <int> <chr>      <int>  <int>    <int> <dbl>
##  1    13    35 female         5      1        1 3.08 
##  2    13    35 female         8      1        1 0.946
##  3    13    35 female         9      1        1 1.60 
##  4    13    35 female        10      1        1 6.23 
##  5    13    35 female        11      1        1 2.29 
##  6    13    35 female        12      1        1 1.58 
##  7    13    35 female        13      1        0 0.903
##  8    13    35 female        16      1        1 0.606
##  9    13    35 female        17      1        1 2.53 
## 10    13    35 female        19      1        1 5.20 
## # … with 15 more rows

Subsetting rows by position

Should you find this necessary, you can subset rows by numeric index using dplyr::slice(). It takes any numeric vector of valid indices as input.

raw %>%
  slice(351:360)
## # A tibble: 10 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     8    32 male    38.6  31.7  24.8         1      1        1  2.61
##  2     8    32 male    38.6  31.7  24.8         2      1        1  1.63
##  3     8    32 male    38.6  31.7  24.8         3      0        0  2.68
##  4     8    32 male    38.6  31.7  24.8         4      0        0  4.30
##  5     8    32 male    38.6  31.7  24.8         5      0        1  2.59
##  6     8    32 male    38.6  31.7  24.8         6      0        0  1.35
##  7     8    32 male    38.6  31.7  24.8         7      1        0  4.76
##  8     8    32 male    38.6  31.7  24.8         8      1        0  2.10
##  9     8    32 male    38.6  31.7  24.8         9      1        1  3.92
## 10     8    32 male    38.6  31.7  24.8        10      1        1  3.65

Getting only unique/un-duplicated rows

Another function that technically “subsets” data, but not logically: distinct() is the dplyr function for removing rows with duplicate observations.

raw %>%
  # First, I'm removing columns with trialwise observations so that all remaining values are in fact duplicated
  select(id, gender, age) %>%
  # the .keep_all argument specifies whether to drop columns that aren't being distinct-ed by
  distinct(id, .keep_all = TRUE)
## # A tibble: 20 x 3
##       id gender   age
##    <int> <chr>  <int>
##  1     1 male      32
##  2     2 male      31
##  3     3 female    32
##  4     4 female    25
##  5     5 male      23
##  6     6 male      29
##  7     7 male      25
##  8     8 male      32
##  9     9 female    18
## 10    10 male      19
## 11    11 male      20
## 12    12 female    25
## 13    13 female    35
## 14    14 female    28
## 15    15 male      33
## 16    16 male      29
## 17    17 male      27
## 18    18 female    35
## 19    19 male      31
## 20    20 male      21

You can, of course, call distinct() on multiple columns, which will retain every unique combination of values in said columns once.

raw %>%
  # note that all other columns are now DROPPED
  distinct(gender, age)
## # A tibble: 15 x 2
##      age gender
##    <int> <chr> 
##  1    32 male  
##  2    31 male  
##  3    32 female
##  4    25 female
##  5    23 male  
##  6    29 male  
##  7    25 male  
##  8    18 female
##  9    19 male  
## 10    20 male  
## 11    35 female
## 12    28 female
## 13    33 male  
## 14    27 male  
## 15    21 male

Subsetting random rows

This is something you might need to do on occasion. For example, sometimes when I want to plot single-subject data for just a few subjects from a dataset, I want to select a random subset of subjects so I’m not always just looking at the subjects who happened to be at the top of the df for whatever reason.

dplyr contains the functions sample_n() and sample_frac(), which wrap base::sample() to let you quickly get a random subset of rows, specified either by number of rows to get or by fraction of total rows to get.

Note that these work best if your df has one row per subject, because these behave a very particular way on grouped dfs (more on these later).

raw %>%
  # this will create a df with only one row per subject
  distinct(id, gender, age) %>%
  sample_n(5)
## # A tibble: 5 x 3
##      id   age gender
##   <int> <int> <chr> 
## 1    17    27 male  
## 2    15    33 male  
## 3    12    25 female
## 4    16    29 male  
## 5     7    25 male

Subsetting by row and column simultaneously

Thanks to the pipe %>%, you can chain a select() call with a filter() (or otherwise) call to subset by row and by column in the same command.

In this case, order doesn’t super matter, so it’s a matter of personal preference as to which you ought to call first. I would generally recommend putting all renaming calls (if you are renaming any columns) as early as possible to allow the rest of the code to be consistent with itself re: column names.

raw %>%
  select(id, age, gender, trial_num, is_old, response, rt) %>%
  filter(id == 13)
## # A tibble: 50 x 7
##       id   age gender trial_num is_old response    rt
##    <int> <int> <chr>      <int>  <int>    <int> <dbl>
##  1    13    35 female         1      0        1 2.74 
##  2    13    35 female         2      0        1 2.95 
##  3    13    35 female         3      0        0 2.10 
##  4    13    35 female         4      0        1 2.76 
##  5    13    35 female         5      1        1 3.08 
##  6    13    35 female         6      0        0 2.36 
##  7    13    35 female         7      0        0 3.00 
##  8    13    35 female         8      1        1 0.946
##  9    13    35 female         9      1        1 1.60 
## 10    13    35 female        10      1        1 6.23 
## # … with 40 more rows

Manipulating data

dplyr is the home for most of the functions of the tidyverse that deal with manipulating your data in some kind of way–creating new data, calculating summary data, etc. We’ll tour these functions by breaking them down into their use cases.

Creating/modifying columns

dplyr::mutate() will be your workhorse function for creating new columns or modifying existing ones in a df.

Inside of every mutate() call, as with previous tidyverse functions, column names are fed in as symbols (like variables, without quotes) without needing to $ index them from the parent df every time.

raw %>%
  # just to make the output narrower
  select(-starts_with("q")) %>%
  mutate(rt_ms = round(rt * 1000))
## # A tibble: 1,000 x 8
##       id   age gender trial_num is_old response    rt rt_ms
##    <int> <int> <chr>      <int>  <int>    <int> <dbl> <dbl>
##  1     1    32 male           1      0        1 6.01   6012
##  2     1    32 male           2      0        0 1.40   1404
##  3     1    32 male           3      1        1 3.19   3188
##  4     1    32 male           4      0        0 3.33   3326
##  5     1    32 male           5      0        0 0.865   865
##  6     1    32 male           6      1        1 6.39   6389
##  7     1    32 male           7      0        0 1.94   1943
##  8     1    32 male           8      1        1 3.50   3500
##  9     1    32 male           9      0        1 2.58   2575
## 10     1    32 male          10      0        0 3.53   3532
## # … with 990 more rows

You can do anything inside a mutate() call that:

  • is totally element-wise
  • returns a vector of length 1 (this will recycle)
  • returns a vector of the same length as the total number of rows

Like tibble() for creating tibbles, mutate() will not recycle vectors longer than 1 but shorter than the input df. I like this behavior, as it has protected me on many occasions from creating columns that didn’t contain what I meant them to! If you do have a vector that you WANT recycled into a column, I would feed that vector into rep() so that you can explicitly create an output vector of the right length.

# This won't run
raw %>%
  distinct(id, gender, age) %>%
  mutate(group = c("treatment", "control"))
# This will run, because I've wrapped the offending vector in rep() to force it to be the right length
# you can use any of the rep() args to control the length of the output vector
# I'll explain n() in a little bit also
raw %>%
  distinct(id, gender, age) %>%
  mutate(group = rep(c("treatment", "control"), length.out = n()))
## # A tibble: 20 x 4
##       id   age gender group    
##    <int> <int> <chr>  <chr>    
##  1     1    32 male   treatment
##  2     2    31 male   control  
##  3     3    32 female treatment
##  4     4    25 female control  
##  5     5    23 male   treatment
##  6     6    29 male   control  
##  7     7    25 male   treatment
##  8     8    32 male   control  
##  9     9    18 female treatment
## 10    10    19 male   control  
## 11    11    20 male   treatment
## 12    12    25 female control  
## 13    13    35 female treatment
## 14    14    28 female control  
## 15    15    33 male   treatment
## 16    16    29 male   control  
## 17    17    27 male   treatment
## 18    18    35 female control  
## 19    19    31 male   treatment
## 20    20    21 male   control

Nice helpers

This section features a bunch of handy functions you can use inside mutate() (but not exclusively! they work on any old vector too) to do all sorts of magical things. Again, this is not an exhaustive walkthrough of all the tidy functions you might find useful–this is a limited subset of functions/cases that I find myself running into regularly.

n()

n() is a shortcut (only usable inside other tidyverse functions, alas) that returns the number of rows in the current df. I find it a lot safer than using length(my_df) inside of calls operating on my_df (for example).

You can call it inside of other functions inside of mutate() as well.

raw %>%
  # again, to make the print output narrower
  select(-starts_with("q")) %>%
  mutate(nrows = n())
## # A tibble: 1,000 x 8
##       id   age gender trial_num is_old response    rt nrows
##    <int> <int> <chr>      <int>  <int>    <int> <dbl> <int>
##  1     1    32 male           1      0        1 6.01   1000
##  2     1    32 male           2      0        0 1.40   1000
##  3     1    32 male           3      1        1 3.19   1000
##  4     1    32 male           4      0        0 3.33   1000
##  5     1    32 male           5      0        0 0.865  1000
##  6     1    32 male           6      1        1 6.39   1000
##  7     1    32 male           7      0        0 1.94   1000
##  8     1    32 male           8      1        1 3.50   1000
##  9     1    32 male           9      0        1 2.58   1000
## 10     1    32 male          10      0        0 3.53   1000
## # … with 990 more rows

Observe that the value of the column nrows is equal to the number of rows reported in the tibble output. You can use this number for all sorts of stuff–I frequently find myself creating a helper column in my dfs with the row index, by doing something like my_df %>% mutate(index = 1:n()).

recode()

recode() will point-replace specified values in a vector with other values. It’s most useful when you would have otherwise used a series of nested if-else calls to reassign old values of your column to new values.

It has a particular syntax: the old values must be argument names (on the left of the =) and formatted as bare variable names, and the new values can be specified as strings or numbers respectively.

As with other functions in the tidyverse, recode() won’t let you try to put substitute values of different data types inside of it–all new values must be the same type.

You also don’t have to replace every existing value (you can leave values the way they are by just not listing them in the recode() call), but again, all values in the output vector must be the same data type.

raw %>%
  mutate(gender = recode(gender,
                         male = "M",
                         female = "F"))
## # A tibble: 1,000 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32 M       26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32 M       26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32 M       26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32 M       26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32 M       26.6  23.7  37.8         5      0        0 0.865
##  6     1    32 M       26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32 M       26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32 M       26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32 M       26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32 M       26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

Note that to call recode() on a numeric, or otherwise not variable-name-safe, input vector, you have to use backtick `` quotes to force the non-variable-able values to render as variables.

raw %>%
  mutate(is_old = recode(is_old,
                         `1` = "old",
                         `0` = "new"))
## # A tibble: 1,000 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int> <chr>     <int> <dbl>
##  1     1    32 male    26.6  23.7  37.8         1 new           1 6.01 
##  2     1    32 male    26.6  23.7  37.8         2 new           0 1.40 
##  3     1    32 male    26.6  23.7  37.8         3 old           1 3.19 
##  4     1    32 male    26.6  23.7  37.8         4 new           0 3.33 
##  5     1    32 male    26.6  23.7  37.8         5 new           0 0.865
##  6     1    32 male    26.6  23.7  37.8         6 old           1 6.39 
##  7     1    32 male    26.6  23.7  37.8         7 new           0 1.94 
##  8     1    32 male    26.6  23.7  37.8         8 old           1 3.50 
##  9     1    32 male    26.6  23.7  37.8         9 new           1 2.58 
## 10     1    32 male    26.6  23.7  37.8        10 new           0 3.53 
## # … with 990 more rows

if_else() and case_when()

dplyr::if_else() (again, observe the underscore in the function name) is a stricter & faster version of base::ifelse(), your workhorse conditional vector modifying function. In line with the tidy way, if_else() will yell at you if your values for true and false aren’t the same data type, to try to protect you from accidental data typing disaster.

raw %>%
  distinct(id, age, gender) %>%
  # I know being over 30 isn't OLD. Just for illustration!
  mutate(age_bin = if_else(age >= 30,
                           "old",
                           "young"))
## # A tibble: 20 x 4
##       id   age gender age_bin
##    <int> <int> <chr>  <chr>  
##  1     1    32 male   old    
##  2     2    31 male   old    
##  3     3    32 female old    
##  4     4    25 female young  
##  5     5    23 male   young  
##  6     6    29 male   young  
##  7     7    25 male   young  
##  8     8    32 male   old    
##  9     9    18 female young  
## 10    10    19 male   young  
## 11    11    20 male   young  
## 12    12    25 female young  
## 13    13    35 female old    
## 14    14    28 female young  
## 15    15    33 male   old    
## 16    16    29 male   young  
## 17    17    27 male   young  
## 18    18    35 female old    
## 19    19    31 male   old    
## 20    20    21 male   young

dplyr::case_when() is a function that lets you create a vector based on a switch statement. A switch statement is essentially a neater if-else if-else if-… statement, that lets you specify a series of conditions, one of which you’d expect to be true for each piece of data going in.

Notice the particular syntax that case_when() expects to let it do its job: the logical statement is specified on the left, then a tilde ~, then the desired output value on the right. The last logical statement, TRUE ~ ..., functions as a final “else” statement in this large if-else chain, which assigns a default output to any values that aren’t TRUE for any other conditions in the statement. I like to have the final case either be NA, or the existing value of the old vector (if overwriting an existing vector).

Below is one of my common use cases for case_when()–to bin numeric values. (I prefer using this to base::cut() because I find that this reads much more cleanly to my eyes.)

raw %>%
  distinct(id, age, gender) %>%
  mutate(age_bin = case_when(age >= 30 ~ "old",
                             age > 22 & age < 30 ~ "medium",
                             age <= 22 ~ "young",
                             TRUE ~ NA_character_))
## # A tibble: 20 x 4
##       id   age gender age_bin
##    <int> <int> <chr>  <chr>  
##  1     1    32 male   old    
##  2     2    31 male   old    
##  3     3    32 female old    
##  4     4    25 female medium 
##  5     5    23 male   medium 
##  6     6    29 male   medium 
##  7     7    25 male   medium 
##  8     8    32 male   old    
##  9     9    18 female young  
## 10    10    19 male   young  
## 11    11    20 male   young  
## 12    12    25 female medium 
## 13    13    35 female old    
## 14    14    28 female medium 
## 15    15    33 male   old    
## 16    16    29 male   medium 
## 17    17    27 male   medium 
## 18    18    35 female old    
## 19    19    31 male   old    
## 20    20    21 male   young

case_when() lets you specify all manner of logical statements, so you can let your heart go wild!

# You will probably never need this specific coding
# This is just to show how you can have slightly more complex logical statements
raw %>%
  distinct(id, age, gender) %>%
  mutate(age_gender = case_when(age >= 30 & gender == "male" ~ "old male",
                                age >= 30 & gender == "female" ~ "old female",
                                gender == "female" ~ "female",
                                TRUE ~ NA_character_))
## # A tibble: 20 x 4
##       id   age gender age_gender
##    <int> <int> <chr>  <chr>     
##  1     1    32 male   old male  
##  2     2    31 male   old male  
##  3     3    32 female old female
##  4     4    25 female female    
##  5     5    23 male   <NA>      
##  6     6    29 male   <NA>      
##  7     7    25 male   <NA>      
##  8     8    32 male   old male  
##  9     9    18 female female    
## 10    10    19 male   <NA>      
## 11    11    20 male   <NA>      
## 12    12    25 female female    
## 13    13    35 female old female
## 14    14    28 female female    
## 15    15    33 male   old male  
## 16    16    29 male   <NA>      
## 17    17    27 male   <NA>      
## 18    18    35 female old female
## 19    19    31 male   old male  
## 20    20    21 male   <NA>

Important: case_when() evaluates its logical statements in the order you specify them, just like an if-else if statement. If you have some values for which multiple of your conditions are true, the values will be assigned to the first true condition listed in your statement. Observe the behavior of the following call, versus the one immediately above:

# Now, ALL the values of age_gender where gender == "female" will be "female",
# because that condition was satisfied BEFORE the case_when() statement could check
# the age >= 30 & gender == "female" statement
raw %>%
  distinct(id, age, gender) %>%
  mutate(age_gender = case_when(gender == "female" ~ "female",
                                age >= 30 & gender == "male" ~ "old male",
                                age >= 30 & gender == "female" ~ "old female",
                                TRUE ~ NA_character_))
## # A tibble: 20 x 4
##       id   age gender age_gender
##    <int> <int> <chr>  <chr>     
##  1     1    32 male   old male  
##  2     2    31 male   old male  
##  3     3    32 female female    
##  4     4    25 female female    
##  5     5    23 male   <NA>      
##  6     6    29 male   <NA>      
##  7     7    25 male   <NA>      
##  8     8    32 male   old male  
##  9     9    18 female female    
## 10    10    19 male   <NA>      
## 11    11    20 male   <NA>      
## 12    12    25 female female    
## 13    13    35 female female    
## 14    14    28 female female    
## 15    15    33 male   old male  
## 16    16    29 male   <NA>      
## 17    17    27 male   <NA>      
## 18    18    35 female female    
## 19    19    31 male   old male  
## 20    20    21 male   <NA>

As long as you remember that case_when() statements check conditions sequentially, and you’re deliberate when you specify conditions that might simultaneously be true, you should be okay.

By the way, notice that a case_when() statement for the case where every logical condition is testing for equality with == is equivalent to calling recode() to do the same thing.

raw %>%
  distinct(id, age, gender) %>%
  mutate(gender = case_when(gender == "male" ~ "M",
                            gender == "female" ~ "F",
                            TRUE ~ gender))
## # A tibble: 20 x 3
##       id   age gender
##    <int> <int> <chr> 
##  1     1    32 M     
##  2     2    31 M     
##  3     3    32 F     
##  4     4    25 F     
##  5     5    23 M     
##  6     6    29 M     
##  7     7    25 M     
##  8     8    32 M     
##  9     9    18 F     
## 10    10    19 M     
## 11    11    20 M     
## 12    12    25 F     
## 13    13    35 F     
## 14    14    28 F     
## 15    15    33 M     
## 16    16    29 M     
## 17    17    27 M     
## 18    18    35 F     
## 19    19    31 M     
## 20    20    21 M

The above case_when() call is equivalent to the following recode() call:

raw %>%
  distinct(id, age, gender) %>%
  mutate(gender = recode(gender,
                         male = "M",
                         female = "F"))
## # A tibble: 20 x 3
##       id   age gender
##    <int> <int> <chr> 
##  1     1    32 M     
##  2     2    31 M     
##  3     3    32 F     
##  4     4    25 F     
##  5     5    23 M     
##  6     6    29 M     
##  7     7    25 M     
##  8     8    32 M     
##  9     9    18 F     
## 10    10    19 M     
## 11    11    20 M     
## 12    12    25 F     
## 13    13    35 F     
## 14    14    28 F     
## 15    15    33 M     
## 16    16    29 M     
## 17    17    27 M     
## 18    18    35 F     
## 19    19    31 M     
## 20    20    21 M

The recode() call is a lot less verbose. In the case where you can use recode() in place of case_when(), I recommend that. In situations where you need the full flexibility that case_when() affords you, then continue on your merry way!

coalesce() and na_if()

One thing you might find yourself doing a lot with if_else() calls is replacing values in a column with NA, or replacing NA values in a column with something else. dplyr helper functions are here to help you save a few keystrokes!

To replace a particular value in a column with NA, use the aptly named na_if().

raw %>%
  distinct(id, gender) %>%
  # In this example, I'm saving the result into a new column,
  # but you could easily overwrite the contents of an old column
  mutate(female = na_if(gender, "male"))
## # A tibble: 20 x 3
##       id gender female
##    <int> <chr>  <chr> 
##  1     1 male   <NA>  
##  2     2 male   <NA>  
##  3     3 female female
##  4     4 female female
##  5     5 male   <NA>  
##  6     6 male   <NA>  
##  7     7 male   <NA>  
##  8     8 male   <NA>  
##  9     9 female female
## 10    10 male   <NA>  
## 11    11 male   <NA>  
## 12    12 female female
## 13    13 female female
## 14    14 female female
## 15    15 male   <NA>  
## 16    16 male   <NA>  
## 17    17 male   <NA>  
## 18    18 female female
## 19    19 male   <NA>  
## 20    20 male   <NA>

Bear in mind that na_if() is a specialized case of if_else(column == value_to_remove, NA, column), and so you should use if_else() if you want to replace things using any comparison other than direct equality.

To replace missing values in a column with non-missing values, use coalesce(). You can use this a few ways.

First off, you can replace all missing values with the same new value.

raw %>%
  distinct(id, gender) %>%
  mutate(female = na_if(gender, "male"),
         # See that I'm calling coalesce() on the column with NAs I just created
         gender2 = coalesce(female, "male"))
## # A tibble: 20 x 4
##       id gender female gender2
##    <int> <chr>  <chr>  <chr>  
##  1     1 male   <NA>   male   
##  2     2 male   <NA>   male   
##  3     3 female female female 
##  4     4 female female female 
##  5     5 male   <NA>   male   
##  6     6 male   <NA>   male   
##  7     7 male   <NA>   male   
##  8     8 male   <NA>   male   
##  9     9 female female female 
## 10    10 male   <NA>   male   
## 11    11 male   <NA>   male   
## 12    12 female female female 
## 13    13 female female female 
## 14    14 female female female 
## 15    15 male   <NA>   male   
## 16    16 male   <NA>   male   
## 17    17 male   <NA>   male   
## 18    18 female female female 
## 19    19 male   <NA>   male   
## 20    20 male   <NA>   male

You can also use coalesce() to zip together a series of vectors, where every time one has a missing value the other one has a non-missing value.

raw %>%
  distinct(id, gender) %>%
  mutate(female = na_if(gender, "male"),
         male = na_if(gender, "female"),
         # notice that coalesce is being called on two columns,
         # specified as bare variables
         gender2 = coalesce(female, male))
## # A tibble: 20 x 5
##       id gender female male  gender2
##    <int> <chr>  <chr>  <chr> <chr>  
##  1     1 male   <NA>   male  male   
##  2     2 male   <NA>   male  male   
##  3     3 female female <NA>  female 
##  4     4 female female <NA>  female 
##  5     5 male   <NA>   male  male   
##  6     6 male   <NA>   male  male   
##  7     7 male   <NA>   male  male   
##  8     8 male   <NA>   male  male   
##  9     9 female female <NA>  female 
## 10    10 male   <NA>   male  male   
## 11    11 male   <NA>   male  male   
## 12    12 female female <NA>  female 
## 13    13 female female <NA>  female 
## 14    14 female female <NA>  female 
## 15    15 male   <NA>   male  male   
## 16    16 male   <NA>   male  male   
## 17    17 male   <NA>   male  male   
## 18    18 female female <NA>  female 
## 19    19 male   <NA>   male  male   
## 20    20 male   <NA>   male  male

Here’s a more explicit example of using coalesce() to zip together more than two vectors:

coalesce(c(1, NA, NA, NA, NA),
         c(NA, NA, NA, NA, 5),
         c(NA, NA, NA, 4, NA),
         c(NA, NA, 3, NA, NA),
         c(NA, 2, NA, NA, NA))
## [1] 1 2 3 4 5

As with many other tidy functions, coalesce() obeys order, so if there are cases where more than one of your input vectors have a non-missing value, your output vector will take the value of the first vector with an non-missing value.

coalesce(c(1, NA, NA, NA, NA),
         c(NA, 3, NA, NA, 5),
         c(NA, NA, NA, 4, NA),
         c(NA, NA, 3, NA, NA),
         c(NA, 2, NA, NA, NA))
## [1] 1 3 3 4 5

stringr

If you don’t already use stringr, it’s a fantastic package in the tidyverse that makes string manipulation a breeze. Detecting string matches, subsetting strings, removing white space, you name it, stringr probably has a function for it!

There won’t be any code examples here because there are so many functions in this package that you might want to use that I don’t know which ones to show. If you want to peruse what the package has to offer, please check out the official stringr cheatsheet to get the full experience!

Calculating summary statistics

Handily, dplyr’s general function for calculating summary statistics is summarize()! You can use this to generate any single summary value on a vector you can think of–mean, median, SD, you name it.

What can you use to generate a summary statistic? In this case, any function that operates on a vector (or multiple vectors) and returns a single value (length 1) as output. Gotta be length 1!

In summarize(), as in mutate(), the argument name will be the name of the new column generated by the code provided in the argument value.

(If you DON’T specify the column name (by leaving the argument name empty), the code you provided to calculate the column will automatically be substituted in as the name for the column. Don’t do this…)

# Just trying not to overload any function names
summaries <- raw %>%
  summarize(rt_mean = mean(rt),
            rt_median = median(rt))

summaries
## # A tibble: 1 x 2
##   rt_mean rt_median
##     <dbl>     <dbl>
## 1    2.98      2.99

Special versions of summarize()

summarize() lets you custom-specify every summary statistic you want to calculate, which is very nice indeed. Sometimes, however, you may have cases where you would like to calculate the same statistic(s) on very many variables, and you wouldn’t like to type each of those out. We can handle this!

The two functions I use most often for this purpose are summarize_if() and summarize_at().

summarize_if() lets you specify summary functions, and then applies those functions to every column in your starting df that meets some logical criteria. For example, let’s say you want to calculate the mean and median of every column in a df that contains numeric data.

# Creating one character column and three numeric columns so you can see that the character column shan't appear in the output of summarize_if()
tibble(chars = letters[1:10],
       nums1 = rnorm(10, mean = 0, sd = 1)) %>%
  mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
         nums3 = rnorm(n(), mean = 0, sd = 20)) %>%
  summarize_if(is.numeric, funs(avg = mean(.),
                                median = median(.)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1 x 6
##   nums1_avg nums2_avg nums3_avg nums1_median nums2_median nums3_median
##       <dbl>     <dbl>     <dbl>        <dbl>        <dbl>        <dbl>
## 1    -0.173      93.8     -2.28        0.124         90.2        -5.79

The first true argument of summarize_if() is .predicate, where you specify the tester function that should return TRUE or FALSE for every column in your starter df.

The next argument is .funs, where you specify the summary function(s) you would like done to all of the columns for which .predicate returns TRUE. dplyr provides the helper function funs() to generate a list of function calls in the right format to be fed into summarize_if(). All you need is to specify the input function calls, each separated by a comma. The period . in the function calls represents “this df in which we are working”. This allows you to tell tidyverse functions which argument in your summary functions will receive the data from the starter df that’s being summarized. This is useful in cases where you have a summary function where the data isn’t the first argument; this way you can make sure the data to be summarized doesn’t get sent into the wrong argument and gum up the works.

You can see that funs() allows you to name the function calls you enter inside it, and that summarize_if() automatically appends the provided names of functions to the output cols.

The other specialty summarizing function, summarize_at(), does not use a logical predicate function to choose columns to summarize, but instead allows you to specify by name all the columns to summarize.

Instead of specifying a function in .predicate, that argument is replaced with .vars, where you specify the variables you want. You can do this a couple of ways:

With a character vector explicitly naming each column you want as a string:

tibble(chars = letters[1:10],
       nums1 = rnorm(10, mean = 0, sd = 1)) %>%
  mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
         nums3 = rnorm(n(), mean = 0, sd = 20)) %>%
  summarize_at(c("nums1", "nums2"), funs(avg = mean(.),
                                         median = median(.)))
## # A tibble: 1 x 4
##   nums1_avg nums2_avg nums1_median nums2_median
##       <dbl>     <dbl>        <dbl>        <dbl>
## 1     0.142      102.        0.272         108.

With tidyselect helpers:

tibble(chars = letters[1:10],
       nums1 = rnorm(10, mean = 0, sd = 1)) %>%
  mutate(nums2 = rnorm(n(), mean = 100, sd = 20),
         # This one doesn't start with "nums" like the other two do, notice!
         numbers3 = rnorm(n(), mean = 0, sd = 20)) %>%
  summarize_at(vars(starts_with("nums")), funs(avg = mean(.),
                                               median = median(.)))
## # A tibble: 1 x 4
##   nums1_avg nums2_avg nums1_median nums2_median
##       <dbl>     <dbl>        <dbl>        <dbl>
## 1    0.0805      106.     -0.00313         103.

Note that here, you have to wrap your tidyselect call (in this case starts_with()) inside a vars() call. dplyr::vars() is a helper function that lets you use tidyselect helper functions inside other tidyverse functions that aren’t select().

Grouping

dplyr supports grouping within dfs/tibbles. This means that, if groups are defined in your df, all operations that in some way operate along the nrows of your df now operate independently for each group. Your df still looks the same, but this grouping metadata will affect the behavior of some other dplyr functions called on your data.

To group your data, call dplyr::group_by(), listing the variable(s) you want to group by.

# grouping by one variable
raw %>%
  group_by(id)
## # A tibble: 1,000 x 10
## # Groups:   id [20]
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32 male    26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32 male    26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32 male    26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32 male    26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32 male    26.6  23.7  37.8         5      0        0 0.865
##  6     1    32 male    26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32 male    26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32 male    26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32 male    26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32 male    26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

You can group by as many variables as you would like. Each group will be composed of the observations belonging to one of the unique combinations of grouping variable levels present in your data.

raw %>%
  group_by(id, is_old)
## # A tibble: 1,000 x 10
## # Groups:   id, is_old [40]
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32 male    26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32 male    26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32 male    26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32 male    26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32 male    26.6  23.7  37.8         5      0        0 0.865
##  6     1    32 male    26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32 male    26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32 male    26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32 male    26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32 male    26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

To remove grouping from your df, use ungroup(). No arguments needed!

# Notice that the tibble output no longer tells you that groups are present in this df
raw %>%
  group_by(id, is_old) %>%
  ungroup()
## # A tibble: 1,000 x 10
##       id   age gender   q_1   q_2   q_3 trial_num is_old response    rt
##    <int> <int> <chr>  <dbl> <dbl> <dbl>     <int>  <int>    <int> <dbl>
##  1     1    32 male    26.6  23.7  37.8         1      0        1 6.01 
##  2     1    32 male    26.6  23.7  37.8         2      0        0 1.40 
##  3     1    32 male    26.6  23.7  37.8         3      1        1 3.19 
##  4     1    32 male    26.6  23.7  37.8         4      0        0 3.33 
##  5     1    32 male    26.6  23.7  37.8         5      0        0 0.865
##  6     1    32 male    26.6  23.7  37.8         6      1        1 6.39 
##  7     1    32 male    26.6  23.7  37.8         7      0        0 1.94 
##  8     1    32 male    26.6  23.7  37.8         8      1        1 3.50 
##  9     1    32 male    26.6  23.7  37.8         9      0        1 2.58 
## 10     1    32 male    26.6  23.7  37.8        10      0        0 3.53 
## # … with 990 more rows

Mutating grouped data

If we think back to the two types of vector operations mutate() supports, you can think about how grouping affects these operations.

  • Grouping does not affect:
    • truly element-wise operations
    • operations returning a vector length 1
  • Grouping does affect:
    • operations using all values in the column (z-scoring, ranking, etc)
    • operations referencing the number of rows in the df (e.g. n())

For all operations that use all values in the column, grouping causes these operations to see each group as if it were its own independent df.

raw %>%
  # Just to make the df narrower for printing
  select(-starts_with("q")) %>%
  mutate(rt_z_across = scale(rt)) %>%
  group_by(id) %>%
  mutate(rt_z_within = scale(rt))
## # A tibble: 1,000 x 9
## # Groups:   id [20]
##       id   age gender trial_num is_old response    rt rt_z_across[,1]
##    <int> <int> <chr>      <int>  <int>    <int> <dbl>           <dbl>
##  1     1    32 male           1      0        1 6.01            1.97 
##  2     1    32 male           2      0        0 1.40           -1.03 
##  3     1    32 male           3      1        1 3.19            0.136
##  4     1    32 male           4      0        0 3.33            0.226
##  5     1    32 male           5      0        0 0.865          -1.38 
##  6     1    32 male           6      1        1 6.39            2.22 
##  7     1    32 male           7      0        0 1.94           -0.675
##  8     1    32 male           8      1        1 3.50            0.339
##  9     1    32 male           9      0        1 2.58           -0.263
## 10     1    32 male          10      0        0 3.53            0.360
## # … with 990 more rows, and 1 more variable: rt_z_within[,1] <dbl>

Here, scale() (a base R function to mean-center or z-score vectors) demonstrates different behavior when called on an ungrouped df (the first call) vs on a grouped df (the second call). In this case, the first call z-scores across all observations in the vector, across subject, while the second, grouped, call z-scores independently within each subject.

Sometimes, you’ll want to mutate() considering every observation in the df, and sometimes you’ll want to mutate() independently for each group in your df. You can do both! Just be mindful of the grouping status of your df at all times.

Summarizing grouped data

Grouping is incredibly useful in conjunction with the summarize() functions, because it allows a smooth, modular way to specify when you want summary statistics calculated separately for every level of some grouping variable.

raw %>%
  group_by(id) %>%
  summarize(rt_median = median(rt))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 20 x 2
##       id rt_median
##    <int>     <dbl>
##  1     1      2.98
##  2     2      2.96
##  3     3      3.30
##  4     4      2.59
##  5     5      3.29
##  6     6      3.02
##  7     7      3.08
##  8     8      2.95
##  9     9      3.54
## 10    10      3.82
## 11    11      2.85
## 12    12      2.48
## 13    13      2.74
## 14    14      3.51
## 15    15      2.78
## 16    16      2.85
## 17    17      3.00
## 18    18      3.07
## 19    19      2.98
## 20    20      2.63

Observe that the grouping variable is output by summarize() as well as the calculated summary statistic, because the values of the grouping variable are necessary to mark out which summary statistic belongs to which grouping levels.

We shouldn’t be surprised about the output of summarize() when we feed in data that has been grouped by multiple variables.

# for example, if you wanted to tabulate summary statistics for multiple conditions within subject
raw %>%
  group_by(id, is_old) %>%
  summarize(rt_median = median(rt))
## `summarise()` regrouping output by 'id' (override with `.groups` argument)
## # A tibble: 40 x 3
## # Groups:   id [20]
##       id is_old rt_median
##    <int>  <int>     <dbl>
##  1     1      0      2.66
##  2     1      1      3.37
##  3     2      0      2.67
##  4     2      1      3.15
##  5     3      0      2.53
##  6     3      1      3.83
##  7     4      0      2.58
##  8     4      1      2.68
##  9     5      0      3.44
## 10     5      1      2.69
## # … with 30 more rows

The special summarizing functions, like summarize_if() and summarize_at(), will also obey grouping.

When to ungroup your df

Some tidyverse functions either run more slowly on grouped data, or have unwanted behavior on grouped data. For the following functions, I would recommend ungrouping your df before applying:

  • filter(), because it runs rather slowly on grouped dfs
    • UNLESS! You may want to filter your df based on rankings within a grouping variable (e.g. the row with the minimum RT for every subject). In this case, something like filter(rt == min(rt)) would return one row per group if called on a grouped df.
  • complete() and crossing() from tidyr (more on these below)

Creating new rows

Repeating existing rows

This strategy is inspired by a Stack Overflow post that I can’t seem to find now, but I find this useful when I am generating simulated datasets with multilevel structure. Essentially, because dplyr::slice() and other numeric indexing methods (base R bracket indexing included) allow you to index the same position multiple times to get repeats in your output, you can repeat rows in your df by using slice() to index the same row multiple times.

If you want to repeat every row in your df, you can use the below syntax to rep() every index in your df (1:n()) each number of times. Specifying each instead of times ensures that all repeats of the same row stay next to each other, which is useful if you use this strategy to expand simulated person-level data to accommodate multiple observations per person.

tibble(x = letters[1:5]) %>%
  slice(rep(1:n(), each = 3))
## # A tibble: 15 x 1
##    x    
##    <chr>
##  1 a    
##  2 a    
##  3 a    
##  4 b    
##  5 b    
##  6 b    
##  7 c    
##  8 c    
##  9 c    
## 10 d    
## 11 d    
## 12 d    
## 13 e    
## 14 e    
## 15 e

Completing combinations of grouping variables

There might be situations (particularly, those that are about to be fed into a summarize() call) where you want to complete combinations of grouping variables that might not exist in your data. For example, you might be working with a task where each trial can have a correct or incorrect response, and you’d like to use summarize() to count up the total number of correct and incorrect responses per person. If you have some star subject who gets all their trials correct, there will be no instances of that subject and the “incorrect” response value, so instead of showing a count of 0 rows, the summary will just be missing that row.

After you’ve called summarize(), you can use tidyr::complete() to create all the grouping variable combos you might have been missing (because they had no instances to count).

tibble(x = letters[1:5]) %>%
  slice(rep(1:n(), each = 3)) %>%
  mutate(y = rbinom(n(), 1, 0.5)) %>%
  count(x, y) %>%
  # complete() must be called on ungrouped data, otherwise it attempts to complete the groups independently, which means it doesn't recognize when one group has levels of a variable that another group doesn't have
  ungroup() %>%
  complete(x, y, fill = list(n = 0L))
## # A tibble: 10 x 3
##    x         y     n
##    <chr> <int> <int>
##  1 a         0     1
##  2 a         1     2
##  3 b         0     3
##  4 b         1     0
##  5 c         0     0
##  6 c         1     3
##  7 d         0     1
##  8 d         1     2
##  9 e         0     2
## 10 e         1     1

More broadly, you can use tidyr functions to do what you might do using base::expand.grid(), but with a bit more control. tidyr::crossing() is directly analogous to base::expand.grid(), in that it will create every possible combination of each of the provided variable levels.

# in crossing(), vectors listed first will repeat slowest
crossing(x = letters[1:2], y = letters[3:4], z = 1:2)
## # A tibble: 8 x 3
##   x     y         z
##   <chr> <chr> <int>
## 1 a     c         1
## 2 a     c         2
## 3 a     d         1
## 4 a     d         2
## 5 b     c         1
## 6 b     c         2
## 7 b     d         1
## 8 b     d         2

Reshaping data

tidyr provides the functions you’ll need to reshape your data. Once you get the hang of them, hopefully you’ll find the functions easy to use, because they have essentially no ancillary arguments specifying how you want your data reshaped. Each of the tidy reshaping functions does only one thing, so what they lose in flexibility they gain in predictability, so you should get either what you expect, or an error, when you call these functions.

From wide to long

tidyr::pivot_longer() is how we’ll be taking data from wide form to long form. We’ll take wide data that has different conditions of an observed/calculated variable in different columns, and make it long, so that all those values of the same variable are in one column, with another column indicating what condition that value comes form.

(If you want a deeper review on tidy data structure, and what wide form and long form mean, read this vignette.)

In this example, I’m first going to calculate some related summary statistics on reaction time to create wide data, because our original trialwise data is in long form.

summaries <- raw %>%
  group_by(id) %>%
  summarize(rt_min = min(rt),
            rt_median = median(rt),
            rt_max = max(rt)) %>%
  # Note that I'm using sample_n() here to just make the output df shorter by selecting some subjs
  sample_n(3)
## `summarise()` ungrouping output (override with `.groups` argument)
summaries
## # A tibble: 3 x 4
##      id rt_min rt_median rt_max
##   <int>  <dbl>     <dbl>  <dbl>
## 1     3 -0.198      3.30   6.31
## 2     6 -0.716      3.02   5.78
## 3    12 -1.11       2.48   5.11

You can see this df is wide, because different observations of summary RT (minimum, median, and maximum) are each in their own column. (This data does have the utility of having one row per subject. Wide data is not necessarily always bad!)

When we gather() this wide df and thus make it long, we expect to have one column for RT, and another column indicating whether that RT is a minimum, median, or maximum.

gather() takes three main arguments (peep the other ones in the docs if you want to futz around):

  • key: you specify the name of the new column (doesn’t exist yet!) containing your indicator variable for the conditions of your value variable
  • value: you specify the name of the column that contains values of the variable themselves
  • ...: you specify all of the columns that actually contain values to be gather()-ed. Note that here, you can specify columns just as you would in select(), including using tidyselect() helpers.
summaries_long <- summaries %>%
  # if I wanted to use tidyselect helpers, I could use starts_with("rt") here
  pivot_longer(cols = c(rt_min, rt_median, rt_max), names_to = "statistic", values_to = "rt")

summaries_long
## # A tibble: 9 x 3
##      id statistic     rt
##   <int> <chr>      <dbl>
## 1     3 rt_min    -0.198
## 2     3 rt_median  3.30 
## 3     3 rt_max     6.31 
## 4     6 rt_min    -0.716
## 5     6 rt_median  3.02 
## 6     6 rt_max     5.78 
## 7    12 rt_min    -1.11 
## 8    12 rt_median  2.48 
## 9    12 rt_max     5.11

To trim off prefixes of key/indicator variable, you can use the names_prefix argument to tell R to drop a particular string from the beginning of each column name before pivoting.

summaries_long <- summaries %>%
  pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt", names_prefix = "rt_")

summaries_long
## # A tibble: 9 x 3
##      id statistic     rt
##   <int> <chr>      <dbl>
## 1     3 min       -0.198
## 2     3 median     3.30 
## 3     3 max        6.31 
## 4     6 min       -0.716
## 5     6 median     3.02 
## 6     6 max        5.78 
## 7    12 min       -1.11 
## 8    12 median     2.48 
## 9    12 max        5.11

Pivot-longering multiple distinct variables at once

You can also use pivot_longer() to pivot multiple value variables at once. If you have multiple unique variables in a wide df that are all bound by the same indicator conditions, you can modify the arguments of pivot_longer() and get the job done.

Below, I’m first going to create a summary df where we have observations of multiple variables (in this case, the three “questionnaire” scores), and each has multiple observation conditions (the minimum and maximum score on that questionnaire within each gender).

summaries_qs <- raw %>%
  select(id, gender, starts_with("q")) %>%
  distinct(id, .keep_all = TRUE) %>%
  group_by(gender) %>%
  summarize_at(vars(starts_with("q")), funs(min = min(.),
                                            max = max(.)))

summaries_qs
## # A tibble: 2 x 7
##   gender q_1_min q_2_min q_3_min q_1_max q_2_max q_3_max
##   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 female    15.9    18.4   17.3     64.9    46.2    33.5
## 2 male      18.1    18.5    7.71    44.7    53.0    47.5

Say I want to get this into a long df where each questionnaire has its own column, and there’s one indicator column for whether the score is a minimum or maximum score, and that indicator corresponds across all three questionnaire columns.

pivot_longer() in its usual won’t get me all the way there:

summaries_qs %>%
  pivot_longer(cols = starts_with("q"), names_to = "extreme", values_to = "q_score")
## # A tibble: 12 x 3
##    gender extreme q_score
##    <chr>  <chr>     <dbl>
##  1 female q_1_min   15.9 
##  2 female q_2_min   18.4 
##  3 female q_3_min   17.3 
##  4 female q_1_max   64.9 
##  5 female q_2_max   46.2 
##  6 female q_3_max   33.5 
##  7 male   q_1_min   18.1 
##  8 male   q_2_min   18.5 
##  9 male   q_3_min    7.71
## 10 male   q_1_max   44.7 
## 11 male   q_2_max   53.0 
## 12 male   q_3_max   47.5

You can see that the key/indicator column, extreme, actually contains information for two distinct variables: which questionnaire the score is from, AND which extreme the score is for. pivot_longer() can handle this, but we need to tell it where to get the names for all the new columns from, since there’s more than one value column now.

Here, we feed a vector into the names_to argument, telling it which part of the old column names should go into the names column (here extreme) and which should be split out into value columns (with the special phrase ".value"). Then we use names_pattern to specify capture groups to tell R how to break up the column names.

summaries_qs %>%
  pivot_longer(cols = starts_with("q"),
               names_to = c(".value", "extreme"),
               names_pattern = "(._.*)_(.*)")
## # A tibble: 4 x 5
##   gender extreme   q_1   q_2   q_3
##   <chr>  <chr>   <dbl> <dbl> <dbl>
## 1 female min      15.9  18.4 17.3 
## 2 female max      64.9  46.2 33.5 
## 3 male   min      18.1  18.5  7.71
## 4 male   max      44.7  53.0 47.5

From long to wide

To go from long form to wide form, you’ll use tidyr::pivot_wider().

pivot_wider()’s main arguments are analogous to pivot_longer()’s, but not exactly the same because you’re doing something different:

  • names_from: the name of the existing column that is the key/indicator column. You must specify this as a bare variable name, because you’re referring to an existing column.
  • values_from: the name of the existing value column. Again, specify as a bare variable name.
summaries_long <- summaries %>%
  pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt")

summaries_long %>%
  pivot_wider(names_from = statistic, values_from = rt)
## # A tibble: 3 x 4
##      id rt_min rt_median rt_max
##   <int>  <dbl>     <dbl>  <dbl>
## 1     3 -0.198      3.30   6.31
## 2     6 -0.716      3.02   5.78
## 3    12 -1.11       2.48   5.11

pivot_wider() automatically sets the column names of the new wide-split columns to whatever the indicator values were in the key column. Keep this in mind if you’re modifying the indicator values to be cleaner–you may want to modify them again if you plan to spread your data, to put the name of the value back into the indicator conditions.

For example, the below is nice in long form but a little less clear in wide form, because we now don’t necessarily know what the min/median/max are calculated on:

summaries %>%
  pivot_longer(cols = starts_with("rt"), names_to = "statistic", values_to = "rt", names_prefix = "rt_") %>%
  pivot_wider(names_from = statistic, values_from = rt)
## # A tibble: 3 x 4
##      id    min median   max
##   <int>  <dbl>  <dbl> <dbl>
## 1     3 -0.198   3.30  6.31
## 2     6 -0.716   3.02  5.78
## 3    12 -1.11    2.48  5.11

Pivot-widering multiple distinct variables at once

Similarly to pivot_longer(), pivot_wider()’s default behavior is to only let you pivot one value column at a time. Again, if you want to widen multiple unique variables that have repeated measures over the same conditions, you can modify the arguments of pivot_wider() to achieve this.

With pivot_wider(), the main thing you change is to pass either a vector of column names or a tidyselect call to values_from, instead of just one column. With multiple columns going in as the argument, usually pivot_wider() knows how to pivot those columns and make their new names.

summaries_qs_long <- summaries_qs %>%
  pivot_longer(cols = starts_with("q"),
               names_to = c(".value", "extreme"),
               names_pattern = "(._.*)_(.*)")

summaries_qs_long %>%
  pivot_wider(names_from = extreme,
              values_from = starts_with("q"))
## # A tibble: 2 x 7
##   gender q_1_min q_1_max q_2_min q_2_max q_3_min q_3_max
##   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 female    15.9    64.9    18.4    46.2   17.3     33.5
## 2 male      18.1    44.7    18.5    53.0    7.71    47.5

If you want to change the separator character that splits up the new appended names, you can use the names_sep argument. As you can see here, the default separator is an underscore.

Coming up in part 2

The next walkthrough will get into working with nested dfs, custom-vectorizing functions using purrr::map(), using dfs to store multiple model objects, and more.

Next: Tidy Walkthrough, Part 2