Data Manipulation

core R


Goals of This Lesson

  1. Learn how to group and categorize data in R
  2. Learn how to generate descriptive statistics in R

Load Data

This dataset examines the relationship between multitasking and working memory (link to the original paper by Uncapher et al., 2016).

# use read.csv function to open data into R 
df <- read_csv(here("content", "tutorials", "r-core", "3-datamanipulation", "uncapher_2016_repeated_measures_dataset.csv"))

# we will continue using tidyverse tools
# so we have library()-ed tidyverse at the very top of this document

Familiarizing Yourself with the Data

Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.

# base R functions
df
## # A tibble: 136 x 17
##    subjNum groupStatus numDist conf  hitCount allOldCount rtHit faCount
##      <dbl> <chr>         <dbl> <chr>    <dbl>       <dbl> <dbl>   <dbl>
##  1       6 HMM               0 hi          18          25 0.991       3
##  2       6 HMM               6 hi          14          25 0.952       3
##  3      10 HMM               0 hi           5          25 1.08        8
##  4      10 HMM               6 hi           5          25 1.00        8
##  5      14 HMM               0 hi           3          25 2.35        4
##  6      14 HMM               6 hi           7          25 2.03        4
##  7      15 HMM               0 hi          10          25 1.03        1
##  8      15 HMM               6 hi           9          25 1.16        1
##  9      20 HMM               0 hi           1          25 0.710       2
## 10      20 HMM               6 hi           2          25 0.963       2
## # … with 126 more rows, and 9 more variables: allNewCount <dbl>, rtFA <dbl>,
## #   distPresent <chr>, hitRate <dbl>, faRate <dbl>, dprime <dbl>, mmi <dbl>,
## #   adhd <dbl>, bis <dbl>
nrow(df)
## [1] 136
ncol(df)
## [1] 17
names(df)
##  [1] "subjNum"     "groupStatus" "numDist"     "conf"        "hitCount"   
##  [6] "allOldCount" "rtHit"       "faCount"     "allNewCount" "rtFA"       
## [11] "distPresent" "hitRate"     "faRate"      "dprime"      "mmi"        
## [16] "adhd"        "bis"

Selecting Relevant Variables

Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.

# tidyverse select() function 
df <- df %>%
  select(subjNum, groupStatus, adhd, hitRate, faRate, dprime)

Basic Descriptive Statistics

Summarizing data

Let’s learn how to make simple tables of summary statistics. First, we will calculate summary info across all data using summarize(), a useful function for creating summaries. Like mutate(), it can take mutiple functions as arguments. Note that we’re not creating a new object for this summary (i.e. not using the <- symbol), so this will print but not save.

df %>% 
  summarize(min_HR = min(hitRate, na.rm = T),
            max_HR = max(hitRate, na.rm = T), 
            mean_HR = mean(hitRate, na.rm= T), 
            sd_HR = sd(hitRate, na.rm = T))
## # A tibble: 1 x 4
##   min_HR max_HR mean_HR sd_HR
##    <dbl>  <dbl>   <dbl> <dbl>
## 1 0.0577  0.788   0.351 0.153

Grouping data

Next, we will learn how to group data based on certain variables of interest.

We will use the group_by() function in tidyverse, which will automatically group any subsequent actions called on the data.

# Split the summary statistics by group
meanSummary <- df %>% 
  group_by(groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))

We can group data by more than one factor. Let’s say we’re interested in how levels of ADHD interact with groupStatus (multitasking: high or low). We will make a factor for ADHD (mean-split), and add it as a grouping variable.

# If adhd score is lower than mean, label "low", else label "high"
df <- df %>% 
   mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High")) 

table(df$adhdF)
## 
## High  Low 
##   66   70

Then we’ll check how evenly split these groups are.

# How many data points are in each group (adhdF x groupStatus)?
# count() will calculate this for us.
df %>%
  count(adhdF, groupStatus)
## # A tibble: 4 x 3
##   adhdF groupStatus     n
##   <chr> <chr>       <int>
## 1 High  HMM            44
## 2 High  LMM            22
## 3 Low   HMM            24
## 4 Low   LMM            46

Then we’ll calculate some summary info on these groups.

df %>%
  group_by(adhdF, groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))
## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.
## # A tibble: 4 x 5
## # Groups:   adhdF [2]
##   adhdF groupStatus meanHitRate meanFalseAlarm meanDprime
##   <chr> <chr>             <dbl>          <dbl>      <dbl>
## 1 High  HMM               0.377         0.0900      1.15 
## 2 High  LMM               0.288         0.0686      1.09 
## 3 Low   HMM               0.287         0.0850      0.819
## 4 Low   LMM               0.389         0.0780      1.30

Extra: Working with a Long Dataset

This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.

Counting unique subjects

# Get a list of subjects using unique()
SubList <- unique(df$subjNum)

SubList
##  [1]   6  10  14  15  20  22  26  37  41  42  43  49  52  56  63  67  69  71  72
## [20]  85  90  97 108 110 114 115 118 121 125 127 138 139 141 142   2   5   9  11
## [39]  21  24  29  30  36  39  47  48  53  62  66  74  75  77  78  80  81  88  91
## [58]  96  99 102 104 105 120 122 124 131 132 136
# how many subjects are in this dataframe?
Nsubs <- length(SubList)

Calculating number of trials per subject

df %>%
  count(subjNum)
## # A tibble: 68 x 2
##    subjNum     n
##  *   <dbl> <int>
##  1       2     2
##  2       5     2
##  3       6     2
##  4       9     2
##  5      10     2
##  6      11     2
##  7      14     2
##  8      15     2
##  9      20     2
## 10      21     2
## # … with 58 more rows

Combining summary statistics with the full dataframe

For some analyses, you might want to add a higher level variable (e.g. subject average hitRate) alongside your long data. We can do this by using mutate instead of summarize. Note: you’ll have repeating values for the average column.

avgHR <- df %>% 
  group_by(subjNum) %>% 
  mutate(avgHR = mean(hitRate, na.rm=T))

# You should now have an avgHR column in df, which will
# repeat within each subject, but vary across subjects.

Saving Your Work

Saving tables into .csv files

basic_descriptives <-  df %>% 
  # if adhd score is lower than mean, label "low", else label "high""
  mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High")) %>%
  # split the summary statistics by group 
  group_by(adhdF, groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))
## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.
# To save this out, use write_csv() to save a CSV file (which can open in
# other programs like Excel) or save() to save a combined file comprising
# the full data and descriptives (which can only open in R/RStudio).
write_csv(basic_descriptives, path = here("content", "tutorials", "r-core", "3-datamanipulation", "myDescriptives.csv"))
save(basic_descriptives, df, file = here("content", "tutorials",, "r-core", "3-datamanipulation", "StudyData.rda"))

# Note: These files will automatically save to your working directory
# unless you specify otherwise. To do so, use here() to indicate the
# location where you would like to save the file.

Next: Plotting (Graphing your data using ggplot2)