Data Manipulation

core R

Goals of This Lesson

Learn how to group and categorize data in R
Learn how to generate descriptive statistics in R

Links to Files and Video Recording

The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/r-core/3-datamanipulation/index.rmd.

For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 2 page.

Load Data

This dataset examines the relationship between multitasking and working memory (link to the original paper by Uncapher et al., 2016).

# use read.csv function to open data into R 
df <- read_csv(here("content", "tutorials", "r-core", "3-datamanipulation", "uncapher_2016_repeated_measures_dataset.csv"))

# we will continue using tidyverse tools
# so we have library()-ed tidyverse at the very top of this document

Familiarizing Yourself with the Data

Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.

# base R functions
df

## # A tibble: 136 x 17
##    subjNum groupStatus numDist conf  hitCount allOldCount rtHit faCount
##      <dbl> <chr>         <dbl> <chr>    <dbl>       <dbl> <dbl>   <dbl>
##  1       6 HMM               0 hi          18          25 0.991       3
##  2       6 HMM               6 hi          14          25 0.952       3
##  3      10 HMM               0 hi           5          25 1.08        8
##  4      10 HMM               6 hi           5          25 1.00        8
##  5      14 HMM               0 hi           3          25 2.35        4
##  6      14 HMM               6 hi           7          25 2.03        4
##  7      15 HMM               0 hi          10          25 1.03        1
##  8      15 HMM               6 hi           9          25 1.16        1
##  9      20 HMM               0 hi           1          25 0.710       2
## 10      20 HMM               6 hi           2          25 0.963       2
## # … with 126 more rows, and 9 more variables: allNewCount <dbl>, rtFA <dbl>,
## #   distPresent <chr>, hitRate <dbl>, faRate <dbl>, dprime <dbl>, mmi <dbl>,
## #   adhd <dbl>, bis <dbl>

nrow(df)

## [1] 136

ncol(df)

## [1] 17

names(df)

##  [1] "subjNum"     "groupStatus" "numDist"     "conf"        "hitCount"   
##  [6] "allOldCount" "rtHit"       "faCount"     "allNewCount" "rtFA"       
## [11] "distPresent" "hitRate"     "faRate"      "dprime"      "mmi"        
## [16] "adhd"        "bis"

Selecting Relevant Variables

Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.

# tidyverse select() function 
df <- df %>%
  select(subjNum, groupStatus, adhd, hitRate, faRate, dprime)

Basic Descriptive Statistics

Summarizing data

Let’s learn how to make simple tables of summary statistics. First, we will calculate summary info across all data using summarize(), a useful function for creating summaries. Like mutate(), it can take mutiple functions as arguments. Note that we’re not creating a new object for this summary (i.e. not using the <- symbol), so this will print but not save.

df %>% 
  summarize(min_HR = min(hitRate, na.rm = T),
            max_HR = max(hitRate, na.rm = T), 
            mean_HR = mean(hitRate, na.rm= T), 
            sd_HR = sd(hitRate, na.rm = T))

## # A tibble: 1 x 4
##   min_HR max_HR mean_HR sd_HR
##    <dbl>  <dbl>   <dbl> <dbl>
## 1 0.0577  0.788   0.351 0.153

Grouping data

Next, we will learn how to group data based on certain variables of interest.

We will use the group_by() function in tidyverse, which will automatically group any subsequent actions called on the data.

# Split the summary statistics by group
meanSummary <- df %>% 
  group_by(groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))

We can group data by more than one factor. Let’s say we’re interested in how levels of ADHD interact with groupStatus (multitasking: high or low). We will make a factor for ADHD (mean-split), and add it as a grouping variable.

# If adhd score is lower than mean, label "low", else label "high"
df <- df %>% 
   mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High")) 

table(df$adhdF)

## 
## High  Low 
##   66   70

Then we’ll check how evenly split these groups are.

# How many data points are in each group (adhdF x groupStatus)?
# count() will calculate this for us.
df %>%
  count(adhdF, groupStatus)

## # A tibble: 4 x 3
##   adhdF groupStatus     n
##   <chr> <chr>       <int>
## 1 High  HMM            44
## 2 High  LMM            22
## 3 Low   HMM            24
## 4 Low   LMM            46

Then we’ll calculate some summary info on these groups.

df %>%
  group_by(adhdF, groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))

## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.

## # A tibble: 4 x 5
## # Groups:   adhdF [2]
##   adhdF groupStatus meanHitRate meanFalseAlarm meanDprime
##   <chr> <chr>             <dbl>          <dbl>      <dbl>
## 1 High  HMM               0.377         0.0900      1.15 
## 2 High  LMM               0.288         0.0686      1.09 
## 3 Low   HMM               0.287         0.0850      0.819
## 4 Low   LMM               0.389         0.0780      1.30

Extra: Working with a Long Dataset

This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.

Counting unique subjects

# Get a list of subjects using unique()
SubList <- unique(df$subjNum)

SubList

##  [1]   6  10  14  15  20  22  26  37  41  42  43  49  52  56  63  67  69  71  72
## [20]  85  90  97 108 110 114 115 118 121 125 127 138 139 141 142   2   5   9  11
## [39]  21  24  29  30  36  39  47  48  53  62  66  74  75  77  78  80  81  88  91
## [58]  96  99 102 104 105 120 122 124 131 132 136

# how many subjects are in this dataframe?
Nsubs <- length(SubList)

Calculating number of trials per subject

df %>%
  count(subjNum)

## # A tibble: 68 x 2
##    subjNum     n
##  *   <dbl> <int>
##  1       2     2
##  2       5     2
##  3       6     2
##  4       9     2
##  5      10     2
##  6      11     2
##  7      14     2
##  8      15     2
##  9      20     2
## 10      21     2
## # … with 58 more rows

Combining summary statistics with the full dataframe

For some analyses, you might want to add a higher level variable (e.g. subject average hitRate) alongside your long data. We can do this by using mutate instead of summarize. Note: you’ll have repeating values for the average column.

avgHR <- df %>% 
  group_by(subjNum) %>% 
  mutate(avgHR = mean(hitRate, na.rm=T))

# You should now have an avgHR column in df, which will
# repeat within each subject, but vary across subjects.

Saving Your Work

Saving tables into .csv files

basic_descriptives <-  df %>% 
  # if adhd score is lower than mean, label "low", else label "high""
  mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High")) %>%
  # split the summary statistics by group 
  group_by(adhdF, groupStatus) %>%
  summarize(meanHitRate = mean(hitRate, na.rm= T), 
            meanFalseAlarm= mean(faRate, na.rm= T),
            meanDprime = mean(dprime, na.rm = T))

## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.

# To save this out, use write_csv() to save a CSV file (which can open in
# other programs like Excel) or save() to save a combined file comprising
# the full data and descriptives (which can only open in R/RStudio).
write_csv(basic_descriptives, path = here("content", "tutorials", "r-core", "3-datamanipulation", "myDescriptives.csv"))
save(basic_descriptives, df, file = here("content", "tutorials",, "r-core", "3-datamanipulation", "StudyData.rda"))

# Note: These files will automatically save to your working directory
# unless you specify otherwise. To do so, use here() to indicate the
# location where you would like to save the file.

Next: Plotting (Graphing your data using ggplot2)