Data Manipulation
Goals of This Lesson
- Learn how to group and categorize data in R
- Learn how to generate descriptive statistics in R
Links to Files and Video Recording
The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/r-core/3-datamanipulation/index.rmd
.
For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 2 page.
Load Data
This dataset examines the relationship between multitasking and working memory (link to the original paper by Uncapher et al., 2016).
Familiarizing Yourself with the Data
Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.
## # A tibble: 136 x 17
## subjNum groupStatus numDist conf hitCount allOldCount rtHit faCount
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 6 HMM 0 hi 18 25 0.991 3
## 2 6 HMM 6 hi 14 25 0.952 3
## 3 10 HMM 0 hi 5 25 1.08 8
## 4 10 HMM 6 hi 5 25 1.00 8
## 5 14 HMM 0 hi 3 25 2.35 4
## 6 14 HMM 6 hi 7 25 2.03 4
## 7 15 HMM 0 hi 10 25 1.03 1
## 8 15 HMM 6 hi 9 25 1.16 1
## 9 20 HMM 0 hi 1 25 0.710 2
## 10 20 HMM 6 hi 2 25 0.963 2
## # … with 126 more rows, and 9 more variables: allNewCount <dbl>, rtFA <dbl>,
## # distPresent <chr>, hitRate <dbl>, faRate <dbl>, dprime <dbl>, mmi <dbl>,
## # adhd <dbl>, bis <dbl>
## [1] 136
## [1] 17
## [1] "subjNum" "groupStatus" "numDist" "conf" "hitCount"
## [6] "allOldCount" "rtHit" "faCount" "allNewCount" "rtFA"
## [11] "distPresent" "hitRate" "faRate" "dprime" "mmi"
## [16] "adhd" "bis"
Selecting Relevant Variables
Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.
Basic Descriptive Statistics
Summarizing data
Let’s learn how to make simple tables of summary statistics.
First, we will calculate summary info across all data using summarize()
, a useful function for creating summaries. Like mutate()
, it can take mutiple functions as arguments. Note that we’re not creating a new object for this summary (i.e. not using the <-
symbol), so this will print but not save.
df %>%
summarize(min_HR = min(hitRate, na.rm = T),
max_HR = max(hitRate, na.rm = T),
mean_HR = mean(hitRate, na.rm= T),
sd_HR = sd(hitRate, na.rm = T))
## # A tibble: 1 x 4
## min_HR max_HR mean_HR sd_HR
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0577 0.788 0.351 0.153
Grouping data
Next, we will learn how to group data based on certain variables of interest.
We will use the group_by()
function in tidyverse, which will automatically group any subsequent actions called on the data.
# Split the summary statistics by group
meanSummary <- df %>%
group_by(groupStatus) %>%
summarize(meanHitRate = mean(hitRate, na.rm= T),
meanFalseAlarm= mean(faRate, na.rm= T),
meanDprime = mean(dprime, na.rm = T))
We can group data by more than one factor. Let’s say we’re interested in how levels of ADHD interact with groupStatus
(multitasking: high or low). We will make a factor for ADHD (mean-split), and add it as a grouping variable.
# If adhd score is lower than mean, label "low", else label "high"
df <- df %>%
mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High"))
table(df$adhdF)
##
## High Low
## 66 70
Then we’ll check how evenly split these groups are.
# How many data points are in each group (adhdF x groupStatus)?
# count() will calculate this for us.
df %>%
count(adhdF, groupStatus)
## # A tibble: 4 x 3
## adhdF groupStatus n
## <chr> <chr> <int>
## 1 High HMM 44
## 2 High LMM 22
## 3 Low HMM 24
## 4 Low LMM 46
Then we’ll calculate some summary info on these groups.
df %>%
group_by(adhdF, groupStatus) %>%
summarize(meanHitRate = mean(hitRate, na.rm= T),
meanFalseAlarm= mean(faRate, na.rm= T),
meanDprime = mean(dprime, na.rm = T))
## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.
## # A tibble: 4 x 5
## # Groups: adhdF [2]
## adhdF groupStatus meanHitRate meanFalseAlarm meanDprime
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 High HMM 0.377 0.0900 1.15
## 2 High LMM 0.288 0.0686 1.09
## 3 Low HMM 0.287 0.0850 0.819
## 4 Low LMM 0.389 0.0780 1.30
Extra: Working with a Long Dataset
This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.
Counting unique subjects
## [1] 6 10 14 15 20 22 26 37 41 42 43 49 52 56 63 67 69 71 72
## [20] 85 90 97 108 110 114 115 118 121 125 127 138 139 141 142 2 5 9 11
## [39] 21 24 29 30 36 39 47 48 53 62 66 74 75 77 78 80 81 88 91
## [58] 96 99 102 104 105 120 122 124 131 132 136
Calculating number of trials per subject
## # A tibble: 68 x 2
## subjNum n
## * <dbl> <int>
## 1 2 2
## 2 5 2
## 3 6 2
## 4 9 2
## 5 10 2
## 6 11 2
## 7 14 2
## 8 15 2
## 9 20 2
## 10 21 2
## # … with 58 more rows
Combining summary statistics with the full dataframe
For some analyses, you might want to add a higher level variable (e.g. subject average hitRate) alongside your long data. We can do this by using mutate instead of summarize. Note: you’ll have repeating values for the average column.
Saving Your Work
Saving tables into .csv files
basic_descriptives <- df %>%
# if adhd score is lower than mean, label "low", else label "high""
mutate(adhdF = if_else(adhd < mean(adhd), "Low", "High")) %>%
# split the summary statistics by group
group_by(adhdF, groupStatus) %>%
summarize(meanHitRate = mean(hitRate, na.rm= T),
meanFalseAlarm= mean(faRate, na.rm= T),
meanDprime = mean(dprime, na.rm = T))
## `summarise()` has grouped output by 'adhdF'. You can override using the `.groups` argument.
# To save this out, use write_csv() to save a CSV file (which can open in
# other programs like Excel) or save() to save a combined file comprising
# the full data and descriptives (which can only open in R/RStudio).
write_csv(basic_descriptives, path = here("content", "tutorials", "r-core", "3-datamanipulation", "myDescriptives.csv"))
save(basic_descriptives, df, file = here("content", "tutorials",, "r-core", "3-datamanipulation", "StudyData.rda"))
# Note: These files will automatically save to your working directory
# unless you specify otherwise. To do so, use here() to indicate the
# location where you would like to save the file.