Data Manipulation in Python
Goals of this Lesson
Students will learn:
- How to group and categorize data in Python
- How to generative descriptive statistics in Python
Links to Files and Video Recording
The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/python/3-datamanipulation/index.ipynb
.
For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 3 page.
# load packages we will be using for this lesson
import pandas as pd
Open Dataset and Load Package
This dataset examines the relationship between multitasking and working memory. Link here to original paper by Uncapher et al. 2016.
# use pd.read_csv to open data into python
df = pd.read_csv("uncapher_2016_repeated_measures_dataset.csv")
Familiarize Yourself with the Data
Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.
df.head()
subjNum | groupStatus | numDist | conf | hitCount | allOldCount | rtHit | faCount | allNewCount | rtFA | distPresent | hitRate | faRate | dprime | mmi | adhd | bis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | HMM | 0 | hi | 18 | 25 | 0.990657 | 3 | 50 | 1.062458 | nodist | 0.711538 | 0.068627 | 2.043976 | 5.77 | 4 | 74 |
1 | 6 | HMM | 6 | hi | 14 | 25 | 0.951638 | 3 | 50 | 1.062458 | dist | 0.557692 | 0.068627 | 1.631213 | 5.77 | 4 | 74 |
2 | 10 | HMM | 0 | hi | 5 | 25 | 1.081535 | 8 | 50 | 1.036764 | nodist | 0.211538 | 0.166667 | 0.166327 | 6.21 | 4 | 51 |
3 | 10 | HMM | 6 | hi | 5 | 25 | 0.999527 | 8 | 50 | 1.036764 | dist | 0.211538 | 0.166667 | 0.166327 | 6.21 | 4 | 51 |
4 | 14 | HMM | 0 | hi | 3 | 25 | 2.346210 | 4 | 50 | 2.075087 | nodist | 0.134615 | 0.088235 | 0.246866 | 8.60 | 5 | 62 |
df.shape
(136, 17)
df.columns
Index(['subjNum', 'groupStatus', 'numDist', 'conf', 'hitCount', 'allOldCount',
'rtHit', 'faCount', 'allNewCount', 'rtFA', 'distPresent', 'hitRate',
'faRate', 'dprime', 'mmi', 'adhd', 'bis'],
dtype='object')
Selecting Relevant Variables
Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.
df = df[["subjNum", "groupStatus", "adhd", "hitRate", "faRate", "dprime"]]
df.head()
subjNum | groupStatus | adhd | hitRate | faRate | dprime | |
---|---|---|---|---|---|---|
0 | 6 | HMM | 4 | 0.711538 | 0.068627 | 2.043976 |
1 | 6 | HMM | 4 | 0.557692 | 0.068627 | 1.631213 |
2 | 10 | HMM | 4 | 0.211538 | 0.166667 | 0.166327 |
3 | 10 | HMM | 4 | 0.211538 | 0.166667 | 0.166327 |
4 | 14 | HMM | 5 | 0.134615 | 0.088235 | 0.246866 |
Basic Descriptives
Summarizing data
Let’s learn how to make simple tables of summary statistics.
First, we will calculate summary info across all data using describe()
, a useful function for creating summaries. Note that we’re not creating a new object for this summary (i.e. not using the =
symbol), so this will print but not save.
df.describe()
subjNum | adhd | hitRate | faRate | dprime | |
---|---|---|---|---|---|
count | 136.000000 | 136.000000 | 136.000000 | 136.000000 | 136.000000 |
mean | 72.676471 | 2.382353 | 0.350679 | 0.081603 | 1.133846 |
std | 41.664779 | 1.651302 | 0.153422 | 0.073607 | 0.566277 |
min | 2.000000 | 0.000000 | 0.057692 | 0.009804 | 0.047920 |
25% | 38.500000 | 1.000000 | 0.250000 | 0.029412 | 0.712359 |
50% | 73.000000 | 2.000000 | 0.326923 | 0.058824 | 1.094755 |
75% | 108.500000 | 4.000000 | 0.442308 | 0.107843 | 1.545407 |
max | 142.000000 | 5.000000 | 0.788462 | 0.362745 | 2.478890 |
Grouping Data
Next, we will learn how to group data based on certain variables of interest.
We will use the groupby()
function in pandas
, which will automatically group any subsequent actions called on the data.
df.groupby(["groupStatus"]).mean()
subjNum | adhd | hitRate | faRate | dprime | |
---|---|---|---|---|---|
groupStatus | |||||
HMM | 74.852941 | 2.941176 | 0.345023 | 0.088235 | 1.032845 |
LMM | 70.500000 | 1.823529 | 0.356335 | 0.074971 | 1.234847 |
We can group data by more than one factor. Let’s say we’re interested in how levels of adhd
interact with groupStatus
(multitasking: high or low).
We will first make a factor for adhd
(median-split), and add it as a grouping variable using the cut()
function in pandas
:
df["adhdF"] = pd.cut(df["adhd"],bins=2,labels=["Low","High"])
Then we’ll check how evenly split these groups are by using groupby()
the size()
functions:
df.groupby(["groupStatus","adhdF"]).size()
groupStatus adhdF
HMM Low 24
High 44
LMM Low 46
High 22
dtype: int64
Then we’ll calculate some summary info about these groups:
df.groupby(["groupStatus","adhdF"]).mean()
subjNum | adhd | hitRate | faRate | dprime | ||
---|---|---|---|---|---|---|
groupStatus | adhdF | |||||
HMM | Low | 82.500000 | 1.083333 | 0.286859 | 0.084967 | 0.818802 |
High | 70.681818 | 3.954545 | 0.376748 | 0.090018 | 1.149595 | |
LMM | Low | 72.478261 | 0.913043 | 0.388796 | 0.078005 | 1.302366 |
High | 66.363636 | 3.727273 | 0.288462 | 0.068627 | 1.093670 |
A note on piping / stringing commands together
In R, we often use the pipe %>%
to string a series of steps together. We can do the same in Python with many functions in a row.
This is how we’re able to take the output of df.groupby(["groupStatus","adhdF"])
and then send that output into the mean()
function
Extra: Working with a Long Dataset
This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.
How many unique subjects are in the data?
subList = df["subjNum"].unique()
nSubs = len(subList)
nSubs
68
How many trials were there per subject?
nTrialsPerSubj = df.groupby(["subjNum"]).size().reset_index(name="nTrials")
nTrialsPerSubj.head()
subjNum | nTrials | |
---|---|---|
0 | 2 | 2 |
1 | 5 | 2 |
2 | 6 | 2 |
3 | 9 | 2 |
4 | 10 | 2 |
Combine summary statistics with the full dataframe
For some analyses, you might want to add a higher level variable (e.g., subject average hitRate
) alongside your long data. We can do this by summarizing the data in a new dataframe and then merging it with the full data.
avgHR = df.groupby(["subjNum"])["hitRate"].mean().reset_index(name="avgHR")
avgHR.head()
subjNum | avgHR | |
---|---|---|
0 | 2 | 0.326923 |
1 | 5 | 0.403846 |
2 | 6 | 0.634615 |
3 | 9 | 0.173077 |
4 | 10 | 0.211538 |
df = df.merge(avgHR,on="subjNum")
df.head()
subjNum | groupStatus | adhd | hitRate | faRate | dprime | adhdF | avgHR | |
---|---|---|---|---|---|---|---|---|
0 | 6 | HMM | 4 | 0.711538 | 0.068627 | 2.043976 | High | 0.634615 |
1 | 6 | HMM | 4 | 0.557692 | 0.068627 | 1.631213 | High | 0.634615 |
2 | 10 | HMM | 4 | 0.211538 | 0.166667 | 0.166327 | High | 0.211538 |
3 | 10 | HMM | 4 | 0.211538 | 0.166667 | 0.166327 | High | 0.211538 |
4 | 14 | HMM | 5 | 0.134615 | 0.088235 | 0.246866 | High | 0.211538 |
You should now have an avgHR
column in df
, which will repeat within each subject, but vary across subjects.
Next: Plotting in Python