Data Manipulation in Python

core python


Goals of this Lesson

Students will learn:

  1. How to group and categorize data in Python
  2. How to generative descriptive statistics in Python

The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/python/3-datamanipulation/index.ipynb.

For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 3 page.

# load packages we will be using for this lesson
import pandas as pd

Open Dataset and Load Package

This dataset examines the relationship between multitasking and working memory. Link here to original paper by Uncapher et al. 2016.

# use pd.read_csv to open data into python
df = pd.read_csv("uncapher_2016_repeated_measures_dataset.csv")

Familiarize Yourself with the Data

Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.

df.head()

subjNum groupStatus numDist conf hitCount allOldCount rtHit faCount allNewCount rtFA distPresent hitRate faRate dprime mmi adhd bis
0 6 HMM 0 hi 18 25 0.990657 3 50 1.062458 nodist 0.711538 0.068627 2.043976 5.77 4 74
1 6 HMM 6 hi 14 25 0.951638 3 50 1.062458 dist 0.557692 0.068627 1.631213 5.77 4 74
2 10 HMM 0 hi 5 25 1.081535 8 50 1.036764 nodist 0.211538 0.166667 0.166327 6.21 4 51
3 10 HMM 6 hi 5 25 0.999527 8 50 1.036764 dist 0.211538 0.166667 0.166327 6.21 4 51
4 14 HMM 0 hi 3 25 2.346210 4 50 2.075087 nodist 0.134615 0.088235 0.246866 8.60 5 62
df.shape
(136, 17)
df.columns
Index(['subjNum', 'groupStatus', 'numDist', 'conf', 'hitCount', 'allOldCount',
       'rtHit', 'faCount', 'allNewCount', 'rtFA', 'distPresent', 'hitRate',
       'faRate', 'dprime', 'mmi', 'adhd', 'bis'],
      dtype='object')

Selecting Relevant Variables

Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.

df = df[["subjNum", "groupStatus", "adhd", "hitRate", "faRate", "dprime"]]
df.head()

subjNum groupStatus adhd hitRate faRate dprime
0 6 HMM 4 0.711538 0.068627 2.043976
1 6 HMM 4 0.557692 0.068627 1.631213
2 10 HMM 4 0.211538 0.166667 0.166327
3 10 HMM 4 0.211538 0.166667 0.166327
4 14 HMM 5 0.134615 0.088235 0.246866

Basic Descriptives

Summarizing data

Let’s learn how to make simple tables of summary statistics. First, we will calculate summary info across all data using describe(), a useful function for creating summaries. Note that we’re not creating a new object for this summary (i.e. not using the = symbol), so this will print but not save.

df.describe()

subjNum adhd hitRate faRate dprime
count 136.000000 136.000000 136.000000 136.000000 136.000000
mean 72.676471 2.382353 0.350679 0.081603 1.133846
std 41.664779 1.651302 0.153422 0.073607 0.566277
min 2.000000 0.000000 0.057692 0.009804 0.047920
25% 38.500000 1.000000 0.250000 0.029412 0.712359
50% 73.000000 2.000000 0.326923 0.058824 1.094755
75% 108.500000 4.000000 0.442308 0.107843 1.545407
max 142.000000 5.000000 0.788462 0.362745 2.478890

Grouping Data

Next, we will learn how to group data based on certain variables of interest.

We will use the groupby() function in pandas, which will automatically group any subsequent actions called on the data.

df.groupby(["groupStatus"]).mean()

subjNum adhd hitRate faRate dprime
groupStatus
HMM 74.852941 2.941176 0.345023 0.088235 1.032845
LMM 70.500000 1.823529 0.356335 0.074971 1.234847

We can group data by more than one factor. Let’s say we’re interested in how levels of adhd interact with groupStatus (multitasking: high or low).

We will first make a factor for adhd (median-split), and add it as a grouping variable using the cut() function in pandas:

df["adhdF"] = pd.cut(df["adhd"],bins=2,labels=["Low","High"])

Then we’ll check how evenly split these groups are by using groupby() the size() functions:

df.groupby(["groupStatus","adhdF"]).size()
groupStatus  adhdF
HMM          Low      24
             High     44
LMM          Low      46
             High     22
dtype: int64

Then we’ll calculate some summary info about these groups:

df.groupby(["groupStatus","adhdF"]).mean()

subjNum adhd hitRate faRate dprime
groupStatus adhdF
HMM Low 82.500000 1.083333 0.286859 0.084967 0.818802
High 70.681818 3.954545 0.376748 0.090018 1.149595
LMM Low 72.478261 0.913043 0.388796 0.078005 1.302366
High 66.363636 3.727273 0.288462 0.068627 1.093670

A note on piping / stringing commands together

In R, we often use the pipe %>% to string a series of steps together. We can do the same in Python with many functions in a row.

This is how we’re able to take the output of df.groupby(["groupStatus","adhdF"]) and then send that output into the mean() function


Extra: Working with a Long Dataset

This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.

How many unique subjects are in the data?

subList = df["subjNum"].unique()
nSubs = len(subList)
nSubs
68

How many trials were there per subject?

nTrialsPerSubj = df.groupby(["subjNum"]).size().reset_index(name="nTrials")
nTrialsPerSubj.head()

subjNum nTrials
0 2 2
1 5 2
2 6 2
3 9 2
4 10 2

Combine summary statistics with the full dataframe

For some analyses, you might want to add a higher level variable (e.g., subject average hitRate) alongside your long data. We can do this by summarizing the data in a new dataframe and then merging it with the full data.

avgHR = df.groupby(["subjNum"])["hitRate"].mean().reset_index(name="avgHR")
avgHR.head()

subjNum avgHR
0 2 0.326923
1 5 0.403846
2 6 0.634615
3 9 0.173077
4 10 0.211538
df = df.merge(avgHR,on="subjNum")
df.head()

subjNum groupStatus adhd hitRate faRate dprime adhdF avgHR
0 6 HMM 4 0.711538 0.068627 2.043976 High 0.634615
1 6 HMM 4 0.557692 0.068627 1.631213 High 0.634615
2 10 HMM 4 0.211538 0.166667 0.166327 High 0.211538
3 10 HMM 4 0.211538 0.166667 0.166327 High 0.211538
4 14 HMM 5 0.134615 0.088235 0.246866 High 0.211538

You should now have an avgHR column in df, which will repeat within each subject, but vary across subjects.

Next: Plotting in Python