Data Manipulation in Python

core python

Goals of this Lesson

Students will learn:

How to group and categorize data in Python
How to generative descriptive statistics in Python

Links to Files and Video Recording

The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/python/3-datamanipulation/index.ipynb.

For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 3 page.

# load packages we will be using for this lesson
import pandas as pd

Open Dataset and Load Package

This dataset examines the relationship between multitasking and working memory. Link here to original paper by Uncapher et al. 2016.

# use pd.read_csv to open data into python
df = pd.read_csv("uncapher_2016_repeated_measures_dataset.csv")

Familiarize Yourself with the Data

Quick review from Data Cleaning: take a look at the basic data structure, number of rows and columns.

df.head()

	subjNum	groupStatus	numDist	conf	hitCount	allOldCount	rtHit	faCount	allNewCount	rtFA	distPresent	hitRate	faRate	dprime	mmi	adhd	bis
0	6	HMM	0	hi	18	25	0.990657	3	50	1.062458	nodist	0.711538	0.068627	2.043976	5.77	4	74
1	6	HMM	6	hi	14	25	0.951638	3	50	1.062458	dist	0.557692	0.068627	1.631213	5.77	4	74
2	10	HMM	0	hi	5	25	1.081535	8	50	1.036764	nodist	0.211538	0.166667	0.166327	6.21	4	51
3	10	HMM	6	hi	5	25	0.999527	8	50	1.036764	dist	0.211538	0.166667	0.166327	6.21	4	51
4	14	HMM	0	hi	3	25	2.346210	4	50	2.075087	nodist	0.134615	0.088235	0.246866	8.60	5	62

df.shape

(136, 17)

df.columns

Index(['subjNum', 'groupStatus', 'numDist', 'conf', 'hitCount', 'allOldCount',
       'rtHit', 'faCount', 'allNewCount', 'rtFA', 'distPresent', 'hitRate',
       'faRate', 'dprime', 'mmi', 'adhd', 'bis'],
      dtype='object')

Selecting Relevant Variables

Sometimes datasets have many variables that are unnecessary for a given analysis. To simplify your life, and your code, we can select only the given variables we’d like to use for now.

df = df[["subjNum", "groupStatus", "adhd", "hitRate", "faRate", "dprime"]]
df.head()

	subjNum	groupStatus	adhd	hitRate	faRate	dprime
0	6	HMM	4	0.711538	0.068627	2.043976
1	6	HMM	4	0.557692	0.068627	1.631213
2	10	HMM	4	0.211538	0.166667	0.166327
3	10	HMM	4	0.211538	0.166667	0.166327
4	14	HMM	5	0.134615	0.088235	0.246866

Basic Descriptives

Summarizing data

Let’s learn how to make simple tables of summary statistics. First, we will calculate summary info across all data using describe(), a useful function for creating summaries. Note that we’re not creating a new object for this summary (i.e. not using the = symbol), so this will print but not save.

df.describe()

	subjNum	adhd	hitRate	faRate	dprime
count	136.000000	136.000000	136.000000	136.000000	136.000000
mean	72.676471	2.382353	0.350679	0.081603	1.133846
std	41.664779	1.651302	0.153422	0.073607	0.566277
min	2.000000	0.000000	0.057692	0.009804	0.047920
25%	38.500000	1.000000	0.250000	0.029412	0.712359
50%	73.000000	2.000000	0.326923	0.058824	1.094755
75%	108.500000	4.000000	0.442308	0.107843	1.545407
max	142.000000	5.000000	0.788462	0.362745	2.478890

Grouping Data

Next, we will learn how to group data based on certain variables of interest.

We will use the groupby() function in pandas, which will automatically group any subsequent actions called on the data.

df.groupby(["groupStatus"]).mean()

	subjNum	adhd	hitRate	faRate	dprime
groupStatus
HMM	74.852941	2.941176	0.345023	0.088235	1.032845
LMM	70.500000	1.823529	0.356335	0.074971	1.234847

We can group data by more than one factor. Let’s say we’re interested in how levels of adhd interact with groupStatus (multitasking: high or low).

We will first make a factor for adhd (median-split), and add it as a grouping variable using the cut() function in pandas:

df["adhdF"] = pd.cut(df["adhd"],bins=2,labels=["Low","High"])

Then we’ll check how evenly split these groups are by using groupby() the size() functions:

df.groupby(["groupStatus","adhdF"]).size()

groupStatus  adhdF
HMM          Low      24
             High     44
LMM          Low      46
             High     22
dtype: int64

Then we’ll calculate some summary info about these groups:

df.groupby(["groupStatus","adhdF"]).mean()

		subjNum	adhd	hitRate	faRate	dprime
groupStatus	adhdF
HMM	Low	82.500000	1.083333	0.286859	0.084967	0.818802
HMM	High	70.681818	3.954545	0.376748	0.090018	1.149595
LMM	Low	72.478261	0.913043	0.388796	0.078005	1.302366
LMM	High	66.363636	3.727273	0.288462	0.068627	1.093670

A note on piping / stringing commands together

In R, we often use the pipe %>% to string a series of steps together. We can do the same in Python with many functions in a row.

This is how we’re able to take the output of df.groupby(["groupStatus","adhdF"]) and then send that output into the mean() function

Extra: Working with a Long Dataset

This is a repeated measures (“long”) dataset, with multiple rows per subject. This makes things a bit tricker, but we are going to show you some tools for how to work with “long” datasets.

How many unique subjects are in the data?

subList = df["subjNum"].unique()
nSubs = len(subList)
nSubs

How many trials were there per subject?

nTrialsPerSubj = df.groupby(["subjNum"]).size().reset_index(name="nTrials")
nTrialsPerSubj.head()

	subjNum	nTrials
0	2	2
1	5	2
2	6	2
3	9	2
4	10	2

Combine summary statistics with the full dataframe

For some analyses, you might want to add a higher level variable (e.g., subject average hitRate) alongside your long data. We can do this by summarizing the data in a new dataframe and then merging it with the full data.

avgHR = df.groupby(["subjNum"])["hitRate"].mean().reset_index(name="avgHR")
avgHR.head()

	subjNum	avgHR
0	2	0.326923
1	5	0.403846
2	6	0.634615
3	9	0.173077
4	10	0.211538

df = df.merge(avgHR,on="subjNum")
df.head()

	subjNum	groupStatus	adhd	hitRate	faRate	dprime	adhdF	avgHR
0	6	HMM	4	0.711538	0.068627	2.043976	High	0.634615
1	6	HMM	4	0.557692	0.068627	1.631213	High	0.634615
2	10	HMM	4	0.211538	0.166667	0.166327	High	0.211538
3	10	HMM	4	0.211538	0.166667	0.166327	High	0.211538
4	14	HMM	5	0.134615	0.088235	0.246866	High	0.211538

You should now have an avgHR column in df, which will repeat within each subject, but vary across subjects.

Next: Plotting in Python