Data Cleaning in Python

core python

Goals of this Lesson

Students will learn:

How to open various data types in Python
How to check for missing or problematic data and address issues
How to filter, rearrange and shape data in preparation for analysis

Links to Files and Video Recording

The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions. This particular file is located here: /content/tutorials/python/2-datacleaning/index.ipynb.

For a video recording of this tutorial from the Fall 2020 workshop, please visit the Workshop Recording: Session 3 page.

A Quick Introduction to Python Scientific Computing Modules

As a programming languge, Python can do quite a lot. For example, it is an extremely popular choice for GUI and web-based application development (Reddit, Google, Facebook), databases (Spotify, Netflix), and scientific computing (NASA, for example, but also us!).

One reason that Python is so widely used is due to its extensive library of third-party modules. Let’s start by briefly covering the most important modules for scientific computing, some (but not all) of which we’ll be using today.

Data Analysis

NumPy: The fundamental package for scientific computing in Python. NumPy provides Python with most of the functionality of MATLAB.
SciPy: Provides many user-friendly and efficient numerical routines such as routines for numerical integration, interpolation, optimization, linear algebra and statistics.
Pandas: Provides high-performance, easy-to-use data structures and data analysis tools. Pandas provides Python with most of the functionality of R.

Data Visualization

Matplotlib: Python 2D plotting library which produces publication quality figures. The pyplot module provides a MATLAB-like interface and is what most people use.
Seaborn: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

We’ll now import a few of these modules using their standard abbreviations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context("poster")
sns.set(style="ticks",font="Arial",font_scale=2)

In order to call one of the functions belonging to a particular module, you can use the . syntax. For example, numpy has a mean() function which will compute the arithmetic mean across an axis. If we wanted to call that function, we would simply write:

np.mean([2,4,6])

4.0

For those coming from R, this is the equivalent of something like dplyr::filter(). Python is stricter than R about making sure you specify which library the function you are using comes from.

Now that you’re familiar with the basics of modules in Python, let’s go ahead and move on to some data cleaning.

Python Data Structures

There are a few ways that data can be stored and manipulated in Python, some of which you’ve already covered.

To review, the first and most basic is a list:

mylist = [1,2,3]
mylist

[1, 2, 3]

Lists can be arbitrarily long and can store hold multiple types of data, although this isn’t usually a good idea:

mylist2 = [1,"2",3.0, [4,5]]
mylist2

[1, '2', 3.0, [4, 5]]

Similar to lists, the numpy module provides the ability to work with n-dimensional arrays for numerical data only. We can initialize an array full of zeros using the np.zeros() function:

myarray = np.zeros((10))
myarray

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

If we want to work with numerical data that has two dimensions, we can create a matrix in a very similar way:

mymatrix = np.zeros((10,2))
mymatrix

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

We won’t be working much with numpy arrays directly today, but you should know that they are often a better option than lists when you are working with numerical data.

Today, we will primarily be working with pandas dataframes. This object provides functionality that is very similar to dataframes in R. Let’s start by converting our empty matrix into a dataframe. We can also give each of our columns more informative names:

mydataframe = pd.DataFrame(mymatrix,columns=["Height","Weight"])
mydataframe

	Height	Weight
0	0.0	0.0
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0
5	0.0	0.0
6	0.0	0.0
7	0.0	0.0
8	0.0	0.0
9	0.0	0.0

Another way we can create a dataframe is by first creating a dictionary and then converting this to a dataframe. A dictionary is another type of data structure used by Python. Dictionaries consist of an unordered collection of key-value pairs. Keys are used to index the dictionary in order to access the values associated with that key. Let’s start by making a simple dictionary with one key and one value:

data = {'name':["Monica","Michelle","Paul","Ellen"]}
data

{'name': ['Monica', 'Michelle', 'Paul', 'Ellen']}

If we index this dictionary using the name key, it will return the its value, which is a list of names:

data['name']

['Monica', 'Michelle', 'Paul', 'Ellen']

We can also add new key-value pairs to this dictionary:

data['score'] = [16,20,19,35]
data['year'] = [2, 5, 2, 1]
data

{'name': ['Monica', 'Michelle', 'Paul', 'Ellen'],
 'score': [16, 20, 19, 35],
 'year': [2, 5, 2, 1]}

Similar to how we made a dataframe from our numpy array above, we can easily make a dataframe from this dictionary:

dataframe = pd.DataFrame(data)
dataframe

	name	score	year
0	Monica	16	2
1	Michelle	20	5
2	Paul	19	2
3	Ellen	35	1

Reading Data into Python

It’s easy to introduce errors if you are entering data manually like above, and with a lot of data it would get tedious. Most of the time, you’ll be reading data from an external file (.txt or .csv), or opening up an existing dataset in Python. Once you find the location of your files, what you do next will depend on the file format.

Reminder about the `os` module

This module provides a way to interface with the operating system we are running Python on (Windows, Mac, or Linux). Let’s start by first loading this module:

import os

It’s always important to check where our working directory is when trying to read data into Python.

os.getcwd()

'/Users/emily/Documents/GitHub/cu-psych-comp-tutorial/content/tutorials/python/2-datacleaning'

You can access a list of everything (all files and directories) within your working directory using the os.listdir() function…

os.listdir()

['.DS_Store',
 'Study1_Followup.csv',
 'index.md',
 'Study2_Subjects.csv',
 'Study1_clean.csv',
 '.ipynb_checkpoints',
 'Study1.csv',
 'challenge',
 'README_Study2.txt',
 'Study2_Trials.csv',
 'index.ipynb']

…as well as in the “Files” tab on the left-hand side of the JupyterLab window.

What kind of file do you have?

For .txt, .csv, or any kind of delimited (such as tab-delimited) file, you can use the pandas function read_table():

#help(pd.read_table)

mydata = pd.read_table("Study1.csv", sep=",")

If you know you have a .csv file, another common option is read_csv(), which has a default comma separator.

Remember, all of these commands can have arguments that will help Python make sense of your data. To find out what arguments are possible, you can use the help() function like we did above to look at what read_table() does.

To do this, just put whatever command you would like to learn about inside of help() (e.g. help(pd.read_table)). Remember that for functions associated with a particular module you will need to tell Python which module they come from using the . syntax.

You can always also Google a function to quickly find this information.

Inspecting your data

Now that you have data, it’s time to get some results! But wait! Are you sure this data is OK? Doing some basic steps to inspect your data now can save you lots of headaches later, and Python makes it really easy.

Start by checking that you have the expected number of rows and columns in your data frame. You can do this by by asking Python:

# get the number of rows and columns
mydata.shape

(50, 9)

# get the names of columns
mydata.columns

Index(['ID', 'Age', 'Sex', 'Condition', 'Personality', 'T1', 'T2', 'T3', 'T4'], dtype='object')

# take a peek at the first few rows
mydata.head()

	ID	Age	Sex	Condition	Personality	T1	T2	T3	T4
0	1	21.0	Female	0	67	5	2	3	1
1	2	22.0	Female	1	63	4	1	5	2
2	3	24.0	Male	1	58	6	2	6	2
3	4	22.0	Female	1	51	2	3	6	5
4	5	19.0	Female	0	49	7	6	5	2

Rename a variable

Now that we’ve loaded our data into Python and have made sure it makes sense, we can start manipulating and cleaning it.

Look back at your dataframe. What is the fifth variable? What does that even mean? Luckily, this is your study and you know that it’s a personality questionnaire measuring neuroticism. Let’s fix that name and make it more intuitive:

mydata = mydata.rename({'Personality':'Neuroticism'}, axis="columns")
mydata.head()

	ID	Age	Sex	Condition	Neuroticism	T1	T2	T3	T4
0	1	21.0	Female	0	67	5	2	3	1
1	2	22.0	Female	1	63	4	1	5	2
2	3	24.0	Male	1	58	6	2	6	2
3	4	22.0	Female	1	51	2	3	6	5
4	5	19.0	Female	0	49	7	6	5	2

We can also rename multiple variables at once:

mydata = mydata.rename({'T1':'Day1',
                        'T2':'Day2',
                        'T3':'Day3',
                        'T4':'Day4'}, axis="columns")
mydata.head()

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4
0	1	21.0	Female	0	67	5	2	3	1
1	2	22.0	Female	1	63	4	1	5	2
2	3	24.0	Male	1	58	6	2	6	2
3	4	22.0	Female	1	51	2	3	6	5
4	5	19.0	Female	0	49	7	6	5	2

Adding a new column

Often we’ll want to add some new data into a dataframe.

# here we add a column where are the values are the same string
mydata['studyName'] = 'study1'

# here we add a column 'random' of 50 unique random numbers
mydata['random'] = np.random.random(50)

mydata.head()

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4	studyName	random
0	1	21.0	Female	0	67	5	2	3	1	study1	0.726704
1	2	22.0	Female	1	63	4	1	5	2	study1	0.118629
2	3	24.0	Male	1	58	6	2	6	2	study1	0.925191
3	4	22.0	Female	1	51	2	3	6	5	study1	0.639571
4	5	19.0	Female	0	49	7	6	5	2	study1	0.579511

For those coming from R, the Python syntax for referencing columns as df["columnName"] is roughly equivalent to using R’s $ operator.

Removing Columns

We can remove columns with the .drop() function

mydata = mydata.drop(['random', 'studyName'], axis = 1)

Indexing a Dataframe

Sometimes you might want to look at only a subset of the columns in a dataframe (for example, when there are many variables). Doing this with a pandas dataframe is relatively straightforward:

# indexing a single column
ids = mydata[['ID']]
ids.head()

	ID
0	1
1	2
2	3
3	4
4	5

# indexing multiple columns
mydata_subset = mydata[['ID','Age','Neuroticism']]
mydata_subset.head()

	ID	Age	Neuroticism
0	1	21.0	67
1	2	22.0	63
2	3	24.0	58
3	4	22.0	51
4	5	19.0	49

Using .loc and .iloc to index DataFrames

If we want to pull out or manipulate specific pieces of dataframes, we can use the .loc[] and .iloc[] functions.
With both functions, the data referenced is always formatted as [selection of rows, selection of columns].

`.loc[]` takes selections of rows from named columns.

So, here we’re asking for elements 0:2 from the Age column:

mydata.loc[0:2, ['Age']]

	Age
0	21.0
1	22.0
2	24.0

We can also use conditional logic to select rows. Here, we ask for all elements in the Age column that are above 24:

mydata.loc[mydata['Age'] > 24, ['Age']]

	Age
5	25.0
17	25.0
25	25.0
27	25.0
37	25.0

`.iloc[]` takes selections of rows and columns using numeric indices.

mydata.iloc[3:7, 1:4]

	Age	Sex	Condition
3	22.0	Female	1
4	19.0	Female	0
5	25.0	Female	1
6	24.0	Female	1

Check for missing data

One problem you may have is missing data. Sometimes this is something you already know about, but you should check your data frame anyway to make sure nothing got missed in a data entry error. For small datasets, you can do this visually, but for larger ones you can ask Python.

mydata.isnull()
#mydata.isnull().values.any()

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4
0	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False
6	False	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	False	False
8	False	False	False	False	False	False	False	False	False
9	False	False	False	False	False	False	False	False	False
10	False	False	False	False	False	False	False	False	False
11	False	False	False	False	False	False	False	False	False
12	False	False	False	False	False	False	False	False	False
13	False	False	False	False	False	False	False	False	False
14	False	False	False	False	False	False	False	False	False
15	False	False	False	False	False	False	False	False	False
16	False	False	False	False	False	False	False	False	False
17	False	False	False	False	False	False	False	False	False
18	False	False	False	False	False	False	False	False	False
19	False	False	False	False	False	False	False	False	False
20	False	False	False	False	False	False	False	False	False
21	False	False	False	False	False	False	False	False	False
22	False	False	False	False	False	False	False	False	False
23	False	False	False	False	False	False	False	False	False
24	False	False	False	False	False	False	False	False	False
25	False	False	False	False	False	False	False	False	False
26	False	False	False	False	False	False	False	False	False
27	False	False	False	False	False	False	False	False	False
28	False	False	False	False	False	False	False	False	False
29	False	False	False	False	False	False	False	False	False
30	False	False	False	False	False	False	False	False	False
31	False	False	False	False	False	False	False	False	False
32	False	False	False	False	False	False	False	False	False
33	False	False	False	False	False	False	False	False	False
34	False	False	False	False	False	False	False	False	False
35	False	False	False	False	False	False	False	False	False
36	False	False	False	False	False	False	False	False	False
37	False	False	False	False	False	False	False	False	False
38	False	True	False	False	False	False	False	False	False
39	False	False	False	False	False	False	False	False	False
40	False	False	False	False	False	False	False	False	False
41	False	False	False	False	False	False	False	False	False
42	False	False	False	False	False	False	False	False	False
43	False	False	False	False	False	False	False	False	False
44	False	False	False	False	False	False	False	False	False
45	False	False	False	False	False	False	False	False	False
46	False	False	False	False	False	False	False	False	False
47	False	False	False	False	False	False	False	False	False
48	False	False	False	False	False	False	False	False	False
49	False	False	False	False	False	False	False	False	False

In this case, the missing value is the Age value in row 38. You know you have this info somewhere on a paper form, so you go dig it up and want to replace it.

# Verify that this row contains the missing data
mydata.loc[mydata["ID"]==39]

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4
38	39	NaN	Female	1	53	1	3	1	3

# Replace row, column with the value 30
mydata.loc[mydata["ID"]==39, "Age"] = 30

# Verify that the replacement worked
mydata.loc[mydata["ID"]==39]

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4
38	39	30.0	Female	1	53	1	3	1	3

Check for correct values

Let’s take a look at the Sex variable:

mydata['Sex'].head()

0    Female
1    Female
2      Male
3    Female
4    Female
Name: Sex, dtype: object

It looks like there are two categories here, but let’s double check. We can use the unique() function to list all of the unique values in a column:

mydata["Sex"].unique()

array(['Female', 'Male', 'Femle'], dtype=object)

Here we see another data entry problem. At least one of the rows has a third category label that should really be another case of “Female”. Let’s replace this label using the replace() function:

mydata["Sex"] = mydata["Sex"].replace('Femle', 'Female')

# Verify that the replacement worked
mydata["Sex"].unique()

array(['Female', 'Male'], dtype=object)

Now let’s look at some of the continuous variables. You can also look at these by indexing them individually, but sometimes it’s easier to visualize. The hist() function, which creates histograms, is good here.

mydata["Age"].hist();

png

mydata["Neuroticism"].hist();

png

Looks like we have a potential outlier on the Neuroticism score. This could be an entry error, but it could also be a real value that just happens to be really low. This is why data inspection is so important for later analysis — now that you know that the value is there, it’s up to you to decide how to deal with it.

Filtering data

Let’s say we have decided a prori to exclude outliers 3SD above or below the mean. We will first define these boundaries:

upper = np.mean(mydata["Neuroticism"]) + 3*np.std(mydata["Neuroticism"])
lower = np.mean(mydata["Neuroticism"]) - 3*np.std(mydata["Neuroticism"])

We can now use conditional indexing to exclude all rows with a Neuroticism score above or below these values:

mydata = mydata[(mydata["Neuroticism"] > lower) & (mydata["Neuroticism"] < upper)]

The line above says: return only the Neuroticism values greater than the lower boundary and less than the upper boundary and then save it in the mydata variable.

# Verify that we excluded 1 outlier
mydata.shape

(49, 9)

mydata["Neuroticism"].hist();

png

Getting Ready for Analysis

Now that we’ve gone through and cleaned up the problems, you can think ahead to how you’ll want to use this data.

Recoding variables

Sometimes we want to treat categorical variables as factors, but sometimes we want to pretend they’re numeric (as in a regression, when binary variables can be coded as 0 and 1). Right now, Condition is coded as a binary numeric variable, but that’s not very informative, so you’d rather have the values be descriptive. Here, the function replace() is again useful:

mydata['ConditionF'] = mydata['Condition'].replace([0,1], ['Control','Treatment'])

# Verify that your variable is now recoded as you'd like
mydata[['Condition','ConditionF']].head()

	Condition	ConditionF
0	0	Control
1	1	Treatment
2	1	Treatment
3	1	Treatment
4	0	Control

Calculating new variables

You may also want to recalculate or rescale some variables. For example, we can turn Neuroticism into a z-score, or calculate an average response across the four time points.

To compute a z-score, we can use the zscore() function from the scipy.stats module:

from scipy.stats import zscore

mydata['NeuroticismZ'] = zscore(mydata['Neuroticism'])

mydata['NeuroticismZ'].hist();

png

To calculate the means across each day, we can use the mean() function from pandas on a dataframe that has been indexed to include only data from the four days:

mydata['DayMean'] = mydata[['Day1','Day2','Day3','Day4']].mean(axis="columns")

mydata['DayMean'].hist();

png

Combining data from multiple sources

Sometimes, data might be spread across multiple files, and you’ll want to combine those for your analysis. For example, maybe this study had a follow-up survey on Day 30. Scores from that survey were entered into another spreadsheet, which only has the subject ID and that score. We want to include that score into our data.

# first load the followup dataset
mydata2 = pd.read_csv("Study1_Followup.csv")

We can use the function merge() to combine the two dataframes. To make sure the data matches up, we use the on argument to specify that IDs should match. That way even if the data is in a different order, scores will match together correctly.

mydata = mydata.merge(mydata2,on="ID")
mydata.head()

	ID	Age	Sex	Condition	Neuroticism	Day1	Day2	Day3	Day4	ConditionF	NeuroticismZ	DayMean	Day30
0	1	21.0	Female	0	67	5	2	3	1	Control	0.754583	2.75	6
1	2	22.0	Female	1	63	4	1	5	2	Treatment	0.418450	3.00	6
2	3	24.0	Male	1	58	6	2	6	2	Treatment	-0.001715	4.00	6
3	4	22.0	Female	1	51	2	3	6	5	Treatment	-0.589946	4.00	3
4	5	19.0	Female	0	49	7	6	5	2	Control	-0.758013	5.00	6

Shaping data

Finally, you may want to change the layout of your data. Right now, our dataframe is in “wide” format, which means that each row is a subject, and each observation gets its own column. For some analyses, you’ll need to use “long” format, where each row is an observation, and columns specify things like Time and ID to differentiate the observations. For this, we can use the melt() function in pandas:

value_cols = ["Day1","Day2","Day3","Day4"] # columns we would like to convert to a single "long" column
id_cols = list(mydata.columns) # columns we would like to stay in the same "wide" format
for i in value_cols:
    id_cols.remove(i)

Wide → Long

mydata_Long = pd.melt(mydata,id_vars=id_cols,var_name="Time",value_vars=value_cols,value_name="Score")
mydata_Long.head()

	ID	Age	Sex	Condition	Neuroticism	ConditionF	NeuroticismZ	DayMean	Day30	Time	Score
0	1	21.0	Female	0	67	Control	0.754583	2.75	6	Day1	5
1	2	22.0	Female	1	63	Treatment	0.418450	3.00	6	Day1	4
2	3	24.0	Male	1	58	Treatment	-0.001715	4.00	6	Day1	6
3	4	22.0	Female	1	51	Treatment	-0.589946	4.00	3	Day1	2
4	5	19.0	Female	0	49	Control	-0.758013	5.00	6	Day1	7

Long → Wide

We can go back in the other direction by using the pivot_table() function in pandas:

mydata_Wide = mydata_Long.pivot_table(values="Score", index=id_cols, columns='Time').reset_index()
mydata_Wide.columns.name = None
mydata_Wide.head()

	ID	Age	Sex	Condition	Neuroticism	ConditionF	NeuroticismZ	DayMean	Day30	Day1	Day2	Day3	Day4
0	1	21.0	Female	0	67	Control	0.754583	2.75	6	5	2	3	1
1	2	22.0	Female	1	63	Treatment	0.418450	3.00	6	4	1	5	2
2	3	24.0	Male	1	58	Treatment	-0.001715	4.00	6	6	2	6	2
3	4	22.0	Female	1	51	Treatment	-0.589946	4.00	3	2	3	6	5
4	5	19.0	Female	0	49	Control	-0.758013	5.00	6	7	6	5	2

Saving Your Work

Once you’ve created a data cleaning script like this one, you’ll have a record of all the edits you’ve made on the raw data, and you can recreate your cleaned data just by running the script again. However, it’s often easier to save your cleaned data as its own file (never overwrite the raw data), so when you come back to do analysis you don’t have to bother with all the cleaning steps.

You can always save data frames as a .csv for easy sharing and viewing outside of Python.

# write data to a .csv
mydata.to_csv("Study1_clean.csv",index = False)

Future Directions

Congratulations! You’ve now cleaned some data in Python and you’re ready to start working with your data. This tutorial only went over some basic cleaning steps. As you work with your own data, you may find yourself needing other tools.

Next: Data Manipulation in Python

	Height	Weight
0	0.0	0.0
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0
5	0.0	0.0
6	0.0	0.0
7	0.0	0.0
8	0.0	0.0
9	0.0	0.0

	Height	Weight
0	0.0	0.0
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0
5	0.0	0.0
6	0.0	0.0
7	0.0	0.0
8	0.0	0.0
9	0.0	0.0