Plotting using matplotlib and seaborn


Goals of this lesson

Students will learn:

  1. How to generate basic statistical visualizations in Python using the seaborn package
In [1]:
# load packages we will be using for this lesson
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set(rc={'figure.figsize':(12,8)})

1. Load in sample data (NHANES)

We’re going to practice here on a dataset from the 1990 NHANES (National Health and Nutrition Examination Survey). The variables are below:

  • Region - Geographic region in the USA: Northeast (1), Midwest (2), South (3), and West (4)
  • Sex - Biological sex: Male (1), Female (2)
  • Age - Age measured in months (we’ll convert this to years below)
  • Urban - Residential population density: Metropolital Area (1), Other (2)
  • Weight - Weight in pounds
  • Height - Height in inches
  • BMI - BMI, measured in kg/(m^2)
In [2]:
nhanes = pd.read_csv("NHANES1990.csv")
nhanes.head()
Out[2]:
Region Sex Age Urban Weight Height BMI
0 3 2 513 2 171.7 65.3 28.4
1 4 1 307 2 155.2 62.3 28.2
2 4 2 886 1 166.7 59.2 33.5
3 4 1 458 1 224.7 71.9 30.6
4 2 1 888 2 245.0 67.7 37.6

First, let's clean up the data a little bit:

In [3]:
nhanes['Age'] = nhanes['Age']/12
nhanes['Urban'] = nhanes['Urban'].replace({1:'Metro Area',2:'Non-Metro Area'})
nhanes['Region'] = nhanes['Region'].replace({1:'Northeast',2:'Midwest',3:'South',4:'West'})
In [4]:
nhanes.head()
Out[4]:
Region Sex Age Urban Weight Height BMI
0 South 2 42.750000 Non-Metro Area 171.7 65.3 28.4
1 West 1 25.583333 Non-Metro Area 155.2 62.3 28.2
2 West 2 73.833333 Metro Area 166.7 59.2 33.5
3 West 1 38.166667 Metro Area 224.7 71.9 30.6
4 Midwest 1 74.000000 Non-Metro Area 245.0 67.7 37.6

2. Scatter Plot

seaborn makes creating attractive and publication-quality data visualizations possible with single line commands. We'll start with a scatter plot to look at how some of our variables are distributed by using the scatterplot() function.

In [9]:
sns.scatterplot(x="Age",y="Weight",data=nhanes);

We can also easily represent other dimensions of the data on this place using the size of the points. Let's map that to BMI:

In [10]:
sns.scatterplot(x="Age",y="Weight",data=nhanes,size="BMI");

We can even add a fourth dimension to this visualization by mapping the color of the points to a categorial variable. Let's now use the hue argument to represent each point based on whether it is urban or non-urban:

In [11]:
sns.scatterplot(x="Age",y="Weight",data=nhanes,size="BMI",hue="Urban");