Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Getting Started¶

In each section of this chapter we will work with the Baby Names dataset from Chapter 1. We will pose a question, break the question down into high-level steps, then translate each step into Python code using pandas DataFrames. We begin by importing pandas:

# pd is a common shorthand for pandas
import pandas as pd

Now we can read in the data using pd.read_csv (docs).

baby = pd.read_csv('babynames.csv')
baby

	Name	Sex	Count	Year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
...	...	...	...	...
1891891	Verna	M	5	1883
1891892	Winnie	M	5	1883
1891893	Winthrop	M	5	1883

1891894 rows × 4 columns

Note that for the code above to work, the babynames.csv file must be located in the same directory as this notebook. We can check what files are in the current folder by running ls in a notebook cell:

ls

babynames.csv                  indexes_slicing_sorting.ipynb

When we use pandas to read in data, we get a DataFrame. A DataFrame is a tabular data structure where each column is labeled (in this case 'Name', 'Sex', 'Count', 'Year') and each row is labeled (in this case 0, 1, 2, ..., 1891893). The Table introduced in Data 8, however, only has labeled columns.

The labels of a DataFrame are called the indexes of the DataFrame and make many data manipulations easier.

Indexes, Slicing, and Sorting¶

Let's use pandas to answer the following question:

What were the five most popular baby names in 2016?

Breaking the Problem Down¶

We can decompose this question into the following simpler table manipulations:

Slice out the rows for the year 2016.
Sort the rows in descending order by Count.

Now, we can express these steps in pandas.

Slicing using `.loc`¶

To select subsets of a DataFrame, we use the .loc slicing syntax. The first argument is the label of the row and the second is the label of the column:

baby

	Name	Sex	Count	Year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
...	...	...	...	...
1891891	Verna	M	5	1883
1891892	Winnie	M	5	1883
1891893	Winthrop	M	5	1883

1891894 rows × 4 columns

baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'

'Anna'

To slice out multiple rows or columns, we can use :. Note that .loc slicing is inclusive, unlike Python's slicing.

# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']

	Name	Sex	Count
1	Anna	F	3860
2	Emma	F	2587
3	Elizabeth	F	2549
4	Minnie	F	2243
5	Margaret	F	2142

We will often want a single column from a DataFrame:

baby.loc[:, 'Year']

0          1884
1          1884
2          1884
           ... 
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64

Note that when we select a single column, we get a pandas Series. A Series is like a one-dimensional NumPy array since we can perform arithmetic on all the elements at once.

baby.loc[:, 'Year'] * 2

0          3768
1          3768
2          3768
           ... 
1891891    3766
1891892    3766
1891893    3766
Name: Year, Length: 1891894, dtype: int64

To select out specific columns, we can pass a list into the .loc slice:

# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]

	Name	Year
0	Mary	1884
1	Anna	1884
2	Emma	1884
...	...	...
1891891	Verna	1883
1891892	Winnie	1883
1891893	Winthrop	1883

1891894 rows × 2 columns

Selecting columns is common, so there's a shorthand.

# Shorthand for baby.loc[:, 'Name']
baby['Name']

0              Mary
1              Anna
2              Emma
             ...   
1891891       Verna
1891892      Winnie
1891893    Winthrop
Name: Name, Length: 1891894, dtype: object

# Shorthand for baby.loc[:, ['Name', 'Count']]
baby[['Name', 'Count']]

	Name	Count
0	Mary	9217
1	Anna	3860
2	Emma	2587
...	...	...
1891891	Verna	5
1891892	Winnie	5
1891893	Winthrop	5

1891894 rows × 2 columns

Slicing rows using a predicate¶

To slice out the rows with year 2016, we will first create a Series containing True for each row we want to keep and False for each row we want to drop. This is simple because math and boolean operators on Series are applied to each element in the Series.

# Series of years
baby['Year']

0          1884
1          1884
2          1884
           ... 
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64

# Compare each year with 2016
baby['Year'] == 2016

0          False
1          False
2          False
           ...  
1891891    False
1891892    False
1891893    False
Name: Year, Length: 1891894, dtype: bool

Once we have this Series of True and False, we can pass it into .loc.

# We are slicing rows, so the boolean Series goes in the first
# argument to .loc
baby_2016 = baby.loc[baby['Year'] == 2016, :]
baby_2016

	Name	Sex	Count	Year
1850880	Emma	F	19414	2016
1850881	Olivia	F	19246	2016
1850882	Ava	F	16237	2016
...	...	...	...	...
1883745	Zyahir	M	5	2016
1883746	Zyel	M	5	2016
1883747	Zylyn	M	5	2016

32868 rows × 4 columns

Sorting Rows¶

The next step is the sort the rows in descending order by 'Count'. We can use the sort_values() function.

sorted_2016 = baby_2016.sort_values('Count', ascending=False)
sorted_2016

	Name	Sex	Count	Year
1850880	Emma	F	19414	2016
1850881	Olivia	F	19246	2016
1869637	Noah	M	19015	2016
...	...	...	...	...
1868752	Mikaelyn	F	5	2016
1868751	Miette	F	5	2016
1883747	Zylyn	M	5	2016

32868 rows × 4 columns

Finally, we will use .iloc to slice out the first five rows of the DataFrame. .iloc works like .loc but takes in numerical indices instead of labels. It does not include the right endpoint in its slices, like Python's list slicing.

# Get the value in the zeroth row, zeroth column
sorted_2016.iloc[0, 0]

'Emma'

# Get the first five rows
sorted_2016.iloc[0:5]

	Name	Sex	Count	Year
1850880	Emma	F	19414	2016
1850881	Olivia	F	19246	2016
1869637	Noah	M	19015	2016
1869638	Liam	M	18138	2016
1850882	Ava	F	16237	2016

In Conclusion¶

We now have the five most popular baby names in 2016 and learned to express the following operations in pandas:

Operation	`pandas`
Read a CSV file	`pd.read_csv()`
Slicing using labels or indices	`.loc` and `.iloc`
Slicing rows using a predicate	Use a boolean-valued Series in `.loc`
Sorting rows	`.sort_values()`