Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Grouping and Pivoting¶

In this section, we will answer the question:

What were the most popular male and female names in each year?

Here's the Baby Names dataset once again:

baby = pd.read_csv('babynames.csv')
baby.head()
# the .head() method outputs the first five rows of the DataFrame

	Name	Sex	Count	Year
0	Mary	F	9217	1884
1	Anna	F	3860	1884
2	Emma	F	2587	1884
3	Elizabeth	F	2549	1884
4	Minnie	F	2243	1884

Breaking the Problem Down¶

We should first notice that the question in the previous section has similarities to this one; the question in the previous section restricts names to babies born in 2016 whereas this question asks for names in all years.

We once again decompose this problem into simpler table manipulations.

Group the baby DataFrame by 'Year' and 'Sex'.
For each group, compute the most popular name.

Recognizing which operation is needed for each problem is sometimes tricky. Usually, a convoluted series of steps will signal to you that there might be a simpler way to express what you want. If we didn't immediately recognize that we needed to group, for example, we might write steps like the following:

Loop through each unique year.
For each year, loop through each unique sex.
For each unique year and sex, find the most common name.

There is almost always a better alternative to looping over a pandas DataFrame. In particular, looping over unique values of a DataFrame should usually be replaced with a group.

Grouping¶

To group in pandas. we use the .groupby() method.

baby.groupby('Year')

<pandas.core.groupby.DataFrameGroupBy object at 0x1a14e21f60>

.groupby() returns a strange-looking DataFrameGroupBy object. We can call .agg() on this object with an aggregation function in order to get a familiar output:

# The aggregation function takes in a series of values for each group
# and outputs a single value
def length(series):
    return len(series)

# Count up number of values for each year. This is equivalent to
# counting the number of rows where each year appears.
baby.groupby('Year').agg(length)

	Name	Sex	Count
Year
1880	2000	2000	2000
1881	1935	1935	1935
1882	2127	2127	2127
...	...	...	...
2014	33206	33206	33206
2015	33063	33063	33063
2016	32868	32868	32868

137 rows × 3 columns

You might notice that the length function simply calls the len function, so we can simplify the code above.

baby.groupby('Year').agg(len)

	Name	Sex	Count
Year
1880	2000	2000	2000
1881	1935	1935	1935
1882	2127	2127	2127
...	...	...	...
2014	33206	33206	33206
2015	33063	33063	33063
2016	32868	32868	32868

137 rows × 3 columns

The aggregation is applied to each column of the DataFrame, producing redundant information. We can restrict the output columns by slicing before grouping.

year_rows = baby[['Year', 'Count']].groupby('Year').agg(len)
year_rows

# A further shorthand to accomplish the same result:
#
# year_counts = baby[['Year', 'Count']].groupby('Year').count()
#
# pandas has shorthands for common aggregation functions, including
# count, sum, and mean.

	Count
Year
1880	2000
1881	1935
1882	2127
...	...
2014	33206
2015	33063
2016	32868

137 rows × 1 columns

Note that the index of the resulting DataFrame now contains the unique years, so we can slice subsets of years using .loc as before:

# Every twentieth year starting at 1880
year_rows.loc[1880:2016:20, :]

	Count
Year
1880	2000
1900	3730
1920	10755
1940	8961
1960	11924
1980	19440
2000	29764

Grouping on Multiple Columns¶

As we've seen in Data 8, we can group on multiple columns to get groups based on unique pairs of values. To do this, pass in a list of column labels into .groupby().

grouped_counts = baby.groupby(['Year', 'Sex']).sum()
grouped_counts

		Count
Year	Sex
1880	F	90992
1880	M	110491
1881	F	91953
...	...	...
2015	M	1907211
2016	F	1756647
2016	M	1880674

274 rows × 1 columns

The code above computes the total number of babies born for each year and sex. Let's now use grouping by muliple columns to compute the most popular names for each year and sex. Since the data are already sorted in descending order of Count for each year and sex, we can define an aggregation function that returns the first value in each series. (If the data weren't sorted, we can call sort_values() first.)

# The most popular name is simply the first one that appears in the series
def most_popular(series):
    return series.iloc[0]

baby_pop = baby.groupby(['Year', 'Sex']).agg(most_popular)
baby_pop

		Name	Count
Year	Sex
1880	F	Mary	7065
1880	M	John	9655
1881	F	Mary	6919
...	...	...	...
2015	M	Noah	19594
2016	F	Emma	19414
2016	M	Noah	19015

274 rows × 2 columns

Notice that grouping by multiple columns results in multiple labels for each row. This is called a "multilevel index" and is tricky to work with. The important thing to know is that .loc takes in a tuple for the row index instead of a single value:

baby_pop.loc[(2000, 'F'), 'Name']

'Emily'

But .iloc behaves the same as usual since it uses indices instead of labels:

baby_pop.iloc[10:15, :]

		Name	Count
Year	Sex
1885	F	Mary	9128
1885	M	John	8756
1886	F	Mary	9889
1886	M	John	9026
1887	F	Mary	9888

Pivoting¶

If you group by two columns, you can often use pivot to present your data in a more convenient format. Using a pivot lets you use one set of grouped labels as the columns of the resulting table.

To pivot, use the pd.pivot_table() function.

pd.pivot_table(baby,
               index='Year',         # Index for rows
               columns='Sex',        # Columns
               values='Name',        # Values in table
               aggfunc=most_popular) # Aggregation function

Sex	F	M
Year
1880	Mary	John
1881	Mary	John
1882	Mary	John
...	...	...
2014	Emma	Noah
2015	Emma	Noah
2016	Emma	Noah

137 rows × 2 columns

Compare this result to the baby_pop table that we computed using .groupby(). We can see that the Sex index in baby_pop became the columns of the pivot table.

baby_pop

		Name	Count
Year	Sex
1880	F	Mary	7065
1880	M	John	9655
1881	F	Mary	6919
...	...	...	...
2015	M	Noah	19594
2016	F	Emma	19414
2016	M	Noah	19015

274 rows × 2 columns

In Conclusion¶

We now have the most popular baby names for each sex and year in our dataset and learned to express the following operations in pandas:

Operation	`pandas`
Group	`df.groupby(label)`
Group by multiple columns	`df.groupby([label1, label2])`
Group and aggregate	`df.groupby(label).agg(func)`
Pivot	`pd.pivot_table()`