Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Python String Methods¶

Python provides a variety of methods for basic string manipulation. Although simple, these methods form the primitives that piece together to form more complex string operations. We will introduce Python's string methods in the context of a common use case for working with text: data cleaning.

Cleaning Text Data¶

Data often comes from several different sources that each implements its own way of encoding information. In the following example, we have one table that records the state that a county belongs to and another that records the population of the county.

# HIDDEN
state = pd.DataFrame({
    'County': [
        'De Witt County',
        'Lac qui Parle County',
        'Lewis and Clark County',
        'St John the Baptist Parish',
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA',
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt  ',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist',
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044',
    ]
})

state

	County	State
0	De Witt County	IL
1	Lac qui Parle County	MN
2	Lewis and Clark County	MT
3	St John the Baptist Parish	LA

population

	County	Population
0	DeWitt	16,798
1	Lac Qui Parle	8,067
2	Lewis & Clark	55,716
3	St. John the Baptist	43,044

We would naturally like to join the state and population tables using the County column. Unfortunately, not a single county is spelled the same in the two tables. This example is illustrative of the following common issues in text data:

Capitalization: qui vs Qui
Different punctuation conventions: St. vs St
Omission of words: County/Parish is absent in the population table
Use of whitespace: DeWitt vs De Witt
Different abbreviation conventions: & vs and

String Methods¶

Python's string methods allow us to start resolving these issues. These methods are conveniently defined on all Python strings and thus do not require importing other modules. Although it is worth familiarizing yourself with the complete list of string methods, we describe a few of the most commonly used methods in the table below.

Method	Description
`str[x:y]`	Slices `str`, returning indices x (inclusive) to y (not inclusive)
`str.lower()`	Returns a copy of a string with all letters converted to lowercase
`str.replace(a, b)`	Replaces all instances of the substring `a` in `str` with the substring `b`
`str.split(a)`	Returns substrings of `str` split at a substring `a`
`str.strip()`	Removes leading and trailing whitespace from `str`

We select the string for St. John the Baptist parish from the state and population tables and apply string methods to remove capitalization, punctuation, and county/parish occurrences.

john1 = state.loc[3, 'County']
john2 = population.loc[3, 'County']

(john1
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

Applying the same set of methods to john2 allows us to verify that the two strings are now identical.

(john2
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

Satisfied, we create a method called clean_county that normalizes an input county.

def clean_county(county):
    return (county
            .lower()
            .strip()
            .replace(' county', '')
            .replace(' parish', '')
            .replace('&', 'and')
            .replace(' ', '')
            .replace('.', ''))

We may now verify that the clean_county method produces matching counties for all the counties in both tables:

([clean_county(county) for county in state['County']],
 [clean_county(county) for county in population['County']]
)

(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],
 ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])

Because each county in both tables has the same transformed representation, we may successfully join the two tables using the transformed county names.

String Methods in pandas¶

In the code above we used a loop to transform each county name. pandas Series objects provide a convenient way to apply string methods to each item in the series. First, the series of county names in the state table:

state['County']

0                De Witt County
1          Lac qui Parle County
2        Lewis and Clark County
3    St John the Baptist Parish
Name: County, dtype: object

The .str property on pandas Series exposes the same string methods as Python does. Calling a method on the .str property calls the method on each item in the series.

state['County'].str.lower()

0                de witt county
1          lac qui parle county
2        lewis and clark county
3    st john the baptist parish
Name: County, dtype: object

This allows us to transform each string in the series without using a loop.

(state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

0              dewitt
1         lacquiparle
2       lewisandclark
3    stjohnthebaptist
Name: County, dtype: object

We save the transformed counties back into their originating tables:

state['County'] = (state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

population['County'] = (population['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

Now, the two tables contain the same string representation of the counties:

state

	County	State
0	dewitt	IL
1	lacquiparle	MN
2	lewisandclark	MT
3	stjohnthebaptist	LA

population

	County	Population
0	dewitt	16,798
1	lacquiparle	8,067
2	lewisandclark	55,716
3	stjohnthebaptist	43,044

It is simple to join these tables once the counties match.

state.merge(population, on='County')

	County	State	Population
0	dewitt	IL	16,798
1	lacquiparle	MN	8,067
2	lewisandclark	MT	55,716
3	stjohnthebaptist	LA	43,044

Summary¶

Python's string methods form a set of simple and useful operations for string manipulation. pandas Series implement the same methods that apply the underlying Python method to each string in the series.

You may find the complete docs on Python's string methods here and the docs on Pandas str methods here.