Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Regex and Python

In this section, we introduce regex usage in Python using the built-in re module. Since we only cover a few of the most commonly used methods, you will find it useful to consult the official documentation on the re module as well.

re.search

re.search(pattern, string) searches for a match of the regex pattern anywhere in string. It returns a truthy match object if the pattern is found; it returns None if not.

phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match
<_sre.SRE_Match object; span=(11, 23), match='382-384-3840'>

Although the returned match object has a variety of useful properties, we most commonly use re.search to test whether a pattern appears in a string.

if re.search(phone_re, text):
    print("Found a match!")
Found a match!
if re.search(phone_re, 'Hello world'):
    print("No match; this won't print")

Another commonly used method, re.match(pattern, string), behaves the same as re.search but only checks for a match at the start of string instead of a match anywhere in the string.

re.findall

We use re.findall(pattern, string) to extract substrings that match a regex. This method returns a list of all matches of pattern in string.

gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)
['email1@gmail.com', 'email3@gmail.com']

Regex Groups

Using regex groups, we specify subpatterns to extract from a regex by wrapping the subpattern in parentheses ( ). When a regex contains regex groups, re.findall returns a list of tuples that contain the subpattern contents.

For example, the following familiar regex extracts phone numbers from a string:

phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
['382-384-3840', '123-456-7890']

To split apart the individual three or four digit components of a phone number, we can wrap each digit group in parentheses.

# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)
[('382', '384', '3840'), ('123', '456', '7890')]

As promised, re.findall returns a list of tuples containing the individual components of the matched phone numbers.

re.sub

re.sub(pattern, replacement, string) replaces all occurrences of pattern with replacement in the provided string. This method behaves like the Python string method str.sub but uses a regex to match patterns.

In the code below, we alter the dates to have a common format by substituting the date separators with a dash.

messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)
'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'

re.split

re.split(pattern, string) splits the input string each time the regex pattern appears. This method behaves like the Python string method str.split but uses a regex to make the split.

In the code below, we use re.split to split chapter names from their page numbers in a table of contents for a book.

toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()

# First, split into individual lines
lines = re.split('\n', toc)
lines
['PLAYING PILGRIMS============3',
 'A MERRY CHRISTMAS===========13',
 'THE LAURENCE BOY============31',
 'BURDENS=====================55',
 'BEING NEIGHBORLY============76']
# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]
[['PLAYING PILGRIMS', '3'],
 ['A MERRY CHRISTMAS', '13'],
 ['THE LAURENCE BOY', '31'],
 ['BURDENS', '55'],
 ['BEING NEIGHBORLY', '76']]

Regex and pandas

Recall that pandas Series objects have a .str property that supports string manipulation using Python string methods. Conveniently, the .str property also supports some functions from the re module. We demonstrate basic regex usage in pandas, leaving the complete method list to the pandas documentation on string methods.

We've stored the text of the first five sentences of the novel Little Women in the DataFrame below. We can use the string methods that pandas provides to extract the spoken dialog in each sentence.

# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
    'sentences': text.split('\n')
})
little
sentences
0 "Christmas won't be Christmas without any pres...
1 "It's so dreadful to be poor!" sighed Meg, loo...
2 "I don't think it's fair for some girls to hav...
3 "We've got Father and Mother, and each other,"...
4 The four young faces on which the firelight sh...

Since spoken dialog lies within double quotation marks, we create a regex that captures a double quotation mark, a sequence of any characters except a double quotation mark, and the closing quotation mark.

quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)
0    ["Christmas won't be Christmas without any pre...
1                     ["It's so dreadful to be poor!"]
2    ["I don't think it's fair for some girls to ha...
3     ["We've got Father and Mother, and each other,"]
4    ["We haven't got Father, and shall not have hi...
Name: sentences, dtype: object

Since the Series.str.findall method returns a list of matches, pandas also provides Series.str.extract and Series.str.extractall method to extract matches into a Series or DataFrame. These methods require the regex to contain at least one regex group.

# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken
0    Christmas won't be Christmas without any prese...
1                         It's so dreadful to be poor!
2    I don't think it's fair for some girls to have...
3         We've got Father and Mother, and each other,
4    We haven't got Father, and shall not have him ...
Name: sentences, dtype: object

We can add this series as a column of the little DataFrame:

little['dialog'] = spoken
little
sentences dialog
0 "Christmas won't be Christmas without any pres... Christmas won't be Christmas without any prese...
1 "It's so dreadful to be poor!" sighed Meg, loo... It's so dreadful to be poor!
2 "I don't think it's fair for some girls to hav... I don't think it's fair for some girls to have...
3 "We've got Father and Mother, and each other,"... We've got Father and Mother, and each other,
4 The four young faces on which the firelight sh... We haven't got Father, and shall not have him ...

We can confirm that our string manipulation behaves as expected for the last sentence in our DataFrame by printing the original and extracted text:

print(little.loc[4, 'sentences'])
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
print(little.loc[4, 'dialog'])
We haven't got Father, and shall not have him for a long time.

Summary

The re module in Python provides a useful group of methods for manipulating text using regular expressions. When working with DataFrames, we often use the analogous string manipulation methods implemented in pandas.

For the complete documentation on the re module, see https://docs.python.org/3/library/re.html

For the complete documentation on pandas string methods, see https://pandas.pydata.org/pandas-docs/stable/text.html