Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/04'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Investigating Berkeley Police Data

We will use the Berkeley Police Department's publicly available datasets to demonstrate data cleaning techniques. We have downloaded the Calls for Service dataset and Stops dataset.

We can use the ls shell command with the -lh flags to see more details about the files:

!ls -lh data/
total 13936
-rw-r--r--@ 1 sam  staff   979K Aug 29 14:41 Berkeley_PD_-_Calls_for_Service.csv
-rw-r--r--@ 1 sam  staff    81B Aug 29 14:28 cvdow.csv
-rw-r--r--@ 1 sam  staff   5.8M Aug 29 14:41 stops.json

The command above shows the data files and their file sizes. This is especially useful because we now know the files are small enough to load into memory. As a rule of thumb, it is usually safe to load a file into memory that is around one fourth of the total memory capacity of the computer. For example, if a computer has 4GB of RAM we should be able to load a 1GB CSV file in pandas. To handle larger datasets we will need additional computational tools that we will cover later in this book.

Notice the use of the exclamation point before ls. This tells Jupyter that the next line of code is a shell command, not a Python expression. We can run any available shell command in Jupyter using !:

# The `wc` shell command shows us how many lines each file has.
# We can see that the `stops.json` file has the most lines (29852).
!wc -l data/*
   16497 data/Berkeley_PD_-_Calls_for_Service.csv
       8 data/cvdow.csv
   29852 data/stops.json
   46357 total

Understanding the Data Generation

We will state important questions you should ask of all datasets before data cleaning or processing. These questions are related to how the data were generated, so data cleaning will usually not be able to resolve issues that arise here.

What do the data contain? The website for the Calls for Service data states that the dataset describes "crime incidents (not criminal reports) within the last 180 days". Further reading reveals that "not all calls for police service are included (e.g. Animal Bite)".

The website for the Stops data states that the dataset contains data on all "vehicle detentions (including bicycles) and pedestrian detentions (up to five persons)" since January 26, 2015.

Are the data a census? This depends on our population of interest. For example, if we are interested in calls for service within the last 180 days for crime incidents then the Calls dataset is a census. However, if we are interested in calls for service within the last 10 years the dataset is clearly not a census. We can make similar statements about the Stops dataset since the data collection started on January 26, 2015.

If the data form a sample, is it a probability sample? If we are investigating a period of time that the data do not have entries for, the data do not form a probability sample since there is no randomness involved in the data collection process — we have all data for certain time periods but no data for others.

What limitations will this data have on our conclusions? Although we will ask this question at each step of our data processing, we can already see that our data impose important limitations. The most important limitation is that we cannot make unbiased estimations for time periods not covered by our datasets.

Cleaning The Calls Dataset

Let's now clean the Calls dataset. The head shell command prints the first five lines of the file.

!head data/Berkeley_PD_-_Calls_for_Service.csv
CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
17091420,BURGLARY AUTO,07/23/2017 12:00:00 AM,06:00,BURGLARY - VEHICLE,0,08/29/2017 08:28:05 AM,"2500 LE CONTE AVE
Berkeley, CA
(37.876965, -122.260544)",2500 LE CONTE AVE,Berkeley,CA
17020462,THEFT FROM PERSON,04/13/2017 12:00:00 AM,08:45,LARCENY,4,08/29/2017 08:28:00 AM,"2200 SHATTUCK AVE
Berkeley, CA
(37.869363, -122.268028)",2200 SHATTUCK AVE,Berkeley,CA
17050275,BURGLARY AUTO,08/24/2017 12:00:00 AM,18:30,BURGLARY - VEHICLE,4,08/29/2017 08:28:06 AM,"200 UNIVERSITY AVE
Berkeley, CA
(37.865491, -122.310065)",200 UNIVERSITY AVE,Berkeley,CA

It appears to be a comma-separated values (CSV) file, though it's hard to tell whether the entire file is formatted properly. We can use pd.read_csv to read in the file as a DataFrame. If pd.read_csv errors, we will have to dig deeper and manually resolve formatting issues. Fortunately, pd.read_csv successfully returns a DataFrame:

calls = pd.read_csv('data/Berkeley_PD_-_Calls_for_Service.csv')
calls
CASENO OFFENSE EVENTDT EVENTTM ... Block_Location BLKADDR City State
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... 2500 LE CONTE AVE\nBerkeley, CA\n(37.876965, -... 2500 LE CONTE AVE Berkeley CA
1 17020462 THEFT FROM PERSON 04/13/2017 12:00:00 AM 08:45 ... 2200 SHATTUCK AVE\nBerkeley, CA\n(37.869363, -... 2200 SHATTUCK AVE Berkeley CA
2 17050275 BURGLARY AUTO 08/24/2017 12:00:00 AM 18:30 ... 200 UNIVERSITY AVE\nBerkeley, CA\n(37.865491, ... 200 UNIVERSITY AVE Berkeley CA
... ... ... ... ... ... ... ... ... ...
5505 17018126 DISTURBANCE 04/01/2017 12:00:00 AM 12:22 ... 1600 FAIRVIEW ST\nBerkeley, CA\n(37.850001, -1... 1600 FAIRVIEW ST Berkeley CA
5506 17090665 THEFT MISD. (UNDER $950) 04/01/2017 12:00:00 AM 12:00 ... 2000 DELAWARE ST\nBerkeley, CA\n(37.874489, -1... 2000 DELAWARE ST Berkeley CA
5507 17049700 SEXUAL ASSAULT MISD. 08/22/2017 12:00:00 AM 20:02 ... 2400 TELEGRAPH AVE\nBerkeley, CA\n(37.866761, ... 2400 TELEGRAPH AVE Berkeley CA

5508 rows × 11 columns

We can define a function to show different slices of the data and then interact with it:

def df_interact(df):
    '''
    Outputs sliders that show rows and columns of df
    '''
    def peek(row=0, col=0):
        return df.iloc[row:row + 5, col:col + 6]
    interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
    print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))

df_interact(calls)
(5508 rows, 11 columns) total

Based on the output above, the resulting DataFrame looks reasonably well-formed since the columns are properly named and the data in each column seems to be entered consistently. What data does each column contain? We can look at the dataset website:

Column Description Type
CASENO Case Number Number
OFFENSE Offense Type Plain Text
EVENTDT Date Event Occurred Date & Time
EVENTTM Time Event Occurred Plain Text
CVLEGEND Description of Event Plain Text
CVDOW Day of Week Event Occurred Number
InDbDate Date dataset was updated in the portal Date & Time
Block_Location Block level address of event Location
BLKADDR Plain Text
City Plain Text
State Plain Text

On the surface the data looks easy to work with. However, before starting data analysis we must answer the following questions:

  1. Are there missing values in the dataset? This question is important because missing values can represent many different things. For example, missing addresses could mean that locations were removed to protect anonymity, or that some respondents chose not to answer a survey question, or that a recording device broke.
  2. Are there any missing values that were filled in (e.g. a 999 for unknown age or 12:00am for unknown date)? These will clearly impact analysis if we ignore them.
  3. Which parts of the data were entered by a human? As we will soon see, human-entered data is filled with inconsistencies and mispellings.

Although there are plenty more checks to go through, these three will suffice for many cases. See the Quartz bad data guide for a more complete list of checks.

Are there missing values?

This is a simple check in pandas:

# True if row contains at least one null value
null_rows = calls.isnull().any(axis=1)
calls[null_rows]
CASENO OFFENSE EVENTDT EVENTTM ... Block_Location BLKADDR City State
116 17014831 BURGLARY AUTO 03/16/2017 12:00:00 AM 22:00 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
478 17042511 BURGLARY AUTO 07/20/2017 12:00:00 AM 16:00 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
486 17022572 VEHICLE STOLEN 04/22/2017 12:00:00 AM 21:00 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
... ... ... ... ... ... ... ... ... ...
4945 17091287 VANDALISM 07/01/2017 12:00:00 AM 08:00 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
4947 17038382 BURGLARY RESIDENTIAL 06/30/2017 12:00:00 AM 15:00 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA
5167 17091632 VANDALISM 08/15/2017 12:00:00 AM 23:30 ... Berkeley, CA\n(37.869058, -122.270455) NaN Berkeley CA

27 rows × 11 columns

It looks like 27 calls didn't have a recorded address in BLKADDR. Unfortunately, the data description isn't very clear on how the locations were recorded. We know that all of these calls were made for events in Berkeley, so we can likely assume that the addresses for these calls were originally somewhere in Berkeley.

Are there any missing values that were filled in?

From the missing value check above we can see that the Block_Location column has Berkeley, CA recorded if the location was missing.

In addition, an inspection of the calls table shows us that the EVENTDT column has the correct dates but records 12am for all of its times. Instead, the times are in the EVENTTM column.

# Show the first 7 rows of the table again for reference
calls.head(7)
CASENO OFFENSE EVENTDT EVENTTM ... Block_Location BLKADDR City State
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... 2500 LE CONTE AVE\nBerkeley, CA\n(37.876965, -... 2500 LE CONTE AVE Berkeley CA
1 17020462 THEFT FROM PERSON 04/13/2017 12:00:00 AM 08:45 ... 2200 SHATTUCK AVE\nBerkeley, CA\n(37.869363, -... 2200 SHATTUCK AVE Berkeley CA
2 17050275 BURGLARY AUTO 08/24/2017 12:00:00 AM 18:30 ... 200 UNIVERSITY AVE\nBerkeley, CA\n(37.865491, ... 200 UNIVERSITY AVE Berkeley CA
3 17019145 GUN/WEAPON 04/06/2017 12:00:00 AM 17:30 ... 1900 SEVENTH ST\nBerkeley, CA\n(37.869318, -12... 1900 SEVENTH ST Berkeley CA
4 17044993 VEHICLE STOLEN 08/01/2017 12:00:00 AM 18:00 ... 100 PARKSIDE DR\nBerkeley, CA\n(37.854247, -12... 100 PARKSIDE DR Berkeley CA
5 17037319 BURGLARY RESIDENTIAL 06/28/2017 12:00:00 AM 12:00 ... 1500 PRINCE ST\nBerkeley, CA\n(37.851503, -122... 1500 PRINCE ST Berkeley CA
6 17030791 BURGLARY RESIDENTIAL 05/30/2017 12:00:00 AM 08:45 ... 300 MENLO PL\nBerkeley, CA\n 300 MENLO PL Berkeley CA

7 rows × 11 columns

As a data cleaning step, we want to merge the EVENTDT and EVENTTM columns to record both date and time in one field. If we define a function that takes in a DF and returns a new DF, we can later use pd.pipe to apply all transformations in one go.

def combine_event_datetimes(calls):
    combined = pd.to_datetime(
        # Combine date and time strings
        calls['EVENTDT'].str[:10] + ' ' + calls['EVENTTM'],
        infer_datetime_format=True,
    )
    return calls.assign(EVENTDTTM=combined)

# To peek at the result without mutating the calls DF:
calls.pipe(combine_event_datetimes).head(2)
CASENO OFFENSE EVENTDT EVENTTM ... BLKADDR City State EVENTDTTM
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... 2500 LE CONTE AVE Berkeley CA 2017-07-23 06:00:00
1 17020462 THEFT FROM PERSON 04/13/2017 12:00:00 AM 08:45 ... 2200 SHATTUCK AVE Berkeley CA 2017-04-13 08:45:00

2 rows × 12 columns

Which parts of the data were entered by a human?

It looks like most of the data columns are machine-recorded, including the date, time, day of week, and location of the event.

In addition, the OFFENSE and CVLEGEND columns appear to contain consistent values. We can check the unique values in each column to see if anything was misspelled:

calls['OFFENSE'].unique()
array(['BURGLARY AUTO', 'THEFT FROM PERSON', 'GUN/WEAPON',
       'VEHICLE STOLEN', 'BURGLARY RESIDENTIAL', 'VANDALISM',
       'DISTURBANCE', 'THEFT MISD. (UNDER $950)', 'THEFT FROM AUTO',
       'DOMESTIC VIOLENCE', 'THEFT FELONY (OVER $950)', 'ALCOHOL OFFENSE',
       'MISSING JUVENILE', 'ROBBERY', 'IDENTITY THEFT',
       'ASSAULT/BATTERY MISD.', '2ND RESPONSE', 'BRANDISHING',
       'MISSING ADULT', 'NARCOTICS', 'FRAUD/FORGERY',
       'ASSAULT/BATTERY FEL.', 'BURGLARY COMMERCIAL', 'MUNICIPAL CODE',
       'ARSON', 'SEXUAL ASSAULT FEL.', 'VEHICLE RECOVERED',
       'SEXUAL ASSAULT MISD.', 'KIDNAPPING', 'VICE', 'HOMICIDE'], dtype=object)
calls['CVLEGEND'].unique()
array(['BURGLARY - VEHICLE', 'LARCENY', 'WEAPONS OFFENSE',
       'MOTOR VEHICLE THEFT', 'BURGLARY - RESIDENTIAL', 'VANDALISM',
       'DISORDERLY CONDUCT', 'LARCENY - FROM VEHICLE', 'FAMILY OFFENSE',
       'LIQUOR LAW VIOLATION', 'MISSING PERSON', 'ROBBERY', 'FRAUD',
       'ASSAULT', 'NOISE VIOLATION', 'DRUG VIOLATION',
       'BURGLARY - COMMERCIAL', 'ALL OTHER OFFENSES', 'ARSON', 'SEX CRIME',
       'RECOVERED VEHICLE', 'KIDNAPPING', 'HOMICIDE'], dtype=object)

Since each value in these columns appears to be spelled correctly, we won't have to perform any corrections on these columns.

We also check the BLKADDR column for inconsistencies and find that sometimes an address is recorded (e.g. 2500 LE CONTE AVE) but other times a cross street is recorded (e.g. ALLSTON WAY & FIFTH ST). This suggests that a human entered this data in and this column will be difficult to use for analysis. Fortunately we can use the latitude and longitude of the event instead of the street address.

calls['BLKADDR'][[0, 5001]]
0            2500 LE CONTE AVE
5001    ALLSTON WAY & FIFTH ST
Name: BLKADDR, dtype: object

Final Touchups

This dataset seems almost ready for analysis. The Block_Location column seems to contain strings that record address, latitude, and longitude. We will want to separate the latitude and longitude for easier use.

def split_lat_lon(calls):
    return calls.join(
        calls['Block_Location']
        # Get coords from string
        .str.split('\n').str[2]
        # Remove parens from coords
        .str[1:-1]
        # Split latitude and longitude
        .str.split(', ', expand=True)
        .rename(columns={0: 'Latitude', 1: 'Longitude'})
    )

calls.pipe(split_lat_lon).head(2)
CASENO OFFENSE EVENTDT EVENTTM ... City State Latitude Longitude
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... Berkeley CA 37.876965 -122.260544
1 17020462 THEFT FROM PERSON 04/13/2017 12:00:00 AM 08:45 ... Berkeley CA 37.869363 -122.268028

2 rows × 13 columns

Then, we can match the day of week number with its weekday:

# This DF contains the day for each number in CVDOW
day_of_week = pd.read_csv('data/cvdow.csv')
day_of_week
CVDOW Day
0 0 Sunday
1 1 Monday
2 2 Tuesday
3 3 Wednesday
4 4 Thursday
5 5 Friday
6 6 Saturday
def match_weekday(calls):
    return calls.merge(day_of_week, on='CVDOW')
calls.pipe(match_weekday).head(2)
CASENO OFFENSE EVENTDT EVENTTM ... BLKADDR City State Day
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... 2500 LE CONTE AVE Berkeley CA Sunday
1 17038302 BURGLARY AUTO 07/02/2017 12:00:00 AM 22:00 ... BOWDITCH STREET & CHANNING WAY Berkeley CA Sunday

2 rows × 12 columns

We'll drop columns we no longer need:

def drop_unneeded_cols(calls):
    return calls.drop(columns=['CVDOW', 'InDbDate', 'Block_Location', 'City',
                               'State', 'EVENTDT', 'EVENTTM'])

Finally, we'll pipe the calls DF through all the functions we've defined:

calls_final = (calls.pipe(combine_event_datetimes)
               .pipe(split_lat_lon)
               .pipe(match_weekday)
               .pipe(drop_unneeded_cols))
df_interact(calls_final)
(5508 rows, 8 columns) total

The Calls dataset is now ready for further data analysis. In the next section, we will clean the Stops dataset.

# HIDDEN
# Save data to CSV for other chapters
# calls_final.to_csv('../ch5/data/calls.csv', index=False)