Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
calls = pd.read_csv('data/calls.csv', parse_dates=['EVENTDTTM'], infer_datetime_format=True)
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)

Scope

The scope of the dataset refers to the coverage of the dataset in relation to what we are interested in analyzing. We seek to answer the following question about our data scope:

Does the data cover the topic of interest?

For example, the Calls and Stops datasets contain call and stop incidents made in Berkeley. If we are interested in crime incidents in the state of California, however, these datasets will be too limited in scope.

In general, larger scope is more useful than smaller scope since we can filter larger scope down to a smaller scope but often can't go from smaller scope to larger scope. For example, if we had a dataset of police stops in the United States we could subset the dataset to investigate Berkeley.

Keep in mind that scope is a broad term not always used to describe geographic location. For example, it can also refer to time coverage — the Calls dataset only contains data for a 180 day period.

We will often address the scope of the dataset during the investigation of the data generation process and confirm the dataset's scope during EDA. Let's confirm the geographic and time scope of the Calls dataset.

calls
CASENO OFFENSE CVLEGEND BLKADDR EVENTDTTM Latitude Longitude Day
0 17091420 BURGLARY AUTO BURGLARY - VEHICLE 2500 LE CONTE AVE 2017-07-23 06:00:00 37.876965 -122.260544 Sunday
1 17038302 BURGLARY AUTO BURGLARY - VEHICLE BOWDITCH STREET & CHANNING WAY 2017-07-02 22:00:00 37.867209 -122.256554 Sunday
2 17049346 THEFT MISD. (UNDER $950) LARCENY 2900 CHANNING WAY 2017-08-20 23:20:00 37.867948 -122.250664 Sunday
... ... ... ... ... ... ... ... ...
5505 17021604 IDENTITY THEFT FRAUD 100 MONTROSE RD 2017-03-31 00:00:00 37.896218 -122.270671 Friday
5506 17033201 DISTURBANCE DISORDERLY CONDUCT 2300 COLLEGE AVE 2017-06-09 22:34:00 37.868957 -122.254552 Friday
5507 17047247 BURGLARY AUTO BURGLARY - VEHICLE UNIVERSITY AVENUE & CHESTNUT ST 2017-08-11 20:00:00 37.869679 -122.288038 Friday

5508 rows × 8 columns

# Shows earliest and latest dates in calls
calls['EVENTDTTM'].dt.date.sort_values()
1384    2017-03-02
1264    2017-03-02
1408    2017-03-02
           ...    
3516    2017-08-28
3409    2017-08-28
3631    2017-08-28
Name: EVENTDTTM, Length: 5508, dtype: object
calls['EVENTDTTM'].dt.date.max() - calls['EVENTDTTM'].dt.date.min()
datetime.timedelta(179)

The table contains data for a time period of 179 days which is close enough to the 180 day time period in the data description that we can suppose there were no calls on either April 14st, 2017 or August 29, 2017.

To check the geographic scope, we can use a map:

import folium # Use the Folium Javascript Map Library
import folium.plugins

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Latitude', 'Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)

With a few exceptions, the Calls dataset covers the Berkeley area. We can see that most police calls happened in the Downtown Berkeley and south of UC Berkeley campus areas.

Let's now confirm the temporal and geographic scope for the Stops dataset:

stops
Incident Number Call Date/Time Location Incident Type Dispositions Location - Latitude Location - Longitude
0 2015-00004825 2015-01-26 00:10:00 SAN PABLO AVE / MARIN AVE T M NaN NaN
1 2015-00004829 2015-01-26 00:50:00 SAN PABLO AVE / CHANNING WAY T M NaN NaN
2 2015-00004831 2015-01-26 01:03:00 UNIVERSITY AVE / NINTH ST T M NaN NaN
... ... ... ... ... ... ... ...
29205 2017-00024245 2017-04-30 22:59:26 UNIVERSITY AVE/6TH ST T BM2TWN NaN NaN
29206 2017-00024250 2017-04-30 23:19:27 UNIVERSITY AVE / WEST ST T HM4TCS 37.869876 -122.286551
29207 2017-00024254 2017-04-30 23:38:34 CHANNING WAY / BOWDITCH ST 1194 AR 37.867208 -122.256529

29208 rows × 7 columns

stops['Call Date/Time'].dt.date.sort_values()
0        2015-01-26
25       2015-01-26
26       2015-01-26
            ...    
29175    2017-04-30
29177    2017-04-30
29207    2017-04-30
Name: Call Date/Time, Length: 29208, dtype: object

As promised, the data collection begins on January 26th, 2015. It looks like the data were downloaded somewhere around the beginning of May 2017 since the dates stop on April 30th, 2017. Let's draw a map to see the geographic data:

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = stops[['Location - Latitude', 'Location - Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)

We can confirm that the police stops in the dataset happened in Berkeley, and that most police calls happened in the Downtown Berkeley and West Berkeley areas.