Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

# HIDDEN
calls = pd.read_csv('data/calls.csv', parse_dates=['EVENTDTTM'], infer_datetime_format=True)
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)

Scope¶

The scope of the dataset refers to the coverage of the dataset in relation to what we are interested in analyzing. We seek to answer the following question about our data scope:

Does the data cover the topic of interest?

For example, the Calls and Stops datasets contain call and stop incidents made in Berkeley. If we are interested in crime incidents in the state of California, however, these datasets will be too limited in scope.

In general, larger scope is more useful than smaller scope since we can filter larger scope down to a smaller scope but often can't go from smaller scope to larger scope. For example, if we had a dataset of police stops in the United States we could subset the dataset to investigate Berkeley.

Keep in mind that scope is a broad term not always used to describe geographic location. For example, it can also refer to time coverage — the Calls dataset only contains data for a 180 day period.

We will often address the scope of the dataset during the investigation of the data generation process and confirm the dataset's scope during EDA. Let's confirm the geographic and time scope of the Calls dataset.

calls

	CASENO	OFFENSE	CVLEGEND	BLKADDR	EVENTDTTM	Latitude	Longitude	Day
0	17091420	BURGLARY AUTO	BURGLARY - VEHICLE	2500 LE CONTE AVE	2017-07-23 06:00:00	37.876965	-122.260544	Sunday
1	17038302	BURGLARY AUTO	BURGLARY - VEHICLE	BOWDITCH STREET & CHANNING WAY	2017-07-02 22:00:00	37.867209	-122.256554	Sunday
2	17049346	THEFT MISD. (UNDER $950)	LARCENY	2900 CHANNING WAY	2017-08-20 23:20:00	37.867948	-122.250664	Sunday
...	...	...	...	...	...	...	...	...
5505	17021604	IDENTITY THEFT	FRAUD	100 MONTROSE RD	2017-03-31 00:00:00	37.896218	-122.270671	Friday
5506	17033201	DISTURBANCE	DISORDERLY CONDUCT	2300 COLLEGE AVE	2017-06-09 22:34:00	37.868957	-122.254552	Friday
5507	17047247	BURGLARY AUTO	BURGLARY - VEHICLE	UNIVERSITY AVENUE & CHESTNUT ST	2017-08-11 20:00:00	37.869679	-122.288038	Friday

5508 rows × 8 columns

# Shows earliest and latest dates in calls
calls['EVENTDTTM'].dt.date.sort_values()

1384    2017-03-02
1264    2017-03-02
1408    2017-03-02
           ...    
3516    2017-08-28
3409    2017-08-28
3631    2017-08-28
Name: EVENTDTTM, Length: 5508, dtype: object

calls['EVENTDTTM'].dt.date.max() - calls['EVENTDTTM'].dt.date.min()

datetime.timedelta(179)

The table contains data for a time period of 179 days which is close enough to the 180 day time period in the data description that we can suppose there were no calls on either April 14st, 2017 or August 29, 2017.

To check the geographic scope, we can use a map:

import folium # Use the Folium Javascript Map Library
import folium.plugins

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Latitude', 'Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)

With a few exceptions, the Calls dataset covers the Berkeley area. We can see that most police calls happened in the Downtown Berkeley and south of UC Berkeley campus areas.

Let's now confirm the temporal and geographic scope for the Stops dataset:

stops

	Incident Number	Call Date/Time	Location	Incident Type	Dispositions	Location - Latitude	Location - Longitude
0	2015-00004825	2015-01-26 00:10:00	SAN PABLO AVE / MARIN AVE	T	M	NaN	NaN
1	2015-00004829	2015-01-26 00:50:00	SAN PABLO AVE / CHANNING WAY	T	M	NaN	NaN
2	2015-00004831	2015-01-26 01:03:00	UNIVERSITY AVE / NINTH ST	T	M	NaN	NaN
...	...	...	...	...	...	...	...
29205	2017-00024245	2017-04-30 22:59:26	UNIVERSITY AVE/6TH ST	T	BM2TWN	NaN	NaN
29206	2017-00024250	2017-04-30 23:19:27	UNIVERSITY AVE / WEST ST	T	HM4TCS	37.869876	-122.286551
29207	2017-00024254	2017-04-30 23:38:34	CHANNING WAY / BOWDITCH ST	1194	AR	37.867208	-122.256529

29208 rows × 7 columns

stops['Call Date/Time'].dt.date.sort_values()

0        2015-01-26
25       2015-01-26
26       2015-01-26
            ...    
29175    2017-04-30
29177    2017-04-30
29207    2017-04-30
Name: Call Date/Time, Length: 29208, dtype: object

As promised, the data collection begins on January 26th, 2015. It looks like the data were downloaded somewhere around the beginning of May 2017 since the dates stop on April 30th, 2017. Let's draw a map to see the geographic data:

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = stops[['Location - Latitude', 'Location - Longitude']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)

We can confirm that the police stops in the dataset happened in Berkeley, and that most police calls happened in the Downtown Berkeley and West Berkeley areas.