Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Granularity

The granularity of your data is what each record in your data represents. For example, in the Calls dataset each record represents a single case of a police call.

# HIDDEN
calls = pd.read_csv('data/calls.csv')
calls.head()
CASENO OFFENSE CVLEGEND BLKADDR EVENTDTTM Latitude Longitude Day
0 17091420 BURGLARY AUTO BURGLARY - VEHICLE 2500 LE CONTE AVE 2017-07-23 06:00:00 37.876965 -122.260544 Sunday
1 17038302 BURGLARY AUTO BURGLARY - VEHICLE BOWDITCH STREET & CHANNING WAY 2017-07-02 22:00:00 37.867209 -122.256554 Sunday
2 17049346 THEFT MISD. (UNDER $950) LARCENY 2900 CHANNING WAY 2017-08-20 23:20:00 37.867948 -122.250664 Sunday
3 17091319 THEFT MISD. (UNDER $950) LARCENY 2100 RUSSELL ST 2017-07-09 04:15:00 37.856719 -122.266672 Sunday
4 17044238 DISTURBANCE DISORDERLY CONDUCT TELEGRAPH AVENUE & DURANT AVE 2017-07-30 01:16:00 37.867816 -122.258994 Sunday

In the Stops dataset, each record represents a single incident of a police stop.

# HIDDEN
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)
stops.head()
Incident Number Call Date/Time Location Incident Type Dispositions Location - Latitude Location - Longitude
0 2015-00004825 2015-01-26 00:10:00 SAN PABLO AVE / MARIN AVE T M NaN NaN
1 2015-00004829 2015-01-26 00:50:00 SAN PABLO AVE / CHANNING WAY T M NaN NaN
2 2015-00004831 2015-01-26 01:03:00 UNIVERSITY AVE / NINTH ST T M NaN NaN
3 2015-00004848 2015-01-26 07:16:00 2000 BLOCK BERKELEY WAY 1194 BM4ICN NaN NaN
4 2015-00004849 2015-01-26 07:43:00 1700 BLOCK SAN PABLO AVE 1194 BM4ICN NaN NaN

On the other hand, we could have received the Stops data in the following format:

# HIDDEN
(stops
 .groupby(stops['Call Date/Time'].dt.date)
 .size()
 .rename('Num Incidents')
 .to_frame()
)
Num Incidents
Call Date/Time
2015-01-26 46
2015-01-27 57
2015-01-28 56
... ...
2017-04-28 82
2017-04-29 86
2017-04-30 59

825 rows × 1 columns

In this case, each record in the table corresponds to a single date instead of a single incident. We would describe this table as having a coarser granularity than the one above. It's important to know the granularity of your data because it determines what kind of analyses you can perform. Generally speaking, too fine of a granularity is better than too coarse; while we can use grouping and pivoting to change a fine granularity to a coarse one, we have few tools to go from coarse to fine.

Granularity Checklist

You should have answers to the following questions after looking at the granularity of your datasets. We will answer them for the Calls and Stops datasets.

What does a record represent?

In the Calls dataset, each record represents a single case of a police call. In the Stops dataset, each record represents a single incident of a police stop.

Do all records capture granularity at the same level? (Sometimes a table will contain summary rows.)

Yes, for both Calls and Stops datasets.

If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.

No aggregations were performed as far as we can tell for the datasets. We do keep in mind that in both datasets, the location is entered as a block location instead of a specific address.

What kinds of aggregations can we perform on the data?

For example, it's often useful to aggregate individual people to demographic groups or individual events to totals across time.

In this case, we can aggregate across various granularities of date or time. For example, we can find the most common hour of day for incidents with aggregation. We might also be able to aggregate across event locations to find the regions of Berkeley with the most incidents.