Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Granularity¶

The granularity of your data is what each record in your data represents. For example, in the Calls dataset each record represents a single case of a police call.

# HIDDEN
calls = pd.read_csv('data/calls.csv')
calls.head()

	CASENO	OFFENSE	CVLEGEND	BLKADDR	EVENTDTTM	Latitude	Longitude	Day
0	17091420	BURGLARY AUTO	BURGLARY - VEHICLE	2500 LE CONTE AVE	2017-07-23 06:00:00	37.876965	-122.260544	Sunday
1	17038302	BURGLARY AUTO	BURGLARY - VEHICLE	BOWDITCH STREET & CHANNING WAY	2017-07-02 22:00:00	37.867209	-122.256554	Sunday
2	17049346	THEFT MISD. (UNDER $950)	LARCENY	2900 CHANNING WAY	2017-08-20 23:20:00	37.867948	-122.250664	Sunday
3	17091319	THEFT MISD. (UNDER $950)	LARCENY	2100 RUSSELL ST	2017-07-09 04:15:00	37.856719	-122.266672	Sunday
4	17044238	DISTURBANCE	DISORDERLY CONDUCT	TELEGRAPH AVENUE & DURANT AVE	2017-07-30 01:16:00	37.867816	-122.258994	Sunday

In the Stops dataset, each record represents a single incident of a police stop.

# HIDDEN
stops = pd.read_csv('data/stops.csv', parse_dates=[1], infer_datetime_format=True)
stops.head()

	Incident Number	Call Date/Time	Location	Incident Type	Dispositions	Location - Latitude	Location - Longitude
0	2015-00004825	2015-01-26 00:10:00	SAN PABLO AVE / MARIN AVE	T	M	NaN	NaN
1	2015-00004829	2015-01-26 00:50:00	SAN PABLO AVE / CHANNING WAY	T	M	NaN	NaN
2	2015-00004831	2015-01-26 01:03:00	UNIVERSITY AVE / NINTH ST	T	M	NaN	NaN
3	2015-00004848	2015-01-26 07:16:00	2000 BLOCK BERKELEY WAY	1194	BM4ICN	NaN	NaN
4	2015-00004849	2015-01-26 07:43:00	1700 BLOCK SAN PABLO AVE	1194	BM4ICN	NaN	NaN

On the other hand, we could have received the Stops data in the following format:

# HIDDEN
(stops
 .groupby(stops['Call Date/Time'].dt.date)
 .size()
 .rename('Num Incidents')
 .to_frame()
)

	Num Incidents
Call Date/Time
2015-01-26	46
2015-01-27	57
2015-01-28	56
...	...
2017-04-28	82
2017-04-29	86
2017-04-30	59

825 rows × 1 columns

In this case, each record in the table corresponds to a single date instead of a single incident. We would describe this table as having a coarser granularity than the one above. It's important to know the granularity of your data because it determines what kind of analyses you can perform. Generally speaking, too fine of a granularity is better than too coarse; while we can use grouping and pivoting to change a fine granularity to a coarse one, we have few tools to go from coarse to fine.

Granularity Checklist¶

You should have answers to the following questions after looking at the granularity of your datasets. We will answer them for the Calls and Stops datasets.

What does a record represent?

In the Calls dataset, each record represents a single case of a police call. In the Stops dataset, each record represents a single incident of a police stop.

Do all records capture granularity at the same level? (Sometimes a table will contain summary rows.)

Yes, for both Calls and Stops datasets.

If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.

No aggregations were performed as far as we can tell for the datasets. We do keep in mind that in both datasets, the location is entered as a block location instead of a specific address.

What kinds of aggregations can we perform on the data?

For example, it's often useful to aggregate individual people to demographic groups or individual events to totals across time.

In this case, we can aggregate across various granularities of date or time. For example, we can find the most common hour of day for incidents with aggregation. We might also be able to aggregate across event locations to find the regions of Berkeley with the most incidents.