Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/05'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Faithfulness

We describe a dataset as "faithful" if we believe it accurately captures reality. Typically, untrustworthy datasets contain:

Unrealistic or incorrect values

For example, dates in the future, locations that don't exist, negative counts, or large outliers.

Violations of obvious dependencies

For example, age and birthday for individuals don't match.

Hand-entered data

As we have seen, these are typically filled with spelling errors and inconsistencies.

Clear signs of data falsification

For example, repeated names, fake looking email addresses, or repeated use of uncommon names or fields.

Notice the many similarities to data cleaning. As we have mentioned, we often go back and forth between data cleaning and EDA, especially when determining data faithfulness. For example, visualizations often help us identify strange entries in the data.

calls = pd.read_csv('data/calls.csv')
calls.head()
CASENO OFFENSE EVENTDT EVENTTM ... BLKADDR Latitude Longitude Day
0 17091420 BURGLARY AUTO 07/23/2017 12:00:00 AM 06:00 ... 2500 LE CONTE AVE 37.876965 -122.260544 Sunday
1 17038302 BURGLARY AUTO 07/02/2017 12:00:00 AM 22:00 ... BOWDITCH STREET & CHANNING WAY 37.867209 -122.256554 Sunday
2 17049346 THEFT MISD. (UNDER $950) 08/20/2017 12:00:00 AM 23:20 ... 2900 CHANNING WAY 37.867948 -122.250664 Sunday
3 17091319 THEFT MISD. (UNDER $950) 07/09/2017 12:00:00 AM 04:15 ... 2100 RUSSELL ST 37.856719 -122.266672 Sunday
4 17044238 DISTURBANCE 07/30/2017 12:00:00 AM 01:16 ... TELEGRAPH AVENUE & DURANT AVE 37.867816 -122.258994 Sunday

5 rows × 9 columns

calls['CASENO'].plot.hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ebb2898>

Notice the unexpected clusters at 17030000 and 17090000. By plotting the distribution of case numbers, we can quickly see anomalies in the data. In this case, we might guess that two different teams of police use different sets of case numbers for their calls.

Exploring the data often reveals anomalies; if fixable, we can then apply data cleaning techniques.