Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/01'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

The Students of Data 100

The data science lifecycle involves the following general steps:

  1. Question/Problem Formulation:
    1. What do we want to know or what problems are we trying to solve?
    2. What are our hypotheses?
    3. What are our metrics of success?

  2. Data Acquisition and Cleaning:
    1. What data do we have and what data do we need?
    2. How will we collect more data?
    3. How do we organize the data for analysis?

  3. Exploratory Data Analysis:
    1. Do we already have relevant data?
    2. What are the biases, anomalies, or other issues with the data?
    3. How do we transform the data to enable effective analysis?

  4. Prediction and Inference:
    1. What does the data say about the world?
    2. Does it answer our questions or accurately solve the problem?
    3. How robust are our conclusions?

We now demonstrate this process applied to a dataset of student first names from a previous offering of Data 100. In this chapter, we proceed quickly in order to give the reader a general sense of a complete iteration through the lifecycle. In later chapters, we expand on each step in this process to develop a repertoire of skills and principles.

Question Formulation

We would like to figure out if the student first names give us additional information about the students themselves. Although this is a vague question to ask, it is enough to get us working with our data and we can make the question more precise as we go.

Data Acquisition and Cleaning

Let's begin by looking at our data, the roster of student first names that we've downloaded from a previous offering of Data 100.

Don't worry if you don't understand the code for now; we introduce the libraries in more depth soon. Instead, focus on the process and the charts that we create.

import pandas as pd

students = pd.read_csv('roster.csv')
students
Name Role
0 Keeley Student
1 John Student
2 BRYAN Student
... ... ...
276 Ernesto Waitlist Student
277 Athan Waitlist Student
278 Michael Waitlist Student

279 rows × 2 columns

We can quickly see that there are some quirks in the data. For example, one of the student's names is all uppercase letters. In addition, it is not obvious what the Role column is for.

In Data 100, we will study how to identify anomalies in data and apply corrections. The differences in capitalization will cause our programs to think that 'BRYAN' and 'Bryan' are different names when they are identical for our purposes. Let's convert all names to lower case to avoid this.

students['Name'] = students['Name'].str.lower()
students
Name Role
0 keeley Student
1 john Student
2 bryan Student
... ... ...
276 ernesto Waitlist Student
277 athan Waitlist Student
278 michael Waitlist Student

279 rows × 2 columns

Now that our data are in a more useful format, we proceed to exploratory data analysis.