Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/10'))

Modeling and Estimation

Essentially, all models are wrong, but some are useful.

George Box, Statistician (1919-2013)

We have covered question formulation, data cleaning, and exploratory data analysis, the first three steps of the data science lifecycle. We have also seen that EDA often reveals relationships between variables in our dataset. How do we decide whether a relationship is real or spurious? How do we use these relationships to make reliable predictions about the future? To answer these questions we will need the mathematical tools for modeling and estimation.

A model is an idealized representation of a system. For example, if we drop a steel ball off the Leaning Tower of Pisa, a simple model of gravity states that we expect the ball to drop to the ground, accelerating at the rate of 9.8 m/s². This model also allows us to predict how long it will take the ball to hit the ground using the laws of projectile motion.

This model of gravity describes our system's behavior but is only an approximation—it leaves out the effects of air resistance, the gravitational effects of other celestial bodies, and the buoyancy of air. Because of these unconsidered factors, our model will almost always make incorrect predictions in real life! Still, the simple model of gravity is accurate enough in so many situations that it's widely used and taught today.

Similarly, any model that we define using data is an approximation of a real-world process. When the approximations are not too severe, our model has practical use. This naturally raises a few fundamental questions. How do we choose a model? How do we know whether we need a more complicated model?

In the remaining chapters of the book, we will develop computational tools to design and fit models to data. We will also introduce inferential tools that allow us to reason about our models' ability to generalize to the population of interest.