Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))

Data Design

Data science would hardly be a discipline without data. It is thus of utmost importance that we begin any data analysis by understanding how our data were collected.

In this chapter we discuss data design, the process of data collection. Many well-meaning scientists have drawn premature conclusions because they were not careful enough in understanding their data design. We will use examples and simulations to justify the importance of probability sampling in data science.