Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/02'))

Dewey Defeats Truman

In the 1948 US Presidential election, New York Governor Thomas Dewey ran against the incumbent Harry Truman. As usual, a number of polling agencies conducted polls of voters in order to predict which candidate was more likely to win the election.

1936: A Previous Polling Catastrophe

In 1936, three elections prior to 1948, the Literary Digest infamously predicted a landslide defeat for Franklin Delano Roosevelt. To make this claim, the magazine polled a sample of over 2 million people based on telephone and car registrations. As you may know, this sampling scheme suffers from sampling bias: those with telephones and cars tend to be wealthier than those without. In this case, the sampling bias was so great that the Literary Digest thought Roosevelt would only receive 43% of the popular vote when he ended up with 61% of the popular vote, a difference of almost 20% and the largest error ever made by a major poll. The Literary Digest went out of business soon after.

1948: The Gallup Poll

Determined to learn from past mistakes, the Gallup Poll used a method called quota sampling to predict the results of the 1948 election. In their sampling scheme, each interviewer polled a set number of people from each demographic class. For example, the interviews were required to interview both males and females from different ages, ethnicities, and income levels to match the demographics in the US Census. This ensured that the poll would not leave out important subgroups of the voting population.

Using this method, the Gallup Poll predicted that Thomas Dewey would earn 5% more of the popular vote than Harry Truman would. This difference was significant enough that the Chicago Tribune famously printed the headline "Dewey Defeats Truman":

As we know now, Truman ended up winning the election. In fact, he won with 5% more of the popular vote than Dewey! What went wrong with the Gallup Poll?

The Problem With Quota Sampling

Although quota sampling did help pollsters reduce sampling bias, it introduced bias in another way. The Gallup Poll told its interviewers that as long as they fulfilled their quotas they could interview whomever they wished. Here's one possible explanation for why the interviewers ended up polling a disproportionate number of Republicans: at the time, Republicans were on average wealthier and more likely to live in nicer neighborhoods, making them easier to interview. This observation is supported by the fact that the Gallup Poll predicted 2-6% more Republican votes than the actual results for the 3 elections prior.

These examples highlight the importance of understanding sampling bias as much as possible during the data collection process. Both Literary Digest and Gallup Poll made the mistake of assuming their methods were unbiased when their sampling schemes were based on human judgement all along.

We now rely on probability sampling, a family of sampling methods that assigns precise probabilities to the appearance of each sample, to reduce bias as much as possible in our data collection process.

Big Data?

In the age of Big Data, we are tempted to deal with bias by collecting more data. After all, we know that a census will give us perfect estimates; shouldn't a very large sample give almost perfect estimates regardless of the sampling technique?

We will return to this question after discussing probability sampling methods to compare the two approaches.