Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
# Make names table
sql_expr = """
CREATE TABLE names(
    cat_id INTEGER PRIMARY KEY,
    name TEXT);
"""
result = sqlite_engine.execute(sql_expr)

# Populate names table
sql_expr = """
INSERT INTO names VALUES 
(0, "Apricot"),
(1, "Boots"),
(2, "Cally"),
(4, "Eugene");
"""

result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make colors table
sql_expr = """
CREATE TABLE colors(
    cat_id INTEGER PRIMARY KEY,
    color TEXT);
"""
result = sqlite_engine.execute(sql_expr)

# Populate colors table
sql_expr = """
INSERT INTO colors VALUES 
(0, "orange"),
(1, "black"),
(2, "calico"),
(3, "white");
"""

result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make ages table
sql_expr = """
CREATE TABLE ages(
    cat_id INTEGER PRIMARY KEY,
    age INT);
"""
result = sqlite_engine.execute(sql_expr)

# Populate ages table
sql_expr = """
INSERT INTO ages VALUES 
(0, 4),
(1, 3),
(2, 9),
(4, 20);
"""

result = sqlite_engine.execute(sql_expr)

SQL Joins

In pandas we use the pd.merge method to join two tables using matching values in their columns. For example:

pd.merge(table1, table2, on='common_column')

In this section, we introduce SQL joins. SQL joins are used to combine multiple tables in a relational database.

Suppose we are cat store owners with a database for the cats we have in our store. We have two different tables: names and colors. The names table contains the columns cat_id, a unique number assigned to each cat, and name, the name for the cat. The colors table contains the columns cat_id and color, the color of each cat.

Note that there are some missing rows from both tables - a row with cat_id 3 is missing from the names table, and a row with cat_id 4 is missing from the colors table.

names
cat_id name
0 Apricot
1 Boots
2 Cally
4 Eugene
colors
cat_id color
0 orange
1 black
2 calico
3 white

To compute the color of the cat named Apricot, we have to use information in both tables. We can join the tables on the cat_id column, creating a new table with both name and color.

Joins

A join combines tables by matching values in their columns.

There are four main types of joins: inner joins, outer joins, left joins, and right joins. Although all four combine tables, each one treats non-matching values differently.

Inner Join

Definition: In an inner join, the final table only contains rows that have matching columns in both tables.

Inner Join

Example: We would like to join the names and colors tables together to match each cat with its color. Since both tables contain a cat_id column that is the unique identifier for a cat, we can use an inner join on the cat_id column.

SQL: To write an inner join in SQL we modify our FROM clause to use the following syntax:

SELECT ...
FROM <TABLE_1>
    INNER JOIN <TABLE_2>
    ON <...>

For example:

SELECT *
FROM names AS N
    INNER JOIN colors AS C
    ON N.cat_id = C.cat_id;
cat_id name cat_id color
0 0 Apricot 0 orange
1 1 Boots 1 black
2 2 Cally 2 calico

You may verify that each cat name is matched with its color. Notice that the cats with cat_id 3 and 4 are not present in our resulting table because the colors table doesn't have a row with cat_id 4 and the names table doesn't have a row with cat_id 3. In an inner join, if a row doesn't have a matching value in the other table, the row is not included in the final result.

Assuming we have a DataFrame called names and a DataFrame called colors, we can conduct an inner join in pandas by writing:

pd.merge(names, colors, how='inner', on='cat_id')

Full/Outer Join

Definition: In a full join (sometimes called an outer join), all records from both tables are included in the joined table. If a row doesn't have a match in the other table, the missing values are filled in with NULL.

Full outer join

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all rows in either table even if there isn't a match.

SQL: To write an outer join in SQL we modify our FROM clause to use the following syntax:

SELECT ...
FROM <TABLE_1>
    FULL JOIN <TABLE_2>
    ON <...>

For example:

SELECT name, color
FROM names N
    FULL JOIN colors C
    ON N.cat_id = C.cat_id;
cat_id name color
0 Apricot orange
1 Boots black
2 Cally calico
3 NULL white
4 Eugene NULL

Notice that the final output contains the entries with cat_id 3 and 4. If a row does not have a match, it is still included in the final output and any missing values are filled in with NULL.

In pandas:

pd.merge(names, colors, how='outer', on='cat_id')

Left Join

Definition: In a left join, all records from the left table are included in the joined table. If a row doesn't have a match in the right table, the missing values are filled in with NULL.

left join

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all the cat names even if a cat doesn't have a matching color.

SQL: To write an left join in SQL we modify our FROM clause to use the following syntax:

SELECT ...
FROM <TABLE_1>
    LEFT JOIN <TABLE_2>
    ON <...>

For example:

SELECT name, color
FROM names N
    LEFT JOIN colors C
    ON N.cat_id = C.cat_id;
cat_id name color
0 Apricot orange
1 Boots black
2 Cally calico
4 Eugene NULL

Notice that the final output includes all four cat names. Three of the cat_ids in the names relation had matching cat_ids in the colors table and one did not (Eugene). The cat name that did not have a matching color has NULL as its color.

In pandas:

pd.merge(names, colors, how='left', on='cat_id')

Right Join

Definition: In a right join, all records from the right table are included in the joined table. If a row doesn't have a match in the left table, the missing values are filled in with NULL.

right join

Example: As before, we join the names and colors tables together to match each cat with its color. This time, we want to keep all the cat color even if a cat doesn't have a matching name.

SQL: To write a right join in SQL we modify our FROM clause to use the following syntax:

SELECT ...
FROM <TABLE_1>
    RIGHT JOIN <TABLE_2>
    ON <...>

For example:

SELECT name, color
FROM names N
    RIGHT JOIN colors C
    ON N.cat_id = C.cat_id;
cat_id name color
0 Apricot orange
1 Boots black
2 Cally calico
3 NULL white

This time, observe that the final output includes all four cat colors. Three of the cat_ids in the colors relation had matching cat_ids in the names table and one did not (white). The cat color that did not have a matching name has NULL as its name.

You may also notice that a right join produces the same result a left join with the table order swapped. That is, names left joined with colors is the same as colors right joined with names. Because of this, some SQL engines (such as SQLite) do not support right joins.

In pandas:

pd.merge(names, colors, how='right', on='cat_id')

Implicit Inner Joins

There are typically multiple ways to accomplish the same task in SQL just as there are multiple ways to accomplish the same task in Python. We point out one other method for writing an inner join that appears in practice called an implicit join. Recall that we previously wrote the following to conduct an inner join:

SELECT *
FROM names AS N
    INNER JOIN colors AS C
    ON N.cat_id = C.cat_id;

An implicit inner join has a slightly different syntax. Notice in particular that the FROM clause uses a comma to select from two tables and that the query includes a WHERE clause to specify the join condition.

SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;

When multiple tables are specified in the FROM clause, SQL creates a table containing every combination of rows from each table. For example:

sql_expr = """
SELECT *
FROM names N, colors C
"""
pd.read_sql(sql_expr, sqlite_engine)
cat_id name cat_id color
0 0 Apricot 0 orange
1 0 Apricot 1 black
2 0 Apricot 2 calico
3 0 Apricot 3 white
4 1 Boots 0 orange
5 1 Boots 1 black
6 1 Boots 2 calico
7 1 Boots 3 white
8 2 Cally 0 orange
9 2 Cally 1 black
10 2 Cally 2 calico
11 2 Cally 3 white
12 4 Eugene 0 orange
13 4 Eugene 1 black
14 4 Eugene 2 calico
15 4 Eugene 3 white

This operation is often called a Cartesian product: each row in the first table is paired with every row in the second table. Notice that many rows contain cat colors that are not matched properly with their names. The additional WHERE clause in the implicit join filters out rows that do not have matching cat_id values.

SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;
cat_id name cat_id color
0 0 Apricot 0 orange
1 1 Boots 1 black
2 2 Cally 2 calico

Joining Multiple Tables

To join multiple tables, extend the FROM clause with additional JOIN operators. For example, the following table ages includes data about each cat's age.

cat_id age
0 4
1 3
2 9
4 20

To conduct an inner join on the names, colors, and ages table, we write:

# Joining three tables

sql_expr = """
SELECT name, color, age
    FROM names n
    INNER JOIN colors c ON n.cat_id = c.cat_id
    INNER JOIN ages a ON n.cat_id = a.cat_id;
"""
pd.read_sql(sql_expr, sqlite_engine)
name color age
0 Apricot orange 4
1 Boots black 3
2 Cally calico 9

Summary

We have covered the four main types of SQL joins: inner, full, left, and right joins. We use all four joins to combine information in separate relations, and each join differs only in how it handles non-matching rows in the input tables.