# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
# Make names table
sql_expr = """
CREATE TABLE names(
cat_id INTEGER PRIMARY KEY,
name TEXT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate names table
sql_expr = """
INSERT INTO names VALUES
(0, "Apricot"),
(1, "Boots"),
(2, "Cally"),
(4, "Eugene");
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make colors table
sql_expr = """
CREATE TABLE colors(
cat_id INTEGER PRIMARY KEY,
color TEXT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate colors table
sql_expr = """
INSERT INTO colors VALUES
(0, "orange"),
(1, "black"),
(2, "calico"),
(3, "white");
"""
result = sqlite_engine.execute(sql_expr)
# HIDDEN
# Make ages table
sql_expr = """
CREATE TABLE ages(
cat_id INTEGER PRIMARY KEY,
age INT);
"""
result = sqlite_engine.execute(sql_expr)
# Populate ages table
sql_expr = """
INSERT INTO ages VALUES
(0, 4),
(1, 3),
(2, 9),
(4, 20);
"""
result = sqlite_engine.execute(sql_expr)
SQL Joins¶
In pandas
we use the pd.merge
method to join two tables using matching values in their columns. For example:
pd.merge(table1, table2, on='common_column')
In this section, we introduce SQL joins. SQL joins are used to combine multiple tables in a relational database.
Suppose we are cat store owners with a database for the cats we have in our store. We have two different tables: names
and colors
. The names
table contains the columns cat_id
, a unique number assigned to each cat, and name
, the name for the cat. The colors
table contains the columns cat_id
and color
, the color of each cat.
Note that there are some missing rows from both tables - a row with cat_id
3 is missing from the names
table, and a row with cat_id
4 is missing from the colors
table.
cat_id | name |
---|---|
0 | Apricot |
1 | Boots |
2 | Cally |
4 | Eugene |
cat_id | color |
---|---|
0 | orange |
1 | black |
2 | calico |
3 | white |
To compute the color of the cat named Apricot, we have to use information in both tables. We can join the tables on the cat_id
column, creating a new table with both name
and color
.
Joins¶
A join combines tables by matching values in their columns.
There are four main types of joins: inner joins, outer joins, left joins, and right joins. Although all four combine tables, each one treats non-matching values differently.
Inner Join¶
Definition: In an inner join, the final table only contains rows that have matching columns in both tables.
Example: We would like to join the names
and colors
tables together to match each cat with its color. Since both tables contain a cat_id
column that is the unique identifier for a cat, we can use an inner join on the cat_id
column.
SQL: To write an inner join in SQL we modify our FROM
clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
INNER JOIN <TABLE_2>
ON <...>
For example:
SELECT *
FROM names AS N
INNER JOIN colors AS C
ON N.cat_id = C.cat_id;
cat_id | name | cat_id | color | |
---|---|---|---|---|
0 | 0 | Apricot | 0 | orange |
1 | 1 | Boots | 1 | black |
2 | 2 | Cally | 2 | calico |
You may verify that each cat name is matched with its color. Notice that the cats with cat_id
3 and 4 are not present in our resulting table because the colors
table doesn't have a row with cat_id
4 and the names
table doesn't have a row with cat_id
3. In an inner join, if a row doesn't have a matching value in the other table, the row is not included in the final result.
Assuming we have a DataFrame called names
and a DataFrame called colors
, we can conduct an inner join in pandas
by writing:
pd.merge(names, colors, how='inner', on='cat_id')
Full/Outer Join¶
Definition: In a full join (sometimes called an outer join), all records from both tables are included in the joined table. If a row doesn't have a match in the other table, the missing values are filled in with NULL
.
Example: As before, we join the names
and colors
tables together to match each cat with its color. This time, we want to keep all rows in either table even if there isn't a match.
SQL: To write an outer join in SQL we modify our FROM
clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
FULL JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
FULL JOIN colors C
ON N.cat_id = C.cat_id;
cat_id | name | color |
---|---|---|
0 | Apricot | orange |
1 | Boots | black |
2 | Cally | calico |
3 | NULL | white |
4 | Eugene | NULL |
Notice that the final output contains the entries with cat_id
3 and 4. If a row does not have a match, it is still included in the final output and any missing values are filled in with NULL
.
In pandas
:
pd.merge(names, colors, how='outer', on='cat_id')
Left Join¶
Definition: In a left join, all records from the left table are included in the joined table. If a row doesn't have a match in the right table, the missing values are filled in with NULL
.
Example: As before, we join the names
and colors
tables together to match each cat with its color. This time, we want to keep all the cat names even if a cat doesn't have a matching color.
SQL: To write an left join in SQL we modify our FROM
clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
LEFT JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
LEFT JOIN colors C
ON N.cat_id = C.cat_id;
cat_id | name | color |
---|---|---|
0 | Apricot | orange |
1 | Boots | black |
2 | Cally | calico |
4 | Eugene | NULL |
Notice that the final output includes all four cat names. Three of the cat_id
s in the names
relation had matching cat_id
s in the colors
table and one did not (Eugene). The cat name that did not have a matching color has NULL
as its color.
In pandas
:
pd.merge(names, colors, how='left', on='cat_id')
Right Join¶
Definition: In a right join, all records from the right table are included in the joined table. If a row doesn't have a match in the left table, the missing values are filled in with NULL
.
Example: As before, we join the names
and colors
tables together to match each cat with its color. This time, we want to keep all the cat color even if a cat doesn't have a matching name.
SQL: To write a right join in SQL we modify our FROM
clause to use the following syntax:
SELECT ...
FROM <TABLE_1>
RIGHT JOIN <TABLE_2>
ON <...>
For example:
SELECT name, color
FROM names N
RIGHT JOIN colors C
ON N.cat_id = C.cat_id;
cat_id | name | color |
---|---|---|
0 | Apricot | orange |
1 | Boots | black |
2 | Cally | calico |
3 | NULL | white |
This time, observe that the final output includes all four cat colors. Three of the cat_id
s in the colors
relation had matching cat_id
s in the names
table and one did not (white). The cat color that did not have a matching name has NULL
as its name.
You may also notice that a right join produces the same result a left join with the table order swapped. That is, names
left joined with colors
is the same as colors
right joined with names
. Because of this, some SQL engines (such as SQLite) do not support right joins.
In pandas
:
pd.merge(names, colors, how='right', on='cat_id')
Implicit Inner Joins¶
There are typically multiple ways to accomplish the same task in SQL just as there are multiple ways to accomplish the same task in Python. We point out one other method for writing an inner join that appears in practice called an implicit join. Recall that we previously wrote the following to conduct an inner join:
SELECT *
FROM names AS N
INNER JOIN colors AS C
ON N.cat_id = C.cat_id;
An implicit inner join has a slightly different syntax. Notice in particular that the FROM
clause uses a comma to select from two tables and that the query includes a WHERE
clause to specify the join condition.
SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;
When multiple tables are specified in the FROM
clause, SQL creates a table containing every combination of rows from each table. For example:
sql_expr = """
SELECT *
FROM names N, colors C
"""
pd.read_sql(sql_expr, sqlite_engine)
This operation is often called a Cartesian product: each row in the first table is paired with every row in the second table. Notice that many rows contain cat colors that are not matched properly with their names. The additional WHERE
clause in the implicit join filters out rows that do not have matching cat_id
values.
SELECT *
FROM names AS N, colors AS C
WHERE N.cat_id = C.cat_id;
cat_id | name | cat_id | color | |
---|---|---|---|---|
0 | 0 | Apricot | 0 | orange |
1 | 1 | Boots | 1 | black |
2 | 2 | Cally | 2 | calico |
Joining Multiple Tables¶
To join multiple tables, extend the FROM
clause with additional JOIN
operators. For example, the following table ages
includes data about each cat's age.
cat_id | age |
---|---|
0 | 4 |
1 | 3 |
2 | 9 |
4 | 20 |
To conduct an inner join on the names
, colors
, and ages
table, we write:
# Joining three tables
sql_expr = """
SELECT name, color, age
FROM names n
INNER JOIN colors c ON n.cat_id = c.cat_id
INNER JOIN ages a ON n.cat_id = a.cat_id;
"""
pd.read_sql(sql_expr, sqlite_engine)
Summary¶
We have covered the four main types of SQL joins: inner, full, left, and right joins. We use all four joins to combine information in separate relations, and each join differs only in how it handles non-matching rows in the input tables.