Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))

The Relational Model

A database is an organized collection of data. In the past, data was stored in specialized data structures that were designed for specific tasks. For example, airlines might record flight bookings in a different format than a bank managing an account ledger. In 1969, Ted Codd introduced the relational model as a general method of storing data. Data is stored in two-dimensional tables called relations, consisting of individual observations in each row (commonly referred to as tuples). Each tuple is a structured data item that represents the relationship between certain attributes (columns). Each attribute of a relation has a name and data type.

Consider the purchases relation below:

purchases

name product retailer date purchased
Samantha iPod Best Buy June 3, 2016
Timothy Chromebook Amazon July 8, 2016
Jason Surface Pro Target October 2, 2016

In purchases, each tuple represents the relationship between the name, product, retailer, and date purchased attributes.

A relation's schema contains its column names, data types, and constraints. For example, the schema of the purchases table states that the columns are name, product, retailer, and date purchased; it also states that each column contains text.

The following prices relation shows the price of certain gadgets at a few retail stores:

prices

retailer product price
Best Buy Galaxy S9 719.00
Best Buy iPod 200.00
Amazon iPad 450.00
Amazon Battery pack 24.87
Amazon Chromebook 249.99
Target iPod 215.00
Target Surface Pro 799.00
Target Google Pixel 2 659.00
Walmart Chromebook 238.79

We can then reference both tables simultaneously to determine how much Samantha, Timothy, and Jason paid for their respective gadgets (assuming prices at each store stay constant over time). Together, the two tables form a relational database, which is a collection of one or more relations. The schema of the entire database is the set of schemas of the individual relations in the database.

Relational Database Management Systems

A relational database can be simply described as a set of tables containing rows of individual data entries. A relational database management system (RDBMSs) provides an interface to a relational database. Oracle, MySQL, and PostgreSQL are three of the most commonly used RDBMSs used in practice today.

Relational database management systems give users the ability to add, edit, and remove data from databases. These systems provide several key benefits over using a collection of text files to store data, including:

  1. Reliable data storage: RDBMSs protect against data corruption from system failures or crashes.
  2. Performance: RDBMSs often store data more efficiently than text files and have well-developed algorithms for querying data.
  3. Data management: RDBMSs implement access control, preventing unauthorized users from accessing sensitive datasets.
  4. Data consistency: RDBMSs can impose constraints on the data entered—for example, that a column GPA only contains floats between 0.0 and 4.0.

To work with data stored in a RDBMS, we use the SQL programming language.

RDBMS vs. pandas

How do RDBMSs and the pandas Python package differ? First, pandas is not concerned about data storage. Although DataFrames can read and write from multiple data formats, pandas does not dictate how the data are actually stored on the underlying computer like a RDBMS does. Second, pandas primarily provides methods for manipulating data while RDBMSs handle both data storage and data manipulation, making them more suitable for larger datasets. A typical rule of thumb is to use a RDBMS for datasets larger than several gigabytes. Finally, pandas requires knowledge of Python in order to use, whereas RDBMSs require knowledge of SQL. Since SQL is simpler to learn than Python, RDBMSs allow less technical users to store and query data, a handy trait.