Open on DataHub

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))

The Relational Model¶

A database is an organized collection of data. In the past, data was stored in specialized data structures that were designed for specific tasks. For example, airlines might record flight bookings in a different format than a bank managing an account ledger. In 1969, Ted Codd introduced the relational model as a general method of storing data. Data is stored in two-dimensional tables called relations, consisting of individual observations in each row (commonly referred to as tuples). Each tuple is a structured data item that represents the relationship between certain attributes (columns). Each attribute of a relation has a name and data type.

Consider the purchases relation below:

purchases

name	product	retailer	date purchased
Samantha	iPod	Best Buy	June 3, 2016
Timothy	Chromebook	Amazon	July 8, 2016
Jason	Surface Pro	Target	October 2, 2016

In purchases, each tuple represents the relationship between the name, product, retailer, and date purchased attributes.

A relation's schema contains its column names, data types, and constraints. For example, the schema of the purchases table states that the columns are name, product, retailer, and date purchased; it also states that each column contains text.

The following prices relation shows the price of certain gadgets at a few retail stores:

prices

retailer	product	price
Best Buy	Galaxy S9	719.00
Best Buy	iPod	200.00
Amazon	iPad	450.00
Amazon	Battery pack	24.87
Amazon	Chromebook	249.99
Target	iPod	215.00
Target	Surface Pro	799.00
Target	Google Pixel 2	659.00
Walmart	Chromebook	238.79

We can then reference both tables simultaneously to determine how much Samantha, Timothy, and Jason paid for their respective gadgets (assuming prices at each store stay constant over time). Together, the two tables form a relational database, which is a collection of one or more relations. The schema of the entire database is the set of schemas of the individual relations in the database.

Relational Database Management Systems¶

A relational database can be simply described as a set of tables containing rows of individual data entries. A relational database management system (RDBMSs) provides an interface to a relational database. Oracle, MySQL, and PostgreSQL are three of the most commonly used RDBMSs used in practice today.

Relational database management systems give users the ability to add, edit, and remove data from databases. These systems provide several key benefits over using a collection of text files to store data, including:

Reliable data storage: RDBMSs protect against data corruption from system failures or crashes.
Performance: RDBMSs often store data more efficiently than text files and have well-developed algorithms for querying data.
Data management: RDBMSs implement access control, preventing unauthorized users from accessing sensitive datasets.
Data consistency: RDBMSs can impose constraints on the data entered—for example, that a column GPA only contains floats between 0.0 and 4.0.

To work with data stored in a RDBMS, we use the SQL programming language.

RDBMS vs. pandas¶

How do RDBMSs and the pandas Python package differ? First, pandas is not concerned about data storage. Although DataFrames can read and write from multiple data formats, pandas does not dictate how the data are actually stored on the underlying computer like a RDBMS does. Second, pandas primarily provides methods for manipulating data while RDBMSs handle both data storage and data manipulation, making them more suitable for larger datasets. A typical rule of thumb is to use a RDBMS for datasets larger than several gigabytes. Finally, pandas requires knowledge of Python in order to use, whereas RDBMSs require knowledge of SQL. Since SQL is simpler to learn than Python, RDBMSs allow less technical users to store and query data, a handy trait.