# HIDDEN
# Clear previously defined variables
%reset -f
# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/09'))
The Relational Model¶
A database is an organized collection of data. In the past, data was stored in specialized data structures that were designed for specific tasks. For example, airlines might record flight bookings in a different format than a bank managing an account ledger. In 1969, Ted Codd introduced the relational model as a general method of storing data. Data is stored in two-dimensional tables called relations, consisting of individual observations in each row (commonly referred to as tuples). Each tuple is a structured data item that represents the relationship between certain attributes (columns). Each attribute of a relation has a name and data type.
Consider the purchases
relation below:
name | product | retailer | date purchased |
Samantha | iPod | Best Buy | June 3, 2016 |
Timothy | Chromebook | Amazon | July 8, 2016 |
Jason | Surface Pro | Target | October 2, 2016 |
In purchases
, each tuple represents the relationship between the name
, product
, retailer
, and date purchased
attributes.
A relation's schema contains its column names, data types, and constraints. For example, the schema of the purchases
table states that the columns are name
, product
, retailer
, and date purchased
; it also states that each column contains text.
The following prices
relation shows the price of certain gadgets at a few retail stores:
retailer | product | price |
Best Buy | Galaxy S9 | 719.00 |
Best Buy | iPod | 200.00 |
Amazon | iPad | 450.00 |
Amazon | Battery pack | 24.87 |
Amazon | Chromebook | 249.99 |
Target | iPod | 215.00 |
Target | Surface Pro | 799.00 |
Target | Google Pixel 2 | 659.00 |
Walmart | Chromebook | 238.79 |
We can then reference both tables simultaneously to determine how much Samantha, Timothy, and Jason paid for their respective gadgets (assuming prices at each store stay constant over time). Together, the two tables form a relational database, which is a collection of one or more relations. The schema of the entire database is the set of schemas of the individual relations in the database.
Relational Database Management Systems¶
A relational database can be simply described as a set of tables containing rows of individual data entries. A relational database management system (RDBMSs) provides an interface to a relational database. Oracle, MySQL, and PostgreSQL are three of the most commonly used RDBMSs used in practice today.
Relational database management systems give users the ability to add, edit, and remove data from databases. These systems provide several key benefits over using a collection of text files to store data, including:
- Reliable data storage: RDBMSs protect against data corruption from system failures or crashes.
- Performance: RDBMSs often store data more efficiently than text files and have well-developed algorithms for querying data.
- Data management: RDBMSs implement access control, preventing unauthorized users from accessing sensitive datasets.
- Data consistency: RDBMSs can impose constraints on the data entered—for example, that a column
GPA
only contains floats between 0.0 and 4.0.
To work with data stored in a RDBMS, we use the SQL programming language.
RDBMS vs. pandas¶
How do RDBMSs and the pandas
Python package differ? First, pandas
is not concerned about data storage. Although DataFrames can read and write from multiple data formats, pandas
does not dictate how the data are actually stored on the underlying computer like a RDBMS does. Second, pandas
primarily provides methods for manipulating data while RDBMSs handle both data storage and data manipulation, making them more suitable for larger datasets. A typical rule of thumb is to use a RDBMS for datasets larger than several gigabytes. Finally, pandas
requires knowledge of Python in order to use, whereas RDBMSs require knowledge of SQL. Since SQL is simpler to learn than Python, RDBMSs allow less technical users to store and query data, a handy trait.