Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/13'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)
# HIDDEN
from scipy.optimize import minimize as sci_min

def minimize(cost_fn, grad_cost_fn, X, y, progress=True):
    '''
    Uses scipy.minimize to minimize cost_fn using a form of gradient descent.
    '''
    theta = np.zeros(X.shape[1])
    iters = 0
    
    def objective(theta):
        return cost_fn(theta, X, y)
    def gradient(theta):
        return grad_cost_fn(theta, X, y)
    def print_theta(theta):
        nonlocal iters
        if progress and iters % progress == 0:
            print(f'theta: {theta} | cost: {cost_fn(theta, X, y):.2f}')
        iters += 1
        
    print_theta(theta)
    return sci_min(
        objective, theta, method='BFGS', jac=gradient, callback=print_theta,
        tol=1e-7
    ).x

Linear Regression Case Study

In this section, we perform an end-to-end case study of applying the linear regression model to a dataset. The dataset we will be working with has various attributes, such as length and girth, of donkeys.

Our task is to predict a donkey's weight using linear regression.

Preliminary Data Overview

We will begin by reading in the dataset and taking a quick peek at its contents.

donkeys = pd.read_csv("donkeys.csv")
donkeys.head()
BCS Age Sex ... Height Weight WeightAlt
0 3.0 <2 stallion ... 90 77 NaN
1 2.5 <2 stallion ... 94 100 NaN
2 1.5 <2 stallion ... 95 74 NaN
3 3.0 <2 female ... 96 116 NaN
4 2.5 <2 female ... 91 91 NaN

5 rows × 8 columns

It's always a good idea to look at how much data we have by looking at the dimensions of the dataset. If we have a large number of observations, printing out the entire dataframe may crash our notebook.

donkeys.shape
(544, 8)

The dataset is relatively small, with only 544 rows of observations and 8 columns. Let's look at what columns are available to us.

donkeys.columns.values
array(['BCS', 'Age', 'Sex', 'Length', 'Girth', 'Height', 'Weight',
       'WeightAlt'], dtype=object)

A good understanding of our data can guide our analysis, so we should understand what each of these columns represent. A few of these columns are self-explanatory, but others require a little more explanation:

  • BCS: Body Condition Score (a physical health rating)
  • Girth: the measurement around the middle of the donkey
  • WeightAlt: the second weighing (31 donkeys in our data were weighed twice in order to check the accuracy of the scale)

It is also a good idea to determine which variables are quantitative and which are categorical.

Quantitative: Length, Girth, Height, Weight, WeightAlt

Categorical: BCS, Age, Sex

Data Cleaning

In this section, we will check the data for any abnormalities that we have to deal with.

By examining WeightAlt more closely, we can make sure that the scale is accurate by taking the difference between the two different weighings and plotting them.

difference = donkeys['WeightAlt'] - donkeys['Weight']
sns.distplot(difference.dropna());

The measurements are all within 1 kg of each other, which seems reasonable.

Next, we can look for unusual values that might indicate errors or other problems. We can use the quantile function in order to detect anomalous values.

donkeys.quantile([0.005, 0.995])
BCS Length Girth Height Weight WeightAlt
0.005 1.5 71.145 90.000 89.0 71.715 98.75
0.995 4.0 111.000 131.285 112.0 214.000 192.80

For each of these numerical columns, we can look at which rows fall outside of these quantiles and what values they take on. Consider that we want our model to apply to only healthy and mature donkeys.

First, let's look at the BCS column.

donkeys[(donkeys['BCS'] < 1.5) | (donkeys['BCS'] > 4)]['BCS']
291    4.5
445    1.0
Name: BCS, dtype: float64

Also looking at the barplot of BCS:

plt.hist(donkeys['BCS'], density=True)
plt.xlabel('BCS');

Considering that BCS is an indication of the health of a donkey, a BCS of 1 represents an extremely emaciated donkey and a BCS of 4.5 an overweight donkey. Also looking at the barplot, there only appear to be two donkeys with such outlying BCS values. Thus, we remove these two donkeys.


Now, let's look at Length, Girth, and Height.

donkeys[(donkeys['Length'] < 71.145) | (donkeys['Length'] > 111)]['Length']
8       46
22      68
26      69
216    112
Name: Length, dtype: int64
donkeys[(donkeys['Girth'] < 90) | (donkeys['Girth'] > 131.285)]['Girth']
8       66
239    132
283    134
523    134
Name: Girth, dtype: int64
donkeys[(donkeys['Height'] < 89) | (donkeys['Height'] > 112)]['Height']
8       71
22      86
244    113
523    116
Name: Height, dtype: int64

For these three columns, the donkey in row 8 seems to have a much smaller value than the cut-off while the other anomalous donkeys are close to the cut-off and likely do not need to be removed.


Finally, let's take a look at Weight.

donkeys[(donkeys['Weight'] < 71.715) | (donkeys['Weight'] > 214)]['Weight']
8       27
26      65
50      71
291    227
523    230
Name: Weight, dtype: int64

The first 2 and last 2 donkeys in the list are far off from the cut-off and most likely should be removed. The middle donkey can be included.


Since WeightAlt closely corresponds to Weight, we skip checking this column for anomalies. Summarizing what we have learned, here is how we want to filter our donkeys:

  • Keep donkeys with BCS in the range 1.5 and 4
  • Keep donkeys with Weight between 71 and 214
donkeys_c = donkeys[(donkeys['BCS'] >= 1.5) & (donkeys['BCS'] <= 4) &
                         (donkeys['Weight'] >= 71) & (donkeys['Weight'] <= 214)]

Train-Test Split

Before we proceed with our data analysis, we divide our data into an 80/20 split, using 80% of our data to train our model and setting aside the other 20% for evaluation of the model.

X_train, X_test, y_train, y_test = train_test_split(donkeys_c.drop(['Weight'], axis=1),
                                                    donkeys_c['Weight'],
                                                    test_size=0.2,
                                                   random_state=42)
X_train.shape, X_test.shape
((431, 7), (108, 7))

Let's also create a function that evaluates our predictions on the test set. Let's use mean squared error.

def mse_test_set(predictions):
    return float(np.sum((predictions - y_test) ** 2))

Exploratory Data Analysis and Visualization

As usual, we will explore our data before attempting to fit a model to it.

First, we will examine the categorical variables with boxplots.

# HIDDEN
sns.boxplot(x=X_train['BCS'], y=y_train);

It seems like median weight increases with BCS, but not linearly.

# HIDDEN
sns.boxplot(x=X_train['Sex'], y=y_train,
            order = ['female', 'stallion', 'gelding']);

It seems like the sex of the donkey doesn't appear to cause much of a difference in weight.

# HIDDEN
sns.boxplot(x=X_train['Age'], y=y_train, 
            order = ['<2', '2-5', '5-10', '10-15', '15-20', '>20']);

For donkeys over 5, the weight distribution is not too different.

Now, let's look at the quantitative variables. We can plot each of them against the target variable.

# HIDDEN
X_train['Weight'] = y_train
sns.regplot('Length', 'Weight', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Girth', 'Weight', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Height', 'Weight', X_train, fit_reg=False);

All three of our quantitative features have a linear relationship with our target variable of Weight, so we will not have to perform any transformations on our input data.

It is also a good idea to see if our features are linear with each other. We plot two below:

# HIDDEN
sns.regplot('Height', 'Length', X_train, fit_reg=False);
# HIDDEN
sns.regplot('Height', 'Girth', X_train, fit_reg=False);

From these plots, we can see that our predictor variables also have strong linear relationships with each other. This makes our model harder to interpret, so we should keep this in mind after we create our model.

Simpler Linear Models

Rather than using all of our data at once, let's try to fit linear models to one or two variables first.

Below are three simple linear regression models using just one quantitative variable. Which model appears to be the best?

# HIDDEN
sns.regplot('Length', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Length']], X_train['Weight'])
predictions = model.predict(X_test[['Length']])
print("MSE:", mse_test_set(predictions))
MSE: 26052.58007702549
sns.regplot('Girth', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Girth']], X_train['Weight'])
predictions = model.predict(X_test[['Girth']])
print("MSE:", mse_test_set(predictions))
MSE: 13248.814105932383
sns.regplot('Height', 'Weight', X_train, fit_reg=True);
# HIDDEN
model = LinearRegression()
model.fit(X_train[['Height']], X_train['Weight'])
predictions = model.predict(X_test[['Height']])
print("MSE:", mse_test_set(predictions))
MSE: 36343.308584306134

Looking at the scatterplots and the mean squared errors, it seems like Girth is the best sole predictor of Weight as it has the strongest linear relationship with Weight and the smallest mean squared error.

Can we do better with two variables? Let's try fitting a linear model using both Girth and Length. Although it is not as easy to visualize this model, we can still look at the MSE of this model.

# HIDDEN
model = LinearRegression()
model.fit(X_train[['Girth', 'Length']], X_train['Weight'])
predictions = model.predict(X_test[['Girth', 'Length']])
print("MSE:", mse_test_set(predictions))
MSE: 9680.902423377258

Wow! Looks like our MSE went down from around 13000 with just Girth alone to 10000 with Girth and Length. Using including the second variable improved our model.

We can also use categorical variables in our model. Let's now look at a linear model using the categorical variable of Age. This is the plot of Age versus Weight:

# HIDDEN
sns.stripplot(x='Age', y='Weight', data=X_train, order=['<2', '2-5', '5-10', '10-15', '15-20', '>20']);

Seeing how Age is a categorical variable, we need to introduce dummy variables in order to produce a linear regression model.

# HIDDEN
just_age_and_weight = X_train[['Age', 'Weight']]
with_age_dummies = pd.get_dummies(just_age_and_weight, columns=['Age'])
model = LinearRegression()
model.fit(with_age_dummies.drop('Weight', axis=1), with_age_dummies['Weight'])

just_age_and_weight_test = X_test[['Age']]
with_age_dummies_test = pd.get_dummies(just_age_and_weight_test, columns=['Age'])
predictions = model.predict(with_age_dummies_test)
print("MSE:", mse_test_set(predictions))
MSE: 41398.515625

A MSE of around 40000 is worse than what we could get using any single one of the quantitative variables, but this variable could still prove to be useful in our linear model.

Let's try to interpret this linear model. Note that every donkey that falls into an age category, say 2-5 years of age, will receive the same prediction because they share the input values: a 1 in the column corresponding to 2-5 years of age, and 0 in all other columns. Thus, we can interpret categorical variables as simply changing the constant in the model because the categorical variable separates the donkeys into groups and gives one prediction for all donkeys within that group.

Our next step is to create a final model using both our categorical variables and multiple quantitative variables.

Transforming Variables

Recall from our boxplots that Sex was not a useful variable, so we will drop it. We will also remove the WeightAlt column because we only have its value for 31 donkeys. Finally, using get_dummies, we transform the categorical variables BCS and Age into dummy variables so that we can include them in the model.

# HIDDEN
X_train.drop('Weight', axis=1, inplace=True)
# HIDDEN
pd.set_option('max_columns', 15)
X_train.drop(['Sex', 'WeightAlt'], axis=1, inplace=True)
X_train = pd.get_dummies(X_train, columns=['BCS', 'Age'])
X_train.head()
Length Girth Height BCS_1.5 BCS_2.0 BCS_2.5 BCS_3.0 BCS_3.5 BCS_4.0 Age_10-15 Age_15-20 Age_2-5 Age_5-10 Age_<2 Age_>20
465 98 113 99 0 0 0 1 0 0 0 0 1 0 0 0
233 101 119 101 0 0 0 1 0 0 1 0 0 0 0 0
450 106 125 103 0 0 1 0 0 0 1 0 0 0 0 0
453 93 120 100 0 0 1 0 0 0 0 0 1 0 0 0
452 98 120 108 0 0 1 0 0 0 0 0 0 1 0 0

Recall that we noticed that the weight distribution of donkeys over the age of 5 is not very different. Thus, let's combine the columns Age_10-15, Age_15-20, and Age_>20 into one column.

age_over_10 = X_train['Age_10-15'] | X_train['Age_15-20'] | X_train['Age_>20']
X_train['Age_>10'] = age_over_10
X_train.drop(['Age_10-15', 'Age_15-20', 'Age_>20'], axis=1, inplace=True)

Since we do not want our matrix to be over-parameterized, we should drop one category from the BCS and Age dummies.

X_train.drop(['BCS_3.0', 'Age_5-10'], axis=1, inplace=True)
X_train.head()
Length Girth Height BCS_1.5 BCS_2.0 BCS_2.5 BCS_3.5 BCS_4.0 Age_2-5 Age_<2 Age_>10
465 98 113 99 0 0 0 0 0 1 0 0
233 101 119 101 0 0 0 0 0 0 0 1
450 106 125 103 0 0 1 0 0 0 0 1
453 93 120 100 0 0 1 0 0 1 0 0
452 98 120 108 0 0 1 0 0 0 0 0

We should also add a column of biases in order to have a constant term in our model.

X_train = X_train.assign(bias=1)
# HIDDEN
X_train = X_train.reindex(columns=['bias'] + list(X_train.columns[:-1]))
X_train.head()
bias Length Girth Height BCS_1.5 BCS_2.0 BCS_2.5 BCS_3.5 BCS_4.0 Age_2-5 Age_<2 Age_>10
465 1 98 113 99 0 0 0 0 0 1 0 0
233 1 101 119 101 0 0 0 0 0 0 0 1
450 1 106 125 103 0 0 1 0 0 0 0 1
453 1 93 120 100 0 0 1 0 0 1 0 0
452 1 98 120 108 0 0 1 0 0 0 0 0

Multiple Linear Regression Model

We are finally ready to fit our model to all of the variables we have deemed important and transformed into the proper form.

Our model looks like this:

$$ f_\theta (\textbf{x}) = \theta_0 + \theta_1 (Length) + \theta_2 (Girth) + \theta_3 (Height) + ... + \theta_{11} (Age_>10) $$

Here are the functions we defined in the multiple linear regression lesson, which we will use again:

def linear_model(thetas, X):
    '''Returns predictions by a linear model on x_vals.'''
    return X @ thetas

def mse_cost(thetas, X, y):
    return np.mean((y - linear_model(thetas, X)) ** 2)

def grad_mse_cost(thetas, X, y):
    n = len(X)
    return -2 / n * (X.T @ y  - X.T @ X @ thetas)

In order to use the above functions, we need X, and y. These can both be obtained from our data frames. Remember that X and y have to be numpy matrices in order to be able to multiply them with @ notation.

X_train = X_train.values
y_train = y_train.values

Now we just need to call the minimize function defined in a previous section.

thetas = minimize(mse_cost, grad_mse_cost, X_train, y_train)
theta: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] | cost: 23979.72
theta: [0.01 0.53 0.65 0.56 0.   0.   0.   0.   0.   0.   0.   0.  ] | cost: 1214.03
theta: [-0.07  1.84  2.55 -2.87 -0.02 -0.13 -0.34  0.19  0.07 -0.22 -0.3   0.43] | cost: 1002.46
theta: [-0.25 -0.76  4.81 -3.06 -0.08 -0.38 -1.11  0.61  0.24 -0.66 -0.93  1.27] | cost: 815.50
theta: [-0.44 -0.33  4.08 -2.7  -0.14 -0.61 -1.89  1.02  0.4  -1.06 -1.57  2.09] | cost: 491.91
theta: [-1.52  0.85  2.   -1.58 -0.52 -2.22 -5.63  3.29  1.42 -2.59 -5.14  5.54] | cost: 140.86
theta: [-2.25  0.9   1.72 -1.3  -0.82 -3.52 -7.25  4.64  2.16 -2.95 -7.32  6.61] | cost: 130.33
theta: [ -4.16   0.84   1.32  -0.78  -1.65  -7.09 -10.4    7.82   4.18  -3.44
 -12.61   8.24] | cost: 116.92
theta: [ -5.89   0.75   1.17  -0.5   -2.45 -10.36 -11.81  10.04   6.08  -3.6
 -16.65   8.45] | cost: 110.37
theta: [ -7.75   0.67   1.13  -0.35  -3.38 -13.76 -11.84  11.55   8.2   -3.8
 -20.     7.55] | cost: 105.74
theta: [ -9.41   0.64   1.15  -0.31  -4.26 -16.36 -10.81  11.97  10.12  -4.33
 -21.88   6.15] | cost: 102.82
theta: [-11.08   0.66   1.17  -0.32  -5.18 -18.28  -9.43  11.61  11.99  -5.37
 -22.77   4.69] | cost: 100.70
theta: [-12.59   0.69   1.16  -0.32  -6.02 -19.17  -8.53  10.86  13.54  -6.65
 -22.89   3.73] | cost: 99.34
theta: [-14.2    0.72   1.14  -0.3   -6.89 -19.35  -8.29  10.03  14.98  -7.99
 -22.74   3.14] | cost: 98.30
theta: [-16.14   0.73   1.11  -0.26  -7.94 -19.03  -8.65   9.3   16.47  -9.18
 -22.59   2.76] | cost: 97.35
theta: [-18.68   0.73   1.1   -0.21  -9.27 -18.29  -9.42   8.76  18.14 -10.04
 -22.55   2.39] | cost: 96.38
theta: [-21.93   0.72   1.1   -0.17 -10.94 -17.19 -10.25   8.5   19.92 -10.36
 -22.66   1.99] | cost: 95.35
theta: [-26.08   0.7    1.13  -0.14 -13.03 -15.78 -10.79   8.54  21.78 -10.05
 -22.83   1.59] | cost: 94.18
theta: [-31.35   0.69   1.17  -0.13 -15.59 -14.12 -10.69   8.9   23.61  -9.19
 -22.93   1.32] | cost: 92.84
theta: [-37.51   0.7    1.21  -0.13 -18.44 -12.47  -9.79   9.52  25.14  -8.06
 -22.78   1.38] | cost: 91.40
theta: [-43.57   0.72   1.23  -0.12 -21.06 -11.3   -8.4   10.2   25.98  -7.16
 -22.24   1.87] | cost: 90.06
theta: [-48.96   0.74   1.23  -0.1  -23.13 -10.82  -7.13  10.76  26.06  -6.79
 -21.34   2.6 ] | cost: 88.89
theta: [-54.87   0.76   1.22  -0.05 -25.11 -10.88  -6.25  11.22  25.55  -6.8
 -20.04   3.41] | cost: 87.62
theta: [-63.83   0.78   1.21   0.02 -27.82 -11.42  -5.83  11.68  24.36  -6.96
 -17.97   4.26] | cost: 85.79
theta: [-77.9    0.8    1.22   0.13 -31.81 -12.47  -6.17  12.03  22.29  -6.98
 -14.93   4.9 ] | cost: 83.19
theta: [-94.94   0.81   1.26   0.23 -36.3  -13.73  -7.37  11.98  19.65  -6.47
 -11.73   4.88] | cost: 80.40
theta: [-108.1     0.81    1.34    0.28  -39.34  -14.55   -8.72   11.32   17.48
   -5.47   -9.92    4.21] | cost: 78.34
theta: [-115.07    0.81    1.4     0.29  -40.38  -14.75   -9.46   10.3    16.16
   -4.47   -9.7     3.5 ] | cost: 77.07
theta: [-119.8     0.81    1.44    0.28  -40.43  -14.6    -9.61    9.02   15.09
   -3.67  -10.25    3.05] | cost: 76.03
theta: [-125.16    0.82    1.47    0.3   -40.01  -14.23   -9.3     7.48   13.79
   -3.14  -11.09    2.94] | cost: 74.96
theta: [-131.24    0.83    1.48    0.33  -39.39  -13.76   -8.71    6.21   12.41
   -3.16  -11.79    3.17] | cost: 74.03
theta: [-137.42    0.84    1.48    0.39  -38.62  -13.25   -8.11    5.57   11.18
   -3.67  -12.11    3.47] | cost: 73.23
theta: [-144.82    0.85    1.47    0.46  -37.36  -12.53   -7.56    5.47    9.93
   -4.57  -12.23    3.56] | cost: 72.28
theta: [-155.48    0.86    1.48    0.54  -34.88  -11.3    -6.98    5.95    8.38
   -5.92  -12.27    3.13] | cost: 70.91
theta: [-167.86    0.88    1.52    0.62  -31.01   -9.63   -6.53    7.03    6.9
   -7.3   -12.29    1.91] | cost: 69.33
theta: [-176.09    0.89    1.57    0.64  -27.32   -8.32   -6.41    8.07    6.31
   -7.84  -12.29    0.44] | cost: 68.19
theta: [-178.63    0.9     1.6     0.62  -25.15   -7.88   -6.5     8.52    6.6
   -7.51  -12.19   -0.39] | cost: 67.59
theta: [-179.83    0.91    1.63    0.6   -23.4    -7.84   -6.6     8.61    7.27
   -6.83  -11.89   -0.72] | cost: 67.08
theta: [-182.79    0.91    1.66    0.58  -20.55   -8.01   -6.68    8.49    8.44
   -5.7   -11.11   -0.69] | cost: 66.27
theta: [-190.23    0.93    1.68    0.6   -15.62   -8.38   -6.68    8.1    10.26
   -4.1    -9.46    0.01] | cost: 65.11
theta: [-199.13    0.93    1.69    0.67  -11.37   -8.7    -6.55    7.67   11.53
   -3.17   -7.81    1.13] | cost: 64.28
theta: [-203.85    0.93    1.68    0.72  -10.03   -8.78   -6.42    7.5    11.68
   -3.25   -7.13    1.86] | cost: 64.01
theta: [-204.24    0.93    1.67    0.74  -10.33   -8.74   -6.39    7.52   11.46
   -3.52   -7.17    1.97] | cost: 63.98
theta: [-204.06    0.93    1.67    0.74  -10.48   -8.72   -6.39    7.54   11.39
   -3.59   -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98
theta: [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95] | cost: 63.98

Our linear model is:

$y = -204.03 + 0.93x_1 + ... -7.22x_{9} + 1.95x_{11}$

Let's compare this equation that we obtained to the one we would get if we had used sklearn's LinearRegression model instead.

model = LinearRegression(fit_intercept=False) # We already accounted for it with the bias column
model.fit(X_train[:, :14], y_train)
print("Coefficients", model.coef_)
Coefficients [-204.03    0.93    1.67    0.74  -10.5    -8.72   -6.39    7.54   11.39
   -3.6    -7.22    1.95]

The coefficients look exactly the same! Our homemade functions create the same model as an established Python package!

We successfully fit a linear model to our donkey data! Nice!

Evaluating our Model

Our next step is to evaluate our model's performance on the test set. We need to perform the same data pre-processing steps on the test set as we did on the training set before we can pass it into our model.

X_test.drop(['Sex', 'WeightAlt'], axis=1, inplace=True)
X_test = pd.get_dummies(X_test, columns=['BCS', 'Age'])
age_over_10 = X_test['Age_10-15'] | X_test['Age_15-20'] | X_test['Age_>20']
X_test['Age_>10'] = age_over_10
X_test.drop(['Age_10-15', 'Age_15-20', 'Age_>20'], axis=1, inplace=True)
X_test.drop(['BCS_3.0', 'Age_5-10'], axis=1, inplace=True)
X_test = X_test.assign(bias=1)
# HIDDEN
X_test = X_test.reindex(columns=['bias'] + list(X_test.columns[:-1]))
X_test
bias Length Girth Height BCS_1.5 BCS_2.0 BCS_2.5 BCS_3.5 BCS_4.0 Age_2-5 Age_<2 Age_>10
490 1 98 119 103 0 0 1 0 0 0 0 1
75 1 86 114 105 0 0 0 0 0 1 0 0
352 1 94 114 101 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
182 1 94 114 102 0 0 0 0 0 1 0 0
334 1 104 113 105 0 0 1 0 0 0 0 0
543 1 104 124 110 0 0 0 0 0 0 0 0

108 rows × 12 columns

We pass X_test into predict of our LinearRegression model:

X_test = X_test.values
predictions = model.predict(X_test)

Let's look at the mean squared error:

mse_test_set(predictions)
7261.974205350604

With these predictions, we can also make a residual plot:

# HIDDEN
y_test = y_test.values
resid = y_test - predictions
resid_prop = resid / y_test
plt.scatter(np.arange(len(resid_prop)), resid_prop, s=15)
plt.axhline(0)
plt.title('Residual proportions (resid / actual Weight)')
plt.xlabel('Index of row in data')
plt.ylabel('Error proportion');

Looks like our model does pretty well! The residual proportions indicate that our predictions are mostly within 15% of the correct value.