Polynomial Regression using Mini-Batch Gradient Descent

Overview

Basics

Goal: The goal is to build a Polynomial Regression model from scratch using Mini-Batch Gradient Descent approach.

The basic idea of a linear regression model is to find the best fit line that models the linear relationsip between a dependent and various independent variables perfectly. The best fit line is achieved by finding the values of slope (m) and constant (c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).

Polynomial regression is a special case of linear regression which tries to find a non-linear relationship between the dependent and independent variable(s) by modeling the relation as an n-th degree polynomial. To generate a higher order polynomial, we can add powers of the original features as new features.

Hypothesis

A single feature polynomial regression model can be represented as:

$h_{\theta}(x)=\theta_{0}+\theta_{1}x +\theta_{2} x^2+\ldots\theta_{n} x^n$

where, Parameters: $\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2},\ldots\theta_{n}\right\}$

Note: This is still considered to be linear model as it is a linear combination of coefficients/weights associated with the features. For e.g., $x^2$ is only a feature that has been added. However, the curve that we are fitting is polynomial in nature.

The parameters we need to find are:

$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$

Cost Function

Cost function gives an idea of how far the predicted values are from the actual values. The formula is:

$J\left(\theta_{0}, \theta_{1} \ldots\right)=\frac{1}{m} \Sigma\left(h(\theta)_{i}-y_{i}\right)^{2}$

where $ m = no.\ of\ records$, $ y_{i} = true\ values $

Gradient Descent

Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:

$\theta_{j}:=\theta_{j}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$

So, the gradients for various parameters can be written as:

$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$

image.png

Get Data

For this exercise, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:

  1. Read the data
  2. Identify dependent (Price) and independent (LSTAT - % lower status of the population) variable
  3. Add Polynomial terms to the data
  4. Split the data into train and test sets
  5. Standardize the data
  6. Build model using mini-batch gradient descent
  7. Evaluate model performance

Load Data

In [1]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
In [2]:
# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()
Out[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [3]:
# Create dependent (Price) and independent (LSTAT - % lower status of the population) variables
Y=load_boston().target
display(f'Head of Y: {Y[:5]}')
display(f'Shape of Y: {Y.shape}')

X=load_boston().data[:,12]
display(f'Head of X: {X[:5]}')
display(f'Shape of X: {X.shape}')
'Head of Y: [24.  21.6 34.7 33.4 36.2]'
'Shape of Y: (506,)'
'Head of X: [4.98 9.14 4.03 2.94 5.33]'
'Shape of X: (506,)'
In [4]:
# Plot X and Y
plt.scatter(X,Y)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Variation of Price and LSTAT')
plt.show()

From the figure above, we can see that LSTAT has a non-linear relation with target variable Price. We will transform the original features into higher degree polynomials before training the model.

Add Polynomial Term

Here, we will add a second order polynomial term to our data.

In [5]:
# Add polynomial term
X = np.c_[X, np.power(X,2)]

print(f'First 5 rows of X: \n {X[:5]}')
First 5 rows of X: 
 [[ 4.98   24.8004]
 [ 9.14   83.5396]
 [ 4.03   16.2409]
 [ 2.94    8.6436]
 [ 5.33   28.4089]]

Split and Standardize data

In [6]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=42)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (379, 2)'
'Shape of test data (127, 2)'
In [7]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Add Column for Intercept term

Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.

Note - The column of ones must be added after standardizing the data. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.

In [8]:
x_train = np.c_[np.ones(len(x_train)), x_train]
x_test = np.c_[np.ones(len(x_test)), x_test]
In [9]:
print(f'First 5 rows of x_train: \n {x_train[:5]}')
First 5 rows of x_train: 
 [[ 1.         -1.07775629 -0.7846345 ]
 [ 1.          0.35116456  0.07689937]
 [ 1.         -1.22625199 -0.82335346]
 [ 1.          2.06586959  2.2800638 ]
 [ 1.         -0.72472879 -0.65417833]]

Mini Batch Gradient Descent

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (b0 and b1)
  2. Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

    for each epoch(e):

     for each mini-batch (m):
    
         subset X and y per mini-bacth size
    
         - Forward Pass:
           Make predictions: y_pred = (b0 + b1Xm)
           Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)
    
         - Backward Pass:
           Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)
    
         - Update Parameters:
           Update parameter bj: bj - alpha * D_bj
In [10]:
class Poly_Regression():
    
    def __init__(self):
        self.b0 = None
        self.b1 = None
        
    def fit(self, X, y, epochs=100, learning_rate=0.01, batch_size=32, verbose=False):
        
        '''
        Used to calculate the coefficients of the linear regression model.
        
        '''
        
        n = X.shape[0]    # no of records in data
        batch_loss = []   # store total loss after each batch
        epoch_loss = []   # store total loss after each epoch
        
        # Initialize weights and bias
        # Note b0 (intercept) is not written here as it has been accounted for 
        # when adding a column of ones in the cells above 
        self.bj = np.zeros(X.shape[1])  # shape = (ncols,) = (11,)
        
        for e in range(epochs):   # iterate over epochs
            e_loss = 0
            
            for i in range(0, n, batch_size):   # iterate over batches
                # Subset data for batches
                X_new = X[i:i+batch_size]
                y_new = y[i:i+batch_size]
                
                # Calculate Loss
                y_pred = np.dot(self.bj, X_new.T)
                loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2)
                e_loss += loss
                batch_loss.append(loss)
                
                # Calculate gradients
                D_bj = (-2/len(X_new)) * np.dot(y_new - y_pred, X_new) # np.sum is NOT used to ensure shape remains (11,) (1,)

                # Update parameters
                self.bj -= learning_rate * D_bj
                
            epoch_loss.append(e_loss)
            if verbose == True and e%1000 == 0:
                print(f'Epoch: {e}, Loss: {e_loss}')
                
        return self.bj, batch_loss, epoch_loss
    
    def predict(self, X):
        return np.dot(self.bj, X.T)

Plot Loss for different learning rates

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.

In [11]:
epochs = 10000
batch = 32

model = Poly_Regression()
losses = {}

for lr in [0.1, 0.3, 0.01, 0.03, 0.001, 0.003]:
    bj, batch_loss, epoch_loss = model.fit(x_train, y_train, epochs = epochs, 
                                               learning_rate=lr, batch_size=batch)
    losses[f'LR={str(lr)}'] = epoch_loss
    
# Plot loss
plt.figure(figsize=(15,8))    
xs = np.arange(len(epoch_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 with lowest final loss to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss

In [12]:
# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32

# Build model
lr_mbgd_model = Poly_Regression()


# Fit Model
bj, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch, verbose=True)
Epoch: 0, Loss: 7199.329377907019
Epoch: 1000, Loss: 396.6391391494353
Epoch: 2000, Loss: 371.94764251399005
Epoch: 3000, Loss: 368.2185534720047
Epoch: 4000, Loss: 367.65453315224704
Epoch: 5000, Loss: 367.5689051038003
Epoch: 6000, Loss: 367.5557810324503
Epoch: 7000, Loss: 367.5537215591388
Epoch: 8000, Loss: 367.55338003641566
Epoch: 9000, Loss: 367.55331655050463
In [13]:
# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')
Bias: 22.910560472505956
Weight Matrix: 
[-17.5456265   10.99886224]

Plot Loss

Batch Loss
In [14]:
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
Epoch Loss
In [15]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

In this section, we will check the performance of our model on training and test data. We will also build a polynomial regression model using scikit-learn library and compare performance.

Performance on Training data

In [16]:
# Create predictions
train_pred_mbgd = lr_mbgd_model.predict(x_train)
In [17]:
# Plot predictions
plt.figure(figsize=(12,8))

plt.scatter(x_train[:,1], y_train)
plt.scatter(x_train[:,1], train_pred_mbgd, 
            label = f'y={bj[0]:.2f} + {bj[1]:.2f}x + {bj[2]:.2f}x2')

plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.legend()
plt.show()
In [18]:
# Calculate metrics
train_r2_mbgd = np.abs(r2_score(y_train, train_pred_mbgd))
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)

print("Performance on training set:")
print(f'Train R2 is: {train_r2_mbgd}')
print(f'Train MSE is: {train_mse_mbgd}')
print(f'Train RMSE is: {train_rmse_mbgd}')
Performance on training set:
Train R2 is: 0.6592310035862463
Train MSE is: 30.219984656794377
Train RMSE is: 5.497270655224679

Perofrmance on Test data

Evaluate model performance on test data.

In [19]:
# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)
In [20]:
# Plot predictions
plt.figure(figsize=(12,8))

plt.scatter(x_test[:,1], y_test)           
plt.scatter(x_test[:,1], test_pred_mbgd,
           label = f'y={bj[0]:.2f} + {bj[1]:.2f}x + {bj[2]:.2f}x2')

plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on test data')
plt.legend()
plt.show()
In [21]:
# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)

print("Performance on test set:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test MSE is: {test_mse_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')
Performance on test set:
Test R2 is: 0.5281588456642883
Test MSE is: 33.04168829667163
Test RMSE is: 5.748190001789401

Comparison with Scikit Learn

Here, we build a Multiple Linear Regression model using LinearRegression from scikit-learn library and compare the results with model created from scratch.

In [22]:
# Build a model using scikit learn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# "Creates a polynomial regression model for the given degree"
poly_features = PolynomialFeatures(degree=2)
   
# transform the features to higher degree features.
train_data = x_train[:,1].reshape(-1,1)
test_data = x_test[:,1].reshape(-1,1)
X_train_poly = poly_features.fit_transform(train_data)
   
# fit the transformed features to Linear Regression
poly_model = LinearRegression()

poly_model.fit(X_train_poly, y_train)
     
# predicting on training data-set
y_train_predicted = poly_model.predict(X_train_poly)
   
# predicting on test data-set
y_test_predicted = poly_model.predict(poly_features.fit_transform(test_data))
In [23]:
# Plot predictions
plt.figure(figsize=(12,8))

plt.scatter(test_data, y_test)           
plt.scatter(test_data, y_test_predicted)

plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on test data')
plt.show()
In [24]:
# Calculate metrics

test_r2 = np.abs(r2_score(y_test, y_test_predicted))
test_mse = mean_squared_error(y_test, y_test_predicted)
test_rmse = np.sqrt(test_mse)

print("Performance on test set using scikit-learn model:")
print(f'Test R2 is: {test_r2}')
print(f'Test MSE is: {test_mse}')
print(f'Test RMSE is: {test_rmse}')
Performance on test set using scikit-learn model:
Test R2 is: 0.5207911179492031
Test MSE is: 33.557629223781575
Test RMSE is: 5.792894718858748

Model Results

Training Data Test Data Model using scikit-learn
R2 0.66 0.53 0.52
MSE 30.22 33.042 33.56
RMSE 5.5 5.75 5.79

From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.

Summary

To summarize, in this notebook we created a polynomial regression model (degree 2) using mini-batch gradient descent from scratch. We saw how iterating over different learning rates helped identify a learning rate with minimum loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.