Multiple Linear Regression using Mini-Batch Gradient Descent

Overview

Basics

Goal: The goal is to build a Multiple Linear Regression model from scratch using Mini-Batch Gradient Descent approach.

The basic idea of a linear regression model is to find the best fit line that models the linear relationsip between a dependent and various independent variables perfectly. The best fit line is achieved by finding the values of slope(m) and constant(c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).

Hypothesis

Multiple linear model can be represented as:

$h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
or
$h_{\theta}(x)=\theta^{T} x$

where, Parameters: $\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$ and Features: $x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{n}\right\}$

The parameters we need to find are:

$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$

Cost Function

Cost function gives an idea of how far the predicted values are from the actual values. The formula is:

$J\left(\theta_{0}, \theta_{1} \ldots\right)=\frac{1}{m} \Sigma\left(h(\theta)_{i}-y_{i}\right)^{2}$

where $ m = no.\ of\ records$, $ y_{i} = true\ values $

Gradient Descent

Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:

$\theta_{j}:=\theta_{j}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$

So, the gradients for various parameters can be written as:

$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$

image.png

Get Data

For this exercise, we will use the Diabetes dataset available in the scikit-learn library. We will:

  1. Read the data
  2. Identify dependent and independent variables
  3. Split the data into train and test sets
  4. Standardize the data
  5. Build model using mini-batch gradient descent
  6. Evaluate model performance

Load Data

In [1]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
In [25]:
# Read the data
from sklearn.datasets import load_diabetes
diabetes_data=pd.DataFrame(load_diabetes().data,columns=load_diabetes().feature_names)
diabetes_data.head()
Out[25]:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641
In [26]:
# Check shape of data
diabetes_data.shape
Out[26]:
(442, 10)
In [27]:
# Create dependent and independent variables
data = load_diabetes()
X = data.data
Y = data.target
In [28]:
# Plot a variable from data and target variable Y
plt.scatter(diabetes_data.loc[:,'bmi'],Y)
plt.ylabel('Disease Progression')
plt.xlabel('BMI')
plt.title('Variation of Disease Progression and BMI')
plt.show()

Split and Standardize data

In [29]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=42)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (331, 10)'
'Shape of test data (111, 10)'
In [30]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Add Column for Intercept term

Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.

Note - The column of ones must be added after standardizing the data. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.

In [31]:
# Add a column of ones to accomodate for intercept term (Q0*X0)
x_train = np.c_[np.ones(len(x_train), dtype='int64'), x_train]
x_test = np.c_[np.ones(len(x_test), dtype='int64'), x_test]
In [32]:
print(x_train.shape, x_test.shape)
(331, 11) (111, 11)

Mini Batch Gradient Descent

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (b0 and b1)
  2. Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

    for each epoch(e):

     for each mini-batch (m):
    
         subset X and y per mini-bacth size
    
         - Forward Pass:
           Make predictions: y_pred = (b0 + b1Xm)
           Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)
    
         - Backward Pass:
           Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)
    
         - Update Parameters:
           Update parameter bj: bj - alpha * D_bj
In [35]:
class Multiple_LR():
    
    def __init__(self):
        self.b0 = None
        self.b1 = None
        
    def fit(self, X, y, epochs=100, learning_rate=0.01, batch_size=32, verbose=False):
        
        '''
        Used to calculate the coefficients of the linear regression model.
        
        '''
        
        n = X.shape[0]    # no of records in data
        batch_loss = []   # store total loss after each batch
        epoch_loss = []   # store total loss after each epoch
        
        # Initialize weights and bias
        # Note b0 (intercept) is not written here as it has been accounted for 
        # when adding a column of ones in the cells above 
        self.bj = np.random.rand(X.shape[1])  # shape = (ncols,) = (11,)
        
        for e in range(epochs):   # iterate over epochs
            e_loss = 0
            
            for i in range(0, n, batch_size):   # iterate over batches
                # Subset data for batches
                X_new = X[i:i+batch_size]
                y_new = y[i:i+batch_size]
                
                # Calculate Loss
                y_pred = np.dot(self.bj, X_new.T)
                loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2)
                e_loss += loss
                batch_loss.append(loss)
                
                # Calculate gradients
                D_bj = (-2/len(X_new)) * np.dot(X_new.T, y_new - y_pred) # np.sum is NOT used to ensure shape remains (11,) (1,)

                # Update parameters
                self.bj -= learning_rate * D_bj
                
            epoch_loss.append(e_loss)
            if verbose == True and e%1000 == 0:
                print(f'Epoch: {e}, Loss: {e_loss}')
                
        return self.bj, batch_loss, epoch_loss
    
    def predict(self, X):
        return np.dot(self.bj, X.T)

Plot Loss for different learning rates

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

In [36]:
# Build and fit LR model for different learning rates

e = 10000
batch = 32

losses = {}  
model = Multiple_LR()
for lr in [0.1, 0.01, 0.001, 0.0001, 0.00001]:
    bj, batch_loss, epoch_loss = model.fit(x_train, y_train, learning_rate=lr, 
                                               epochs=e, batch_size=batch)
    losses[f'LR={str(lr)}'] = epoch_loss

# Plot loss
plt.figure(figsize=(15,12))    
xs = np.arange(len(epoch_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 with lowest final loss to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss

In [37]:
# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32

# Build model
lr_mbgd_model = Multiple_LR()


# Fit Model
bj, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch, verbose=True)
Epoch: 0, Loss: 317578.18523734657
Epoch: 1000, Loss: 32572.50730941292
Epoch: 2000, Loss: 32505.61196692166
Epoch: 3000, Loss: 32460.81784782076
Epoch: 4000, Loss: 32430.767873670648
Epoch: 5000, Loss: 32410.606772913474
Epoch: 6000, Loss: 32397.079726312753
Epoch: 7000, Loss: 32388.0033240038
Epoch: 8000, Loss: 32381.91284306942
Epoch: 9000, Loss: 32377.825674168627
In [38]:
# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')
Bias: 154.89791723842532
Weight Matrix: 
[ -1.12720836 -10.8636127   26.33476025  18.58345778 -36.33049563
  20.86340649   3.04919268   9.02813465  31.79647368   1.63203223]

Plot Loss

Batch Loss
In [39]:
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
Epoch Loss
In [40]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

In this section, we will check the performance of our model on training and test data. We will also build a polynomial regression model using scikit-learn library and compare performance.

Performance on Training data

Evaluate model performance on training data.

In [41]:
# Create predictions on test set
train_pred_mbgd = lr_mbgd_model.predict(x_train)
In [42]:
# Plot predictions
import seaborn as sns

sns.scatterplot(y_train, train_pred_mbgd, alpha=0.4)
sns.regplot(y_train, train_pred_mbgd, truncate=True, 
            scatter_kws={'s': 20, 'alpha':0.3}, 
            line_kws={'color':'green', 'linewidth': 2})
 
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted Prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Training Set]")
 
plt.show()
In [43]:
# Calculate metrics
train_r2_mbgd = np.abs(r2_score(y_train, train_pred_mbgd))
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)

print("Performance on training set:")
print(f'Train R2 is: {train_r2_mbgd}')
print(f'Train MSE is: {train_mse_mbgd}')
print(f'Train RMSE is: {train_rmse_mbgd}')
Performance on training set:
Train R2 is: 0.5165572826182001
Train MSE is: 2922.2297342316483
Train RMSE is: 54.05765194892994

Performance on Test data

Evaluate model performance on test data.

In [44]:
# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)
In [45]:
import seaborn as sns

sns.scatterplot(y_test, test_pred_mbgd, alpha=0.4)
sns.regplot(y_test, test_pred_mbgd, truncate=True, scatter_kws={'s': 20, 'alpha':0.3}, line_kws={'color':'green', 'linewidth': 2})
 
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Test Set]")
 
plt.show()
In [46]:
# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)

print("Performance on test set:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test MSE is: {test_mse_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')
Performance on test set:
Test R2 is: 0.4861487380720835
Test MSE is: 2841.438080723931
Test RMSE is: 53.3051412222492

Comparison with Scikit Learn

Here, we build a Multiple Linear Regression model using LinearRegression from scikit-learn library and compare the results with model created from scratch.

In [47]:
# Build a model using scikit learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
lr_preds = lr_model.predict(x_test)

test_r2 = np.abs(r2_score(y_test, lr_preds))
test_mse = mean_squared_error(y_test, lr_preds)
test_rmse = np.sqrt(mean_squared_error(y_test, lr_preds))

print("Using scikit-learn model:")
print(f'Test R2 is: {test_r2}')
print(f'Test MSE is: {test_mse}')
print(f'Test RMSE is: {test_rmse}')
Using scikit-learn model:
Test R2 is: 0.46771406356922773
Test MSE is: 2943.37611225009
Test RMSE is: 54.252890358487726

Model Results

Training Data Test Data Model using scikit-learn
R2 0.517 0.486 0.468
MSE 2922.23 2841.44 2943.376
RMSE 54.06 53.305 54.253

From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.

Summary

To summarize, in this notebook we created a multiple linear regression model using mini-batch gradient from scratch. We saw how iterating over different learning rates helped identify a learning rate with minimum loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.