Multiple Linear Regression using Mini-Batch Gradient Descent¶

Source: https://morioh.com/p/0d9b2bedf683

Overview¶

Basics¶

Goal: The goal is to build a Multiple Linear Regression model from scratch using Mini-Batch Gradient Descent approach.

The basic idea of a linear regression model is to find the best fit line that models the linear relationsip between a dependent and various independent variables perfectly. The best fit line is achieved by finding the values of slope(m) and constant(c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).

Hypothesis¶

Multiple linear model can be represented as:

$h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
or
$h_{\theta}(x)=\theta^{T} x$

where, Parameters: $\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$ and Features: $x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{n}\right\}$

The parameters we need to find are:

$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$

Cost Function¶

Cost function gives an idea of how far the predicted values are from the actual values. The formula is:

$J\left(\theta_{0}, \theta_{1} \ldots\right)=\frac{1}{m} \Sigma\left(h(\theta)_{i}-y_{i}\right)^{2}$

where $ m = no.\ of\ records$, $ y_{i} = true\ values $

Gradient Descent¶

Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:

$\theta_{j}:=\theta_{j}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$

So, the gradients for various parameters can be written as:

$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$

Get Data¶

For this exercise, we will use the Diabetes dataset available in the scikit-learn library. We will:

Read the data
Identify dependent and independent variables
Split the data into train and test sets
Standardize the data
Build model using mini-batch gradient descent
Evaluate model performance

Load Data¶

# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score

# Read the data
from sklearn.datasets import load_diabetes
diabetes_data=pd.DataFrame(load_diabetes().data,columns=load_diabetes().feature_names)
diabetes_data.head()

# Check shape of data
diabetes_data.shape

(442, 10)

# Create dependent and independent variables
data = load_diabetes()
X = data.data
Y = data.target

# Plot a variable from data and target variable Y
plt.scatter(diabetes_data.loc[:,'bmi'],Y)
plt.ylabel('Disease Progression')
plt.xlabel('BMI')
plt.title('Variation of Disease Progression and BMI')
plt.show()

Split and Standardize data¶

# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=42)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (331, 10)'

'Shape of test data (111, 10)'

# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Add Column for Intercept term¶

Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.

Note - The column of ones must be added after standardizing the data. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.

# Add a column of ones to accomodate for intercept term (Q0*X0)
x_train = np.c_[np.ones(len(x_train), dtype='int64'), x_train]
x_test = np.c_[np.ones(len(x_test), dtype='int64'), x_test]

print(x_train.shape, x_test.shape)

(331, 11) (111, 11)

Mini Batch Gradient Descent¶

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model¶

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

Randomly initialize model parameters (b0 and b1)

Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

for each epoch(e):

 for each mini-batch (m):

     subset X and y per mini-bacth size

     - Forward Pass:
       Make predictions: y_pred = (b0 + b1Xm)
       Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)

     - Backward Pass:
       Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)

     - Update Parameters:
       Update parameter bj: bj - alpha * D_bj

class Multiple_LR():
    
    def __init__(self):
        self.b0 = None
        self.b1 = None
        
    def fit(self, X, y, epochs=100, learning_rate=0.01, batch_size=32, verbose=False):
        
        '''
        Used to calculate the coefficients of the linear regression model.
        
        '''
        
        n = X.shape[0]    # no of records in data
        batch_loss = []   # store total loss after each batch
        epoch_loss = []   # store total loss after each epoch
        
        # Initialize weights and bias
        # Note b0 (intercept) is not written here as it has been accounted for 
        # when adding a column of ones in the cells above 
        self.bj = np.random.rand(X.shape[1])  # shape = (ncols,) = (11,)
        
        for e in range(epochs):   # iterate over epochs
            e_loss = 0
            
            for i in range(0, n, batch_size):   # iterate over batches
                # Subset data for batches
                X_new = X[i:i+batch_size]
                y_new = y[i:i+batch_size]
                
                # Calculate Loss
                y_pred = np.dot(self.bj, X_new.T)
                loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2)
                e_loss += loss
                batch_loss.append(loss)
                
                # Calculate gradients
                D_bj = (-2/len(X_new)) * np.dot(X_new.T, y_new - y_pred) # np.sum is NOT used to ensure shape remains (11,) (1,)

                # Update parameters
                self.bj -= learning_rate * D_bj
                
            epoch_loss.append(e_loss)
            if verbose == True and e%1000 == 0:
                print(f'Epoch: {e}, Loss: {e_loss}')
                
        return self.bj, batch_loss, epoch_loss
    
    def predict(self, X):
        return np.dot(self.bj, X.T)

Plot Loss for different learning rates¶

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

# Build and fit LR model for different learning rates

e = 10000
batch = 32

losses = {}  
model = Multiple_LR()
for lr in [0.1, 0.01, 0.001, 0.0001, 0.00001]:
    bj, batch_loss, epoch_loss = model.fit(x_train, y_train, learning_rate=lr, 
                                               epochs=e, batch_size=batch)
    losses[f'LR={str(lr)}'] = epoch_loss

# Plot loss
plt.figure(figsize=(15,12))    
xs = np.arange(len(epoch_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 with lowest final loss to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss¶

# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32

# Build model
lr_mbgd_model = Multiple_LR()


# Fit Model
bj, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch, verbose=True)

Epoch: 0, Loss: 317578.18523734657
Epoch: 1000, Loss: 32572.50730941292
Epoch: 2000, Loss: 32505.61196692166
Epoch: 3000, Loss: 32460.81784782076
Epoch: 4000, Loss: 32430.767873670648
Epoch: 5000, Loss: 32410.606772913474
Epoch: 6000, Loss: 32397.079726312753
Epoch: 7000, Loss: 32388.0033240038
Epoch: 8000, Loss: 32381.91284306942
Epoch: 9000, Loss: 32377.825674168627

# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')

Bias: 154.89791723842532
Weight Matrix: 
[ -1.12720836 -10.8636127   26.33476025  18.58345778 -36.33049563
  20.86340649   3.04919268   9.02813465  31.79647368   1.63203223]

Plot Loss¶

Batch Loss¶

# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()

Epoch Loss¶

# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance¶

In this section, we will check the performance of our model on training and test data. We will also build a polynomial regression model using scikit-learn library and compare performance.

Performance on Training data¶

Evaluate model performance on training data.

# Create predictions on test set
train_pred_mbgd = lr_mbgd_model.predict(x_train)

# Plot predictions
import seaborn as sns

sns.scatterplot(y_train, train_pred_mbgd, alpha=0.4)
sns.regplot(y_train, train_pred_mbgd, truncate=True, 
            scatter_kws={'s': 20, 'alpha':0.3}, 
            line_kws={'color':'green', 'linewidth': 2})
 
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted Prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Training Set]")
 
plt.show()

# Calculate metrics
train_r2_mbgd = np.abs(r2_score(y_train, train_pred_mbgd))
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)

print("Performance on training set:")
print(f'Train R2 is: {train_r2_mbgd}')
print(f'Train MSE is: {train_mse_mbgd}')
print(f'Train RMSE is: {train_rmse_mbgd}')

Performance on training set:
Train R2 is: 0.5165572826182001
Train MSE is: 2922.2297342316483
Train RMSE is: 54.05765194892994

Performance on Test data¶

Evaluate model performance on test data.

# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)

import seaborn as sns

sns.scatterplot(y_test, test_pred_mbgd, alpha=0.4)
sns.regplot(y_test, test_pred_mbgd, truncate=True, scatter_kws={'s': 20, 'alpha':0.3}, line_kws={'color':'green', 'linewidth': 2})
 
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Test Set]")
 
plt.show()

# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)

print("Performance on test set:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test MSE is: {test_mse_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')

Performance on test set:
Test R2 is: 0.4861487380720835
Test MSE is: 2841.438080723931
Test RMSE is: 53.3051412222492

Comparison with Scikit Learn¶

Here, we build a Multiple Linear Regression model using LinearRegression from scikit-learn library and compare the results with model created from scratch.

# Build a model using scikit learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
lr_preds = lr_model.predict(x_test)

test_r2 = np.abs(r2_score(y_test, lr_preds))
test_mse = mean_squared_error(y_test, lr_preds)
test_rmse = np.sqrt(mean_squared_error(y_test, lr_preds))

print("Using scikit-learn model:")
print(f'Test R2 is: {test_r2}')
print(f'Test MSE is: {test_mse}')
print(f'Test RMSE is: {test_rmse}')

Using scikit-learn model:
Test R2 is: 0.46771406356922773
Test MSE is: 2943.37611225009
Test RMSE is: 54.252890358487726

Model Results¶

	Training Data	Test Data	Model using scikit-learn
R2	0.517	0.486	0.468
MSE	2922.23	2841.44	2943.376
RMSE	54.06	53.305	54.253

From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.

Summary¶

To summarize, in this notebook we created a multiple linear regression model using mini-batch gradient from scratch. We saw how iterating over different learning rates helped identify a learning rate with minimum loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641