Goal: The goal is to build a Multiple Linear Regression model from scratch using Mini-Batch Gradient Descent approach.
The basic idea of a linear regression model
is to find the best fit line that models the linear relationsip between a dependent and various independent variables perfectly. The best fit line is achieved by finding the values of slope(m) and constant(c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).
Multiple linear model can be represented as:
$h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
or
$h_{\theta}(x)=\theta^{T} x$
where, Parameters: $\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$ and Features: $x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{n}\right\}$
The parameters we need to find are:
$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$
Cost function gives an idea of how far the predicted values are from the actual values. The formula is:
$J\left(\theta_{0}, \theta_{1} \ldots\right)=\frac{1}{m} \Sigma\left(h(\theta)_{i}-y_{i}\right)^{2}$
where $ m = no.\ of\ records$, $ y_{i} = true\ values $
Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.
Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:
$\theta_{j}:=\theta_{j}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$
where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$
So, the gradients for various parameters can be written as:
$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$
For this exercise, we will use the Diabetes dataset
available in the scikit-learn library. We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
from sklearn.datasets import load_diabetes
diabetes_data=pd.DataFrame(load_diabetes().data,columns=load_diabetes().feature_names)
diabetes_data.head()
# Check shape of data
diabetes_data.shape
# Create dependent and independent variables
data = load_diabetes()
X = data.data
Y = data.target
# Plot a variable from data and target variable Y
plt.scatter(diabetes_data.loc[:,'bmi'],Y)
plt.ylabel('Disease Progression')
plt.xlabel('BMI')
plt.title('Variation of Disease Progression and BMI')
plt.show()
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=42)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.
Note - The column of ones must be added after standardizing the data
. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.
# Add a column of ones to accomodate for intercept term (Q0*X0)
x_train = np.c_[np.ones(len(x_train), dtype='int64'), x_train]
x_test = np.c_[np.ones(len(x_test), dtype='int64'), x_test]
print(x_train.shape, x_test.shape)
In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.
Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:
Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then
for each epoch(e):
for each mini-batch (m):
subset X and y per mini-bacth size
- Forward Pass:
Make predictions: y_pred = (b0 + b1Xm)
Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)
- Backward Pass:
Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)
- Update Parameters:
Update parameter bj: bj - alpha * D_bj
class Multiple_LR():
def __init__(self):
self.b0 = None
self.b1 = None
def fit(self, X, y, epochs=100, learning_rate=0.01, batch_size=32, verbose=False):
'''
Used to calculate the coefficients of the linear regression model.
'''
n = X.shape[0] # no of records in data
batch_loss = [] # store total loss after each batch
epoch_loss = [] # store total loss after each epoch
# Initialize weights and bias
# Note b0 (intercept) is not written here as it has been accounted for
# when adding a column of ones in the cells above
self.bj = np.random.rand(X.shape[1]) # shape = (ncols,) = (11,)
for e in range(epochs): # iterate over epochs
e_loss = 0
for i in range(0, n, batch_size): # iterate over batches
# Subset data for batches
X_new = X[i:i+batch_size]
y_new = y[i:i+batch_size]
# Calculate Loss
y_pred = np.dot(self.bj, X_new.T)
loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2)
e_loss += loss
batch_loss.append(loss)
# Calculate gradients
D_bj = (-2/len(X_new)) * np.dot(X_new.T, y_new - y_pred) # np.sum is NOT used to ensure shape remains (11,) (1,)
# Update parameters
self.bj -= learning_rate * D_bj
epoch_loss.append(e_loss)
if verbose == True and e%1000 == 0:
print(f'Epoch: {e}, Loss: {e_loss}')
return self.bj, batch_loss, epoch_loss
def predict(self, X):
return np.dot(self.bj, X.T)
Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.
# Build and fit LR model for different learning rates
e = 10000
batch = 32
losses = {}
model = Multiple_LR()
for lr in [0.1, 0.01, 0.001, 0.0001, 0.00001]:
bj, batch_loss, epoch_loss = model.fit(x_train, y_train, learning_rate=lr,
epochs=e, batch_size=batch)
losses[f'LR={str(lr)}'] = epoch_loss
# Plot loss
plt.figure(figsize=(15,12))
xs = np.arange(len(epoch_loss))
for k, v in losses.items():
plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()
From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 with lowest final loss to build and fit a model and then evaluate its performance.
# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32
# Build model
lr_mbgd_model = Multiple_LR()
# Fit Model
bj, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, batch_size=batch, verbose=True)
# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e
plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
In this section, we will check the performance of our model on training and test data. We will also build a polynomial regression model using scikit-learn library and compare performance.
Evaluate model performance on training data.
# Create predictions on test set
train_pred_mbgd = lr_mbgd_model.predict(x_train)
# Plot predictions
import seaborn as sns
sns.scatterplot(y_train, train_pred_mbgd, alpha=0.4)
sns.regplot(y_train, train_pred_mbgd, truncate=True,
scatter_kws={'s': 20, 'alpha':0.3},
line_kws={'color':'green', 'linewidth': 2})
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted Prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Training Set]")
plt.show()
# Calculate metrics
train_r2_mbgd = np.abs(r2_score(y_train, train_pred_mbgd))
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)
print("Performance on training set:")
print(f'Train R2 is: {train_r2_mbgd}')
print(f'Train MSE is: {train_mse_mbgd}')
print(f'Train RMSE is: {train_rmse_mbgd}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)
import seaborn as sns
sns.scatterplot(y_test, test_pred_mbgd, alpha=0.4)
sns.regplot(y_test, test_pred_mbgd, truncate=True, scatter_kws={'s': 20, 'alpha':0.3}, line_kws={'color':'green', 'linewidth': 2})
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Actual Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$ [Test Set]")
plt.show()
# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)
print("Performance on test set:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test MSE is: {test_mse_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')
Here, we build a Multiple Linear Regression model using LinearRegression
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
lr_preds = lr_model.predict(x_test)
test_r2 = np.abs(r2_score(y_test, lr_preds))
test_mse = mean_squared_error(y_test, lr_preds)
test_rmse = np.sqrt(mean_squared_error(y_test, lr_preds))
print("Using scikit-learn model:")
print(f'Test R2 is: {test_r2}')
print(f'Test MSE is: {test_mse}')
print(f'Test RMSE is: {test_rmse}')
Training Data | Test Data | Model using scikit-learn | |
---|---|---|---|
R2 | 0.517 | 0.486 | 0.468 |
MSE | 2922.23 | 2841.44 | 2943.376 |
RMSE | 54.06 | 53.305 | 54.253 |
From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.
To summarize, in this notebook we created a multiple linear regression model using mini-batch gradient from scratch. We saw how iterating over different learning rates helped identify a learning rate with minimum loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.