Goal: The goal is to build a Simple Linear Regression model from scratch using various Gradient Descent approaches.
The basic idea of a linear regression model
is to find the best fit line that models the linear relationsip between a dependent and independent variable perfectly. The best fit line is achieved by finding the values of slope(m) and constant(c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).
Simple linear model can be represented as:
$h(\theta)=\theta_{0}+\theta_{1} X$
The parameters we need to find are:
$\theta_{0}$ and $\theta_{1}$
Cost function gives an idea of how far the predicted values are from the actual values. The formula is:
$J\left(\theta_{0}, \theta_{1} \right)=\frac{1}{m} \Sigma\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$
where $ m = no.\ of\ records$, $ y_{i} = true\ values $
Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through the training data and update the model parameters in accordance with the gradient of error with respect to the training data.
Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formulas are:
$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$
$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}$
where $\alpha$ is the learning rate and $ m = no.\ of\ records$
We will explore 3-types of gradient descent algorithms:
For this exercise, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()
# Create dependent (Price) and independent (LSTAT - % lower status of the population) variables
Y=load_boston().target
display(f'Head of Y: {Y[:5]}')
display(f'Shape of Y: {Y.shape}')
X=load_boston().data[:,12]
display(f'Head of X: {X[:5]}')
display(f'Shape of X: {X.shape}')
# Plot X and Y
plt.scatter(X,Y)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Variation of Price and LSTAT')
plt.show()
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=101)
# Reshape since the data has only one feature
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
display(x_train.shape)
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
Stochastic Gradient Descent uses each training sample in the dataset for a forward pass and updates the parameters simultaneously. So, the algorithm uses each training sample to compute the gradient of cost function and update parameters right away. Depending on the problem, this can make SGD faster than batch gradient descent.
One advantage is the frequent updates allow us to have a pretty detailed rate of improvement. The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing.
Here we will build a Linear Regression model using Stochastic Gradient Descent (SGD). Here is how the algorithm works:
Pick a value for learning rate (alpha) and number of epochs (iterations to iterate over). The learning rate determines how big the step would be on each iteration. Then,
for each epoch(e):
for each training sample(i):
- Forward Pass:
Make predictions: (b0 + b1Xi)
Calculate loss/error: (y - y_pred)**2
- Backward Pass:
Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
- Update Parameters:
Update parameter b0: b0 - alpha * D_b0
Update parameter b1: b1 - alpha * D_b1
class LR_SGD():
def __init__(self):
# Initialize parameters with random weights
self.b0 = np.random.rand(1)
self.b1 = np.random.rand(1)
def fit(self, X, y, learning_rate=0.01, epochs=100):
n = X.shape[0]
loss_epoch = [] # store total loss after each epoch
total_loss = [] # store loss after each update
for e in range(epochs): # iterate over epochs
epoch_loss = 0
for i in range(n): # iterate over each row in data
y_pred = self.b0 + self.b1*X[i] # predicted value
error = np.sum((y[i] - y_pred)**2) #error=(observed-predicted)**2
epoch_loss += error
total_loss.append(error) # store loss after each update
# Calculate gradients
D_b0 = (-2/n) * np.sum(y[i] - y_pred)
D_b1 = (-2/n) * np.dot(X[i], y[i] - y_pred)
# Update parameters
self.b0 -= learning_rate * D_b0
self.b1 -= learning_rate * D_b1
loss_epoch.append(epoch_loss) # store total loss after each epoch
print(f'Epoch: {e}, epoch loss: {epoch_loss}')
return self.b0, self.b1, total_loss, loss_epoch
def predict(self, X):
return self.b0 + self.b1*X
# Build and fit LR model
alpha = 0.1
e = 20
# Build model
lr_sgd_model = LR_SGD()
# Fit Model
b0, b1, total_loss, loss_epoch = lr_sgd_model.fit(x_train,y_train,learning_rate=alpha, epochs=e)
# Plot total loss which shows loss at each iteration
plt.figure(figsize=(15,10))
plt.plot(np.arange(len(x_train)*e), total_loss)
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.title('Loss at each iteration of training sample')
plt.show()
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), loss_epoch)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
Check Model performance on training data
# Create predictions on training data and calculate metrics
train_predictions = lr_sgd_model.predict(x_train)
train_r2 = r2_score(y_train, train_predictions)
train_mse = mean_squared_error(y_train, train_predictions)
train_rmse = np.sqrt(train_mse)
print(f'Training R2 is: {train_r2}')
print(f'Training RMSE is: {train_rmse}')
# Plot model fit
plt.scatter(x_train,y_train)
plt.plot(x_train, train_predictions, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()
Evaluate model performance on test data.
# Create predictions on test set
test_predictions = lr_sgd_model.predict(x_test)
# Calculate metrics
test_r2 = r2_score(y_test, test_predictions)
test_mse = mean_squared_error(y_test, test_predictions)
test_rmse = np.sqrt(test_mse)
print("Using Stochastic Gradient Descent:")
print(f'R2 is: {test_r2}')
print(f'RMSE is: {test_rmse}')
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_predictions, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()
Batch Gradient Descent uses all training data for a forward pass and then updates the parameters after all training data has been considered. So, the algorithm uses all dataset available to compute the average gradients of all training examples and then uses that mean gradient to updates parameters. This whole process is like a cycle and it's called a training epoch.
Some advantages of batch gradient descent are its computational efficient, it produces a stable error gradient and a stable convergence. Some disadvantages are the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset be in memory and available to the algorithm.
Let's build a Linear Regression model using Batch Gradient Descent (BGD). Here is how the algorithm works:
Pick a value for learning rate (alpha) and number of epochs (iterations to iterate over), then
for each epoch(e):
- Forward Pass:
Make predictions: (b0 + b1Xi)
Calculate loss/error: sum((y - y_pred)**2) / len(training data)
- Backward Pass:
Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
- Update Parameters:
Update parameter b0: b0 - alpha * D_b0
Update parameter b1: b1 - alpha * D_b1
class LR_BGD():
def __init__(self):
# Initialize parameters
self.b0 = np.random.rand(1)
self.b1 = np.random.rand(1)
def fit(self, X, y, learning_rate=0.01, epochs=100):
n = X.shape[0]
total_loss = [] # store loss after each epoch
for e in range(epochs): # iterate over epochs
y_pred = self.b0 + np.dot(X, self.b1) # predicted value
loss = (1/n) * np.sum((y - y_pred)**2) # error = (1/n) * sum((observed-predicted)**2)
total_loss.append(loss) # append loss to list
# Calculate gradients
D_b0 = (-2/n) * np.sum(y - y_pred)
D_b1 = (-2/n) * np.dot(y-y_pred, X)
# Update parameters
self.b0 -= learning_rate * D_b0
self.b1 -= learning_rate * D_b1
if e%100 == 0:
print(f'Epoch: {e}, Loss: {loss}')
# return parameters and loss
return self.b0, self.b1, total_loss
def predict(self, X):
return self.b0 + np.dot(X, self.b1)
# Build and fit LR model
alpha = 0.001
e = 2000
# Build model
lr_bgd_model = LR_BGD()
# Fit Model
b0, b1, total_loss = lr_bgd_model.fit(x_train,y_train,learning_rate=alpha, epochs=e)
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), total_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
Check Model performance on training data
# Create predictions on training data and calculate metrics
train_pred_bgd = lr_bgd_model.predict(x_train)
train_r2_bgd = r2_score(y_train, train_pred_bgd)
train_mse_bgd = mean_squared_error(y_train, train_pred_bgd)
train_rmse_bgd = np.sqrt(train_mse_bgd)
print(f'Training R2 is: {train_r2_bgd}')
print(f'Training RMSE is: {train_rmse_bgd}')
# Plot model fit
plt.scatter(x_train,y_train)
plt.plot(x_train, train_pred_bgd, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()
Evaluate model performance on test data.
# Create predictions on test set
test_pred_bgd = lr_bgd_model.predict(x_test)
# Calculate metrics
test_r2_bgd = np.abs(r2_score(y_test, test_pred_bgd))
test_mse_bgd = mean_squared_error(y_test, test_pred_bgd)
test_rmse_bgd = np.sqrt(test_mse_bgd)
print("Using Batch Gradient Descent:")
print(f'Test R2 is: {test_r2_bgd}')
print(f'Test RMSE is: {test_rmse_bgd}')
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_pred_bgd, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()
In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.
Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:
Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then
for each epoch(e):
for each mini-batch (m):
subset X and y per mini-bacth size
- Forward Pass:
Make predictions: y_pred = (b0 + b1Xm)
Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)
- Backward Pass:
Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
- Update Parameters:
Update parameter b0: b0 - alpha * D_b0
Update parameter b1: b1 - alpha * D_b1
class LR_MBGD():
def __init__(self):
self.b0 = np.random.rand(1)
self.b1 = np.random.rand(1)
def fit(self, X, y, learning_rate=0.01, epochs=100, batch_size=50):
n = X.shape[0]
batch_loss = []
epoch_loss = []
for e in range(epochs): # iterate over epochs
e_loss = 0
for i in range(0, n, batch_size): # iterate over each batch
X_new = X[i:i+batch_size] # create X per batch size
y_new = y[i:i+batch_size] # create y per batch size
y_pred = self.b0 + np.dot(X_new, self.b1) # get prediction
loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2) # error = (1/n) * sum((observed-predicted)**2)
e_loss += loss
batch_loss.append(loss)
D_b0 = (-2/len(X_new)) * np.sum(y_new - y_pred)
D_b1 = (-2/len(X_new)) * np.dot(y_new - y_pred, X_new)
self.b0 -= learning_rate * D_b0
self.b1 -= learning_rate * D_b1
epoch_loss.append(e_loss)
if e%10 == 0:
print(f'Epoch: {e}, Loss: {e_loss}')
return self.b0, self.b1, batch_loss, epoch_loss
def predict(self, X):
return self.b0 + self.b1*X
# Build and fit LR model
alpha = 0.001
e = 200
batch = 30
# Build model
lr_mbgd_model = LR_MBGD()
# Fit Model
b0, b1, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, batch_size=batch)
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e
plt.figure(figsize=(15,10))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
In this section, we will check the performance of our model on training and test data.
Check model performance on training data.
# Create predictions on training data and calculate metrics
train_pred_mbgd = lr_mbgd_model.predict(x_train)
train_r2_mbgd = r2_score(y_train, train_pred_mbgd)
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)
print(f'Training R2 is: {train_r2_mbgd}')
print(f'Training RMSE is: {train_rmse_mbgd}')
# Model fit on training data
plt.scatter(x_train,y_train)
plt.plot(x_train, train_pred_mbgd, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()
Evaluate model performance on test data.
# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)
# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)
print("Using Mini-Batch Gradient Descent:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_pred_mbgd, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()
Gradient Descent Type | Iterations | Epochs | Learning Rate | Test Data | Training Data | ||
---|---|---|---|---|---|---|---|
R2 | RMSE | R2 | RMSE | ||||
Stochastic | 354 (each record of training data in considered) | 20 | 0.1 | 0.565 | 5.704 | 0.534 | 6.416 |
Batch | None (all training data in considered) | 2000 | 0.001 | 0.565 | 5.704 | 0.534 | 6.416 |
Mini-Batch | 12 (batch size is considered) | 200 | 0.001 | 0.54 | 6.591 | 0.534 | 6.045 |
Batch | Stochastic | Mini-batch |
---|---|---|
Uses all training data to compute the gradient of cost function and then updates model parameters. So it takes the average of the gradients of all training examples and then uses that mean gradient to update model parameters. | Uses each record in the training sample to compute the gradient of cost function and update parameters right away. | Uses a set of ‘m’ records in the training sample to compute the gradient of the cost function and update parameters. The set of 'm' records is called a mini-batch. |
All training data in considered at once for gradient computation and parameter update. | Each record is considered for gradient computation and parameter update. | A set of 'm' records called 'mini-batch' is considered for gradient computation and parameter update. |
Pro - computationally efficient as all training data is considered at once | Pro - frequent updates allow us to have a pretty detailed rate of improvement | Pro - enjoys the advantages of both Batch and Stochastic |
Pro - produces a stable error gradient and a stable convergence | Pro - converges faster for larger datasets | 1. updates are less expensive and hence gradients are less noisy than stochastic |
Pro - converges directly to minima | Con - frequent updates to gradients are more computationally expensive than the batch gradient descent | 2. requires only a batch of 'm' records in memory unlike batch |
Con - requires the entire training dataset be in memory and available. So, not efficient for large datasets | Con - frequent updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing | |
Con - stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve | Con - never reaches the minima but it will keep dancing around it |
To summarize, in this notebook we created a simple linear regression model using various descent approaches from scratch. We saw how each approach requires choosing the right value of learning rate and epochs to generate the best results.
Note that gradient descent is a first-order optimization algorithm, which means it doesn’t take into account the second-order derivatives of the cost function. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. The curvature of the function affects the size of each learning step.
Therefore, if:
As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.