Simple Linear Regression using various Gradient Descent Approaches

Overview

Goal: The goal is to build a Simple Linear Regression model from scratch using various Gradient Descent approaches.

The basic idea of a linear regression model is to find the best fit line that models the linear relationsip between a dependent and independent variable perfectly. The best fit line is achieved by finding the values of slope(m) and constant(c) parameters that minimize the sum of squared errors between the observed values (training data) and predicted values (generated by model).

Simple linear model can be represented as:

$h(\theta)=\theta_{0}+\theta_{1} X$

The parameters we need to find are:

$\theta_{0}$ and $\theta_{1}$

Cost function gives an idea of how far the predicted values are from the actual values. The formula is:

$J\left(\theta_{0}, \theta_{1} \right)=\frac{1}{m} \Sigma\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

where $ m = no.\ of\ records$, $ y_{i} = true\ values $

Gradient Descent

Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through the training data and update the model parameters in accordance with the gradient of error with respect to the training data.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formulas are:

$\theta_{0}:=\theta_{0}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$

$\theta_{1}:=\theta_{1}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}$

where $\alpha$ is the learning rate and $ m = no.\ of\ records$

We will explore 3-types of gradient descent algorithms:

  1. Stochastic Gradient Descent: uses each training sample to compute the gradient of cost function and updates the parameters right away.
  2. Batch Gradient Descent: uses all training data to compute the gradient of cost function and then updates the parameters.
  3. Mini-Batch Gradient Descent: uses a batch of 'm' training samples to compute the gradient of cost function and then updates the parameters.

image.png

For this exercise, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:

  1. Read the data
  2. Identify dependent (Price) and independent (LSTAT - % lower status of the population) variables
  3. Split the data into train and test sets
  4. Standardize the data
  5. Build model using various gradient descent approaches

Load Data

In [2]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
In [3]:
# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [4]:
# Create dependent (Price) and independent (LSTAT - % lower status of the population) variables
Y=load_boston().target
display(f'Head of Y: {Y[:5]}')
display(f'Shape of Y: {Y.shape}')

X=load_boston().data[:,12]
display(f'Head of X: {X[:5]}')
display(f'Shape of X: {X.shape}')
'Head of Y: [24.  21.6 34.7 33.4 36.2]'
'Shape of Y: (506,)'
'Head of X: [4.98 9.14 4.03 2.94 5.33]'
'Shape of X: (506,)'
In [5]:
# Plot X and Y
plt.scatter(X,Y)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Variation of Price and LSTAT')
plt.show()

Split and Standardize data

In [6]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=101)

# Reshape since the data has only one feature
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)

display(x_train.shape)
(354, 1)
In [7]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

Stochastic Gradient Descent

Stochastic Gradient Descent uses each training sample in the dataset for a forward pass and updates the parameters simultaneously. So, the algorithm uses each training sample to compute the gradient of cost function and update parameters right away. Depending on the problem, this can make SGD faster than batch gradient descent.

One advantage is the frequent updates allow us to have a pretty detailed rate of improvement. The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing.

Build and Fit Model

Here we will build a Linear Regression model using Stochastic Gradient Descent (SGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (b0 and b1)
  2. Pick a value for learning rate (alpha) and number of epochs (iterations to iterate over). The learning rate determines how big the step would be on each iteration. Then,

    for each epoch(e):

     for each training sample(i):
         - Forward Pass:
           Make predictions: (b0 + b1Xi)
           Calculate loss/error: (y - y_pred)**2
    
         - Backward Pass:
           Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
           Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
    
         - Update Parameters:
           Update parameter b0: b0 - alpha * D_b0
           Update parameter b1: b1 - alpha * D_b1
In [8]:
class LR_SGD():
    def __init__(self):
        # Initialize parameters with random weights
        self.b0 = np.random.rand(1)
        self.b1 = np.random.rand(1)
        
    def fit(self, X, y, learning_rate=0.01, epochs=100):
        n = X.shape[0]
        loss_epoch = []  # store total loss after each epoch
        total_loss = []  # store loss after each update
        
        for e in range(epochs):  # iterate over epochs
            epoch_loss = 0
            
            for i in range(n):   # iterate over each row in data
                
                y_pred = self.b0 + self.b1*X[i]    # predicted value
                
                error = np.sum((y[i] - y_pred)**2)  #error=(observed-predicted)**2
                epoch_loss += error
                total_loss.append(error)  # store loss after each update
                
                # Calculate gradients
                D_b0 = (-2/n) * np.sum(y[i] - y_pred)
                D_b1 = (-2/n) * np.dot(X[i], y[i] - y_pred)
                
                # Update parameters
                self.b0 -= learning_rate * D_b0
                self.b1 -= learning_rate * D_b1
            
            loss_epoch.append(epoch_loss)  # store total loss after each epoch
            print(f'Epoch: {e}, epoch loss: {epoch_loss}')
    
        return self.b0, self.b1, total_loss, loss_epoch
    
    def predict(self, X):
        return self.b0 + self.b1*X
            
In [9]:
# Build and fit LR model
alpha = 0.1
e = 20

# Build model
lr_sgd_model = LR_SGD()

# Fit Model
b0, b1, total_loss, loss_epoch = lr_sgd_model.fit(x_train,y_train,learning_rate=alpha, epochs=e)
Epoch: 0, epoch loss: 163053.5993787917
Epoch: 1, epoch loss: 113358.41135641943
Epoch: 2, epoch loss: 80039.72833396995
Epoch: 3, epoch loss: 57698.8243880432
Epoch: 4, epoch loss: 42717.10125751486
Epoch: 5, epoch loss: 32669.05816695833
Epoch: 6, epoch loss: 25928.85787334445
Epoch: 7, epoch loss: 21406.64079167891
Epoch: 8, epoch loss: 18371.796607963908
Epoch: 9, epoch loss: 16334.515344065194
Epoch: 10, epoch loss: 14966.39794281587
Epoch: 11, epoch loss: 14047.245185473575
Epoch: 12, epoch loss: 13429.39193668331
Epoch: 13, epoch loss: 13013.800822705562
Epoch: 14, epoch loss: 12734.037573565676
Epoch: 15, epoch loss: 12545.529084120866
Epoch: 16, epoch loss: 12418.362320410624
Epoch: 17, epoch loss: 12332.456612327487
Epoch: 18, epoch loss: 12274.326852646607
Epoch: 19, epoch loss: 12234.913141758334

Plot Loss

In [10]:
# Plot total loss which shows loss at each iteration
plt.figure(figsize=(15,10))
plt.plot(np.arange(len(x_train)*e), total_loss)
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.title('Loss at each iteration of training sample')
plt.show()
In [11]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), loss_epoch)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

Performance on training data

Check Model performance on training data

In [73]:
# Create predictions on training data and calculate metrics
train_predictions = lr_sgd_model.predict(x_train)

train_r2 = r2_score(y_train, train_predictions)
train_mse = mean_squared_error(y_train, train_predictions)
train_rmse = np.sqrt(train_mse)

print(f'Training R2 is: {train_r2}')
print(f'Training RMSE is: {train_rmse}')
Training R2 is: 0.5525133881416271
Training RMSE is: 5.870662067069559
In [74]:
# Plot model fit
plt.scatter(x_train,y_train)
plt.plot(x_train, train_predictions, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()

Performance on test data

Evaluate model performance on test data.

In [12]:
# Create predictions on test set
test_predictions = lr_sgd_model.predict(x_test)
In [13]:
# Calculate metrics
test_r2 = r2_score(y_test, test_predictions)
test_mse = mean_squared_error(y_test, test_predictions)
test_rmse = np.sqrt(test_mse)

print("Using Stochastic Gradient Descent:")
print(f'R2 is: {test_r2}')
print(f'RMSE is: {test_rmse}')
Using Stochastic Gradient Descent:
R2 is: 0.5048369349839705
RMSE is: 7.010703520024047
In [72]:
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_predictions, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()

Batch Gradient Descent

Batch Gradient Descent uses all training data for a forward pass and then updates the parameters after all training data has been considered. So, the algorithm uses all dataset available to compute the average gradients of all training examples and then uses that mean gradient to updates parameters. This whole process is like a cycle and it's called a training epoch.

Some advantages of batch gradient descent are its computational efficient, it produces a stable error gradient and a stable convergence. Some disadvantages are the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset be in memory and available to the algorithm.

Build and Fit Model

Let's build a Linear Regression model using Batch Gradient Descent (BGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (b0 and b1)
  2. Pick a value for learning rate (alpha) and number of epochs (iterations to iterate over), then

    for each epoch(e):

     - Forward Pass:
       Make predictions: (b0 + b1Xi)
       Calculate loss/error: sum((y - y_pred)**2) / len(training data)
    
     - Backward Pass:
       Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
       Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
    
     - Update Parameters:
       Update parameter b0: b0 - alpha * D_b0
       Update parameter b1: b1 - alpha * D_b1
In [75]:
class LR_BGD():
    def __init__(self):
        # Initialize parameters
        self.b0 = np.random.rand(1)
        self.b1 = np.random.rand(1)
        
    def fit(self, X, y, learning_rate=0.01, epochs=100):
        n = X.shape[0]
        total_loss = []    # store loss after each epoch
        
        for e in range(epochs):    # iterate over epochs
            y_pred = self.b0 + np.dot(X, self.b1)   # predicted value
            loss = (1/n) * np.sum((y - y_pred)**2)  # error = (1/n) * sum((observed-predicted)**2)
            total_loss.append(loss)   # append loss to list
            
            # Calculate gradients
            D_b0 = (-2/n) * np.sum(y - y_pred)
            D_b1 = (-2/n) * np.dot(y-y_pred, X)
            
            # Update parameters
            self.b0 -= learning_rate * D_b0
            self.b1 -= learning_rate * D_b1
            
            if e%100 == 0:
                print(f'Epoch: {e}, Loss: {loss}')
        
        # return parameters and loss
        return self.b0, self.b1, total_loss
    
    def predict(self, X):
        return self.b0 + np.dot(X, self.b1)
In [76]:
# Build and fit LR model
alpha = 0.001
e = 2000

# Build model
lr_bgd_model = LR_BGD()

# Fit Model
b0, b1, total_loss = lr_bgd_model.fit(x_train,y_train,learning_rate=alpha, epochs=e)
Epoch: 0, Loss: 554.9963104421455
Epoch: 100, Loss: 383.18232570628436
Epoch: 200, Loss: 268.0580879712949
Epoch: 300, Loss: 190.91890669662845
Epoch: 400, Loss: 139.23167380112363
Epoch: 500, Loss: 104.598559989848
Epoch: 600, Loss: 81.39258619183721
Epoch: 700, Loss: 65.84338600012236
Epoch: 800, Loss: 55.424619319331406
Epoch: 900, Loss: 48.443507891709544
Epoch: 1000, Loss: 43.765802913947915
Epoch: 1100, Loss: 40.63149914500939
Epoch: 1200, Loss: 38.531353846687566
Epoch: 1300, Loss: 37.12414810046313
Epoch: 1400, Loss: 36.18124761934428
Epoch: 1500, Loss: 35.54945563037643
Epoch: 1600, Loss: 35.12612238862191
Epoch: 1700, Loss: 34.84246726683541
Epoch: 1800, Loss: 34.65240369473739
Epoch: 1900, Loss: 34.5250512915403

Plot Loss

In [77]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), total_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

Performance on training data

Check Model performance on training data

In [81]:
# Create predictions on training data and calculate metrics
train_pred_bgd = lr_bgd_model.predict(x_train)

train_r2_bgd = r2_score(y_train, train_pred_bgd)
train_mse_bgd = mean_squared_error(y_train, train_pred_bgd)
train_rmse_bgd = np.sqrt(train_mse_bgd)

print(f'Training R2 is: {train_r2_bgd}')
print(f'Training RMSE is: {train_rmse_bgd}')
Training R2 is: 0.552837395379034
Training RMSE is: 5.868536325887803
In [82]:
# Plot model fit
plt.scatter(x_train,y_train)
plt.plot(x_train, train_pred_bgd, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()

Performance on test data

Evaluate model performance on test data.

In [78]:
# Create predictions on test set
test_pred_bgd = lr_bgd_model.predict(x_test)
In [79]:
# Calculate metrics
test_r2_bgd = np.abs(r2_score(y_test, test_pred_bgd))
test_mse_bgd = mean_squared_error(y_test, test_pred_bgd)
test_rmse_bgd = np.sqrt(test_mse_bgd)

print("Using Batch Gradient Descent:")
print(f'Test R2 is: {test_r2_bgd}')
print(f'Test RMSE is: {test_rmse_bgd}')
Using Batch Gradient Descent:
Test R2 is: 0.5055082248531583
Test RMSE is: 7.005949722022083
In [80]:
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_pred_bgd, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()

Mini Batch Gradient Descent

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (b0 and b1)
  2. Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

    for each epoch(e):

     for each mini-batch (m):
    
         subset X and y per mini-bacth size
    
         - Forward Pass:
           Make predictions: y_pred = (b0 + b1Xm)
           Calculate loss/error: sum((ym - y_pred)**2) / len(Xm)
    
         - Backward Pass:
           Calculate gradient for b0 (D_b0): partial derivative of J(b) w.r.t J(b0)
           Calculate gradient for b1 (D_b1): partial derivative of J(b) w.r.t J(b1)
    
         - Update Parameters:
           Update parameter b0: b0 - alpha * D_b0
           Update parameter b1: b1 - alpha * D_b1
In [83]:
class LR_MBGD():
    def __init__(self):
        self.b0 = np.random.rand(1)
        self.b1 = np.random.rand(1)
        
    def fit(self, X, y, learning_rate=0.01, epochs=100, batch_size=50):
        n = X.shape[0]
        batch_loss = []
        epoch_loss = []
        
        for e in range(epochs):   # iterate over epochs
            e_loss = 0
            
            for i in range(0, n, batch_size):   # iterate over each batch
                X_new = X[i:i+batch_size]   # create X per batch size
                y_new = y[i:i+batch_size]   # create y per batch size
                
                y_pred = self.b0 + np.dot(X_new, self.b1)   # get prediction
                loss = (1/len(X_new)) * np.sum((y_new - y_pred)**2)  # error = (1/n) * sum((observed-predicted)**2)
                e_loss += loss
                batch_loss.append(loss)
                
                D_b0 = (-2/len(X_new)) * np.sum(y_new - y_pred)
                D_b1 = (-2/len(X_new)) * np.dot(y_new - y_pred, X_new)
                
                self.b0 -= learning_rate * D_b0
                self.b1 -= learning_rate * D_b1

                
            epoch_loss.append(e_loss)
            if e%10 == 0:
                print(f'Epoch: {e}, Loss: {e_loss}')
        
        return self.b0, self.b1, batch_loss, epoch_loss
    
    def predict(self, X):
        return self.b0 + self.b1*X
            
In [84]:
# Build and fit LR model
alpha = 0.001
e = 200
batch = 30

# Build model
lr_mbgd_model = LR_MBGD()

# Fit Model
b0, b1, batch_loss, epoch_loss = lr_mbgd_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch)
Epoch: 0, Loss: 6507.575730348718
Epoch: 10, Loss: 4178.928961216095
Epoch: 20, Loss: 2739.206562173148
Epoch: 30, Loss: 1849.0625690531926
Epoch: 40, Loss: 1298.6987762920467
Epoch: 50, Loss: 958.4083543098174
Epoch: 60, Loss: 748.0001507613397
Epoch: 70, Loss: 617.8956763803964
Epoch: 80, Loss: 537.4425495384049
Epoch: 90, Loss: 487.68941060505233
Epoch: 100, Loss: 456.91907820424836
Epoch: 110, Loss: 437.88695313721087
Epoch: 120, Loss: 426.1136720228275
Epoch: 130, Loss: 418.82954042325593
Epoch: 140, Loss: 414.3219235789203
Epoch: 150, Loss: 411.5317636389716
Epoch: 160, Loss: 409.8041189314743
Epoch: 170, Loss: 408.73392777001015
Epoch: 180, Loss: 408.07064560852007
Epoch: 190, Loss: 407.65928153773666

Plot Loss

Batch Loss
In [85]:
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(15,10))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
Epoch Loss
In [86]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

In this section, we will check the performance of our model on training and test data.

Performance on training data

Check model performance on training data.

In [90]:
# Create predictions on training data and calculate metrics
train_pred_mbgd = lr_mbgd_model.predict(x_train)

train_r2_mbgd = r2_score(y_train, train_pred_mbgd)
train_mse_mbgd = mean_squared_error(y_train, train_pred_mbgd)
train_rmse_mbgd = np.sqrt(train_mse_mbgd)

print(f'Training R2 is: {train_r2_mbgd}')
print(f'Training RMSE is: {train_rmse_mbgd}')
Training R2 is: 0.5544928128105445
Training RMSE is: 5.857663451962277
In [91]:
# Model fit on training data
plt.scatter(x_train,y_train)
plt.plot(x_train, train_pred_mbgd, color='r')
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.title('Model fit on training data')
plt.show()

Perofrmance on test data

Evaluate model performance on test data.

In [87]:
# Create predictions on test set
test_pred_mbgd = lr_mbgd_model.predict(x_test)
In [88]:
# Calculate metrics
test_r2_mbgd = np.abs(r2_score(y_test, test_pred_mbgd))
test_mse_mbgd = mean_squared_error(y_test, test_pred_mbgd)
test_rmse_mbgd = np.sqrt(test_mse_mbgd)

print("Using Mini-Batch Gradient Descent:")
print(f'Test R2 is: {test_r2_mbgd}')
print(f'Test RMSE is: {test_rmse_mbgd}')
Using Mini-Batch Gradient Descent:
Test R2 is: 0.5093411537588715
Test RMSE is: 6.978744470207237
In [89]:
# Plot Predictions
plt.scatter(x_test,y_test)
plt.ylabel('Price')
plt.xlabel('LSTAT')
plt.plot(x_test, test_pred_mbgd, color='r', label='y={:.2f}+{:.2f}x'.format(b0[0], b1[0]))
plt.legend()
plt.title('Model fit on test data')
plt.show()

Model Results

Gradient Descent Type Iterations Epochs Learning Rate Test Data Training Data
R2 RMSE R2 RMSE
Stochastic 354 (each record of training data in considered) 20 0.1 0.565 5.704 0.534 6.416
Batch None (all training data in considered) 2000 0.001 0.565 5.704 0.534 6.416
Mini-Batch 12 (batch size is considered) 200 0.001 0.54 6.591 0.534 6.045

Comparison: Batch vs Stochastic vs Mini-batch

Batch Stochastic Mini-batch
Uses all training data to compute the gradient of cost function and then updates model parameters. So it takes the average of the gradients of all training examples and then uses that mean gradient to update model parameters. Uses each record in the training sample to compute the gradient of cost function and update parameters right away. Uses a set of ‘m’ records in the training sample to compute the gradient of the cost function and update parameters. The set of 'm' records is called a mini-batch.
All training data in considered at once for gradient computation and parameter update. Each record is considered for gradient computation and parameter update. A set of 'm' records called 'mini-batch' is considered for gradient computation and parameter update.
Pro - computationally efficient as all training data is considered at once Pro - frequent updates allow us to have a pretty detailed rate of improvement Pro - enjoys the advantages of both Batch and Stochastic
Pro - produces a stable error gradient and a stable convergence Pro - converges faster for larger datasets 1. updates are less expensive and hence gradients are less noisy than stochastic
Pro - converges directly to minima Con - frequent updates to gradients are more computationally expensive than the batch gradient descent 2. requires only a batch of 'm' records in memory unlike batch
Con - requires the entire training dataset be in memory and available. So, not efficient for large datasets Con - frequent updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing
Con - stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve Con - never reaches the minima but it will keep dancing around it

Summary

To summarize, in this notebook we created a simple linear regression model using various descent approaches from scratch. We saw how each approach requires choosing the right value of learning rate and epochs to generate the best results.

Challenges

Note that gradient descent is a first-order optimization algorithm, which means it doesn’t take into account the second-order derivatives of the cost function. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. The curvature of the function affects the size of each learning step.

Therefore, if:

  • Second derivative = 0 →the curvature is linear. Therefore, the step size = the learning rate α.
  • Second derivative > 0 → the curvature is going upward. Therefore, the step size < the learning rate α and may lead to divergence.
  • Second derivative < 0 → the curvature is going downward. Therefore, the step size > the learning rate α.

As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.