Logistic Regression using Mini-Batch Gradient Descent

image.png

Overview

Basics

Goal: The goal is to build a Logistic Regression model from scratch using Mini-Batch Gradient Descent approach.

Logistic Regression allows us to study relationships between binary response (0/1, Yes/No etc.) and continuous or categorical variables. The algorithm tries to find a straight line or a boundary to correctly separate the 2 classes. For e.g., logistic regression can be used by a bank to predict whether to approve or deny a loan application based on various factors such as income, debt, credit score etc.

Note: Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. In other words, $z$ can be written as:

$z=w_{1} x_{1}+w_{2} x_{2}$

There is no interaction between the weights and parameter values ($w_{1} x_{1}*w_{2} x_{2}$) that would make a model non-linear.

Hypothesis

Hypothesis:

With modeling the value of response variable, we model the probability of success (yes). So, our hypothesis is:

$h_{\theta}(x) = probability\ of\ success = estimated\ probability\ that\ y=1\ given\ input\ x$.

So, the hypothesis can be written as: probability that y = 1, given $x$, parametrized by $\theta$

$h_{\theta}(x) = P(y = 1|x ; \theta$)
where $y = response\ \{0,1\}$, $x = input\ features$, $\theta = parameters$

Logistic regression hypothesis is represented by a sigmoid/logistic function:

$\begin{aligned} h_{\theta}(x) &=g(\theta^{T} x) \\ \text { where } g(z) &=\frac{1}{1+e^{-z}} \\ \text { hence } h_{\theta}(x) &=\frac{1}{1+e^{-\left(\theta^{T} x\right)}} \end{aligned}$

image.png The sigmoid function has values very close to either 0 or 1 across most of its domain. This makes it suitable for application in classification methods such as logistic regression. The function takes in any value and returns an output to be between 0 and 1. We can set a cut-off point (threshold) at 0.5 and say anything < 0.5 results in class 0 and anything >= 0.5 results in class 1.

Our goal is to predict:
$y=1$ when $h_{\theta}(x) \geqslant 0.5$ i.e. when $\theta^{T}x \geqslant 0$
$y=0$ when $h_{\theta}(x) < 0.5$ i.e. when $\theta^{T}x < 0$

The parameters we need to find are:

$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$

Cost Function

Cost Function gives an idea of how far the predicted values are from the actual values. The formula for logistic regression cost function is:

$J(\theta)=\frac{1}{m} \sum_{i=1}^{m} {\operatorname{cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right)}$

where, ${\operatorname{cost}\left(h_{\theta}(x), y\right)}=\left\{\begin{aligned}-\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\-\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.$

So, the cost function can be written as:

$J(\theta)= -\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]$

Gradient Descent

Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:

$\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$

So, the gradients for various parameters can be written as:

$\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$

image.png

Get Data

For this exercise, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:

  1. Read the data
  2. Identify dependent and independent variables
  3. Split the data into train and test sets
  4. Standardize the data
  5. Build model using mini-batch gradient descent
  6. Evaluate model performance

Load Data

In [119]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
In [75]:
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
Out[75]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [76]:
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data
display(f'Shape of X: {X.shape}')
'Shape of Y: (569,)'
'Shape of X: (569, 30)'

Split and Standardize data

In [79]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (426, 30)'
'Shape of test data (143, 30)'
In [80]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Add Column for Intercept term

Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.

Note - The column of ones must be added after standardizing the data. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.

In [81]:
x_train = np.c_[np.ones(len(x_train)), x_train]
x_test = np.c_[np.ones(len(x_test)), x_test]
In [82]:
print(f'First 5 rows of x_train: \n {x_train[:2]}')
First 5 rows of x_train: 
 [[ 1.          0.56834943 -0.7455382   0.58009089  0.37053367  0.16248975
   0.75051504 -0.29232688  0.12077331  0.64546786  0.42409254 -0.8459362
  -1.33183264 -0.76246363 -0.59518062 -0.83573848 -0.40831834 -0.39989516
   0.02644473 -0.13791584 -0.04101054  0.13524938 -1.06324969  0.15967859
  -0.0411782  -0.41416553 -0.0124391  -0.31116881  0.14461744  0.39847515
   0.28446451]
 [ 1.         -0.28961453 -0.14718352 -0.26447524 -0.35755004  1.3705287
   0.35627046  0.40744592  0.62125191  1.15404923  0.71471915 -0.4349292
  -0.59897155 -0.48949416 -0.37623321 -0.18336028 -0.12288072 -0.10165446
   0.05631866 -0.37842054 -0.06070741 -0.13701334  0.34997575 -0.14949267
  -0.22447494  1.96801062  0.97681458  1.02736027  1.41557963  1.61872573
   1.78132724]]
In [83]:
x_train.shape
Out[83]:
(426, 31)

Mini Batch Gradient Descent

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

  1. Randomly initialize model parameters (bj)
  2. Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

    for each epoch(e):

     for each mini-batch (m):
    
         subset X and y per mini-bacth size
    
         - Forward Pass:
           Make predictions: y_pred = 1 / (1 + e^(-(Q_t * x))
           Calculate loss/error: -(1/len(Xm)) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))
    
         - Backward Pass:
           Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)
    
         - Update Parameters:
           Update parameter bj: bj - alpha * D_bj
In [91]:
class Logistic_Regression():
    
    def __init__(self):
#         self.b0 = None
        self.bj = None
        
    
    def _sigmoid(self, X, weights):
        
        # Linear model = Q_t * x
        z = np.dot(X, weights)
        
        # Sigmoid function 1 / (1 + e^(-z))
        return 1/(1 + np.exp(-z))
    
    def _loss(self, y, h):
        '''
        a really small value 'epsilon' is added to avoid 
        overflow and divison by zero error for log
        loss = (-1/m) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))
        where hQ_x = 1/(1 + e^(-Q_t*x))
        '''
        epsilon = 1e-5
        los = (-1/len(y)) * np.sum(y * np.log(h + epsilon) + (1-y) * np.log(1-h+epsilon))
        return los

    
    def fit(self, X, y, learning_rate=0.01, epochs=100, batch_size=32, verbose=False):
        n_obs = X.shape[0]
        batch_loss = []
        epoch_loss = []
        
        # Initialize weights and bias
        self.bj = np.zeros(X.shape[1])  # shape = (ncols,) = (31,)
        
        for e in range(epochs):                
            loss_e = 0
            
            for i in range(0, n_obs, batch_size):
                # Subset data for batches
                X_new = X[i:i+batch_size]
                y_new = y[i:i+batch_size]
                
                # Calculate loss
                y_pred = self._sigmoid(X_new, self.bj)   # predictions
                loss = self._loss(y_new, y_pred)
                loss_e += loss
                batch_loss.append(loss)
                
                # Calculate gradients
                D_bj = (1/len(X_new)) * np.dot(y_pred - y_new, X_new)
                
                # Update parameters
                self.bj -= learning_rate * D_bj
                
            epoch_loss.append(loss_e)
            if verbose==True and e%1000==0:
                print(f'Epoch: {e}, Loss: {loss_e}')
                
        return self.bj, batch_loss, epoch_loss
                
        
    def predict(self, X, threshold):
        y_predicted = self._sigmoid(X, self.bj)  #make prediction
        
        # Assign prediction to a class: 
        # if pred >= threshold then 1 else 0 and return as an array
        y_predicted_cls = [1 if i >= threshold else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

Plot Loss for different learning rates

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.

In [92]:
epochs = 5000
batch = 32

model = Logistic_Regression()
losses = {}

for lr in [0.1, 0.3, 0.01, 0.03, 0.001, 0.003]:
    bj, batch_loss, epoch_loss = model.fit(x_train, y_train, epochs = epochs, 
                                               learning_rate=lr, batch_size=batch)
    losses[f'LR={str(lr)}'] = epoch_loss
    
# Plot loss
plt.figure(figsize=(15,8))    
xs = np.arange(len(epoch_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss

In [93]:
# Build and fit best LR model
alpha = 0.01
e = 5000
batch = 32

# Build model
log_model = Logistic_Regression()


# Fit Model
bj, batch_loss, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch, verbose=True)
Epoch: 0, Loss: 8.23310208513671
Epoch: 1000, Loss: 0.4840312474055349
Epoch: 2000, Loss: 0.395579589621741
Epoch: 3000, Loss: 0.35424869722499774
Epoch: 4000, Loss: 0.32868245767105925
In [94]:
# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')
Bias: 0.6395366474906564
Weight Matrix: 
[-0.95871735 -1.05167035 -0.91601    -0.9716524  -0.21731968  0.47890513
 -1.2272293  -1.22554714 -0.08052077  0.92169111 -1.70174711  0.27163408
 -0.90702226 -1.54354609 -0.49318326  1.28487694 -0.14599505 -0.55228438
  0.64782947  0.97123319 -1.31642645 -1.82216761 -1.02641107 -1.37081966
 -1.720245    0.08425357 -1.47543613 -1.2299526  -1.39463277 -0.75310917]

Plot Loss

Batch Loss
In [95]:
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
Epoch Loss
In [96]:
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance

In this section, we will check the performance of our model on training and test data. We will also build a logistic regression model using scikit-learn library and compare performance.

Before we make predictions, let's define a function accuracy which returns the ratio of the number of correct predictions to the number of observations. We will use this function to determine the accuracy of out models on training and test sets.

In [98]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data

In [97]:
# Create predictions
train_pred = log_model.predict(x_train, 0.5)
In [100]:
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc}')
Accuracy on training set: 0.9953051643192489
In [125]:
# Calculate metrics
cm_train = confusion_matrix(y_train, train_pred)
train_report = classification_report(y_train, train_pred)

print("Performance on training set:\n")
print(f'Confusion Matrix:\n {cm_train}\n')
print(f'Classification Report:\n {train_report}')
Performance on training set:

Confusion Matrix:
 [[156   1]
 [  1 268]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       157
           1       1.00      1.00      1.00       269

    accuracy                           1.00       426
   macro avg       0.99      0.99      0.99       426
weighted avg       1.00      1.00      1.00       426

Confusion Matrix shows the following:

  • True negatives (TN) (upper-left position): 156 observations are zeros predicted correctly.
  • False negatives (FN) (lower-left position): 1 observation is incorrectly predicted as zero.
  • False positives (FP) (upper-right position): 1 observation is incorrectly predicted as one.
  • True positives (TP) (lower-right position): 268 observations are ones predicted correctly.

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extremen values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Perofrmance on Test data

Evaluate model performance on test data.

In [101]:
# Create predictions on test set
test_pred = log_model.predict(x_test, 0.5)
In [102]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
Accuracy on test set: 0.958041958041958
In [126]:
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
test_report = classification_report(y_test, test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')
Performance on test set:

Confusion Matrix:
 [[51  4]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.93      0.94        55
           1       0.96      0.98      0.97        88

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.96       143
weighted avg       0.96      0.96      0.96       143

Confusion Matrix shows the following:

  • True negatives (TN) (upper-left position): 51 observations are zeros predicted correctly.
  • False negatives (FN) (lower-left position): 2 observations are incorrectly predicted as zero.
  • False positives (FP) (upper-right position): 4 observations are incorrectly predicted as one.
  • True positives (TP) (lower-right position): 86 observations are ones predicted correctly.

Comparison with Scikit Learn

Here, we build a Multiple Linear Regression model using LinearRegression from scikit-learn library and compare the results with model created from scratch.

In [103]:
# Build a model using scikit learn
from sklearn.linear_model import LogisticRegression
   
# fit the Logistic Regression
clf = LogisticRegression(fit_intercept=True, C=1e15)

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
In [110]:
# Calculate accuracy
clf.score(x_test, y_test)
Out[110]:
0.958041958041958
In [127]:
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
test_report_scikit = classification_report(y_test, y_test_pred)

print("Performance on test set using Scikit learn:\n")
print(f'Confusion Matrix:\n {cm_test_scikit}\n')
print(f'Classification Report:\n {test_report_scikit}')
Performance on test set using Scikit learn:

Confusion Matrix:
 [[52  3]
 [ 3 85]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.95      0.95        55
           1       0.97      0.97      0.97        88

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143

Confusion Matrix shows the following:

  • True negatives (TN) (upper-left position): 52 observations are zeros predicted correctly.
  • False negatives (FN) (lower-left position): 3 observations are incorrectly predicted as zero.
  • False positives (FP) (upper-right position): 3 observations are incorrectly predicted as one.
  • True positives (TP) (lower-right position): 85 observations are ones predicted correctly.

Model Results

Accuracy Precision Recall F1 Score
0 1 0 1 0 1
Training Data 0.995 0.99 1 0.99 1 0.99 1
Test Data 0.958 0.96 0.96 0.93 0.98 0.94 0.97
Model using Scikit Learn 0.958 0.95 0.97 0.95 0.97 0.95 0.97

From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.

Summary

To summarize, in this notebook we created a logistic regression model using mini-batch gradient descent from scratch. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.