Logistic Regression using Mini-Batch Gradient Descent¶

Overview¶

Basics¶

Goal: The goal is to build a Logistic Regression model from scratch using Mini-Batch Gradient Descent approach.

Logistic Regression allows us to study relationships between binary response (0/1, Yes/No etc.) and continuous or categorical variables. The algorithm tries to find a straight line or a boundary to correctly separate the 2 classes. For e.g., logistic regression can be used by a bank to predict whether to approve or deny a loan application based on various factors such as income, debt, credit score etc.

Note: Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. In other words, $z$ can be written as:

$z=w_{1} x_{1}+w_{2} x_{2}$

There is no interaction between the weights and parameter values ($w_{1} x_{1}*w_{2} x_{2}$) that would make a model non-linear.

Hypothesis¶

Hypothesis:

With modeling the value of response variable, we model the probability of success (yes). So, our hypothesis is:

$h_{\theta}(x) = probability\ of\ success = estimated\ probability\ that\ y=1\ given\ input\ x$.

So, the hypothesis can be written as: probability that y = 1, given $x$, parametrized by $\theta$

$h_{\theta}(x) = P(y = 1|x ; \theta$)
where $y = response\ \{0,1\}$, $x = input\ features$, $\theta = parameters$

Logistic regression hypothesis is represented by a sigmoid/logistic function:

$\begin{aligned} h_{\theta}(x) &=g(\theta^{T} x) \\ \text { where } g(z) &=\frac{1}{1+e^{-z}} \\ \text { hence } h_{\theta}(x) &=\frac{1}{1+e^{-\left(\theta^{T} x\right)}} \end{aligned}$

The sigmoid function has values very close to either 0 or 1 across most of its domain. This makes it suitable for application in classification methods such as logistic regression. The function takes in any value and returns an output to be between 0 and 1. We can set a cut-off point (threshold) at 0.5 and say anything < 0.5 results in class 0 and anything >= 0.5 results in class 1.

Our goal is to predict:
$y=1$ when $h_{\theta}(x) \geqslant 0.5$ i.e. when $\theta^{T}x \geqslant 0$
$y=0$ when $h_{\theta}(x) < 0.5$ i.e. when $\theta^{T}x < 0$

The parameters we need to find are:

$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$

Cost Function¶

Cost Function gives an idea of how far the predicted values are from the actual values. The formula for logistic regression cost function is:

$J(\theta)=\frac{1}{m} \sum_{i=1}^{m} {\operatorname{cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right)}$

where, ${\operatorname{cost}\left(h_{\theta}(x), y\right)}=\left\{\begin{aligned}-\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\-\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.$

So, the cost function can be written as:

$J(\theta)= -\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]$

Gradient Descent¶

Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:

$\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$

So, the gradients for various parameters can be written as:

$\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$

Get Data¶

For this exercise, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:

Read the data
Identify dependent and independent variables
Split the data into train and test sets
Standardize the data
Build model using mini-batch gradient descent
Evaluate model performance

Load Data¶

# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report

# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()

# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data
display(f'Shape of X: {X.shape}')

'Shape of Y: (569,)'

'Shape of X: (569, 30)'

Split and Standardize data¶

# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (426, 30)'

'Shape of test data (143, 30)'

# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Add Column for Intercept term¶

Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.

Note - The column of ones must be added after standardizing the data. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.

x_train = np.c_[np.ones(len(x_train)), x_train]
x_test = np.c_[np.ones(len(x_test)), x_test]

print(f'First 5 rows of x_train: \n {x_train[:2]}')

First 5 rows of x_train: 
 [[ 1.          0.56834943 -0.7455382   0.58009089  0.37053367  0.16248975
   0.75051504 -0.29232688  0.12077331  0.64546786  0.42409254 -0.8459362
  -1.33183264 -0.76246363 -0.59518062 -0.83573848 -0.40831834 -0.39989516
   0.02644473 -0.13791584 -0.04101054  0.13524938 -1.06324969  0.15967859
  -0.0411782  -0.41416553 -0.0124391  -0.31116881  0.14461744  0.39847515
   0.28446451]
 [ 1.         -0.28961453 -0.14718352 -0.26447524 -0.35755004  1.3705287
   0.35627046  0.40744592  0.62125191  1.15404923  0.71471915 -0.4349292
  -0.59897155 -0.48949416 -0.37623321 -0.18336028 -0.12288072 -0.10165446
   0.05631866 -0.37842054 -0.06070741 -0.13701334  0.34997575 -0.14949267
  -0.22447494  1.96801062  0.97681458  1.02736027  1.41557963  1.61872573
   1.78132724]]

x_train.shape

(426, 31)

Mini Batch Gradient Descent¶

In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.

Build and Fit Model¶

Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:

Randomly initialize model parameters (bj)

Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then

for each epoch(e):

 for each mini-batch (m):

     subset X and y per mini-bacth size

     - Forward Pass:
       Make predictions: y_pred = 1 / (1 + e^(-(Q_t * x))
       Calculate loss/error: -(1/len(Xm)) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))

     - Backward Pass:
       Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)

     - Update Parameters:
       Update parameter bj: bj - alpha * D_bj

class Logistic_Regression():
    
    def __init__(self):
#         self.b0 = None
        self.bj = None
        
    
    def _sigmoid(self, X, weights):
        
        # Linear model = Q_t * x
        z = np.dot(X, weights)
        
        # Sigmoid function 1 / (1 + e^(-z))
        return 1/(1 + np.exp(-z))
    
    def _loss(self, y, h):
        '''
        a really small value 'epsilon' is added to avoid 
        overflow and divison by zero error for log
        loss = (-1/m) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))
        where hQ_x = 1/(1 + e^(-Q_t*x))
        '''
        epsilon = 1e-5
        los = (-1/len(y)) * np.sum(y * np.log(h + epsilon) + (1-y) * np.log(1-h+epsilon))
        return los

    
    def fit(self, X, y, learning_rate=0.01, epochs=100, batch_size=32, verbose=False):
        n_obs = X.shape[0]
        batch_loss = []
        epoch_loss = []
        
        # Initialize weights and bias
        self.bj = np.zeros(X.shape[1])  # shape = (ncols,) = (31,)
        
        for e in range(epochs):                
            loss_e = 0
            
            for i in range(0, n_obs, batch_size):
                # Subset data for batches
                X_new = X[i:i+batch_size]
                y_new = y[i:i+batch_size]
                
                # Calculate loss
                y_pred = self._sigmoid(X_new, self.bj)   # predictions
                loss = self._loss(y_new, y_pred)
                loss_e += loss
                batch_loss.append(loss)
                
                # Calculate gradients
                D_bj = (1/len(X_new)) * np.dot(y_pred - y_new, X_new)
                
                # Update parameters
                self.bj -= learning_rate * D_bj
                
            epoch_loss.append(loss_e)
            if verbose==True and e%1000==0:
                print(f'Epoch: {e}, Loss: {loss_e}')
                
        return self.bj, batch_loss, epoch_loss
                
        
    def predict(self, X, threshold):
        y_predicted = self._sigmoid(X, self.bj)  #make prediction
        
        # Assign prediction to a class: 
        # if pred >= threshold then 1 else 0 and return as an array
        y_predicted_cls = [1 if i >= threshold else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

Plot Loss for different learning rates¶

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.

epochs = 5000
batch = 32

model = Logistic_Regression()
losses = {}

for lr in [0.1, 0.3, 0.01, 0.03, 0.001, 0.003]:
    bj, batch_loss, epoch_loss = model.fit(x_train, y_train, epochs = epochs, 
                                               learning_rate=lr, batch_size=batch)
    losses[f'LR={str(lr)}'] = epoch_loss
    
# Plot loss
plt.figure(figsize=(15,8))    
xs = np.arange(len(epoch_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss¶

# Build and fit best LR model
alpha = 0.01
e = 5000
batch = 32

# Build model
log_model = Logistic_Regression()


# Fit Model
bj, batch_loss, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, batch_size=batch, verbose=True)

Epoch: 0, Loss: 8.23310208513671
Epoch: 1000, Loss: 0.4840312474055349
Epoch: 2000, Loss: 0.395579589621741
Epoch: 3000, Loss: 0.35424869722499774
Epoch: 4000, Loss: 0.32868245767105925

# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')

Bias: 0.6395366474906564
Weight Matrix: 
[-0.95871735 -1.05167035 -0.91601    -0.9716524  -0.21731968  0.47890513
 -1.2272293  -1.22554714 -0.08052077  0.92169111 -1.70174711  0.27163408
 -0.90702226 -1.54354609 -0.49318326  1.28487694 -0.14599505 -0.55228438
  0.64782947  0.97123319 -1.31642645 -1.82216761 -1.02641107 -1.37081966
 -1.720245    0.08425357 -1.47543613 -1.2299526  -1.39463277 -0.75310917]

Plot Loss¶

Batch Loss¶

# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e

plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()

Epoch Loss¶

# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance¶

In this section, we will check the performance of our model on training and test data. We will also build a logistic regression model using scikit-learn library and compare performance.

Before we make predictions, let's define a function accuracy which returns the ratio of the number of correct predictions to the number of observations. We will use this function to determine the accuracy of out models on training and test sets.

# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data¶

# Create predictions
train_pred = log_model.predict(x_train, 0.5)

# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc}')

Accuracy on training set: 0.9953051643192489

# Calculate metrics
cm_train = confusion_matrix(y_train, train_pred)
train_report = classification_report(y_train, train_pred)

print("Performance on training set:\n")
print(f'Confusion Matrix:\n {cm_train}\n')
print(f'Classification Report:\n {train_report}')

Performance on training set:

Confusion Matrix:
 [[156   1]
 [  1 268]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       157
           1       1.00      1.00      1.00       269

    accuracy                           1.00       426
   macro avg       0.99      0.99      0.99       426
weighted avg       1.00      1.00      1.00       426

Confusion Matrix shows the following:

True negatives (TN) (upper-left position): 156 observations are zeros predicted correctly.
False negatives (FN) (lower-left position): 1 observation is incorrectly predicted as zero.
False positives (FP) (upper-right position): 1 observation is incorrectly predicted as one.
True positives (TP) (lower-right position): 268 observations are ones predicted correctly.

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extremen values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Perofrmance on Test data¶

Evaluate model performance on test data.

# Create predictions on test set
test_pred = log_model.predict(x_test, 0.5)

# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')

Accuracy on test set: 0.958041958041958

# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
test_report = classification_report(y_test, test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')

Performance on test set:

Confusion Matrix:
 [[51  4]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.93      0.94        55
           1       0.96      0.98      0.97        88

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.96       143
weighted avg       0.96      0.96      0.96       143

Confusion Matrix shows the following:

True negatives (TN) (upper-left position): 51 observations are zeros predicted correctly.
False negatives (FN) (lower-left position): 2 observations are incorrectly predicted as zero.
False positives (FP) (upper-right position): 4 observations are incorrectly predicted as one.
True positives (TP) (lower-right position): 86 observations are ones predicted correctly.

Comparison with Scikit Learn¶

Here, we build a Multiple Linear Regression model using LinearRegression from scikit-learn library and compare the results with model created from scratch.

# Build a model using scikit learn
from sklearn.linear_model import LogisticRegression
   
# fit the Logistic Regression
clf = LogisticRegression(fit_intercept=True, C=1e15)

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)

# Calculate accuracy
clf.score(x_test, y_test)

0.958041958041958

# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
test_report_scikit = classification_report(y_test, y_test_pred)

print("Performance on test set using Scikit learn:\n")
print(f'Confusion Matrix:\n {cm_test_scikit}\n')
print(f'Classification Report:\n {test_report_scikit}')

Performance on test set using Scikit learn:

Confusion Matrix:
 [[52  3]
 [ 3 85]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.95      0.95        55
           1       0.97      0.97      0.97        88

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143

Confusion Matrix shows the following:

True negatives (TN) (upper-left position): 52 observations are zeros predicted correctly.
False negatives (FN) (lower-left position): 3 observations are incorrectly predicted as zero.
False positives (FP) (upper-right position): 3 observations are incorrectly predicted as one.
True positives (TP) (lower-right position): 85 observations are ones predicted correctly.

Model Results¶

	Accuracy	Precision		Recall		F1 Score
		0	1	0	1	0	1
Training Data	0.995	0.99	1	0.99	1	0.99	1
Test Data	0.958	0.96	0.96	0.93	0.98	0.94	0.97
Model using Scikit Learn	0.958	0.95	0.97	0.95	0.97	0.95	0.97

From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.

Summary¶

To summarize, in this notebook we created a logistic regression model using mini-batch gradient descent from scratch. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678