Goal: The goal is to build a Logistic Regression model from scratch using Mini-Batch Gradient Descent approach.
Logistic Regression allows us to study relationships between binary response (0/1, Yes/No etc.) and continuous or categorical variables. The algorithm tries to find a straight line or a boundary to correctly separate the 2 classes. For e.g., logistic regression can be used by a bank to predict whether to approve or deny a loan application based on various factors such as income, debt, credit score etc.
Note: Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. In other words, $z$ can be written as:
$z=w_{1} x_{1}+w_{2} x_{2}$
There is no interaction between the weights and parameter values ($w_{1} x_{1}*w_{2} x_{2}$) that would make a model non-linear.
Hypothesis:
With modeling the value of response variable, we model the probability of success (yes). So, our hypothesis is:
$h_{\theta}(x) = probability\ of\ success = estimated\ probability\ that\ y=1\ given\ input\ x$.
So, the hypothesis can be written as: probability that y = 1, given $x$, parametrized by $\theta$
$h_{\theta}(x) = P(y = 1|x ; \theta$)
where $y = response\ \{0,1\}$, $x = input\ features$, $\theta = parameters$
Logistic regression hypothesis is represented by a sigmoid/logistic function:
$\begin{aligned} h_{\theta}(x) &=g(\theta^{T} x) \\ \text { where } g(z) &=\frac{1}{1+e^{-z}} \\ \text { hence } h_{\theta}(x) &=\frac{1}{1+e^{-\left(\theta^{T} x\right)}} \end{aligned}$
The sigmoid function has values very close to either 0 or 1 across most of its domain. This makes it suitable for application in classification methods such as logistic regression. The function takes in any value and returns an output to be between 0 and 1. We can set a cut-off point (threshold) at 0.5 and say anything < 0.5 results in class 0 and anything >= 0.5 results in class 1.
Our goal is to predict:
$y=1$ when $h_{\theta}(x) \geqslant 0.5$ i.e. when $\theta^{T}x \geqslant 0$
$y=0$ when $h_{\theta}(x) < 0.5$ i.e. when $\theta^{T}x < 0$
The parameters we need to find are:
$\theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{n}\right\}$
Cost Function gives an idea of how far the predicted values are from the actual values. The formula for logistic regression cost function is:
$J(\theta)=\frac{1}{m} \sum_{i=1}^{m} {\operatorname{cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right)}$
where, ${\operatorname{cost}\left(h_{\theta}(x), y\right)}=\left\{\begin{aligned}-\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\-\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.$
So, the cost function can be written as:
$J(\theta)= -\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]$
Mini-Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we repeatedly iterate through batches of 'm' training samples from the data and update the model parameters in accordance with the gradient of error.
Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. The formula is:
$\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$
where $j = 0,...,n$, $m = no. of records$, $n = no. of features$, $\alpha = learning\ rate$
So, the gradients for various parameters can be written as:
$\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{0}$
$\theta_{1}:=\theta_{1}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{1}$
$\theta_{2}:=\theta_{2}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{2}$
$\ldots$
For this exercise, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')
X=load_breast_cancer().data
display(f'Shape of X: {X.shape}')
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
Here, we will add a column of ones ($X_0$) to accomodate for intercept term ($\theta_0$) in to represent $\theta_0 x_0$.
Note - The column of ones must be added after standardizing the data
. If it is done before standardizing, the computatuion of Q0*X0 will change as this column will get standardized as well.
x_train = np.c_[np.ones(len(x_train)), x_train]
x_test = np.c_[np.ones(len(x_test)), x_test]
print(f'First 5 rows of x_train: \n {x_train[:2]}')
x_train.shape
In mini-batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.
Let's build a Linear Regression model using Mini-Batch Gradient Descent (MBGD). Here is how the algorithm works:
Pick a value for learning rate (alpha), number of epochs (iterations to iterate over) and batch size, then
for each epoch(e):
for each mini-batch (m):
subset X and y per mini-bacth size
- Forward Pass:
Make predictions: y_pred = 1 / (1 + e^(-(Q_t * x))
Calculate loss/error: -(1/len(Xm)) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))
- Backward Pass:
Calculate gradient for bj (D_bj): partial derivative of J(b) w.r.t J(bj)
- Update Parameters:
Update parameter bj: bj - alpha * D_bj
class Logistic_Regression():
def __init__(self):
# self.b0 = None
self.bj = None
def _sigmoid(self, X, weights):
# Linear model = Q_t * x
z = np.dot(X, weights)
# Sigmoid function 1 / (1 + e^(-z))
return 1/(1 + np.exp(-z))
def _loss(self, y, h):
'''
a really small value 'epsilon' is added to avoid
overflow and divison by zero error for log
loss = (-1/m) * sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x))
where hQ_x = 1/(1 + e^(-Q_t*x))
'''
epsilon = 1e-5
los = (-1/len(y)) * np.sum(y * np.log(h + epsilon) + (1-y) * np.log(1-h+epsilon))
return los
def fit(self, X, y, learning_rate=0.01, epochs=100, batch_size=32, verbose=False):
n_obs = X.shape[0]
batch_loss = []
epoch_loss = []
# Initialize weights and bias
self.bj = np.zeros(X.shape[1]) # shape = (ncols,) = (31,)
for e in range(epochs):
loss_e = 0
for i in range(0, n_obs, batch_size):
# Subset data for batches
X_new = X[i:i+batch_size]
y_new = y[i:i+batch_size]
# Calculate loss
y_pred = self._sigmoid(X_new, self.bj) # predictions
loss = self._loss(y_new, y_pred)
loss_e += loss
batch_loss.append(loss)
# Calculate gradients
D_bj = (1/len(X_new)) * np.dot(y_pred - y_new, X_new)
# Update parameters
self.bj -= learning_rate * D_bj
epoch_loss.append(loss_e)
if verbose==True and e%1000==0:
print(f'Epoch: {e}, Loss: {loss_e}')
return self.bj, batch_loss, epoch_loss
def predict(self, X, threshold):
y_predicted = self._sigmoid(X, self.bj) #make prediction
# Assign prediction to a class:
# if pred >= threshold then 1 else 0 and return as an array
y_predicted_cls = [1 if i >= threshold else 0 for i in y_predicted]
return np.array(y_predicted_cls)
Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.
epochs = 5000
batch = 32
model = Logistic_Regression()
losses = {}
for lr in [0.1, 0.3, 0.01, 0.03, 0.001, 0.003]:
bj, batch_loss, epoch_loss = model.fit(x_train, y_train, epochs = epochs,
learning_rate=lr, batch_size=batch)
losses[f'LR={str(lr)}'] = epoch_loss
# Plot loss
plt.figure(figsize=(15,8))
xs = np.arange(len(epoch_loss))
for k, v in losses.items():
plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()
From the plot above, we can see that the losses generated by various models seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.
# Build and fit best LR model
alpha = 0.01
e = 5000
batch = 32
# Build model
log_model = Logistic_Regression()
# Fit Model
bj, batch_loss, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, batch_size=batch, verbose=True)
# Print bias and weights
print(f'Bias: {bj[0]}')
print(f'Weight Matrix: \n{bj[1:]}')
# Plot total loss which shows loss at each iteration
iterations = ((len(x_train)//batch)+1) * e
plt.figure(figsize=(8,5))
plt.plot(np.arange(iterations), batch_loss)
plt.xlabel('Batches')
plt.ylabel('Loss')
plt.title('Loss at each Batch')
plt.show()
# Plot epoch loss which shows loss at each epoch
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), epoch_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
In this section, we will check the performance of our model on training and test data. We will also build a logistic regression model using scikit-learn library and compare performance.
Before we make predictions, let's define a function accuracy
which returns the ratio of the number of correct predictions to the number of observations. We will use this function to determine the accuracy of out models on training and test sets.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
# Create predictions
train_pred = log_model.predict(x_train, 0.5)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc}')
# Calculate metrics
cm_train = confusion_matrix(y_train, train_pred)
train_report = classification_report(y_train, train_pred)
print("Performance on training set:\n")
print(f'Confusion Matrix:\n {cm_train}\n')
print(f'Classification Report:\n {train_report}')
Confusion Matrix shows the following:
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extremen values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Evaluate model performance on test data.
# Create predictions on test set
test_pred = log_model.predict(x_test, 0.5)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
test_report = classification_report(y_test, test_pred)
print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')
Confusion Matrix shows the following:
Here, we build a Multiple Linear Regression model using LinearRegression
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.linear_model import LogisticRegression
# fit the Logistic Regression
clf = LogisticRegression(fit_intercept=True, C=1e15)
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf.score(x_test, y_test)
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
test_report_scikit = classification_report(y_test, y_test_pred)
print("Performance on test set using Scikit learn:\n")
print(f'Confusion Matrix:\n {cm_test_scikit}\n')
print(f'Classification Report:\n {test_report_scikit}')
Confusion Matrix shows the following:
Accuracy | Precision | Recall | F1 Score | ||||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
Training Data | 0.995 | 0.99 | 1 | 0.99 | 1 | 0.99 | 1 |
Test Data | 0.958 | 0.96 | 0.96 | 0.93 | 0.98 | 0.94 | 0.97 |
Model using Scikit Learn | 0.958 | 0.95 | 0.97 | 0.95 | 0.97 | 0.95 | 0.97 |
From the results comparison, we can see that the model created using mini-batch gradient descent fairs well when compared to the model created using scikit-learn library.
To summarize, in this notebook we created a logistic regression model using mini-batch gradient descent from scratch. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate the best results. Later, we evaluated the results of our model on training and test data and compared performance with a model created using scikit-learn library.