Goal: The goal is to build a naive Support Vector Classifier from scratch using Batch Gradient Descent approach.
Given a training set with labeled data, the idea is to find a decision boundary/hyperplane that separates the two classes such that the distance between the data points in each class and the hyperplane is maximized.
A hyperplane is represented as:
$w^{\top} x+b=0$
It is possible to create many hyperplanes that separate the two classes {1,-1} perfectly. Any such separating hyperplane has a property that:
$w^{\top} x_i+b>0$ if y = 1
$w^{\top} x_i+b<0$ if y = -1
A natural choice for selecting a single hyperplane that has the maximum margin, i.e the maximum distance between data points of both classes is the maximum margin classifier also known as optimal separating hyperplane.
Margin: If we compute the prependicular distance from training observation in each class to a separating hyperplane; the smallest such distance from observation to hyperplane is called margin. We try to find a hyperplane that maximizes this smallest distance i.e. margin.
Support Vectors: are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. The maximal margin hyperplane depends directly on support vectors but not on any other observations. So a movement to any other observation would not affect the supporting hyperplane.
Disadvantages:
The idea of a maximal margin classifier is to maximize the margin. If we consider two imaginary lines/hyperplanes passing through the support vectors in each class that are parallel to the optimal hyperplane, then the margin can be computed as the distance between these two parallel hyperplanes.
The margin is represented as: $\frac{2 c}{\|w\|}$
The margin can be maximized by minimizing ${\|w\|}$. So, the cost function is written as:
$\min _{w, b}\|w\|$ s.t. $y^{i}\left(w^{\top} x^{i}+b\right) \geq 1, \forall i$
where:
$y^{i}$: class that the observation belongs to
$x^{i}$: features
$w,\ b$: parameters of the hyperplane
$i$: number of observations
Important Points:
For each observation to be on the correct side of hyperplane, we just need to find $w,\ b$ such that $y^{i}\left(w^{\top} x^{i}+b\right) > 0$. The constraint for maximal margin, however, gurantees that each observation will be on the correct side of the hyperplane with some cushion/margin (represented by 1).
Although the constraint applies to all observations, only a few constrains (those for support vectors) are relevant as the optimal hyperplane directly depends on these vectors only.
Sometimes perfect separation is achievable but not desired. For e.g. in the case of an outlier, a HMC may be able to separate the classes perfectly but the model may not generalize well and accuracy may suffer. Then there are cases where the data is not completely separable, such as in most real world scenarios, and so a HMC may not exist. In such cases, we may consider a classifier that does not perfectly separate the two classes in the interest of:
Support Vector Classifier, also known as soft margin classifier does exactly this. Rather than seeking the largest margin such that every observation is not only on the correct side of hyperplane but also on correct side of margin, we instead allow some observations to be on the incorrect side of the margin or even hyperplane.
Support Vector Classifier allows us to loosen the constraints (or soften the margin) by allowing some points to be on the wrong side of the margin. This is achieved by using Hinge Loss. Hinge loss is represented as:
$\operatorname{Loss}(\underbrace{y^{i}\left(\omega^{\top} x^{i}+b\right)}_{z})=\left\{\begin{array}{cc}0 & \text { if } y^{i}\left(\omega^{\top} x^{i}+b\right) \geqslant 1 \\ 1-\left(y^{i}\left(\omega^{\top} x^{i}+b\right)\right) & \text { if } y^{i}\left(\omega^{\top} x^{i}+b\right)<1\end{array}\right.$
OR
Loss $(z)=\left\{\begin{array}{ll}0 & \text { if } z \geqslant 1 \\ 1-z & \text { if } z<1\end{array}\right.$
If a point is on the correct side of the margin, the loss is 0. However, if it is on the incorrect side of margin or hyperplane, the loss increases linearly.
For points that are on or outside of margin boundary, the loss is 0. For blue +
points (on or outside margin), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = 1 (margin boundary). So, Hinge Loss ${y^{i}\left(\omega^{\top} x^{i}+b\right)}$ = Loss $(z)$ = Loss(1 x 1) = Loss(1) = 0 (if Z>=1, then 0).
For points that are on decision boundary, the loss is 1. For blue +
points (on decision boundary), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = 0 (decision boundary). So, Hinge Loss $(z)$ = Loss(1 x 0) = Loss(0) = 1 (if Z<1, then 1-Z).
For points that are b/w margin and decision boundary, loss = 0 < loss < 1. For blue +
points (b/w margin and decision boundary), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = something b/w 0 and 1 (imagine line b/w margin and decision boundary).
For points that are on incorrect side of decision boundary, loss > 1. For Red point on blue side, $y_i$ = -1 and $(\theta.x + \theta_{0})$ = 1 (blue margin boundary). So, Hinge Loss $(z)$ = Loss(-1 x 1) = Loss(-1) = 1-(-1) = 2 (if Z<1, then 1-Z).
Total Error (loss) is given by:
$\sum_{i=1}^{n}\max \left(0, 1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \right)$
Support Vector Classifier works when the data is linearly separable. Support Vector Machine is an extension of SVC that allows us to use Kernels to learn complex non-linear decision boundaries. Various kernerls such as a Gaussian
or Radial Basis Function
can be used along with SVC to learn non-linear decision boundaries.
Some pros and cons of SVM are:
Pros | Cons |
---|---|
Works really well when there is a clear margin of separation | Doesn’t perform well with large data set as training time is higher |
Effective in high dimensional spaces | Doesn’t perform well when the data set has more noise |
Kernel flexibility allows for learning different types of decision boundaries | Desn't provide probability estimates |
Can be used for both Classification and Regression tasks | Costly computation |
SVC combines the robustness of hinge loss with maximal margin. The cost function is written as:
$J\left(w, b\right)= \frac{1}{n} \sum_{i=1}^{n} \operatorname{max}\left(0, 1 - y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \right)+\frac{\lambda}{2}\|w\|^{2}$
OR
$J\left(\theta, \theta_{0}\right)= \frac{1}{n} \sum_{i=1}^{n} \operatorname{max}\left(0, 1 - y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)\right)+\frac{\lambda}{2}\|\theta\|^{2}$
where:
n : no. of records
$\theta_{0}$: Bias
$\theta_{j}$: Weights
${\lambda}$: regularization parameter
Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we use all the training data to compute the gradient of cost function and then updates parameters.
Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. In SVC, gradient depends on the outcome of hinge loss.
If $y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)$ >= 1:
Bias: $\theta_{0}:=\theta_{0}-0$
Note: Gradient is 0 in this scenario.
Weights: $\theta_{j}:=\theta_{j}-\lambda \cdot \theta_{j}$
If $y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)$ < 1:
Bias: $\theta_{0}:=\theta_{0}-y^{(i)}$
Weights: $\theta_{j}:=\theta_{j}- \left(\lambda \cdot \theta_{j} - y^{(i)}\cdot x^{(i)} \right)$
where:
n : no. of records
$\theta_{0}$: Bias
$\theta_{j}$: Weights
${\lambda}$: regularization parameter
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')
X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')
features = load_breast_cancer().feature_names
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
In this section, we will create the class to build and fit a Support Vector classifier. We will also build and fit a model using training data.
Here is how the algorithm works:
Pick a value for learning rate ($\alpha$), number of epochs (iterations to iterate over), regularization parameter ($\lambda$) then
for each epoch(e):
- Forward Pass: Calculate Cost = Hinge loss + Maximal margin
z = y*f(x) = y*(w*X + b)
if z >= 1:
loss = 0
else:
loss = 1 - z
hinge = loss/len(X)
cost = hinge + (lambd/2)*||w||^2
- Backward Pass:
Calculate gradient for b and w (D_b, D_w): partial derivative of bj w.r.t bias and weights
Objective function: (1/n)*sum(max(0, 1-z)) + (lambda/2) * ||w||^2
if z >= 1:
D_b = 0, D_w = lambda*w_i
else:
D_b = y_i, D_w = -(y_i*x_i) + lambda*w_i = lambda*w_i - (y_i*x_i)
- Update Parameters:
b: b - alpha * D_b
w: w - alpha * D_w
class My_SVC():
def __init__(self):
self.b = None
self.w = None
def _cost(self,y,X,w,b,lambd):
'''
computes cost = hinge loss + max margin
hinge loss = (1/n) * sum(max(0, 1 - y_i(w*x_i + b)))
max margin = (lambd/2) * ||w||^2
'''
loss = 0
for i in range(X.shape[0]):
z = y[i] * (np.dot(self.w, X[i]) + self.b)
loss += max(0, 1-z)
hinge = loss/len(X)
cost = lambd * 0.5 * np.dot(w, w.T) + hinge
return cost
def fit(self, X, y, learning_rate=0.01, epochs=100, lambd=0.1, verbose=False):
n_obs, m_features = X.shape
total_loss = []
total_cost = []
# convert y values to be -1 or 1
y_new = np.where(y <= 0, -1, 1)
# initialize weights and bias
self.b = 0
self.w = np.random.rand(m_features)
# iterate over epochs
for e in range(epochs):
cost = self._cost(y_new,X,w,b,lambd) # get cost
# Calculate gradient and update weights, bias
for i in range(X.shape[0]):
z = y_new[i] * (np.dot(self.w, X[i]) + self.b)
if z >= 1:
self.w -= learning_rate * (lambd * self.w)
else:
self.b -= learning_rate * y[i]
self.w -= learning_rate * ((lambd * self.w) - np.dot(X[i], y_new[i]))
total_loss.append(cost)
if verbose == True and e%10 == 0:
print(f'Epoch: {e}, Loss: {cost}')
return self.b, self.w, total_loss
def predict(self, X):
pred = np.dot(self.w, X.T) + self.b
return np.sign(pred)
Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.
epochs = 10
lambd=0.2
model = My_SVC()
losses = {}
for lr in [0.001, 0.0001]:
for lambd in [0.25, 0.5, 0.75]:
b, w, total_loss = model.fit(x_train, y_train, learning_rate=lr, epochs=epochs)
losses[f'LR={str(lr)}, Lambda={str(lambd)}'] = total_loss
# Plot loss
plt.figure(figsize=(15,8))
xs = np.arange(len(total_loss))
for k, v in losses.items():
plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
plt.title('Loss per Epoch for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()
From the plot above, we can see that the losses generated by various models with learning rate of 0.001 seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 and lambda of 0.25 to build and fit a model and then evaluate its performance.
# Build and fit best LR model
alpha = 0.001
e = 10
lambd=0.25
# Build model
svc_clf = My_SVC()
# Fit Model
b, w, total_loss = svc_clf.fit(x_train, y_train, learning_rate=alpha, epochs=e)
# Print bias and weights
print(f'Bias: {b}')
print(f'Weight Matrix: \n{w}')
# Plot total loss which shows loss at each iteration
e = 10
plt.figure(figsize=(8,5))
plt.plot(np.arange(e), total_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()
In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.
Before we make predictions, let's define a function, accuracy
, which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
Evaluate model performance on training data.
# Create predictions
train_pred = svc_clf.predict(x_train)
# Calculate accuracy
y_train_new = np.where(y_train <= 0, -1, 1)
train_acc = accuracy(y_train_new, train_pred)
print(f'Accuracy on test set: {train_acc}')
# Print classification report
train_report = classification_report(y_train_new, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = svc_clf.predict(x_test)
# Calculate accuracy
y_test_new = np.where(y_test <= 0, -1, 1)
test_acc = accuracy(y_test_new, test_pred)
print(f'Accuracy on test set: {test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test_new, test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
classes = set(y_test_new)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report = classification_report(y_test_new, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.73 can be used for evaluating the model.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a model using SVC
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.svm import SVC
# fit the model
clf = SVC(C=4, kernel='linear')
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Training Data | 0.75 | 0.85 | 0.75 | 0.75 |
Test Data | 0.73 | 0.83 | 0.73 | 0.73 |
Model using Scikit learn | 0.94 | 0.94 | 0.94 | 0.94 |
From the results comparison, we can see that the model created from scratch performs poorly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set is 0.73 compared to scikit-learn model with a recall of 0.94.
To summarize, in this notebook we created a naive Support Vector Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.