Support Vector Classifier¶

Overview¶

Goal: The goal is to build a naive Support Vector Classifier from scratch using Batch Gradient Descent approach.

Maximal (Hard) Margin Classifier (HMC)¶

Given a training set with labeled data, the idea is to find a decision boundary/hyperplane that separates the two classes such that the distance between the data points in each class and the hyperplane is maximized.

A hyperplane is represented as:

$w^{\top} x+b=0$

It is possible to create many hyperplanes that separate the two classes {1,-1} perfectly. Any such separating hyperplane has a property that:

$w^{\top} x_i+b>0$ if y = 1
$w^{\top} x_i+b<0$ if y = -1

A natural choice for selecting a single hyperplane that has the maximum margin, i.e the maximum distance between data points of both classes is the maximum margin classifier also known as optimal separating hyperplane.

Margin: If we compute the prependicular distance from training observation in each class to a separating hyperplane; the smallest such distance from observation to hyperplane is called margin. We try to find a hyperplane that maximizes this smallest distance i.e. margin.

Support Vectors: are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. The maximal margin hyperplane depends directly on support vectors but not on any other observations. So a movement to any other observation would not affect the supporting hyperplane.

Disadvantages:

The classifier works relatively well when there is a clear margin of separation between classes, which is not the case in real world scenarios.
It is extremely sensitive to small changes in support vectors and can lead to a dramatic shift in the hyperplane significantly reducing the margin (image below).

Image source

Cost Function¶

Image source

The idea of a maximal margin classifier is to maximize the margin. If we consider two imaginary lines/hyperplanes passing through the support vectors in each class that are parallel to the optimal hyperplane, then the margin can be computed as the distance between these two parallel hyperplanes.

The margin is represented as: $\frac{2 c}{\|w\|}$

The margin can be maximized by minimizing ${\|w\|}$. So, the cost function is written as:

$\min _{w, b}\|w\|$ s.t. $y^{i}\left(w^{\top} x^{i}+b\right) \geq 1, \forall i$

where:
$y^{i}$: class that the observation belongs to
$x^{i}$: features
$w,\ b$: parameters of the hyperplane
$i$: number of observations

Important Points:

For each observation to be on the correct side of hyperplane, we just need to find $w,\ b$ such that $y^{i}\left(w^{\top} x^{i}+b\right) > 0$. The constraint for maximal margin, however, gurantees that each observation will be on the correct side of the hyperplane with some cushion/margin (represented by 1).
Although the constraint applies to all observations, only a few constrains (those for support vectors) are relevant as the optimal hyperplane directly depends on these vectors only.

Soft Margin (Support Vector) Classifier¶

Sometimes perfect separation is achievable but not desired. For e.g. in the case of an outlier, a HMC may be able to separate the classes perfectly but the model may not generalize well and accuracy may suffer. Then there are cases where the data is not completely separable, such as in most real world scenarios, and so a HMC may not exist. In such cases, we may consider a classifier that does not perfectly separate the two classes in the interest of:

Greater robustness to indiviual observations such as outliers
Better classification of most training examples

Support Vector Classifier, also known as soft margin classifier does exactly this. Rather than seeking the largest margin such that every observation is not only on the correct side of hyperplane but also on correct side of margin, we instead allow some observations to be on the incorrect side of the margin or even hyperplane.

Hinge Loss¶

Support Vector Classifier allows us to loosen the constraints (or soften the margin) by allowing some points to be on the wrong side of the margin. This is achieved by using Hinge Loss. Hinge loss is represented as:

$\operatorname{Loss}(\underbrace{y^{i}\left(\omega^{\top} x^{i}+b\right)}_{z})=\left\{\begin{array}{cc}0 & \text { if } y^{i}\left(\omega^{\top} x^{i}+b\right) \geqslant 1 \\ 1-\left(y^{i}\left(\omega^{\top} x^{i}+b\right)\right) & \text { if } y^{i}\left(\omega^{\top} x^{i}+b\right)<1\end{array}\right.$

OR

Loss $(z)=\left\{\begin{array}{ll}0 & \text { if } z \geqslant 1 \\ 1-z & \text { if } z<1\end{array}\right.$

If a point is on the correct side of the margin, the loss is 0. However, if it is on the incorrect side of margin or hyperplane, the loss increases linearly.

Examples¶

For points that are on or outside of margin boundary, the loss is 0. For blue + points (on or outside margin), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = 1 (margin boundary). So, Hinge Loss ${y^{i}\left(\omega^{\top} x^{i}+b\right)}$ = Loss $(z)$ = Loss(1 x 1) = Loss(1) = 0 (if Z>=1, then 0).
For points that are on decision boundary, the loss is 1. For blue + points (on decision boundary), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = 0 (decision boundary). So, Hinge Loss $(z)$ = Loss(1 x 0) = Loss(0) = 1 (if Z<1, then 1-Z).
For points that are b/w margin and decision boundary, loss = 0 < loss < 1. For blue + points (b/w margin and decision boundary), $y_i$ = 1 and $(\theta.x + \theta_{0})$ = something b/w 0 and 1 (imagine line b/w margin and decision boundary).
For points that are on incorrect side of decision boundary, loss > 1. For Red point on blue side, $y_i$ = -1 and $(\theta.x + \theta_{0})$ = 1 (blue margin boundary). So, Hinge Loss $(z)$ = Loss(-1 x 1) = Loss(-1) = 1-(-1) = 2 (if Z<1, then 1-Z).

Total Error (loss) is given by:

$\sum_{i=1}^{n}\max \left(0, 1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \right)$

Pros and Cons : SVM¶

Support Vector Classifier works when the data is linearly separable. Support Vector Machine is an extension of SVC that allows us to use Kernels to learn complex non-linear decision boundaries. Various kernerls such as a Gaussian or Radial Basis Function can be used along with SVC to learn non-linear decision boundaries.

Some pros and cons of SVM are:

Pros	Cons
Works really well when there is a clear margin of separation	Doesn’t perform well with large data set as training time is higher
Effective in high dimensional spaces	Doesn’t perform well when the data set has more noise
Kernel flexibility allows for learning different types of decision boundaries	Desn't provide probability estimates
Can be used for both Classification and Regression tasks	Costly computation

SVC Cost Function¶

SVC combines the robustness of hinge loss with maximal margin. The cost function is written as:

$J\left(w, b\right)= \frac{1}{n} \sum_{i=1}^{n} \operatorname{max}\left(0, 1 - y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \right)+\frac{\lambda}{2}\|w\|^{2}$

OR

$J\left(\theta, \theta_{0}\right)= \frac{1}{n} \sum_{i=1}^{n} \operatorname{max}\left(0, 1 - y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)\right)+\frac{\lambda}{2}\|\theta\|^{2}$

where:
n : no. of records
$\theta_{0}$: Bias
$\theta_{j}$: Weights
${\lambda}$: regularization parameter

The larger ${\lambda}$, the more emphasis on large margins
The smaller ${\lambda}$, the more emphasis on reducing error/mistakes

Gradient Descent¶

Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we use all the training data to compute the gradient of cost function and then updates parameters.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. In SVC, gradient depends on the outcome of hinge loss.

If $y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)$ >= 1:

Bias: $\theta_{0}:=\theta_{0}-0$
Note: Gradient is 0 in this scenario.

Weights: $\theta_{j}:=\theta_{j}-\lambda \cdot \theta_{j}$

If $y^{(i)}\left(\theta \cdot x^{(i)}+\theta_{0}\right)$ < 1:

Bias: $\theta_{0}:=\theta_{0}-y^{(i)}$

Weights: $\theta_{j}:=\theta_{j}- \left(\lambda \cdot \theta_{j} - y^{(i)}\cdot x^{(i)} \right)$

where:
n : no. of records
$\theta_{0}$: Bias
$\theta_{j}$: Weights
${\lambda}$: regularization parameter

Get Data¶

Load Data¶

# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report

# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()

# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')

features = load_breast_cancer().feature_names

'Shape of Y: (569,)'

'Shape of X: (569, 10)'

Split and Standardize data¶

# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (426, 10)'

'Shape of test data (143, 10)'

# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Support Vector Classifier¶

In this section, we will create the class to build and fit a Support Vector classifier. We will also build and fit a model using training data.

Define the Class¶

Here is how the algorithm works:

Randomly initialize model parameters ($\theta_{0}$ and $\theta_{j}$)

Pick a value for learning rate ($\alpha$), number of epochs (iterations to iterate over), regularization parameter ($\lambda$) then

for each epoch(e):

     - Forward Pass: Calculate Cost = Hinge loss + Maximal margin
       z = y*f(x) = y*(w*X + b)
       if z >= 1:
           loss = 0
       else:
           loss = 1 - z
       hinge = loss/len(X)
       cost = hinge + (lambd/2)*||w||^2

     - Backward Pass:
       Calculate gradient for b and w (D_b, D_w): partial derivative of bj w.r.t bias and weights
       Objective function: (1/n)*sum(max(0, 1-z)) + (lambda/2) * ||w||^2
       if z >= 1:
           D_b = 0, D_w = lambda*w_i
       else:
           D_b = y_i, D_w = -(y_i*x_i) + lambda*w_i = lambda*w_i - (y_i*x_i)

     - Update Parameters:
       b: b - alpha * D_b
       w: w - alpha * D_w

class My_SVC():
    
    def __init__(self):
        self.b = None
        self.w = None
                
    
    def _cost(self,y,X,w,b,lambd):
        '''
        computes cost = hinge loss + max margin
        hinge loss = (1/n) * sum(max(0, 1 - y_i(w*x_i + b)))
        max margin = (lambd/2) * ||w||^2
        '''
        loss = 0
        for i in range(X.shape[0]):
            z = y[i] * (np.dot(self.w, X[i]) + self.b)
            loss += max(0, 1-z)
        hinge = loss/len(X)
        cost = lambd * 0.5 * np.dot(w, w.T) + hinge
        return cost
        
        
    def fit(self, X, y, learning_rate=0.01, epochs=100, lambd=0.1, verbose=False):
        n_obs, m_features = X.shape
        total_loss = []
        total_cost = []
        
        # convert y values to be -1 or 1
        y_new = np.where(y <= 0, -1, 1)
        
        # initialize weights and bias
        self.b = 0
        self.w = np.random.rand(m_features)
        
        # iterate over epochs
        for e in range(epochs):
            cost = self._cost(y_new,X,w,b,lambd)  # get cost
            
            # Calculate gradient and update weights, bias
            for i in range(X.shape[0]):
                z = y_new[i] * (np.dot(self.w, X[i]) + self.b)
                if z >= 1:
                    self.w -= learning_rate * (lambd * self.w)
                
                else:
                    self.b -= learning_rate * y[i]
                    self.w -= learning_rate * ((lambd * self.w) - np.dot(X[i], y_new[i]))
            
            
            total_loss.append(cost)
            
            if verbose == True and e%10 == 0:
                print(f'Epoch: {e}, Loss: {cost}')
                

        return self.b, self.w, total_loss
        

    def predict(self, X):
        pred = np.dot(self.w, X.T) + self.b
        return np.sign(pred)

Plot Loss for different learning rates and lambda values¶

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates to build the models. Next, we plot the loss generated by different models.

epochs = 10
lambd=0.2

model = My_SVC()
losses = {}

for lr in [0.001, 0.0001]:
    for lambd in [0.25, 0.5, 0.75]:
        b, w, total_loss = model.fit(x_train, y_train, learning_rate=lr, epochs=epochs)
        losses[f'LR={str(lr)}, Lambda={str(lambd)}'] = total_loss
    
# Plot loss
plt.figure(figsize=(15,8))    
xs = np.arange(len(total_loss))

for k, v in losses.items():
    plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
    
plt.title('Loss per Epoch for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
plt.show()

From the plot above, we can see that the losses generated by various models with learning rate of 0.001 seem to be fairly close. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 and lambda of 0.25 to build and fit a model and then evaluate its performance.

Build Best Model and Plot Loss¶

# Build and fit best LR model
alpha = 0.001
e = 10
lambd=0.25

# Build model
svc_clf = My_SVC()


# Fit Model
b, w, total_loss = svc_clf.fit(x_train, y_train, learning_rate=alpha, epochs=e)

# Print bias and weights
print(f'Bias: {b}')
print(f'Weight Matrix: \n{w}')

Bias: -2.233999999999865
Weight Matrix: 
[-0.67890869 -0.39395724 -0.57165921 -0.77018823 -0.29956337 -0.30518875
 -0.66702181 -0.90370683 -0.30170197  0.21795575]

Plot Loss¶

# Plot total loss which shows loss at each iteration
e = 10

plt.figure(figsize=(8,5))
plt.plot(np.arange(e), total_loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss at each Epoch')
plt.show()

Evaluate Model Performance¶

In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.

Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.

# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data¶

Evaluate model performance on training data.

# Create predictions
train_pred = svc_clf.predict(x_train)

# Calculate accuracy
y_train_new = np.where(y_train <= 0, -1, 1)
train_acc = accuracy(y_train_new, train_pred)
print(f'Accuracy on test set: {train_acc}')

Accuracy on test set: 0.7535211267605634

# Print classification report
train_report = classification_report(y_train_new, train_pred)
print(f'Classification Report:\n {train_report}')

Classification Report:
               precision    recall  f1-score   support

          -1       0.60      1.00      0.75       158
           1       1.00      0.61      0.76       268

    accuracy                           0.75       426
   macro avg       0.80      0.80      0.75       426
weighted avg       0.85      0.75      0.75       426

Perofrmance on Test data¶

Evaluate model performance on test data.

# Create predictions on test set
test_pred = svc_clf.predict(x_test)

# Calculate accuracy
y_test_new = np.where(y_test <= 0, -1, 1)
test_acc = accuracy(y_test_new, test_pred)
print(f'Accuracy on test set: {test_acc}')

Accuracy on test set: 0.7342657342657343

# Calculate metrics
cm_test = confusion_matrix(y_test_new, test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
classes = set(y_test_new)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');

Confusion Matrix:

# Print classification report
test_report = classification_report(y_test_new, test_pred)
print(f'Classification Report:\n {test_report}')

Classification Report:
               precision    recall  f1-score   support

          -1       0.59      0.98      0.74        54
           1       0.98      0.58      0.73        89

    accuracy                           0.73       143
   macro avg       0.79      0.78      0.73       143
weighted avg       0.83      0.73      0.73       143

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.73 can be used for evaluating the model.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn¶

Here, we build a model using SVC from scikit-learn library and compare the results with model created from scratch.

# Build a model using scikit learn
from sklearn.svm import SVC
   
# fit the model
clf = SVC(C=4, kernel='linear')

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)

# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')

SVC accuracy on test set: 0.944

# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');

Confusion Matrix:

# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.94      0.93        54
           1       0.97      0.94      0.95        89

    accuracy                           0.94       143
   macro avg       0.94      0.94      0.94       143
weighted avg       0.94      0.94      0.94       143

Model Results¶

Model	Accuracy	Precision	Recall	F1 Score
Training Data	0.75	0.85	0.75	0.75
Test Data	0.73	0.83	0.73	0.73
Model using Scikit learn	0.94	0.94	0.94	0.94

From the results comparison, we can see that the model created from scratch performs poorly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set is 0.73 compared to scikit-learn model with a recall of 0.94.

Summary¶

To summarize, in this notebook we created a naive Support Vector Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678