K-Nearest Neighbor (KNN)

Overview

Basics

Goal: The goal is to build a K-Nearest Neighbor classification model from scratch.

KNN is a simple supervised learning algorithm that can be used for classification as well as regression tasks. It is based on the intuition that similar instances should have similar class labels (in classification) or similar target values (in regression). The KNN algorithm assumes that similar things exist in close proximity (near to each other).

Given training data with class labels, new data points can be classified based on k-neighbors defined by the user. Because the user must specify in advance what k to choose, the algorithm is somewhat naive. Elbow method helps to select the optimal number of clusters k for KNN.

Pros Cons
Non-parametric; does not make any assumptions about data and generates non-linear decision boundary Computationally expensive; Time – need to compute distance to all examples; Space – need to store all training examples
Easy to interpret Highly sensitive to outliers
Fair prediction performance Sensitive to irrelevant attributes
Can be used for Classification and Regression Does not work well on large data

Get Data

For this exercise, we will use the Iris flower toy dataset available in the scikit-learn library. We will:

  1. Read the data
  2. Identify dependent and independent variables
  3. Split the data into train and test sets
  4. Standardize the data

Load Data

In [21]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
In [2]:
# Read the data
iris_data=pd.DataFrame(load_iris().data,columns=load_iris().feature_names)
iris_data.head()
Out[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [3]:
iris_data.iloc[0]
Out[3]:
sepal length (cm)    5.1
sepal width (cm)     3.5
petal length (cm)    1.4
petal width (cm)     0.2
Name: 0, dtype: float64
In [4]:
# Create dependent and independent variables
Y=load_iris().target
display(f'Shape of Y: {Y.shape}')

X=load_iris().data
display(f'Shape of X: {X.shape}')
'Shape of Y: (150,)'
'Shape of X: (150, 4)'

Split and Standardize data

In [5]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.30,random_state=100)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (105, 4)'
'Shape of test data (45, 4)'
In [6]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

KNN Model

In this section, we will create the class to build and fit a KNN model. We will build and fit the model using a random value for k and test its performance.

Next, we will write a function that iterates over different values of k and plots the error rate for models generated using each k. The plot is also known as Elbow Plot.

Build and Fit Model

Here is how the KNN algorithm works:

Given the training data and their class labels, for a new data point:

  1. Calculate Distance
    a. Calculate the distance from new point to each point in training data using a distance metric: Euclidean or Manhattan.
  1. Get k nearest samples and their labels
    a. Sort the distances in ascending order
    d. Subset indices of the first k elements
    a. Using indices, get class labels for those k data points
  1. Get majority vote
    a. Identify the most common class label out of the k data points selected
In [7]:
from collections import Counter


class My_KNN():
    '''
    Fits and predicts a KNN model given data values and true labels
    '''
    
    def __init__(self):
        pass
    
    def fit(self, X, y, k=3, p=2):
        '''
        Initializes the:
        - training data and their class labels
        - k for no of neighbors
        - p for distance type: 1: Manhattan, 2: Euclidean
        '''
        self.X_train = X
        self.y_train = y
        self.k = k
        self.p = p
    
    def predict(self, X):
        '''
        Generates predictions given a set of new points 
        '''
        # Iterate over each point and call _predict on each point
        pred_cls = [self._predict(x, self.p) for x in X]
        return np.array(pred_cls)
    
    def _predict(self, x, p):
        '''
        Given a new data point and distance type:
        1. Computes distance from new point to each point in the data
        2. Sorts the distances in descending order
        3. Subsets the indices for first `k` distances
        4. Using indices, get class labels for those `k` points
        5. Identify the most common class label out of the k data points selected
        '''
        # Compute distance
        dist = [self._compute_dist(x, train_x, p) for train_x in self.X_train]
        
        # Get k nearest sample indexes and their labels
        k_indices = [np.argsort(dist)[:self.k]]
        k_labels = [self.y_train[i] for i in k_indices[0]]
        
        
        # Compute majority vote
        maj_vote = Counter(k_labels).most_common(1)
        return maj_vote[0][0]

    
    def _compute_dist(self, x1, x2, p):
        '''
        Computes Manhattan (p=1) or Euclidean(p=2) distance between 2 points
        '''
        dist = np.sum(np.abs(x1 - x2)**p)
        return dist**(1/p)

Check Performance

Here, we will build a model using a random k value of 5 neighbors and check its performance.

In [8]:
# Build Model
knn_model = My_KNN()
knn_model.fit(x_train, y_train, k=5, p=2)

Before we make predictions, let's define a function accuracy which returns the ratio of the number of correct predictions to the number of observations.

In [9]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc
In [10]:
# Create predictions
test_pred = knn_model.predict(x_test)
In [11]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
Accuracy on test set: 0.9333333333333333

Find Best k Value (Elbow Plot)

We will define a function that:

  • Iterates over different values of k
  • Builds and fits model for each k on training data
  • Evaluate performance on test data
  • Plot error rate for different k values
In [12]:
# Define function to plot error rates for different k-values

def find_best_k(x_train, y_train, x_test, y_test, highest_k=40):
    '''
    Iterates over different values of k, computes error and plot error for each k
    '''
    errors = []
    
    for i in range(1,highest_k):
        model = My_KNN()
        model.fit(x_train, y_train, k=i)
        pred = model.predict(x_test)
        loss = np.mean(pred != y_test)
        errors.append(loss)
    
    # Plot errors
    fig = plt.figure(figsize=(12,8))
    plt.plot(range(1,highest_k), errors, color='blue', 
             linestyle='dashed', marker='o',
             markerfacecolor='red', markersize=10)
    plt.title('Error Rate vs. K Value')
    plt.xlabel('K value')
    plt.ylabel('Error Rate')
    return fig
In [13]:
# Plot Error Rate for different k values
k_fig = find_best_k(x_train, y_train, x_test, y_test, highest_k=40)

From the plot above, we can see the variation in error rate as k-value changes. The error rate reduces for k=8 to k=18 and then increases again.

Evaluate Model Performance

In this section, we will check the performance of our model on test data using the best k-value = 8. We will also build a KNN model using scikit-learn library and compare performance.

Performance on Training data

Evaluate model performance on training data.

In [14]:
# Build Model
knn_model = My_KNN()
knn_model.fit(x_train, y_train, k=8)
In [15]:
# Create predictions
train_pred = knn_model.predict(x_train)
In [16]:
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on test set: {train_acc}')
Accuracy on test set: 0.9619047619047619
In [17]:
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       0.93      0.97      0.95        39
           2       0.97      0.91      0.94        32

    accuracy                           0.96       105
   macro avg       0.96      0.96      0.96       105
weighted avg       0.96      0.96      0.96       105

Perofrmance on Test data

Evaluate model performance on test data.

In [18]:
# Create predictions on test set
test_pred = knn_model.predict(x_test)
In [19]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
Accuracy on test set: 0.9555555555555556
In [22]:
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [23]:
# Calculate metrics
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.85      1.00      0.92        11
           2       1.00      0.89      0.94        18

    accuracy                           0.96        45
   macro avg       0.95      0.96      0.95        45
weighted avg       0.96      0.96      0.96        45

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extremen values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn

Here, we build a KNN model using KNeighborsClassifier from scikit-learn library and compare the results with model created from scratch.

In [18]:
# Build a model using scikit learn
from sklearn.neighbors import KNeighborsClassifier
   
# fit the Logistic Regression
clf = KNeighborsClassifier(n_neighbors=8)

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
In [19]:
# Calculate accuracy
clf.score(x_test, y_test)
Out[19]:
0.9333333333333333
In [20]:
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
test_report_scikit = classification_report(y_test, y_test_pred)

print("Performance on test set using Scikit learn:\n")
print(f'Confusion Matrix:\n {cm_test_scikit}\n')
print(f'Classification Report:\n {test_report_scikit}')
Performance on test set using Scikit learn:

Confusion Matrix:
 [[16  0  0]
 [ 0 11  0]
 [ 0  3 15]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.79      1.00      0.88        11
           2       1.00      0.83      0.91        18

    accuracy                           0.93        45
   macro avg       0.93      0.94      0.93        45
weighted avg       0.95      0.93      0.93        45

Model Results

Model Accuracy Precision Recall F1 Score
Training Data 0.96 0.96 0.96 0.96
Test Data 0.95 0.96 0.96 0.96
Model using Scikit learn 0.93 0.95 0.93 0.93

From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.

Summary

To summarize, in this notebook we created a K-Nearest Neighbor model from scratch. We saw how iterating over different values of k helped identify an optimal k-value to generate the best results. Later, we evaluated the results of our model on test data and compared performance with a model created using scikit-learn library.