Goal: The goal is to build a K-Nearest Neighbor classification model from scratch.
KNN is a simple supervised learning algorithm that can be used for classification as well as regression tasks. It is based on the intuition that similar instances should have similar class labels (in classification) or similar target values (in regression). The KNN algorithm assumes that similar things exist in close proximity (near to each other).
Given training data with class labels, new data points can be classified based on k-neighbors defined by the user. Because the user must specify in advance what k to choose, the algorithm is somewhat naive. Elbow method helps to select the optimal number of clusters k for KNN.
Pros | Cons |
---|---|
Non-parametric; does not make any assumptions about data and generates non-linear decision boundary | Computationally expensive; Time – need to compute distance to all examples; Space – need to store all training examples |
Easy to interpret | Highly sensitive to outliers |
Fair prediction performance | Sensitive to irrelevant attributes |
Can be used for Classification and Regression | Does not work well on large data |
For this exercise, we will use the Iris flower toy dataset available in the scikit-learn library. We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
# Read the data
iris_data=pd.DataFrame(load_iris().data,columns=load_iris().feature_names)
iris_data.head()
iris_data.iloc[0]
# Create dependent and independent variables
Y=load_iris().target
display(f'Shape of Y: {Y.shape}')
X=load_iris().data
display(f'Shape of X: {X.shape}')
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.30,random_state=100)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
In this section, we will create the class to build and fit a KNN model. We will build and fit the model using a random value for k
and test its performance.
Next, we will write a function that iterates over different values of k and plots the error rate for models generated using each k
. The plot is also known as Elbow Plot.
Here is how the KNN algorithm works:
Given the training data and their class labels, for a new data point:
from collections import Counter
class My_KNN():
'''
Fits and predicts a KNN model given data values and true labels
'''
def __init__(self):
pass
def fit(self, X, y, k=3, p=2):
'''
Initializes the:
- training data and their class labels
- k for no of neighbors
- p for distance type: 1: Manhattan, 2: Euclidean
'''
self.X_train = X
self.y_train = y
self.k = k
self.p = p
def predict(self, X):
'''
Generates predictions given a set of new points
'''
# Iterate over each point and call _predict on each point
pred_cls = [self._predict(x, self.p) for x in X]
return np.array(pred_cls)
def _predict(self, x, p):
'''
Given a new data point and distance type:
1. Computes distance from new point to each point in the data
2. Sorts the distances in descending order
3. Subsets the indices for first `k` distances
4. Using indices, get class labels for those `k` points
5. Identify the most common class label out of the k data points selected
'''
# Compute distance
dist = [self._compute_dist(x, train_x, p) for train_x in self.X_train]
# Get k nearest sample indexes and their labels
k_indices = [np.argsort(dist)[:self.k]]
k_labels = [self.y_train[i] for i in k_indices[0]]
# Compute majority vote
maj_vote = Counter(k_labels).most_common(1)
return maj_vote[0][0]
def _compute_dist(self, x1, x2, p):
'''
Computes Manhattan (p=1) or Euclidean(p=2) distance between 2 points
'''
dist = np.sum(np.abs(x1 - x2)**p)
return dist**(1/p)
Here, we will build a model using a random k value of 5 neighbors and check its performance.
# Build Model
knn_model = My_KNN()
knn_model.fit(x_train, y_train, k=5, p=2)
Before we make predictions, let's define a function accuracy
which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
# Create predictions
test_pred = knn_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
k
Value (Elbow Plot)¶We will define a function that:
# Define function to plot error rates for different k-values
def find_best_k(x_train, y_train, x_test, y_test, highest_k=40):
'''
Iterates over different values of k, computes error and plot error for each k
'''
errors = []
for i in range(1,highest_k):
model = My_KNN()
model.fit(x_train, y_train, k=i)
pred = model.predict(x_test)
loss = np.mean(pred != y_test)
errors.append(loss)
# Plot errors
fig = plt.figure(figsize=(12,8))
plt.plot(range(1,highest_k), errors, color='blue',
linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K value')
plt.ylabel('Error Rate')
return fig
# Plot Error Rate for different k values
k_fig = find_best_k(x_train, y_train, x_test, y_test, highest_k=40)
From the plot above, we can see the variation in error rate as k-value changes. The error rate reduces for k=8 to k=18 and then increases again.
In this section, we will check the performance of our model on test data using the best k-value = 8
. We will also build a KNN model using scikit-learn library and compare performance.
Evaluate model performance on training data.
# Build Model
knn_model = My_KNN()
knn_model.fit(x_train, y_train, k=8)
# Create predictions
train_pred = knn_model.predict(x_train)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on test set: {train_acc}')
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = knn_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Calculate metrics
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extremen values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a KNN model using KNeighborsClassifier
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.neighbors import KNeighborsClassifier
# fit the Logistic Regression
clf = KNeighborsClassifier(n_neighbors=8)
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf.score(x_test, y_test)
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
test_report_scikit = classification_report(y_test, y_test_pred)
print("Performance on test set using Scikit learn:\n")
print(f'Confusion Matrix:\n {cm_test_scikit}\n')
print(f'Classification Report:\n {test_report_scikit}')
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Training Data | 0.96 | 0.96 | 0.96 | 0.96 |
Test Data | 0.95 | 0.96 | 0.96 | 0.96 |
Model using Scikit learn | 0.93 | 0.95 | 0.93 | 0.93 |
From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.
To summarize, in this notebook we created a K-Nearest Neighbor model from scratch. We saw how iterating over different values of k
helped identify an optimal k-value to generate the best results. Later, we evaluated the results of our model on test data and compared performance with a model created using scikit-learn library.