Goal: The goal is to build a Gaussian Naive Bayes classifier from scratch.
Naive Bayes is a simple supervised learning algorithm that is based on Bayes' theorem. The algorithm assumes that the features are independent and identically distributed (iid). This means that:
Decision boundaries generated by the algorithm can be straight line, parabolic or elliptical only and not non-linear
.
Some well known applications of the algorithm include Text classification, Spam Filtering, Sentiment Analysis and Recommender Systems.
Pros | Cons |
---|---|
It is easy and fast | Assumption of independent predictors: In real life, it is almost impossible that we get a set of predictors which are completely independent |
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression | Zero Frequency - If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability |
Performs well in multi-class classification | |
Works well on large datasets |
Bayes’ theorem describes the probability of an event based on the prior knowledge of conditions that might be related to the event. It can be written as:
$P\left(y \mid x_{1}, \cdots, x_{j}\right)=\frac{P\left(x_{1}, \cdots x_{j} \mid y\right) P(y)}{P\left(x_{1}, \cdots, x_{j}\right)}$
where,
$x_{1}, \cdots, x_{j}$: are j features that are independent of each other
y: is the dependent variable
$P\left(y \mid x_{1}, \cdots, x_{j}\right)$: Posterior Probability of target class i.e. probability of $y$ given features $x_{1}, \cdots, x_{j}$
$P\left(x_{1}, \cdots x_{j} \mid y\right)$: Likelihood of features $x_{1}$ to $x_{j}$ given that their class is y.
$P(y)$: Prior Probability
$P\left(x_{1}, \cdots, x_{j}\right)$: Marginal Probability
Posterior probability, $P\left(y \mid x_{1}, \cdots, x_{j}\right)$, is the probability of target class being true given features $x_{1}, \cdots, x_{j}$. It is calculated for each class i.e. if the target variable has 2 classes (male/female) then, posterior probaility can be calculated as:
$P\left(male \mid height, weight, foot size\right)$ $=\frac{P(\text { male })\ *\ p(\text { height } \mid \text { male })\ *\ p(\text { weight } \mid \text { male })\ *\ p(\text { foot size } \mid \text { male })}{\text { marginal probability }}$
$P\left(female \mid height, weight, foot size\right)$ $=\frac{P(\text { female })\ *\ p(\text { height } \mid \text { female })\ *\ p(\text { weight } \mid \text { female })\ *\ p(\text { foot size } \mid \text { female })}{\text { marginal probability }}$
Prior Probability, $P(y)$, is the probability of class being true. It is the frequency of the class and can be calculated as:
$P(y)$ = No. of records for the class / Total no. of records
Likelihood (Class conditional probability), $P\left(x_{1}, \cdots x_{j} \mid y\right)$, is calculated based on the probability distribution. In case of Gaussian distribution, this can be calculated using the probability distribution function (pdf) of Gaussian distribution:
$P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
where,
$x_{i}$: new point to be classified
$\mu_{y}$: mean of the target class y
$\sigma_{y}^{2}$: variance of the target class y
Selecting Class with Highest Posterior:
To select the class with highest posterior probability, we use the following formula to take the argmax of posterior probabilites.
$y=\operatorname{argmax}_{y} P\left(y \mid x_{1}, \cdots, x_{j}\right)=\operatorname{argmax}_{y} \frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$
Since marginal probability is the same for all classes, we ignore the denominator.
$y=\operatorname{argmax}_{y} P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)$
Multiplying these probabilities can lead to very small numbers and can cause overflow problems. To avoid numerical overflow, it is more convenient to use the log function:
$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$
Here is how the Naive Nayes algorithm works:
Given the training data and their class labels:
Calculate the posterior probability of each class in target 'y' given features 'X':
$P(y \mid X)=\frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$
To calculate posterior probability, we:
Calculate conditional probability of the class $P\left(x_{i} \mid y\right)$, using the probability density function (PDF) for Gaussian/Normal distribution:
$P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
For this we:
Note - Technically, we only calculate the numerator of posterior probability. Since marginal probability, $P(X)$, is the same for all classes, we can ignore the denominator.
Find class with highest probability using the following formula:
$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$
class NaiveBayes():
def __init__(self):
pass
def fit(self, X, y):
'''
1. Identify unique classes and their number in target variable
2. Initialize mean, variance and prior probability vectors
- Since mean and variance are calculated for each feature in each class,
their shape will be (no. of classes, no. of features)
- Since prior is just the probability of each class, it remains a 1-D array
3. Iterate over each class and
- Subset the data for class
- Compute mean, variance and prior probability for each class
- Add these values to initialized arrays at the same time
'''
# Get no. of records and features
n_records, n_features = X.shape
# Identify unique classes and their number in target variable
self._classes = np.unique(y)
n_classes = len(self._classes)
# Initialize mean, variance and prior probability vectors
self._mean = np.zeros((n_classes, n_features), dtype = np.float64)
self._var = np.zeros((n_classes, n_features), dtype = np.float64)
self._prior = np.zeros(n_classes, dtype = np.float64)
# Iterate over each class to compute mean, variance and prior
for c in self._classes:
X_class = X[c==y]
self._mean[c,:] = np.mean(X_class, axis=0)
self._var[c,:] = np.var(X_class, axis=0)
self._prior[c] = len(X_class) / n_records
def predict(self, X):
'''
Returns the predicted class given test data
'''
y_pred = [self._predict(x) for x in X]
return y_pred
def _predict(self, x):
'''
1. Initialize a list of posterior prob for each class
2. Calculate posterior probability for each class using argmax formula:
- get the prior probability of class from fit method
- calculate conditional probability
- calculate posterior and append to list of posterior probabilities
3. Select class with highest posterior
'''
# Initialize posteriors
posteriors = []
# Iterate over each class and calculate posterior using PDF function
for idx, c in enumerate(self._classes):
prior = np.log(self._prior[idx])
conditional = np.sum(np.log(self._pdf(idx, x)))
posterior = prior + conditional
posteriors.append(posterior)
# Select class with highest posterior
final_cls = self._classes[np.argmax(posteriors)]
return final_cls
def _pdf(self, cls_idx, x):
'''
Calculates conditional probability using Gaussian PDF formula
'''
mean = self._mean[cls_idx]
var = self._var[cls_idx]
numerator = np.exp(-(x - mean)**2 / (2 * var))
denominator = np.sqrt(2 * np.pi * var)
return numerator / denominator
For this exercise, we will use the Rice Seed dataset downloaded from Kaggle. The data was extracted for two kinds of rice: Jasmine - 1, Gonen - 0. Here is some information about the dataset:
We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB
# Read the data
data = pd.read_csv('Datasets/riceClassification.csv')
data.head()
# Check size
data.shape
# Check class count
data['Class'].value_counts()
There are 9985 records for Jasmine and 8200 records for Gonen.
The plots below show the variation in each feature colored by class (Jasmine - 1, Gonen - 0).
cols = data.columns[1:]
fig = plt.figure(figsize=(15, 20))
for sp in range(0,len(cols)):
ax = fig.add_subplot(4,3,sp+1)
sns.scatterplot(data=data,
x = np.arange(0, data.shape[0]),
y = data.loc[:,cols[sp]],
hue = data['Class'].map({0:'Gonen', 1:'Jasmine'}),
ax=ax)
ax.set_xlabel('No. of Records')
ax.set_ylabel(cols[sp])
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()
# Drop columns that are not needed
data.drop('id', axis=1, inplace=True)
data.columns
# Check for NA values
data.isnull().sum()
# Create X and Y variables
X = data.drop('Class', axis=1)
print(X.shape)
Y = data.Class
print(Y.shape)
# Train test split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
# Build Model
NB_model = NaiveBayes()
NB_model.fit(x_train, y_train)
In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.
Before we make predictions, let's define a function, accuracy
, which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
Evaluate model performance on training data.
# Create predictions
train_pred = NB_model.predict(x_train)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc:.3f}')
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = NB_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc:.3f}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")
classes = ['Gonen','Jasmine']
# Plot confusion matrix
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a model using GaussianNB
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
# fit the Logistic Regression
clf = GaussianNB()
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'GaussianNB accuracy on test set: {clf_acc:.3f}')
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Training Data | 0.984 | 0.98 | 0.98 | 0.98 |
Test Data | 0.982 | 0.98 | 0.98 | 0.98 |
Model using Scikit learn | 0.982 | 0.98 | 0.98 | 0.98 |
From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.
To summarize, in this notebook we created a Gaussian Naive Bayes classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.