Goal: The goal is to build a Gaussian Naive Bayes classifier from scratch.
Naive Bayes is a simple supervised learning algorithm that is based on Bayes' theorem. The algorithm assumes that the features are independent and identically distributed (iid). This means that:
Decision boundaries generated by the algorithm can be straight line, parabolic or elliptical only and not non-linear.
Some well known applications of the algorithm include Text classification, Spam Filtering, Sentiment Analysis and Recommender Systems.
| Pros | Cons | 
|---|---|
| It is easy and fast | Assumption of independent predictors: In real life, it is almost impossible that we get a set of predictors which are completely independent | 
| When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression | Zero Frequency - If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability | 
| Performs well in multi-class classification | |
| Works well on large datasets | 
Bayes’ theorem describes the probability of an event based on the prior knowledge of conditions that might be related to the event. It can be written as:
$P\left(y \mid x_{1}, \cdots, x_{j}\right)=\frac{P\left(x_{1}, \cdots x_{j} \mid y\right) P(y)}{P\left(x_{1}, \cdots, x_{j}\right)}$
where,
  $x_{1}, \cdots, x_{j}$: are j features that are independent of each other
  y: is the dependent variable
  $P\left(y \mid x_{1}, \cdots, x_{j}\right)$: Posterior Probability of target class i.e. probability of $y$ given features $x_{1}, \cdots, x_{j}$
  $P\left(x_{1}, \cdots x_{j} \mid y\right)$: Likelihood of features $x_{1}$ to $x_{j}$ given that their class is y.
  $P(y)$: Prior Probability
  $P\left(x_{1}, \cdots, x_{j}\right)$: Marginal Probability
Posterior probability, $P\left(y \mid x_{1}, \cdots, x_{j}\right)$, is the probability of target class being true given features $x_{1}, \cdots, x_{j}$. It is calculated for each class i.e. if the target variable has 2 classes (male/female) then, posterior probaility can be calculated as:
$P\left(male \mid height, weight, foot size\right)$ $=\frac{P(\text { male })\ *\ p(\text { height } \mid \text { male })\ *\ p(\text { weight } \mid \text { male })\ *\ p(\text { foot size } \mid \text { male })}{\text { marginal probability }}$
$P\left(female \mid height, weight, foot size\right)$ $=\frac{P(\text { female })\ *\ p(\text { height } \mid \text { female })\ *\ p(\text { weight } \mid \text { female })\ *\ p(\text { foot size } \mid \text { female })}{\text { marginal probability }}$
Prior Probability, $P(y)$, is the probability of class being true. It is the frequency of the class and can be calculated as:
$P(y)$ = No. of records for the class / Total no. of records
Likelihood (Class conditional probability), $P\left(x_{1}, \cdots x_{j} \mid y\right)$, is calculated based on the probability distribution. In case of Gaussian distribution, this can be calculated using the probability distribution function (pdf) of Gaussian distribution:
$P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
where,
  $x_{i}$: new point to be classified
  $\mu_{y}$: mean of the target class y
  $\sigma_{y}^{2}$: variance of the target class y
Selecting Class with Highest Posterior:
To select the class with highest posterior probability, we use the following formula to take the argmax of posterior probabilites.
$y=\operatorname{argmax}_{y} P\left(y \mid x_{1}, \cdots, x_{j}\right)=\operatorname{argmax}_{y} \frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$
Since marginal probability is the same for all classes, we ignore the denominator.
$y=\operatorname{argmax}_{y} P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)$
Multiplying these probabilities can lead to very small numbers and can cause overflow problems. To avoid numerical overflow, it is more convenient to use the log function:
$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$
Here is how the Naive Nayes algorithm works:
Given the training data and their class labels:
Calculate the posterior probability of each class in target 'y' given features 'X':
$P(y \mid X)=\frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$
To calculate posterior probability, we:
Calculate conditional probability of the class $P\left(x_{i} \mid y\right)$, using the probability density function (PDF) for Gaussian/Normal distribution:
$P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
For this we:
Note - Technically, we only calculate the numerator of posterior probability. Since marginal probability, $P(X)$, is the same for all classes, we can ignore the denominator.
Find class with highest probability using the following formula:
$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$
class NaiveBayes():
    
    def __init__(self):
        pass
    
    
    def fit(self, X, y):
        '''
        1. Identify unique classes and their number in target variable
        
        2. Initialize mean, variance and prior probability vectors
            - Since mean and variance are calculated for each feature in each class, 
              their shape will be (no. of classes, no. of features)
            - Since prior is just the probability of each class, it remains a 1-D array
            
        3. Iterate over each class and
            - Subset the data for class
            - Compute mean, variance and prior probability for each class
            - Add these values to initialized arrays at the same time
        '''
        
        # Get no. of records and features
        n_records, n_features = X.shape
        
        # Identify unique classes and their number in target variable
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        
        # Initialize mean, variance and prior probability vectors
        self._mean = np.zeros((n_classes, n_features), dtype = np.float64)
        self._var = np.zeros((n_classes, n_features), dtype = np.float64)
        self._prior = np.zeros(n_classes, dtype = np.float64)
        
        # Iterate over each class to compute mean, variance and prior
        for c in self._classes:
            X_class = X[c==y]
            self._mean[c,:] = np.mean(X_class, axis=0)
            self._var[c,:] = np.var(X_class, axis=0)
            self._prior[c] = len(X_class) / n_records
            
    def predict(self, X):
        '''
        Returns the predicted class given test data
        '''
        y_pred = [self._predict(x) for x in X]
        return y_pred
  
    def _predict(self, x):
        
        '''
        1. Initialize a list of posterior prob for each class
        
        2. Calculate posterior probability for each class using argmax formula:
          - get the prior probability of class from fit method
          - calculate conditional probability
          - calculate posterior and append to list of posterior probabilities 
          
        3. Select class with highest posterior
        '''
        # Initialize posteriors
        posteriors = []
        
        # Iterate over each class and calculate posterior using PDF function
        for idx, c in enumerate(self._classes):
            prior = np.log(self._prior[idx])
            conditional = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + conditional
            posteriors.append(posterior)
            
        # Select class with highest posterior
        final_cls = self._classes[np.argmax(posteriors)]
        return final_cls
    
    
    def _pdf(self, cls_idx, x):
        '''
        Calculates conditional probability using Gaussian PDF formula
        '''
        mean = self._mean[cls_idx]
        var = self._var[cls_idx]
        numerator = np.exp(-(x - mean)**2 / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator
For this exercise, we will use the Rice Seed dataset downloaded from Kaggle. The data was extracted for two kinds of rice: Jasmine - 1, Gonen - 0. Here is some information about the dataset:
We will:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB
# Read the data
data = pd.read_csv('Datasets/riceClassification.csv')
data.head()
# Check size
data.shape
# Check class count
data['Class'].value_counts()
There are 9985 records for Jasmine and 8200 records for Gonen.
The plots below show the variation in each feature colored by class (Jasmine - 1, Gonen - 0).
cols = data.columns[1:]
fig = plt.figure(figsize=(15, 20))
for sp in range(0,len(cols)):
    ax = fig.add_subplot(4,3,sp+1)
    sns.scatterplot(data=data,
                   x = np.arange(0, data.shape[0]),
                   y = data.loc[:,cols[sp]],
                   hue = data['Class'].map({0:'Gonen', 1:'Jasmine'}),
                   ax=ax)
    ax.set_xlabel('No. of Records')
    ax.set_ylabel(cols[sp])
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()
# Drop columns that are not needed
data.drop('id', axis=1, inplace=True)
data.columns
# Check for NA values
data.isnull().sum()
# Create X and Y variables
X = data.drop('Class', axis=1)
print(X.shape)
Y = data.Class
print(Y.shape)
# Train test split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
# Build Model
NB_model = NaiveBayes()
NB_model.fit(x_train, y_train)
In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.
Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc
Evaluate model performance on training data.
# Create predictions
train_pred = NB_model.predict(x_train)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc:.3f}')
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = NB_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc:.3f}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")
classes = ['Gonen','Jasmine']
# Plot confusion matrix
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a model using GaussianNB from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
   
# fit the Logistic Regression
clf = GaussianNB()
clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'GaussianNB accuracy on test set: {clf_acc:.3f}')
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
| Model | Accuracy | Precision | Recall | F1 Score | 
|---|---|---|---|---|
| Training Data | 0.984 | 0.98 | 0.98 | 0.98 | 
| Test Data | 0.982 | 0.98 | 0.98 | 0.98 | 
| Model using Scikit learn | 0.982 | 0.98 | 0.98 | 0.98 | 
From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.
To summarize, in this notebook we created a Gaussian Naive Bayes classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.