Naive Bayes Classifier

image.png

Overview

Basics

Goal: The goal is to build a Gaussian Naive Bayes classifier from scratch.

Naive Bayes is a simple supervised learning algorithm that is based on Bayes' theorem. The algorithm assumes that the features are independent and identically distributed (iid). This means that:

  • Independence: each feature is uncorrelated from each other. For e.g. BMI is independent of weight or height etc. The assumption of independence is rarely true in real world scenarios and is a “naive” assumption - hence the name “naive bayes”.
  • Identically distributed: we assume that the values of the features (e.g. height, weight etc.) follow the same probability distribution. In this case of a Gaussian classifier, we assume that the features are normally distributed and follow the Gaussian probability distribution function.

Decision boundaries generated by the algorithm can be straight line, parabolic or elliptical only and not non-linear.

Some well known applications of the algorithm include Text classification, Spam Filtering, Sentiment Analysis and Recommender Systems.

Pros Cons
It is easy and fast Assumption of independent predictors: In real life, it is almost impossible that we get a set of predictors which are completely independent
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression Zero Frequency - If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability
Performs well in multi-class classification
Works well on large datasets

Bayes' Theorem and Naive Bayes

Bayes’ theorem describes the probability of an event based on the prior knowledge of conditions that might be related to the event. It can be written as:

$P\left(y \mid x_{1}, \cdots, x_{j}\right)=\frac{P\left(x_{1}, \cdots x_{j} \mid y\right) P(y)}{P\left(x_{1}, \cdots, x_{j}\right)}$

where,
$x_{1}, \cdots, x_{j}$: are j features that are independent of each other
y: is the dependent variable
$P\left(y \mid x_{1}, \cdots, x_{j}\right)$: Posterior Probability of target class i.e. probability of $y$ given features $x_{1}, \cdots, x_{j}$
$P\left(x_{1}, \cdots x_{j} \mid y\right)$: Likelihood of features $x_{1}$ to $x_{j}$ given that their class is y.
$P(y)$: Prior Probability
$P\left(x_{1}, \cdots, x_{j}\right)$: Marginal Probability

  • Posterior probability, $P\left(y \mid x_{1}, \cdots, x_{j}\right)$, is the probability of target class being true given features $x_{1}, \cdots, x_{j}$. It is calculated for each class i.e. if the target variable has 2 classes (male/female) then, posterior probaility can be calculated as:

    $P\left(male \mid height, weight, foot size\right)$ $=\frac{P(\text { male })\ *\ p(\text { height } \mid \text { male })\ *\ p(\text { weight } \mid \text { male })\ *\ p(\text { foot size } \mid \text { male })}{\text { marginal probability }}$

    $P\left(female \mid height, weight, foot size\right)$ $=\frac{P(\text { female })\ *\ p(\text { height } \mid \text { female })\ *\ p(\text { weight } \mid \text { female })\ *\ p(\text { foot size } \mid \text { female })}{\text { marginal probability }}$

  • Prior Probability, $P(y)$, is the probability of class being true. It is the frequency of the class and can be calculated as:

    $P(y)$ = No. of records for the class / Total no. of records

  • Likelihood (Class conditional probability), $P\left(x_{1}, \cdots x_{j} \mid y\right)$, is calculated based on the probability distribution. In case of Gaussian distribution, this can be calculated using the probability distribution function (pdf) of Gaussian distribution:

    $P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$

    where,
    $x_{i}$: new point to be classified
    $\mu_{y}$: mean of the target class y
    $\sigma_{y}^{2}$: variance of the target class y

  • Marginal Probability $P\left(x_{1}, \cdots, x_{j}\right)$: for Naive Bayes, we only calculate the numerator of posterior probability. Since marginal probability is the same for all classes, we can ignore the denominator.

Selecting Class with Highest Posterior:

To select the class with highest posterior probability, we use the following formula to take the argmax of posterior probabilites.

$y=\operatorname{argmax}_{y} P\left(y \mid x_{1}, \cdots, x_{j}\right)=\operatorname{argmax}_{y} \frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$

Since marginal probability is the same for all classes, we ignore the denominator.

$y=\operatorname{argmax}_{y} P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)$

Multiplying these probabilities can lead to very small numbers and can cause overflow problems. To avoid numerical overflow, it is more convenient to use the log function:

$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$

Naive Bayes Classifier

Define the Class

Here is how the Naive Nayes algorithm works:

Given the training data and their class labels:

  1. Calculate the posterior probability of each class in target 'y' given features 'X':

    $P(y \mid X)=\frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$

    To calculate posterior probability, we:

    • subset the data for each class in target $y$
    • Calculate prior probability of the class i.e. $P(y) = P(Play = yes)$: the probability of class being true
    • Calculate conditional probability of the class $P\left(x_{i} \mid y\right)$, using the probability density function (PDF) for Gaussian/Normal distribution:

      $P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$

      For this we:

      • Calculate mean and variance of all features in that class

    Note - Technically, we only calculate the numerator of posterior probability. Since marginal probability, $P(X)$, is the same for all classes, we can ignore the denominator.

  1. Find class with highest probability using the following formula:

    $y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$

In [1]:
class NaiveBayes():
    
    def __init__(self):
        pass
    
    
    def fit(self, X, y):
        '''
        1. Identify unique classes and their number in target variable
        
        2. Initialize mean, variance and prior probability vectors
            - Since mean and variance are calculated for each feature in each class, 
              their shape will be (no. of classes, no. of features)
            - Since prior is just the probability of each class, it remains a 1-D array
            
        3. Iterate over each class and
            - Subset the data for class
            - Compute mean, variance and prior probability for each class
            - Add these values to initialized arrays at the same time
        '''
        
        # Get no. of records and features
        n_records, n_features = X.shape
        
        # Identify unique classes and their number in target variable
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        
        # Initialize mean, variance and prior probability vectors
        self._mean = np.zeros((n_classes, n_features), dtype = np.float64)
        self._var = np.zeros((n_classes, n_features), dtype = np.float64)
        self._prior = np.zeros(n_classes, dtype = np.float64)
        
        # Iterate over each class to compute mean, variance and prior
        for c in self._classes:
            X_class = X[c==y]
            self._mean[c,:] = np.mean(X_class, axis=0)
            self._var[c,:] = np.var(X_class, axis=0)
            self._prior[c] = len(X_class) / n_records

            
    def predict(self, X):
        '''
        Returns the predicted class given test data
        '''
        y_pred = [self._predict(x) for x in X]
        return y_pred
  

    def _predict(self, x):
        
        '''
        1. Initialize a list of posterior prob for each class
        
        2. Calculate posterior probability for each class using argmax formula:
          - get the prior probability of class from fit method
          - calculate conditional probability
          - calculate posterior and append to list of posterior probabilities 
          
        3. Select class with highest posterior
        '''
        # Initialize posteriors
        posteriors = []
        
        # Iterate over each class and calculate posterior using PDF function
        for idx, c in enumerate(self._classes):
            prior = np.log(self._prior[idx])
            conditional = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + conditional
            posteriors.append(posterior)
            
        # Select class with highest posterior
        final_cls = self._classes[np.argmax(posteriors)]
        return final_cls
    
    
    def _pdf(self, cls_idx, x):
        '''
        Calculates conditional probability using Gaussian PDF formula
        '''
        mean = self._mean[cls_idx]
        var = self._var[cls_idx]
        numerator = np.exp(-(x - mean)**2 / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

Get Data and Build Model

For this exercise, we will use the Rice Seed dataset downloaded from Kaggle. The data was extracted for two kinds of rice: Jasmine - 1, Gonen - 0. Here is some information about the dataset:

image.png

We will:

  1. Read the data
  2. Split the data into train and test sets
  3. Standardize the data
  4. Build and fit the model
In [2]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB

Load Data

In [3]:
# Read the data
data = pd.read_csv('Datasets/riceClassification.csv')
data.head()
Out[3]:
id Area MajorAxisLength MinorAxisLength Eccentricity ConvexArea EquivDiameter Extent Perimeter Roundness AspectRation Class
0 1 4537 92.229317 64.012769 0.719916 4677 76.004525 0.657536 273.085 0.764510 1.440796 1
1 2 2872 74.691881 51.400454 0.725553 3015 60.471018 0.713009 208.317 0.831658 1.453137 1
2 3 3048 76.293164 52.043491 0.731211 3132 62.296341 0.759153 210.012 0.868434 1.465950 1
3 4 3073 77.033628 51.928487 0.738639 3157 62.551300 0.783529 210.657 0.870203 1.483456 1
4 5 3693 85.124785 56.374021 0.749282 3802 68.571668 0.769375 230.332 0.874743 1.510000 1
In [4]:
# Check size
data.shape
Out[4]:
(18185, 12)
In [5]:
# Check class count
data['Class'].value_counts()
Out[5]:
1    9985
0    8200
Name: Class, dtype: int64

There are 9985 records for Jasmine and 8200 records for Gonen.

Plot Data

The plots below show the variation in each feature colored by class (Jasmine - 1, Gonen - 0).

In [6]:
cols = data.columns[1:]
fig = plt.figure(figsize=(15, 20))
for sp in range(0,len(cols)):
    ax = fig.add_subplot(4,3,sp+1)
    sns.scatterplot(data=data,
                   x = np.arange(0, data.shape[0]),
                   y = data.loc[:,cols[sp]],
                   hue = data['Class'].map({0:'Gonen', 1:'Jasmine'}),
                   ax=ax)
    ax.set_xlabel('No. of Records')
    ax.set_ylabel(cols[sp])
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()

Clean the data

In [7]:
# Drop columns that are not needed
data.drop('id', axis=1, inplace=True)
data.columns
Out[7]:
Index(['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity',
       'ConvexArea', 'EquivDiameter', 'Extent', 'Perimeter', 'Roundness',
       'AspectRation', 'Class'],
      dtype='object')
In [8]:
# Check for NA values
data.isnull().sum()
Out[8]:
Area               0
MajorAxisLength    0
MinorAxisLength    0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Perimeter          0
Roundness          0
AspectRation       0
Class              0
dtype: int64

Split and Standardize data

In [9]:
# Create X and Y variables
X = data.drop('Class', axis=1)
print(X.shape)

Y = data.Class
print(Y.shape)
(18185, 10)
(18185,)
In [10]:
# Train test split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (13638, 10)'
'Shape of test data (4547, 10)'
In [11]:
# Standardize the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Build Model

In [12]:
# Build Model
NB_model = NaiveBayes()
NB_model.fit(x_train, y_train)

Evaluate Model Performance

In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.

Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.

In [13]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data

Evaluate model performance on training data.

In [14]:
# Create predictions
train_pred = NB_model.predict(x_train)
In [15]:
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc:.3f}')
Accuracy on training set: 0.984
In [16]:
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.98      6151
           1       0.97      1.00      0.99      7487

    accuracy                           0.98     13638
   macro avg       0.99      0.98      0.98     13638
weighted avg       0.98      0.98      0.98     13638

Perofrmance on Test data

Evaluate model performance on test data.

In [17]:
# Create predictions on test set
test_pred = NB_model.predict(x_test)
In [18]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc:.3f}')
Accuracy on test set: 0.982
In [19]:
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")

classes = ['Gonen','Jasmine']

# Plot confusion matrix
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [20]:
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98      2049
           1       0.97      1.00      0.98      2498

    accuracy                           0.98      4547
   macro avg       0.98      0.98      0.98      4547
weighted avg       0.98      0.98      0.98      4547

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn

Here, we build a model using GaussianNB from scikit-learn library and compare the results with model created from scratch.

In [21]:
# Build a model using scikit learn
   
# fit the Logistic Regression
clf = GaussianNB()

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
In [22]:
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'GaussianNB accuracy on test set: {clf_acc:.3f}')
GaussianNB accuracy on test set: 0.982
In [23]:
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [24]:
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98      2049
           1       0.97      1.00      0.98      2498

    accuracy                           0.98      4547
   macro avg       0.98      0.98      0.98      4547
weighted avg       0.98      0.98      0.98      4547

Model Results

Model Accuracy Precision Recall F1 Score
Training Data 0.984 0.98 0.98 0.98
Test Data 0.982 0.98 0.98 0.98
Model using Scikit learn 0.982 0.98 0.98 0.98

From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.

Summary

To summarize, in this notebook we created a Gaussian Naive Bayes classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.