Naive Bayes Classifier¶

Overview¶

Basics¶

Goal: The goal is to build a Gaussian Naive Bayes classifier from scratch.

Naive Bayes is a simple supervised learning algorithm that is based on Bayes' theorem. The algorithm assumes that the features are independent and identically distributed (iid). This means that:

Independence: each feature is uncorrelated from each other. For e.g. BMI is independent of weight or height etc. The assumption of independence is rarely true in real world scenarios and is a “naive” assumption - hence the name “naive bayes”.
Identically distributed: we assume that the values of the features (e.g. height, weight etc.) follow the same probability distribution. In this case of a Gaussian classifier, we assume that the features are normally distributed and follow the Gaussian probability distribution function.

Decision boundaries generated by the algorithm can be straight line, parabolic or elliptical only and not non-linear.

Some well known applications of the algorithm include Text classification, Spam Filtering, Sentiment Analysis and Recommender Systems.

Pros	Cons
It is easy and fast	Assumption of independent predictors: In real life, it is almost impossible that we get a set of predictors which are completely independent
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression	Zero Frequency - If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability
Performs well in multi-class classification
Works well on large datasets

Bayes' Theorem and Naive Bayes¶

Bayes’ theorem describes the probability of an event based on the prior knowledge of conditions that might be related to the event. It can be written as:

$P\left(y \mid x_{1}, \cdots, x_{j}\right)=\frac{P\left(x_{1}, \cdots x_{j} \mid y\right) P(y)}{P\left(x_{1}, \cdots, x_{j}\right)}$

where,
$x_{1}, \cdots, x_{j}$: are j features that are independent of each other
y: is the dependent variable
$P\left(y \mid x_{1}, \cdots, x_{j}\right)$: Posterior Probability of target class i.e. probability of $y$ given features $x_{1}, \cdots, x_{j}$
$P\left(x_{1}, \cdots x_{j} \mid y\right)$: Likelihood of features $x_{1}$ to $x_{j}$ given that their class is y.
$P(y)$: Prior Probability
$P\left(x_{1}, \cdots, x_{j}\right)$: Marginal Probability

Posterior probability, $P\left(y \mid x_{1}, \cdots, x_{j}\right)$, is the probability of target class being true given features $x_{1}, \cdots, x_{j}$. It is calculated for each class i.e. if the target variable has 2 classes (male/female) then, posterior probaility can be calculated as:

$P\left(male \mid height, weight, foot size\right)$ $=\frac{P(\text { male })\ *\ p(\text { height } \mid \text { male })\ *\ p(\text { weight } \mid \text { male })\ *\ p(\text { foot size } \mid \text { male })}{\text { marginal probability }}$

$P\left(female \mid height, weight, foot size\right)$ $=\frac{P(\text { female })\ *\ p(\text { height } \mid \text { female })\ *\ p(\text { weight } \mid \text { female })\ *\ p(\text { foot size } \mid \text { female })}{\text { marginal probability }}$

Prior Probability, $P(y)$, is the probability of class being true. It is the frequency of the class and can be calculated as:

$P(y)$ = No. of records for the class / Total no. of records

Likelihood (Class conditional probability), $P\left(x_{1}, \cdots x_{j} \mid y\right)$, is calculated based on the probability distribution. In case of Gaussian distribution, this can be calculated using the probability distribution function (pdf) of Gaussian distribution:

$P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$

where,
$x_{i}$: new point to be classified
$\mu_{y}$: mean of the target class y
$\sigma_{y}^{2}$: variance of the target class y

Marginal Probability $P\left(x_{1}, \cdots, x_{j}\right)$: for Naive Bayes, we only calculate the numerator of posterior probability. Since marginal probability is the same for all classes, we can ignore the denominator.

Selecting Class with Highest Posterior:

To select the class with highest posterior probability, we use the following formula to take the argmax of posterior probabilites.

$y=\operatorname{argmax}_{y} P\left(y \mid x_{1}, \cdots, x_{j}\right)=\operatorname{argmax}_{y} \frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$

Since marginal probability is the same for all classes, we ignore the denominator.

$y=\operatorname{argmax}_{y} P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)$

Multiplying these probabilities can lead to very small numbers and can cause overflow problems. To avoid numerical overflow, it is more convenient to use the log function:

$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$

Naive Bayes Classifier¶

Define the Class¶

Here is how the Naive Nayes algorithm works:

Given the training data and their class labels:

Calculate the posterior probability of each class in target 'y' given features 'X':

$P(y \mid X)=\frac{P\left(x_{1} \mid y\right) \cdot P\left(x_{2} \mid y\right) \cdot \ldots \cdot P\left(x_{n} \mid y\right) \cdot P(y)}{P(X)}$

To calculate posterior probability, we:
- subset the data for each class in target $y$
- Calculate prior probability of the class i.e. $P(y) = P(Play = yes)$: the probability of class being true
- Calculate conditional probability of the class $P\left(x_{i} \mid y\right)$, using the probability density function (PDF) for Gaussian/Normal distribution:
  
  $P\left(x_{i} \mid y\right)=\frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \cdot \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$
  
  For this we:
  - Calculate mean and variance of all features in that class
Note - Technically, we only calculate the numerator of posterior probability. Since marginal probability, $P(X)$, is the same for all classes, we can ignore the denominator.

Find class with highest probability using the following formula:

$y=\operatorname{argmax}_{y} \log \left(P\left(x_{1} \mid y\right)\right)+\log \left(P\left(x_{2} \mid y\right)\right)+\ldots+\log \left(P\left(x_{n} \mid y\right)\right)+\log (P(y))$

class NaiveBayes():
    
    def __init__(self):
        pass
    
    
    def fit(self, X, y):
        '''
        1. Identify unique classes and their number in target variable
        
        2. Initialize mean, variance and prior probability vectors
            - Since mean and variance are calculated for each feature in each class, 
              their shape will be (no. of classes, no. of features)
            - Since prior is just the probability of each class, it remains a 1-D array
            
        3. Iterate over each class and
            - Subset the data for class
            - Compute mean, variance and prior probability for each class
            - Add these values to initialized arrays at the same time
        '''
        
        # Get no. of records and features
        n_records, n_features = X.shape
        
        # Identify unique classes and their number in target variable
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        
        # Initialize mean, variance and prior probability vectors
        self._mean = np.zeros((n_classes, n_features), dtype = np.float64)
        self._var = np.zeros((n_classes, n_features), dtype = np.float64)
        self._prior = np.zeros(n_classes, dtype = np.float64)
        
        # Iterate over each class to compute mean, variance and prior
        for c in self._classes:
            X_class = X[c==y]
            self._mean[c,:] = np.mean(X_class, axis=0)
            self._var[c,:] = np.var(X_class, axis=0)
            self._prior[c] = len(X_class) / n_records

            
    def predict(self, X):
        '''
        Returns the predicted class given test data
        '''
        y_pred = [self._predict(x) for x in X]
        return y_pred
  

    def _predict(self, x):
        
        '''
        1. Initialize a list of posterior prob for each class
        
        2. Calculate posterior probability for each class using argmax formula:
          - get the prior probability of class from fit method
          - calculate conditional probability
          - calculate posterior and append to list of posterior probabilities 
          
        3. Select class with highest posterior
        '''
        # Initialize posteriors
        posteriors = []
        
        # Iterate over each class and calculate posterior using PDF function
        for idx, c in enumerate(self._classes):
            prior = np.log(self._prior[idx])
            conditional = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + conditional
            posteriors.append(posterior)
            
        # Select class with highest posterior
        final_cls = self._classes[np.argmax(posteriors)]
        return final_cls
    
    
    def _pdf(self, cls_idx, x):
        '''
        Calculates conditional probability using Gaussian PDF formula
        '''
        mean = self._mean[cls_idx]
        var = self._var[cls_idx]
        numerator = np.exp(-(x - mean)**2 / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

Get Data and Build Model¶

For this exercise, we will use the Rice Seed dataset downloaded from Kaggle. The data was extracted for two kinds of rice: Jasmine - 1, Gonen - 0. Here is some information about the dataset:

We will:

Read the data
Split the data into train and test sets
Standardize the data
Build and fit the model

# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB

Load Data¶

# Read the data
data = pd.read_csv('Datasets/riceClassification.csv')
data.head()

# Check size
data.shape

(18185, 12)

# Check class count
data['Class'].value_counts()

1    9985
0    8200
Name: Class, dtype: int64

There are 9985 records for Jasmine and 8200 records for Gonen.

Plot Data¶

The plots below show the variation in each feature colored by class (Jasmine - 1, Gonen - 0).

cols = data.columns[1:]
fig = plt.figure(figsize=(15, 20))
for sp in range(0,len(cols)):
    ax = fig.add_subplot(4,3,sp+1)
    sns.scatterplot(data=data,
                   x = np.arange(0, data.shape[0]),
                   y = data.loc[:,cols[sp]],
                   hue = data['Class'].map({0:'Gonen', 1:'Jasmine'}),
                   ax=ax)
    ax.set_xlabel('No. of Records')
    ax.set_ylabel(cols[sp])
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()

Clean the data¶

# Drop columns that are not needed
data.drop('id', axis=1, inplace=True)
data.columns

Index(['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity',
       'ConvexArea', 'EquivDiameter', 'Extent', 'Perimeter', 'Roundness',
       'AspectRation', 'Class'],
      dtype='object')

# Check for NA values
data.isnull().sum()

Area               0
MajorAxisLength    0
MinorAxisLength    0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Perimeter          0
Roundness          0
AspectRation       0
Class              0
dtype: int64

Split and Standardize data¶

# Create X and Y variables
X = data.drop('Class', axis=1)
print(X.shape)

Y = data.Class
print(Y.shape)

(18185, 10)
(18185,)

# Train test split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (13638, 10)'

'Shape of test data (4547, 10)'

# Standardize the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Build Model¶

# Build Model
NB_model = NaiveBayes()
NB_model.fit(x_train, y_train)

Evaluate Model Performance¶

In this section, we will check the performance of our model on train and test data. We will also build a Naive Bayes model using scikit-learn library and compare performance.

Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.

# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data¶

Evaluate model performance on training data.

# Create predictions
train_pred = NB_model.predict(x_train)

# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on training set: {train_acc:.3f}')

Accuracy on training set: 0.984

# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.98      6151
           1       0.97      1.00      0.99      7487

    accuracy                           0.98     13638
   macro avg       0.99      0.98      0.98     13638
weighted avg       0.98      0.98      0.98     13638

Perofrmance on Test data¶

Evaluate model performance on test data.

# Create predictions on test set
test_pred = NB_model.predict(x_test)

# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc:.3f}')

Accuracy on test set: 0.982

# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")

classes = ['Gonen','Jasmine']

# Plot confusion matrix
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');

Confusion Matrix:

# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98      2049
           1       0.97      1.00      0.98      2498

    accuracy                           0.98      4547
   macro avg       0.98      0.98      0.98      4547
weighted avg       0.98      0.98      0.98      4547

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn¶

Here, we build a model using GaussianNB from scikit-learn library and compare the results with model created from scratch.

# Build a model using scikit learn
   
# fit the Logistic Regression
clf = GaussianNB()

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)

# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'GaussianNB accuracy on test set: {clf_acc:.3f}')

GaussianNB accuracy on test set: 0.982

# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');

Confusion Matrix:

# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.96      0.98      2049
           1       0.97      1.00      0.98      2498

    accuracy                           0.98      4547
   macro avg       0.98      0.98      0.98      4547
weighted avg       0.98      0.98      0.98      4547

Model Results¶

Model	Accuracy	Precision	Recall	F1 Score
Training Data	0.984	0.98	0.98	0.98
Test Data	0.982	0.98	0.98	0.98
Model using Scikit learn	0.982	0.98	0.98	0.98

From the results comparison, we can see that the model created from scratch fairs well when compared to the model created using scikit-learn library.

Summary¶

To summarize, in this notebook we created a Gaussian Naive Bayes classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.

	id	Area	MajorAxisLength	MinorAxisLength	Eccentricity	ConvexArea	EquivDiameter	Extent	Perimeter	Roundness	AspectRation	Class
0	1	4537	92.229317	64.012769	0.719916	4677	76.004525	0.657536	273.085	0.764510	1.440796	1
1	2	2872	74.691881	51.400454	0.725553	3015	60.471018	0.713009	208.317	0.831658	1.453137	1
2	3	3048	76.293164	52.043491	0.731211	3132	62.296341	0.759153	210.012	0.868434	1.465950	1
3	4	3073	77.033628	51.928487	0.738639	3157	62.551300	0.783529	210.657	0.870203	1.483456	1
4	5	3693	85.124785	56.374021	0.749282	3802	68.571668	0.769375	230.332	0.874743	1.510000	1