Random Forest Classifier

Overview

Goal: The goal is to build a naive Random Forest Classifier from scratch.

We will:

  • Understand the basics
  • Create the class to build and fit a Random Forest classifier
  • Utilize Breast cancer wisconsin (diagnostic) dataset from sklearn.datasets to:
    • Build a model
    • Evaluate model performance on training and test data
    • Build a model using RandomForestClassifier from scikit-learn
    • Review results to compare our model's performance with scikit-learn model

Basics

Random Forests are built on the fundamental of Decision Tree. The idea is to build a large number of decison trees and then aggregate the prediction of all trees. Even though individual trees overfit, aggregation reduces the variance of the overall procedure and results in improved predictive performance.

image.png Image Source

Bagging

Bagging, also known as bootstrap aggregation, is a technique to reduce overfiting by randomly sampling the data with replacement. Data is randomly sampled (only rows/records are sampled) and these subsets of data are then passed to build multiple Decision Trees. The idea of bagging is to reduce model variance by averaging many models trained on different subsets of data.

Trees in bagging are not completely independent of each other since all the original features are considered at every split of every tree. Simply bagging trees results in tree correlation that limits the effect of variance reduction.

Building a Forest

Random Forest introduces more randomness by randomly sampling the features as well thereby reducing the effects of tree correlation. So, in random forest, both records and features are sampled with replacement.

image.png

At a high level, here is how the algorithm works:

a. Identify the number of trees to build.
b. Randomly sample the data with replacement - here both records (rows) and features (columns) are randomly sampled.
c. Pass randomly sampled data to different trees.
d. Each tree performs its own computation and votes on a outcome.
e. Use aggregate of the predicted response from all trees.

The elements of building an individual tree such as Where and how to split, How to stop splitting and How to assign a class label are discussed in the Decision Tree notebook.

Pros and Cons

Pros Cons
Handle continous and discrete features Not interpretable
Can be used for both classification and regression tasks Do not work well with unstructured data (images, audio etc)
Require very little pre-processing
Can easily handle missing values
Non-parametric model: no assumptions about the shape of data
Often quite accurate

Random Forest Classifier

In this section, we will create the class to build and fit a Random Forest classifier. We will also build and fit a model using training data.

At a high level, here is how the algorithm works:

a. Identify the number of trees to build.    
b. Randomly sample the data with replacement - here both records (rows) and features (columns) are randomly sampled.   
c. Pass randomly sampled data to different trees.    
d. Each tree performs its own computation and votes on a outcome.    
e. Use aggregate of the predicted response from all trees. 

Define the Class

Since Random Forest builds multiple decision trees, we will import the My_Tree class from Decision Tree notebook.

Here is an approach to defining the classes:

My_RandomForest: fits the random forest model and makes predictions

a. `fit()`
    i. Initializes the following parameters:
        a. n_trees = no of trees to build     
        b. min_split_samples = min. samples required by a node to create a split     
        c. max_depth = max. no. of nodes in a tree    
        d. n_feats = number of features to randomly sample and pass to each tree     
        e. replace = when True randomly samples features with replacement   
    ii. Iterates over each tree to:
        a. Randomly sample the data with replacement using `_bootstrap_agg()`     
        b. Fit a tree using `My_Tree` class imported from Decision Tree         

b. `_bootstrap_agg()` randomly samples the data with replacement and returns it

c. `predict()` 
    i. Iterates over different trees created    
    ii. Uses `My_Tree.predict()` to generates predictions for each tree     
    iii. Uses `_most_common()` to return the most common value as an array
In [1]:
# Imports My_Tree class from Decision_Tree.ipynb
import nbimporter
from Decision_Tree import My_Tree
In [2]:
class My_RandomForest():
    def __init__(self):
        self.trees = []
    
    def fit(self, X, y, n_trees=100, min_split_samples=2, max_depth=100, n_feats=None, replace=False):
        """
        i. Initializes the following parameters:
            a. n_trees = no of trees to build     
            b. min_split_samples = min. samples required by a node to create a split     
            c. max_depth = max. no. of nodes in a tree    
            d. n_feats = number of features to randomly sample and pass to each tree     
            e. replace = when True randomly samples features with replacement   
        ii. Iterates over each tree to:
            a. Randomly sample the data with replacement using `_bootstrap_agg()`     
            b. Fit a tree using `My_Tree` class imported from Decision Tree
        """
        self.X = X
        self.y = y
        self.n_trees = n_trees
        self.min_split_samples = min_split_samples
        self.max_depth = max_depth
        self.n_feats = n_feats
        self.replace = replace
        self.trees = []
        
        for _ in range(self.n_trees):
            tree = My_Tree()
            x_sub, y_sub = self._bootstrap_agg(X, y)
            tree.fit(x_sub, y_sub, min_split_samples=min_split_samples, 
                    max_depth=max_depth, n_feats=n_feats, replace=replace)
            self.trees.append(tree)
    
    def _bootstrap_agg(self, X, y):
        """
        randomly samples the data with replacement and returns it
        """
        n_records = X.shape[0]
        idxs = np.random.choice(n_records, size=n_records, replace=True)
        return X[idxs], y[idxs]
    
    def predict(self, X):
        """
        i. Iterates over different trees created    
        ii. Uses `My_Tree.predict()` to generates predictions for each tree     
        iii. Uses `_most_common()` to return the most common value as an array
        
        `tree_preds` returns an array of predictions generated by each tree for each record in test data.
        For e.g. a test data with 4 records when iterated over 3 trees will result in labels generated 
        by each tree [1111 0000 1111]. The result we want is the most common label for each test record i.e.
        [101 101 101 101]. This shows that for 1st test record, 1st tree predicted 1, second: 0 and third tree: 1.
        To get the result in this form, we use `swapaxes()`.
        """
        tree_preds = np.array([tree.predict(X) for tree in self.trees])
        tree_preds = np.swapaxes(tree_preds, 0, 1)
        y_pred = [self._most_common(pred) for pred in tree_preds]
        return np.array(y_pred)
    
    def _most_common(self, y):
        """
        returns the most common value in a node
        """
        count = Counter(y)
        common = count.most_common(1)[0][0]
        return common

Get Data and Build Model

In this section, we will:

  • Load the Breast cancer wisconsin (diagnostic) dataset from sklearn.datasets
  • Split it into training and test sets
  • Build the model
In [3]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
from collections import Counter

View Data Description

In [4]:
# View data description
data = datasets.load_breast_cancer()
print(data.DESCR)
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

Load Data

In [5]:
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
Out[5]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [6]:
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data
display(f'Shape of X: {X.shape}')

features = load_breast_cancer().feature_names
'Shape of Y: (569,)'
'Shape of X: (569, 30)'

Split Data

Decision trees (within Random Forest) do not require data to be standardized as nodes are costructed by iterating over threshold values of each feature. So, here we will just split the data.

In [7]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (426, 30)'
'Shape of test data (143, 30)'

Build the Model

In [8]:
# Build Model
rf_model = My_RandomForest()
rf_model.fit(x_train, y_train, n_trees=100, min_split_samples=5, 
             max_depth=50, n_feats=3, replace=True)

Evaluate Model Performance

In this section, we will check the performance of our model on train and test data. We will also build a Decision Tree model using scikit-learn library and compare performance.

Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.

In [9]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data

Evaluate model performance on training data.

In [10]:
# Create predictions
train_pred = rf_model.predict(x_train)
In [11]:
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on train set: {train_acc}')
Accuracy on train set: 0.9976525821596244
In [12]:
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      1.00       156
           1       1.00      1.00      1.00       270

    accuracy                           1.00       426
   macro avg       1.00      1.00      1.00       426
weighted avg       1.00      1.00      1.00       426

Perofrmance on Test data

Evaluate model performance on test data.

In [13]:
# Create predictions on test set
test_pred = rf_model.predict(x_test)
In [14]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
Accuracy on test set: 0.951048951048951
In [15]:
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [16]:
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.91      0.94        56
           1       0.94      0.98      0.96        87

    accuracy                           0.95       143
   macro avg       0.95      0.94      0.95       143
weighted avg       0.95      0.95      0.95       143

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.95 can be used for evaluating the model.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn

Here, we build a model using DecisionTreeClassifier from scikit-learn library and compare the results with model created from scratch.

In [17]:
# Build a model using scikit learn
from sklearn.ensemble import RandomForestClassifier
   
# fit the model
clf = RandomForestClassifier(n_estimators=100, max_depth=50,
                            min_samples_split=5, bootstrap=True)

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
In [18]:
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')
SVC accuracy on test set: 0.951
In [19]:
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [20]:
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.91      0.94        56
           1       0.94      0.98      0.96        87

    accuracy                           0.95       143
   macro avg       0.95      0.94      0.95       143
weighted avg       0.95      0.95      0.95       143

Visualize a tree

scikit-learn allows the flexibility to visualize a decision tree. Here, we visualize the first of 100 decision trees created by the model.

In [21]:
from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf.estimators_[0], 
                   feature_names=load_breast_cancer().feature_names,  
                   class_names=load_breast_cancer().target_names,
                   filled=True)

Model Results

Model Accuracy Precision Recall F1 Score
Training Data 1 1 1 1
Test Data 0.951 0.95 0.95 0.95
Model using Scikit learn 0.951 0.95 0.95 0.95

From the results comparison, we can see that the model created from scratch performs fairly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set and the scikit-learn model with similar hyper-parameters is 0.95.

Summary

To summarize, in this notebook we created a naive Random Forest Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.