Goal: The goal is to build a naive Random Forest Classifier from scratch.
We will:
sklearn.datasets
to:RandomForestClassifier
from scikit-learnRandom Forests are built on the fundamental of Decision Tree. The idea is to build a large number of decison trees and then aggregate the prediction of all trees. Even though individual trees overfit, aggregation reduces the variance of the overall procedure and results in improved predictive performance.
Bagging, also known as bootstrap aggregation, is a technique to reduce overfiting by randomly sampling the data with replacement. Data is randomly sampled (only rows/records are sampled) and these subsets of data are then passed to build multiple Decision Trees. The idea of bagging is to reduce model variance by averaging many models trained on different subsets of data.
Trees in bagging are not completely independent of each other since all the original features are considered at every split of every tree. Simply bagging trees results in tree correlation that limits the effect of variance reduction.
Random Forest introduces more randomness by randomly sampling the features as well thereby reducing the effects of tree correlation. So, in random forest, both records and features are sampled with replacement.
At a high level, here is how the algorithm works:
a. Identify the number of trees to build.
b. Randomly sample the data with replacement - here both records (rows) and features (columns) are randomly sampled.
c. Pass randomly sampled data to different trees.
d. Each tree performs its own computation and votes on a outcome.
e. Use aggregate of the predicted response from all trees.
The elements of building an individual tree such as Where and how to split, How to stop splitting and How to assign a class label are discussed in the Decision Tree notebook.
Pros | Cons |
---|---|
Handle continous and discrete features | Not interpretable |
Can be used for both classification and regression tasks | Do not work well with unstructured data (images, audio etc) |
Require very little pre-processing | |
Can easily handle missing values | |
Non-parametric model: no assumptions about the shape of data | |
Often quite accurate |
In this section, we will create the class to build and fit a Random Forest classifier. We will also build and fit a model using training data.
At a high level, here is how the algorithm works:
a. Identify the number of trees to build.
b. Randomly sample the data with replacement - here both records (rows) and features (columns) are randomly sampled.
c. Pass randomly sampled data to different trees.
d. Each tree performs its own computation and votes on a outcome.
e. Use aggregate of the predicted response from all trees.
Since Random Forest builds multiple decision trees, we will import the My_Tree
class from Decision Tree notebook.
Here is an approach to defining the classes:
My_RandomForest: fits the random forest model and makes predictions
a. `fit()`
i. Initializes the following parameters:
a. n_trees = no of trees to build
b. min_split_samples = min. samples required by a node to create a split
c. max_depth = max. no. of nodes in a tree
d. n_feats = number of features to randomly sample and pass to each tree
e. replace = when True randomly samples features with replacement
ii. Iterates over each tree to:
a. Randomly sample the data with replacement using `_bootstrap_agg()`
b. Fit a tree using `My_Tree` class imported from Decision Tree
b. `_bootstrap_agg()` randomly samples the data with replacement and returns it
c. `predict()`
i. Iterates over different trees created
ii. Uses `My_Tree.predict()` to generates predictions for each tree
iii. Uses `_most_common()` to return the most common value as an array
# Imports My_Tree class from Decision_Tree.ipynb
import nbimporter
from Decision_Tree import My_Tree
class My_RandomForest():
def __init__(self):
self.trees = []
def fit(self, X, y, n_trees=100, min_split_samples=2, max_depth=100, n_feats=None, replace=False):
"""
i. Initializes the following parameters:
a. n_trees = no of trees to build
b. min_split_samples = min. samples required by a node to create a split
c. max_depth = max. no. of nodes in a tree
d. n_feats = number of features to randomly sample and pass to each tree
e. replace = when True randomly samples features with replacement
ii. Iterates over each tree to:
a. Randomly sample the data with replacement using `_bootstrap_agg()`
b. Fit a tree using `My_Tree` class imported from Decision Tree
"""
self.X = X
self.y = y
self.n_trees = n_trees
self.min_split_samples = min_split_samples
self.max_depth = max_depth
self.n_feats = n_feats
self.replace = replace
self.trees = []
for _ in range(self.n_trees):
tree = My_Tree()
x_sub, y_sub = self._bootstrap_agg(X, y)
tree.fit(x_sub, y_sub, min_split_samples=min_split_samples,
max_depth=max_depth, n_feats=n_feats, replace=replace)
self.trees.append(tree)
def _bootstrap_agg(self, X, y):
"""
randomly samples the data with replacement and returns it
"""
n_records = X.shape[0]
idxs = np.random.choice(n_records, size=n_records, replace=True)
return X[idxs], y[idxs]
def predict(self, X):
"""
i. Iterates over different trees created
ii. Uses `My_Tree.predict()` to generates predictions for each tree
iii. Uses `_most_common()` to return the most common value as an array
`tree_preds` returns an array of predictions generated by each tree for each record in test data.
For e.g. a test data with 4 records when iterated over 3 trees will result in labels generated
by each tree [1111 0000 1111]. The result we want is the most common label for each test record i.e.
[101 101 101 101]. This shows that for 1st test record, 1st tree predicted 1, second: 0 and third tree: 1.
To get the result in this form, we use `swapaxes()`.
"""
tree_preds = np.array([tree.predict(X) for tree in self.trees])
tree_preds = np.swapaxes(tree_preds, 0, 1)
y_pred = [self._most_common(pred) for pred in tree_preds]
return np.array(y_pred)
def _most_common(self, y):
"""
returns the most common value in a node
"""
count = Counter(y)
common = count.most_common(1)[0][0]
return common
In this section, we will:
sklearn.datasets
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
from collections import Counter
# View data description
data = datasets.load_breast_cancer()
print(data.DESCR)
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')
X=load_breast_cancer().data
display(f'Shape of X: {X.shape}')
features = load_breast_cancer().feature_names
Decision trees (within Random Forest) do not require data to be standardized as nodes are costructed by iterating over threshold values of each feature. So, here we will just split the data.
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=100)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Build Model
rf_model = My_RandomForest()
rf_model.fit(x_train, y_train, n_trees=100, min_split_samples=5,
max_depth=50, n_feats=3, replace=True)
In this section, we will check the performance of our model on train and test data. We will also build a Decision Tree model using scikit-learn library and compare performance.
Before we make predictions, let's define a function, accuracy
, which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
Evaluate model performance on training data.
# Create predictions
train_pred = rf_model.predict(x_train)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on train set: {train_acc}')
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = rf_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.95 can be used for evaluating the model.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a model using DecisionTreeClassifier
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.ensemble import RandomForestClassifier
# fit the model
clf = RandomForestClassifier(n_estimators=100, max_depth=50,
min_samples_split=5, bootstrap=True)
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
scikit-learn
allows the flexibility to visualize a decision tree. Here, we visualize the first of 100 decision trees created by the model.
from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf.estimators_[0],
feature_names=load_breast_cancer().feature_names,
class_names=load_breast_cancer().target_names,
filled=True)
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Training Data | 1 | 1 | 1 | 1 |
Test Data | 0.951 | 0.95 | 0.95 | 0.95 |
Model using Scikit learn | 0.951 | 0.95 | 0.95 | 0.95 |
From the results comparison, we can see that the model created from scratch performs fairly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set and the scikit-learn model with similar hyper-parameters is 0.95.
To summarize, in this notebook we created a naive Random Forest Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.