Goal: The goal is to build a naive Decision Tree Classifier from scratch.
A decision tree is a tree where each node represents a feature (attribute), each link (branch) represents a decision (rule) and each leaf represents an outcome (categorical or continues value).
Classification trees are a hierarchical way of partitioning the space. We start with the entire space and recursively divide it into smaller regions. At the end, every region is assigned with a class label.
So, a basic decision tree partitions the training data into homogeneous subgroups. The subgroups (also called nodes) are formed recursively using binary partitions formed by asking simple yes-or-no questions about each feature (e.g., is commute > 1hr?). This is done a number of times until a suitable stopping criteria is satisfied. After all the partitioning has been done, the model predicts an output.
Building a tree involves the following three general elements:
a. Where and how to split: how do we decide which feature to split on and how to split the tree?
b. How to stop splitting: once we start to'grow' the tree, how do we decide when to stop splitting?
c. How to assign a class label: each leaf node needs to be assigned to a class. How do we assign these class labels?
The objective at each node is to find the “best” feature to partition the remaining data such that the Information Gain is maximized.
Entropy is calculated as:
$H(s)=-$ (p+) $\log _{2}(p+)\ -\ $ (p-) $\log _{2}(p-)$
$\text {where}$
$(p+) \rightarrow \%$ of positive class
$(p-) \rightarrow \%$ of negative class
Entropy (F1) = -$(9/14)\log _{2}(9/14)\ -\ $(5/14)$ \log _{2}(5/14)$ = 0.91
Entropy (F2) = -$(6/8) \log _{2}(6/8)\ -\ $ (2/8)$ \log _{2}(2/8)$ = 0.81
Entropy (F3) = -$ (3/6) \log _{2}(3/6)\ -\ $ (3/6)$ \log _{2}(3/6)$ = 1
Information gain is used to decide which feature to split on at each step in building the tree. Information Gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.
Information Gain is calculated as:
$\operatorname{Gain}(S, A)=H(s) - \sum \frac{/ S v /}{/ s /} H(S_{v})$
where:
$H(s)$ - entropy of root node
$S_{v}$ - subset after split
$S$ - total subset
$H(S_{v})$ - entropy of each child node
Information gain for above tree is:
$H(F 1)-\frac{8}{14} * H(F 2)-\frac{6}{14} * H(F 3)$
= $0.91-\frac{8}{14} * 0.81-\frac{6}{14} * 1$
= 0.05
Gini Impurity is calculated as:
$G I=1-\sum_{i=1}^{n}(p)^{2}$
$=1-\left[\left(p_{+}\right)^{2}+(p-)^{2}\right]$
Gini (F2) = 1 - [(6/8)^2 + (2/8)^2] = 1 - (0.5625 + 0.0625) = 0.375
Gini (F3) = 1 - [(3/6)^2 + (3/6)^2] = 0.5
Weighted Average Gini Impurity
It is used to decide which feature to split on at each step in building the tree and provides the same information as "Information Gain". It is calculated as the wieghted average of each split.
Gini (F1) $=\left(\frac{8}{8+6}\right) \times$ G.I. (F2) $+\left(\frac{6}{8+6}\right) \times $ G.I. (F3)
$=\left(\frac{8}{8+6}\right) \times 0.375$ $+\left(\frac{6}{8+6}\right) \times 0.5$
$=0.214 + 0.214 = 0.43$
In this scenario where the response variable is continous, the objective at each node is to find the “best” feature to partition the remaining data such that the overall error between the actual response and the predicted value is minimized.
So at each node, we:
We select the predictor and threshold that leads to the greatest reduction in RSS.
If we grow an overly complex tree, we tend to overfit to the training data resulting in poor generalization performance. To achieve a balance between depth and complexity of a tree, there are two primary approaches:
Early stopping explicitly restricts the growth of the tree. There are several ways such as:
The idea is to build a very large complex tree and then prune it (remove nodes) to see if the pruned tree results in improved performance. This can be done using Cost Complexity Pruning. For a given value of α, the goal is to find the smallest pruned tree that has the lowest error or highest accuracy.
In case of regression:
minimize $\{S S E+\alpha|T|\}$
where:
$\alpha$: cost complexity parameter
T : number of terminal nodes
The steps to cost complexity pruning are:
cost_complexity_pruning_path()
from sklearn.The image below shows an example that although the fisrt tree has lowest SSR, it will not be selected. The tree score using cost complexity pruning is lowest for the second tree and hence that will be selected.
Classification:
Pure leaf - When a leaf is pure, it has only one class and that label is assigned to the class.
Impure leaf - When the leaf is impure, store the most common class label for the leaf node i.e. if the leaf has 3Y and 2N then "Y" is assigned as the class label.
Regression:
In case of regression, average value (predicted value) of the leaf is the final value assigned.
Pros | Cons |
---|---|
Easy to interpret and communicate | Easy to overfit |
Can be used for both classification and regression tasks | Elaborate pruning required |
Require very little pre-processing | It is unstable—small changes in data can dramatically affect the structure of the tree and the final prediction. |
Can easily handle categorical features | Inflexible - inserting new data requires retraining the tree from scratch |
Most decision tree implementations can easily handle missing values | |
Non-parametric model: no assumptions about the shape of data |
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
from collections import Counter
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')
X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')
features = load_breast_cancer().feature_names
Decision trees do not require data to be standardized as nodes are costructed by iterating over threshold values of each feature. So, here we will just split the data.
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
In this section, we will create the class to build and fit a Decision Tree classifier. We will also build and fit a model using training data.
Here is how the algorithms works:
The objective at each node is to find the “best” feature to partition the remaining data such that the ingormation gain is maximized.
Iterate over all the features and all poosible feature values to find the best feature and threshold to split on. So,
Create a split based on best feature and best threshold value
Build the tree recursively
Apply a stopping criteria to stop growing the tree. Criteria could be maximum depth, minimum no. of samples at each node, no more class distribution.
Once the tree stops growing, store the most common class label for the leaf node.
Here is an approach to defining the classes:
_calc_entropy()
calculates the entropy of a single nodeis_leaf()
determines if a node is a leaf node My_Tree: fits the decision tree model and makes predictions
a. fit()
creates a root node and starts growing the tree using _grow_tree()
b. _grow_tree()
Checks for stopping criteria
i. If met, returns the `_most_common()` value for node
ii. If not met:
a. Randomly choose features to split on
b. Find the best feature and threshold value using `_best_criteria()`
c. Partition the data on best feature and threshold using `_create_split()`
d. Recursively grow the left and right trees using `_grow_tree()`
e. Return `Node` with best feature, threshold, left and right value
c. _best_criteria()
i. Iterates over all features and unique feature values
ii. Calculates `_information_gain()` using a feature and unique value as threhold
iii. Identifies largest gain and returns best feature and threhold value
d. _information_gain()
i. Calculates parent entropy - entropy(y) for the feature using `_calc_entropy()`
ii. Creates split using `_create_split()` based on feature and threshold
iii. Compute entropy for indices on left and right splits using `_calc_entropy()`
iv. Calculate weighted avg. of entropy for splits (left and right children)
v. Return Information Gain = Parent entropy - Wt. avg of entropy for children
e. _create_split()
i. Compute left indices - indices where feature value >= threshold
ii. Compute right indices - indices where feature value < threshold
iii. Return left and right indices
f. _most_common()
returns the most common value in a node
g. predict()
iterates over test data and traverses the tree using _traverse_tree()
h. _traverse_tree()
starts at the root node
i. Checks if the node is leaf and returns value
ii. If not, checks if the test data value of best feature for the node >= threshold
a. If so, return `_traverse_tree()` again for that feature and left of Node
b. Else, return `_traverse_tree()` again for that feature and right of Node
def calc_entropy(y):
'''
Calculates the entropy for a single node
'''
# Get total count of each class in y
y_classes = np.bincount(y) # returns array([54, 89])
# Divide class count by length
y_hist = y_classes/len(y)
# calculate entropy
entropy = -np.sum([p * np.log2(p) for p in y_hist if p > 0])
return entropy
class Node:
"""
Stores information such as its feature, split threshold, left node, right node
and value if node is a leaf node.
"""
def __init__(self, feature=None, threshold=None, left=None, right=None, *, value=None):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.value = value
def is_leaf(self):
"""
determines if a node is a leaf node
"""
return self.value is not None
class My_Tree():
"""
fits the decision tree model and makes predictions
"""
def __init__(self):
self.root = None
def fit(self, X, y, min_split_samples=2, max_depth=100, n_feats=None, replace=False):
"""
creates a root node and starts growing the tree
"""
# Initialize
self.min_split_samples = min_split_samples
self.max_depth = max_depth
self.n_feats = n_feats
# Subset number of features
self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
self.root = self._grow_tree(X, y)
def _grow_tree(self, X, y, depth=0):
"""
Checks for stopping criteria
i. If met, returns the `_most_common()` value for node
ii. If not met:
a. Randomly choose features to split on
b. Find the best feature and threshold value using `_best_criteria()`
c. Partition the data on best feature and threshold using `_create_split()`
d. Recursively grow the left and right trees using `_grow_tree()`
e. Return `Node` with best feature, threshold, left and right value
"""
n_records, n_features = X.shape
n_classes = len(np.unique(y))
# Check for stopping criteria
if (depth >= self.max_depth
or n_records < self.min_split_samples
or n_classes == 1):
leaf_value = self._most_common(y)
return Node(value=leaf_value)
# Randomly choose feature indices to split on
feat_indxs = np.random.choice(n_features, self.n_feats, replace=False)
# Find the best feature and threshold value
best_feat_idx, best_thresh = self._best_criteria(feat_indxs, X, y)
# Split left and right indices
left_idxs, right_idxs = self._create_split(X[:,best_feat_idx], best_thresh)
# Grow left and right trees
left_tree = self._grow_tree(X[left_idxs,:], y[left_idxs], depth+1)
right_tree = self._grow_tree(X[right_idxs,:], y[right_idxs], depth+1)
# Return `Node` with best feature, threshold, left and right value
return Node(best_feat_idx, best_thresh, left_tree, right_tree)
def _best_criteria(self, feat_indxs, X, y):
"""
i. Iterates over all features and unique feature values
ii. Calculates `_information_gain()` using a feature and unique value as threhold
iii. Identifies largest gain and returns best feature and threhold value
"""
best_gain = -1
split_idx, split_thresh = None, None
# Iterate over features and their unique values to
# identify the best feature and its split threhold
for idx in feat_indxs:
X_col = X[:, idx]
unique_vals = np.unique(X_col)
for thresh in unique_vals:
gain = self._information_gain(X_col, y, thresh)
if gain > best_gain:
best_gain = gain
split_idx = idx
split_thresh = thresh
return split_idx, split_thresh
def _information_gain(self, X_col, y, thresh):
"""
i. Calculates parent entropy - entropy(y) for the feature using `_calc_entropy()`
ii. Creates split using `_create_split()` based on feature and threshold
iii. Compute entropy for indices on left and right splits using `_calc_entropy()`
iv. Calculate weighted avg. of entropy for splits (left and right children)
v. Return Information Gain = Parent entropy - Wt. avg of entropy for children
"""
# Calculate parent entropy
ent_parent = calc_entropy(y)
# Create split
left_idx, right_idx = self._create_split(X_col, thresh)
# Calculate weighted avg. entropy of left and right indices
n = len(y)
n_l, n_r = len(left_idx), len(right_idx)
if n_l == 0 or n_r == 0:
return 0
left_ent, right_ent = calc_entropy(y[left_idx]), calc_entropy(y[right_idx])
wt_avg_ent = (n_l/n) * left_ent + (n_r/n) * right_ent
# Calculate information gain
gain = ent_parent - wt_avg_ent
return gain
def _create_split(self, X_col, thresh):
"""
i. Compute left indices - indices where feature value >= threshold
ii. Compute right indices - indices where feature value < threshold
iii. Return left and right indices
"""
left_idx = np.argwhere(X_col >= thresh).flatten()
right_idx = np.argwhere(X_col < thresh).flatten()
return left_idx, right_idx
def _most_common(self,y):
"""
returns the most common value in a node
"""
count = Counter(y)
common = count.most_common(1)[0][0]
return common
def predict(self, X):
'''
iterates over test data and traverses the tree using `_traverse_tree()`
'''
return np.array([self._traverse_tree(x, self.root) for x in X])
def _traverse_tree(self, x, node):
"""
i. Checks if the node is leaf and returns value
ii. If not, checks if the test data value of best feature for the node >= threshold
a. If so, return `_traverse_tree()` again for that feature and left of Node
b. Else, return `_traverse_tree()` again for that feature and right of Node
"""
if node.is_leaf():
return node.value
if x[node.feature] >= node.threshold:
return self._traverse_tree(x, node.left)
return self._traverse_tree(x, node.right)
# Build Model
tree_model = My_Tree()
tree_model.fit(x_train, y_train)
In this section, we will check the performance of our model on train and test data. We will also build a Decision Tree model using scikit-learn library and compare performance.
Before we make predictions, let's define a function, accuracy
, which returns the ratio of the number of correct predictions to the number of observations.
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
Evaluate model performance on training data.
# Create predictions
train_pred = tree_model.predict(x_train)
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on train set: {train_acc}')
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Evaluate model performance on test data.
# Create predictions on test set
test_pred = tree_model.predict(x_test)
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.
Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.
In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.90 can be used for evaluating the model.
F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.
Here, we build a model using DecisionTreeClassifier
from scikit-learn library and compare the results with model created from scratch.
# Build a model using scikit learn
from sklearn.tree import DecisionTreeClassifier
# fit the model
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
# predicting on training data-set
y_train_pred = clf.predict(x_train)
# predicting on test data-set
y_test_pred = clf.predict(x_test)
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")
# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
scikit-learn
allows the flexibility to visualize a decision tree.
from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf,
feature_names=load_breast_cancer().feature_names,
class_names=load_breast_cancer().target_names,
filled=True)
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Training Data | 1 | 1 | 1 | 1 |
Test Data | 0.9 | 0.9 | 0.9 | 0.9 |
Model using Scikit learn | 0.92 | 0.92 | 0.92 | 0.92 |
From the results comparison, we can see that the model created from scratch performs fairly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set is 0.9 compared to scikit-learn model with a recall of 0.92.
To summarize, in this notebook we created a naive Decision Tree Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.