Decision Tree Classifier

Overview

Goal: The goal is to build a naive Decision Tree Classifier from scratch.

A decision tree is a tree where each node represents a feature (attribute), each link (branch) represents a decision (rule) and each leaf represents an outcome (categorical or continues value).

Basics

Classification trees are a hierarchical way of partitioning the space. We start with the entire space and recursively divide it into smaller regions. At the end, every region is assigned with a class label.

So, a basic decision tree partitions the training data into homogeneous subgroups. The subgroups (also called nodes) are formed recursively using binary partitions formed by asking simple yes-or-no questions about each feature (e.g., is commute > 1hr?). This is done a number of times until a suitable stopping criteria is satisfied. After all the partitioning has been done, the model predicts an output.

Building a Tree

Building a tree involves the following three general elements:

a. Where and how to split: how do we decide which feature to split on and how to split the tree?

b. How to stop splitting: once we start to'grow' the tree, how do we decide when to stop splitting?

c. How to assign a class label: each leaf node needs to be assigned to a class. How do we assign these class labels?

Where and how to split?

The objective at each node is to find the “best” feature to partition the remaining data such that the Information Gain is maximized.

  • Iterate over all the features and all possible feature values to find the best feature and threshold to split on. So,
    • For each feature:
      • For each value in all possible values in the feature:
        • Create a split based on feature value
          • Calculate Entropy (or Gini Impurity) for each node in the split
        • Calculate Information Gain (or Weighted avg. Gini Impurity) for the split on that value
      • Choose the value with highest Information Gain (or Weighted avg. Gini Impurity) as the split value for the feature.
    • Pick the feature with highest Information Gain as the best feature and value as the best threshold value to split on.
Entropy
  • Entropy measures the impurity of a split i.e. impurity of an attribute in a single node.
  • It ranges between 0 and 1 where 0 represents pure split and 1 represents a completely impure split.
  • It helps in deciding which feature or threshold value to split on.
  • Entropy is calculated w.r.t only 1 node. To measure information generated by the whole tree, Information Gain is used.

Entropy is calculated as:

$H(s)=-$ (p+) $\log _{2}(p+)\ -\ $ (p-) $\log _{2}(p-)$
$\text {where}$
$(p+) \rightarrow \%$ of positive class
$(p-) \rightarrow \%$ of negative class

image.png

Entropy (F1) = -$(9/14)\log _{2}(9/14)\ -\ $(5/14)$ \log _{2}(5/14)$ = 0.91

Entropy (F2) = -$(6/8) \log _{2}(6/8)\ -\ $ (2/8)$ \log _{2}(2/8)$ = 0.81

Entropy (F3) = -$ (3/6) \log _{2}(3/6)\ -\ $ (3/6)$ \log _{2}(3/6)$ = 1

Information Gain

Information gain is used to decide which feature to split on at each step in building the tree. Information Gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

Information Gain is calculated as:

$\operatorname{Gain}(S, A)=H(s) - \sum \frac{/ S v /}{/ s /} H(S_{v})$
where:
$H(s)$ - entropy of root node
$S_{v}$ - subset after split
$S$ - total subset
$H(S_{v})$ - entropy of each child node

Information gain for above tree is:

$H(F 1)-\frac{8}{14} * H(F 2)-\frac{6}{14} * H(F 3)$

= $0.91-\frac{8}{14} * 0.81-\frac{6}{14} * 1$

= 0.05

Gini Impurity
  • Gini Impurity is also used to measure the impurity of a split and is measured for a single node.
  • It varies between 0 and 0.5, lower is better.
  • Better than Entropy as it does not include log calculations which are computationally expensive.

Gini Impurity is calculated as:

$G I=1-\sum_{i=1}^{n}(p)^{2}$

$=1-\left[\left(p_{+}\right)^{2}+(p-)^{2}\right]$

Gini (F2) = 1 - [(6/8)^2 + (2/8)^2] = 1 - (0.5625 + 0.0625) = 0.375
Gini (F3) = 1 - [(3/6)^2 + (3/6)^2] = 0.5

Weighted Average Gini Impurity

It is used to decide which feature to split on at each step in building the tree and provides the same information as "Information Gain". It is calculated as the wieghted average of each split.

Gini (F1) $=\left(\frac{8}{8+6}\right) \times$ G.I. (F2) $+\left(\frac{6}{8+6}\right) \times $ G.I. (F3)

$=\left(\frac{8}{8+6}\right) \times 0.375$ $+\left(\frac{6}{8+6}\right) \times 0.5$

$=0.214 + 0.214 = 0.43$

Regression Scenario

In this scenario where the response variable is continous, the objective at each node is to find the “best” feature to partition the remaining data such that the overall error between the actual response and the predicted value is minimized.

So at each node, we:

  • make the prediction which is the mean of response in the training set in that particular node
  • calculate the difference between each observed and predicted value (mean), square it to generate squared residuals
  • add the squared residuals from both splits to generate residual sum of squares (RSS)

We select the predictor and threshold that leads to the greatest reduction in RSS.

How to stop splitting?

If we grow an overly complex tree, we tend to overfit to the training data resulting in poor generalization performance. To achieve a balance between depth and complexity of a tree, there are two primary approaches:

  • Early Stopping
  • Pruning
Early Stopping

Early stopping explicitly restricts the growth of the tree. There are several ways such as:

  • Set a max depth to restrict the tree depth to a certain level: we stop splitting once the depth is reached. The shallower the tree the less variance we have in our predictions; however, this may lead to higher bias as shallow trees (e.g., stumps) are not able to capture interactions and complex patterns in the data.
  • Set minimum number of observations allowed in any terminal node: we stop splitting once a node has fewer observation than the minimum set. Large values for minimum observations allowed restrict further splits leading to shallower trees. This may also lead to higher bias as shallow trees are not able to capture complex patterns.
  • Stop growing if the Information Gain or RSS is beyond a certain threshold: we stop splitting if the information gain (classification) is below a certain threshold or RSS (regression) is higher than a cretain threshold. This will result in smaller tress however it is short sighted as a seemingly worthless split early on might be followed by a really good split later.
Pruning

The idea is to build a very large complex tree and then prune it (remove nodes) to see if the pruned tree results in improved performance. This can be done using Cost Complexity Pruning. For a given value of α, the goal is to find the smallest pruned tree that has the lowest error or highest accuracy.

In case of regression:

minimize $\{S S E+\alpha|T|\}$
where:
$\alpha$: cost complexity parameter
T : number of terminal nodes

The steps to cost complexity pruning are:

  • Build a large tree
  • Using cost complexity pruning to identify effective $\alpha$ values of subtrees. Use cost_complexity_pruning_path() from sklearn.
  • Split the data into train and test sets
  • Use these $\alpha$ and k-fold cross validation to pick the $\alpha$ value that returns lowest error or highest accuracy.
  • Pick the tree with $\alpha$ value as the most optimal tree.

The image below shows an example that although the fisrt tree has lowest SSR, it will not be selected. The tree score using cost complexity pruning is lowest for the second tree and hence that will be selected.

image.png

How to assign a class label?

Classification:

  • Pure leaf - When a leaf is pure, it has only one class and that label is assigned to the class.

  • Impure leaf - When the leaf is impure, store the most common class label for the leaf node i.e. if the leaf has 3Y and 2N then "Y" is assigned as the class label.

Regression:

In case of regression, average value (predicted value) of the leaf is the final value assigned.

Pros and Cons

Pros Cons
Easy to interpret and communicate Easy to overfit
Can be used for both classification and regression tasks Elaborate pruning required
Require very little pre-processing It is unstable—small changes in data can dramatically affect the structure of the tree and the final prediction.
Can easily handle categorical features Inflexible - inserting new data requires retraining the tree from scratch
Most decision tree implementations can easily handle missing values
Non-parametric model: no assumptions about the shape of data

Get Data

Load Data

In [1]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
from collections import Counter
In [2]:
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
Out[2]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [3]:
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')

features = load_breast_cancer().feature_names
'Shape of Y: (569,)'
'Shape of X: (569, 10)'

Split Data

Decision trees do not require data to be standardized as nodes are costructed by iterating over threshold values of each feature. So, here we will just split the data.

In [4]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (426, 10)'
'Shape of test data (143, 10)'

Support Vector Classifier

In this section, we will create the class to build and fit a Decision Tree classifier. We will also build and fit a model using training data.

Here is how the algorithms works:

The objective at each node is to find the “best” feature to partition the remaining data such that the ingormation gain is maximized.

  • Iterate over all the features and all poosible feature values to find the best feature and threshold to split on. So,

    • For each feature:
      • For each value in all possible values in the feature:
        • Create a split based on feature value
          • Calculate Entropy (or Gini Impurity) for each node in the split
        • Calculate Information Gain (or Weighted avg. Gini Impurity) for the split on that value
      • Choose the value with lowest Information Gain (or Weighted avg. Gini Impurity) as the split value for the feature.
    • Pick the feature with lowest Information Gain as the best feature and value as the best threshold value to split on.
  • Create a split based on best feature and best threshold value

    • Subset the data based on the theshold value so each split gets the subset accordingly
  • Build the tree recursively

  • Apply a stopping criteria to stop growing the tree. Criteria could be maximum depth, minimum no. of samples at each node, no more class distribution.

  • Once the tree stops growing, store the most common class label for the leaf node.

Define the Class

Here is an approach to defining the classes:

  • _calc_entropy() calculates the entropy of a single node
  • Node: stores information about a node such as its feature, split threshold, left node, right node and value if node is a leaf node.
    a. is_leaf() determines if a node is a leaf node
  • My_Tree: fits the decision tree model and makes predictions
    a. fit() creates a root node and starts growing the tree using _grow_tree()

    b. _grow_tree()

      Checks for stopping criteria    
      i. If met, returns the `_most_common()` value for node    
      ii. If not met:    
          a. Randomly choose features to split on    
          b. Find the best feature and threshold value using `_best_criteria()`    
          c. Partition the data on best feature and threshold using `_create_split()`    
          d. Recursively grow the left and right trees using `_grow_tree()`    
          e. Return `Node` with best feature, threshold, left and right value    
    

    c. _best_criteria()

      i. Iterates over all features and unique feature values    
      ii. Calculates `_information_gain()` using a feature and unique value as threhold    
      iii. Identifies largest gain and returns best feature and threhold value     
    

    d. _information_gain()

      i. Calculates parent entropy - entropy(y) for the feature using `_calc_entropy()`    
      ii. Creates split using `_create_split()` based on feature and threshold    
      iii. Compute entropy for indices on left and right splits using `_calc_entropy()`    
      iv. Calculate weighted avg. of entropy for splits (left and right children)    
      v. Return Information Gain = Parent entropy - Wt. avg of entropy for children    
    

    e. _create_split()

      i. Compute left indices - indices where feature value >= threshold    
      ii. Compute right indices - indices where feature value < threshold    
      iii. Return left and right indices    
    

    f. _most_common() returns the most common value in a node

    g. predict() iterates over test data and traverses the tree using _traverse_tree()

    h. _traverse_tree() starts at the root node

      i. Checks if the node is leaf and returns value     
      ii. If not, checks if the test data value of best feature for the node >= threshold    
          a. If so, return `_traverse_tree()` again for that feature and left of Node    
          b. Else, return `_traverse_tree()` again for that feature and right of Node    
In [5]:
def calc_entropy(y):
    '''
    Calculates the entropy for a single node
    '''
    # Get total count of each class in y
    y_classes = np.bincount(y)  # returns array([54, 89])

    # Divide class count by length
    y_hist = y_classes/len(y)
    
    # calculate entropy
    entropy = -np.sum([p * np.log2(p) for p in y_hist if p > 0])
    return entropy

class Node:
    """
    Stores information such as its feature, split threshold, left node, right node 
    and value if node is a leaf node.
    """
    def __init__(self, feature=None, threshold=None, left=None, right=None, *, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value
        
    def is_leaf(self):
        """
        determines if a node is a leaf node
        """
        return self.value is not None
    
class My_Tree():
    """
    fits the decision tree model and makes predictions
    """
    def __init__(self):
        self.root = None
    
    def fit(self, X, y, min_split_samples=2, max_depth=100, n_feats=None, replace=False):
        """
        creates a root node and starts growing the tree
        """
        # Initialize
        self.min_split_samples = min_split_samples
        self.max_depth = max_depth
        self.n_feats = n_feats
        
        # Subset number of features
        self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
        self.root = self._grow_tree(X, y)
        
    def _grow_tree(self, X, y, depth=0):
        """
        Checks for stopping criteria    
        i. If met, returns the `_most_common()` value for node    
        ii. If not met:    
            a. Randomly choose features to split on    
            b. Find the best feature and threshold value using `_best_criteria()`    
            c. Partition the data on best feature and threshold using `_create_split()`    
            d. Recursively grow the left and right trees using `_grow_tree()`    
            e. Return `Node` with best feature, threshold, left and right value  
        """
        n_records, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Check for stopping criteria
        if (depth >= self.max_depth
           or n_records < self.min_split_samples
           or n_classes == 1):
            leaf_value = self._most_common(y)
            return Node(value=leaf_value)
        
        # Randomly choose feature indices to split on
        feat_indxs = np.random.choice(n_features, self.n_feats, replace=False)
        
        # Find the best feature and threshold value
        best_feat_idx, best_thresh = self._best_criteria(feat_indxs, X, y)
        
        # Split left and right indices
        left_idxs, right_idxs = self._create_split(X[:,best_feat_idx], best_thresh)
        
        # Grow left and right trees
        left_tree = self._grow_tree(X[left_idxs,:], y[left_idxs], depth+1)
        right_tree = self._grow_tree(X[right_idxs,:], y[right_idxs], depth+1)
        
        # Return `Node` with best feature, threshold, left and right value 
        return Node(best_feat_idx, best_thresh, left_tree, right_tree)
    
    def _best_criteria(self, feat_indxs, X, y):
        """
        i. Iterates over all features and unique feature values    
        ii. Calculates `_information_gain()` using a feature and unique value as threhold    
        iii. Identifies largest gain and returns best feature and threhold value 
        """
        best_gain = -1
        split_idx, split_thresh = None, None
        
        # Iterate over features and their unique values to 
        # identify the best feature and its split threhold
        for idx in feat_indxs:
            X_col = X[:, idx]
            unique_vals = np.unique(X_col)
            for thresh in unique_vals:
                gain = self._information_gain(X_col, y, thresh)
                if gain > best_gain:
                    best_gain = gain
                    split_idx = idx
                    split_thresh = thresh
                    
        return split_idx, split_thresh  
    
    def _information_gain(self, X_col, y, thresh):
        """
        i. Calculates parent entropy - entropy(y) for the feature using `_calc_entropy()`    
        ii. Creates split using `_create_split()` based on feature and threshold    
        iii. Compute entropy for indices on left and right splits using `_calc_entropy()`    
        iv. Calculate weighted avg. of entropy for splits (left and right children)    
        v. Return Information Gain = Parent entropy - Wt. avg of entropy for children 
        """
        # Calculate parent entropy
        ent_parent = calc_entropy(y)
        
        # Create split
        left_idx, right_idx = self._create_split(X_col, thresh)
        
        # Calculate weighted avg. entropy of left and right indices
        n = len(y)
        n_l, n_r = len(left_idx), len(right_idx)
        if n_l == 0 or n_r == 0:
            return 0
        left_ent, right_ent = calc_entropy(y[left_idx]), calc_entropy(y[right_idx]) 
        wt_avg_ent = (n_l/n) * left_ent + (n_r/n) * right_ent
        
        # Calculate information gain
        gain = ent_parent - wt_avg_ent
        return gain
        
    def _create_split(self, X_col, thresh):
        """
        i. Compute left indices - indices where feature value >= threshold    
        ii. Compute right indices - indices where feature value < threshold    
        iii. Return left and right indices 
        """
        left_idx = np.argwhere(X_col >= thresh).flatten()
        right_idx = np.argwhere(X_col < thresh).flatten()
        return left_idx, right_idx
        
    def _most_common(self,y):
        """
        returns the most common value in a node
        """
        count = Counter(y)
        common = count.most_common(1)[0][0]
        return common
    
    def predict(self, X):
        '''
        iterates over test data and traverses the tree using `_traverse_tree()`
        '''
        return np.array([self._traverse_tree(x, self.root) for x in X])
    
    def _traverse_tree(self, x, node):
        """
        i. Checks if the node is leaf and returns value     
        ii. If not, checks if the test data value of best feature for the node >= threshold    
          a. If so, return `_traverse_tree()` again for that feature and left of Node    
          b. Else, return `_traverse_tree()` again for that feature and right of Node
        """
    
        if node.is_leaf():
            return node.value
        
        if x[node.feature] >= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)
        

Build the Model

In [6]:
# Build Model
tree_model = My_Tree()
tree_model.fit(x_train, y_train)

Evaluate Model Performance

In this section, we will check the performance of our model on train and test data. We will also build a Decision Tree model using scikit-learn library and compare performance.

Before we make predictions, let's define a function, accuracy, which returns the ratio of the number of correct predictions to the number of observations.

In [7]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

Performance on Training data

Evaluate model performance on training data.

In [8]:
# Create predictions
train_pred = tree_model.predict(x_train)
In [9]:
# Calculate accuracy
train_acc = accuracy(y_train, train_pred)
print(f'Accuracy on train set: {train_acc}')
Accuracy on train set: 1.0
In [10]:
# Print classification report
train_report = classification_report(y_train, train_pred)
print(f'Classification Report:\n {train_report}')
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       158
           1       1.00      1.00      1.00       268

    accuracy                           1.00       426
   macro avg       1.00      1.00      1.00       426
weighted avg       1.00      1.00      1.00       426

Perofrmance on Test data

Evaluate model performance on test data.

In [11]:
# Create predictions on test set
test_pred = tree_model.predict(x_test)
In [12]:
# Calculate accuracy
test_acc = accuracy(y_test, test_pred)
print(f'Accuracy on test set: {test_acc}')
Accuracy on test set: 0.8951048951048951
In [13]:
# Calculate metrics
cm_test = confusion_matrix(y_test, test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
classes = set(y_test)
sns.heatmap(cm_test, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [14]:
# Print classification report
test_report = classification_report(y_test, test_pred)
print(f'Classification Report:\n {test_report}')
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.87      0.86        54
           1       0.92      0.91      0.92        89

    accuracy                           0.90       143
   macro avg       0.89      0.89      0.89       143
weighted avg       0.90      0.90      0.90       143

Precision: measures how precise/accurate the model is and is the ratio of correctly predicted positive observations to the total predicted positive observations (TP / TP + FP). Precision is a good measure to determine, when the costs of False Positive is high.

Recall: is the ratio of correctly predicted positive observations to the actual positive observations (TP / TP + FN). Recall is a good measure to determine, when the costs of False Negative is high.

In this scenario where we are trying to predict if a patient has cancer or not, the cost of a False Negative is very high. Hence the weighted avg. recall of 0.90 can be used for evaluating the model.

F1 Score: is the harmonic mean of precision and recall taking both metrics into account. Harmonic mean is used because it punishes extreme values. In cases where we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score.

Comparison with Scikit Learn

Here, we build a model using DecisionTreeClassifier from scikit-learn library and compare the results with model created from scratch.

In [16]:
# Build a model using scikit learn
from sklearn.tree import DecisionTreeClassifier
   
# fit the model
clf = DecisionTreeClassifier()

clf.fit(x_train, y_train)
     
# predicting on training data-set
y_train_pred = clf.predict(x_train)
   
# predicting on test data-set
y_test_pred = clf.predict(x_test)
In [17]:
# Calculate accuracy
clf_acc = clf.score(x_test, y_test)
print(f'SVC accuracy on test set: {clf_acc:.3f}')
SVC accuracy on test set: 0.916
In [18]:
# Calculate metrics
cm_test_scikit = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n")

# Plot confusion matrix
sns.heatmap(cm_test_scikit, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes)
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix:

In [19]:
# Print classification report
test_report_scikit = classification_report(y_test, y_test_pred)
print(f'Classification Report:\n {test_report_scikit}')
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.89      0.89        54
           1       0.93      0.93      0.93        89

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

Visualize a tree

scikit-learn allows the flexibility to visualize a decision tree.

In [22]:
from sklearn import tree
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=load_breast_cancer().feature_names,  
                   class_names=load_breast_cancer().target_names,
                   filled=True)

Model Results

Model Accuracy Precision Recall F1 Score
Training Data 1 1 1 1
Test Data 0.9 0.9 0.9 0.9
Model using Scikit learn 0.92 0.92 0.92 0.92

From the results comparison, we can see that the model created from scratch performs fairly when compared to the model created using scikit-learn library. Given that the cost of false negative is high, the recall for our model on test set is 0.9 compared to scikit-learn model with a recall of 0.92.

Summary

To summarize, in this notebook we created a naive Decision Tree Classifier from scratch. Later, we evaluated the results of our model on training, test data and compared performance with a model created using scikit-learn library.