This document provides an overview of decision trees and the CART (Classification and Regression Trees) algorithm, explaining their components such as root nodes, nodes, and leaves, as well as impurity measures like Gini index and entropy. It discusses the advantages and disadvantages of decision trees, including their interpretability and the risk of overfitting, while also highlighting their applications in various fields like business management and healthcare. Additionally, it includes references for further reading and resources for coding with decision trees.
Decision tree algorithm
●These are also termed as CART algorithms.
● These are used for
○ Classification and
○ Regression
● Classification and Regression Trees
4.
Decision tree components
●Root node
○ It refers to the start of the decision tree with
maximum split ( information Gain)
● Node
○ Node is a condition with multiple outcomes in
the tree.
● Leaf
○ This is the final decision(end point) of a node
from the condition(question)
6.
Every node yieldsmaximum data in each split which could be
achieved by IG
Information Gain ( IG )
7.
It can becalculated by using impurity measures of each split
1. Gini Index (Ig
)
2. Entropy ( Ih
)
3. Classification error ( Ie
)
Impurity Metrics
8.
● Root nodeis split to get maximum info gain.
● Increase in nodes in the tree causes overfitting.
● Splitting continues until each of the leaf is pure ( one of the
possible outcome )
● Pruning can also be done which means removal of branches
which use features of low importance.
● Gini index ≅ Entropy
● If uniform distribution , entropy is 1
Principle of spliting nodes
9.
Split A
Parent dataset ---> 40 items in feature 1 and 40 items in feature 2
Child 1 → 30 items in feature 1 and 10 items in feature 2
Child 2 → 10 items in feature 1 and 30 items in feature 2
Split B
Parent data set ---> 40 items in feature 1 and 40 items in feature 2
Child 1 → 20 items in feature 1 and 40 items in feature 2
Child 2 → 20 items in feature 1 and 0 items in feature 2
Comparison of allImpurity Metrics
Scaled Entropy = Entropy /2
Gini index is intermediate
values of impurity lying
between classification error
and Entropy .
14.
Pros :
● Simpleto understand, interpret, visualize.
● It is effective to use in numerical and categorical data outcomes.
● Requires little effort from users for data preparation.
● Nonlinear relationships between parameters do not affect tree
performance.
● Able to handle irrelevant attributes ( Gain = 0 )
15.
Cons :
● Maymake a complex tree with maximum depth.
● Unstable as small variation in input data may result in
completely different tree to get generated.
● As it is a greedy algorithm , may not find globally best tree for a
data set .