# How do CART models work

Example for binary classification problem

Cover image credit: Elliott Engelmann

# r import
library(tidyverse)
library(knitr)
library(kableExtra)
library(Rgraphviz)
library(rpart)
library(rattle)
library(reticulate)

use_python("/home/ruslan/anaconda3/bin/python3.7")
# python import
import pandas as pd
from sklearn import tree
import matplotlib.pyplot as plt

## What is CART?

Classification and Regression Trees (or CART for short) is a type of supervised learning algorithm [1] that is mostly used in classification problems [2], but can be applied for regression problems [3] as well. It can handle both categorical and numerical input variables.

In this tutorial we are going to look at the classification problem.

Let’s start with a simple example. Imagine that we have medical data about patients and we have two drugs (Drug A and Drug B) that work differently for each patient based on his health metrics. We have historical data about 15 patients and what drug type have worked better for them. Now, the 16-th patient comes to us and we want to assign a drug to him based on this historical data. You can load the sample data set here.

patients_df <- read_csv("DecisionTree_SampleData.csv")
kable(patients_df, format = "markdown", caption = "<b>Patients Data</b>")
Patient IDGenderBMIBPDrug
p01Mnot normalnot normalDrug A
p02Mnot normalnormalDrug B
p03Mnot normalnot normalDrug A
p04Mnormalnot normalDrug A
p05Mnot normalnot normalDrug A
p06Fnot normalnormalDrug B
p07Fnormalnot normalDrug A
p08Fnot normalnormalDrug B
p09Mnot normalnot normalDrug A
p10Mnormalnot normalDrug A
p11Fnot normalnot normalDrug B
p12FnormalnormalDrug B
p13FnormalnormalDrug B
p14Mnormalnot normalDrug B
p15Fnot normalnormalDrug B
p16Mnot normalnot normal???

Our decision tree is going to look something like this:

It doesn’t look exactly like a “normal tree” we are used to see in real life. The trick is that this tree is growing from top to down.

Imagine that all our data is located at the root node of the tree. Now we want to do is to split the data by most significant features (that will create new decision nodes) until we reach the leaf nodes with the most homogeneous target possible (meaning that ideally just one class is present in each leaf node).

Now the question is how do we split the data at each node?

## Information Gain and Entropy

The main metrics to use for dealing with classification problems are Information gain and Gini impurity. And variance reduction is most commonly used for dealing with regression tasks. We are going to use Information gain for this tutorial but in the end there will be formulas provided on how to use other metrics since they have the same idea.

Information gain is based on the concept of entropy [4] and information content from information theory [5]. First of all, let’s build an intuition behind the entropy. Simply saying, entropy is the measure of chaos or uncertainty in the data.

Imagine that I asked you to play in a following simple game: I have two coins, one coin is unfair ($$P(\text{Heads}) = \frac{4}{5}$$), the other one is fair ($$P(\text{Heads}) = \frac{1}{2}$$). You can peak a coin and flip it. If it comes up Heads I will give you $1. Which coin would you choose? Well, I hope that you would choose an unfair coin. In this case you are more certain about the outcome of the coin flip (on average it will come up Heads in $$80\%$$ of the time). For the fair coin you have the highest level of uncertainty, meaning that Heads and Tails can occur equally likely. Probability mass functions for both coins look as: $$$p_{\text{unfair}} = \begin{cases} \frac{4}{5} & \text{Heads}\\ \frac{1}{5} & \text{Tails}\\ \end{cases}$$$ $$$p_{\text{fair}} = \begin{cases} \frac{1}{2} & \text{Heads}\\ \frac{1}{2} & \text{Tails}\\ \end{cases}$$$ We can calculate the entropy using the formula: $H(X) = - \sum_{i=1}^{n} p(x_i) \cdot log_2 [p(x_i)]$ • $$n$$ is the number of all possible outcomes for random variable $$X$$ • $$log_2$$ is the logarithm with base 2 ($$log_2 2 = 1$$, $$log_2 4 = 2$$, etc) Entropy for the unfair coin: $H(\text{unfair}) = - \left( \frac{4}{5} \cdot log_2 \frac{4}{5} + \frac{1}{5} \cdot log_2 \frac{1}{5} \right) \\ \approx 0.72$ Entropy for the fair coin: $H(\text{fair}) = - \left( \frac{1}{2} \cdot log_2 \frac{1}{2} + \frac{1}{2} \cdot log_2 \frac{1}{2} \right) \\ = 1$ entropy <- function(p) { return(-sum(p*log(p, base = 2))) } p_unfair <- c(4/5, 1/5) p_fair <- c(1/2, 1/2) entropy(p_unfair) ## [1] 0.7219281 entropy(p_fair) ## [1] 1 As you can see, entropy for the fair coin is higher meaning that the level of uncertainty is also higher. Note that entropy cannot be negative and the minimum value is $$0$$ which means the highest level of certainty. For example, imagine the unfair coin with the probability $$P(\text{Heads}) = 1$$ and $$P(\text{Tails}) = 0$$. In such way you are sure that coin will come up Heads and entropy is equal to zero. Coming back to our decision tree problem we want to split out data into nodes that have reduced the level of uncertainty (or in other words increasing the information gain). Basic algorithm look as following: 1. For each node: • Choose a feature from the data set. • Calculate the significance (information gain) of that feature in the splitting of data. • Repeat for each feature. 1. Split the data by the feature that is the most significant splitter (highest information gain). 2. Repeat until there are no features left. ## Growing a Tree As always for machine learning task we are going to split the data on training (first 15 patients) and test (16-th patient) sets. train_df <- patients_df[-16, ] test_df <- patients_df[16, ] ### Root Node The initial entropy of the distribution of target variable (Drug) is: $H(X) = -\left( p(A) \cdot log_2 p(A) + p(B) \cdot log_2 p(B) \right)$ • $$p(A)$$ - probability of Drug A. Since there are 7 observations out of 15 with the Drug A $$p(A) = \frac{7}{15} \approx 0.47$$ • $$p(B)$$ - probability of Drug B. Since there are 8 observations out of 15 with the Drug A $$p(B) = \frac{8}{15} \approx 0.53$$ $H(X) = -\left(0.47 \cdot log_2 0.47 + 0.53 \cdot log_2 0.53 \right) \\ \approx 0.997$ temp_df <- train_df %>% group_by(Drug) %>% summarise(n = n(), p = round(n() / dim(train_df)[1], 2)) ## summarise() ungrouping output (override with .groups argument) kable(temp_df, format = "markdown") Drugnp Drug A70.47 Drug B80.53 entropy(temp_df$p)
## [1] 0.9974016

For the next step we are going to split the data by each feature (Gender, BMI, BP), calculate the entropy, and check which split variant gives the highest information gain.

#### Gender

Let’s start with Gender. We have two labels in the data set: M and F. After splitting the data we end up with the following proportions:

M62
F16

Meaning that among 15 patients 6 men were assigned to drug A and 2 were assigned to drug B whereas 1 woman was assigned to drug A and 6 women were assigned to drug B.

Calculate the entropy for each gender label:

M62$$\frac{6}{8}$$$$\frac{2}{8}$$0.811
F16$$\frac{1}{6}$$$$\frac{5}{6}$$0.592

Now we would like to know how much information did we gain after the split.

Information Gain = Entropy before the split - Weighted entropy after the split
• Entropy before the split is the entropy at the root in this case
• Weighted entropy is defined as entropy times the label proportion. For example, weighted entropy for Male patients is equals to:

$\text{Weighted Entropy (M)} = \frac{\text{# Male patients}}{\text{# All patients}} \cdot H(M)$

$\text{Weighted Entropy (M)} = \frac{8}{15} \cdot 0.811 = 0.433$

M62$$\frac{6}{8}$$$$\frac{2}{8}$$0.811$$\frac{8}{15}$$0.433
F16$$\frac{1}{6}$$$$\frac{5}{6}$$0.592$$\frac{7}{15}$$0.276

Now we can calculate the information gain:

$\text{IG} = 0.997 - (0.433 + 0.276) = 0.288$

#### BMI

Repeat the previous steps for the BMI feature. We have two labels: normal and not normal.

normal33$$\frac{3}{6}$$$$\frac{3}{6}$$1$$\frac{6}{15}$$0.4
not normal45$$\frac{4}{9}$$$$\frac{5}{9}$$0.991$$\frac{9}{15}$$0.595

$\text{IG} = 0.997 - (0.4 + 0.595) = 0.002$

#### Blood Pressure

And the same for BP column (labels are the same: normal and not normal):

normal06$$\frac{0}{6}$$$$\frac{6}{6}$$0$$\frac{6}{15}$$0
not normal72$$\frac{7}{9}$$$$\frac{2}{9}$$0.764$$\frac{9}{15}$$0.459

$\text{IG} = 0.997 - (0 + 0.459) = 0.538$

#### Node Summary

As we can see BP gives the highest information gain, or in other words, if we split the data by this column we will be more certain about the target variable distribution. So our first split will look like this:

### BP:normal Node

Continue to grow the tree from the (normal node). We filter observations that have normal blood pressure:

train_df %>%
filter(BP == "normal") %>%
select(-BP) %>%
kable(format = "markdown")
Patient IDGenderBMIDrug
p02Mnot normalDrug B
p06Fnot normalDrug B
p08Fnot normalDrug B
p12FnormalDrug B
p13FnormalDrug B
p15Fnot normalDrug B

As we can see, there is no Drug A for this node and entropy is already 0. This node is going to be last for this branch with the drug B as an output with probability = 1.

### BP:not normal Node

Filter all the observations that have not normal blood pressure:

train_df %>%
filter(BP == "not normal") %>%
select(-BP) %>%
kable(format = "markdown")
Patient IDGenderBMIDrug
p01Mnot normalDrug A
p03Mnot normalDrug A
p04MnormalDrug A
p05Mnot normalDrug A
p07FnormalDrug A
p09Mnot normalDrug A
p10MnormalDrug A
p11Fnot normalDrug B
p14MnormalDrug B

Now we consider this set as “initial set”. The entropy of target distribution at this node is $$0.764$$ (we have already calculated it). At this node we have two features: Gender and BMI. We perform the same procedure to find the best splitting feature.

#### Gender

M61$$\frac{6}{7}$$$$\frac{1}{7}$$0.592$$\frac{7}{9}$$0.46
F11$$\frac{1}{2}$$$$\frac{1}{2}$$1$$\frac{2}{9}$$0.222

$\text{IG} = 0.764 - (0.46 + 0.222) = 0.082$

#### BMI

normal41$$\frac{4}{5}$$$$\frac{1}{5}$$0.722$$\frac{5}{9}$$0.401
not normal31$$\frac{3}{4}$$$$\frac{1}{4}$$0.811$$\frac{4}{9}$$0.361

$\text{IG} = 0.764 - (0.401 + 0.361) = 0.002$

#### Node Summary

Gender gives the most information gain after splitting the data. Now the tree will look like this:

### Gender:M Node

Now we filer the set so that we have only observations with BP:not normal and Gender:M:

train_df %>%
filter(BP == "not normal" & Gender == "M") %>%
select(-c(BP, Gender)) %>%
kable(format = "markdown")
Patient IDBMIDrug
p01not normalDrug A
p03not normalDrug A
p04normalDrug A
p05not normalDrug A
p09not normalDrug A
p10normalDrug A
p14normalDrug B

There is just one feature left BMI. Since we have no more features apart from this one we are going to “grow” last nodes with BMI:normal and BMI:not normal labels:

normal21$$\frac{2}{3}$$$$\frac{1}{3}$$
not normal40$$\frac{4}{4}$$$$\frac{0}{4}$$

For the BMI:normal node the predicted class is going to be Drug A, since its probability is higher. The same applies to BMI:not normal node.

### Gender:F Node

Now we filer the set so that we have only observations with BP:not normal and Gender:F:

train_df %>%
filter(BP == "not normal" & Gender == "F") %>%
select(-c(BP, Gender)) %>%
kable(format = "markdown") 
Patient IDBMIDrug
p07normalDrug A
p11not normalDrug B

We are left with just 1 observation in each node. Predicted class for node BMI:normal is going to be Drug A ($$P = 1.$$). Predicted class for node BMI:not normal is going to be Drug B ($$P = 1.$$).

### Final Tree

Using the information from the “female” branch we can visualize the final tree.

Code

node0 <- "Blood Pressure"
node1 <- "Drug B (P=1.)"
node2 <- "Gender"
node3 <- "BMI"
node4 <- "BMI "
node5 <- "Drug A (P=0.66)"
node6 <- "Drug A (P=1.)"
node7 <- "Drug A (P=1.) "
node8 <- "Drug B (P=1.) "
nodeNames <- c(node0, node1, node2, node3,
node4, node5, node6, node7, node8)

# create new graph object
rEG <- new("graphNEL", nodes=nodeNames, edgemode="directed")

plot(rEG, attrs = at)
text(150, 300, "normal", col="black")
text(260, 300, "not normal", col="black")
text(200, 190, "male", col="black")
text(310, 190, "female", col="black")
text(100, 80, "normal", col="black")
text(230, 80, "not normal", col="black")
text(290, 80, "normal", col="black")
text(420, 80, "not normal", col="black")

### Prediction

Let’s look at the 16-th patient:

test_df %>%
kable(format = "markdown") 
Patient IDGenderBMIBPDrug
p16Mnot normalnot normal???

To make a prediction we just need to move top to bottom by the branches of our tree.

1. Check BP. Since BP:not normal go to left branch.
2. Check Gender. Gender:M => right branch.
3. Check BMI. BMI:not normal => right branch.

The predicted value is going to be Drug A with a probability of 1.0.

## Let Computer Do the Math

### R

We can build a decision tree using rpart function (rpart package) and plot it with the help of fancyRpartPlot function (rattle package). Note, that R doesn’t require to encode the categorical variables since it can handle them by itself.

model_dt <- rpart(formula = Drug~Gender+BMI+BP,
data = train_df,
method = "class",
minsplit = 1,
minbucket = 1,
parms = list(split = "information")
)

#plot decision tree model
fancyRpartPlot(model_dt, caption = NULL, type=4)

Get a prediction for 16-th patient:

# return the class
predict(object = model_dt, newdata = test_df, type = "class")
##      1
## Drug A
## Levels: Drug A Drug B

# return the probability of each class
predict(object = model_dt, newdata = test_df, type = "prob")
##      Drug A    Drug B
## 1 0.8571429 0.1428571

Note that since R didn’t split Gender:M node by BMI the probabilities are quite different that we expected but it doesn’t effect the predicted class.

### Python

Unlike R, Python doesn’t allow categorical input variables, so we need to encode them:

patients_df = pd.read_csv("DecisionTree_SampleData.csv")
# drop id column
patients_df.drop('Patient ID', axis=1, inplace=True)

# convert to dummy variables
patients_df_labeled = pd.get_dummies(patients_df)
columns_to_drop = ['Gender_F', 'BMI_not normal', 'BP_not normal', 'Drug_???', 'Drug_Drug B']
patients_df_labeled.drop(
columns_to_drop,
axis=1,
inplace=True)
unknown_drug = patients_df_labeled.iloc[15, :]
patients_df_labeled.drop(15, axis=0, inplace=True) # unknow drug row
patients_df_labeled.head().to_html(classes='table table-striped')
Gender_MBMI_normalBP_normalDrug_Drug A
01001
11010
21001
31101
41001

For example Gender_M:1 means that subject is Male and Gender_M:0 means that subject is female.

To build a decision tree model we can use a DecisionTreeClassifier function from sklearn.tree module.

# split the data
X = patients_df_labeled.drop('Drug_Drug A', axis=1)
y = patients_df_labeled['Drug_Drug A']

# fit the decision tree model
model = tree.DecisionTreeClassifier(criterion='entropy')
model.fit(X, y)
## DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
##                        max_depth=None, max_features=None, max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, presort='deprecated',
##                        random_state=None, splitter='best')
plt.figure(figsize=(10,10))
tree.plot_tree(model, filled=True, rounded=True,
feature_names=X.columns.tolist(), proportion=False,
class_names=['Drug B', 'Drug A'])
plt.show()

Get a prediction for the 16-th patient:

unknown_drug = pd.DataFrame(unknown_drug).T
unknown_drug.drop(['Drug_Drug A'], axis=1, inplace=True)

# return the class
model.predict(unknown_drug)
## array([1], dtype=uint8)
# return the probability of each class
model.predict_proba(unknown_drug)
## array([[0., 1.]])

## Numerical Explanatory Variables

In our simple data set we had only categorical variables (with 2 labels each). But what would model do if we had numerical value as an input? In fact, the decision tree model (in R or Python) takes only numerical variables as input.

I want to draw your attention to Python decision tree, especially on the decision nodes like Gender_M <= 0.5. Where does this 0.5 come from? The answer intersects with the answer on how the model handles numerical values.

Recall the Gender_M column we created for Python train set. It consisted of two numerical values (0 and 1). What model does is filtering the data by the middle value among these two numbers. “Middle” value was :$$\frac{0+1}{2}=0.5$$. Now model have to sets of data: one set has only those values that agree with condition Gender_M <= 0 == True (OR Gender_M = 0 OR Gender is Female) and the other set that doesn’t agree with condition Gender_M <= 0 == False (OR Gender_M = 1 OR Gender is Male). R does this behind the scenes, so we don’t need to worry about encoding, however, Python tree was a good way to show how does the algorithm works.

Image that we add a third category Non-binary to the Gender feature. First, we would encode the values and create a new column Gender_encoded, which would have three numerical values (0, 1 and 2). Say, Non-binary = 0, Male = 1, Female = 2. Now model would have three conditions for filtering the data.

• Gender_encoded <= 0.5 (OR Gender_encoded = 0 OR Gender is Non-binary)
• Gender_encoded <= 1.5 (OR Gender_encoded = 0 or 1 OR Gender is Non-binary | Male)

In a general form we can say that if we have numerical feature, the algorithm will convert it to “categorical” by choosing the average “middle” value as a threshold between actual input values:

• filter 1: feature <= threshold
• filter 2: feature > threshold

For example:

idblood_pressuredrug
p01100A
p02120B
p03150B
p03200B

“Middle threshold” values are: 110 ($$\frac{100+120}{2}$$), 135 ($$\frac{120+150}{2}$$), 175 ($$\frac{150+200}{2}$$).

## Other Splitting Metrics

### Gini Impurity

As told before, for classification problems you can also use Gini impurity as a way to find the best splitting feature. The idea stays the same. The only difference is that instead of Entropy formula we are using Gini formula:

$G(X) = 1 - \sum_{i=1}^{n} p(x_i)^2$ After calculating Gini impurity value for each feature we would calculate the gain and chose the feature that has the highest gain.

Gain = Gini impurity before the split - Weighted Gini impurity after the split

### Variance Reduction

If we are dealing with the regression problem (predicting numerical variable, rather than a class) we would use variance reduction. Instead of calculating the probabilities of target variable we calculate its variance:

$Var(x) = \frac{\sum_{i=1}^n(x_i- \bar x)^2}{n-1}$ And in the same way we are looking for a variable that has the highest variance reduction, or in other words variance in target variable becomes lower after the split.

Variance Reduction = Variance before the split - Weighted variance after the split

## Summary

I hope that it all made sense and it became a bit clearer how do CART models work. It is a really simple algorithm that doesn’t require much feature engineering and often used as a base estimator in more complex bagging and boosting models. Please let me know if you have any questions that I might have missed here.

##### Ruslan Klymentiev
###### Data Scientist

My life credo is “Never stop learning”. When I am not learning, I am travelling or hiking.