Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Decision Trees and Prediction with trees

A decision tree is a kind of flowchart — a graphical representation of the process for making a decision or a series of decisions.

In the picture below, if you want to build a decision tree to decide which is the species of the flower (outcome) based on the variables that describe the flower, Eg If petal length  is <2.45 it is considered as setosa, else then the 2nd criteria comes into picture. if petal width is <1.75 then .. so on.

decisiontree

All the variables in the middle of the tree are called nodes. And all the ends are called leaves.

Tree Generation

Lets say we want to generate a tree for an outcome that we want.

  1. Start with all variables in one group
  2. Find the variable/split that best separates the outcomes
  3. Divide the data into two groups (“leaves”) on that split (“node”)
  4. Within each split, find the best variable/split that separates the outcomes
  5. Continue until the groups are too small or sufficiently “pure”

R does this in a completely automated way.

How do we know them whether the tree we have got is good ? Using “Misclassification rate” or “Residual mean deviance”

R Example

Noted below is a detailed explanation of generating and interpreting a decision tree in R.


data(iris)
set.seed(200)
inorout = rbinom(nrow(iris),1,prob=0.9) # random split of 90% records into the training set
trainingset = iris[inorout==1,] # that's 140 rows
testset = iris[inorout==0,] # that's 10 rows. Also called the hold out data set
library(tree)
tree1 <- tree(Species ~ .,data=trainingset) #the dot means use all columns except the outcome one
plot(tree1)
text(tree1)
summary(tree1)

Tree1 is the model that has been trained using the data from the training set.

The output of the ‘summary’ command would be

Classification tree:
tree(formula = Species ~ ., data = trainingset)
Variables actually used in tree construction:
[1] "Petal.Length" "Petal.Width"  "Sepal.Length"
Number of terminal nodes:  6 
Residual mean deviance:  0.1347 = 18.05 / 134 
Misclassification error rate: 0.02857 = 4 / 140 

Which says that using this model (tree1), when compared against the actual trainingset, 4 errors were present and the misclassification error rate was 4/140 that is 2.9%. This is visible using the table command below.


table(trainingset$Species, predict(tree1, type="class"))

Results are the below (where 4 errors are visible) which shows the misclassification (here aka resubstitution error – error obtained by comparing the actual data against the tree model derived on itself. The resubstitution error rate is highly optimistic. It underestimates the true error rate because we use the dataset for training the model and then testing the model against the same dataset.)


            setosa versicolor virginica
  setosa         45          0         0
  versicolor      0         45         3
  virginica       0          1        46

But since the misclassification error rate is quite low, is this a good model to choose ? Or is there overfitting. ?

Try K-Fold Cross validation on the training set.

ct = cv.tree(tree1, FUN=prune.tree) # here method="misclass" can also be added . Similar results are obtained.
plot(ct)

kfold1
The deviation seen is lowest between 4-6 leaves.

Now we prune the tree to get the best 4 leaves.


pt = prune.tree(tree1, best=4)
summary(pt)

After we do the K Fold Cross validation and prune the tree, we again take the summary of the resulting tree and get the misclassification error.


Classification tree:
snip.tree(tree = tree1, nodes = c(7L, 12L))
Variables actually used in tree construction:
[1] "Petal.Length" "Petal.Width" 
Number of terminal nodes:  4 
Residual mean deviance:  0.1969 = 26.77 / 136 
Misclassification error rate: 0.02857 = 4 / 140 

which is again leading to quite a small misclassification error. Now for the final step, we see how well out model works out to predicting data (hold out set).


testprediction = predict(pt, testset, type="class")
table(testset$Species, testprediction)
table(testset$Species== testprediction)

..which gives an “all correct” result.


            testpredictionset
             setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          2         0
  virginica       0          0         3

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

http://www.wikihow.com/Create-a-Decision-Tree

The study material for the MOOC “Data Analysis” at Coursera.org

http://mkseo.pe.kr/stats/?p=16

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: