Resources: Link
http://dataaspirant.com/2017/02/01/decisiontreealgorithmpythonwithscikitlearn/
Decision Tree Algorithm implementation with scikit learn
One of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.
As in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikitlearn machine learning packages for balance scale dataset.
The summarizing way of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set. We will program our classifier in Python language and will use its sklearn library.
How we can implement Decision Tree classifier in Python with ScikitlearnCLICK TO TWEET
Decision tree algorithm prerequisites
Before get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on how the Decision Tree Algorithm worksarticle.
Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced. The greatness of using Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.
Before get started let’s quickly look into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.
Assumptions we make while using Decision tree
 In the beginning, the whole training set is considered at the root.
 Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model.
 Records are distributed recursively on the basis of attribute values.
 Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Decision Tree Algorithm Pseudocode
 Place the best attribute of our dataset at the root of the tree.
 Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
 Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
While building our decision tree classifier, we can improve its accuracy by tuning it with different parameters. But this tuning should be done carefully since by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model.
Sklearn Library Installation
Python’s sklearn library holds tons of modules that help to build predictive models. It contains tools for data splitting, preprocessing, feature selection, tuning and supervised – unsupervised learningalgorithms, etc. It is similar to Caret library in R programming.
For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most popular machine learning Python libraries.
Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like Knearest neighbor classifier, Support vector machine classifier, decision tree, linear regression, etc.
Balance Scale Data Set Description
Balance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 1st.
1.: 3 (L, B, R)
2. LeftWeight: 5 (1, 2, 3, 4, 5)
3. LeftDistance: 5 (1, 2, 3, 4, 5)
4. RightWeight: 5 (1, 2, 3, 4, 5)
5. RightDistance: 5 (1, 2, 3, 4, 5)
Index  Variable Name  Variable Values 
1.  Class Name( Target Variable)  “R” : balance scale tip to the right “L” : balance scale tip to the left “B” : balance scale be balanced 
2.  LeftWeight  1, 2, 3, 4, 5 
3.  LeftDistance  1, 2, 3, 4, 5 
4.  RightWeight  1, 2, 3, 4, 5 
5.  RightDistance  1, 2, 3, 4, 5 
The above table shows all the details of data.
Balance Scale Problem Statement
The problem we are going to address is To model a classifier for evaluating balance tip’s direction.
Decision Tree classifier implementation in Python with sklearn Library
The modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classified the balance scale’s tip direction.
Python packages used
 NumPy
 NumPy is a Numeric Python module. It provides fast mathematical functions.
 Numpy provides robust data structures for efficient computation of multidimensional arrays & matrices.
 We used numpy to read data files into numpy arrays and data manipulation.
 Pandas
 Provides DataFrame Object for data manipulation
 Provides reading & writing data b/w different files.
 DataFrames can hold different types data of multidimensional arrays.
 ScikitLearn
 It’s a machine learning library. It includes various machine learning algorithms.
 We are using its
 train_test_split,
 DecisionTreeClassifier,
 accuracy_score algorithms.
If you haven’t setup the machine learning setup in your system the below posts will helpful.
Importing Python Machine Learning Libraries
This section involves importing all the libraries we are going to use. We are importing numpy and sklearn train_test_split, DecisionTreeClassifier & accuracy_score modules.
1
2
3
4
5
6

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

Numpy arrays and pandas dataframes will help us in manipulating data. As discussed above, sklearn is a machine learning library. The cross_validation’s train_test_split() method will help us by splitting data into train & test set.
The tree module will be used to build a Decision Tree Classifier. Accutacy_score module will be used to calculate accuracy metrics from the predicted class variables.
Data Import
For importing the data and manipulating it, we are going to use pandas dataframes. First of all, we need to download the dataset. You can download the dataset from here. All the data values are separated by commas.
After downloading the data file, we will use Pandas read_csv() method to import data into pandas dataframe. Since our data is separated by commas “,” and there is no header in our data, so we will put header parameter’s value “None” and sep parameter’s value as “,”.
1
2
3

balance_data = pd.read_csv(
sep= ‘,’, header= None)

We are saving our data into “balance_data” dataframe.
For checking the length & dimensions of our dataframe, we can use len() method & “.shape”.
1
2

print “Dataset Lenght:: “, len(balance_data)
print “Dataset Shape:: “, balance_data.shape

Output:
1
2

Dataset Lenght:: 625
Dataset Shape:: (625, 5)

We can print head .e, top 5 lines of our dataframe using head() method.
1
2

print “Dataset:: “
balance_data.head()

Output:
Decision Tree Classifier with criterion gini index
1
2
3

clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)

Output:
1
2
3
4

DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=3,
max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=100, splitter=‘best’)

Decision Tree Classifier with criterion information gain
1
2
3

clf_entropy = DecisionTreeClassifier(criterion = “entropy”, random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)

Output
1
2
3
4

DecisionTreeClassifier(class_weight=None, criterion=‘entropy’, max_depth=3,
max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=100, splitter=‘best’)

Prediction
Now, we have modeled 2 classifiers. One classifier with gini index & another one with information gain as the criterion. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record.
1

clf_gini.predict([[4, 4, 3, 3]])

Output
This way we can predict class for a single record. It’s time to predict target variable for the whole test dataset.
Prediction for Decision Tree classifier with criterion as gini index
1
2

y_pred = clf_gini.predict(X_test)
y_pred

Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

array([‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘L’, ‘L’,
‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’,
‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’, ‘L’,
‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’,
‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘L’,
‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’,
‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’,
‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’,
‘R’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’,
‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’,
‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’,
‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’,
‘L’, ‘R’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’,
‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’,
‘R’, ‘L’, ‘L’, ‘R’, ‘R’, ‘R’], dtype=object)

Prediction for Decision Tree classifier with criterion as information gain
1
2

y_pred_en = clf_entropy.predict(X_test)
y_pred_en

Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

array([‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’,
‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’,
‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’,
‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’,
‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’,
‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’,
‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’,
‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘L’, ‘R’, ‘L’, ‘L’, ‘R’,
‘R’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’,
‘L’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’,
‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’,
‘R’, ‘L’, ‘R’, ‘R’, ‘L’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘R’, ‘R’, ‘R’,
‘L’, ‘R’, ‘R’, ‘R’, ‘R’, ‘R’, ‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘R’, ‘L’,
‘R’, ‘L’, ‘R’, ‘L’, ‘R’, ‘L’, ‘L’, ‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘R’,
‘L’, ‘L’, ‘L’, ‘R’, ‘R’, ‘R’], dtype=object)

Calculating Accuracy Score
The function accuracy_score() will be used to print accuracy of Decision Tree algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.
 y_true,
 y_pred,
 normalize,
 sample_weight.
Out of these 4, normalize & sample_weight are optional parameters. The parameter y_true accepts an array of correct labels and y_pred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value.
Accuracy for Decision Tree classifier with criterion as gini index
1

print “Accuracy is “, accuracy_score(y_test,y_pred)*100

Output
1

Accuracy is 73.4042553191

Accuracy for Decision Tree classifier with criterion as information gain
1

print “Accuracy is “, accuracy_score(y_test,y_pred_en)*100

Output
1

Accuracy is 70.7446808511

Conclusion
In this article, we have learned how to model the decision tree algorithm in Python using the Python machine learning library scikitlearn. In the process, we learned how to split the data into train and test dataset. To model decision tree classifier we used the information gain, and gini index split criteria. In the end, we calucalte the accuracy of these two decision tree models