Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Predicting the Species making use of the file iris.

Short note on Decision tree


A decision tree is a flowchart-like structure used to make decisions or predictions. It consists of nodes
representing decisions or tests on attributes, branches representing the outcome of these decisions, and
leaf nodes representing final outcomes or predictions. Each internal node corresponds to a test on an
attribute, each branch corresponds to the result of the test, and each leaf node corresponds to a class
label or a continuous value.

Decision Tree for Understanding and Classification.

Structure of a Decision Tree


 Root Node: Represents the entire dataset and the initial decision to be made.
 Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or
more branches.
 Branches: Represent the outcome of a decision or test, leading to another node.
 Leaf Nodes: Represent the final decision or prediction. No further splits occur at these
nodes.

Short note on the file iris


The iris data are very small and methods can be applied to it in memory, within R, without splitting it
into pieces and applying MapReduce algorithms. It is an accessible introductory example nonetheless,
as it is easy to verify computations done with MapReduce to those with the traditional approach. It is
the MapReduce principles—not the size of the data—that are important: Once an algorithm has been
expressed in MapReduce terms, it theoretically can be applied unchanged to much larger data.
The iris data are a data frame of 150 measurements of iris petal and sepal lengths and widths, with 50
measurements for each species of “setosa,” “versicolor,” and “virginica.”
Screen shot of the file iris

The file has 150 records and 5 fields.


Dim:

No of Rows: 150
Names:

Number of columns: 5

Objectives:
Obj.1
Decision trees are constructed to know the influencing variables on Species
Obj.2
To use the decision tree model for prediction and classification.

The packages required:


1. Require(“rpart”)
2. Require (“rpart.plot”)
3. Require(“caret”)
Code in R – Studio:

Decision tree diagram:

Interpretation:
1.If the Petal. Length Is less than 2.5, the observation is classified as Setosa. This is the root node, and
it represents the initial decision point.
2. After the root node next decision depends on Petal width.
If Petal. Width is less than 1.8, the observation is classified as versicolor.
3. For observations where Petal. Width is greater than or equal to 1.8, it predicts virginica.
This tree clearly separates Setosa from the other two species based on Petal. Length. For flowers with
longer petals (>= 2.5), it further distinguishes between Versicolor and Virginica based on Petal. Width.
Description of the rules:
Rules for Setosa
If Petal. Length is less than 2.5
Then species= Setosa
Confidence: 100%
Data Coverage is 33%
Rule for Versicolor
If Petal. Length greater than or equal to 2.5 and petal width is less than 1.8
Then species = Versicolor
Confidence:91%
Data Coverage IS 36%
Rule for Virginica
If Petal. Length greater than 2.5 and Petal. Width greater than 1.8
Then Species = Virginica
Confidence:98%
Data Coverage is 31%
Prediction:

Data frame output:


Description of Confusion Matrix:
A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set
of test data. It is a means of displaying the number of accurate and inaccurate instances based on the
model’s predictions. It is often used to measure the performance of classification models, which aim
to predict a categorical label for each input instance.
The matrix displays the number of instances produced by the model on the test data.

 True positives (TP): occur when the model accurately predicts a positive data point.
 True negatives (TN): occur when the model accurately predicts a negative data point.
 False positives (FP): occur when the model predicts a positive data point incorrectly.
 False negatives (FN): occur when the model mispredicts a negative data point.
Package Caret is used for Confusion Matrix

Interpretation of confusion matrix:


Confusion Matrix:
 The confusion matrix shows the counts of true and predicted classifications for the three species of
iris flowers: setosa, versicolor, and virginica.
 Overall Statistics:
o Accuracy: 0.96 (96%)
o 95% CI: (0.915, 0.9852)
o No Information Rate: 0.3333 (most frequent class proportion)
o P-Value [Acc > NIR]: < 2.2e-16 (statistical significance of the accuracy)
o Kappa: 0.94 (measures the agreement between predicted and true classes, adjusted for
chance)

You might also like