Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Classifier Series - Decision Trees

Creator: Muhammad Bilal Alam

What are Decision Trees?


Decision trees are a type of machine learning algorithm that helps us make decisions by
breaking down a complex problem into smaller, more manageable pieces. Decision trees
work by creating a tree-like model of decisions and their possible consequences. Each
decision or event is represented by a node on the tree, and the branches represent the
possible outcomes or consequences of that decision. At the end of each branch, we reach a
final decision or prediction.

The formula for a decision tree is based on recursive binary splitting of the data using the
most significant feature at each node. The basic equation for building a decision tree is as
follows:

T(x) = f1(x) if x < k1

T(x) = f2(x) if k1 ≤ x < k2

T(x) = f3(x) if k2 ≤ x < k3

...

T(x) = fn(x) if x ≥ kn

where T(x) is the predicted target value for the input x,f1(x) to fn(x) are the decision rules
for each node in the tree, and k1 to kn are the thresholds for each split in the tree.

The decision rules for each node are determined by the most significant feature at that
node, as determined by a metric such as information gain or Gini impurity. The thresholds
for each split are determined by the values of the feature that result in the greatest
reduction in impurity or entropy.

The goal of building a decision tree is to create a model that accurately predicts the target
value for new input data, while minimizing overfitting and maintaining simplicity. This is
achieved by controlling the depth and complexity of the tree, and by using pruning and
other techniques to reduce overfitting.

When to use decision trees:


When the problem involves making decisions based on a set of rules or criteria that can
be easily represented as a tree.
When the data has a clear structure that can be easily partitioned into discrete
categories or groups.
When the data has a mix of categorical and numerical features, and the decision rules
are based on a combination of both.
When interpretability is important, and we want to be able to understand and explain
the decision-making process.

When not to use decision trees:


When the problem is too complex or the data is too noisy to be easily represented as a
tree.
When the data has many features and the decision rules are not clear or simple, as the
resulting tree may be too large and difficult to interpret.
When the data has a high degree of correlation between features, as this can lead to
overfitting and reduce the performance of the tree.
When the goal is to predict continuous or real-valued outputs, as decision trees are
designed to work with discrete categories or groups.

The Titanic Dataset:


The Titanic dataset is a classic dataset in machine learning and data science. It contains
information on the passengers aboard the Titanic, including their demographic and ticket
information, as well as whether or not they survived the sinking of the ship. The dataset is
often used as a benchmark dataset for classification tasks in machine learning.

The dataset contains 14 features, which are as follows:

The Diabetes dataset contains ten features, which are as follows:

pclass: Passenger class, where 1 = 1st class, 2 = 2nd class, and 3 = 3rd class.
name: Name of the passenger.
sex: Sex of the passenger.
age: Age of the passenger.
sibsp: Number of siblings/spouses the passenger had aboard the Titanic.
parch: Number of parents/children the passenger had aboard the Titanic.
ticket: Ticket number of the passenger.
fare: Fare paid by the passenger.
cabin: Cabin number of the passenger.
embarked: Port of embarkation, where C = Cherbourg, Q = Queenstown, and S =
Southampton.
boat: Lifeboat number (if the passenger survived).
body: Body number (if the passenger did not survive and their body was recovered).
home.dest: Home or destination of the passenger.
target: Binary variable that indicates whether a passenger survived the sinking of the
Titanic (1) or not (0).

Import Necessary Libraries


In [1]: import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Load the Titanic Dataset


In [2]: # Load the Titanic dataset from OpenML
titanic = fetch_openml(name='titanic', version=1)

# Convert the data to a pandas dataframe


df = pd.DataFrame(data=titanic.data, columns=titanic.feature_names)
df['target'] = titanic.target

# Print the first five rows of the dataframe


df.head()

Out[2]: pclass name sex age sibsp parch ticket fare cabin embarked boat bo

Allen,
Miss.
0 1.0 female 29.0000 0.0 0.0 24160 211.3375 B5 S 2 N
Elisabeth
Walton

Allison,
Master. C22
1 1.0 male 0.9167 1.0 2.0 113781 151.5500 S 11 N
Hudson C26
Trevor

Allison,
Miss. C22
2 1.0 female 2.0000 1.0 2.0 113781 151.5500 S None N
Helen C26
Loraine

Allison,
Mr.
C22
3 1.0 Hudson male 30.0000 1.0 2.0 113781 151.5500 S None 13
C26
Joshua
Creighton

Allison,
Mrs.
Hudson J C22
4 1.0 female 25.0000 1.0 2.0 113781 151.5500 S None N
C (Bessie C26
Waldo
Daniels)

Check the Shape of Dataset


In [3]: df.shape

(1309, 14)
Out[3]:

View Summary of the Dataset


In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 name 1309 non-null object
2 sex 1309 non-null category
3 age 1046 non-null float64
4 sibsp 1309 non-null float64
5 parch 1309 non-null float64
6 ticket 1309 non-null object
7 fare 1308 non-null float64
8 cabin 295 non-null object
9 embarked 1307 non-null category
10 boat 486 non-null object
11 body 121 non-null float64
12 home.dest 745 non-null object
13 target 1309 non-null category
dtypes: category(3), float64(6), object(5)
memory usage: 116.8+ KB

Showing Statistical Summary of the Dataset


In [5]: df.describe().T

Out[5]: count mean std min 25% 50% 75% max

pclass 1309.0 2.294882 0.837836 1.0000 2.0000 3.0000 3.000 3.0000

age 1046.0 29.881135 14.413500 0.1667 21.0000 28.0000 39.000 80.0000

sibsp 1309.0 0.498854 1.041658 0.0000 0.0000 0.0000 1.000 8.0000

parch 1309.0 0.385027 0.865560 0.0000 0.0000 0.0000 0.000 9.0000

fare 1308.0 33.295479 51.758668 0.0000 7.8958 14.4542 31.275 512.3292

body 121.0 160.809917 97.696922 1.0000 72.0000 155.0000 256.000 328.0000

EDA: View the distribution of the target variable


In [6]: # Visualize the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=df, palette='cool')
plt.title('Survival Count', fontsize=16, fontweight='bold')
plt.xlabel('Survived', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks([0, 1], ['No', 'Yes'], fontsize=12)
plt.yticks(fontsize=12)
plt.show()
EDA: Visualize the distribution of age by survival status
In [7]: # Visualize the distribution of age by survival status
plt.figure(figsize=(8,6))
sns.histplot(data=df, x='age', hue='target', multiple='stack', palette='cool', alph
plt.title('Age Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Age', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()

EDA: Visualize the distribution of fare by survival status


In [8]: # Visualize the distribution of fare by survival status
plt.figure(figsize=(8,6))
sns.histplot(data=df, x='fare', hue='target', multiple='stack', palette='cool', alp
plt.title('Fare Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Fare', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()

EDA: Visualize the distribution of passenger class by survival


status
In [9]: # Visualize the distribution of passenger class by survival status
plt.figure(figsize=(8,6))
sns.countplot(x='pclass', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Passenger Class Distribution by Survival', fontsize=16, fontweight='bold
plt.xlabel('Passenger Class', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()
EDA: Visualize the distribution of sex by survival status
In [10]: # Visualize the distribution of sex by survival status
plt.figure(figsize=(8,6))
sns.countplot(x='sex', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Sex Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Sex', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks([0, 1], ['Male', 'Female'], fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()
EDA: Visualize the distribution of sex by survival status
In [11]: # Visualize the distribution of sex by survival status
plt.figure(figsize=(8,6))
sns.countplot(x='sibsp', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Sibling/Spouse Distribution by Survival', fontsize=16, fontweight='bold
plt.xlabel('Number of Siblings/Spouses', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()

EDA: Visualize the correlation between features


In [12]: # Visualize the correlation between features
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='cool')
plt.title('Feature Correlation', fontsize=16, fontweight='bold')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
Data Preprocessing: Check and Drop Duplicates if any
In [13]: # Check for duplicates and drop them if they exist
if df.duplicated().any():
print("Duplicate rows found!")
df.drop_duplicates(inplace=True)
else:
print("No Duplicate Rows found!")

No Duplicate Rows found!

Data Preprocessing: Drop Unneccessary Features:


The features that are dropped are:

name: This feature contains the name of each passenger, which is not relevant to
predicting survival and is unlikely to be useful in a machine learning model.
ticket: This feature contains the ticket number of each passenger, which is unlikely to
be useful in predicting survival. cabin: This feature contains the cabin number of each
passenger, which is unlikely to be useful in predicting survival and has many missing
values.
boat: This feature contains the lifeboat number for passengers who survived, which is
not relevant for predicting survival of the passengers who did not survive.
body: This feature contains the identification number of the body of passengers who
did not survive, which is not relevant for predicting survival of the passengers who did
survive.
home.dest: This feature contains the home and destination of each passenger, which is
not likely to be useful in predicting survival.

In [14]: # Drop unnecessary features


df = df.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'], axis=1)
Data Preprocessing: Check for Missing Values
In [15]: # Check for missing values
print(df.isnull().sum())

pclass 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
target 0
dtype: int64

Data Preprocessing: Fill Missing Values


In [16]: # Fill in missing values for age and fare with median values
df['age'] = df['age'].fillna(df['age'].median())
df['fare'] = df['fare'].fillna(df['fare'].median())

In [17]: # Fill in missing values for embarked with mode value


mode_embarked = df['embarked'].mode()[0]
df['embarked'] = df['embarked'].fillna(mode_embarked)

Data Preprocessing: Do Data Encoding


In [18]: # Convert sex and embarked features to numeric values
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embarked'] = df['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

Data Preprocessing: Split Dataset into Target and Feature


In [19]: # Split the dataset into features and target
X = df.drop(['target'], axis=1)
y = df['target']

Data Preprocessing: Convert target variable to integer


In [20]: # Convert the target variable to int64
y = y.astype('int64')

Checking Decision Tree Assumptions on Titanic Dataset


Binary or categorical target: Requirement Met

Numeric and categorical features: Requirement Met

No missing values: Requirement Met

No correlated features: Reasonably Met (Features are not Highly Correlated)

Balanced dataset: Reasonably Met (Not Extremely Unbalanced)


ML: Split Dataset into test and train
In [21]: # Splitting Dataset in the ratio 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_sta

ML: Define Hyperparameters


In [22]: # Define the hyperparameters to tune
params = {'max_depth': [3, 5, 7, 9], 'min_samples_leaf': [1, 5, 10, 15], 'min_sampl

ML: Do GridSearch to find best Parameters and fit the data in


it
In [23]: # Fit a decision tree classifier to the training data using grid search cross-valid
clf = DecisionTreeClassifier(random_state=0)
grid_clf = GridSearchCV(clf, params, cv=5)
grid_clf.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
Out[23]:
param_grid={'max_depth': [3, 5, 7, 9],
'min_samples_leaf': [1, 5, 10, 15],
'min_samples_split': [2, 4, 6, 8]})

ML: Make Prediction using the trained Model


In [24]: # Make predictions on the testing data using the best estimator from the grid searc
best_clf = grid_clf.best_estimator_
y_pred = best_clf.predict(X_test)

ML: Find the Most Important Features in the Dataset


In [25]: # Compute feature importances
importances = best_clf.feature_importances_
features = X.columns

# Sort the feature importances in descending order


importances_sorted = sorted(zip(features, importances), key=lambda x: x[1], reverse
features_sorted = [f[0] for f in importances_sorted]
importances_sorted = [f[1] for f in importances_sorted]

# Define colors for the bars


colors = plt.cm.Paired(np.arange(len(importances)))

# Plot the feature importances


plt.figure(figsize=(8, 6))
plt.barh(features_sorted, importances_sorted, color=colors)
plt.xticks(rotation=0)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree Classifier Feature Importances')

# Add the feature importance values to the bars


for i, v in enumerate(importances_sorted):
plt.text(v, i, '{:.2f}'.format(v), color='black', fontsize=10, ha='left', va='c

plt.show()
ML: Evaluate the Performance of the Model
In [26]: # Evaluate the performance of the classifier
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

In [27]: # Plot the confusion matrix


cm_display = ConfusionMatrixDisplay(cm).plot(cmap='cool')
plt.show()

In [28]: # Create a pandas DataFrame to store the evaluation metrics


data = {'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-score', 'ROC AUC'],
'Score': [acc, prec, rec, f1, roc_auc]}
df_eval = pd.DataFrame(data)

# Print the evaluation metrics


print('\nEvaluation Metrics:')
df_eval

Evaluation Metrics:
Out[28]: Metric Score

0 Accuracy 0.811705

1 Precision 0.778626

2 Recall 0.693878

3 F1-score 0.733813

4 ROC AUC 0.787996

In [32]: # Create a bar plot of the evaluation metrics


plt.figure(figsize=(6, 6))
ax = sns.barplot(x='Metric', y='Score', data=df_eval[df_eval.Metric != 'ROC AUC'])
plt.ylim([0, 1])
plt.title('Decision Tree Classifier Evaluation Metrics')

# Add values at the top of each bar


for i in ax.containers:
ax.bar_label(i, label_type='edge', fontsize=10)

plt.show()

In [30]: # Plot ROC curve


plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='skyblue', lw=2, label='ROC curve (area = %0.2f)' % roc_au
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
Conclusion
Based on the evaluation metrics, the decision tree model performed reasonably well on the
Titanic dataset. The model achieved an accuracy score of 0.81, which means that it correctly
classified 81% of the passengers in the test set. The precision score of 0.78 indicates that
when the model predicted that a passenger survived, it was correct 78% of the time. The
recall score of 0.69 means that the model correctly identified 69% of the passengers who
actually survived. The F1-score, which is a weighted average of precision and recall, was
0.73. Finally, the ROC AUC score, which measures the model's ability to distinguish between
positive and negative samples, was 0.79. Overall, these metrics suggest that the decision
tree model is a reasonable choice for predicting survival on the Titanic dataset. However,
there is still room for improvement, and it may be worth exploring other models or
optimizing the hyperparameters of the decision tree to achieve better performance.

You might also like