Titanic Eda

Classifier Series - Decision Trees
Creator: Muhammad Bilal Alam
What are Decision Trees?

Decision trees are a type of machine learning algorithm that helps us make decisions by
breaking down a complex problem into smaller, more manageable pieces. Decision trees
work by creating a tree-like model of decisions and their possible consequences. Each
decision or event is represented by a node on the tree, and the branches represent the
possible outcomes or consequences of that decision. At the end of each branch, we reach a
final decision or prediction.
The formula for a decision tree is based on recursive binary splitting of the data using the
most significant feature at each node. The basic equation for building a decision tree is as
follows:
T(x) = f1(x) if x < k1
T(x) = f2(x) if k1 ≤ x < k2
T(x) = f3(x) if k2 ≤ x < k3
...
T(x) = fn(x) if x ≥ kn
where T(x) is the predicted target value for the input x,f1(x) to fn(x) are the decision rules
for each node in the tree, and k1 to kn are the thresholds for each split in the tree.
The decision rules for each node are determined by the most significant feature at that
node, as determined by a metric such as information gain or Gini impurity. The thresholds
for each split are determined by the values of the feature that result in the greatest
reduction in impurity or entropy.
The goal of building a decision tree is to create a model that accurately predicts the target
value for new input data, while minimizing overfitting and maintaining simplicity. This is
achieved by controlling the depth and complexity of the tree, and by using pruning and
other techniques to reduce overfitting.
When to use decision trees:

When the problem involves making decisions based on a set of rules or criteria that can
be easily represented as a tree.
When the data has a clear structure that can be easily partitioned into discrete
categories or groups.
When the data has a mix of categorical and numerical features, and the decision rules
are based on a combination of both.
When interpretability is important, and we want to be able to understand and explain
the decision-making process.
When not to use decision trees:

When the problem is too complex or the data is too noisy to be easily represented as a
tree.
When the data has many features and the decision rules are not clear or simple, as the
resulting tree may be too large and difficult to interpret.
When the data has a high degree of correlation between features, as this can lead to
overfitting and reduce the performance of the tree.
When the goal is to predict continuous or real-valued outputs, as decision trees are
designed to work with discrete categories or groups.
The Titanic Dataset:

The Titanic dataset is a classic dataset in machine learning and data science. It contains
information on the passengers aboard the Titanic, including their demographic and ticket
information, as well as whether or not they survived the sinking of the ship. The dataset is
often used as a benchmark dataset for classification tasks in machine learning.
The dataset contains 14 features, which are as follows:
The Diabetes dataset contains ten features, which are as follows:
pclass: Passenger class, where 1 = 1st class, 2 = 2nd class, and 3 = 3rd class.
name: Name of the passenger.
sex: Sex of the passenger.
age: Age of the passenger.
sibsp: Number of siblings/spouses the passenger had aboard the Titanic.
parch: Number of parents/children the passenger had aboard the Titanic.
ticket: Ticket number of the passenger.
fare: Fare paid by the passenger.
cabin: Cabin number of the passenger.
embarked: Port of embarkation, where C = Cherbourg, Q = Queenstown, and S =
Southampton.
boat: Lifeboat number (if the passenger survived).
body: Body number (if the passenger did not survive and their body was recovered).
home.dest: Home or destination of the passenger.
target: Binary variable that indicates whether a passenger survived the sinking of the
Titanic (1) or not (0).
Import Necessary Libraries

In [1]: import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Load the Titanic Dataset

In [2]: # Load the Titanic dataset from OpenML
titanic = fetch_openml(name='titanic', version=1)
# Convert the data to a pandas dataframe

df = pd.DataFrame(data=titanic.data, columns=titanic.feature_names)
df['target'] = titanic.target
# Print the first five rows of the dataframe

df.head()
Out[2]: pclass name sex age sibsp parch ticket fare cabin embarked boat bo
Allen,
Miss.
0 1.0 female 29.0000 0.0 0.0 24160 211.3375 B5 S 2 N
Elisabeth
Walton
Allison,
Master. C22
1 1.0 male 0.9167 1.0 2.0 113781 151.5500 S 11 N
Hudson C26
Trevor
Allison,
Miss. C22
2 1.0 female 2.0000 1.0 2.0 113781 151.5500 S None N
Helen C26
Loraine
Allison,
Mr.
C22
3 1.0 Hudson male 30.0000 1.0 2.0 113781 151.5500 S None 13
C26
Joshua
Creighton
Allison,
Mrs.
Hudson J C22
4 1.0 female 25.0000 1.0 2.0 113781 151.5500 S None N
C (Bessie C26
Waldo
Daniels)
Check the Shape of Dataset

In [3]: df.shape
(1309, 14)
Out[3]:
View Summary of the Dataset

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 name 1309 non-null object
2 sex 1309 non-null category
3 age 1046 non-null float64
4 sibsp 1309 non-null float64
5 parch 1309 non-null float64
6 ticket 1309 non-null object
7 fare 1308 non-null float64
8 cabin 295 non-null object
9 embarked 1307 non-null category
10 boat 486 non-null object
11 body 121 non-null float64
12 home.dest 745 non-null object
13 target 1309 non-null category
dtypes: category(3), float64(6), object(5)
memory usage: 116.8+ KB
Showing Statistical Summary of the Dataset

In [5]: df.describe().T
Out[5]: count mean std min 25% 50% 75% max
pclass 1309.0 2.294882 0.837836 1.0000 2.0000 3.0000 3.000 3.0000
age 1046.0 29.881135 14.413500 0.1667 21.0000 28.0000 39.000 80.0000
sibsp 1309.0 0.498854 1.041658 0.0000 0.0000 0.0000 1.000 8.0000
parch 1309.0 0.385027 0.865560 0.0000 0.0000 0.0000 0.000 9.0000
fare 1308.0 33.295479 51.758668 0.0000 7.8958 14.4542 31.275 512.3292
body 121.0 160.809917 97.696922 1.0000 72.0000 155.0000 256.000 328.0000
EDA: View the distribution of the target variable

In [6]: # Visualize the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=df, palette='cool')
plt.title('Survival Count', fontsize=16, fontweight='bold')
plt.xlabel('Survived', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks([0, 1], ['No', 'Yes'], fontsize=12)
plt.yticks(fontsize=12)
plt.show()
EDA: Visualize the distribution of age by survival status
In [7]: # Visualize the distribution of age by survival status
sns.histplot(data=df, x='age', hue='target', multiple='stack', palette='cool', alph
plt.title('Age Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Age', fontsize=14)
plt.xticks(fontsize=12)
plt.legend(['No', 'Yes'], fontsize=12, title='Survived', title_fontsize=12)
plt.show()
EDA: Visualize the distribution of fare by survival status

In [8]: # Visualize the distribution of fare by survival status
sns.histplot(data=df, x='fare', hue='target', multiple='stack', palette='cool', alp
plt.title('Fare Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Fare', fontsize=14)
plt.show()
EDA: Visualize the distribution of passenger class by survival

status
In [9]: # Visualize the distribution of passenger class by survival status
sns.countplot(x='pclass', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Passenger Class Distribution by Survival', fontsize=16, fontweight='bold
plt.xlabel('Passenger Class', fontsize=14)
plt.show()
EDA: Visualize the distribution of sex by survival status
In [10]: # Visualize the distribution of sex by survival status
sns.countplot(x='sex', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Sex Distribution by Survival', fontsize=16, fontweight='bold')
plt.xlabel('Sex', fontsize=14)
plt.xticks([0, 1], ['Male', 'Female'], fontsize=12)
plt.show()
EDA: Visualize the distribution of sex by survival status
In [11]: # Visualize the distribution of sex by survival status
sns.countplot(x='sibsp', hue='target', data=df, palette='cool', alpha=0.8)
plt.title('Sibling/Spouse Distribution by Survival', fontsize=16, fontweight='bold
plt.xlabel('Number of Siblings/Spouses', fontsize=14)
plt.show()
EDA: Visualize the correlation between features

In [12]: # Visualize the correlation between features
sns.heatmap(df.corr(), annot=True, cmap='cool')
plt.title('Feature Correlation', fontsize=16, fontweight='bold')
plt.show()
Data Preprocessing: Check and Drop Duplicates if any
In [13]: # Check for duplicates and drop them if they exist
if df.duplicated().any():
print("Duplicate rows found!")
df.drop_duplicates(inplace=True)
else:
print("No Duplicate Rows found!")
No Duplicate Rows found!
Data Preprocessing: Drop Unneccessary Features:

The features that are dropped are:
name: This feature contains the name of each passenger, which is not relevant to
predicting survival and is unlikely to be useful in a machine learning model.
ticket: This feature contains the ticket number of each passenger, which is unlikely to
be useful in predicting survival. cabin: This feature contains the cabin number of each
passenger, which is unlikely to be useful in predicting survival and has many missing
values.
boat: This feature contains the lifeboat number for passengers who survived, which is
not relevant for predicting survival of the passengers who did not survive.
body: This feature contains the identification number of the body of passengers who
did not survive, which is not relevant for predicting survival of the passengers who did
survive.
home.dest: This feature contains the home and destination of each passenger, which is
not likely to be useful in predicting survival.
In [14]: # Drop unnecessary features

df = df.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'], axis=1)
Data Preprocessing: Check for Missing Values
In [15]: # Check for missing values
print(df.isnull().sum())
pclass 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
target 0
dtype: int64
Data Preprocessing: Fill Missing Values

In [16]: # Fill in missing values for age and fare with median values
df['age'] = df['age'].fillna(df['age'].median())
df['fare'] = df['fare'].fillna(df['fare'].median())
In [17]: # Fill in missing values for embarked with mode value

mode_embarked = df['embarked'].mode()[0]
df['embarked'] = df['embarked'].fillna(mode_embarked)
Data Preprocessing: Do Data Encoding

In [18]: # Convert sex and embarked features to numeric values
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embarked'] = df['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
Data Preprocessing: Split Dataset into Target and Feature

In [19]: # Split the dataset into features and target
X = df.drop(['target'], axis=1)
y = df['target']
Data Preprocessing: Convert target variable to integer

In [20]: # Convert the target variable to int64
y = y.astype('int64')
Checking Decision Tree Assumptions on Titanic Dataset

Binary or categorical target: Requirement Met
Numeric and categorical features: Requirement Met
No missing values: Requirement Met
No correlated features: Reasonably Met (Features are not Highly Correlated)
Balanced dataset: Reasonably Met (Not Extremely Unbalanced)

ML: Split Dataset into test and train
In [21]: # Splitting Dataset in the ratio 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_sta
ML: Define Hyperparameters

In [22]: # Define the hyperparameters to tune
params = {'max_depth': [3, 5, 7, 9], 'min_samples_leaf': [1, 5, 10, 15], 'min_sampl
ML: Do GridSearch to find best Parameters and fit the data in

it
In [23]: # Fit a decision tree classifier to the training data using grid search cross-valid
clf = DecisionTreeClassifier(random_state=0)
grid_clf = GridSearchCV(clf, params, cv=5)
grid_clf.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
Out[23]:
param_grid={'max_depth': [3, 5, 7, 9],
'min_samples_leaf': [1, 5, 10, 15],
'min_samples_split': [2, 4, 6, 8]})
ML: Make Prediction using the trained Model

In [24]: # Make predictions on the testing data using the best estimator from the grid searc
best_clf = grid_clf.best_estimator_
y_pred = best_clf.predict(X_test)
ML: Find the Most Important Features in the Dataset

In [25]: # Compute feature importances
importances = best_clf.feature_importances_
features = X.columns
# Sort the feature importances in descending order

importances_sorted = sorted(zip(features, importances), key=lambda x: x[1], reverse
features_sorted = [f[0] for f in importances_sorted]
importances_sorted = [f[1] for f in importances_sorted]
# Define colors for the bars

colors = plt.cm.Paired(np.arange(len(importances)))
# Plot the feature importances

plt.figure(figsize=(8, 6))
plt.barh(features_sorted, importances_sorted, color=colors)
plt.xticks(rotation=0)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree Classifier Feature Importances')
# Add the feature importance values to the bars

for i, v in enumerate(importances_sorted):
plt.text(v, i, '{:.2f}'.format(v), color='black', fontsize=10, ha='left', va='c
plt.show()
ML: Evaluate the Performance of the Model
In [26]: # Evaluate the performance of the classifier
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
In [27]: # Plot the confusion matrix

cm_display = ConfusionMatrixDisplay(cm).plot(cmap='cool')
plt.show()
In [28]: # Create a pandas DataFrame to store the evaluation metrics

data = {'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-score', 'ROC AUC'],
'Score': [acc, prec, rec, f1, roc_auc]}
df_eval = pd.DataFrame(data)
# Print the evaluation metrics

print('\nEvaluation Metrics:')
df_eval
Evaluation Metrics:
Out[28]: Metric Score
0 Accuracy 0.811705
1 Precision 0.778626
2 Recall 0.693878
3 F1-score 0.733813
4 ROC AUC 0.787996
In [32]: # Create a bar plot of the evaluation metrics

ax = sns.barplot(x='Metric', y='Score', data=df_eval[df_eval.Metric != 'ROC AUC'])
plt.ylim([0, 1])
plt.title('Decision Tree Classifier Evaluation Metrics')
# Add values at the top of each bar

for i in ax.containers:
ax.bar_label(i, label_type='edge', fontsize=10)
plt.show()
In [30]: # Plot ROC curve

plt.plot(fpr, tpr, color='skyblue', lw=2, label='ROC curve (area = %0.2f)' % roc_au
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
Conclusion
Based on the evaluation metrics, the decision tree model performed reasonably well on the
Titanic dataset. The model achieved an accuracy score of 0.81, which means that it correctly
classified 81% of the passengers in the test set. The precision score of 0.78 indicates that
when the model predicted that a passenger survived, it was correct 78% of the time. The
recall score of 0.69 means that the model correctly identified 69% of the passengers who
actually survived. The F1-score, which is a weighted average of precision and recall, was
0.73. Finally, the ROC AUC score, which measures the model's ability to distinguish between
positive and negative samples, was 0.79. Overall, these metrics suggest that the decision
tree model is a reasonable choice for predicting survival on the Titanic dataset. However,
there is still room for improvement, and it may be worth exploring other models or
optimizing the hyperparameters of the decision tree to achieve better performance.

Titanic Eda

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Titanic Eda

Uploaded by

Copyright:

Available Formats

Classifier Series - Decision Trees

Creator: Muhammad Bilal Alam

What are Decision Trees?

T(x) = f1(x) if x < k1

T(x) = f2(x) if k1 ≤ x < k2

T(x) = f3(x) if k2 ≤ x < k3

When to use decision trees:

When not to use decision trees:

The Titanic Dataset:

The dataset contains 14 features, which are as follows:

The Diabetes dataset contains ten features, which are as follows:

Import Necessary Libraries

Load the Titanic Dataset

# Convert the data to a pandas dataframe

# Print the first five rows of the dataframe

Check the Shape of Dataset

View Summary of the Dataset

Showing Statistical Summary of the Dataset

Out[5]: count mean std min 25% 50% 75% max

pclass 1309.0 2.294882 0.837836 1.0000 2.0000 3.0000 3.000 3.0000

age 1046.0 29.881135 14.413500 0.1667 21.0000 28.0000 39.000 80.0000

sibsp 1309.0 0.498854 1.041658 0.0000 0.0000 0.0000 1.000 8.0000

parch 1309.0 0.385027 0.865560 0.0000 0.0000 0.0000 0.000 9.0000

fare 1308.0 33.295479 51.758668 0.0000 7.8958 14.4542 31.275 512.3292

body 121.0 160.809917 97.696922 1.0000 72.0000 155.0000 256.000 328.0000

EDA: View the distribution of the target variable

EDA: Visualize the distribution of fare by survival status

EDA: Visualize the distribution of passenger class by survival

EDA: Visualize the correlation between features

No Duplicate Rows found!

Data Preprocessing: Drop Unneccessary Features:

In [14]: # Drop unnecessary features

Data Preprocessing: Fill Missing Values

In [17]: # Fill in missing values for embarked with mode value

Data Preprocessing: Do Data Encoding

Data Preprocessing: Split Dataset into Target and Feature

Data Preprocessing: Convert target variable to integer

Checking Decision Tree Assumptions on Titanic Dataset

Numeric and categorical features: Requirement Met

No missing values: Requirement Met

No correlated features: Reasonably Met (Features are not Highly Correlated)

Balanced dataset: Reasonably Met (Not Extremely Unbalanced)

ML: Define Hyperparameters

ML: Do GridSearch to find best Parameters and fit the data in

ML: Make Prediction using the trained Model

ML: Find the Most Important Features in the Dataset

# Sort the feature importances in descending order

# Define colors for the bars

# Plot the feature importances

# Add the feature importance values to the bars

In [27]: # Plot the confusion matrix

In [28]: # Create a pandas DataFrame to store the evaluation metrics

# Print the evaluation metrics

4 ROC AUC 0.787996

In [32]: # Create a bar plot of the evaluation metrics

# Add values at the top of each bar

In [30]: # Plot ROC curve

You might also like