Professional Documents
Culture Documents
Titanic Eda
Titanic Eda
The formula for a decision tree is based on recursive binary splitting of the data using the
most significant feature at each node. The basic equation for building a decision tree is as
follows:
...
T(x) = fn(x) if x ≥ kn
where T(x) is the predicted target value for the input x,f1(x) to fn(x) are the decision rules
for each node in the tree, and k1 to kn are the thresholds for each split in the tree.
The decision rules for each node are determined by the most significant feature at that
node, as determined by a metric such as information gain or Gini impurity. The thresholds
for each split are determined by the values of the feature that result in the greatest
reduction in impurity or entropy.
The goal of building a decision tree is to create a model that accurately predicts the target
value for new input data, while minimizing overfitting and maintaining simplicity. This is
achieved by controlling the depth and complexity of the tree, and by using pruning and
other techniques to reduce overfitting.
pclass: Passenger class, where 1 = 1st class, 2 = 2nd class, and 3 = 3rd class.
name: Name of the passenger.
sex: Sex of the passenger.
age: Age of the passenger.
sibsp: Number of siblings/spouses the passenger had aboard the Titanic.
parch: Number of parents/children the passenger had aboard the Titanic.
ticket: Ticket number of the passenger.
fare: Fare paid by the passenger.
cabin: Cabin number of the passenger.
embarked: Port of embarkation, where C = Cherbourg, Q = Queenstown, and S =
Southampton.
boat: Lifeboat number (if the passenger survived).
body: Body number (if the passenger did not survive and their body was recovered).
home.dest: Home or destination of the passenger.
target: Binary variable that indicates whether a passenger survived the sinking of the
Titanic (1) or not (0).
Out[2]: pclass name sex age sibsp parch ticket fare cabin embarked boat bo
Allen,
Miss.
0 1.0 female 29.0000 0.0 0.0 24160 211.3375 B5 S 2 N
Elisabeth
Walton
Allison,
Master. C22
1 1.0 male 0.9167 1.0 2.0 113781 151.5500 S 11 N
Hudson C26
Trevor
Allison,
Miss. C22
2 1.0 female 2.0000 1.0 2.0 113781 151.5500 S None N
Helen C26
Loraine
Allison,
Mr.
C22
3 1.0 Hudson male 30.0000 1.0 2.0 113781 151.5500 S None 13
C26
Joshua
Creighton
Allison,
Mrs.
Hudson J C22
4 1.0 female 25.0000 1.0 2.0 113781 151.5500 S None N
C (Bessie C26
Waldo
Daniels)
(1309, 14)
Out[3]:
name: This feature contains the name of each passenger, which is not relevant to
predicting survival and is unlikely to be useful in a machine learning model.
ticket: This feature contains the ticket number of each passenger, which is unlikely to
be useful in predicting survival. cabin: This feature contains the cabin number of each
passenger, which is unlikely to be useful in predicting survival and has many missing
values.
boat: This feature contains the lifeboat number for passengers who survived, which is
not relevant for predicting survival of the passengers who did not survive.
body: This feature contains the identification number of the body of passengers who
did not survive, which is not relevant for predicting survival of the passengers who did
survive.
home.dest: This feature contains the home and destination of each passenger, which is
not likely to be useful in predicting survival.
pclass 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
target 0
dtype: int64
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
Out[23]:
param_grid={'max_depth': [3, 5, 7, 9],
'min_samples_leaf': [1, 5, 10, 15],
'min_samples_split': [2, 4, 6, 8]})
plt.show()
ML: Evaluate the Performance of the Model
In [26]: # Evaluate the performance of the classifier
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
Evaluation Metrics:
Out[28]: Metric Score
0 Accuracy 0.811705
1 Precision 0.778626
2 Recall 0.693878
3 F1-score 0.733813
plt.show()