Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Exploratory Data Analysis and Prediction Model on

Titanic Dataset using various Machine Learning


algorithms

Supervised By:
Muhammad Bux Alvi
mbalvi@iub.edu.pk

Project Report By:


Syed Qasim Raza (20CSE-29)
Muhammad Saood Ch (20CSE-27)

Batch: 2020-24

Department of Computer Systems Engineering


Faculty of Engineering

The Islamia University of Bahawalpur


Data Mining | CEP

Abstract
According to legend, the Royal Mail Ship (RMS) Titanic, a British cruise ship, was the
largest cruise ship ever built. On April 15, 1912, he crossed the Pacific Ocean for the
first time and collided with an iceberg on his way from Southampton to New York City.
About half of the 2,200 passengers on board died in accidents that were never heard of.
This notorious incident forced researchers to dig deeper into the data set. The purpose
of this report is to analyze the research data and to understand the factors that contribute
to the survival of a person on board. The study identifies passengers' characteristics -
cabin class, age, and point of departure - and their relationship to disaster prevention.
Survival is predicted using different machine learning algorithms like K-Nearest
Neighbor, Support Vector Machine, Logistic Regression, and Random Forest. In this
report, we conducted a comparative analysis of the accuracy of the aforementioned
algorithms and found that SVM based model achieved the highest accuracy among the
used algorithms.

Page 2|
Data Mining | CEP

1. Introduction
The Titanic dataset is widely recognized and extensively used in the field of machine
learning and predictive modeling. It provides a rich collection of information about the
passengers aboard the ill-fated RMS Titanic, including their demographics, ticket class,
cabin details, and survival outcomes. This dataset has become a benchmark for
developing predictive models to determine passenger survival based on the available
features.
The primary objective of this report is to perform an Exploratory Data Analysis (EDA)
on the Titanic dataset, enabling us to gain valuable insights into the variables and their
relationships. Through EDA, we aim to uncover patterns, identify trends, and highlight
potential factors associated with survival. This analysis will be crucial in developing a
robust prediction model using the K-Nearest Neighbors (KNN), Logistic Regression
algorithm, Random Forest, and Support vector machine.

1.1 K-Nearest neighbors


The KNN algorithm is a widely-used classification algorithm that relies on the principle
of similarity. This is a simple but effective method for classification. The major
drawbacks concerning KNN [1] are its low efficiency, being a lazy learning method that
prohibits it in many applications such as dynamic web mining for a large repository,
and its dependency on the selection of a “good value” for k.
It classifies new instances by comparing them to the labeled instances in the training
dataset and assigning them to the majority class among their nearest neighbors. By
implementing the KNN algorithm on the Titanic dataset, we can create a predictive
model that estimates the likelihood of survival for passengers based on their
characteristics. It uses the Euclidean distance formula to compute the distance
between the points, which is mathematically represented as:

𝐷= ∑ (𝑥𝑖 − 𝑦𝑖) (1)

1.2 Logistics Regression


Logistic regression is a statistical modeling technique used to predict binary outcomes
or categorical outcomes with two or more levels. It is a type of regression analysis that
estimates the probability of an event occurring based on a set of independent variables.
logistic regression may include only one or multiple independent variables, although
examining multiple variables is generally more informative because it reveals the
unique contribution of each variable after adjusting for the others[2]. Here is what a
Logistic Regression model looks like:

(2)
1.3 Support Vector Machine
Page 3|
Data Mining | CEP

Support Vector Machine is one of the classical machine learning techniques that can
still help solve big data classification problems. Especially, it can help multi-domain
applications in a big data environment. However, the support vector machine is
mathematically complex and computationally expensive.[3]

The basic idea behind SVM is to find an optimal hyperplane that separates different
classes of data points in a high-dimensional space. The hyperplane is chosen to
maximize the margin, which is the distance between the hyperplane and the nearest data
points of each class. The data points that are closest to the hyperplane are called support
vectors, hence the name "Support Vector Machine."

1.4 Random Forest


Random forest is a commonly used machine-learning algorithm trademarked by Leo
Breiman and Adele Cutler. Its ease of use and flexibility have fueled its adoption, as it
handles both classification and regression problems.
Random forests[4] are a combination of tree predictors such that each tree depends on
the values of a random vector sampled independently and with the same distribution for
all trees in the forest. The generalization error for forests converges to a limit as the
number of trees in the forest becomes large. The generalization error of a forest of tree
classifiers depends on the strength of the individual trees in the forest and the correlation
between them. The mathematical form of random forest to calculate mean square error
is:

𝑓^ = ∑ (𝑓𝑠 − 𝑦𝑠) (3)


Where "𝑆"denotes the number of data points, "𝑓𝑠"is the value returned by the
model, and "𝑦𝑠"is the actual value of data points.

1.5 Related Work


Eric Lam and Tang used the Titanic problem to compare and contrast three algorithms-
SVM, decision tree analysis, and Naive Bayes. They concluded that Sex was crucial for
accurately predicting survival. They added that selecting crucial features is crucial for
getting better outcomes. The three methods they employed were equally accurate, with
no discernible differences.[5]
Kunal Vyas, Lin Li, and Zeshi Zheng suggested that dimensionality reduction and
playing more with the data set could improve the accuracy of the algorithms. Their most
important conclusion is that more features utilized in the models do not necessarily
make results better.[6]
Bruno S. Frey, David A. Savage, and Benno Torgler concluded that people in their
prime age died less often than older people. Passengers with high financial stability

Page 4|
Data Mining | CEP

traveling in first class are better able to save themselves, as are passengers in second
class compared to third class.[7]
Trevor Stephens has made the prediction using Random forest and decision tree
algorithms. He has used the following parameters- Title, Fare, P-class, FamilyID,
Family Size, SibSp, Parch, Sex, Age, and Embarked. He has not mentioned the accuracy
percentages of the implemented algorithms.[8]

2. Research Method
2.1 Data set
The data set is available publicly on Kaggle.com in CSV(Comma Separated Values)
format. As mentioned before, the data set has 891 rows with attributes - the name of the
passenger, the Number of siblings, parents, or kids, cabin, ticket number, the fare of
ticket, and the place where the person has embarked from. The raw data set has metadata
and incomplete or missing entries filtered in pre-processing. Pre-processing includes
assigning the median of available values to missing values and converting string values
to numeric.
Further, the data set has been split into test and train sets to predict how efficiently the
algorithm works. Before the algorithm is built for this specific model, a few data
exploration graphs have been made to analyze which features could be detrimental to
the model and which could help us facilitate our result. The features have been listed
below:

Attributes Description
Passenger ID Identification No of Passenger
Pclass Passenger class(l,2,3)
Name Name of Passenger
Sex Gender of passenger
Age Age of the passenger
SibSp Number of siblings or spouses on the ship
Parch Number of children or parents on the ship
Ticket Ticket Number
Fare Price of the ticket
Embarked Port of embarkation

There are some more attributes but they were most common like wiki_name and
wiki_age or some are not useful like boat, body, etc.

2.2 Exploratory Data Analysis


An exploratory analysis looks at the data from as many angles as possible, always on
the lookout for some interesting feature. The data analyst is interested in uncovering
facts about the data and may use any procedure of his/her liking to this end. The only
limits to such an analysis are those imposed by time constraints and the creativity of the
data analyst. EDA is not guided by a desire to confirm the presence of a particular effect,
Page 5|
Data Mining | CEP

and it is not supported by a statistical model that incorporates a mathematical expression


for such an effect.

Fig:1.1 Histogram

We can explore correlations between variables using the heatmap

Fig:1.2 Heatmap of Correlation


Then we see the Relation of Survived with Sex and Embarked attribute:

Fig:1.3 Relation between Sex Fig:1.3 Relation between Embarked


And Survival And Survival
Page 6|
Data Mining | CEP

We have different results after some analysis:

Fig :1.3 Relation between Age Fig :1.3 Embarked city


And Survival distribution percentages.

2.3 Data Pre-processing


Data preprocessing is an essential step in the data analysis and machine learning
pipeline. It involves transforming raw data into a format suitable for further analysis or
model training. The goal of data preprocessing is to improve data quality, address
missing values, handle outliers, and standardize the data to make it consistent and ready
for modeling.

2.3.1 Missing Value Imputation:


Handling missing values is an essential task in data preprocessing and analysis. Missing
values can occur in datasets due to various reasons, such as data entry errors, equipment
failures, or incomplete survey responses. Dealing with missing values is crucial to
ensure the accuracy and reliability of data analysis and machine learning models. There
are several strategies for handling missing values such as Removal,
Mean/Median/Mode Imputation, Forward/Backward Fill, Regression Imputation
, and Multiple Imputations.

Read the Titanic dataset into a Data Frame called 'data'


Identify the columns with missing values in the Data Frame
For each column with missing values, do the following:
- Check the data type of the column
- If the column is categorical:
- Replace the missing values with the mode value of column
- If the column is numerical:
- Replace the missing values with the mean/median of column
Print the summary statistics of the modified Data Frame
Verify that missing values have been imputed.

2.3.2 Handling Categorical Variables:

Page 7|
Data Mining | CEP

Categorical data is a type of data that represents qualitative or descriptive characteristics


rather than numerical values. It consists of categories or labels that do not have a natural
numerical order. In machine learning, handling categorical data is an important task, as
many real-world datasets contain categorical features. There are several common
approaches for handling categorical data in machine learning such as Label Encoding,
One-Hot Encoding, Ordinal Encoding, and Targeted Encoding.

1. Label Encoding: This approach assigns a unique numerical label to each


category in a categorical feature. For example, mapping "red" to 0, "green" to 1,
and "blue" to 2. However, using label encoding alone may introduce an ordinal
relationship among categories, which might be inappropriate for certain
algorithms.

2. One-Hot Encoding: One-hot encoding creates binary features for each category
in a categorical variable. Each category is represented by a binary vector where only
one element is active (1) and the rest are inactive (0). This approach avoids the ordinal
relationship issue and allows algorithms to interpret categorical variables properly.
However, it can result in a high-dimensional and sparse feature space, especially when
dealing with categorical variables with many unique categories.

3. Ordinal Encoding: Ordinal encoding is suitable when there is a natural ordering


or ranking among categories. It assigns numerical values based on the order of
categories. For example, "low" as 1, "medium" as 2, and "high" as 3. This approach
preserves the ordinal relationship between categories.

4. Target Encoding: Target encoding, also known as mean encoding, replaces each
category with the mean target value of the corresponding category. It is particularly
useful when the target variable is binary or regression-based. However, target encoding
may introduce data leakage if not performed carefully, as it utilizes information from
the target variable during encoding.

5. Hashing Trick: The hashing trick is a technique that converts categorical features
into a fixed-size representation, typically using a hash function. It reduces the
dimensionality of the feature space, which can be beneficial when dealing with high-
cardinality categorical variables. However, it may result in potential collisions where
different categories are mapped to the same hash value.
It is important to preprocess categorical data appropriately to ensure effective utilization
in machine learning models. The choice of encoding technique depends on the specific
dataset, the number of unique categories, and the machine-learning algorithm being
used. we employed the Label Encoding technique to transform categorical data present
in the "Survived" and "Embarked" columns into numerical values.

Import the necessary libraries


Page 8|
Data Mining | CEP

- pandas
- LabelEncoder from sklearn.preprocessing
Read the Titanic dataset into a Data Frame called 'data'
Specify the column names with categorical variables as
'categorical_cols'
- 'Sex'
- 'Embarked'
Create an instance of LabelEncoder called 'encoder'
For each column name 'col' in 'categorical_cols', do the following:
- Access the column data in the 'data' Data Frame using 'data[col]'
- Fit the encoder to the column data and transform it in place
using 'encoder.fit_transform(data[col])'
- Replace the original column data in 'data' with the transformed
values using 'data[col] = encoder.fit_transform(data[col])'
Print the first few rows of the modified DataFrame using 'print(data.
head())'

2.4 Feature Engineering


Feature engineering is the process of creating new features or transforming existing
features in a dataset to improve the performance of machine learning models. It involves
selecting, extracting, and transforming the raw data into a more suitable representation
that captures relevant information for the task at hand.

Feature engineering is crucial because the quality and relevance of the features directly
impact the performance of machine learning algorithms.

Here are some common techniques used in feature engineering:

2.4.1 Feature Reduction/ Selection

Feature reduction, also known as feature selection or dimensionality reduction, is a


technique used in machine learning to reduce the number of input variables (features)
in a dataset. The goal of feature reduction is to select the most informative and relevant
features while discarding irrelevant, redundant, or noisy ones.

This process helps improve the performance of machine learning models by mitigating
the curse of dimensionality, reducing overfitting, and enhancing interpretability.

We only need that features that help us to increase the accuracy of the model or that
affect the result of the model.

In our project, we only need Survived, Pclass, Sex, Age, and Embarked for modeling.
All others help us in EDA.
2.4.2 Feature Extraction

Feature extraction is a critical step in machine learning, particularly in scenarios where


the original input data contains a large number of features or is of high dimensionality.

Page 9|
Data Mining | CEP

Feature extraction involves reducing the dimensionality of the data by selecting,


combining, or transforming the original features into a more compact and representative
set of features while retaining as much relevant information as possible.

The process of feature extraction can involve various techniques and methods,
depending on the nature of the data and the specific problem at hand.

2.4.3 Data Scaling

Data scaling, also known as feature scaling or normalization, is a preprocessing step in


machine learning that involves transforming the numerical features of a dataset to a
common scale. Scaling the data is important because it can help improve the
performance and efficiency of many machine-learning algorithms.

There are several common methods for data scaling:


a. Min-Max Scaling (Normalization): This method rescales the data to a fixed
range, typically between 0 and 1. It is done by subtracting the minimum value of
the feature and then dividing it by the difference between the maximum and
minimum values. The formula is:
X_scaled=(X-X_min)/(X_max-X_min)
Min-Max scaling is sensitive to outliers, as it compresses the data range into a
fixed interval. However, it can be useful when the distribution of the data is
known to be uniformly distributed.

b. Standardization (Z-score normalization): This method transforms the data to


have zero mean and unit variance. It subtracts the mean value of the feature and
divides it by the standard deviation. The formula is:
X_scaled=(X-X_mean)/X_std
2.5 Modeling
We used the KNN algorithm to build a prediction model for survival on the Titanic
dataset. We trained and tested the KNN algorithm on the dataset and evaluated its
performance using appropriate metrics such as accuracy, precision, recall, and F1 score.
We also compared the performance of KNN with other machine learning algorithms
such as SVM, LR, and Random forest.
2.6 Evaluation
Accuracy, precision, and recall are used to evaluate model performance. These
metrics can be defined as:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (4)

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5)

P a g e 10 |
Data Mining | CEP

𝑅𝑒𝑐𝑎𝑙𝑙 = (6)

3. Results & Discussion


Our EDA showed that female has a much higher survival rate than male. The KNN
algorithm achieved an accuracy score of 0.77 on the test set. SVM achieved an
accuracy score of 0.79, while LR achieved an accuracy score of 0.78. Therefore, we
can conclude that SVM performed slightly better than KNN and LR in predicting
survival on the Titanic dataset.

4. Summary/ Conclusion
The comprehensive research gives us a result with the Support Vector Machine having
the highest score with 79.3 percent correct predictions and the lowest false discovery
rate. The research also made us aware of the features that are highly relevant to the
prediction of the survival of a passenger, with Sex being a feature with the highest
importance. The correlation between factors first evaluated using a basic formula was
justified in some cases and defied in others.
Obtaining valuable results from the raw and missing data using machine learning and
feature engineering methods is very important for a knowledge-based world. In this
paper, we have proposed models for predicting whether a person survived the Titanic
disaster or not. First, detailed data analysis is conducted to investigate features that have
correlations or are non-informative.
In conclusion, this paper presents a comparative study of machine learning techniques
to analyze Titanic data Sets to learn what features affect the classification results and
which techniques are robust.

5. References
1. Guo, G., et al. KNN Model-Based Approach in Classification. 2003. Berlin, Heidelberg:
Springer Berlin Heidelberg.
2. Stoltzfus, J.C.J.A.e.m., Logistic regression: a brief primer. 2011. 18(10): p. 1099-1104.
3. Suthaharan, S., Support Vector Machine, in Machine Learning Models and Algorithms for
Big Data Classification: Thinking with Examples for Effective Learning. 2016, Springer US:
Boston, MA. p. 207-235.
4. Breiman, L.J.M.l., Random forests. 2001. 45: p. 5-32.
5. Lam, E.a.T., Chongxuan, CS229 Titanic--Machine Learning From Disaster. 2012.
6. Vyas, K.a.Z., Zeshi and Li, Lin, Titanic-machine learning from disaster. Machine Learning
Final Project, UMass Lowell, 2015: p. 1-7.
7. Frey, B.S.a.S., David A and Torgler, Benno, Behavior under extreme conditions: The Titanic
disaster. Journal of Economic Perspectives, 2011.
8. Stephens, T., Titanic: Getting Started With R - Part 3:Decision Trees. 2014.

P a g e 11 |

You might also like