Professional Documents
Culture Documents
CEP Final
CEP Final
Supervised By:
Muhammad Bux Alvi
mbalvi@iub.edu.pk
Batch: 2020-24
Abstract
According to legend, the Royal Mail Ship (RMS) Titanic, a British cruise ship, was the
largest cruise ship ever built. On April 15, 1912, he crossed the Pacific Ocean for the
first time and collided with an iceberg on his way from Southampton to New York City.
About half of the 2,200 passengers on board died in accidents that were never heard of.
This notorious incident forced researchers to dig deeper into the data set. The purpose
of this report is to analyze the research data and to understand the factors that contribute
to the survival of a person on board. The study identifies passengers' characteristics -
cabin class, age, and point of departure - and their relationship to disaster prevention.
Survival is predicted using different machine learning algorithms like K-Nearest
Neighbor, Support Vector Machine, Logistic Regression, and Random Forest. In this
report, we conducted a comparative analysis of the accuracy of the aforementioned
algorithms and found that SVM based model achieved the highest accuracy among the
used algorithms.
Page 2|
Data Mining | CEP
1. Introduction
The Titanic dataset is widely recognized and extensively used in the field of machine
learning and predictive modeling. It provides a rich collection of information about the
passengers aboard the ill-fated RMS Titanic, including their demographics, ticket class,
cabin details, and survival outcomes. This dataset has become a benchmark for
developing predictive models to determine passenger survival based on the available
features.
The primary objective of this report is to perform an Exploratory Data Analysis (EDA)
on the Titanic dataset, enabling us to gain valuable insights into the variables and their
relationships. Through EDA, we aim to uncover patterns, identify trends, and highlight
potential factors associated with survival. This analysis will be crucial in developing a
robust prediction model using the K-Nearest Neighbors (KNN), Logistic Regression
algorithm, Random Forest, and Support vector machine.
(2)
1.3 Support Vector Machine
Page 3|
Data Mining | CEP
Support Vector Machine is one of the classical machine learning techniques that can
still help solve big data classification problems. Especially, it can help multi-domain
applications in a big data environment. However, the support vector machine is
mathematically complex and computationally expensive.[3]
The basic idea behind SVM is to find an optimal hyperplane that separates different
classes of data points in a high-dimensional space. The hyperplane is chosen to
maximize the margin, which is the distance between the hyperplane and the nearest data
points of each class. The data points that are closest to the hyperplane are called support
vectors, hence the name "Support Vector Machine."
Page 4|
Data Mining | CEP
traveling in first class are better able to save themselves, as are passengers in second
class compared to third class.[7]
Trevor Stephens has made the prediction using Random forest and decision tree
algorithms. He has used the following parameters- Title, Fare, P-class, FamilyID,
Family Size, SibSp, Parch, Sex, Age, and Embarked. He has not mentioned the accuracy
percentages of the implemented algorithms.[8]
2. Research Method
2.1 Data set
The data set is available publicly on Kaggle.com in CSV(Comma Separated Values)
format. As mentioned before, the data set has 891 rows with attributes - the name of the
passenger, the Number of siblings, parents, or kids, cabin, ticket number, the fare of
ticket, and the place where the person has embarked from. The raw data set has metadata
and incomplete or missing entries filtered in pre-processing. Pre-processing includes
assigning the median of available values to missing values and converting string values
to numeric.
Further, the data set has been split into test and train sets to predict how efficiently the
algorithm works. Before the algorithm is built for this specific model, a few data
exploration graphs have been made to analyze which features could be detrimental to
the model and which could help us facilitate our result. The features have been listed
below:
Attributes Description
Passenger ID Identification No of Passenger
Pclass Passenger class(l,2,3)
Name Name of Passenger
Sex Gender of passenger
Age Age of the passenger
SibSp Number of siblings or spouses on the ship
Parch Number of children or parents on the ship
Ticket Ticket Number
Fare Price of the ticket
Embarked Port of embarkation
There are some more attributes but they were most common like wiki_name and
wiki_age or some are not useful like boat, body, etc.
Fig:1.1 Histogram
Page 7|
Data Mining | CEP
2. One-Hot Encoding: One-hot encoding creates binary features for each category
in a categorical variable. Each category is represented by a binary vector where only
one element is active (1) and the rest are inactive (0). This approach avoids the ordinal
relationship issue and allows algorithms to interpret categorical variables properly.
However, it can result in a high-dimensional and sparse feature space, especially when
dealing with categorical variables with many unique categories.
4. Target Encoding: Target encoding, also known as mean encoding, replaces each
category with the mean target value of the corresponding category. It is particularly
useful when the target variable is binary or regression-based. However, target encoding
may introduce data leakage if not performed carefully, as it utilizes information from
the target variable during encoding.
5. Hashing Trick: The hashing trick is a technique that converts categorical features
into a fixed-size representation, typically using a hash function. It reduces the
dimensionality of the feature space, which can be beneficial when dealing with high-
cardinality categorical variables. However, it may result in potential collisions where
different categories are mapped to the same hash value.
It is important to preprocess categorical data appropriately to ensure effective utilization
in machine learning models. The choice of encoding technique depends on the specific
dataset, the number of unique categories, and the machine-learning algorithm being
used. we employed the Label Encoding technique to transform categorical data present
in the "Survived" and "Embarked" columns into numerical values.
- pandas
- LabelEncoder from sklearn.preprocessing
Read the Titanic dataset into a Data Frame called 'data'
Specify the column names with categorical variables as
'categorical_cols'
- 'Sex'
- 'Embarked'
Create an instance of LabelEncoder called 'encoder'
For each column name 'col' in 'categorical_cols', do the following:
- Access the column data in the 'data' Data Frame using 'data[col]'
- Fit the encoder to the column data and transform it in place
using 'encoder.fit_transform(data[col])'
- Replace the original column data in 'data' with the transformed
values using 'data[col] = encoder.fit_transform(data[col])'
Print the first few rows of the modified DataFrame using 'print(data.
head())'
Feature engineering is crucial because the quality and relevance of the features directly
impact the performance of machine learning algorithms.
This process helps improve the performance of machine learning models by mitigating
the curse of dimensionality, reducing overfitting, and enhancing interpretability.
We only need that features that help us to increase the accuracy of the model or that
affect the result of the model.
In our project, we only need Survived, Pclass, Sex, Age, and Embarked for modeling.
All others help us in EDA.
2.4.2 Feature Extraction
Page 9|
Data Mining | CEP
The process of feature extraction can involve various techniques and methods,
depending on the nature of the data and the specific problem at hand.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (4)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5)
P a g e 10 |
Data Mining | CEP
𝑅𝑒𝑐𝑎𝑙𝑙 = (6)
4. Summary/ Conclusion
The comprehensive research gives us a result with the Support Vector Machine having
the highest score with 79.3 percent correct predictions and the lowest false discovery
rate. The research also made us aware of the features that are highly relevant to the
prediction of the survival of a passenger, with Sex being a feature with the highest
importance. The correlation between factors first evaluated using a basic formula was
justified in some cases and defied in others.
Obtaining valuable results from the raw and missing data using machine learning and
feature engineering methods is very important for a knowledge-based world. In this
paper, we have proposed models for predicting whether a person survived the Titanic
disaster or not. First, detailed data analysis is conducted to investigate features that have
correlations or are non-informative.
In conclusion, this paper presents a comparative study of machine learning techniques
to analyze Titanic data Sets to learn what features affect the classification results and
which techniques are robust.
5. References
1. Guo, G., et al. KNN Model-Based Approach in Classification. 2003. Berlin, Heidelberg:
Springer Berlin Heidelberg.
2. Stoltzfus, J.C.J.A.e.m., Logistic regression: a brief primer. 2011. 18(10): p. 1099-1104.
3. Suthaharan, S., Support Vector Machine, in Machine Learning Models and Algorithms for
Big Data Classification: Thinking with Examples for Effective Learning. 2016, Springer US:
Boston, MA. p. 207-235.
4. Breiman, L.J.M.l., Random forests. 2001. 45: p. 5-32.
5. Lam, E.a.T., Chongxuan, CS229 Titanic--Machine Learning From Disaster. 2012.
6. Vyas, K.a.Z., Zeshi and Li, Lin, Titanic-machine learning from disaster. Machine Learning
Final Project, UMass Lowell, 2015: p. 1-7.
7. Frey, B.S.a.S., David A and Torgler, Benno, Behavior under extreme conditions: The Titanic
disaster. Journal of Economic Perspectives, 2011.
8. Stephens, T., Titanic: Getting Started With R - Part 3:Decision Trees. 2014.
P a g e 11 |