Acknowledgement

ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the Mini-Report of Build a machine learning
model that predicts the type of people who survived the Titanic shipwreck using
passenger data. We own special debt of gratitude to Prof Afeefa Rafeeque, for his constant
support and guidance throughout the report work. His sincerity, thoroughness and perseverance
have been a constant source of inspiration for us. It is only his cognizant efforts that our
endeavours have seen light of day. We also take the opportunity to acknowledge the
contribution of Dr. Salman Baig Sir, for his full support and assistance during the development
of the project. We also do not like to miss the opportunity to acknowledge the contribution of
all faculty members of the department for their kind assistance and cooperation during the
development of our project. And at last, but not the least, we acknowledge our friends for their
contribution in the completion of the project.
1
ABSTRACT
The sinking of RMS Titanic is presumably one of the most infamous and disastrous shipwrecks
in history. During her maiden voyage and early morning hours of April 15, 1912, the Titanic
regrettably sank after colliding with an iceberg, killing an approximate of 1502 passengers and
crew out of 2224 making it one of many of the deadliest commercial maritime in history till
date. The entire international community was deeply shocked and saddened after hearing the
news of this sensational disaster which resulted in improved ship safety legislation. Her
architect, Thomas Andrews died in the disaster. An eye-opening observation that came forth
from the sinking of Titanic is the fact that some individuals had a better chance at surviving
than the others. Kids and women had been given foremost priority. Social classes were heavily
stratified in the early twentieth century, this was especially implemented on the Titanic Firstly,
the aim is use and apply exploratory data analytics (EDA) to uncover previously unknown or
hidden facts in the data set available. Then the task is to later anoint various machine learning
models to conclude the study of which types of individuals are more likely to live. The
outcomes of application of the different machine learning models were then set side by side
and analysed based upon precision.
Keywords— Feature engineering, Machine learning, Data Science, Exploratory Analysis,
Model Evaluation, Exploratory Data Analytics, Data mining, ggplot, Logistic Regression,
Random Forest, Feature Engineering, Support Vector Machine, Confusion Matrix.
2
TABLE OF CONTENTS
CONTENT PAGE NO
Acknowledge 1
Abstract 2
CHAPTER PAGE NO
CHAPTER 1: INTRODUCTION 4-5
1 Introduction 4
CHAPTER 2: LITERATURE REVIEW 6-7
CHAPTER 3: METHODOLOGY 8-11

3.1 Feature Engineering 8
3.2 Machine Learning Models 8
3.2.1 Logistic Regression 9
3.2.2 Decision Tree 9
3.2.3 Random Forest Classifier 10
3.2.3 Support Vector Machine 11
CHAPTER 4: RESULT AND DISCUSSION 12-21
CHAPTER 8: CONCLUSION 22-23

8.1 Future Scope 23
REFERENCES 24
3
Chapter 1
INTRODUCTION
1. Introduction
Machine learning is the umbrella term for any method that can be implemented on a computer
and applied to a set of data to search for patterns. In essence, this refers to all algorithms used
in data science, regardless of whether they are supervised or unsupervised, employed for
segmentation, classification, or regression. Machine learning is an essential tool in many fields,
including autonomous driving, handwriting identification, language translation, speech
recognition, and image categorization.
The pixels that make up an image's characteristics are used to create predictions for
picture categorization, and machine learning algorithms have the ability to make these
discoveries. In autonomous vehicles, information from cameras, range sensors, and GPS is
combined with the pitch and volume of sound samples for voice recognition.
How many Titanic survivors there will be is predicted using machine learning
techniques. A number of features, such as name, title, age, sex, and class, will be used to
produce the forecasts. In order to find meaningful and useful patterns in vast amounts of data,
predictive analysis uses computational approaches. The likelihood of survival is estimated
using machine learning approaches based on different feature combinations. The objective is
to do exploratory data analytics on the currently accessible dataset and to examine the effect of
each field on passengers' survival by applying analytics between each dataset field and the
"Survival" field. Numerous algorithms are examined for accuracy, and the most accurate
algorithm is then recommended for predictions.
The most notorious catastrophe is known as the sinking of "The Titanic," which
happened on April 15, 1912, more than a century ago. The Titanic suffered extensive damage
as a result of the collision with the iceberg. On that terrible night, a wide variety of people of
all ages and genders were present, but it was unfortunate that there weren't enough lifeboats
available for rescue. There were many men among the dead, and the numerous women and kids
on board took their place. The men taking the second-class flight were already dead.
In order to forecast which people survived when the Titanic sank, machine learning techniques
are used. Features such as ticket price, age, sex, and class.
By applying analytics between each dataset field and the "Survival" field, it will be
possible to undertake exploratory data analytics to mine the available dataset for diverse
information and determine the impact of each field on passenger survival. A machine learning
4
method is used to make predictions about newer data sets. The correctness of the data analysis
using the implemented algorithms will be examined. The most accurate algorithm is offered
for predictions after being compared to other algorithms' accuracy levels.
Many pieces of the Titanic were torn off during that fatal night in the disaster that
occurred more than a century ago. Unfortunately, there were insufficient lifeboats to save every
one of the 2224 people. Many guys who were missing in action were. The purpose of this
research article is to accurately forecast, given a collection of demographic data, who would
survive the Titanic. Gender, age, ticket type, and socioeconomic status all had a role in whether
a passenger would be fortunate enough to survive the Titanic or sadly perish, according to a
prediction model developed using passenger data. By combining statistical methods, predictive
analysis is a technique for identifying meaningful and practical trends in huge data sets.
Machine learning techniques based on diverse feature combinations are used to predict
survival.
5
Chapter 2
LITERATURE REVIEW
Every machine learning algorithm operates most effectively under a certain set of conditions.
By making sure your algorithm complies with the requirements, you can guarantee superior
performance. There is no situation when any algorithm can be employed. As an alternative, you
should think about using algorithms like Logistic Regression, Decision Trees, SVM, Random
Forest, etc. in these situations. Logistic regression and decision trees are used as prediction
models in this study.
Logistic Regression and is used to predict the effects of a series of variables on a binary
response variable and to classify observations by modelling the probability of an event
occurring depending on the values of the independent variables, which can be categorical or
numerical. It also estimates the probability that an event occurs for a randomly selected
observations versus the probability that the event does not occur. Biology and social sciences
are where it is most frequently utilised.
The efficiency of the logistic regression model can be evaluated using the AIC (Akaike
Information Criteria), Null Deviance and Residual Deviance, Confusion Matrix, and
McFadden R2 (also known as pseudo R2). AIC is a metric that is comparable to adjusted R2
in logistic regression. A model is penalised by the AIC, a measure of fit, depending on how
many model coefficients there are. As a result, we consistently choose models with low AIC
values. The Null Deviance and Residual Deviance metrics reflect the response predicted by a
model with simply an intercept. Models with lower values are better. Remaining deviation is
the outcome that a model forecasts after taking independent factors into account. Models with
lower values are more accurate. A table comparison of Actual and Predicted values is all that
the Confusion Matrix is. The model's accuracy can be assessed in this way, and overfitting can
be avoided. A logistic regression is used to analyse data because there is no statistic that can be
compared to R-squared. Instead, McFadden R2 is employed. However, pseudo R-squared have
been created to evaluate the goodness-of-fit of logistic models.
A hierarchical tree structure known as a decision tree [9] can be used to divide a sizable
collection of records into more manageable sets of classes by applying a number of simple
decision rules. A decision tree model is a set of rules for grouping a significant number of
heterogeneous individuals into more controllable, homogenous (mutually exclusive)
groupings. Class characteristics may be any type of variable, including those with binary,
6
nominal, ordinal, and quantitative values; however, class types must be of the qualitative
variety (categorical or binary, or ordinal).
To put it simply, a decision tree creates a set of rules (or a series of questions) that can be used
to identify a class given a set of qualities and their classes. A hierarchy of segments within
segments is created by applying one rule at a time. The hierarchy is referred to as a tree, and
each section is referred to as a node. The members of the generated sets are increasingly similar
to one another with each subsequent division. As a result, the method for creating decision trees
is known as recursive partitioning. Applications for decision trees include determining whether
or not to grant a loan, classifying buyers from non-buyers, classifying tumour cells as benign
or malignant, and diagnosing various diseases based on symptoms and characteristics.
7
Chapter 3
METHODOLOGY
Figure: Flowchart for the proposed Methodology
The proposed work is based on the concept that firstly we rode the titanic dataset obtained by
well-known Kaggle data repository. next we perform the exploratory data analysis on the
dataset. In the third level we break our dataset into two parts training and testing. Further we
apply the different well known machine learning algorithm to get the predictions for the
expected survivors of titanic accident.
3.1 Feature Engineering

The majorly significant aspect of data analytics is feature engineering. It acts towards choosing
characteristics implemented during training and therefore during predictions. Domain
awareness is used in feature engineering to identify aspects in the data set that assist in creating
a model for machine learning. When it comes to modelling, it assists in interpreting the data
collection. A bad selection of features may result in a less precise and appalling predictive
model. Reliability and predictive capacity are dependent upon selection of accurate
characteristics. It strains out the unutilized and inessential functions.
Using the age of a person, their sex, title, cabin number, Pclass, place of embarkation,
fare and size of family which includes both parch as well as sibsp, the following features on
the exploratory study is done. The survival column is selected as the reaction column. Such
8
characteristics are chosen because their ideals have an effect on the survival rate. If wrong
features are chosen, bad predictions may be generated. Thus, in creating an effective predictive
model, function engineering works as a backbone.
NaN, NAN and na are some of the columns that contain empty values. Only four
columns, namely age, deck, embarked and embarked_town have a few absent values. In an
attempt to display a few unwanted columns, the next step is to have a look every column’s
value name as well as count. The non-redundant rows and columns should be dropped, and
their rows and missing values should be removed. It is also advisable to drop the column called
deck as it has 688 missing rows of data so an overall of 77.2% of all the data is missing for this
column. A look at the data types will tell which all columns are required to be encoded with a
digit. Embarked and sex are ideally are two of the columns that must be altered. Later, only
distinct values coming from the non-numeric data will be printed while changing them into
numeric ones and then printing new values. The data needs to be split twice. First, into
independent ‘X’ and dependent ‘Y’ data sets. During the second time, it needs to be split into
80% training and 20% testing data sets. Optionally, it is also possible to scale the data within a
specific range. Afterwards there is a need to design a function that within it possesses all the
different machine learning models that could potentially come to aid when making decisions
about which one to choose. After storing all the models in a variable called model, I found that
the Decision Tree Classifier had the most accurate result of 99.29%. The model that came out
to reveal the most was Logistic Regression Model with an accuracy of only 81.11% on the
testing data. However, the ultimate chosen model was the Random Forest Classifier. This
particular model at position 6 was especially chosen because it had a precision of a whopping
97.53% and 80.41% on the training data and testing data respectively.
3.2 Machine Learning Models

To verify and predict survival, various machine learning models are used
3.2.1 Logistic Regression
Where the dependent variable is binary or categorial, logistic regression is the approach that
works best. Using this, the definition of the data and clarification of the interconnection
involving a binary dependant variable and one or several independent ratio-level, ordinal,
nominal and interval variables was carried out. This form of model helps to resolve the problem
of binary classification. Examples are spam detection, health-prediction and marketing
prediction.
9
3.2.2 Decision Tree
The decision tree is an example of supervised learning algorithm. This is usually used in
relation to classification-based challenges. It supports both continuous and categorical input
and output variables. Each root node represents a split point on a single input variable (x) and
the variable itself. The dependent factor is present in the leaf nodes (y). Consider this: Assume
that the task of identifying a person's gender based on the given information requires two
independent variables, i.e., input variables (x), which are height in centimetres and weight in
kilogrammes. (Hypothetical example used merely as a point of comparison.)
3.2.3 Random Forest Classifier
This algorithm is specifically used for supervised classification. Here, the classifier generates
forests with a huge number of trees. With an increase number of trees in the forest, more reliable
can the expected outcomes be. For both classification and regression problems, the Random
Forest Algorithm may be used for e.g. In order to construct a model, it will take 5 randomly
chosen initial variables from a random sample of 100 observation. After reiterating the same
over many times, a final decision is generated.
‘1’ means that a passenger survived while ‘0’ indicates a passenger did not. It is now possible
to visually have a look at how well the chosen model did on the testing data by printing Random
Forest Classifier Model for each passenger. The accuracy for this model on the test data came
out to be 80.4%.
Lastly, after creating, analyzing the data and later deciding upon the desired model to use, I
then tested to see if I would make to one of the people’s in the surviving list. Creating a variable
called my_survival, the following will be inputted.
• Accordingly, I may as well be traveling as a steerage passenger due to a less expensive
ticket price, so pclass=3.
• Female, hence sex = 1.
• Above the age of 18, hence age = 21.
• Will board the Titanic with my sibling, sibsp = 1.
• Parch = 0. • Minimum fare possible, fare = 0.
• Embarkation port will be from Cherbourg, embarked = 0.
The list must be in a 2D array for the prediction method to put these of the model. After putting
these in an array, my model predicted that I may not have survived the Titanic had I been
onboard.
10
3.2.4 Support Vector Machine
Support Vector Machine (SVM) falls in supervised machine learning algorithm. This algorithm
is used to solve both classification and regression problems. The classification is performed by
constructing hyper planes in a multidimensional space that separates cases of different class
labels. For categorical data variables a dummy variable is created with values as either 0 or 1.
So, a categorical dependent variable consisting three levels, say (A, B, C) can be represented
by a set of three dummy variables: A: {1, 0, 0}; B: {0, 1, 0}; C: {0, 0, 1}
11
Chapter 4
RESULT AND DISCUSSION
12
13
14
15
16
17
18
19
20
• We have created a model to determine the survivors of the titanic incident by calculating
and analysing the data and creating a model to predict the survivors based on given
dataset.
• The above given data is the filtered data of the survivors obtained by sorting and
selecting from the main data.
21
Chapter 8
CONCLUSION
In order to further improve the overall result, an extensive hyperparameter should be tuned on
several machine learning models. It is possible to further improve it by doing ensemble
learning. This research paper began with data exploration, and subsequently led to checking
about absent data and learning what features are important. At the arrival of the data
preprocessing portion, missing values were computed and converted features into numeric
ones. Later a few more features were created. 8 different machine learning models were also
simultaneously trained thereby and applied cross validation onto the chosen Random Forest
model. Finally, the confusion matrix was looked into and evaluated as well as the f-score,
computer model’s precision and recall. When attempting data analysis, data cleaning is the first
and foremost step. Exploratory data analytics allows one to recognize the dataset and the
dependency linking characteristics. EDA is implemented in order to find out the connection
between the features of the dataset. This is achieved with the use of different graphical
techniques. Some conclusions are drawn by applying EDA and the facts are discovered.
Figure: Passenger Class Vs. Survival Rate

It has been observed that female survival rates are very high (approx. 74%) while male survival
rates are very low. Extracting titles from the name column can also verify this fact. The survival
rate for Mr is about 16% and the survival rate for Mrs. Is 79%. To know a specific passenger’s
family size, we merged parch and sibsp column and discovered that when the family size lies
between 0 and 3, the survival rate increases. On the other hand, survival rate tends to decrease
as the family size becomes greater than 3. The exact parameters that are in need to be used
during design of prediction and training model are discovered in feature engineering in
accordance with the method of exploratory data analytics. The values of passengers who
survived are estimated by Machine Learning models. To make predictions in classification
problem, the technique of logistic regression is primarily used.
22
8.1 FUTURE WORK
This project involves implementation of data analytics and machine learning. This project work
can be used as reference to learn implementation of EDA and machine learning from very basic.
In future the idea can be extended by making more advanced graphical user interface with the
help of newer libraries like shiny in R. An interactive page can be made, i.e. if the value of a
attribute is changed on the scale the values corresponding to its graph (ggplot or histogram)
will also change. We can also draw much focused conclusions by combining results we
obtained.
23
REFERENCES
[1]. Singh, A., Saraswat, S., & Faujdar, N. (2017, May). Analysing Titanic disaster using
machine learning algorithms. In 2017 International Conference on Computing,
Communication and Automation (ICCCA) (pp. 406-411). IEEE.
[2]. Whitley, M. A. (2015). Using statistical learning to predict survival of passengers on the
RMS Titanic.
[3]. Lam, E., & Tang, C. (2012). CS229 Titanic–Machine Learning From Disaster.
[4]. Wang, D., Peleg, M., Tu, S.W., Boxwala, A.A., Greenes, R.A., Patel, V.L. and Shurtleff,
E.H., 2002. Representation primitives, process models and patient data in computer-
interpretable clinical practice guidelines: A literature review of guideline representation
models. International journal of medical informatics, 68(1-3), pp.59-70.
[5]. Kakde, Y. and Agrawal, S., 2018. Predicting survival on Titanic by applying exploratory
data analytics and machine learning techniques. Int J Comput Appl, pp.32-38.
[6]. Cicoria, S., Sherlock, J., Muniswamaiah, M. and Clarke, L., 2014, May. Classification of
titanic passenger data and chances of surviving the disaster. In Proceedings of Student-Faculty
Research Day, CSIS (pp. 1-6).
[7]. Chatterjee, T. (2017). Prediction of survivors in titanic dataset: a comparative study using
machine learning algorithms. Int J Emerg Res Manag Technol. Department of Management
Studies, NIT Trichy, Tiruchirappalli, Tamilnadu, India.
[8]. Mishra, V. P., Shukla, B., & Bansal, A. (2018). Intelligent Intrusion Detection System with
Innovative Data Cleaning Algorithm and Efficient Unique User Identification Algorithm. Jour
of Adv Research in Dynamical & Control Systems, 10(01), 398-412.
[9]. Mishra, V.P., Shukla, B. and Bansal, A., 2019. Analysis of alarms to prevent the
organizations network in real-time using process mining approach. Cluster Computing, 22(3),
pp.7023-7030.
[10]. Vyas, K., Zheng, Z., & Li, L. (2015). Titanic-Machine Learning From Disaster. Machine
Learning Final Project, UMass Lowell, 1-7.
[11]. Galit Shmueli and Otto R. Koppius MIS Quarterly, Predictive Analytics in Information
System Research, Vol. 35, No. 3(September 2011), pp. 553-572.
[12]. Niklas Donges, 2014. Predicting the Survival of Titanic Passengers.
[13]. Segal, M. R. (2004). Machine learning benchmarks and random forest regression.
24

Acknowledgement

Uploaded by

Copyright:

Available Formats

You might also like

Acknowledgement

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Acknowledgement

Uploaded by

Copyright:

Available Formats

ACKNOWLEDGEMENT

CHAPTER 2: LITERATURE REVIEW 6-7

CHAPTER 3: METHODOLOGY 8-11

CHAPTER 4: RESULT AND DISCUSSION 12-21

CHAPTER 8: CONCLUSION 22-23

Figure: Flowchart for the proposed Methodology

3.1 Feature Engineering

3.2 Machine Learning Models

Figure: Passenger Class Vs. Survival Rate

You might also like