Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

i

Abstract

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean. There were an
estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died dur to the accident.
In this project we have built a machine learning model to predict the survival of the passengers of the
titanic ship. We have used the random forest model for this purpose. The aim of this project is to build
a machine learning model that predicts the type of people who survived the Titanic shipwreck using
passenger data (i.e. name, age, gender, socio-economic class, etc.). The dataset was obtained from
the Kaggle website. The Titanic dataset consist of a training set that includes 891 passengers and a
test set that includes 418 passengers which are different from the passengers in training set. We have
used the Random Forest algorithm for building the prediction model.

Keywords: Machine learning, Random Forest, Titanic Survival Prediction


ii

Contents

Introduction...........................................................................................................................................1
1. Machine Learning......................................................................................................................1
2. Introduction to random forest algorithms...................................................................................1
Theory...................................................................................................................................................3
1. Random Forest Algorithm.........................................................................................................3
2. Assumptions for Random Forest................................................................................................3
3. Advantages of Random Forest...................................................................................................4
4. Working of Random Forest algorithm.......................................................................................4
Advantages and Disadvantages.............................................................................................................5
1. Advantages of Random Forest...................................................................................................6
2. Disadvantages of Random Forest...............................................................................................6
Applications..........................................................................................................................................7
Conclusions...........................................................................................................................................8
References.............................................................................................................................................9
Passenger Survival Prediction on 1

Introduction

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the
early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from
Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship,
and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in
modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the
second of three Olympic-class Ocean liners operated by the White Star Line. The Titanic was built by
the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.
The training-set has 891 examples and 11 features + the target variable (survived). 2 of the
features are floats, 5 are integers and 5 are objects.
In this project we try to build a machine learning model to predict the survival of the
passengers of the titanic ship. We have used the random forest model for this purpose.
As in any other data science and machine learning project we follow the process of inputting
the dataset, then analysis, train test split. Then, we train the model and at the end we check the
accuracy of the model by evaluating the model. The model predicts whether a passenger would
survive on the titanic taking into account and comparing and finding relations amongst various
features.

1. Machine Learning
To understand machine learning we must first understand the basic concepts of Artificial Intelligence.
AI is defined as a program that exhibits cognitive ability similar to that of a human being. Alan Turing
defined AI as “If there is a machine behind a curtain and a human is interacting with it and if the
human feels like he/she is interacting with a human then the machine is artificially intelligent.”
AI exists as an umbrella term that is used to denote all computer programs that can think as
humans do. Any computer program that shows characteristics, such as self-improvement, learning
through inference, or even basic human tasks, such as image recognition and language processing, is
considered to be a form of AI.
The field of artificial intelligence includes within it the sub-fields of machine learning and
deep learning. Deep Learning is a more specialized version of machine learning that utilizes more
complex methods for difficult problems The process of self-learning by collecting new data on the
problem has allowed machine learning algorithms to take over the corporate space.
With machine learning algorithms, AI was able to develop beyond just performing the tasks it was
programmed to do. Before ML entered the mainstream, AI programs were only used to automate low-
level tasks in business and enterprise settings. This included tasks like intelligent automation or
simple rule-based classification. This meant that AI algorithms were restricted to only the domain of
what they were processed for. However, with machine learning, computers were able to move past
doing what they were programmed for and began evolving with each iteration.

2. Introduction to random forest algorithms


Random forests or random decision forests is an ensemble learning method for classification,
regression and other tasks that operates by constructing a multitude of decision trees at training time.
For classification tasks, the output of the random forest is the class selected by most trees. For
regression tasks, the mean or average prediction of the individual trees is returned. Random decision
forests correct
Dept. of Computer Engineering., PVPIT,
Passenger Survival Prediction on 2

for decision trees' habit of overfitting to their training set.  Random forests generally outperform
decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics
can affect their performance.
The first algorithm for random decision forests was created in 1995 by Tin Kam Housing the
random subspace method, which, in Ho's formulation, is a way to implement the "stochastic
discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who registered
"Random Forests" as a trademark in 2006. The extension combines Breiman's "bagging" idea and
random selection of features, introduced first by Ho and later independently by Amit and Geman in
order to construct a collection of decision trees with controlled variance.
The proper introduction of random forests was made in a paper by Leo Breiman. This paper
describes a method of building a forest of uncorrelated trees using a CART like procedure, combined
with randomized node optimization and bagging. In addition, this paper combines several ingredients,
some previously known and some novel, which form the basis of the modern practice of random
forests, in particular:
1. Using out-of-bag error as an estimate of the generalization error.
2. Measuring variable importance through permutation.
Random forests are frequently used as "blackbox" models in businesses, as they generate
reasonable predictions across a wide range of data while requiring little configuration.

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 3

Theory

1. Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:

2. Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random Forest
classifier:

 There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
 The predictions from each tree must have very low correlations.

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 4

3. Advantages of Random Forest


Below are some points that explain why we should use the Random Forest algorithm:

 It takes less training time as compared to other algorithms.


 It predicts output with high accuracy, even for the large dataset it runs efficiently.
 It can also maintain accuracy when a large proportion of data is missing.

4. Working of Random Forest algorithm


Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:

 Step-1: Select random K data points from the training set.\


 Step-2: Build the decision trees associated with the selected data points (Subsets).
 Step-3: Choose the number N for decision trees that you want to build.
 Step-4: Repeat Step 1 & 2.
 Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 5

Implementation

1. Aim:
To build a machine learning model that predicts the type of people which survived the Titanic
shipwreck using passenger data such as name, age, gender

2. Dataset Description:
The dataset has been taken from the Kaggle website under the name “Titanic: Machine Learning
from Disaster”. The Titanic dataset consist of a training set that includes 891 passengers and a test set
that includes 418 passengers which are different from the passengers in training set.
A description of the features is given in Table I.

While the features such as PassengerId, Survived, Pclass, Age, SibSp, Parch and Fare are
numeric values, Name, Sex and Embarked can take nominal values; the features such as Ticket, Cabin
can take numeric and nominal values

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 6

Advantages and Disadvantages

1. Advantages of Random Forest


 It can perform both regression and classification tasks.
 A random forest produces good predictions that can be understood easily.
 It can handle large datasets efficiently.
 The random forest algorithm provides a higher level of accuracy in predicting outcomes over
the decision tree algorithm

2. Disadvantages of Random Forest


The main limitation of random forest is that a large number of trees can make the algorithm too
slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite
slow to create predictions once they are trained. A more accurate prediction requires more trees,
which results in a slower model. In most real-world applications, the random forest algorithm is fast
enough but there can certainly be situations where run-time performance is important and other
approaches would be preferred.

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 7

Applications

There are mainly four sectors where Random Forest mostly used:

 Banking:
Banking sector mostly uses this algorithm for the identification of loan risk. Random
forest is used in banking to predict the creditworthiness of a loan applicant. This helps the
lending institution make a good decision on whether to give the customer the loan or not.
Banks also use the random forest algorithm to detect fraudsters.
 Medicine:
With the help of this algorithm, disease trends and risks of the disease can be
identified.
 Marketing:
Marketing trends can be identified using this algorithm.
 Health care
Health professionals use random forest systems to diagnose patients. Patients are
diagnosed by assessing their previous medical history. Past medical records are reviewed to
establish the right dosage for the patients.
 Stock market
Financial analysts use it to identify potential markets for stocks. It also enables them
to identify the behavior of stocks.
 E-commerce
Through rain forest algorithms, e-commerce vendors can predict the preference of
customers based on past consumption behavior.

Dept. of Computer Engineering., PVPIT,


Passenger Survival Prediction on 8

Conclusion

In this project, we have built a machine learning model to predict the type of people who
survived the titanic shipwreck using passenger data. The passenger data was taken from Kaggle
website. We have used the Random Forest Classifier algorithm for building the model.

Dept. of Computer Engineering., PVPIT,


References

[1] https://www.kaggle.com/competitions/titanic/overview

[2] https://www.javatpoint.com/machine-learning-random-forest-algorithm

[3] https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8

[4] https://en.wikipedia.org/wiki/Random_forest

[5] Tabbakh, A., Rout, J. K., & Rout, M. (2021). Analysis and Prediction of the Survival of
Titanic Passengers Using Machine Learning. In Advances in Distributed Computing and
Machine Learning (pp. 297-304). Springer, Singapore.
[6] Ekinci, E., Omurca, S. İ., & Acun, N. (2018). A comparative study on machine learning
techniques using Titanic dataset. In 7th international conference on advanced technologies
(pp. 411-416).

You might also like