Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 8

Predicting the Survival of Passengers on the Titanic

Christine Chen (Chen Jia Bei)


L6-EVK, Essay, Miss Henfrey
Subject: Maths + Computer Science
Word count: 2095 words

Abstract
In this paper, data sets from Kaggle are used to predict the survival of
passengers aboard on the RMS Titanic, it will be analysing the different factors
affecting their survival including their sex, age, their socio-economic status etc
via Decision Trees in building computational models. The survival rate for each
factor will be compared such as male vs female for the factor gender. Factors
that came across the most important were their sex and socio-economic status.
By finalising and concluding on the results, the highest survival rate occurred
among the female gender and the higher socio-economic status. This was
mainly caused by the captain of Titanic asking females to board the lifeboats
first. Therefore, the survival rate of female was much higher than male in
comparison. It is very important to do research on the Titanic as it allows the
technicians to have a better understanding of ship manufacturing and how to
deal with unexpected accidents.

1 Introduction
The RMS Titanic was an ocean liner that was operated by the White Star Line. People at the time
quoted as “absolutely unsinkable”. However, on the 15th of April 1912, after striking an iceberg on
the way from Southampton to New York it sank in the North Atlantic Ocean. Out of the 2224
passengers an estimated of 1500 died. In this paper a data set consisting of 1309 passengers
which we know the survival of the 891 passengers is used to predict the survival of the rest. The
main purpose of the paper is to review papers that have been produced by other scientists after
research and concluding results from the methodologies that will be used in the paper as well and
compare.

1
2 Literature Review
Dagupta, Mishra, Jha, Singh and Shula have used multiple different methods on predicting survival
for different factors such as one for sex, age and socio-economic status. Methods such as logistic
regression, decision tree and random forest classifier are used.

Before picking a model, a process flow chart is used to determine the most appropriate model for
the research:

Fig.1 Flowchart to determine a right model [1]

Figure 1 goes through a process of choosing the most appropriate method for the prediction.
Starting with the problem itself and then developing a “fitting model” which matches the initial
problem. Checking then the validation of assumptions and in the if statement, if the validation of
assumption is not ok then scientists have to repeat the previous process once again. After the
validation of assumption is suitable then move onto evaluating fitted model. If the model passes
the test then it is a suitable model to use.

They have decided with using feature engineering which is the process of selecting and
manipulating raw data to transform it into a characteristic implemented during training and
making predictions. Here a model using machine learning is developed to predict the survival of
the remaining passengers. Secondly the use of machine learning models to verify and actually
make the prediction. Methods such as logistic regression, decision tree and random forest
classifier is used. Logistic regression analyse data by considering the relationship between
different data factors. They first start off by assuming some kind of log p(x) to be a linear function
for x and with a range of (0,1) so the probability of p would be able to vary from 0 to1. After
solving an equation for p(x) they would be able to predict probabilities by fitting datasets and set
thresholds. For a certain data point “x” they would obtain a predicted class of “y”. Decision tree
which is a supervised algorithm for making predictions. This links to random forest classifier which
then will combine the outputs form decision trees and then produce a single output. In this paper,
a high number of trees used therefore more reliable as increasing the number of trees in the
forest means higher accuracy of output.

Yogesh Kade [2], similarly used a process flow chart to determine the most efficient machine
learning model to use. And the methods used are exactly the same which further proves that
logistic regression, decision tree plus random forest is a very suitable model for this research.

However, Yogesh Kade’s research paper is much more precise in describing the exact use of
random forest and decision trees.

2
Fig. 2 Decision tree model example [2]

As shown in figure 2 above they have considered all the small factors that can be ignored easily
such as height, weight etc. This would make their final outcome from the random forest more
precise and accurate due to the model wlas built much more detailed and well trained earlier. In
this research paper they also analysed their results’s accuracy from results obtained from the
different methods.

Fig. 3 General confusion matrix [2]

Figure 3 is the general formula and layout used for each of the methods. The R mathematical
calculations are performed here. The highest value for this is 1.0 and the worst value for this is
0.0.

By comparison the highest accuracy was shown in the method of support vector machine
(confusion matrix for SVM)

2.2 Data
The data set contains multiple different factors including: Passenger Id, survival state, Socio-
economic class, name, sex, age and the place where they embarked.

Table. 1 Data set description


Survival 0 (No) 1 (Yes) N/A

3
Class 1 (Highest) 2 3 (Lowest)
Sex M (Male) F (Female) N/A
Embarked C (Cherbourg) Q (Queenstown) S (Southampton)

As shown above in table 1, the data can be represented as the ones above.

The source data comes from Kaggle [3]. In the data table, the passenger ID refers to the ID of each
of the passengers on the Titanic. Survival refers to the survival state of the passenger after the
accident, where 0 indicates no and 1 indicates yes. The class refers to the socio-economic class,
differing the passengers into different classes of wealth and fame. Name contains the full name of
the passenger, sex and age are the basic information given by the passenger. Embarked is where
the passenger embarked the Titanic.

3 Methodology
This paper uses TensorFlow Decision Forests. Decision forests are machine learning algorithms
used for regression, classification etc. [4] Random Forest is basically combining all the predictions
from trees where they depend on the dataset given and the different variables. [5] This will build
the model for prediction. Before the actual code, dependencies such as different libraries need to
be imported. E.g. pandas, TensorFlow, TensorFlow decision forests. TensorFLow Decision Forests
library can be used for training the models and evaluating outcomes and interpretation of it. First
step will be preparing the datasets. Names needs to be tokenised and this will produce a new
table of data which will be used further in the later code. Secondly, the Pandas dataset will need
to be converted to TensorFlow Dataset by using imported dependencies. And then train the
model using the default parameters. First a GradientBoostedTreesModel (GBT) model is trained
with the default parameters and then a more specific GBT model with improved parameters
including information about the importance of a variable. The basic logic behind a Gradient
Boosting Model us that it will combine multiple predictions coming from the decision trees and
then generate a final prediction. For gradient boosting models/machines we will have the model
learning first before making estimation and predictions. This would provide “a more accurate
estimate of the response variable” [6]. Advantage of this model/method is high flexibility, allowing
it to be used for almost all tasks requiring dataset understanding, interpretations etc. After the
model is built, predictions can then be made by adding a subprogram into the code. Datasets can
be then loaded into the model for prediction. All the outputs are then printed into a new file.
Ensemble, for these 100 models will be created by different seeds from the TensorFlow Decision
Forests and combining the results. In GBT this will use the “honest” parameter. This method is
very effective in which decision tree algorithms can “provide human-readable rules of
classification” [7], which can be very efficient when it comes to reviewing and understanding the
method.

4 Results

Table.2 Results sorted by sex

Sex: Total: Death: Death Rate

Male 577 468 81.11%

4
Female 314 81 25.80%

Sex is one of the main factors that affected the survival of the passengers on the Titanic. However,
to a certain extend this paper has excluded the possibility of luck involved. Even though the
captains tried to divide the passengers up by sex, age and class, it still depended on where you
ended up on the Titanic, the lifeboat and when. There are way too many uncertainties, and other
factors affecting out prediction results. Also, different officers on different part of the Titanic
announced different things such as Officer Lightoller [8] decided on “women and children only”
on the lifeboats whereas Officer Murdoch decided on “women and children first” but still allowed
men to board afterwards. Overall, as shown in table 2 the death rate of male is still significantly
higher than female with 81.11% against 25.80% which matches with what other papers and
scientists have predicted and the actual death rate. Most of the males weren’t able to board a
lifeboat which led to people forced to stay in the cold ocean water which is potentially the main
death reason. Human bodies are clearly not able to withstand the extreme low temperature of
ocean water. As well as they were not provided with food and shelter.

Table. 3 Results sorted by place of embark.

Embark: Total: Death: Death Rate:

C 168 75 44.64%

S 644 427 66.30%

Q 77 47 61.04%

The place of embark for passengers occurred to be the most irrelevant factor in determining the
passenger’s survival as expected. Southampton had the most people embark there meant that
there’s an increased possibility for those who died embarked at Southampton therefore resulting
in a higher death rate (66.30%) comparing to the rest as shown in table 3. However, the place of
embark could be closely linked to the different types of people there, for example people who live
in Southampton have a lower life expectancy in general (82.5 for female, 78 for male) compared
to Queenstown (87 for female, 84.5 for male). Life expectancy is related to the general health
conditions and living conditions for the region and therefore affecting the likelihood for the
passenger to survive.

Table. 4 Results sorted by the socio-economic class.

Socio-Economic Total: Death: Death Rate:


Class:

1 216 80 37.04%

2 184 97 52.71%

3 491 372 75.76%

5
Now looking at the socio-eco status of each passenger in table 4. This could be potentially an
important factor when they were deciding who to board onto the lifeboats first. Well respected
and high social status people were definitely considered first as we can see from the 37% death
rate where in comparison people in class 3 had a death rate up to 76%. Examples like well known
American Socialite Margaret Brown survived as she was asked to board the lifeboats first. Also J.
Bruce Ismay who was the former managing director of the White Star Line, he was able to survive
which further proves that more respected people were asked to board lifeboats first. [9]

Table. 5 Results sorted by age


Age: Total: Death: Death Rate:
0-20 164 85 51.83%
20-40 372 225 60.48%
40-60 124 76 61.29%
60+ 22 17 77.27%

From table 5 it is easy to observe that the death percentage was indeed the highest for the ones
above 60 due to their weak immune system and bad health conditions. Making them more
vulnerable towards the extreme weather in the middle of the North Atlantic Ocean, not able to
withstand the cold and hunger.

After using the GBT model the importance of different factors are also analysed by the model
itself: The importance of the variables is measured in two ways. One being how much would the
accuracy of the predictions decrease when this variable is excluded from the data. Second being
the decrease of Gini impurity when a variable is chosen to split a node [10].

Table. 6 Variable importance score


Variable Importance Score
Sex 460.498
Age 355.963
Socio-economic Class 28.132
Embarked 8.156

As shown in table 6, matching with the results obtained from results earlier, the importance of the
factors in order is: Sex, Age, Socio-economic class and embarked place. This proves that the
predictions made by the GBT model is accurate to a certain extent and the model built was
successful in making the predictions.

5 Conclusions

After producing the final model and analysing the results generated, it helps the scientists to
understand more about the incident of the Titanic and the usage of the actual dataset in order to
prevent further accidents like this. It is really important for ones to learn from an accident and
take it into consideration in the future. Passengers who were female, in first class were much
6
more likely to survive. With the factor sex having an importance score put to 460.498, this shows
the process of deciding who’s able to board lifeboats were hugely dependent on the sex of the
passenger. This research also have some limitations such as the data set is not complete, it is near
impossible for scientists to obtain a full set of data from the time Titanic took place. Also there’s
potential room for improvement such as one could increase the number of random forest trees in
order to increase the accuracy and reliability of the final prediction.

References

[1] Anasuya Dasgupta, Ved Prakash Mishra, Sanjiv Jha, Bhopendra Singh, Vinod Kumar Shukla,
Predicting the likelihood of survival of Titanic’s passengers by machine Learning, 2021.
[2] Yogesh Kade, Predicting Survival on Titanic by Applying Exploratory Data Analytics and Machine
Learning Techniques, 2018.
[3] Kaggle, 2012, Titanic – Machine Learning from Disaster, 14/04/2024
[4] TensorFlow, 2024, Build, train and evaluate models with TensorFlow Decision Forests, 14/04/2024
[5] Breiman, L.Random Forests. Machine Learning 45, 5-32 (2001).
[6] Alexey Natekin, Alois Knoll, Gradient boosting machines, a tutorial, 2013
[7] Ali, Jehad & Khan, Rehanullah & Ahmad, Nasir & Maqsood, Imran. Random Forests and Decision
Trees. International Journal of Computer Science Issues(IJCSI), 2012
[8] Reddit, 2023, Why did more 2nd class male passengers die than 34d class male passengers on the
Titanic?, 14/04/2024
[9] Wikipedia, 2024, J.Bruce Ismay, 14/04/2024

7
[10] Jake Hoare, How is variable importance calculated for a random forest,
https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/

You might also like