Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

By Miguel García

A GAZE INTO THE de Julián and


Martina Lapera

TITANIC’S TRAGEDY Sancho


1. INTRODUCTION:
Initially, the database given was thoroughly examined. When doing so, some passengers’ information
was thought to be repeated. Hence, a new dataset without repeated values was obtained. This one
contained 657 rows instead of 668 (11 of them were repeated).
In order to extract the variables’ main characteristics and relationships, tables and prop tables of the
categorical variables were computed (these were Survived, Pclass, Sex, Embarked, Cabin and Ticket).
For these, the following information was obtained from the clean data set sample:
It was determined that most of the people died (61.34%) and that most of the total number of
passengers were men (63.77%). Moreover, it was established that the Titanic was distributed by
classes where the third class was the cheapest and the most common and most passengers embarked
in Southampton (71.38%).
The variables Ticket and Cabin were not taken into account because they were considered useless:
505 people were not assigned a cabin. Even so, the variable Ticket was key for determining the
number of repeated rows in the initial dataset.
For the numerical variables, the summary function was used in order to determine the mean,
maximum and minimum values, median and quantiles. Subsequently, the following plots were
computed:

In order to study the correlation between all the variables, a new data set was created in which all
categorical variables were transformed into numerical.

2. QUESTIONS:
2.1. What was the price difference between the classes?

As in all cruises, there are different types of classes where a passenger can stay, in this case there are
three and a high correlation between the variables Fare and Pclass was observed. We made different
subsets of the classes so that the difference between these (in terms of fare price) could be
established.
As observed below, class1 is the most expensive one, with a mean of 88.69, then class2 with a mean
of 22.20 and finally, the cheapest class fare is the third one with a mean of 13.49.
The following plots show how the fare dependence on the class type.

Class 1 Class 2 Class 3

This last plot shows the total of Fare of each class. It


can be observed that two outliers in the first class
round the 500 Fare.

2.2. What were the characteristics of the workers?

When studying the variable Fare, it was discovered that there were passengers who had not paid
anything throughout their stay in Titanic. These were thought to be workers. A new dataset called
‘workers’ was created in which the Fare value of those passengers was 0.
This dataset contained 7 people. The age of these was between 28 and 49 years old, they were all men
and they all embarked in Southampton. Also, none of them brought any family members, and none of
them survived.

2.3. Where did the passengers stay (Pclass) depending on where they embarked?

A high correlation was observed between the variables class and embarked. Hence, a study was made
with these.
Using the subsets of the classes previously created, the passengers stay depending on the place they
embarked was evaluated. Through tables and prop.tables, huge difference between the embarkation
cities were observed: in Cherbourg most of the people stayed in class 1, which is the most expensive
one, whilst in Southampton and Queenstown class 3 was the most common one. These conclusions
are derived from the following geom_bars() and the general one.
2.4. How much money did the passengers spend depending on where they embarked?

Using subsets of each city, a summary of their


respective fares was made.

Having this information, it is noticeable that those who embarked in Cherbourg spent more money
than those in Queenstown, who paid significantly less, (since, as previously discovered, most of the
passengers who embarked in this city were placed in the third class, which we suppose was the
cheapest). In order to analyze this information more clearly, a boxplot of this relationship was made.

2.5. Did class level have anything to do with surviving the Titanic?

The passengers’ class could have had something to do with the


rate of survival. This hypothesis was validated using the
different subsets of the different classes and prop.tables,
These are the general table and general prop.table of this
relationship. Through these, it is clearly observed that the
better the class, the higher the percentage of people who
survived. Now, using a geom_bar(), this dependence will be
studied separately.
2.6. Did passengers’ age condition their chances of survival?

The initial hypothesis posed was that the older the passengers were, the smaller the survival rate (as
ageing implies a gradual physical degradation). In order to prove this, the mean of the passengers’ age
was obtained (29.17 years).
Then, the mean age of those who survived was obtained (28.048 years), as well as the mean age of
the ones who did not (29.88 years). As it can be observed, the mean age of the surviving passengers
was lower than the total mean age whilst the mean age of the ones who died was higher.
Nonetheless, as these results were not conclusive, percentages with the chances of survival of
passengers with ages higher and lower than the mean were obtained. In fact, 41.56% of passengers
older than the mean age survived versus 36.96% of the younger ones. Actually, these percentages
contradicted the initial hypothesis. As a result, a histogram was plotted in order to further analyze the
data:
This graph does not show observable
dependence patterns between the
variables ‘Survived’ and ‘Age’. So as to
prove the lack of relationship between
these, the percentage of survivors
within four age intervals were
computed.
These ranged from 0.42 years
(minimum age) to 74 years (maximum
age) in intervals of 18.395 years.
The percentages obtained were the
following:

❖ Interval 1 (0.42 – 18.815 years): 50.44% survived.


❖ Interval 2 (18.815 – 37.21 years): 35.25% survived.
❖ Interval 3 (31.21 – 55.605 years): 40.87% survived.
❖ Interval 4 (55.605 – 71 years): 32.14% survived.
Consequently, the initial hypothesis was deemed incorrect. Finally, the correlation between these two
variables was computed (-0,067; negligible), and the dependence of ‘Survived’ and ‘Age’ was
considered null.

2.7. Did passengers’ sex determine their survival?

In order to study the relationship between these two variables,


the following percentages were obtained: 75.63% of the
female passengers survived whilst only 17.66% of the total
number of male passengers were survivors of the Titanic.
These data reveal a certain tendency which is supported by the
fact that, from the total number of survivors, 70.87% were
women and 29.13% were men. A geom_bar() was computed in
order to show the relationship previously established.
Finally, the correlation between these variables was obtained: -0,57. This value is high enough to
consider that passengers’ sex was closely related to their chances of survival. The negative value
implies that high values of the variable survived (2) are correlated with low values of the variable sex
(1): 1 being the number that represents female passengers.

2.8. Did having relatives on board the Titanic condition the passengers’ survival?
In order to evaluate this relationship, the following percentages were obtained: 29.38% of the
passengers who were alone survived, whilst 50.04% of those who had families survived. The
hypothesis established was whether being accompanied favored the passengers’ rate of survival.
In order to prove it, a variable called ‘Family’ with two levels was created and bound to the initial data
frame: the value 1 represents the passengers with relatives on the Titanic and 0 those without them.
Then, a bar plot, a table and a prop.table were computed in order to evaluate this dependence:

As observed from the prop.table above, from the


total number of passengers, the ones that did not
have any family members on board and did not
survive represent the 41.7%.

Lastly, and as a final proof of this, the correlation between the variables ‘Family’ and ‘Survived’ was
significant: 0.23. This implies that those passengers who had families (value 1) had a higher survival
rate.

You might also like