Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Project 1- Data Processing

Data Processing for training data


1. Removed Cabin and ticket number
a. Ticket: Inconsistent and non-indicative format. Even with processing to find
“hidden variables”, ticket is most likely to contain information on pclass,
embarked, ticket vendor, or ticket number (nth ticket sold), half of which is given
already, and the other, not relevant.
b. Cabin: The majority of the rows have no values (__). Attempting to compensate
may skew the data greatly
2. Checking the basic statistics of the variables with missing values (Age).
- Low skewness and kurtosis scores indicate that the data generally follows a
normal curve, hence the mean can be used to replace the missing values.

Data Processing for Testing data


1. Removed Cabin and ticket number
a. Ticket: Inconsistent and non-indicative format. Even with processing to find
“hidden variables”, ticket is most likely to contain information on pclass,
embarked, ticket vendor, or ticket number (nth ticket sold), half of which is given
already, and the other, not relevant.
b. Cabin: The majority of the rows have no values (78%). Attempting to compensate
may skew the data greatly
2. Checking the basic statistics of the variables with missing values. Because there are
missing values, directly sorting them under categorical variables (e.g. Embarked and
Sex) will not work.
a. Age: Low skewness and kurtosis values indicate that the mean may be used to
fill in the missing values. Upon removing the 2 outliers, the mean only changes
by a small value of .25. For the sake of accuracy, these outliers will not be
removed
b. Fare: High skewness and kurtosis indicate a need to use the median instead.
Zero values will not be removed, as they are probably padded data from the
researchers, and given the size of the dataset, will most probably not change the
median by a significant degree.
3. Due to its categorical nature, the gender of passengers (0=Female, 1=Male) and
embarked location was converted to binary (1=S, 10=Q, 11=C, for easier processing.

Variable Relationships (test)


1. Passenger Class: According to an article from bowdoin.edu, The survival rate of
first-class passengers is 62%, slightly under the survival rate of the second and third
class combined (41% and 25%). This may be due to easier access to information and
lifeboats for the upper classes.
2. Sex: From the same article, the survival rate of men was 20%, the women 75%, and the
children 50%. This is a direct consequence of the captain’s famous order to the officers:
“Women and children first”, which only allowed men to board lifeboats if there are no
women or children in line.
3. Horizontal (sibsp) and Vertical (parch) family relations: According to Egan & Zanchi
(n.d.), the amount of sibsp has no effect to one’s survival rate. On the contrary, having
more parents/children on board greatly increases your survivability.
4. Port embarked: According to a previous analysis, the port with the highest survival rate is
C, followed by Q then S. This is mainly because C has the least total passengers, and S
the most. Technically, S contained the most survivors and non-survivors, but due to the
low population of C, the percentages point C to have the highest survivability. Because
of this, it might not be the best idea to use this as a decision variable.
5. Fare: Running Fare with Pclass in a Correlation Analysis (Testing dataset) results in a
correlation score of -0.577, which indicates a moderate negative correlation. Due to this,
using both variables as predictors may be redundant later on.
6. Age: According to Egan & Zanchi (n.d.), on average, female survivors are older, while
male survivors are younger. This is consistent with the earlier claim that Sex directly
affects survival rate, as males had a higher probability of riding a lifeboat if they were
children at the time.

Takeaways: The best predictors for survivability are Sex, Passenger class, and possibly Age or
Parch. To treat the “holes in the dataset, the mean may be used for age, while the median
should be used for fare. Due to a difference in scale in each variable, normalization may be
needed. Due to the categorical result, methods that accommodate this (like cluster k-means)
should be used.

Regression Analysis
When attempting to plug the data into a linear regression, an error shows stating that the
data cannot be fit.

Sources:
https://courses.bowdoin.edu/history-2203-fall-2020-kmoyniha/reflection/
https://bibinmjose.github.io/explore_titanic_data/#:~:text=Passengers%20with%20Sibling%2FS
pouse&text=Perished%20%3A%20398%20(45%25)%20pasengers,atleast%20one%20Sibling%
2FSpouse%20onboard.
https://www.shiftcomm.com/insights/never-let-go-titanic-survival-101/

You might also like