Professional Documents
Culture Documents
Individual Asignment Ucs551
Individual Asignment Ucs551
Individual Asignment Ucs551
UCS 551
Titanic dataset is the dataset used in this study. It contains 12 columns and 1309
examples of data which holds the information about the passenger class, name, sex, age, no of
siblings or spouses on board, no of parents or children on board, ticket number, passenger fare,
cabin, port of embarkation, life boat and survival.
The table above shows the variables included in the data set before it undergoes data
preparation process. Based on the data, there are few attributes that have missing values which
include Age, Passenger Fare, and Cabin.
Second stage is to select attribute from the operator and we only select 6 attributes from
the dataset which include Survival, Sex, Age, Passenger Fare, Passenger Class and Cabin.
Figure 3 Missing values from Age, Passenger Fare and Cabin
The third stage is data cleaning and filtering by using Filter Examples operator.
Previously, there are few missing values in the data and it could lead to inaccurate
observations. As a result, only 272 out of the 1309 data were left after the cleaning process. The
dataset is also filtered in terms of age, passenger fare and cabin. Since the original dataset
consists of passenger aged as young as 0.167 years and as old as 80 years old. After filtered,
only 272 data remained and the dataset contains only information on those aged between 0
years to 80 years old.
The next stage is set the attribute Survival as target label. An attribute with the label role
acts as a target Attribute for learning Operators. The label is often called ‘target variable’ or
‘class’. After setting the Survival as target role, I renamed the operator to Set Label.
Figure 6 Passenger's Age in Integer (Before Figure 7 Passengers' Age In Categories (After
Transforming) Transforming)
The last step is transforming Age attribute into five categories which consist of Infant (ages 0-2
years old), Children (ages 3-12 years), Teenager (ages 13-19 years), Adults (ages 20-59) and
lastly Senior Citizens (ages 60 and above). The result is shown on the figure 6 and 7 where I
use Discretize by User Specification from the Operators. This operator will discretize the
selected numerical attributes into user-specified classes. The selected numerical attributes will
change to nominal attributes.
.
2.3 DATA EXPLORATION
The chart above illustrates the Titanic’s Passenger fare. There are 100 passengers who paid
between 0 to 50 fare. From range of 50 to 100 of fare, 104 passengers paid in that range.
Range of 100-150, there are 34 passengers. Range 150-200, there are 2 passengers. Range of
200-250, there are 17 passengers. Range 250-300, there are 12 passengers and lastly from
range 400-500, there are 3 passengers. The maximum fare is 512.329 and the minimum is 0.
The average fare is 84.906 and deviation of 80.401.
The chart above illustrates the Titanic’s Passenger survival. From total of 272 passengers, the
chart shows that there are 182 passenger survive while another 90 passengers did not survive
the disaster.
3.0 APPLYING MACHINE LEARNING MODEL USING RAPIDMINER
The above figure is the data preparation to create decision tree and prediction. As you can see,
I have selected Decision Tree and Apply Model from the operator and drag it to the panel. Apply
model is aim to get a prediction on unseen data or to transform data by applying a
preprocessing model. The ExampleSet upon which the model is applied, has to be compatible
with the Attributes of the model. This means, that the ExampleSet has the same number, order,
type and role of Attributes as the ExampleSet used to generate the model.
The figure above shows the decision tree of titanic dataset. A decision tree is a tree like
collection of nodes intended to create a decision on values associate to a class or an
estimate of a numerical target value. Each node represents a splitting rule for one
specific Attribute. For classification, this rule separates values belonging to different
classes. The building of new nodes is repeated until the stopping criteria are met. A
prediction for the class label Attribute is determined depending on the majority of
Examples which reached this leaf during generation, while an estimation for a numerical
value is obtained by averaging the values in a leaf. This Operator can process
ExampleSets containing both nominal and numerical Attributes. The label Attribute must
be nominal for classification and numerical for regression.
Figure above shows the prediction of passengers’ survival. Before the process, the data shows
that there are 182 passenger survive while another 90 passengers did not survive the disaster.
However, after the prediction process, the model predicts that there are actually total of 153
passengers survive while another 119 passengers predicted to not survive.
Split validation with split data
Split data is an operator that produces the desired number of subsets of the given ExampleSet.
The ExampleSet will be partitioned into subsets according to the specified relative sizes.
Model evaluation metrics are required to quantify model performance. The choice of evaluation
metrics depends on a given machine learning task such as classification, regression, ranking,
clustering, topic modeling, and others. Some metrics, such as precision-recall, are useful for
multiple tasks. Supervised learning tasks such as classification and regression constitutes a
majority of machine learning applications. For this Titanic dataset, I use performance
classification to evaluate the data.
Accuracy
Accuracy is a common evaluation metric for classification problems. It’s the number of correct
predictions made as a ratio of all predictions made. From the figure above, the accuracy of the
model is 78.05%
Recall
Recall gives the idea about how often does the performance predict yes. Recall can be defined
as the ratio of the total number of correctly classified positive examples divide to the total
number of positive examples. High Recall indicates the class is correctly recognized which
means that there are small number of False Negative. The figure shows for true yes, the class
recall is 72.73% while for the true no, the recall value is 88.89%.
Precision
Precision tells us about how often it predicts yes and how often the data is correct. When the
recall is high but the precision is low, it means that most of the positive examples are
correctly recognized. This can be seen from the figure where the class recall for true no is
88.89% but the class precision is low which is 61.54%. Furthermore, when the recall is low but
the precision is high, this shows that there are a lot of positive examples but are predicted
negative. This can be seen from the result for true yes where the class recall is 72.73% but the
precision is high which is 98.02%.
Confusion Matrix
In conclusion, the estimated performance of a model tells us how well it performs on unseen
data. Making predictions on future data is often the main problem we want to solve because as a
human, we tend to make mistakes. It is very important to understand the framework before
choosing any metric because each machine learning model tries to solve a problem with a
different objective by using a different dataset. Machine learning had a big impact on the
economy in general as it helps people to work more efficient and creative.