Individual Asignment Ucs551

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

FACULTY OF BUSINESS AND MANAGEMENT

BACHELOR OF BUSINESS ADMINISTRATION (HONS) FINANCE (BA242)

UCS 551

INDIVIDUAL ASSIGNMENT 5 (Titanic Dataset)


1. INTRODUCTION TO RAPIDMINER

Previously known as YALE (Yet Another Learning Environment), RapidMiner is a data


science software platform that developed by Ingo Mierswa, Ralf Klinkenberg, and Simon Fischer
in 2001. It is an integrated software programme designed for commercial uses along with
education, training, prototyping, and research. Rapidminer provides an integrated environment
for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is
used for business and commercial applications as well as for research, education, training, rapid
prototyping, and application development and supports all steps of the machine learning
process including data preparation, results visualization, model validation and optimization.

RapidMiner offers various solution of technical analysis where it is through template-based


frameworks that easy to use and can reduce errors by nearly removing the need to write any
code. RapidMiner provides data mining and machine learning procedures including data loading
and transformation, data pre-processing and visualization, predictive analytics and statistical
modelling, evaluation, and deployment. RapidMiner runs a graphical user interface to design
and run analytical tasks. Those workflows are called “Processes” in RapidMiner and they
consist of multiple operators. RapidMiner functionality can be extended with additional plugins
which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a
platform for developers to create data analysis algorithms and publish them to the community.
2.0 DATA PRE-PROCESSING AND EXPLORATION USING RAPIDMINER

2.1 Data Understanding

Titanic dataset is the dataset used in this study. It contains 12 columns and 1309
examples of data which holds the information about the passenger class, name, sex, age, no of
siblings or spouses on board, no of parents or children on board, ticket number, passenger fare,
cabin, port of embarkation, life boat and survival.

VARIABLE ROLE VARIABLE TYPE


Survival Target Binominal
Passenger Class Input Polynomial
Age Input Integer
Life Boat Input Polynomial
Sex Input Binominal
Passenger Fare Input Numeric
Cabin Input Polynomial
Name Input Polynomial
No of Siblings or Spouses on Board Input Integer
No of Parents or Children on Board Input Integer
Ticket Number Input Polynomial
Port of Embarkdown Input Polynomial

The table above shows the variables included in the data set before it undergoes data
preparation process. Based on the data, there are few attributes that have missing values which
include Age, Passenger Fare, and Cabin.

2.2 Data Preparation

Figure 1 Retrieve Titanic Dataset


The dataset needs to undergo several data preparation processes. This is to ensure that
the data can accurately predict the result of this study. The figure below shows the processes in
RapidMiner and the operators used to clean and filter the attributes in the dataset. The first
stage of the process is to retrieve the Titanic data from the repository.

Figure 2 Select 6 (Six) Attributes from Dataset

Second stage is to select attribute from the operator and we only select 6 attributes from
the dataset which include Survival, Sex, Age, Passenger Fare, Passenger Class and Cabin.
Figure 3 Missing values from Age, Passenger Fare and Cabin

The third stage is data cleaning and filtering by using Filter Examples operator.
Previously, there are few missing values in the data and it could lead to inaccurate
observations. As a result, only 272 out of the 1309 data were left after the cleaning process. The
dataset is also filtered in terms of age, passenger fare and cabin. Since the original dataset
consists of passenger aged as young as 0.167 years and as old as 80 years old. After filtered,
only 272 data remained and the dataset contains only information on those aged between 0
years to 80 years old.

Figure 4 Filtering process


Figure 5 Set attribute Survival as Target Role

The next stage is set the attribute Survival as target label. An attribute with the label role
acts as a target Attribute for learning Operators. The label is often called ‘target variable’ or
‘class’. After setting the Survival as target role, I renamed the operator to Set Label.

Figure 6 Passenger's Age in Integer (Before Figure 7 Passengers' Age In Categories (After
Transforming) Transforming)
The last step is transforming Age attribute into five categories which consist of Infant (ages 0-2
years old), Children (ages 3-12 years), Teenager (ages 13-19 years), Adults (ages 20-59) and
lastly Senior Citizens (ages 60 and above). The result is shown on the figure 6 and 7 where I
use Discretize by User Specification from the Operators. This operator will discretize the
selected numerical attributes into user-specified classes. The selected numerical attributes will
change to nominal attributes.

.
2.3 DATA EXPLORATION

Figure 8 Passenger Fare

The chart above illustrates the Titanic’s Passenger fare. There are 100 passengers who paid
between 0 to 50 fare. From range of 50 to 100 of fare, 104 passengers paid in that range.
Range of 100-150, there are 34 passengers. Range 150-200, there are 2 passengers. Range of
200-250, there are 17 passengers. Range 250-300, there are 12 passengers and lastly from
range 400-500, there are 3 passengers. The maximum fare is 512.329 and the minimum is 0.
The average fare is 84.906 and deviation of 80.401.

Figure 9 Passenger Survival

The chart above illustrates the Titanic’s Passenger survival. From total of 272 passengers, the
chart shows that there are 182 passenger survive while another 90 passengers did not survive
the disaster.
3.0 APPLYING MACHINE LEARNING MODEL USING RAPIDMINER

Decision tree and prediction

Figure 10 Processes of Decision Tree and Prediction

The above figure is the data preparation to create decision tree and prediction. As you can see,
I have selected Decision Tree and Apply Model from the operator and drag it to the panel. Apply
model is aim to get a prediction on unseen data or to transform data by applying a
preprocessing model. The ExampleSet upon which the model is applied, has to be compatible
with the Attributes of the model. This means, that the ExampleSet has the same number, order,
type and role of Attributes as the ExampleSet used to generate the model.

Figure 11 Decision Tree

The figure above shows the decision tree of titanic dataset. A decision tree is a tree like
collection of nodes intended to create a decision on values associate to a class or an
estimate of a numerical target value. Each node represents a splitting rule for one
specific Attribute. For classification, this rule separates values belonging to different
classes. The building of new nodes is repeated until the stopping criteria are met. A
prediction for the class label Attribute is determined depending on the majority of
Examples which reached this leaf during generation, while an estimation for a numerical
value is obtained by averaging the values in a leaf. This Operator can process
ExampleSets containing both nominal and numerical Attributes. The label Attribute must
be nominal for classification and numerical for regression.

Figure 12 Prediction of Survived Passengers

Figure above shows the prediction of passengers’ survival. Before the process, the data shows
that there are 182 passenger survive while another 90 passengers did not survive the disaster.
However, after the prediction process, the model predicts that there are actually total of 153
passengers survive while another 119 passengers predicted to not survive.
Split validation with split data

Split data is an operator that produces the desired number of subsets of the given ExampleSet.
The ExampleSet will be partitioned into subsets according to the specified relative sizes.

Figure 13 Survival Prediction

Split validation with split validation

Figure 14 Preparation Data of Split Validation

Figure 15 Split Validation


Figure 16 Performance Vector

Figure 17 Description of Performance Vector


4.0 MODEL EVALUATION AND DISCUSSION

Model evaluation metrics are required to quantify model performance. The choice of evaluation
metrics depends on a given machine learning task such as classification, regression, ranking,
clustering, topic modeling, and others. Some metrics, such as precision-recall, are useful for
multiple tasks. Supervised learning tasks such as classification and regression constitutes a
majority of machine learning applications. For this Titanic dataset, I use performance
classification to evaluate the data.

Figure 18 Performance Vector

Accuracy

Accuracy is a common evaluation metric for classification problems. It’s the number of correct
predictions made as a ratio of all predictions made. From the figure above, the accuracy of the
model is 78.05%

Recall

Recall gives the idea about how often does the performance predict yes. Recall can be defined
as the ratio of the total number of correctly classified positive examples divide to the total
number of positive examples. High Recall indicates the class is correctly recognized which
means that there are small number of False Negative. The figure shows for true yes, the class
recall is 72.73% while for the true no, the recall value is 88.89%.
Precision

Precision tells us about how often it predicts yes and how often the data is correct. When the
recall is high but the precision is low, it means that most of the positive examples are
correctly recognized. This can be seen from the figure where the class recall for true no is
88.89% but the class precision is low which is 61.54%. Furthermore, when the recall is low but
the precision is high, this shows that there are a lot of positive examples but are predicted
negative. This can be seen from the result for true yes where the class recall is 72.73% but the
precision is high which is 98.02%.

Figure 19 Confusion Matrix

Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem. The number of


correct and incorrect predictions are summarized with count values and broken down by each
class. This is the key to the confusion matrix. The confusion matrix shows the ways in which the
classification model is confused when it makes predictions. It gives us an understanding not
only into the errors being made by a classifier but more importantly the types of errors that are
being made. From the figure above, the total numbers of examples are 84. The value of 40 in
the figure signifies the true positive. The value of 15 is called false positive. The value of 3 is the
false negative and the value of 24 is called the true negative. It means that out of 27 passengers
that are not survived from the sinking, 24 from it are correctly identified as not survived and the
other 3 passenger are actually survived. Furthermore, out of 55 survived passenger, the model
only correctly identify 40 passengers are survived while another 15 passengers are not actually
survive.
5.0 CONCLUSION

In conclusion, the estimated performance of a model tells us how well it performs on unseen
data. Making predictions on future data is often the main problem we want to solve because as a
human, we tend to make mistakes. It is very important to understand the framework before
choosing any metric because each machine learning model tries to solve a problem with a
different objective by using a different dataset. Machine learning had a big impact on the
economy in general as it helps people to work more efficient and creative.

You might also like