Professional Documents
Culture Documents
Lab Assignment 1 Ucs551
Lab Assignment 1 Ucs551
Lab Assignment 1 Ucs551
UCS551
PREPARED BY:
PREPARED FOR:
DUE DATE:
17/12/2023
Table of Content
No Page
1 Introduction 2
12 Step 3: Run the process and it will generate an AUC – ROC graph. 17
1
13 Conclusion 18
2
Introduction
This dataset contains information about the telco customer dataset. This dataset has
about 21 attributes that’s help to do the analysis. In this analysis, the main objective is to
know either the customer in the telco service that the customer wants to stay using the telco
services or will not using the telco services anymore.
Making analysis about the customer to stay using the telco services or will leave using
the telco services are important to the telco services provider. It is because they can make
analysis behind the reason the customer leaving using the telco services that they provide.
Company also can improve their telco services and make it affordable, broader services and
user friendly. With the analysis, the management can make many improvements and solve
the problems that the company may faces. From the analysis also company can gain many
benefits, for example they can gain information that they have forget or mistaken
information.
In this analysis, the step on each process will be shown to achieve the objective and
make analysis on this telco service dataset. By using RapidMiner application it will help in
doing the analysis efficiently and effectively. RapidMiner has many operators that help the
user in making the assumptions and analyisis based on various dataset.
3
Data Understanding
4
(Polynominal) streaming TV or not
Streaming Movies Input Qualitative Whether the customer has
(Polynominal) streaming movies or not
Contract Input Qualitative Type of contract the
(Polynominal) customer has
Paperless Billing Input Qualitative Whether the customer uses
(Polynominal) paperless billing or not
Payment Method Input Qualitative Payment method used by
(Polynominal) the customer
Monthly Charges Input Quantitative (Integer) Monthly charges incurred
by the customer
Total Charges Input Quantitative (Integer) Total charges incurred by
the customer
5
Part 1: Using Machine Learning Algorithm
Step 2: Using Read CSV and load the sample dataset “Telco Customer Churn.csv”.
Step 3: Using set role operators to set role for attribute “Churn” as label.
Set Role operators is used to assign roles to the attributes.
6
Choose Churn in attribute name and change target role as label.
After running the process, Churn colum will be highlighted, resulting in the role has been
changed to label.
Step 4: Cleaning the missing values. In this dataset, Total Charges has missing values. This
missing value needs to be clean from the dataset.
7
Using filter example operators to remove the missing values.
Filter example operators is used to selective include or exclude examples from the
dataset from the dataset based on certain conditions.
Choose filters and filters Total Charges and choose “is not missing” to remove missing
values.
After running the process, the missing values in Total Charges will be reomoved.
8
Sample operators is used to create a random sample of example from the dataset. It is
used for a large sample of dataset to reduce the sample for exploration, testing, or
model training.
After running the process, the example will reduce to 1000 sample only.
9
Identify the accuracy and correlation using Logistic Regression, Random Forest, and Naïve
Bayes.
10
- The accuracy by using Naïve Bayes is 69.90%.
This bar chart explains about the churn vs. the gender which are to show which
gender of customers that will use the service or will leave the services. From the bar chart,
the x-axis represents the gender of customers and option whether the customer will use the
service or leave the service. The y-axis represents the number of customers.
11
The total number of customers that will continue the service is lower than customers
that will leave the service. The total number of customers will continue the service is 253
customers and the total number of customers will leave the service is 747 customers. By
gender female customer of total 134 is higher than male which is 119 to remain using the
services. The customer that will leave the services by gender is female by 374 customer and
male by 373 customers. By this bar chart graph, we can conclude that female customer are
the significant contributers to remain and leave the services.
This bar chart graph explains about the possibility of the contracted customer using
the services to remain using the services or will leave the services after the contract ends. The
x-axis represents the months or year of the contract and the option of customers to remain or
leave using the services. The y-axis represents the number of the customers.
12
For the choice of the customers that will continue using the services, the customer
using the contract month-to-month is the highest with the total number of customers 223,
followed by the customer using the contract one year by 23 customers and last, the customer
using the contract two years by 7 customers. It can be assumed that the services have a great
plan for the short-term plan and not long-term plan.
The number of customers that will not continue the services are higher than the
customer that will continue using the service by 747 customers. The majority of customers
will not continue using the services.
Step 1: To generate attribute “predict Churn” connect the link like the figure below.
Predict Churn attribute is a new attribute generate by RapidMiner using the machine
learning algorithm. It shows that the ability of the machine learning to predict the
outcome for the event itself.
In the cross-validation operators by using three predictive operators which are Logistics
Regression, Random Forest, and Naïve Bayes to show the difference between three predictive
operators.
13
To evaluate the accuracy of the machine learning to predict the result, we can compare the
actual result and the predicted result. It is tally enough to prove that the machine learning
algorithm is good to be use. That’s why the confusion matrix result is important, so that we
can make sure the result for the “True Positive” and “True Negative” is higher than the result
of “False Positive” and “False Negative”.
14
The accuracy for Logistic Regression is 78.40%. The total samples for True Positive are 668
while True Negative are 116. The ability for the algorithm to predict the positive values are
good but not for predicted the negative values. The ability for the algorithm to classify
positive values (precision) is 82.98% and ability to predict positive values is 89.42% (recall).
15
The accuracy for Random Forest is 80.70%. The total samples for True Positive are 700 while
True Negative are 107. The ability for the algorithm to predict the positive values are good
but not for predicted the negative values. The ability for the algorithm to classify positive
values (precision) is 82.74% and ability to predict positive values is 93.71% (recall).
16
The accuracy for Naïve Bayes is 69.90%. The total samples for True Positive are 486 while
True Negative are 213. The ability for the algorithm to predict the positive values are good
but not for predicted the negative values. The ability for the algorithm to classify positive
values (precision) is 92.40% and ability to predict positive values is 65.06% (recall).
17
Part 3: Compare AUC ROC Curve
Step 2: Double click at the Compare ROCs operator, in the process insert 3 algorithms that
have been use which are Logistics Regression, Random Forest and Naïve Bayes. Then Link
all the algorithms.
18
19
Step 3: Run the process and it will generate an AUC – ROC graph.
As you can see, the blue line (represent Random Forest), is nearly to 1.00 while the other two
lines are below than that. Red line (Naïve Bayes) is the lowest. An excellent model has AUC
near to the 1 which means it has good measure of separability. A poor model has AUC near to
the 0 which means it has worst measure of separability means it predicting 0s as 1s and 1s as
0s. This can conclude that, Random Forest algorithm is the best machine learning model
among all.
20
Conclusion
For this analysis, we are testing for the prediction analysis which is churn model. The
prediction the churn of the customer based on the Churn attributes from the telco customer
dataset. By using machine algoritms like Logistics Regression, Random Forest, and Naïve
Bayes, we can determine which algorithm is the most accurate on making outcome about the
telco customer dataset. We are testing the machine learning algorithm, and it shows the
ability of the machine learning to predict the outcome for the event itself. To evaluate the
accuracy of the machine learning to predict the result, we can compare the actual result and
the predicted result. In this analysis, we use three different algorithms which are Logistics
Regression, Random Forest, and Naïve Bayes to make differentation which is the most
accurate algorithm.
Lastly, after we test all the accuracy on the algorithm, we can compare the algorithm
using the compare ROCs operator. This operator can generate the graph that the user can
easily understand and read the graph. From the graph, we can determine which algorithm has
the most accurate and ability to predict.
21
RUBRIC UCS551 LAB INDIVIDUAL ASSIGNMENT
Report Marks
Punctuality (Punctual Submission) 2 4 6 8 10
Correct format 2 4 6 8 10
Able to give good introduction 2 4 6 8 10
Good explanation on each operator uses 2 4 6 8 10
Dataset and Analytic Application 2 4 6 8 10
22