Lab Assignment 1 Ucs551

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

INTRODUCTION TO DATA ANALYTICS AND APPLICATION

UCS551

INDIVIDUAL ASSIGNMENTS 2 (15%)

(TELCO CUSTOMER CHURN DATASET)

PREPARED BY:

NAME STUDENT ID GROUP

MUHAMMAD IZWAN BIN MOHD AMIN 2022971659 D1AC2206D

PREPARED FOR:

NOOR ASSYIKIN BINTI ALIAS

DUE DATE:

17/12/2023
Table of Content

No Page

1 Introduction 2

2 Data Understanding 3-4

3 PART 1: Using Machine Learning Algorithm 5

Step 1: Create repository “Lab Assignment 1”.


Step 2: Using Read CSV and load the sample dataset “Telco Customer
Churn.csv”.
Step 3: Using set role operators to set role for attribute “Churn” as label.

4 Step 4: Cleaning the missing values. 6

5 Step 5: Reduce the sample to only 1000 samples. 7

6 Step 6: Use Cross Validation technique for different modelling algorithms. 8

7 Performance vector results 9

8 Bar Graph from the Telco Customer Dataset. 10 - 11


Bar Chart Graph Churn Vs. Gender.
Bar Graph Chart Churn Vs. Contract.

9 Part 2: Predictive Churn Analysis 12


Step 1: To generate attribute “predict Churn” connect the link.

10 Result for “predictChurn” attribute. 13 - 15

11 Part 3: Compare AUC ROC Curve 16


Step 1: Using compare ROCs operator and link the operator.
Step 2: Double click at the Compare ROCs operator, in the process insert 3
algorithms.

12 Step 3: Run the process and it will generate an AUC – ROC graph. 17

1
13 Conclusion 18

2
Introduction

This dataset contains information about the telco customer dataset. This dataset has
about 21 attributes that’s help to do the analysis. In this analysis, the main objective is to
know either the customer in the telco service that the customer wants to stay using the telco
services or will not using the telco services anymore.

Making analysis about the customer to stay using the telco services or will leave using
the telco services are important to the telco services provider. It is because they can make
analysis behind the reason the customer leaving using the telco services that they provide.
Company also can improve their telco services and make it affordable, broader services and
user friendly. With the analysis, the management can make many improvements and solve
the problems that the company may faces. From the analysis also company can gain many
benefits, for example they can gain information that they have forget or mistaken
information.

In this analysis, the step on each process will be shown to achieve the objective and
make analysis on this telco service dataset. By using RapidMiner application it will help in
doing the analysis efficiently and effectively. RapidMiner has many operators that help the
user in making the assumptions and analyisis based on various dataset.

3
Data Understanding

Variable Names Role Variable Type Description


Churn Output Qualitative (Integer) Identification number of
the customer
Customer ID Input Quantitative (Integer) Customer ID Number
Gender Input Qualitative Gender of the customer
(Polynominal)
Senior Citizen Input Quantitative (Integer) Whether the customer is a
senior citizen
Partner Input Quantitatve Marital status of the
(Polynominal) customer
Dependents Input Qualitative Whether the customer has
(Polynominal) dependent
Tenure Input Quantitative (Integer) Number of months the
customer has been with the
service
Phone Service Input Qualitative Whether the customer has
(Polynominal) phone service or not
Multiple Lines Input Qualitative Whether the customer has
(Polynominal) multiple lines or not
Internet Services Input Qualitative Type of internet services
(Polynominal) the customer has
Online Security Input Qualitative Whether the customer has
(Polynominal) online security or not
Online Backup Input Qualitative Whether the customer has
(Polynominal) online backup or not
Device Protection Input Qualitative Whether the customer has
(Polynominal) device protection or not
Tech Support Input Qualitative Whether the customer has
(Polynominal) tech support or not
Streaming TV Input Qualitative Whether the customer has

4
(Polynominal) streaming TV or not
Streaming Movies Input Qualitative Whether the customer has
(Polynominal) streaming movies or not
Contract Input Qualitative Type of contract the
(Polynominal) customer has
Paperless Billing Input Qualitative Whether the customer uses
(Polynominal) paperless billing or not
Payment Method Input Qualitative Payment method used by
(Polynominal) the customer
Monthly Charges Input Quantitative (Integer) Monthly charges incurred
by the customer
Total Charges Input Quantitative (Integer) Total charges incurred by
the customer

5
Part 1: Using Machine Learning Algorithm

Step 1: Create repository as “Lab Assignment 2”.

Step 2: Using Read CSV and load the sample dataset “Telco Customer Churn.csv”.

Step 3: Using set role operators to set role for attribute “Churn” as label.
 Set Role operators is used to assign roles to the attributes.

6
Choose Churn in attribute name and change target role as label.

After running the process, Churn colum will be highlighted, resulting in the role has been

changed to label.

Step 4: Cleaning the missing values. In this dataset, Total Charges has missing values. This
missing value needs to be clean from the dataset.

7
Using filter example operators to remove the missing values.
 Filter example operators is used to selective include or exclude examples from the
dataset from the dataset based on certain conditions.

Choose filters and filters Total Charges and choose “is not missing” to remove missing

values.
After running the process, the missing values in Total Charges will be reomoved.

Step 5: Reduce the sample to only 1000 samples.

8
 Sample operators is used to create a random sample of example from the dataset. It is
used for a large sample of dataset to reduce the sample for exploration, testing, or
model training.

After running the process, the example will reduce to 1000 sample only.

Step 6: Use Cross Validation technique for different modelling algorithms.


 Cross Validation operators is used to assess the performance of a predictive model
by dividing the dataset into multiple folds and iteratively training and testing the
model on different subsets of the data.

9
Identify the accuracy and correlation using Logistic Regression, Random Forest, and Naïve
Bayes.

Performance vector results using Logistics Regression.


- The accuracy by using Logistic Regression is 78.40%.

Performance vector results using Random Forest.


- The accuracy by using Random Forest is 80.70%.

Performance vector results using Naïve Bayes.

10
- The accuracy by using Naïve Bayes is 69.90%.

Bar Graph from the Telco Customer Dataset.

1. Bar Chart Graph Churn Vs. Gender.

This bar chart explains about the churn vs. the gender which are to show which
gender of customers that will use the service or will leave the services. From the bar chart,
the x-axis represents the gender of customers and option whether the customer will use the
service or leave the service. The y-axis represents the number of customers.

11
The total number of customers that will continue the service is lower than customers
that will leave the service. The total number of customers will continue the service is 253
customers and the total number of customers will leave the service is 747 customers. By
gender female customer of total 134 is higher than male which is 119 to remain using the
services. The customer that will leave the services by gender is female by 374 customer and
male by 373 customers. By this bar chart graph, we can conclude that female customer are
the significant contributers to remain and leave the services.

2. Bar Graph Chart Churn Vs. Contract.

This bar chart graph explains about the possibility of the contracted customer using
the services to remain using the services or will leave the services after the contract ends. The
x-axis represents the months or year of the contract and the option of customers to remain or
leave using the services. The y-axis represents the number of the customers.

12
For the choice of the customers that will continue using the services, the customer
using the contract month-to-month is the highest with the total number of customers 223,
followed by the customer using the contract one year by 23 customers and last, the customer
using the contract two years by 7 customers. It can be assumed that the services have a great
plan for the short-term plan and not long-term plan.

The number of customers that will not continue the services are higher than the
customer that will continue using the service by 747 customers. The majority of customers
will not continue using the services.

Part 2: Predictive Churn Analysis

Step 1: To generate attribute “predict Churn” connect the link like the figure below.

 Predict Churn attribute is a new attribute generate by RapidMiner using the machine
learning algorithm. It shows that the ability of the machine learning to predict the
outcome for the event itself.

In the cross-validation operators by using three predictive operators which are Logistics
Regression, Random Forest, and Naïve Bayes to show the difference between three predictive
operators.

13
To evaluate the accuracy of the machine learning to predict the result, we can compare the
actual result and the predicted result. It is tally enough to prove that the machine learning
algorithm is good to be use. That’s why the confusion matrix result is important, so that we
can make sure the result for the “True Positive” and “True Negative” is higher than the result
of “False Positive” and “False Negative”.

Result for “predictChurn” attribute using Logistics Regression.

Performance Vector using Logistic Regression.

14
The accuracy for Logistic Regression is 78.40%. The total samples for True Positive are 668
while True Negative are 116. The ability for the algorithm to predict the positive values are
good but not for predicted the negative values. The ability for the algorithm to classify
positive values (precision) is 82.98% and ability to predict positive values is 89.42% (recall).

Result for “predictChurn” attribute using Random Forest.

Performance Vector using Random Forest.

15
The accuracy for Random Forest is 80.70%. The total samples for True Positive are 700 while
True Negative are 107. The ability for the algorithm to predict the positive values are good
but not for predicted the negative values. The ability for the algorithm to classify positive
values (precision) is 82.74% and ability to predict positive values is 93.71% (recall).

Result for “predictChurn” attribute using Naïve Bayes.

Performance Vector using Naïve Bayes.

16
The accuracy for Naïve Bayes is 69.90%. The total samples for True Positive are 486 while
True Negative are 213. The ability for the algorithm to predict the positive values are good
but not for predicted the negative values. The ability for the algorithm to classify positive
values (precision) is 92.40% and ability to predict positive values is 65.06% (recall).

17
Part 3: Compare AUC ROC Curve

Step 1: Using compare ROCs operator and link the operator.


 Compare ROCs operator are used to generate ROC curves which are graphical
representations of the performance of a classification model at various threshold
settings.

Step 2: Double click at the Compare ROCs operator, in the process insert 3 algorithms that
have been use which are Logistics Regression, Random Forest and Naïve Bayes. Then Link
all the algorithms.

18
19
Step 3: Run the process and it will generate an AUC – ROC graph.

As you can see, the blue line (represent Random Forest), is nearly to 1.00 while the other two
lines are below than that. Red line (Naïve Bayes) is the lowest. An excellent model has AUC
near to the 1 which means it has good measure of separability. A poor model has AUC near to
the 0 which means it has worst measure of separability means it predicting 0s as 1s and 1s as
0s. This can conclude that, Random Forest algorithm is the best machine learning model
among all.

20
Conclusion

In conclusion, by using RapidMiner to do the process and analysis on the telco


customer dataset is practical and simple. Telco customer churn dataset can be easy to identify
by the bar graph chart visualization. All the process and step are correctly organized and
understandable. The outcome that needed such as graph bar chart and all the important
visualizations are easy to be implemented automatically by the RapidMiner application.

For this analysis, we are testing for the prediction analysis which is churn model. The
prediction the churn of the customer based on the Churn attributes from the telco customer
dataset. By using machine algoritms like Logistics Regression, Random Forest, and Naïve
Bayes, we can determine which algorithm is the most accurate on making outcome about the
telco customer dataset. We are testing the machine learning algorithm, and it shows the
ability of the machine learning to predict the outcome for the event itself. To evaluate the
accuracy of the machine learning to predict the result, we can compare the actual result and
the predicted result. In this analysis, we use three different algorithms which are Logistics
Regression, Random Forest, and Naïve Bayes to make differentation which is the most
accurate algorithm.

Lastly, after we test all the accuracy on the algorithm, we can compare the algorithm
using the compare ROCs operator. This operator can generate the graph that the user can
easily understand and read the graph. From the graph, we can determine which algorithm has
the most accurate and ability to predict.

21
RUBRIC UCS551 LAB INDIVIDUAL ASSIGNMENT

Report Marks
Punctuality (Punctual Submission) 2 4 6 8 10
Correct format 2 4 6 8 10
Able to give good introduction 2 4 6 8 10
Good explanation on each operator uses 2 4 6 8 10
Dataset and Analytic Application 2 4 6 8 10

Result and Discussion - Process 2 4 6 8 10


Result and Discussion - Result 2 4 6 8 10
Result and Discussion - Techniques 2 4 6 8 10

Result and Discussion - Graph 2 4 6 8 10

Conclusion and references 2 4 6 8 10


/100

22

You might also like