Professional Documents
Culture Documents
Telecommunication Customer Churn (New)
Telecommunication Customer Churn (New)
CUSTOMER CHURN
For a company to expand its clientele, its growth rate (measured by the number of new customers) must
exceed its churn rate.
OBJECTIVE
● To find out the reasons of losing customers by measuring customer loyalty to regain the
lost customers.
● Highlighting the main factors or variables which are affecting the customer churn.
● Using various Machine learning algorithm to build the model and evaluate the accuracy
and performance of the model.
● Providing the insights and recommendations.
METHODOLOGY
Exploratory
Exploratory Data
Data Analysis
Data Cleaning Feature
Data Cleaning FeatureModelling
Selection
Analysis(EDA)
(EDA) Selection
Exploratory data analysis is Data is not an oil but crude Feature selection is the A machine learning model
an approach to analyzing oil which needs to be process by which a subset is a file that has been
data sets to summarize their refined.This is also known of features, or variables, are trained to recognize certain
main characteristics, often as refining part. selected from a large types of patterns. Train a
with visual methods. dataset for building model over a set of data,
Data cleaning is the process machine learning models. providing it an algorithm
of preparing data for that it can use to reason
analysis by removing or over and learn from these
modifying data that is data.
incorrect, incomplete,
irrelevant, duplicated, or
improperly formatted.
DATA VISUALIZATION
● A histogram is a graphical representation that organizes a group of data points
● A box and whisker plot—also called a box plot—displays the five-number summary of a set
of data. The five-number summary is the minimum, first quartile, median, third quartile,
02 Box Plot ●
and maximum.
In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes
through the box at the median. The whiskers go from each quartile to the minimum or
maximum.
● A heat map (or heatmap) is a graphical representation of data where values are
03 Heat Map ●
●
depicted by color.
Heat maps make it easy to visualize complex data and understand it at a glance.
The heatmap style correlation matrix is a very effective tool when used properly.
04
● A count plot can be thought of as a histogram across a categorical,
Count Plot instead of quantitative, variable
HISTOGRAM
But it does not tell full story when we are working with class- imbalanced data set. We will look at two better metrics for evaluating
class-imbalanced problems: precision and recall.
RANDOM FOREST
Classification Report Confusion Matrix
● The fundamental reason to use a random forest instead of a decision tree is to combine the predictions of many decision trees into a
single model.
● A random forest can reduce the high variance from a flexible model like decision tree by combining many trees into one ensemble
model.
● It overcomes the problem of overfitting by averaging or combining the results of different decision tree
● Random forests are very flexible and possess very high accuracy.
For this churn analysis, I did not use accuracy for evaluation since it can be misleading for imbalanced classes such as ours. For the evaluation
of out model, I used precision and recall instead.
FEATURE SELECTION
Feature importance refers to techniques that
assign a score to input features based on how
useful they are at predicting a target variable.
According to feature importance, customer churn is highly impacted by voicemail plan and account length has almost negligible impact on
customer churn.
MULTI-COLLINEARITY BETWEEN VARIABLES
● Total Day Calls, Total night calls, Total Eve calls had a VIF
above 20. So, they have to be removed.
● Total Night Charge and Total Eve Charge also removed as VIF
is approximately 12 for both variables.
LOGISTIC REGRESSION
WITHOUT BALANCING WITH BALANCING
• Logistic regression is an effective model for binary classification tasks, although by default, it is not effective at imbalanced
classification. Given dataset is highly imbalanced.
• We have applied over-sampling technique as it performed better than under sampling technique.
• Balancing the dataset we faced the tradeoff between accuracy and Precison-Recall but clearly, the balanced dataset gives a
better result which can also be seen through ROC curve(model validation tool)
• Precision and recall is quite low when data is imbalanced but it increases significantly after balancing the dataset i.e, from
16% to 76% (recall).
(Receiver Operating Characteristic)ROC CURVE
ROC curves are frequently used to show in a graphical way the trade-off between
sensitivity and specificity for every possible cut-off for a test or a combination of
tests.The AUC-ROC curve , where AUC is area under the ROC curve tells how
better has the classification mechanism worked.
The closer an ROC curve is to the upper left corner, the more efficient is the test.
Meaning , maximising the True Positive rate and minising the False Positive rate.
Thus we can clearly deduce that balancing the data had a better classification of
1’s and 0’s using logistic regression
INSIGHTS
From data visualization we found that those who don't have Voice mail plan and International plan
are more likely to churn.
Out of all the algorithms used , Random Forest gave the best results. According to the model the
most significant variables affecting the churn are:
2. International Plan