Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Predictive Analysis using Logistic Regression and CHAID

Customer Retention
Prakhar Deep Gupta

Business Objective/Problem
Cross selling to improve revenue per customer. Cross selling a way to improve relationship with customer and improve customer retention Identify the customers with the potential for cross selling opportunities.

Data Inputs
o o o o o o o Transactional data of customers obtained from POS Resident of which state. Annual Income No. of dependents Credit Class/Category No. of Transactions per month Average balance per month

Page 2 of 7

Analytical Techniques used and the rationale for using them


Logistic Regression:
The logistic regression equation limits generation of the predicted values of the dependent variable to lie in the interval between zero and one; whereas OLS regression often results in values of the dependent variable take on values of less than zero or greater than one, which are substantively irrelevant and have no interpretative value. A simple transformation (exponentiation) of the logistic regression models parameters leads to an easily interpretable and explainable quantity: the odds ratio. A number of useful tests for assessing model adequacy and fit are available for logistic regression models. Tests are also available for influential and other ill-fitting observations. Parameter estimates generated from a logistic regression model can be applied in a simple data step to the population of interest, this scoring, or creating a probability of event outcome for each member of the population. This score can then be used to select subsets of the population for various treatments as may be appropriate to the substantive issue under analysis.

Decision Tree Analysis using CHAID (Chi-Squared Automatic Interaction Detection)


The output of CHAID prediction model is displayed in hierarchical tree-structured form, in which the root is the population, and the branches are the connecting segments such that the variation of the response variable is minimized within all the segments, and maximized among all the segments. This ensures that that the target clusters are formed which have a tendency for a particular kind of product. In this way we understand what kind of customers will want what kind of products.

Page 3 of 7

Analytical Approach taken


Both the techniques had broadly four steps for obtaining the final results: 1) 2) 3) 4) Data capture from data sources. Data transformation, cleaning and refining Generating the best fir model/ using the algorithm to generate clusters Determining the predictive power by comparing the clusters with actual values

Decision Tree analysis (CHAID)


CHAID is a technique that recursively partitions (or splits) a population into separate and distinct segments. These segments, called nodes, are split in such a way that the variation of the response variable (categorical) is minimized within the segments and maximized among the segments. After the initial splitting of the population into two or more nodes (defined by values of an independent or predictor variable), the splitting process is repeated on each of the nodes. Each node is treated like a new sub-population. The splitting process is repeated until stopping rules are met i.e. when the class value in the partition is same or there is only one object in the partition.

Logistic Regression
The modeling data is used initially in the Univariate Logistic Regression analysis to get an estimate of the number of significant factors that affect the cross sell probability. After this step, the data set is used for the Multivariate Regression analysis for determining the set of final significant factors using pseudo R square values as well as p-values. Once the significant factors are identified, the model thus generated is fed with the validation data set and the subsequent ROC curves are studied for determining the predictive power of the model. The final step includes input of the predication data set and the model provides the predicted cross sell probabilities. As a final step, cluster analysis is performed on the Logistic Regression model output for segmentation.

Page 4 of 7

Validation of the approach and outputs


Logistic Regression
If we want to 95% sure that are results are valid then we have set out alpha level to 0.05. If we set our alpha level to 0.05: And associated value of (Pr>ChiSq) >=0.05

We fail to reject the null hypothesis and conclude that the regression coefficient for the given factor has not been found to be statistically different from zero in estimating cross selling opportunities given the presence of other factors in the model. (Pr>ChiSq) <=0.05 We reject the null hypothesis and conclude that the regression coefficient for the given factor has been found to be statistically different from zero in estimating cross selling opportunities given the presence of other factors in the model.

Conclusion:
1. The above analysis shows that with the exception of condition of account (condition_of_accnt)shown in Exhibit 4 have a significant bearing on the final outcome 2. Also, the credit limit has a Pr>ChiSq value of 0.048 which is very close to being an insignificant factor, sowing that the credit limit increase may not necessarily translate into higher spending by the consumer.

Validation:
The technique is already in use and plenty of research articles have been written on the same. Also, since the final output shows a high degree of correlation, it shows that the factors identified are correct and the data can be used for predictive analysis.

Page 5 of 7

Project Outputs
Logistic Regression
List of significant factors, Area under ROC curve The set of predicted probabilities for each and every customer.

CHAID
The visual representation of the tree, The ROC and Lift curves The final set of prospects/leads from amongst the customers. A matrix set popularly known as the confusion matrix is also obtained which represents the proportion of misclassified items and is an indicator of the model efficacy and predictive power.

Page 6 of 7

References
http://arxiv.org/ftp/arxiv/papers/1002/1002.1144.pdf IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 1, No. 1, January 2010 http://www.nesug.org/proceedings/nesug98/solu/p095.pdf

Page 7 of 7

You might also like