Part I- Review Questions

Q1) [20 marks] List and explain different stages in customer life cycle, the business
process organized around the customer life cycle, and mention how data mining can help
in each process.
Customer life cycle
There are five stages in customer life cycle.
1. Prospects: Identify the target market. In this stage, people in the target market are
not the customers, they are the people who within the calculated model that may
willing to purchase the products. The rest of the world (not the target market) will
have less chance to buy products.
2. Responders: These are people who in the target market and are interested in the
products. They are the intended customers of the company such as for those who
subscribe to the website or specific products and may willing to pay in the future.
3. New customers: These are responders from the last step who place orders or
decide on buying the products in the future.
4. Established customers: These are those new customers who find the products
reliable or well-designed and are willing reorder the products or signing deals for
purchasing large quantity of the products.
5. Former customers: In this stage, existing customers are no longer decided to
purchase and use the products. This action may be due to different reason
including voluntary attrition such as moving to the competitor with lower price,
expected attrition for example changing the taste or not work in the related fields
and forced attrition in which that could not afford the products (Linoff, G., &
Berry, M., 2011).

Business process organized around the customer life cycle (data mining
 Customer Acquisition
Customer acquisition means that the company using some strategies to attract potential
customers and make them willing to buy the products from the company and finally
become the customers. This process will be done by identifying the target market, using
mailing or email to send out the information and promotion of the targeted audience in
the targeted areas. Those prospects will change over time by multiple reasons, it is always
helpful to define new targeted area and audience. The role of data mining in this case are
finding correct target market and audience, being effective in communicating with those
prospects such as selecting the rights tools at the appropriate tine with understandable and
attractive message. (Linoff, G., & Berry, M., 2011).
 Customer Activation
Customer activation is the process of intended audience come into place to subscribe and
register for the website or products. They are willing to get more insight information of
specific products and it is considered as a business move. There are four steps involve in
this process:

1. The Sale (Leads): The targeted audience would like to learn more about specific
products or subscription that he/she may give out their personal information for
further purchase use.
2. The Order: This step involves they have successful become a subscripted member
of the website and their information are being verifies.
3. The Subscription: More verification needed for the final purchase.
4. The Paid Subscription: This last step involves that the intended audience has
become the customers and their order of products or subscription have been made.

Data mining in these steps involve identifying if the customer stays in the activation stage
or not and find out the reasons for the leaving of the customers such as the processes are
too complicated, or the products or subscriptions are not attractive enough or the
customers change their mind. Through analyzing this information, the company will learn
more and improve themselves on the performance.

 Customer Relationship Management

There are four activities to boost customer relationship as well as making profit:
1. Up-selling: when customers making purchases, recommend them with higher
price products or higher quality products to increase their experiences.
2. Cross-selling: when customers making purchase of one product, recommend
another related product to enhance their experiences. For example, when someone
buy a PS5, recommend he a new game that he may interested in such as FIFA
3. Customer value calculation: estimating the value of each customer, such as the
price range the customer may want to pay for or if the customers would come
back again to make another purchase or not.
4. Usage stimulation: asking customers to subscribe to the company’s website to
receive further information on more products that will increase the chance for
them to come back again.
Data mining applications for customer relationship management involve making
appropriate recommendation to each customer base on their age, income, preference, and
area; identifying different target customer and divide them in different group, for each
group, design specific marketing campaign that may be most attractive to them;
predicting if the customer is able to afford the products or subscriptions to decrease the
risk as well as maximize the profits (Linoff, G., & Berry, M., 2011).

 Retention
It is always better to keep the customers loyal to the company and building strong
relationships with them, so it is important to identify the attrition of the customers;
effective retention campaigns are necessary.
Regarding the application of data mining, by designing the model for attrition, the
company can find what and why customers will leave, who will stay and how long will
they stay in order to make better plan on developing the relationships. Some models for
this retention purpose include classification methods, neutral networks, and decision trees
(Linoff, G., & Berry, M., 2011).

Q2) [10 marks] What is “over fitting”? what is its drawback? What are the approaches to
take to avoid “over fitting”?

“Over fitting” occurs when a model using too many data, information, and parameters in
a specific dataset to modify the statistical model. If the model is too complicated, it will
perfectly fit the data available there but fail to generalize its predictive ability to other
data sets. This happens when the model precisely matched with the training dataset or
information provided and well-explained them, may leading to be unadjusted with other
additional and relevant information and dataset, as well as the inability to predict the
future patterns and observations. In another words, the model that has been built only
could memorize the data instead of recognizing more patterns and predicting further
observations (Nelson, 2020).

One of the methods to avoid is that having more available data to train the model. By
adding more data, the algorithms of the model will be able to access data through
different sources and will not stick-on specific dataset to imitate and memorize it, which
will increase the accuracy. The more data available for the model, the less chance it will
be over fitting with the samples and the better patterns and results it will produce
(Corporate Finance Institute, n.d.).

Another method to prevent over fitting is called cross-validation in which it splits the
sample into two parts, one of them contain most of the sample and leave the other part
with small proportion. The purpose of using the large part is to build the model while the
small part is to predict and forecast as well as record the errors (Corporate Finance
Institute, n.d.).

Data simplification is an effective way to avoid over fitting that it simplifies the complex
model, which has greater chance to overfit the dataset, into a more straightforward model.
For example, pruning the decision tree which involves increasing the stability of the
model and effectively avoid overfitting. There are many other methods can reduce over
fitting including Akaike information criterion, model comparison and stopping the model
before it starts to be overfitted (Corporate Finance Institute, n.d.).

Q3) [10 marks] Aside from the Gini index and Entropy, explain two other parameters to
measure purity (best split) in a decision tree.

 Chi-Square
(Actual−Expected )2
The formula that uses for Chi-Square is √ where it calculated the
difference in value between parents and child nodes. The actual represents the observed
value while the expected is the expected frequencies of the variable. Similarly, the Chi-
square is only useful with categorical variables instead of continuous variables. If there
are no differences in the split (observed and expected have the same result), it will result
in zero. If the Chi-Square value is high, it means that there is greater difference in the
distribution in child node with the parent node and will be more purity after split. Steps
involve including finding the expected value of each, applying the formula with the
actual value of each class and add each split up to get the total and choose the higher
among the performance in class and the class (Singh, 2021).

 Information Gain
o Information Gain is employed to separate the nodes when the target variable is
categorical. It is based on the entropy idea and is given by:
Information Gain=1−Entropy
The purity of a node is calculated using entropy. The purity of a node is
proportional to its entropy value. A homogeneous node has zero entropy.
For nodes with higher purity, the information gain is higher, and the
maximum value is 1 because the entropy is subtracted by 1.
The formula for computing the entropy now:
Entropy=−∑ pi log 2 p i

o Use information gain to split the decision tree can follow these steps:
 Calculate the entropy of each sub node separately for each split.
 As the weighted average entropy of the sub nodes, the entropy of each
split is calculated.
 Select the segmentation that has the lowest entropy value or provides the
most information.
 Repeat steps 1-3 until you get homogeneous nodes (Sharma, 2020).

Q4) [10 marks] What do the x-axis and y-axis represent in an ROC curve? What does the
area under the curve represent? Draw an example ROC curve for a classifier which
performs better than a random guess. Describe an application where false negative might
matter more than the accuracy of the classifier?

ROC curve shows the relative tradeoffs between true positive rate and false positive rate.

The x-axis: False Positive Rate= =1−Specificity
the y-axis: True Positive Rate = =Sensitivity .

The area below the ROC curve of the test can be used as a basis for the measuring test
performance. Larger areas mean more useful tests; when the range is close to 1, the
classifier will perform the best and when the area is close to 0.5, the classifier performs
poor (similar to random guess); the classifier performs substantially worse than a random
guess when the area is less than 0.5.

The red curve above represents that a classifier performs better than a random guess.

Regarding the breathalyzer tests, a false positive would show that drivers are over the
alcohol limit when they have not even touched an alcoholic drink; a false negative would
register driver as sober when he or she are drunk, or at least over the limit. At this time, a
false negative might matter more because allowing drunk drivers to continue driving
while assuming they are sober is dangerous to them and others around them. In contrast,
dealing with the false positive, drivers can provide a blood sample to prove their

innocence. Therefore, false negative will cause bigger problem and it matters more than
the accuracy of the classifier (Valchanov, 2018).

Part II- Problems

Q5) [10 marks] The following confusion matrix is produced during the test phase of a

Actual vs. Classified Positive Negative

Positive 85 10
Negative 5 20

Calculate the following measures for this classifier:

 Accuracy (also called Correctness)
TP+TN 85+ 20
corre ctness= = =0.875
TP+TN + FP+ FN 85+20+ 5+10

 Specificity
TN 20
Specificity= = =0.8
FP+ TN 5+20

 Sensitivity
TP 85
Sensitivity= = =0.895
TP+ FN 85+10

 Precision
TP 85
Presision= = =0.944
TP+ FP 85+5

 F_Measure
2× Precision × Recall 2× 0.944 × 0.895
= =0.919
Precision+ Recall 0.944 +0.895

Q6) [30 marks] Pre-processing using IBM Modeler: Use the dataset posted with this
assignment on the “Assignments” page on BrightSpace called Unclean-Bank-Data.Xlsx.
This dataset includes missing values, invalid values, and outliers.

Use the IBM SPSS Modeler to pre-process and clean the data. Note: Using Excel or
other applications are not acceptable. This question must be done with IBM Modeler

Do not remove a record if there is only one missing value in that record. Instead, use the
IBM Modeler to fill in the missing value with an algorithm of your choice.

Similarly, do not remove a record if it has only one invalid value in that record. Instead,
use the IBM Modeler to fill in the invalid value with an algorithm of your choice.

If you find a record with more than one missing value, or more than one invalid value,
then you may either remove the record, or use the IBM Modeler to fill in for the missing
or invalid values.

If you detect outliers, you must record and report them, then delete the entire record.

You may also want to do other pre-processing tasks such as data normalization,
binning data, etc.

Deliverables for this Question:

1- Briefly explain the three different cleaning/ pre-processing tasks you applied on
the data using the IBM SPSS Modeler.
The first step of cleaning and pre-processing the dataset is removing the record
that contain more than one invalid or missing value from the excel sheet. There
are 3 records that have more than one invalid or missing value which are row 579,
164 and 135.

Three different cleaning/pre-processing task:

Reclassify: For those fields that should be categorical variables including gender,
car, save_act, cheq_act, mortgage and approved that have only two values such as
Male and Female, Yes and No appears to contain some non-standard format such
as M and F to represent Male and Female, Y and N to represent YES or No. In
this case, reclassify is to put those non-standard-formatted records into the
standard format. By selecting Multiple Mode and classify them into the existing
fields, the modeler can change those original value (non-standard form) into new
value (standard form) to reorganize the data. By doing so, the gender will have
only two unique values while the other may have three including the blank fields.

Type: Type is the process that set proper range for specific field. In this case,
there are three field need to apply the Type process including age, income and
children. There are some outliers in these three fields which are unrealistic such as
negative age (age -18), 22 children by wrong entering, and extremely high income
(Annual income of -1 and 10 million). In the Missing value field, select specify
and set the proper range for age, income, and children and select nullify for check
values. After that, click the check box for define blanks. For the field save act,
cheq_act and mortgage that containing blank value, select specify under Missing
value tab and click on define blank. Connect Type with Data Audit and select
Quality tab, under the Impute Missing, select Blank and Null Value and under
Method, select algorithm and click Generate and select Missing Value Supernod.
By clicking OK, there will be a star saying Missing Value Imputation appear and
all the blanks will be filled automatically.

Binning: The last process is the clean the income field by dividing them into 5
bins. Selecting Income in bin fields and choose the number of bins of 5 and click
OK, there will be another field appear in Data Audit as Income_Bin and can be
used for future analysis.

2- Include with your submission your clean dataset (name it “Clean-Bank-

Data.xlsx”). You may submit one zip file including all your deliverables.

