Professional Documents
Culture Documents
Assignment #2 (Group Work) : 10 AM On Nov. 02, 2021
Assignment #2 (Group Work) : 10 AM On Nov. 02, 2021
Submission Instructions:
Submit the report on Brighspace® no later than 10 AM on Nov. 02, 2021.
Only one submission per team. Please choose a team representative to submit
one copy of the report on BrightSpace
When submitting your group work, the submission must include the following two
statements. Without the following two statements included in your submission, your
assignment will not be marked.
NOTE: If the above two statements are not included in your original submission, the
assignment will not be marked. Then, the following deductions will be applied:
-20% if the statement was not submitted with the original submission, but was
submitted after reminded by the TA within 24 hours.
-100% if the statement was not submitted within 24 hours after reminded by the TA.
Important Note: Each individual member of the group must read the following
academic integrity statement, and type in their full name and student ID (which is
considered as the signature). The Academic Integrity Statement must be e-signed by
ALL members of the group UNLESS a group member has not contributed to the
assignment. Your assignment will not be marked if the following academic integrity
statement is not submitted.
I attest to the fact that my own work in this project adheres to the fraud policies as
outlined in the Academic Regulations in the University’s Undergraduate Studies
Calendar. I further attest that I have knowledge of and have respected the “Beware of
Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the
best of my knowledge, I also believe that each of my group colleagues has also met the
aforementioned requirements and regulations. I understand that if my group assignment
is submitted without a completed copy of this Personal Work Statement from each group
member, it will be interpreted by the school that the missing student(s) name is
confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically,
warrant that the work submitted herein is our own group members’ work and not the
work of others;
Statement of Contribution
Question 1 and 6 are done by Ketong Li (Student No. 300112336).
Question 2 and 4 are done by Zhenni Zhao (Student No. 300128622).
Question 3 and 5 are done by Rong Mu (Student No. 300127742).
Q1) [20 marks] List and explain different stages in customer life cycle, the business
process organized around the customer life cycle, and mention how data mining can help
in each process.
Customer life cycle
There are five stages in customer life cycle.
1. Prospects: Identify the target market. In this stage, people in the target market are
not the customers, they are the people who within the calculated model that may
willing to purchase the products. The rest of the world (not the target market) will
have less chance to buy products.
2. Responders: These are people who in the target market and are interested in the
products. They are the intended customers of the company such as for those who
subscribe to the website or specific products and may willing to pay in the future.
3. New customers: These are responders from the last step who place orders or
decide on buying the products in the future.
4. Established customers: These are those new customers who find the products
reliable or well-designed and are willing reorder the products or signing deals for
purchasing large quantity of the products.
5. Former customers: In this stage, existing customers are no longer decided to
purchase and use the products. This action may be due to different reason
including voluntary attrition such as moving to the competitor with lower price,
expected attrition for example changing the taste or not work in the related fields
and forced attrition in which that could not afford the products (Linoff, G., &
Berry, M., 2011).
Business process organized around the customer life cycle (data mining
applications)
Customer Acquisition
Customer acquisition means that the company using some strategies to attract potential
customers and make them willing to buy the products from the company and finally
become the customers. This process will be done by identifying the target market, using
mailing or email to send out the information and promotion of the targeted audience in
the targeted areas. Those prospects will change over time by multiple reasons, it is always
helpful to define new targeted area and audience. The role of data mining in this case are
finding correct target market and audience, being effective in communicating with those
prospects such as selecting the rights tools at the appropriate tine with understandable and
attractive message. (Linoff, G., & Berry, M., 2011).
Customer Activation
Customer activation is the process of intended audience come into place to subscribe and
register for the website or products. They are willing to get more insight information of
specific products and it is considered as a business move. There are four steps involve in
this process:
1. The Sale (Leads): The targeted audience would like to learn more about specific
products or subscription that he/she may give out their personal information for
further purchase use.
2. The Order: This step involves they have successful become a subscripted member
of the website and their information are being verifies.
3. The Subscription: More verification needed for the final purchase.
4. The Paid Subscription: This last step involves that the intended audience has
become the customers and their order of products or subscription have been made.
Data mining in these steps involve identifying if the customer stays in the activation stage
or not and find out the reasons for the leaving of the customers such as the processes are
too complicated, or the products or subscriptions are not attractive enough or the
customers change their mind. Through analyzing this information, the company will learn
more and improve themselves on the performance.
Retention
It is always better to keep the customers loyal to the company and building strong
relationships with them, so it is important to identify the attrition of the customers;
effective retention campaigns are necessary.
Regarding the application of data mining, by designing the model for attrition, the
company can find what and why customers will leave, who will stay and how long will
they stay in order to make better plan on developing the relationships. Some models for
this retention purpose include classification methods, neutral networks, and decision trees
(Linoff, G., & Berry, M., 2011).
Q2) [10 marks] What is “over fitting”? what is its drawback? What are the approaches to
take to avoid “over fitting”?
“Over fitting” occurs when a model using too many data, information, and parameters in
a specific dataset to modify the statistical model. If the model is too complicated, it will
perfectly fit the data available there but fail to generalize its predictive ability to other
data sets. This happens when the model precisely matched with the training dataset or
information provided and well-explained them, may leading to be unadjusted with other
additional and relevant information and dataset, as well as the inability to predict the
future patterns and observations. In another words, the model that has been built only
could memorize the data instead of recognizing more patterns and predicting further
observations (Nelson, 2020).
One of the methods to avoid is that having more available data to train the model. By
adding more data, the algorithms of the model will be able to access data through
different sources and will not stick-on specific dataset to imitate and memorize it, which
will increase the accuracy. The more data available for the model, the less chance it will
be over fitting with the samples and the better patterns and results it will produce
(Corporate Finance Institute, n.d.).
Another method to prevent over fitting is called cross-validation in which it splits the
sample into two parts, one of them contain most of the sample and leave the other part
with small proportion. The purpose of using the large part is to build the model while the
small part is to predict and forecast as well as record the errors (Corporate Finance
Institute, n.d.).
Data simplification is an effective way to avoid over fitting that it simplifies the complex
model, which has greater chance to overfit the dataset, into a more straightforward model.
For example, pruning the decision tree which involves increasing the stability of the
model and effectively avoid overfitting. There are many other methods can reduce over
fitting including Akaike information criterion, model comparison and stopping the model
before it starts to be overfitted (Corporate Finance Institute, n.d.).
Q3) [10 marks] Aside from the Gini index and Entropy, explain two other parameters to
measure purity (best split) in a decision tree.
Chi-Square
(Actual−Expected )2
The formula that uses for Chi-Square is √ where it calculated the
Expected
difference in value between parents and child nodes. The actual represents the observed
value while the expected is the expected frequencies of the variable. Similarly, the Chi-
square is only useful with categorical variables instead of continuous variables. If there
are no differences in the split (observed and expected have the same result), it will result
in zero. If the Chi-Square value is high, it means that there is greater difference in the
distribution in child node with the parent node and will be more purity after split. Steps
involve including finding the expected value of each, applying the formula with the
actual value of each class and add each split up to get the total and choose the higher
among the performance in class and the class (Singh, 2021).
Information Gain
o Information Gain is employed to separate the nodes when the target variable is
categorical. It is based on the entropy idea and is given by:
Information Gain=1−Entropy
The purity of a node is calculated using entropy. The purity of a node is
proportional to its entropy value. A homogeneous node has zero entropy.
For nodes with higher purity, the information gain is higher, and the
maximum value is 1 because the entropy is subtracted by 1.
The formula for computing the entropy now:
n
Entropy=−∑ pi log 2 p i
i=1
o Use information gain to split the decision tree can follow these steps:
Calculate the entropy of each sub node separately for each split.
As the weighted average entropy of the sub nodes, the entropy of each
split is calculated.
Select the segmentation that has the lowest entropy value or provides the
most information.
Repeat steps 1-3 until you get homogeneous nodes (Sharma, 2020).
Q4) [10 marks] What do the x-axis and y-axis represent in an ROC curve? What does the
area under the curve represent? Draw an example ROC curve for a classifier which
performs better than a random guess. Describe an application where false negative might
matter more than the accuracy of the classifier?
ROC curve shows the relative tradeoffs between true positive rate and false positive rate.
FP
The x-axis: False Positive Rate= =1−Specificity
FP +TN
TP
the y-axis: True Positive Rate = =Sensitivity .
TP+ FN
The area below the ROC curve of the test can be used as a basis for the measuring test
performance. Larger areas mean more useful tests; when the range is close to 1, the
classifier will perform the best and when the area is close to 0.5, the classifier performs
poor (similar to random guess); the classifier performs substantially worse than a random
guess when the area is less than 0.5.
The red curve above represents that a classifier performs better than a random guess.
Regarding the breathalyzer tests, a false positive would show that drivers are over the
alcohol limit when they have not even touched an alcoholic drink; a false negative would
register driver as sober when he or she are drunk, or at least over the limit. At this time, a
false negative might matter more because allowing drunk drivers to continue driving
while assuming they are sober is dangerous to them and others around them. In contrast,
dealing with the false positive, drivers can provide a blood sample to prove their
innocence. Therefore, false negative will cause bigger problem and it matters more than
the accuracy of the classifier (Valchanov, 2018).
Q5) [10 marks] The following confusion matrix is produced during the test phase of a
classifier.
Specificity
TN 20
Specificity= = =0.8
FP+ TN 5+20
Sensitivity
TP 85
Sensitivity= = =0.895
TP+ FN 85+10
Precision
TP 85
Presision= = =0.944
TP+ FP 85+5
F_Measure
2× Precision × Recall 2× 0.944 × 0.895
= =0.919
Precision+ Recall 0.944 +0.895
Q6) [30 marks] Pre-processing using IBM Modeler: Use the dataset posted with this
assignment on the “Assignments” page on BrightSpace called Unclean-Bank-Data.Xlsx.
This dataset includes missing values, invalid values, and outliers.
Use the IBM SPSS Modeler to pre-process and clean the data. Note: Using Excel or
other applications are not acceptable. This question must be done with IBM Modeler
only.
Do not remove a record if there is only one missing value in that record. Instead, use the
IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value in that record. Instead,
use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value,
then you may either remove the record, or use the IBM Modeler to fill in for the missing
or invalid values.
If you detect outliers, you must record and report them, then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization,
binning data, etc.
Type: Type is the process that set proper range for specific field. In this case,
there are three field need to apply the Type process including age, income and
children. There are some outliers in these three fields which are unrealistic such as
negative age (age -18), 22 children by wrong entering, and extremely high income
(Annual income of -1 and 10 million). In the Missing value field, select specify
and set the proper range for age, income, and children and select nullify for check
values. After that, click the check box for define blanks. For the field save act,
cheq_act and mortgage that containing blank value, select specify under Missing
value tab and click on define blank. Connect Type with Data Audit and select
Quality tab, under the Impute Missing, select Blank and Null Value and under
Method, select algorithm and click Generate and select Missing Value Supernod.
By clicking OK, there will be a star saying Missing Value Imputation appear and
all the blanks will be filled automatically.
Binning: The last process is the clean the income field by dividing them into 5
bins. Selecting Income in bin fields and choose the number of bins of 5 and click
OK, there will be another field appear in Data Audit as Income_Bin and can be
used for future analysis.
References
Linoff, G., & Berry, M. (2011). Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management. John Wiley & Sons Inc.
Sharma, A. (2020, June 30). Decision Tree Split Methods | Decision Tree Machine
Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/06/4-
ways-split-decision-tree/
Singh, H. (2021, March 25). How to select Best Split in Decision Trees using Chi-
Square. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/how-
to-select-best-split-in-decision-trees-using-chi-square/
Valchanov, I. (2018, June 6). False Positive and False Negative. Medium.
https://towardsdatascience.com/false-positive-and-false-negative-b29df2c60aca